CN109299357B - Laos language text subject classification method - Google Patents

Laos language text subject classification method Download PDF

Info

Publication number
CN109299357B
CN109299357B CN201811017181.3A CN201811017181A CN109299357B CN 109299357 B CN109299357 B CN 109299357B CN 201811017181 A CN201811017181 A CN 201811017181A CN 109299357 B CN109299357 B CN 109299357B
Authority
CN
China
Prior art keywords
words
text
model
list
laos
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811017181.3A
Other languages
Chinese (zh)
Other versions
CN109299357A (en
Inventor
周兰江
王兴金
张建安
周枫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201811017181.3A priority Critical patent/CN109299357B/en
Publication of CN109299357A publication Critical patent/CN109299357A/en
Application granted granted Critical
Publication of CN109299357B publication Critical patent/CN109299357B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method for classifying Laos text subjects, and belongs to the technical field of natural language processing and machine learning. The invention combines the N-gram language feature extraction model and the naive Bayes mathematical model to realize the subject recognition of the Laos article, and eliminates the limitation of naive Bayes to a certain extent. The method considers the condition independent assumption, takes the text as a word bag model, does not consider the sequence information among words, and improves the recognition rate of the text by using the unigram and bigram feature models simultaneously.

Description

Laos language text subject classification method
Technical Field
The invention relates to a method for classifying subjects of Laos Chinese, and belongs to the technical field of natural language processing and machine learning.
Background
As networks grow in popularity, the information of the networks grows at an exponential level. When a user searches for desired information using a search engine, the web pages often return thousands of related pages, and how can the user locate the desired information quickly and efficiently without viewing the pages one by one? The topic identification plays an important role, and the topic identification can effectively respond to the user by utilizing a classifier trained in advance to locate the topic of the content desired by the user in the limited information input by the user. The naive Bayes classification model is a method which has a long history and a solid theoretical basis, and is a direct and efficient method for processing a plurality of problems, and a plurality of high-level natural language processing models can also evolve from the naive Bayes classification model. The probability of belonging to a certain class under the condition of a certain characteristic can be converted into the probability of belonging to a certain class under the condition, and the probability can be calculated by combining the corpora of the people, namely, the posterior probability is calculated by utilizing the prior probability and the likelihood probability. However, it has its own disadvantage that it considers all the characteristic attributes to be conditional independent, which is equivalent to putting the text characteristic information into a word bag without considering the influence of the appearance sequence of words, which often ignores much information and causes misinterpretation of the text.
Disclosure of Invention
The invention provides a method for classifying the subjects of Laos texts, which is used for identifying the subjects of Laos articles of unknown types.
The technical scheme of the invention is as follows: a Laos text subject classification method comprises the following specific steps:
step1, crawling Laos texts by using a web crawler technology, and respectively collecting five types of texts: travel, economy, politics, education and others, wherein five folders named by categories are created to store texts of each category, and then the texts of the Laos are preprocessed to remove interference words which are not related to the categories, so that a corpus is constructed;
step2, traversing the corpus by utilizing a python technology, traversing the five created folders, taking the names of the folders as labels, taking the texts in the folders as data to be trained, respectively storing the labels and the data to be trained into two lists, named class _ list and data _ list, and converting the two lists into tuple forms { class _ list, data _ list };
step3, selecting a fitting function in a CountVectorer method under an N-gram mode in a skearn module, generating a bag-of-words model CountVectorer model from the text in the data _ list, converting the text in the data _ list into a text vector form through the bag-of-words model CountVectorer model, wherein the vector form is (a, b and c), and creating a new tuple for storing the class _ list and the processed data _ list; wherein a represents the number of the selected feature words, b is the index corresponding to each feature word, and c is the number of times each feature word appears at the index position corresponding to b;
step4, dividing the tuple of Step3 into a training set and a test set by adopting a train _ test _ split function provided by Sklearn, and inputting the training set into a naive Bayes model for training; and calculating the accuracy of prediction by using a score function in a naive Bayesian model provided by sklern in the test set, and adjusting the number of the selected high-frequency words when the fitting function is adopted in the Step3 according to the result of the accuracy so as to obtain the trained classifier of the Laos language chapter topic identification model.
The interference words comprise emoticons, numbers, spaces and stop words; wherein, the expression symbols, the numbers and the spaces are removed by adopting a regular expression, and the stop words are removed by adopting a stop word list.
The N-gram mode adopts a combination mode of unigram and bigram.
The naive bayes model is implemented using the MultinomialNB algorithm provided by sklern.
The invention has the beneficial effects that:
1. the invention combines the N-gram language feature extraction model and the naive Bayes mathematical model to realize the subject recognition of the Laos article, and eliminates the limitation of naive Bayes to a certain extent. The method considers the condition independent assumption, takes the text as a word bag model, does not consider the sequence information among words, and improves the recognition rate of the text by using the unigram and bigram feature models simultaneously.
2. When the number of the characteristic values is selected, an iteration loop is generated, the number 800 which can enable the classification accuracy rate to be the highest is selected, and the classification accuracy rate is improved to the maximum.
3. The invention selects a polynomial model of naive Bayes, and considers the situation of repeated words in the polynomial model. And when the polynomial model probability is calculated as the formula (1), a Laplace smoothing technology is added, namely, the lambda value is set to be 1, so that unreasonable 0 probability events are avoided. If the training samples are large, the change of the estimated probability caused by adding 1 to each component in the process of calculating the conditional probability can be ignored, but the method can conveniently and effectively avoid the problem of zero probability.
4. The method searches stop words of Laos, and removes some words which have high occurrence frequency but are irrelevant to text type judgment. Therefore, the time for judging and classifying the training model is reduced, the speed is high, and the text classification effect is improved.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is an illustration of a CountVectorizer.
Detailed Description
The invention will be further illustrated with reference to the following figures and examples, without however being limited to the scope of the invention described.
Example 1: as shown in fig. 1-2, a method for classifying a Laos text topic comprises the following steps: step1, crawling Laos texts by using a web crawler technology, wherein the total crawled texts in five categories are as follows: economy, politics, education, tourism, general. Storing the documents into corresponding five folders respectively, wherein the folders are named by categories so as to facilitate later retrieval and processing, and then performing text processing on the crawled articles to remove a plurality of irrelevant classified interference words so as to construct a corpus; further, the interference words can be set to include emoticons, numbers, spaces and stop words; wherein, the emoticon, the number and the blank space are removed by adopting a regular expression, and the stop word is removed by adopting a stop word list (namely removing the word appearing in the stop word list). The regular expression encoding used in removing some irrelevant classified interfering words is as follows: u "[ \ u0000- \ u10ff ] + $", the code matches each character of the text, which may exclude characters other than Laos text. Table 1 below shows how many Laos texts and the number of each category are crawled.
Table 1:
categories Number of
Economy of production 430
Politics 731
Education 197
Travel toy 145
Others 28
Step2, traversing the corpus by utilizing a python technology, traversing the five created folders, wherein the names of the folders are labels, the texts in the folders are data to be trained, the labels and the data to be trained are respectively stored in two lists, named as data _ list and class _ list, and after the processing is finished, the two lists are converted into a tuple form so as to correspond to the data receiving type of the next module. Table 2 below illustrates the storage form of part of data _ list and class _ list:
table 2:
Figure BDA0001785169360000031
Figure BDA0001785169360000041
step3, before the text is input into the naive Bayes model for probability calculation, the text needs to be converted into a vector form, and the model can identify the text. The countvector method of feature extraction in the sklern module was chosen here. CountVectorizer aims to convert one text into a vector by counting, and CountVectorizer provides a mode of N-gram, which has many modes, where the way unigram is combined with bigram is chosen. unigram considers words of text to all occur singly to construct a vector, while bigram considers words of text to occur in combination of two words to construct a word vector. The combination of these two ways not only considers the impact of a single word on the classifier effect, but also considers the impact of a combined word on the text classification.
Before conversion, a matching function in the countvectorer is used, the first few high-frequency words can be selected according to the word frequency ordering of the words in the input training set to generate a countvectorer model similar to the word bag model, and the model uses the indexes of the high-frequency words as mapping marks for text-to-vector conversion. We can adjust the selection of high frequency words accordingly according to the accuracy calculated at Step 5. Based on multiple results calculations, there is a relatively high degree of accuracy when feature number selection 800 is performed. CountVectorizer also generates a text vector in the form of (a, b, c). a represents the number of the selected feature words, b is the index corresponding to each feature word, and c is the number of times of appearance of each feature word at the index position corresponding to b.
The CountVectorizer also provides a transform method, which converts the next input text into text vectors according to the previously trained bag-of-words model CountVectorModel, and the text vectors provide a basis for the next input of the naive Bayesian model. And inputting the text in the data _ list into a CountVectorizer for processing to obtain a bag-of-words model CountVectorizer model, wherein the bag-of-words model CountVectorizer model can convert the text which needs to be subjected to the future naive Bayesian model prediction into a word vector. As FIG. 2 illustrates the CountVectorizer, Array () on the left side of the figure represents the input text and the right side represents the word vector model after CountVectorizer processing: each text in the data _ list can be converted into a corresponding word vector by using the bag-of-word model, so that all text data are converted into data which can be recognized by a naive Bayes model. The text processed at Step3 with reference to table two will become table 3 below:
table 3:
Figure BDA0001785169360000042
Figure BDA0001785169360000051
step4, these data can then be input into a naive bayes model for training. Before training, we need to partition the data. Sklearn provides a train _ test _ split function, processed data _ list data can be randomly divided into a training set and a test set, and the data _ list data and the corresponding labeled class _ list data are divided into 85% of the training set and 15% of the test set to be stored respectively. The training set is used for training a naive Bayes model, and the testing set is used for verifying the model accuracy. Naive bayes, chosen here, is implemented with the MultinomialNB algorithm provided by sklern. Multinomial nb assumes that the prior probability of a feature is a polynomial distribution, i.e. as in equation (1):
Figure BDA0001785169360000052
y represents class, CkRepresents the Kth class, xjlRepresents the occurrence of j, the word XjRepresents the word mkThe total number of words in the K category is represented, n represents the number of all words in the K category (repeated words are calculated once), and λ is usually 1, namely, laplacian smoothing is performed, so that the probability of 0 can be avoided, and each n-gram is ensured to appear in the corpus at least 1 time. In the MultinomialNB algorithm we consider the case of repeated words, that is, repeated words we consider it to occur multiple times. This formula can be understood in this example view, taking the calculation of P (apple | s) as an example:
Figure BDA0001785169360000053
step5, calculating and predicting accuracy by using a score function in a naive Bayes model provided by sklern, firstly inputting a test set into the score function to obtain accuracy, predicting the category of the test set according to a multinomial NB model, then calculating the accuracy according to the predicted category and the real category, and training a classifier of the Laos language chapter subject recognition model.
Step6, when a Laos text topic recognition classifier needs to be predicted, a prediction function provided by a naive Bayes model of sklern can be used for judging the type of the text, and the prediction function can be predicted according to a multinomial NB model trained by Step 5.
Before the text to be predicted is input into the model, the text still needs to be converted into a text vector, namely, the text is corresponding to a text vector space by using a feature extractor CountVectorizer to generate the text vector. The text vector is then input into the prediction function of naive bayes to predict which class it belongs to.
When a text vector is input into the classifier, the classifier will compute it in five categories based on the text vector: conditional probabilities in tourism, economy, education, politics, others, as in equation (2).
Figure BDA0001785169360000061
In the example this formula can be understood as such to calculate the probability that Text1 belongs to an economy:
Figure BDA0001785169360000062
x is the text vector to be discriminated, YiIs a category. Class conditional probability P (X | Y)i) Is the joint probability on all attributes, and the trained samples are limited, so naive bayes adopts the attribute conditional independence assumption, and thus the formula (2) can be rewritten as the following formula (3):
Figure RE-GDA0001899346570000063
wherein d isNumber of attributes, xjIs the value of x on the jth attribute. In the example the formula can be understood as (x)jRepresents the jth word in Text 1):
Figure BDA0001785169360000064
while calculating P (x)jIf | c) is carried out using the formula (1). Since the conditional probabilities for each class are calculated as they have the same denominator, it is sufficient to calculate the numerator, i.e., equation (4).
Figure BDA0001785169360000065
Conditional probabilities of tourism, economy, education, politics, and others are calculated in formula (3), and then a category is determined by comparing which is higher.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (2)

1. A Laos language text subject classification method is characterized in that: the method comprises the following specific steps:
step1, crawling Laos texts by using a web crawler technology, and respectively collecting five types of texts: tourism, economy, politics, education and others, five folders named by categories are created to store texts of each category, and then the texts of the Laos are preprocessed to remove some irrelevant and classified interference words so as to construct a corpus;
step2, traversing the corpus by utilizing a python technology, traversing the five folders created above, taking the folder names as labels, taking the texts in the folders as data to be trained, respectively storing the labels and the data to be trained into two lists, named class _ list and data _ list, and converting the two lists into tuple forms { class _ list, data _ list } after the processing is finished;
step3, selecting a fitting function in a CountVectorer method under an N-gram mode in a skearn module, generating a bag-of-words model CountVectorer model from the text in the data _ list, converting the text in the data _ list into a text vector form through the bag-of-words model CountVectorer model, wherein the vector form is (a, b and c), and creating a new tuple for storing the class _ list and the processed data _ list; wherein a represents the number of the selected feature words, b is the index corresponding to each feature word, and c is the number of times of occurrence of each feature word at the index position corresponding to b;
step4, dividing the tuple of Step3 into a training set and a test set by adopting a train _ test _ split function provided by Sklearn, and inputting the training set into a naive Bayes model for training; calculating the accuracy of prediction by using a score function in a naive Bayesian model provided by sklern in the test set, and adjusting the number of high-frequency words selected when a fitting function is adopted in Step3 according to the result of the accuracy so as to obtain a classifier of the trained Laos language chapter topic identification model;
the interference words comprise emoticons, numbers, spaces and stop words; wherein, the expression symbols, the numbers and the spaces are removed by adopting a regular expression, and the stop words are removed by adopting a stop word list;
the naive bayes model is implemented using the MultinomialNB algorithm provided by sklern:
Figure FDA0003247557440000011
y represents class, CkRepresents the Kth class, xjlThe number of times of occurrence of the word j is represented by l times and XjRepresents the word mkRepresents the total number of words in the K category, n represents the number of all non-repeated words in the K category, and λ is taken to be 1, i.e., laplacian smoothing.
2. The method for classifying Laos text topics as claimed in claim 1, wherein: the N-gram mode adopts a combination mode of unigram and bigram.
CN201811017181.3A 2018-08-31 2018-08-31 Laos language text subject classification method Active CN109299357B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811017181.3A CN109299357B (en) 2018-08-31 2018-08-31 Laos language text subject classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811017181.3A CN109299357B (en) 2018-08-31 2018-08-31 Laos language text subject classification method

Publications (2)

Publication Number Publication Date
CN109299357A CN109299357A (en) 2019-02-01
CN109299357B true CN109299357B (en) 2022-04-12

Family

ID=65165992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811017181.3A Active CN109299357B (en) 2018-08-31 2018-08-31 Laos language text subject classification method

Country Status (1)

Country Link
CN (1) CN109299357B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083824A (en) * 2019-03-18 2019-08-02 昆明理工大学 A kind of Laotian segmenting method based on Multi-Model Combination neural network
CN111177724A (en) * 2019-12-12 2020-05-19 河北师范大学 Automatic detection method for polymorphic worm virus
CN112084308A (en) * 2020-09-16 2020-12-15 中国信息通信研究院 Method, system and storage medium for text type data recognition
CN113869356A (en) * 2021-08-17 2021-12-31 杭州华亭科技有限公司 Method for judging escape tendency of people based on Bayesian classification

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105843868A (en) * 2016-03-17 2016-08-10 浙江大学 Medial case searching method based on language model
CN106886578A (en) * 2017-01-23 2017-06-23 武汉翼海云峰科技有限公司 A kind of data row mapping method and system
CN107423371A (en) * 2017-07-03 2017-12-01 湖北师范大学 A kind of positive and negative class sensibility classification method of text
CN107506472A (en) * 2017-09-05 2017-12-22 淮阴工学院 A kind of student browses Web page classification method
CN108199951A (en) * 2018-01-04 2018-06-22 焦点科技股份有限公司 A kind of rubbish mail filtering method based on more algorithm fusion models

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7260568B2 (en) * 2004-04-15 2007-08-21 Microsoft Corporation Verifying relevance between keywords and web site contents

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105843868A (en) * 2016-03-17 2016-08-10 浙江大学 Medial case searching method based on language model
CN106886578A (en) * 2017-01-23 2017-06-23 武汉翼海云峰科技有限公司 A kind of data row mapping method and system
CN107423371A (en) * 2017-07-03 2017-12-01 湖北师范大学 A kind of positive and negative class sensibility classification method of text
CN107506472A (en) * 2017-09-05 2017-12-22 淮阴工学院 A kind of student browses Web page classification method
CN108199951A (en) * 2018-01-04 2018-06-22 焦点科技股份有限公司 A kind of rubbish mail filtering method based on more algorithm fusion models

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘伟朋等.基于表情符号的中文微博多维情感分类的研究.《合肥工业大学学报(自然科学版)》.2014,第37卷(第7期), *
基于机器学习的宋词风格识别;赵建明等;《计算机工程与应用》;20180131;第54卷(第1期);第186-190页 *
基于表情符号的中文微博多维情感分类的研究;刘伟朋等;《合肥工业大学学报(自然科学版)》;20140731;第37卷(第7期);第803-807页 *

Also Published As

Publication number Publication date
CN109299357A (en) 2019-02-01

Similar Documents

Publication Publication Date Title
CN109299357B (en) Laos language text subject classification method
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
CN106897371B (en) Chinese text classification system and method
CN101561805B (en) Document classifier generation method and system
US20150074112A1 (en) Multimedia Question Answering System and Method
CN106599054B (en) Method and system for classifying and pushing questions
US8150822B2 (en) On-line iterative multistage search engine with text categorization and supervised learning
CN101751455B (en) Method for automatically generating title by adopting artificial intelligence technology
CN107608960B (en) Method and device for linking named entities
CN112463971B (en) E-commerce commodity classification method and system based on hierarchical combination model
US11210555B2 (en) High-dimensional image feature matching method and device
CN109359302B (en) Optimization method of domain word vectors and fusion ordering method based on optimization method
CN104199965A (en) Semantic information retrieval method
CN107291895B (en) Quick hierarchical document query method
CN110909116B (en) Entity set expansion method and system for social media
CN108595546B (en) Semi-supervision-based cross-media feature learning retrieval method
CN110866102A (en) Search processing method
CN114357120A (en) Non-supervision type retrieval method, system and medium based on FAQ
Michelson et al. Unsupervised information extraction from unstructured, ungrammatical data sources on the world wide web
CN112711944B (en) Word segmentation method and system, and word segmentation device generation method and system
CN111079011A (en) Deep learning-based information recommendation method
CN111597400A (en) Computer retrieval system and method based on way-finding algorithm
CN109344319B (en) Online content popularity prediction method based on ensemble learning
CN117057346A (en) Domain keyword extraction method based on weighted textRank and K-means
CN114510559B (en) Commodity retrieval method based on deep learning semantic implication and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant