CN109947942B - Bayesian text classification method based on position information - Google Patents

Bayesian text classification method based on position information Download PDF

Info

Publication number
CN109947942B
CN109947942B CN201910193320.6A CN201910193320A CN109947942B CN 109947942 B CN109947942 B CN 109947942B CN 201910193320 A CN201910193320 A CN 201910193320A CN 109947942 B CN109947942 B CN 109947942B
Authority
CN
China
Prior art keywords
model
text
feature
bayesian
method based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910193320.6A
Other languages
Chinese (zh)
Other versions
CN109947942A (en
Inventor
金勇�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Firehome Putian Information Technology Co ltd
Original Assignee
Wuhan Firehome Putian Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Firehome Putian Information Technology Co ltd filed Critical Wuhan Firehome Putian Information Technology Co ltd
Priority to CN201910193320.6A priority Critical patent/CN109947942B/en
Publication of CN109947942A publication Critical patent/CN109947942A/en
Application granted granted Critical
Publication of CN109947942B publication Critical patent/CN109947942B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a Bayesian text classification method based on position information, which comprises the following steps: s1, converting the bag-of-words model through an input conversion module, wherein a position function is arranged in the input conversion module to convert the position parameters of the bag-of-words model; s2, training the data conversion result through a learning module to obtain different test results related to the position parameters, and selecting a parameter model with the best test effect, wherein the learning module is internally provided with an MNB Bayesian model; s3, emotion category prediction is carried out on the new text corpus by using the trained model; according to the invention, a better effect is obtained on the premise of not influencing the model speed by setting a weight calculation method based on the position information, so that a reasonable method is provided for avoiding independence assumption among characteristic words in the text.

Description

Bayesian text classification method based on position information
Technical Field
The invention relates to the technical field of natural language processing application, in particular to a Bayesian text classification method based on position information.
Background
With the rapid development of network information and big data technologies, social media (e.g., Twitter, Facebook, microblog number, etc.) and news media (e.g., newwave news, today's headline, fox search news, etc.) websites and unstructured/semi-structured text resources on encyclopedia websites such as wiki encyclopedia and encyclopedia are mainly included, and how to clean, integrate, classify, and mine value information of these data resources is not left from the Natural Language Processing (NLP) technology. The text emotion classification is a commonly used NLP method, and how to effectively classify emotion tendencies related to texts plays a crucial role in classification and integration of text information.
The current text classification (including emotion classification) methods mainly include three main categories: bayesian methods, support vector machine methods, and neural network methods. The prior Bayes method is mainly a polynomial Bayes method based on a bag-of-words model, however, the bag-of-words model does not consider the position information of words in the text, and the position information is particularly important in emotion classification, for example, some important emotion words are in front of and behind sentences, which has important influence on the overall tendency of the text.
Based on the fact, according to the defect of the word bag model in the Bayes method, the invention provides an improved word bag model based on a position function, and the emotion classification is carried out by utilizing a polynomial naive Bayes method with better performance.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an improved word bag model based on a position function, and the emotion classification is carried out by utilizing a polynomial naive Bayes method with better performance.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a Bayesian text classification method based on location information, the method comprising:
s1, converting the bag-of-words model through an input conversion module, wherein a position function is arranged in the input conversion module to convert the position parameters of the bag-of-words model;
s2, training the data conversion result through a learning module to obtain different test results related to the position parameters, and selecting a parameter model with the best test effect, wherein the learning module is internally provided with an MNB Bayesian model;
and S3, emotion category prediction is carried out on the new text corpus by using the trained model.
Further, the result of the data conversion in step S1 includes emotion tags.
Further, the specific process of the conversion in step S1 is as follows:
s101, data preprocessing, including blank line elimination, messy code elimination and unified format conversion;
s102, existing feature presence extraction, specifically, feature extraction is carried out on a text by using a tf-idf method, the corresponding feature value presence existing in the text is 1, and otherwise, the corresponding feature value presence is 0;
s103, extracting position characteristics wt _ pos, specifically, calculating a corresponding position characteristic weight value according to a corresponding position of each characteristic word in the text and a preset position function;
s104, merging the feature values, specifically, adding the presence feature value and the position feature value obtained in step S102 and step S103 to obtain a value corresponding to the feature + wt _ pos.
Further, the preset position function is
Figure BDA0001995035010000021
Wt _ pos represents the weight corresponding to a certain word, n represents the length corresponding to the sentence in which the word is located, p represents the corresponding position (between 1 and n), and lambda is a model super parameter and is obtained after model training.
Further, the training process in step S2 further includes: the conversion result obtained in step S1 is used as a feature vector space, and the label corresponding to each feature vector is used as a prediction category space.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the invention, a better effect is obtained on the premise of not influencing the model speed by setting the weight calculation method based on the position information, so that a reasonable method is provided for avoiding independence assumption among characteristic words in the text.
2. The invention quantitatively describes the contribution of the position function to the feature words, and simultaneously, the setting of the hyper-parameter lambda gives more flexibility to the position influence function, and different parameter selections can be carried out according to different linguistic data.
Drawings
FIG. 1 is a diagram illustrating a word weight-location trend in the context of one embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to embodiments and accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
A Bayesian text classification method based on location information, the method comprising:
s1, converting the bag-of-words model through an input conversion module, wherein a position function is arranged in the input conversion module to convert the position parameters of the bag-of-words model;
s2, training the data conversion result through a learning module to obtain different test results related to the position parameters, and selecting a parameter model with the best test effect, wherein the learning module is internally provided with an MNB Bayesian model;
and S3, emotion category prediction is carried out on the new text corpus by using the trained model.
In a specific implementation, the result of the data conversion in step S1 includes an emotion tag.
In specific implementation, the specific process of conversion in step S1 is as follows:
s101, data preprocessing, including blank line elimination, messy code elimination and unified format conversion;
s102, extracting presence characteristic presence, specifically, extracting the characteristic of a text by using a tf-idf method, wherein the characteristic extraction mainly refers to the presence characteristic, the presence characteristic value presence in the text is 1, and otherwise, the presence characteristic value presence is 0, considering that the method mainly aims at short texts, and the number of times of occurrence of a word is generally 1;
s103, extracting position characteristics wt _ pos, specifically, calculating a corresponding position characteristic weight value according to a corresponding position of each characteristic word in the text and a preset position function;
s104, merging the feature values, specifically, adding the presence feature value and the position feature value obtained in step S102 and step S103 to obtain a value corresponding to the feature, which is present + wt _ pos.
In the technical scheme, the influence degree of a word on the whole emotion tendency of the sentence is determined according to the sequence of the appearance positions of the word mainly by selecting the position function. For example, "i and the siblings dislike staying in a remote and inconvenient village", the language is focused on "dislike", the text describing a certain place in the following has a small influence on the tendency of emotion, and most text describing habits are similar to the above. The position importance of certain words generally increases and then decreases as the position is located, as shown in fig. 1. In fig. 1, as a word is positioned closer to the end of a sentence, it contributes less and less weight, and finally tends to a relatively low level. Assuming that the judgment of a word on the emotional tendency of a sentence considers a simple position function such as a linear function, but is not enough to describe the legend relationship, in order to better describe the tendency relationship, a piecewise function based on a logit function can be considered, and a preset position function is obtained as
Figure BDA0001995035010000051
Wt _ pos represents the weight corresponding to a certain word, n represents the length corresponding to the sentence in which the word is located, p represents the corresponding position (between 1 and n), and lambda is a model super parameter and is obtained after model training.
λ may be given an initial interval [0, 5], where the above equation assumes that a sentence is divided into three parts on average, the weights of words in the first part increase according to the increase of the positions of the words, the weights corresponding to the second part in the middle are basically unchanged, and the weights corresponding to the words in the third part become gradually smaller with the backward movement of the positions, because the emotional tendency of a text sentence is basically determined when the text sentence is described later.
As a preferred embodiment, the corresponding position feature weight value is calculated according to the corresponding position of each feature word in the text and a preset position function. For example, "i and the siblings dislike staying in a remote and inconvenient village", the corresponding unigram feature words are [ "i", "the siblings", "dislike", "staying", "remote", "inconvenient", "village" ], the corresponding positions p of the words are respectively from 1 to 7 (except for the stop words), and the corresponding text length n is 7. Where the hyper-parameter lambda sets multiple sets of values (0.1, 0.5, 1.0, 1.5, 2.0, 3.0, 5.0).
In specific implementation, the training process in step S2 further includes: the conversion result obtained in step S1 is used as a feature vector space, and the label corresponding to each feature vector is used as a prediction category space.
The model method provided by the invention belongs to a discriminant classification method, the training speed is high, and effective data basic statistics obtained by data cleaning aiming at five NLP data sets (short text comment data in different fields at different time) collected on a network are shown in a table 1:
TABLE 1
Figure BDA0001995035010000061
The accuracy obtained by the five-fold cross validation method is shown in table 2 (after screening comparison, the performance of the hyper-parameter λ 1 is the best):
TABLE 2
Figure BDA0001995035010000062
From the above verification results, the MNB classification method based on presence and location features is relatively good in emotion prediction.
In summary, the bayesian text classification method generally assumes that the feature words are independent of each other (only considering the existence features), and therefore does not consider the influence of the position information on the emotional tendency. The method introduces a weight calculation method based on the position information, and obtains better effect on the premise of not influencing the model speed. This provides a reasonable approach to avoid the assumption of independence between feature words in text: firstly, quantitatively depicting the contribution of a position function to a feature word; secondly, the setting of the hyper-parameter lambda gives more flexibility to the position influence function, and different parameter selections can be carried out according to different corpora; thirdly, λ ═ 1 has a certain influence on the position information of the emotion classification method, but the peak value is 0.5, which is smaller than the influence of the presence feature.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (3)

1. A Bayesian text classification method based on location information, the method comprising:
s1, converting the bag-of-words model through an input conversion module, wherein a position function is arranged in the input conversion module to convert the position parameters of the bag-of-words model;
s2, training the data conversion result through a learning module to obtain different test results related to the position parameters, and selecting a parameter model with the best test effect, wherein the learning module is internally provided with an MNB Bayesian model;
s3, emotion category prediction is carried out on the newly entered text corpus by using the trained model;
the specific process of conversion in step S1 is as follows:
s101, data preprocessing, including blank line elimination, messy code elimination and unified format conversion;
s102, existing feature presence extraction, specifically, feature extraction is carried out on a text by using a tf-idf method, the corresponding feature value presence existing in the text is 1, and otherwise, the corresponding feature value presence is 0;
s103, extracting position characteristics wt _ pos, specifically, calculating a corresponding position characteristic weight value according to a corresponding position of each characteristic word in the text and a preset position function;
step S104, merging the feature values, specifically, adding the existence feature value and the position feature value obtained in step S102 and step S103 to obtain a value corresponding to the feature word, which is present + wt _ pos;
the preset position function is
Figure FDA0003540593270000011
Wt _ pos represents the weight corresponding to a certain word, n represents the length corresponding to the sentence in which the word is located, p represents the corresponding position, p is more than 1 and less than n, lambda is a model super parameter, and lambda is obtained after model training.
2. The Bayesian text classification method based on location information as recited in claim 1, wherein: the result of the data conversion in step S1 includes emotion tags.
3. The Bayesian text classification method based on location information as recited in claim 1, wherein the training in step S2 further comprises: using the conversion result obtained in step S1 as a feature vector space, and a label corresponding to each feature vector as a prediction category space.
CN201910193320.6A 2019-03-14 2019-03-14 Bayesian text classification method based on position information Active CN109947942B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910193320.6A CN109947942B (en) 2019-03-14 2019-03-14 Bayesian text classification method based on position information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910193320.6A CN109947942B (en) 2019-03-14 2019-03-14 Bayesian text classification method based on position information

Publications (2)

Publication Number Publication Date
CN109947942A CN109947942A (en) 2019-06-28
CN109947942B true CN109947942B (en) 2022-05-24

Family

ID=67009865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910193320.6A Active CN109947942B (en) 2019-03-14 2019-03-14 Bayesian text classification method based on position information

Country Status (1)

Country Link
CN (1) CN109947942B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905845B (en) * 2021-03-17 2022-06-21 重庆大学 Multi-source unstructured data cleaning method for discrete intelligent manufacturing application

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106055673A (en) * 2016-06-06 2016-10-26 中国人民解放军国防科学技术大学 Chinese short-text sentiment classification method based on text characteristic insertion
CN106445919A (en) * 2016-09-28 2017-02-22 上海智臻智能网络科技股份有限公司 Sentiment classifying method and device
CN107633000A (en) * 2017-08-03 2018-01-26 北京微智信业科技有限公司 File classification method based on tfidf algorithms and related term weight amendment
CN108399165A (en) * 2018-03-28 2018-08-14 广东技术师范学院 A kind of keyword abstraction method based on position weighting

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017090051A1 (en) * 2015-11-27 2017-06-01 Giridhari Devanathan A method for text classification and feature selection using class vectors and the system thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106055673A (en) * 2016-06-06 2016-10-26 中国人民解放军国防科学技术大学 Chinese short-text sentiment classification method based on text characteristic insertion
CN106445919A (en) * 2016-09-28 2017-02-22 上海智臻智能网络科技股份有限公司 Sentiment classifying method and device
CN107633000A (en) * 2017-08-03 2018-01-26 北京微智信业科技有限公司 File classification method based on tfidf algorithms and related term weight amendment
CN108399165A (en) * 2018-03-28 2018-08-14 广东技术师范学院 A kind of keyword abstraction method based on position weighting

Also Published As

Publication number Publication date
CN109947942A (en) 2019-06-28

Similar Documents

Publication Publication Date Title
Liu et al. Emotion classification for short texts: an improved multi-label method
Kumar et al. Sentiment analysis of multimodal twitter data
Neelakandan et al. A gradient boosted decision tree-based sentiment classification of twitter data
Li et al. Filtering out the noise in short text topic modeling
Asghar et al. T‐SAF: Twitter sentiment analysis framework using a hybrid classification scheme
Effrosynidis et al. A comparison of pre-processing techniques for twitter sentiment analysis
Sahni et al. Efficient Twitter sentiment classification using subjective distant supervision
Venugopalan et al. Exploring sentiment analysis on twitter data
Baly et al. A characterization study of arabic twitter data with a benchmarking for state-of-the-art opinion mining models
CN105183833B (en) Microblog text recommendation method and device based on user model
Alowaidi et al. Semantic sentiment analysis of Arabic texts
Nagamanjula et al. A novel framework based on bi-objective optimization and LAN2FIS for Twitter sentiment analysis
CN105068991A (en) Big data based public sentiment discovery method
CN104965823A (en) Big data based opinion extraction method
Tang et al. Learning sentence representation for emotion classification on microblogs
WO2024036840A1 (en) Open-domain dialogue reply method and system based on topic enhancement
CN105183765A (en) Big data-based topic extraction method
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
Demirci Emotion analysis on Turkish tweets
Dedhia et al. Ensemble model for Twitter sentiment analysis
Papadakis et al. Graph vs. bag representation models for the topic classification of web documents
Chou et al. Boosted web named entity recognition via tri-training
Atoum Detecting cyberbullying from tweets through machine learning techniques with sentiment analysis
CN109947942B (en) Bayesian text classification method based on position information
Banados et al. Optimizing support vector machine in classifying sentiments on product brands from Twitter

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant