CN109947942B - Bayesian text classification method based on position information - Google Patents
Bayesian text classification method based on position information Download PDFInfo
- Publication number
- CN109947942B CN109947942B CN201910193320.6A CN201910193320A CN109947942B CN 109947942 B CN109947942 B CN 109947942B CN 201910193320 A CN201910193320 A CN 201910193320A CN 109947942 B CN109947942 B CN 109947942B
- Authority
- CN
- China
- Prior art keywords
- model
- text
- feature
- bayesian
- method based
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Abstract
The invention discloses a Bayesian text classification method based on position information, which comprises the following steps: s1, converting the bag-of-words model through an input conversion module, wherein a position function is arranged in the input conversion module to convert the position parameters of the bag-of-words model; s2, training the data conversion result through a learning module to obtain different test results related to the position parameters, and selecting a parameter model with the best test effect, wherein the learning module is internally provided with an MNB Bayesian model; s3, emotion category prediction is carried out on the new text corpus by using the trained model; according to the invention, a better effect is obtained on the premise of not influencing the model speed by setting a weight calculation method based on the position information, so that a reasonable method is provided for avoiding independence assumption among characteristic words in the text.
Description
Technical Field
The invention relates to the technical field of natural language processing application, in particular to a Bayesian text classification method based on position information.
Background
With the rapid development of network information and big data technologies, social media (e.g., Twitter, Facebook, microblog number, etc.) and news media (e.g., newwave news, today's headline, fox search news, etc.) websites and unstructured/semi-structured text resources on encyclopedia websites such as wiki encyclopedia and encyclopedia are mainly included, and how to clean, integrate, classify, and mine value information of these data resources is not left from the Natural Language Processing (NLP) technology. The text emotion classification is a commonly used NLP method, and how to effectively classify emotion tendencies related to texts plays a crucial role in classification and integration of text information.
The current text classification (including emotion classification) methods mainly include three main categories: bayesian methods, support vector machine methods, and neural network methods. The prior Bayes method is mainly a polynomial Bayes method based on a bag-of-words model, however, the bag-of-words model does not consider the position information of words in the text, and the position information is particularly important in emotion classification, for example, some important emotion words are in front of and behind sentences, which has important influence on the overall tendency of the text.
Based on the fact, according to the defect of the word bag model in the Bayes method, the invention provides an improved word bag model based on a position function, and the emotion classification is carried out by utilizing a polynomial naive Bayes method with better performance.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an improved word bag model based on a position function, and the emotion classification is carried out by utilizing a polynomial naive Bayes method with better performance.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a Bayesian text classification method based on location information, the method comprising:
s1, converting the bag-of-words model through an input conversion module, wherein a position function is arranged in the input conversion module to convert the position parameters of the bag-of-words model;
s2, training the data conversion result through a learning module to obtain different test results related to the position parameters, and selecting a parameter model with the best test effect, wherein the learning module is internally provided with an MNB Bayesian model;
and S3, emotion category prediction is carried out on the new text corpus by using the trained model.
Further, the result of the data conversion in step S1 includes emotion tags.
Further, the specific process of the conversion in step S1 is as follows:
s101, data preprocessing, including blank line elimination, messy code elimination and unified format conversion;
s102, existing feature presence extraction, specifically, feature extraction is carried out on a text by using a tf-idf method, the corresponding feature value presence existing in the text is 1, and otherwise, the corresponding feature value presence is 0;
s103, extracting position characteristics wt _ pos, specifically, calculating a corresponding position characteristic weight value according to a corresponding position of each characteristic word in the text and a preset position function;
s104, merging the feature values, specifically, adding the presence feature value and the position feature value obtained in step S102 and step S103 to obtain a value corresponding to the feature + wt _ pos.
Further, the preset position function is
Wt _ pos represents the weight corresponding to a certain word, n represents the length corresponding to the sentence in which the word is located, p represents the corresponding position (between 1 and n), and lambda is a model super parameter and is obtained after model training.
Further, the training process in step S2 further includes: the conversion result obtained in step S1 is used as a feature vector space, and the label corresponding to each feature vector is used as a prediction category space.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the invention, a better effect is obtained on the premise of not influencing the model speed by setting the weight calculation method based on the position information, so that a reasonable method is provided for avoiding independence assumption among characteristic words in the text.
2. The invention quantitatively describes the contribution of the position function to the feature words, and simultaneously, the setting of the hyper-parameter lambda gives more flexibility to the position influence function, and different parameter selections can be carried out according to different linguistic data.
Drawings
FIG. 1 is a diagram illustrating a word weight-location trend in the context of one embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to embodiments and accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
A Bayesian text classification method based on location information, the method comprising:
s1, converting the bag-of-words model through an input conversion module, wherein a position function is arranged in the input conversion module to convert the position parameters of the bag-of-words model;
s2, training the data conversion result through a learning module to obtain different test results related to the position parameters, and selecting a parameter model with the best test effect, wherein the learning module is internally provided with an MNB Bayesian model;
and S3, emotion category prediction is carried out on the new text corpus by using the trained model.
In a specific implementation, the result of the data conversion in step S1 includes an emotion tag.
In specific implementation, the specific process of conversion in step S1 is as follows:
s101, data preprocessing, including blank line elimination, messy code elimination and unified format conversion;
s102, extracting presence characteristic presence, specifically, extracting the characteristic of a text by using a tf-idf method, wherein the characteristic extraction mainly refers to the presence characteristic, the presence characteristic value presence in the text is 1, and otherwise, the presence characteristic value presence is 0, considering that the method mainly aims at short texts, and the number of times of occurrence of a word is generally 1;
s103, extracting position characteristics wt _ pos, specifically, calculating a corresponding position characteristic weight value according to a corresponding position of each characteristic word in the text and a preset position function;
s104, merging the feature values, specifically, adding the presence feature value and the position feature value obtained in step S102 and step S103 to obtain a value corresponding to the feature, which is present + wt _ pos.
In the technical scheme, the influence degree of a word on the whole emotion tendency of the sentence is determined according to the sequence of the appearance positions of the word mainly by selecting the position function. For example, "i and the siblings dislike staying in a remote and inconvenient village", the language is focused on "dislike", the text describing a certain place in the following has a small influence on the tendency of emotion, and most text describing habits are similar to the above. The position importance of certain words generally increases and then decreases as the position is located, as shown in fig. 1. In fig. 1, as a word is positioned closer to the end of a sentence, it contributes less and less weight, and finally tends to a relatively low level. Assuming that the judgment of a word on the emotional tendency of a sentence considers a simple position function such as a linear function, but is not enough to describe the legend relationship, in order to better describe the tendency relationship, a piecewise function based on a logit function can be considered, and a preset position function is obtained as
Wt _ pos represents the weight corresponding to a certain word, n represents the length corresponding to the sentence in which the word is located, p represents the corresponding position (between 1 and n), and lambda is a model super parameter and is obtained after model training.
λ may be given an initial interval [0, 5], where the above equation assumes that a sentence is divided into three parts on average, the weights of words in the first part increase according to the increase of the positions of the words, the weights corresponding to the second part in the middle are basically unchanged, and the weights corresponding to the words in the third part become gradually smaller with the backward movement of the positions, because the emotional tendency of a text sentence is basically determined when the text sentence is described later.
As a preferred embodiment, the corresponding position feature weight value is calculated according to the corresponding position of each feature word in the text and a preset position function. For example, "i and the siblings dislike staying in a remote and inconvenient village", the corresponding unigram feature words are [ "i", "the siblings", "dislike", "staying", "remote", "inconvenient", "village" ], the corresponding positions p of the words are respectively from 1 to 7 (except for the stop words), and the corresponding text length n is 7. Where the hyper-parameter lambda sets multiple sets of values (0.1, 0.5, 1.0, 1.5, 2.0, 3.0, 5.0).
In specific implementation, the training process in step S2 further includes: the conversion result obtained in step S1 is used as a feature vector space, and the label corresponding to each feature vector is used as a prediction category space.
The model method provided by the invention belongs to a discriminant classification method, the training speed is high, and effective data basic statistics obtained by data cleaning aiming at five NLP data sets (short text comment data in different fields at different time) collected on a network are shown in a table 1:
TABLE 1
The accuracy obtained by the five-fold cross validation method is shown in table 2 (after screening comparison, the performance of the hyper-parameter λ 1 is the best):
TABLE 2
From the above verification results, the MNB classification method based on presence and location features is relatively good in emotion prediction.
In summary, the bayesian text classification method generally assumes that the feature words are independent of each other (only considering the existence features), and therefore does not consider the influence of the position information on the emotional tendency. The method introduces a weight calculation method based on the position information, and obtains better effect on the premise of not influencing the model speed. This provides a reasonable approach to avoid the assumption of independence between feature words in text: firstly, quantitatively depicting the contribution of a position function to a feature word; secondly, the setting of the hyper-parameter lambda gives more flexibility to the position influence function, and different parameter selections can be carried out according to different corpora; thirdly, λ ═ 1 has a certain influence on the position information of the emotion classification method, but the peak value is 0.5, which is smaller than the influence of the presence feature.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (3)
1. A Bayesian text classification method based on location information, the method comprising:
s1, converting the bag-of-words model through an input conversion module, wherein a position function is arranged in the input conversion module to convert the position parameters of the bag-of-words model;
s2, training the data conversion result through a learning module to obtain different test results related to the position parameters, and selecting a parameter model with the best test effect, wherein the learning module is internally provided with an MNB Bayesian model;
s3, emotion category prediction is carried out on the newly entered text corpus by using the trained model;
the specific process of conversion in step S1 is as follows:
s101, data preprocessing, including blank line elimination, messy code elimination and unified format conversion;
s102, existing feature presence extraction, specifically, feature extraction is carried out on a text by using a tf-idf method, the corresponding feature value presence existing in the text is 1, and otherwise, the corresponding feature value presence is 0;
s103, extracting position characteristics wt _ pos, specifically, calculating a corresponding position characteristic weight value according to a corresponding position of each characteristic word in the text and a preset position function;
step S104, merging the feature values, specifically, adding the existence feature value and the position feature value obtained in step S102 and step S103 to obtain a value corresponding to the feature word, which is present + wt _ pos;
the preset position function is
Wt _ pos represents the weight corresponding to a certain word, n represents the length corresponding to the sentence in which the word is located, p represents the corresponding position, p is more than 1 and less than n, lambda is a model super parameter, and lambda is obtained after model training.
2. The Bayesian text classification method based on location information as recited in claim 1, wherein: the result of the data conversion in step S1 includes emotion tags.
3. The Bayesian text classification method based on location information as recited in claim 1, wherein the training in step S2 further comprises: using the conversion result obtained in step S1 as a feature vector space, and a label corresponding to each feature vector as a prediction category space.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910193320.6A CN109947942B (en) | 2019-03-14 | 2019-03-14 | Bayesian text classification method based on position information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910193320.6A CN109947942B (en) | 2019-03-14 | 2019-03-14 | Bayesian text classification method based on position information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109947942A CN109947942A (en) | 2019-06-28 |
CN109947942B true CN109947942B (en) | 2022-05-24 |
Family
ID=67009865
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910193320.6A Active CN109947942B (en) | 2019-03-14 | 2019-03-14 | Bayesian text classification method based on position information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109947942B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112905845B (en) * | 2021-03-17 | 2022-06-21 | 重庆大学 | Multi-source unstructured data cleaning method for discrete intelligent manufacturing application |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106055673A (en) * | 2016-06-06 | 2016-10-26 | 中国人民解放军国防科学技术大学 | Chinese short-text sentiment classification method based on text characteristic insertion |
CN106445919A (en) * | 2016-09-28 | 2017-02-22 | 上海智臻智能网络科技股份有限公司 | Sentiment classifying method and device |
CN107633000A (en) * | 2017-08-03 | 2018-01-26 | 北京微智信业科技有限公司 | File classification method based on tfidf algorithms and related term weight amendment |
CN108399165A (en) * | 2018-03-28 | 2018-08-14 | 广东技术师范学院 | A kind of keyword abstraction method based on position weighting |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017090051A1 (en) * | 2015-11-27 | 2017-06-01 | Giridhari Devanathan | A method for text classification and feature selection using class vectors and the system thereof |
-
2019
- 2019-03-14 CN CN201910193320.6A patent/CN109947942B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106055673A (en) * | 2016-06-06 | 2016-10-26 | 中国人民解放军国防科学技术大学 | Chinese short-text sentiment classification method based on text characteristic insertion |
CN106445919A (en) * | 2016-09-28 | 2017-02-22 | 上海智臻智能网络科技股份有限公司 | Sentiment classifying method and device |
CN107633000A (en) * | 2017-08-03 | 2018-01-26 | 北京微智信业科技有限公司 | File classification method based on tfidf algorithms and related term weight amendment |
CN108399165A (en) * | 2018-03-28 | 2018-08-14 | 广东技术师范学院 | A kind of keyword abstraction method based on position weighting |
Also Published As
Publication number | Publication date |
---|---|
CN109947942A (en) | 2019-06-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Emotion classification for short texts: an improved multi-label method | |
Kumar et al. | Sentiment analysis of multimodal twitter data | |
Neelakandan et al. | A gradient boosted decision tree-based sentiment classification of twitter data | |
Li et al. | Filtering out the noise in short text topic modeling | |
Asghar et al. | T‐SAF: Twitter sentiment analysis framework using a hybrid classification scheme | |
Effrosynidis et al. | A comparison of pre-processing techniques for twitter sentiment analysis | |
Sahni et al. | Efficient Twitter sentiment classification using subjective distant supervision | |
Venugopalan et al. | Exploring sentiment analysis on twitter data | |
Baly et al. | A characterization study of arabic twitter data with a benchmarking for state-of-the-art opinion mining models | |
CN105183833B (en) | Microblog text recommendation method and device based on user model | |
Alowaidi et al. | Semantic sentiment analysis of Arabic texts | |
Nagamanjula et al. | A novel framework based on bi-objective optimization and LAN2FIS for Twitter sentiment analysis | |
CN105068991A (en) | Big data based public sentiment discovery method | |
CN104965823A (en) | Big data based opinion extraction method | |
Tang et al. | Learning sentence representation for emotion classification on microblogs | |
WO2024036840A1 (en) | Open-domain dialogue reply method and system based on topic enhancement | |
CN105183765A (en) | Big data-based topic extraction method | |
Chang et al. | A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING. | |
Demirci | Emotion analysis on Turkish tweets | |
Dedhia et al. | Ensemble model for Twitter sentiment analysis | |
Papadakis et al. | Graph vs. bag representation models for the topic classification of web documents | |
Chou et al. | Boosted web named entity recognition via tri-training | |
Atoum | Detecting cyberbullying from tweets through machine learning techniques with sentiment analysis | |
CN109947942B (en) | Bayesian text classification method based on position information | |
Banados et al. | Optimizing support vector machine in classifying sentiments on product brands from Twitter |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |