CN109947942B

CN109947942B - Bayesian text classification method based on position information

Info

Publication number: CN109947942B
Application number: CN201910193320.6A
Authority: CN
Inventors: 金勇�
Original assignee: Wuhan Firehome Putian Information Technology Co ltd
Current assignee: Wuhan Firehome Putian Information Technology Co ltd
Priority date: 2019-03-14
Filing date: 2019-03-14
Publication date: 2022-05-24
Anticipated expiration: 2039-03-14
Also published as: CN109947942A

Abstract

The invention discloses a Bayesian text classification method based on position information, which comprises the following steps: s1, converting the bag-of-words model through an input conversion module, wherein a position function is arranged in the input conversion module to convert the position parameters of the bag-of-words model; s2, training the data conversion result through a learning module to obtain different test results related to the position parameters, and selecting a parameter model with the best test effect, wherein the learning module is internally provided with an MNB Bayesian model; s3, emotion category prediction is carried out on the new text corpus by using the trained model; according to the invention, a better effect is obtained on the premise of not influencing the model speed by setting a weight calculation method based on the position information, so that a reasonable method is provided for avoiding independence assumption among characteristic words in the text.

Description

Bayesian text classification method based on position information

Technical Field

The invention relates to the technical field of natural language processing application, in particular to a Bayesian text classification method based on position information.

Background

With the rapid development of network information and big data technologies, social media (e.g., Twitter, Facebook, microblog number, etc.) and news media (e.g., newwave news, today's headline, fox search news, etc.) websites and unstructured/semi-structured text resources on encyclopedia websites such as wiki encyclopedia and encyclopedia are mainly included, and how to clean, integrate, classify, and mine value information of these data resources is not left from the Natural Language Processing (NLP) technology. The text emotion classification is a commonly used NLP method, and how to effectively classify emotion tendencies related to texts plays a crucial role in classification and integration of text information.

The current text classification (including emotion classification) methods mainly include three main categories: bayesian methods, support vector machine methods, and neural network methods. The prior Bayes method is mainly a polynomial Bayes method based on a bag-of-words model, however, the bag-of-words model does not consider the position information of words in the text, and the position information is particularly important in emotion classification, for example, some important emotion words are in front of and behind sentences, which has important influence on the overall tendency of the text.

Based on the fact, according to the defect of the word bag model in the Bayes method, the invention provides an improved word bag model based on a position function, and the emotion classification is carried out by utilizing a polynomial naive Bayes method with better performance.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an improved word bag model based on a position function, and the emotion classification is carried out by utilizing a polynomial naive Bayes method with better performance.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a Bayesian text classification method based on location information, the method comprising:

s1, converting the bag-of-words model through an input conversion module, wherein a position function is arranged in the input conversion module to convert the position parameters of the bag-of-words model;

s2, training the data conversion result through a learning module to obtain different test results related to the position parameters, and selecting a parameter model with the best test effect, wherein the learning module is internally provided with an MNB Bayesian model;

and S3, emotion category prediction is carried out on the new text corpus by using the trained model.

Further, the result of the data conversion in step S1 includes emotion tags.

Further, the specific process of the conversion in step S1 is as follows:

s101, data preprocessing, including blank line elimination, messy code elimination and unified format conversion;

s102, existing feature presence extraction, specifically, feature extraction is carried out on a text by using a tf-idf method, the corresponding feature value presence existing in the text is 1, and otherwise, the corresponding feature value presence is 0;

s103, extracting position characteristics wt _ pos, specifically, calculating a corresponding position characteristic weight value according to a corresponding position of each characteristic word in the text and a preset position function;

s104, merging the feature values, specifically, adding the presence feature value and the position feature value obtained in step S102 and step S103 to obtain a value corresponding to the feature + wt _ pos.

Further, the preset position function is

Wt _ pos represents the weight corresponding to a certain word, n represents the length corresponding to the sentence in which the word is located, p represents the corresponding position (between 1 and n), and lambda is a model super parameter and is obtained after model training.

Further, the training process in step S2 further includes: the conversion result obtained in step S1 is used as a feature vector space, and the label corresponding to each feature vector is used as a prediction category space.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the invention, a better effect is obtained on the premise of not influencing the model speed by setting the weight calculation method based on the position information, so that a reasonable method is provided for avoiding independence assumption among characteristic words in the text.

2. The invention quantitatively describes the contribution of the position function to the feature words, and simultaneously, the setting of the hyper-parameter lambda gives more flexibility to the position influence function, and different parameter selections can be carried out according to different linguistic data.

Drawings

FIG. 1 is a diagram illustrating a word weight-location trend in the context of one embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to embodiments and accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In a specific implementation, the result of the data conversion in step S1 includes an emotion tag.

In specific implementation, the specific process of conversion in step S1 is as follows:

s102, extracting presence characteristic presence, specifically, extracting the characteristic of a text by using a tf-idf method, wherein the characteristic extraction mainly refers to the presence characteristic, the presence characteristic value presence in the text is 1, and otherwise, the presence characteristic value presence is 0, considering that the method mainly aims at short texts, and the number of times of occurrence of a word is generally 1;

s104, merging the feature values, specifically, adding the presence feature value and the position feature value obtained in step S102 and step S103 to obtain a value corresponding to the feature, which is present + wt _ pos.

In the technical scheme, the influence degree of a word on the whole emotion tendency of the sentence is determined according to the sequence of the appearance positions of the word mainly by selecting the position function. For example, "i and the siblings dislike staying in a remote and inconvenient village", the language is focused on "dislike", the text describing a certain place in the following has a small influence on the tendency of emotion, and most text describing habits are similar to the above. The position importance of certain words generally increases and then decreases as the position is located, as shown in fig. 1. In fig. 1, as a word is positioned closer to the end of a sentence, it contributes less and less weight, and finally tends to a relatively low level. Assuming that the judgment of a word on the emotional tendency of a sentence considers a simple position function such as a linear function, but is not enough to describe the legend relationship, in order to better describe the tendency relationship, a piecewise function based on a logit function can be considered, and a preset position function is obtained as

λ may be given an initial interval [0, 5], where the above equation assumes that a sentence is divided into three parts on average, the weights of words in the first part increase according to the increase of the positions of the words, the weights corresponding to the second part in the middle are basically unchanged, and the weights corresponding to the words in the third part become gradually smaller with the backward movement of the positions, because the emotional tendency of a text sentence is basically determined when the text sentence is described later.

As a preferred embodiment, the corresponding position feature weight value is calculated according to the corresponding position of each feature word in the text and a preset position function. For example, "i and the siblings dislike staying in a remote and inconvenient village", the corresponding unigram feature words are [ "i", "the siblings", "dislike", "staying", "remote", "inconvenient", "village" ], the corresponding positions p of the words are respectively from 1 to 7 (except for the stop words), and the corresponding text length n is 7. Where the hyper-parameter lambda sets multiple sets of values (0.1, 0.5, 1.0, 1.5, 2.0, 3.0, 5.0).

In specific implementation, the training process in step S2 further includes: the conversion result obtained in step S1 is used as a feature vector space, and the label corresponding to each feature vector is used as a prediction category space.

The model method provided by the invention belongs to a discriminant classification method, the training speed is high, and effective data basic statistics obtained by data cleaning aiming at five NLP data sets (short text comment data in different fields at different time) collected on a network are shown in a table 1:

TABLE 1

The accuracy obtained by the five-fold cross validation method is shown in table 2 (after screening comparison, the performance of the hyper-parameter λ 1 is the best):

TABLE 2

From the above verification results, the MNB classification method based on presence and location features is relatively good in emotion prediction.

In summary, the bayesian text classification method generally assumes that the feature words are independent of each other (only considering the existence features), and therefore does not consider the influence of the position information on the emotional tendency. The method introduces a weight calculation method based on the position information, and obtains better effect on the premise of not influencing the model speed. This provides a reasonable approach to avoid the assumption of independence between feature words in text: firstly, quantitatively depicting the contribution of a position function to a feature word; secondly, the setting of the hyper-parameter lambda gives more flexibility to the position influence function, and different parameter selections can be carried out according to different corpora; thirdly, λ ═ 1 has a certain influence on the position information of the emotion classification method, but the peak value is 0.5, which is smaller than the influence of the presence feature.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A Bayesian text classification method based on location information, the method comprising:

s3, emotion category prediction is carried out on the newly entered text corpus by using the trained model;

the specific process of conversion in step S1 is as follows:

step S104, merging the feature values, specifically, adding the existence feature value and the position feature value obtained in step S102 and step S103 to obtain a value corresponding to the feature word, which is present + wt _ pos;

the preset position function is

Wt _ pos represents the weight corresponding to a certain word, n represents the length corresponding to the sentence in which the word is located, p represents the corresponding position, p is more than 1 and less than n, lambda is a model super parameter, and lambda is obtained after model training.

2. The Bayesian text classification method based on location information as recited in claim 1, wherein: the result of the data conversion in step S1 includes emotion tags.

3. The Bayesian text classification method based on location information as recited in claim 1, wherein the training in step S2 further comprises: using the conversion result obtained in step S1 as a feature vector space, and a label corresponding to each feature vector as a prediction category space.