CN110909158A

CN110909158A - Text classification method based on improved firefly algorithm and K nearest neighbor

Info

Publication number: CN110909158A
Application number: CN201910605245.XA
Authority: CN
Inventors: 文武; 赵成; 刘颖; 解如风; 范荣妹
Original assignee: Chongqing Institute Of Quality And Standardization; CHONGQING XINKE DESIGN Co Ltd
Current assignee: Chongqing Institute Of Quality And Standardization; CHONGQING XINKE DESIGN Co Ltd
Priority date: 2019-07-05
Filing date: 2019-07-05
Publication date: 2020-03-24
Anticipated expiration: 2039-07-05
Also published as: CN110909158B

Abstract

The invention requests to protect a text classification method based on an improved firefly algorithm and K nearest neighbor, and a text feature selection model is constructed by combining information gain and the firefly algorithm, firstly, the information gain is utilized to sequence all features, then, a more representative feature subset is found out by utilizing the stronger optimizing capability of the improved firefly algorithm on a feature set which is sequenced in the front, and a step factor α in the firefly algorithm is adjusted, so that the global searching capability and the local searching capability of the algorithm are ensured, a new fitness function is introduced, the feature dimensionality is properly reduced in the aspect of improving the precision of the feature subset, finally, the model is used for text feature selection, and the obtained feature subset is used for KNN text classification.

Description

Text classification method based on improved firefly algorithm and K nearest neighbor

Technical Field

The invention belongs to the field of Chinese text classification, and particularly relates to a text classification method based on an improved firefly algorithm and K nearest neighbor.

Background

With the rapid development of internet technology, more and more users can conveniently acquire information resources on the internet and can publish information on the internet, that is, the users are carriers for publishing and receiving information at the same time. Although the presentation of information is becoming more and more abundant, the main presentation of information is still text to date. In the face of such a huge amount of text data, people have difficulty in finding out information of interest. If these text data are organized and managed only by means of the conventional manual method, not only a great deal of physical and manual labor is required, but also it is difficult to implement. Therefore, people are forced to find a new technology which can efficiently and accurately organize and manage the redundant information, so that really effective information data can be clearly and clearly presented, text classification is an effective way for solving the problem, and the problem of messy information can be solved to a great extent by effectively assisting people in organizing and classifying the information data.

At present, the precision of the feature subset selected by the traditional text feature selection method adopted in the text classification process is not high, for example, words with low occurrence frequency but more information are deleted by Document Frequency (DF); CHI-square statistics (CHI) only about whether the word appears and not considering the number of occurrences; information Gain (IG) only considers the contribution of words to the whole world, and does not relate to system categories; mutual Information (MI) is more prone to select low frequency words.

In the standard firefly algorithm, a search strategy of the firefly algorithm depends on a control parameter α and is usually constant to control the newer step length of each position, if the parameter is too large, the algorithm is not easy to converge and the calculation times are remarkably increased, if the parameter is obtained, the algorithm has poor global search capability and converges to the local optimum, the algorithm is analyzed and found after a feasibility test for the firefly algorithm in the field of text classification, and in the algorithm solving process, namely after a certain number of iterations, all the fireflies near the optimum position, at the moment, the optimal position is very close to the optimum position, and the optimal position is very close to the optimum position, so that the optimal position is very close to the optimum position, if the optimal position is found, the optimal position is relatively large, and the optimal position is relatively large, if the optimal position is relatively large, the initial search speed is relatively small, and the optimal position is relatively large, if the optimal position is relatively large, the optimal position is relatively small, the optimal position is relatively large, and the optimal position is relatively large, if the initial search result of the optimal position is relatively large, the optimal, and the optimal position is relatively large, the optimal position is relatively large, the optimal, if the initial search result is relatively large, the optimal, and the optimal, the optimal.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. The text classification method based on the improved firefly algorithm and the K nearest neighbor can better improve the defects that the firefly algorithm is easy to fall into local optimum, the convergence speed is low and the like in the process of searching the optimal text feature subset, so that a more accurate subset is obtained, and the text classification accuracy is improved. The technical scheme of the invention is as follows:

a text classification method based on an improved firefly algorithm and K nearest neighbor comprises the following steps:

step 1: acquiring a text, dividing the text into a training set and a test set, preprocessing the training set and the test set of the text including word segmentation and stop word removal, calculating information gain of each word and sequencing, and reserving the characteristics sequenced before a set sequence value n to obtain a text characteristic preselection set;

step 2, initializing the population size N and using the step factor α₀Light absorption coefficient γ, maximum number of iterations T_maxIn the standard firefly algorithm, a search strategy of the standard firefly algorithm depends on a control parameter α, which is usually a constant and controls the step length of each position update, however, in the solving process, namely after iteration for a certain number of times, all the fireflies are close to the optimal position, at the moment, the individual fireflies are close to the optimal position, and when the position is updated next time, according to the original step length, the optimal position is likely to be missed due to the overlarge moving distance, and the situation of back-and-forth swing occurs, so that the optimal solution searching efficiency is low, and the convergence precision and speed of the algorithm are influenced, in order to avoid the situation, a formula for dynamically updating α is provided as follows:

α therein_tValue representing the step size at the time of the t-th update of the position, α₀α value, T, indicating an initialization setting_maxTherefore, α values are relatively large at the initial stage of the algorithm and have a large search range to avoid falling into local optimum, and α values are relatively small at the later stage, so that good local search can be performed, and global optimum can be quickly found.

And step 3: the method comprises the steps that the low-brightness fireflies move to the high-brightness fireflies, positions are updated according to various dimensional data modified by the fireflies, the fitness of the fireflies after the positions are updated is calculated, the low-fitness fireflies are abandoned, the fireflies with the highest current fitness are recorded, the iteration number is added by 1, if the iteration number reaches the maximum value, the algorithm finishes searching, and an optimal text feature subset is output; otherwise, the search is continued.

And 4, step 4: and classifying the obtained optimal text feature subset by adopting a KNN classifier.

Further, the step 1 of text preprocessing specifically includes: 6000 articles from the six categories of the compound denier university corpus were selected as data sets, and each category was divided into 800 training sets and 200 test sets. Dividing words in the training set and the test set by using jieba software, carrying out stop word processing on the text set according to a stop word table with a large Haugh area, calculating the information gain of each word, and sequencing from large to small according to the obtained value.

Further, the information gain calculation formula of step 1 is as follows:

in the formula, ig (t) represents an information gain value of the characteristic t; e (C) represents the entropy of the text set without taking the feature t into account; e (C | t) represents the entropy of the text set when the feature t is considered; p (C)_j) Is represented by C_jProbability of class documents appearing in the corpus; p (t) represents the probability of the occurrence of a document containing the feature t; p (C)_jI t) indicates that the document containing the feature t belongs to C_jThe probability of class occurrence;

representing the probability of occurrence of a document without the feature t;

indicating belonging to document C without feature t_jThe probability of a class; m represents the number of categories, and j represents a certain category.

Further, the text feature preselection set is used as an input of an improved firefly algorithm, the firefly position is initialized randomly, and the fitness of all fireflies in the population is calculated, which specifically comprises the following steps: the vector space model VSM is adopted for text representation, a document can be regarded as a vector in an n-dimensional space, each document is regarded as a firefly, at the initial moment, the fireflies are randomly distributed in the whole search space, the firefly with weak luminescence moves to the luminous intensity, the position is updated through moving, the optimal position is found out finally to complete optimization, in a firefly algorithm, the self brightness and the attraction degree of the firefly determine the attractive force, the brightness is related to the position where the firefly is located, the position is brighter, the attraction degree is proportional to the brightness, namely, the attraction degree is stronger when the brightness of the firefly is larger, when the brightness of the firefly is equal, the firefly moves randomly, the brightness and the attraction degree of the firefly simulate the characteristic that light is transmitted in a medium and is absorbed by the medium to decline, so the brightness and the attraction degree are reduced along with the increase of the distance between the fireflies, the firefly algorithm follows the following three assumptions:

1) all fireflies are sex-independent and attract each other;

2) the attraction degree is determined only by the brightness and the distance, the high-brightness firefly attracts the surrounding low-brightness firefly,

but the attraction degree is reduced along with the increase of the distance, and the firefly with the highest brightness moves randomly;

3) the luminance of the firefly is obtained by calculating a fitness function.

Further, the firefly algorithm is initially used for solving a continuous optimization problem, and the selection of the text features belongs to a combinatorial optimization problem, so that in order to use the firefly algorithm for searching for an optimal feature subset, a Sigmoid function is introduced to convert a position update formula, and the Sigmoid function is defined as follows:

P_ijthe probability value of the jth dimension vector of the ith firefly, theta^ijRepresents the range of the abscissa of the function, where θ^ijThe calculation formula of (a) is as follows:

in the formula, β₀Represents the maximum attraction degree; γ represents a light absorption coefficient; r is_ijRepresents the distance between firefly i and firefly j; x is the number of_kjJ-dimensional vector value, x, representing high-brightness firefly k_ijJ-dimensional vector value representing low-intensity firefly i, α representing random parameter, and rand being [0,1 ]]Random numbers randomly distributed.

Using binary coding, using 0,1 to represent the selected condition of the features, each firefly representing a subset of features, the length of which is equal to the total number of the features, and the optimal solution form is

x_idAnd expressing the value of the d-dimension vector of the ith firefly. Wherein x_ijE {0,1}, i ═ 1,2, … N, N is the size of the population, i.e., the number of fireflies, assuming a feature set F ═ F (F ═ F)₁,f₂,f₃,…,f_n) Then the firefly is represented as a binary vector of length n, where

If the value of a certain position is 0, the feature of the corresponding position is not selected; conversely, when the value of a certain position is 1, the position is selected as the feature.

Further, the initial population is randomly generated from a series of binary numbers, and in the iteration of the algorithm, the rule for updating the ith firefly position is as follows:

further, in the improved firefly algorithm, the step size factor α is adjusted;

a formula for dynamically updating α is presented as follows:

α therein₀α value, α, indicating initialization settings_tDenotes the value of α at the T-th iteration, T_maxRepresents the maximum number of iterations and t represents the t-th iteration.

Further, reducing the feature subset dimension as another secondary optimization objective introduces a new fitness function as follows:

where ω is a constant slightly less than 1, P is the classification accuracy,

represents the subset of features x (i) the vector modulo length, n being the total number of features.

Further, the step 4 uses the obtained feature subset on a KNN classifier for text classification, and the process of the KNN algorithm is as follows: calculating the distance between the new sample point and all sample points in the training set, selecting the first K sample points with the closest distance, and classifying the new sample point into the most belonged class of the K points, wherein the distance adopts Euclidean distance or cosine similarity, and when a method for calculating the cosine similarity is adopted, the formula is as follows:

in the formula: d_iFeature vector, d, representing new sample point i_jA feature vector representing a sample point j in the training set; w is a_ikWeight of k dimension, w, for text i_jkThe k-dimension weight of the text j; m is the dimension of the feature vector;

the formula of degree of membership is as follows:

wherein δ (D)_i,C_m) Presentation textGear D_iWhether or not to belong to class C_mIf the new sample point D belongs to C_mClass, then δ (D)_i,C_m) Is 1, otherwise is 0.

Further, the specific steps of the KNN text classification include:

1) performing word segmentation on all texts, removing stop words, extracting characteristic words, and vectorizing the texts;

2) calculating the similarity of the text to be tested and all texts in the training set, sequencing, and selecting K most similar neighbor texts;

3) calculating the membership degree of the text to be detected in each category, and judging the text to be detected as the category with the maximum membership degree;

the classification effect is represented by precision (P), recall (R) and F₁The values to evaluate:

wherein, TP represents a true positive example, TN represents a true negative example, FP represents a false positive example, and FN represents a false negative example. The invention has the following advantages and beneficial effects:

the method combines information gain and an improved firefly algorithm to construct a new text feature selection model, firstly, the information gain is utilized to sequence all features, then, stronger optimizing capability of the firefly algorithm is utilized to find out more representative feature subsets on feature sets which are sequenced at the front, aiming at the defects that the firefly algorithm is easy to fall into local optimization, complex calculation, slow convergence and the like, step size factors α in the algorithm are adjusted, the overall searching capability of the algorithm is ensured, the local searching capability is also ensured, a new fitness function is introduced, the number of the features is properly reduced on the basis of improving the accuracy of the feature subsets, finally, the model is used for text feature selection, and the obtained feature subsets are used on a KNN classifier to improve the accuracy of text classification.

Drawings

FIG. 1 is a schematic flow chart of the preferred embodiment of the present invention

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

as shown in fig. 1, firstly, the text input set is segmented by using jieba software, and then the text set is subjected to stop word processing according to the stop word list of the hayage. And then calculating the information gain of each word, sequencing the words from large to small according to the obtained values, and reserving the characteristics in the front of the sequencing to obtain a text characteristic preselection set. The information gain calculation formula is as follows:

representing the probability of occurrence of a document without the feature t;

indicating that t belongs to document C without a feature_jThe probability of a class; m represents the number of categories.

Then defined according to the mathematical description of the firefly algorithm:

definition 1 light emission luminance of firefly:

wherein E₀The luminance of the brightest firefly is shown, γ is the light absorption coefficient, and r is the distance between fireflies.

Definition 2 attraction of fireflies:

β therein₀Is the attraction degree at a distance of 0, i.e., the maximum attraction degree.

Define 3 firefly location update:

wherein α is a random parameter and rand is at [0,1 ]]Random numbers randomly distributed. r is^ijExpressing the distance between the i firefly and the j firefly, and the formula is as follows:

where D represents the data dimension, x_idA d-th dimension data component representing the ith firefly.

The firefly algorithm was originally proposed to solve the continuous optimization problem, while the selection of text features belongs to the combinatorial optimization problem. Therefore, in order to use the firefly algorithm for searching the optimal feature subset, a Sigmoid function is introduced to convert a position updating formula. The Sigmoid function is defined as follows:

The invention adopts binary coding, and 0 and 1 are used for representing the selected condition of the characteristics. Each firefly represents a subset of features whose length is equal to the total number of features. The best possible solution form

x_idAnd expressing the value of the d-dimension vector of the ith firefly. Wherein x_ijE {0,1}, i ═ 1,2, … N, N is the size of the population, i.e., the number of fireflies. Suppose we have a set of features, F ═ F₁,f₂,f₃,…,f_n) Firefly is expressed as a binary vector of length n. In that

The initial population is randomly generated from a series of binary numbers, e.g., when F ═ F (F)₁,f₂,f₃,f₄,f₅,f₆,f₇,f₈) When the initial population size N is set to 3 as the set of initial characteristics, firefliesCan be expressed as:

wherein

Representing a subset of features (f)₂,f₄,f₅,f₇)，

Represents a subset (f)₁,f₂,f₅,f₇)，

Represents a subset (f)₃,f₄,f₆,f₈). In the iteration of the algorithm, the rule for updating the ith firefly location is as follows:

initializing algorithm parameters: the maximum number of iterations T with the population size N equal to 50_max50, the light absorption coefficient γ is 1, the initial step factor α₀Maximum attraction β at 0.5₀The constant ω in the fitness function is 0.95.

Then, calculating the fitness value of the firefly according to the fitness function, wherein the calculation formula is as follows:

where ω is a constant slightly less than 1 and P isThe accuracy of the classification is high,

And secondly, enabling the low-brightness firefly to move to the high-brightness firefly, and updating the position according to the modified various-dimensional data of the firefly. Calculating the fitness of each firefly after the position is updated, abandoning the firefly with low fitness, recording the firefly with the highest current fitness, and adding 1 to the iteration times. The location update formula is as follows:

The algorithm realizes optimization by the fact that fireflies tend to high-brightness individual movement due to brightness difference, and after iteration is carried out to a certain degree, all fireflies are close to the optimal position, at the moment, the fireflies are close to the optimal position, and when the position is updated in the next iteration, the optimal position is likely to be missed due to overlarge movement distance, and the situation of swing back and forth occurs.

Wherein, α_tDenotes the value of α at the t-th iteration, α₀α value, T, indicating an initialization setting_maxThe maximum number of iterations is represented, and t represents the t-th iteration, so that the α value is relatively large at the early stage of the algorithm and has a large search range to avoid trapping into local optimization, and the α value is relatively small at the later stage, so that good local search can be performed to quickly find global optimization.

And finally, when the iteration times reach the set value, outputting the obtained optimal feature subset by the algorithm, and using the optimal feature subset for text classification.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A text classification method based on an improved firefly algorithm and K nearest neighbor is characterized by comprising the following steps:

step 2, initializing the population size N and using the step factor α₀Light absorption coefficient γ, maximum number of iterations T_maxThe search strategy of the firefly algorithm depends on a control parameter α which is usually taken as a constant and controls the step length of each position update, but in the solving process, namely after iteration for a certain number of times, all the fireflies are close to the optimal position, at the moment, the firefly individuals are very close to the optimal position, and when the position is updated next time, according to the original step length, the optimal position is likely to be missed due to overlarge moving distance, the situation of swinging back and forth occurs, so that the optimal solution searching efficiency is low, and the algorithm is influencedTo avoid this, a formula for dynamically updating α is proposed as follows:

wherein, α_tValue representing the step size at the time of the t-th update of the position, α₀α value, T, indicating an initialization setting_maxThe maximum number of iterations is represented, t represents the t-th iteration, so that the α value is relatively large at the initial stage of the algorithm and has a large search range to avoid falling into local optimum, and the α value is relatively small at the later stage, so that good local search can be performed, and global optimum can be quickly found;

and step 3: the method comprises the steps that the low-brightness fireflies move to the high-brightness fireflies, positions are updated according to various dimensional data modified by the fireflies, the fitness of the fireflies after the positions are updated is calculated, the low-fitness fireflies are abandoned, the fireflies with the highest current fitness are recorded, the iteration frequency is added by 1, if the iteration frequency reaches the maximum value, the algorithm finishes searching, and an optimal text feature subset is output; otherwise, continuing searching;

2. The method for classifying texts based on the improved firefly algorithm and the K nearest neighbor as claimed in claim 1, wherein the step 1 text preprocessing step specifically comprises: the method comprises the steps of firstly, selecting 6000 articles of six categories from a compound denier university corpus as data sets, dividing each category into 800 training sets and 200 test sets, dividing words of the training sets and the test sets by using jieba software, removing stop words of the training sets and the test sets according to a stop word list with a great hayage size, then calculating information gain of each word, and sequencing from large to small according to obtained values.

3. The method for classifying texts based on the improved firefly algorithm and the K nearest neighbor according to claim 1 or 2, wherein the information gain calculation formula of the step 1 is as follows:

in the formula, ig (t) represents an information gain value of the characteristic t; e (C) represents the entropy of the text set without taking the feature t into account; e (C | t) represents the entropy of the text set when the feature t is considered; p (C)_j) Is represented by C_jProbability of occurrence of class documents in the corpus; p (t) represents the probability of the occurrence of a document containing the feature t; p (C)_jI t) indicates that the document containing the feature t belongs to C_jThe probability of class occurrence;

representing the probability of occurrence of a document without the feature t;

indicating that t belongs to document C without a feature_jThe probability of a class; m represents the number of categories, and j represents a certain category.

4. The method for classifying texts based on the improved firefly algorithm and the K nearest neighbor according to claim 3, wherein the step of randomly initializing the firefly positions and calculating the fitness of all fireflies in the population by taking a preselected set of text features as input of the improved firefly algorithm comprises the following steps: the vector space model VSM is adopted for text representation, one document can be regarded as a vector in n-dimensional space, each document is regarded as a firefly, the fireflies are randomly distributed in the whole search space at the initial moment, the firefly with weak luminescence moves to the strong luminescence, the position is updated through movement, and finally the optimal position is found out to complete optimization, in the firefly algorithm, the attraction strength of the firefly is determined by the self brightness and the attraction degree of the firefly, the brightness is related to the position of the firefly, the brighter the position is, the attraction degree is in direct proportion to the brightness, that is, the larger the brightness is, the stronger the attraction force is, and when the brightness of the human body is equal, the firefly moves randomly, and the brightness and the attraction degree of the firefly simulate the characteristic that the light propagates in the medium and is absorbed by the medium to be faded, thus the brightness and attractiveness decrease as the distance between fireflies increases, and the fireflies algorithm follows the following three assumptions:

1) all fireflies are sex-independent and attract each other;

2) the attraction degree is determined only by the brightness and the distance, the high-brightness firefly attracts the surrounding low-brightness firefly, but the attraction degree decreases with the increase of the distance, and the firefly with the highest brightness moves randomly;

3) the luminance of the firefly is obtained by calculating a fitness function.

5. A method for classifying texts based on an improved firefly algorithm and K nearest neighbors as claimed in claim 4, wherein the firefly algorithm is used for solving the continuous optimization problem initially, and the selection of the text features belongs to the combinatorial optimization problem, so that in order to use the firefly algorithm for searching the optimal feature subset, a Sigmoid function is introduced to convert the position updating formula, and the Sigmoid function is defined as follows:

in the formula, β₀Represents the maximum attraction degree; γ represents a light absorption coefficient; r is_ijRepresents the distance between firefly i and firefly j; x is the number of_kjJ-dimensional vector value, x, representing high-brightness firefly k_ijJ-dimensional vector value representing low-intensity firefly i, α representing random parameter, and rand being [0,1 ]]Random numbers randomly distributed on the block;

binary coding is adopted, 0 and 1 are used for representing the selected condition of the characteristics, and each firefly represents a characteristic subsetThe length of which is equal to the total number of features, the optimal solution being as

x_idAnd expressing the value of the d-dimension vector of the ith firefly. Wherein x_ijE {0,1}, i 1,2, … N, N is the size of the population, i.e., the number of fireflies, assuming a feature set F (F) is set₁,f₂,f₃,…,f_n) Then the firefly is represented as a binary vector of length n, where

6. A method for classifying texts based on an improved firefly algorithm and K nearest neighbors as claimed in claim 5, wherein the initial population is randomly generated by a series of binary numbers, and in the iteration of the algorithm, the rule for updating the ith firefly location is as follows:

7. a method for classifying texts based on an improved firefly algorithm and K neighbors as claimed in claim 5 or 6, wherein in the improved firefly algorithm, the step size factor α is adjusted;

a formula for dynamically updating α is presented as follows:

8. The method for classifying texts based on the firefly algorithm and the K nearest neighbor as claimed in claim 7, wherein the dimension of the feature subset is further reduced as another secondary optimization objective, and a new fitness function is introduced as follows:

where ω is a constant slightly less than 1, P is the classification accuracy,

represents the modulo length of the feature subset x (i) vector, n being the total number of features.

9. The method for text classification based on the improved firefly algorithm and the K nearest neighbor as claimed in claim 8, wherein the step 4 uses the obtained feature subset on a KNN classifier for text classification, and the process of the KNN algorithm is as follows: calculating the distance between the new sample point and all sample points in the training set, selecting the first K sample points with the closest distance, and classifying the new sample point into the most belonged class of the K points, wherein the distance adopts Euclidean distance or cosine similarity, and when a method for calculating the cosine similarity is adopted, the formula is as follows:

the formula of degree of membership is as follows:

wherein δ (D)_i,C_m) Representing document D_iWhether or not to belong to class C_mIf the new sample point D belongs to C_mClass, then δ (D)_i,C_m) Is 1, otherwise is 0.

10. The method for classifying texts based on the improved firefly algorithm and the K nearest neighbor as claimed in claim 9, wherein the specific steps of the KNN text classification include:

3) calculating the membership degree of the text to be tested in each category, and judging the text to be tested as the category with the maximum membership degree;

wherein, TP represents a true positive example, TN represents a true negative example, FP represents a false positive example, and FN represents a false negative example.