CN110909158A - Text classification method based on improved firefly algorithm and K nearest neighbor - Google Patents
Text classification method based on improved firefly algorithm and K nearest neighbor Download PDFInfo
- Publication number
- CN110909158A CN110909158A CN201910605245.XA CN201910605245A CN110909158A CN 110909158 A CN110909158 A CN 110909158A CN 201910605245 A CN201910605245 A CN 201910605245A CN 110909158 A CN110909158 A CN 110909158A
- Authority
- CN
- China
- Prior art keywords
- firefly
- text
- algorithm
- feature
- brightness
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24147—Distances to closest patterns, e.g. nearest neighbour classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention requests to protect a text classification method based on an improved firefly algorithm and K nearest neighbor, and a text feature selection model is constructed by combining information gain and the firefly algorithm, firstly, the information gain is utilized to sequence all features, then, a more representative feature subset is found out by utilizing the stronger optimizing capability of the improved firefly algorithm on a feature set which is sequenced in the front, and a step factor α in the firefly algorithm is adjusted, so that the global searching capability and the local searching capability of the algorithm are ensured, a new fitness function is introduced, the feature dimensionality is properly reduced in the aspect of improving the precision of the feature subset, finally, the model is used for text feature selection, and the obtained feature subset is used for KNN text classification.
Description
Technical Field
The invention belongs to the field of Chinese text classification, and particularly relates to a text classification method based on an improved firefly algorithm and K nearest neighbor.
Background
With the rapid development of internet technology, more and more users can conveniently acquire information resources on the internet and can publish information on the internet, that is, the users are carriers for publishing and receiving information at the same time. Although the presentation of information is becoming more and more abundant, the main presentation of information is still text to date. In the face of such a huge amount of text data, people have difficulty in finding out information of interest. If these text data are organized and managed only by means of the conventional manual method, not only a great deal of physical and manual labor is required, but also it is difficult to implement. Therefore, people are forced to find a new technology which can efficiently and accurately organize and manage the redundant information, so that really effective information data can be clearly and clearly presented, text classification is an effective way for solving the problem, and the problem of messy information can be solved to a great extent by effectively assisting people in organizing and classifying the information data.
At present, the precision of the feature subset selected by the traditional text feature selection method adopted in the text classification process is not high, for example, words with low occurrence frequency but more information are deleted by Document Frequency (DF); CHI-square statistics (CHI) only about whether the word appears and not considering the number of occurrences; information Gain (IG) only considers the contribution of words to the whole world, and does not relate to system categories; mutual Information (MI) is more prone to select low frequency words.
In the standard firefly algorithm, a search strategy of the firefly algorithm depends on a control parameter α and is usually constant to control the newer step length of each position, if the parameter is too large, the algorithm is not easy to converge and the calculation times are remarkably increased, if the parameter is obtained, the algorithm has poor global search capability and converges to the local optimum, the algorithm is analyzed and found after a feasibility test for the firefly algorithm in the field of text classification, and in the algorithm solving process, namely after a certain number of iterations, all the fireflies near the optimum position, at the moment, the optimal position is very close to the optimum position, and the optimal position is very close to the optimum position, so that the optimal position is very close to the optimum position, if the optimal position is found, the optimal position is relatively large, and the optimal position is relatively large, if the optimal position is relatively large, the initial search speed is relatively small, and the optimal position is relatively large, if the optimal position is relatively large, the optimal position is relatively small, the optimal position is relatively large, and the optimal position is relatively large, if the initial search result of the optimal position is relatively large, the optimal, and the optimal position is relatively large, the optimal position is relatively large, the optimal, if the initial search result is relatively large, the optimal, and the optimal, the optimal.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. The text classification method based on the improved firefly algorithm and the K nearest neighbor can better improve the defects that the firefly algorithm is easy to fall into local optimum, the convergence speed is low and the like in the process of searching the optimal text feature subset, so that a more accurate subset is obtained, and the text classification accuracy is improved. The technical scheme of the invention is as follows:
a text classification method based on an improved firefly algorithm and K nearest neighbor comprises the following steps:
step 1: acquiring a text, dividing the text into a training set and a test set, preprocessing the training set and the test set of the text including word segmentation and stop word removal, calculating information gain of each word and sequencing, and reserving the characteristics sequenced before a set sequence value n to obtain a text characteristic preselection set;
step 2, initializing the population size N and using the step factor α0Light absorption coefficient γ, maximum number of iterations TmaxIn the standard firefly algorithm, a search strategy of the standard firefly algorithm depends on a control parameter α, which is usually a constant and controls the step length of each position update, however, in the solving process, namely after iteration for a certain number of times, all the fireflies are close to the optimal position, at the moment, the individual fireflies are close to the optimal position, and when the position is updated next time, according to the original step length, the optimal position is likely to be missed due to the overlarge moving distance, and the situation of back-and-forth swing occurs, so that the optimal solution searching efficiency is low, and the convergence precision and speed of the algorithm are influenced, in order to avoid the situation, a formula for dynamically updating α is provided as follows:
α thereintValue representing the step size at the time of the t-th update of the position, α0α value, T, indicating an initialization settingmaxTherefore, α values are relatively large at the initial stage of the algorithm and have a large search range to avoid falling into local optimum, and α values are relatively small at the later stage, so that good local search can be performed, and global optimum can be quickly found.
And step 3: the method comprises the steps that the low-brightness fireflies move to the high-brightness fireflies, positions are updated according to various dimensional data modified by the fireflies, the fitness of the fireflies after the positions are updated is calculated, the low-fitness fireflies are abandoned, the fireflies with the highest current fitness are recorded, the iteration number is added by 1, if the iteration number reaches the maximum value, the algorithm finishes searching, and an optimal text feature subset is output; otherwise, the search is continued.
And 4, step 4: and classifying the obtained optimal text feature subset by adopting a KNN classifier.
Further, the step 1 of text preprocessing specifically includes: 6000 articles from the six categories of the compound denier university corpus were selected as data sets, and each category was divided into 800 training sets and 200 test sets. Dividing words in the training set and the test set by using jieba software, carrying out stop word processing on the text set according to a stop word table with a large Haugh area, calculating the information gain of each word, and sequencing from large to small according to the obtained value.
Further, the information gain calculation formula of step 1 is as follows:
in the formula, ig (t) represents an information gain value of the characteristic t; e (C) represents the entropy of the text set without taking the feature t into account; e (C | t) represents the entropy of the text set when the feature t is considered; p (C)j) Is represented by CjProbability of class documents appearing in the corpus; p (t) represents the probability of the occurrence of a document containing the feature t; p (C)jI t) indicates that the document containing the feature t belongs to CjThe probability of class occurrence;representing the probability of occurrence of a document without the feature t;indicating belonging to document C without feature tjThe probability of a class; m represents the number of categories, and j represents a certain category.
Further, the text feature preselection set is used as an input of an improved firefly algorithm, the firefly position is initialized randomly, and the fitness of all fireflies in the population is calculated, which specifically comprises the following steps: the vector space model VSM is adopted for text representation, a document can be regarded as a vector in an n-dimensional space, each document is regarded as a firefly, at the initial moment, the fireflies are randomly distributed in the whole search space, the firefly with weak luminescence moves to the luminous intensity, the position is updated through moving, the optimal position is found out finally to complete optimization, in a firefly algorithm, the self brightness and the attraction degree of the firefly determine the attractive force, the brightness is related to the position where the firefly is located, the position is brighter, the attraction degree is proportional to the brightness, namely, the attraction degree is stronger when the brightness of the firefly is larger, when the brightness of the firefly is equal, the firefly moves randomly, the brightness and the attraction degree of the firefly simulate the characteristic that light is transmitted in a medium and is absorbed by the medium to decline, so the brightness and the attraction degree are reduced along with the increase of the distance between the fireflies, the firefly algorithm follows the following three assumptions:
1) all fireflies are sex-independent and attract each other;
2) the attraction degree is determined only by the brightness and the distance, the high-brightness firefly attracts the surrounding low-brightness firefly,
but the attraction degree is reduced along with the increase of the distance, and the firefly with the highest brightness moves randomly;
3) the luminance of the firefly is obtained by calculating a fitness function.
Further, the firefly algorithm is initially used for solving a continuous optimization problem, and the selection of the text features belongs to a combinatorial optimization problem, so that in order to use the firefly algorithm for searching for an optimal feature subset, a Sigmoid function is introduced to convert a position update formula, and the Sigmoid function is defined as follows:
Pijthe probability value of the jth dimension vector of the ith firefly, thetaijRepresents the range of the abscissa of the function, where θijThe calculation formula of (a) is as follows:
in the formula, β0Represents the maximum attraction degree; γ represents a light absorption coefficient; r isijRepresents the distance between firefly i and firefly j; x is the number ofkjJ-dimensional vector value, x, representing high-brightness firefly kijJ-dimensional vector value representing low-intensity firefly i, α representing random parameter, and rand being [0,1 ]]Random numbers randomly distributed.
Using binary coding, using 0,1 to represent the selected condition of the features, each firefly representing a subset of features, the length of which is equal to the total number of the features, and the optimal solution form isxidAnd expressing the value of the d-dimension vector of the ith firefly. Wherein xijE {0,1}, i ═ 1,2, … N, N is the size of the population, i.e., the number of fireflies, assuming a feature set F ═ F (F ═ F)1,f2,f3,…,fn) Then the firefly is represented as a binary vector of length n, whereIf the value of a certain position is 0, the feature of the corresponding position is not selected; conversely, when the value of a certain position is 1, the position is selected as the feature.
Further, the initial population is randomly generated from a series of binary numbers, and in the iteration of the algorithm, the rule for updating the ith firefly position is as follows:
further, in the improved firefly algorithm, the step size factor α is adjusted;
a formula for dynamically updating α is presented as follows:
α therein0α value, α, indicating initialization settingstDenotes the value of α at the T-th iteration, TmaxRepresents the maximum number of iterations and t represents the t-th iteration.
Further, reducing the feature subset dimension as another secondary optimization objective introduces a new fitness function as follows:
where ω is a constant slightly less than 1, P is the classification accuracy,represents the subset of features x (i) the vector modulo length, n being the total number of features.
Further, the step 4 uses the obtained feature subset on a KNN classifier for text classification, and the process of the KNN algorithm is as follows: calculating the distance between the new sample point and all sample points in the training set, selecting the first K sample points with the closest distance, and classifying the new sample point into the most belonged class of the K points, wherein the distance adopts Euclidean distance or cosine similarity, and when a method for calculating the cosine similarity is adopted, the formula is as follows:
in the formula: diFeature vector, d, representing new sample point ijA feature vector representing a sample point j in the training set; w is aikWeight of k dimension, w, for text ijkThe k-dimension weight of the text j; m is the dimension of the feature vector;
the formula of degree of membership is as follows:
wherein δ (D)i,Cm) Presentation textGear DiWhether or not to belong to class CmIf the new sample point D belongs to CmClass, then δ (D)i,Cm) Is 1, otherwise is 0.
Further, the specific steps of the KNN text classification include:
1) performing word segmentation on all texts, removing stop words, extracting characteristic words, and vectorizing the texts;
2) calculating the similarity of the text to be tested and all texts in the training set, sequencing, and selecting K most similar neighbor texts;
3) calculating the membership degree of the text to be detected in each category, and judging the text to be detected as the category with the maximum membership degree;
the classification effect is represented by precision (P), recall (R) and F1The values to evaluate:
wherein, TP represents a true positive example, TN represents a true negative example, FP represents a false positive example, and FN represents a false negative example. The invention has the following advantages and beneficial effects:
the method combines information gain and an improved firefly algorithm to construct a new text feature selection model, firstly, the information gain is utilized to sequence all features, then, stronger optimizing capability of the firefly algorithm is utilized to find out more representative feature subsets on feature sets which are sequenced at the front, aiming at the defects that the firefly algorithm is easy to fall into local optimization, complex calculation, slow convergence and the like, step size factors α in the algorithm are adjusted, the overall searching capability of the algorithm is ensured, the local searching capability is also ensured, a new fitness function is introduced, the number of the features is properly reduced on the basis of improving the accuracy of the feature subsets, finally, the model is used for text feature selection, and the obtained feature subsets are used on a KNN classifier to improve the accuracy of text classification.
Drawings
FIG. 1 is a schematic flow chart of the preferred embodiment of the present invention
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
as shown in fig. 1, firstly, the text input set is segmented by using jieba software, and then the text set is subjected to stop word processing according to the stop word list of the hayage. And then calculating the information gain of each word, sequencing the words from large to small according to the obtained values, and reserving the characteristics in the front of the sequencing to obtain a text characteristic preselection set. The information gain calculation formula is as follows:
as shown in fig. 1, firstly, the text input set is segmented by using jieba software, and then the text set is subjected to stop word processing according to the stop word list of the hayage. And then calculating the information gain of each word, sequencing the words from large to small according to the obtained values, and reserving the characteristics in the front of the sequencing to obtain a text characteristic preselection set. The information gain calculation formula is as follows:
in the formula, ig (t) represents an information gain value of the characteristic t; e (C) represents the entropy of the text set without taking the feature t into account; e (C | t) represents the entropy of the text set when the feature t is considered; p (C)j) Is represented by CjProbability of class documents appearing in the corpus; p (t) represents the probability of the occurrence of a document containing the feature t; p (C)jI t) indicates that the document containing the feature t belongs to CjThe probability of class occurrence;representing the probability of occurrence of a document without the feature t;indicating that t belongs to document C without a featurejThe probability of a class; m represents the number of categories.
Then defined according to the mathematical description of the firefly algorithm:
definition 1 light emission luminance of firefly:
wherein E0The luminance of the brightest firefly is shown, γ is the light absorption coefficient, and r is the distance between fireflies.
Definition 2 attraction of fireflies:
β therein0Is the attraction degree at a distance of 0, i.e., the maximum attraction degree.
Define 3 firefly location update:
wherein α is a random parameter and rand is at [0,1 ]]Random numbers randomly distributed. r isijExpressing the distance between the i firefly and the j firefly, and the formula is as follows:
where D represents the data dimension, xidA d-th dimension data component representing the ith firefly.
The firefly algorithm was originally proposed to solve the continuous optimization problem, while the selection of text features belongs to the combinatorial optimization problem. Therefore, in order to use the firefly algorithm for searching the optimal feature subset, a Sigmoid function is introduced to convert a position updating formula. The Sigmoid function is defined as follows:
Pijthe probability value of the jth dimension vector of the ith firefly, thetaijRepresents the range of the abscissa of the function, where θijThe calculation formula of (a) is as follows:
in the formula, β0Represents the maximum attraction degree; γ represents a light absorption coefficient; r isijRepresents the distance between firefly i and firefly j; x is the number ofkjJ-dimensional vector value, x, representing high-brightness firefly kijJ-dimensional vector value representing low-intensity firefly i, α representing random parameter, and rand being [0,1 ]]Random numbers randomly distributed.
The invention adopts binary coding, and 0 and 1 are used for representing the selected condition of the characteristics. Each firefly represents a subset of features whose length is equal to the total number of features. The best possible solution formxidAnd expressing the value of the d-dimension vector of the ith firefly. Wherein xijE {0,1}, i ═ 1,2, … N, N is the size of the population, i.e., the number of fireflies. Suppose we have a set of features, F ═ F1,f2,f3,…,fn) Firefly is expressed as a binary vector of length n. In thatIf the value of a certain position is 0, the feature of the corresponding position is not selected; conversely, when the value of a certain position is 1, the position is selected as the feature.
The initial population is randomly generated from a series of binary numbers, e.g., when F ═ F (F)1,f2,f3,f4,f5,f6,f7,f8) When the initial population size N is set to 3 as the set of initial characteristics, firefliesCan be expressed as:
whereinRepresenting a subset of features (f)2,f4,f5,f7),Represents a subset (f)1,f2,f5,f7),Represents a subset (f)3,f4,f6,f8). In the iteration of the algorithm, the rule for updating the ith firefly location is as follows:
initializing algorithm parameters: the maximum number of iterations T with the population size N equal to 50max50, the light absorption coefficient γ is 1, the initial step factor α0Maximum attraction β at 0.50The constant ω in the fitness function is 0.95.
Then, calculating the fitness value of the firefly according to the fitness function, wherein the calculation formula is as follows:
where ω is a constant slightly less than 1 and P isThe accuracy of the classification is high,represents the subset of features x (i) the vector modulo length, n being the total number of features.
And secondly, enabling the low-brightness firefly to move to the high-brightness firefly, and updating the position according to the modified various-dimensional data of the firefly. Calculating the fitness of each firefly after the position is updated, abandoning the firefly with low fitness, recording the firefly with the highest current fitness, and adding 1 to the iteration times. The location update formula is as follows:
wherein α is a random parameter and rand is at [0,1 ]]Random numbers randomly distributed. r isijExpressing the distance between the i firefly and the j firefly, and the formula is as follows:
where D represents the data dimension, xidA d-th dimension data component representing the ith firefly.
The algorithm realizes optimization by the fact that fireflies tend to high-brightness individual movement due to brightness difference, and after iteration is carried out to a certain degree, all fireflies are close to the optimal position, at the moment, the fireflies are close to the optimal position, and when the position is updated in the next iteration, the optimal position is likely to be missed due to overlarge movement distance, and the situation of swing back and forth occurs.
Wherein, αtDenotes the value of α at the t-th iteration, α0α value, T, indicating an initialization settingmaxThe maximum number of iterations is represented, and t represents the t-th iteration, so that the α value is relatively large at the early stage of the algorithm and has a large search range to avoid trapping into local optimization, and the α value is relatively small at the later stage, so that good local search can be performed to quickly find global optimization.
And finally, when the iteration times reach the set value, outputting the obtained optimal feature subset by the algorithm, and using the optimal feature subset for text classification.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.
Claims (10)
1. A text classification method based on an improved firefly algorithm and K nearest neighbor is characterized by comprising the following steps:
step 1: acquiring a text, dividing the text into a training set and a test set, preprocessing the training set and the test set of the text including word segmentation and stop word removal, calculating information gain of each word and sequencing, and reserving the characteristics sequenced before a set sequence value n to obtain a text characteristic preselection set;
step 2, initializing the population size N and using the step factor α0Light absorption coefficient γ, maximum number of iterations TmaxThe search strategy of the firefly algorithm depends on a control parameter α which is usually taken as a constant and controls the step length of each position update, but in the solving process, namely after iteration for a certain number of times, all the fireflies are close to the optimal position, at the moment, the firefly individuals are very close to the optimal position, and when the position is updated next time, according to the original step length, the optimal position is likely to be missed due to overlarge moving distance, the situation of swinging back and forth occurs, so that the optimal solution searching efficiency is low, and the algorithm is influencedTo avoid this, a formula for dynamically updating α is proposed as follows:
wherein, αtValue representing the step size at the time of the t-th update of the position, α0α value, T, indicating an initialization settingmaxThe maximum number of iterations is represented, t represents the t-th iteration, so that the α value is relatively large at the initial stage of the algorithm and has a large search range to avoid falling into local optimum, and the α value is relatively small at the later stage, so that good local search can be performed, and global optimum can be quickly found;
and step 3: the method comprises the steps that the low-brightness fireflies move to the high-brightness fireflies, positions are updated according to various dimensional data modified by the fireflies, the fitness of the fireflies after the positions are updated is calculated, the low-fitness fireflies are abandoned, the fireflies with the highest current fitness are recorded, the iteration frequency is added by 1, if the iteration frequency reaches the maximum value, the algorithm finishes searching, and an optimal text feature subset is output; otherwise, continuing searching;
and 4, step 4: and classifying the obtained optimal text feature subset by adopting a KNN classifier.
2. The method for classifying texts based on the improved firefly algorithm and the K nearest neighbor as claimed in claim 1, wherein the step 1 text preprocessing step specifically comprises: the method comprises the steps of firstly, selecting 6000 articles of six categories from a compound denier university corpus as data sets, dividing each category into 800 training sets and 200 test sets, dividing words of the training sets and the test sets by using jieba software, removing stop words of the training sets and the test sets according to a stop word list with a great hayage size, then calculating information gain of each word, and sequencing from large to small according to obtained values.
3. The method for classifying texts based on the improved firefly algorithm and the K nearest neighbor according to claim 1 or 2, wherein the information gain calculation formula of the step 1 is as follows:
in the formula, ig (t) represents an information gain value of the characteristic t; e (C) represents the entropy of the text set without taking the feature t into account; e (C | t) represents the entropy of the text set when the feature t is considered; p (C)j) Is represented by CjProbability of occurrence of class documents in the corpus; p (t) represents the probability of the occurrence of a document containing the feature t; p (C)jI t) indicates that the document containing the feature t belongs to CjThe probability of class occurrence;representing the probability of occurrence of a document without the feature t;indicating that t belongs to document C without a featurejThe probability of a class; m represents the number of categories, and j represents a certain category.
4. The method for classifying texts based on the improved firefly algorithm and the K nearest neighbor according to claim 3, wherein the step of randomly initializing the firefly positions and calculating the fitness of all fireflies in the population by taking a preselected set of text features as input of the improved firefly algorithm comprises the following steps: the vector space model VSM is adopted for text representation, one document can be regarded as a vector in n-dimensional space, each document is regarded as a firefly, the fireflies are randomly distributed in the whole search space at the initial moment, the firefly with weak luminescence moves to the strong luminescence, the position is updated through movement, and finally the optimal position is found out to complete optimization, in the firefly algorithm, the attraction strength of the firefly is determined by the self brightness and the attraction degree of the firefly, the brightness is related to the position of the firefly, the brighter the position is, the attraction degree is in direct proportion to the brightness, that is, the larger the brightness is, the stronger the attraction force is, and when the brightness of the human body is equal, the firefly moves randomly, and the brightness and the attraction degree of the firefly simulate the characteristic that the light propagates in the medium and is absorbed by the medium to be faded, thus the brightness and attractiveness decrease as the distance between fireflies increases, and the fireflies algorithm follows the following three assumptions:
1) all fireflies are sex-independent and attract each other;
2) the attraction degree is determined only by the brightness and the distance, the high-brightness firefly attracts the surrounding low-brightness firefly, but the attraction degree decreases with the increase of the distance, and the firefly with the highest brightness moves randomly;
3) the luminance of the firefly is obtained by calculating a fitness function.
5. A method for classifying texts based on an improved firefly algorithm and K nearest neighbors as claimed in claim 4, wherein the firefly algorithm is used for solving the continuous optimization problem initially, and the selection of the text features belongs to the combinatorial optimization problem, so that in order to use the firefly algorithm for searching the optimal feature subset, a Sigmoid function is introduced to convert the position updating formula, and the Sigmoid function is defined as follows:
Pijthe probability value of the jth dimension vector of the ith firefly, thetaijRepresents the range of the abscissa of the function, where θijThe calculation formula of (a) is as follows:
in the formula, β0Represents the maximum attraction degree; γ represents a light absorption coefficient; r isijRepresents the distance between firefly i and firefly j; x is the number ofkjJ-dimensional vector value, x, representing high-brightness firefly kijJ-dimensional vector value representing low-intensity firefly i, α representing random parameter, and rand being [0,1 ]]Random numbers randomly distributed on the block;
binary coding is adopted, 0 and 1 are used for representing the selected condition of the characteristics, and each firefly represents a characteristic subsetThe length of which is equal to the total number of features, the optimal solution being asxidAnd expressing the value of the d-dimension vector of the ith firefly. Wherein xijE {0,1}, i 1,2, … N, N is the size of the population, i.e., the number of fireflies, assuming a feature set F (F) is set1,f2,f3,…,fn) Then the firefly is represented as a binary vector of length n, whereIf the value of a certain position is 0, the feature of the corresponding position is not selected; conversely, when the value of a certain position is 1, the position is selected as the feature.
6. A method for classifying texts based on an improved firefly algorithm and K nearest neighbors as claimed in claim 5, wherein the initial population is randomly generated by a series of binary numbers, and in the iteration of the algorithm, the rule for updating the ith firefly location is as follows:
7. a method for classifying texts based on an improved firefly algorithm and K neighbors as claimed in claim 5 or 6, wherein in the improved firefly algorithm, the step size factor α is adjusted;
a formula for dynamically updating α is presented as follows:
α therein0α value, α, indicating initialization settingstDenotes the value of α at the T-th iteration, TmaxRepresents the maximum number of iterations and t represents the t-th iteration.
8. The method for classifying texts based on the firefly algorithm and the K nearest neighbor as claimed in claim 7, wherein the dimension of the feature subset is further reduced as another secondary optimization objective, and a new fitness function is introduced as follows:
9. The method for text classification based on the improved firefly algorithm and the K nearest neighbor as claimed in claim 8, wherein the step 4 uses the obtained feature subset on a KNN classifier for text classification, and the process of the KNN algorithm is as follows: calculating the distance between the new sample point and all sample points in the training set, selecting the first K sample points with the closest distance, and classifying the new sample point into the most belonged class of the K points, wherein the distance adopts Euclidean distance or cosine similarity, and when a method for calculating the cosine similarity is adopted, the formula is as follows:
in the formula: diFeature vector, d, representing new sample point ijA feature vector representing a sample point j in the training set; w is aikWeight of k dimension, w, for text ijkThe k-dimension weight of the text j; m is the dimension of the feature vector;
the formula of degree of membership is as follows:
wherein δ (D)i,Cm) Representing document DiWhether or not to belong to class CmIf the new sample point D belongs to CmClass, then δ (D)i,Cm) Is 1, otherwise is 0.
10. The method for classifying texts based on the improved firefly algorithm and the K nearest neighbor as claimed in claim 9, wherein the specific steps of the KNN text classification include:
1) performing word segmentation on all texts, removing stop words, extracting characteristic words, and vectorizing the texts;
2) calculating the similarity of the text to be tested and all texts in the training set, sequencing, and selecting K most similar neighbor texts;
3) calculating the membership degree of the text to be tested in each category, and judging the text to be tested as the category with the maximum membership degree;
the classification effect is represented by precision (P), recall (R) and F1The values to evaluate:
wherein, TP represents a true positive example, TN represents a true negative example, FP represents a false positive example, and FN represents a false negative example.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910605245.XA CN110909158B (en) | 2019-07-05 | 2019-07-05 | Text classification method based on improved firefly algorithm and K nearest neighbor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910605245.XA CN110909158B (en) | 2019-07-05 | 2019-07-05 | Text classification method based on improved firefly algorithm and K nearest neighbor |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110909158A true CN110909158A (en) | 2020-03-24 |
CN110909158B CN110909158B (en) | 2022-10-18 |
Family
ID=69814440
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910605245.XA Active CN110909158B (en) | 2019-07-05 | 2019-07-05 | Text classification method based on improved firefly algorithm and K nearest neighbor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110909158B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112000116A (en) * | 2020-07-24 | 2020-11-27 | 西北工业大学 | Heading angle control method of autonomous underwater vehicle based on improved firefly PID method |
CN112446774A (en) * | 2020-10-30 | 2021-03-05 | 杭州衡泰软件有限公司 | Financial statement quality early warning method |
CN113345420A (en) * | 2021-06-07 | 2021-09-03 | 河海大学 | Countermeasure audio generation method and system based on firefly algorithm and gradient evaluation |
CN114822823A (en) * | 2022-05-11 | 2022-07-29 | 云南升玥信息技术有限公司 | Tumor fine classification system based on cloud computing and artificial intelligence fusion multi-dimensional medical data |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103366362A (en) * | 2013-04-17 | 2013-10-23 | 昆明理工大学 | Glowworm optimization algorithm-based ore zone image segmentation method |
AU2018100796A4 (en) * | 2018-06-14 | 2018-07-19 | Macau University Of Science And Technology | A genetic feature identifying system and a search method for identifying features of genetic information |
CN108388666A (en) * | 2018-03-16 | 2018-08-10 | 重庆邮电大学 | A kind of database multi-list Connection inquiring optimization method based on glowworm swarm algorithm |
CN108876029A (en) * | 2018-06-11 | 2018-11-23 | 南京航空航天大学 | A kind of passenger flow forecasting based on the adaptive chaos firefly of double populations |
CN109657147A (en) * | 2018-12-21 | 2019-04-19 | 岭南师范学院 | Microblogging abnormal user detection method based on firefly and weighting extreme learning machine |
CN109711636A (en) * | 2019-01-09 | 2019-05-03 | 南京工业大学 | A kind of river level prediction technique promoting tree-model based on chaos firefly and gradient |
-
2019
- 2019-07-05 CN CN201910605245.XA patent/CN110909158B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103366362A (en) * | 2013-04-17 | 2013-10-23 | 昆明理工大学 | Glowworm optimization algorithm-based ore zone image segmentation method |
CN108388666A (en) * | 2018-03-16 | 2018-08-10 | 重庆邮电大学 | A kind of database multi-list Connection inquiring optimization method based on glowworm swarm algorithm |
CN108876029A (en) * | 2018-06-11 | 2018-11-23 | 南京航空航天大学 | A kind of passenger flow forecasting based on the adaptive chaos firefly of double populations |
AU2018100796A4 (en) * | 2018-06-14 | 2018-07-19 | Macau University Of Science And Technology | A genetic feature identifying system and a search method for identifying features of genetic information |
CN109657147A (en) * | 2018-12-21 | 2019-04-19 | 岭南师范学院 | Microblogging abnormal user detection method based on firefly and weighting extreme learning machine |
CN109711636A (en) * | 2019-01-09 | 2019-05-03 | 南京工业大学 | A kind of river level prediction technique promoting tree-model based on chaos firefly and gradient |
Non-Patent Citations (6)
Title |
---|
AKSHI KUMAR 等: "A Filter-Wrapper based Feature Selection for Optimized Website Quality Prediction", 《 2019 AMITY INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AICAI)》 * |
G. VENKATA HARI PRASAD 等: "Performance analysis of feature selection methods for feature extracted PCG signals", 《2015 13TH INTERNATIONAL CONFERENCE ON ELECTROMAGNETIC INTERFERENCE AND COMPATIBILITY (INCEMIC)》 * |
LONG ZHANG 等: "Optimal feature selection using distance-based discrete firefly algorithm with mutual information criterion", 《NEURAL COMPUTING AND APPLICATIONS》 * |
左仲亮 等: "一种改进的萤火虫算法", 《微电子学与计算机》 * |
王爱芳 等: "一种基于萤火虫和SVM的图像检索相关反馈方法", 《小型微型计算机系统》 * |
陈东: "基于萤火虫算法的文本聚类研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112000116A (en) * | 2020-07-24 | 2020-11-27 | 西北工业大学 | Heading angle control method of autonomous underwater vehicle based on improved firefly PID method |
CN112446774A (en) * | 2020-10-30 | 2021-03-05 | 杭州衡泰软件有限公司 | Financial statement quality early warning method |
CN113345420A (en) * | 2021-06-07 | 2021-09-03 | 河海大学 | Countermeasure audio generation method and system based on firefly algorithm and gradient evaluation |
CN114822823A (en) * | 2022-05-11 | 2022-07-29 | 云南升玥信息技术有限公司 | Tumor fine classification system based on cloud computing and artificial intelligence fusion multi-dimensional medical data |
CN114822823B (en) * | 2022-05-11 | 2022-11-29 | 云南升玥信息技术有限公司 | Tumor fine classification system based on cloud computing and artificial intelligence fusion multi-dimensional medical data |
Also Published As
Publication number | Publication date |
---|---|
CN110909158B (en) | 2022-10-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110909158B (en) | Text classification method based on improved firefly algorithm and K nearest neighbor | |
CN110598029B (en) | Fine-grained image classification method based on attention transfer mechanism | |
CN109376242B (en) | Text classification method based on cyclic neural network variant and convolutional neural network | |
CN109614979B (en) | Data augmentation method and image classification method based on selection and generation | |
CN108564129B (en) | Trajectory data classification method based on generation countermeasure network | |
US8566746B2 (en) | Parameterization of a categorizer for adjusting image categorization and retrieval | |
CN109389037B (en) | Emotion classification method based on deep forest and transfer learning | |
CN107683469A (en) | A kind of product classification method and device based on deep learning | |
CN109344884A (en) | The method and device of media information classification method, training picture classification model | |
CN111833322B (en) | Garbage multi-target detection method based on improved YOLOv3 | |
CN110287985B (en) | Depth neural network image identification method based on variable topology structure with variation particle swarm optimization | |
CN111105045A (en) | Method for constructing prediction model based on improved locust optimization algorithm | |
CN112733602B (en) | Relation-guided pedestrian attribute identification method | |
CN112182221A (en) | Knowledge retrieval optimization method based on improved random forest | |
CN112883931A (en) | Real-time true and false motion judgment method based on long and short term memory network | |
CN115457332A (en) | Image multi-label classification method based on graph convolution neural network and class activation mapping | |
CN111639695A (en) | Method and system for classifying data based on improved drosophila optimization algorithm | |
CN114896398A (en) | Text classification system and method based on feature selection | |
Qiao et al. | A multi-level thresholding image segmentation method using hybrid Arithmetic Optimization and Harris Hawks Optimizer algorithms | |
CN113836330A (en) | Image retrieval method and device based on generation antagonism automatic enhanced network | |
CN112148994B (en) | Information push effect evaluation method and device, electronic equipment and storage medium | |
CN113934835A (en) | Retrieval type reply dialogue method and system combining keywords and semantic understanding representation | |
Wang et al. | An improved interactive genetic algorithm incorporating relevant feedback | |
Bai et al. | Learning high-level image representation for image retrieval via multi-task dnn using clickthrough data | |
CN112883930A (en) | Real-time true and false motion judgment method based on full-connection network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |