CN110909158A - Text classification method based on improved firefly algorithm and K nearest neighbor - Google Patents

Text classification method based on improved firefly algorithm and K nearest neighbor Download PDF

Info

Publication number
CN110909158A
CN110909158A CN201910605245.XA CN201910605245A CN110909158A CN 110909158 A CN110909158 A CN 110909158A CN 201910605245 A CN201910605245 A CN 201910605245A CN 110909158 A CN110909158 A CN 110909158A
Authority
CN
China
Prior art keywords
firefly
text
algorithm
feature
brightness
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910605245.XA
Other languages
Chinese (zh)
Other versions
CN110909158B (en
Inventor
文武
赵成
刘颖
解如风
范荣妹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Institute Of Quality And Standardization
CHONGQING XINKE DESIGN Co Ltd
Original Assignee
Chongqing Institute Of Quality And Standardization
CHONGQING XINKE DESIGN Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Institute Of Quality And Standardization, CHONGQING XINKE DESIGN Co Ltd filed Critical Chongqing Institute Of Quality And Standardization
Priority to CN201910605245.XA priority Critical patent/CN110909158B/en
Publication of CN110909158A publication Critical patent/CN110909158A/en
Application granted granted Critical
Publication of CN110909158B publication Critical patent/CN110909158B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention requests to protect a text classification method based on an improved firefly algorithm and K nearest neighbor, and a text feature selection model is constructed by combining information gain and the firefly algorithm, firstly, the information gain is utilized to sequence all features, then, a more representative feature subset is found out by utilizing the stronger optimizing capability of the improved firefly algorithm on a feature set which is sequenced in the front, and a step factor α in the firefly algorithm is adjusted, so that the global searching capability and the local searching capability of the algorithm are ensured, a new fitness function is introduced, the feature dimensionality is properly reduced in the aspect of improving the precision of the feature subset, finally, the model is used for text feature selection, and the obtained feature subset is used for KNN text classification.

Description

Text classification method based on improved firefly algorithm and K nearest neighbor
Technical Field
The invention belongs to the field of Chinese text classification, and particularly relates to a text classification method based on an improved firefly algorithm and K nearest neighbor.
Background
With the rapid development of internet technology, more and more users can conveniently acquire information resources on the internet and can publish information on the internet, that is, the users are carriers for publishing and receiving information at the same time. Although the presentation of information is becoming more and more abundant, the main presentation of information is still text to date. In the face of such a huge amount of text data, people have difficulty in finding out information of interest. If these text data are organized and managed only by means of the conventional manual method, not only a great deal of physical and manual labor is required, but also it is difficult to implement. Therefore, people are forced to find a new technology which can efficiently and accurately organize and manage the redundant information, so that really effective information data can be clearly and clearly presented, text classification is an effective way for solving the problem, and the problem of messy information can be solved to a great extent by effectively assisting people in organizing and classifying the information data.
At present, the precision of the feature subset selected by the traditional text feature selection method adopted in the text classification process is not high, for example, words with low occurrence frequency but more information are deleted by Document Frequency (DF); CHI-square statistics (CHI) only about whether the word appears and not considering the number of occurrences; information Gain (IG) only considers the contribution of words to the whole world, and does not relate to system categories; mutual Information (MI) is more prone to select low frequency words.
In the standard firefly algorithm, a search strategy of the firefly algorithm depends on a control parameter α and is usually constant to control the newer step length of each position, if the parameter is too large, the algorithm is not easy to converge and the calculation times are remarkably increased, if the parameter is obtained, the algorithm has poor global search capability and converges to the local optimum, the algorithm is analyzed and found after a feasibility test for the firefly algorithm in the field of text classification, and in the algorithm solving process, namely after a certain number of iterations, all the fireflies near the optimum position, at the moment, the optimal position is very close to the optimum position, and the optimal position is very close to the optimum position, so that the optimal position is very close to the optimum position, if the optimal position is found, the optimal position is relatively large, and the optimal position is relatively large, if the optimal position is relatively large, the initial search speed is relatively small, and the optimal position is relatively large, if the optimal position is relatively large, the optimal position is relatively small, the optimal position is relatively large, and the optimal position is relatively large, if the initial search result of the optimal position is relatively large, the optimal, and the optimal position is relatively large, the optimal position is relatively large, the optimal, if the initial search result is relatively large, the optimal, and the optimal, the optimal.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. The text classification method based on the improved firefly algorithm and the K nearest neighbor can better improve the defects that the firefly algorithm is easy to fall into local optimum, the convergence speed is low and the like in the process of searching the optimal text feature subset, so that a more accurate subset is obtained, and the text classification accuracy is improved. The technical scheme of the invention is as follows:
a text classification method based on an improved firefly algorithm and K nearest neighbor comprises the following steps:
step 1: acquiring a text, dividing the text into a training set and a test set, preprocessing the training set and the test set of the text including word segmentation and stop word removal, calculating information gain of each word and sequencing, and reserving the characteristics sequenced before a set sequence value n to obtain a text characteristic preselection set;
step 2, initializing the population size N and using the step factor α0Light absorption coefficient γ, maximum number of iterations TmaxIn the standard firefly algorithm, a search strategy of the standard firefly algorithm depends on a control parameter α, which is usually a constant and controls the step length of each position update, however, in the solving process, namely after iteration for a certain number of times, all the fireflies are close to the optimal position, at the moment, the individual fireflies are close to the optimal position, and when the position is updated next time, according to the original step length, the optimal position is likely to be missed due to the overlarge moving distance, and the situation of back-and-forth swing occurs, so that the optimal solution searching efficiency is low, and the convergence precision and speed of the algorithm are influenced, in order to avoid the situation, a formula for dynamically updating α is provided as follows:
Figure BDA0002120592940000031
α thereintValue representing the step size at the time of the t-th update of the position, α0α value, T, indicating an initialization settingmaxTherefore, α values are relatively large at the initial stage of the algorithm and have a large search range to avoid falling into local optimum, and α values are relatively small at the later stage, so that good local search can be performed, and global optimum can be quickly found.
And step 3: the method comprises the steps that the low-brightness fireflies move to the high-brightness fireflies, positions are updated according to various dimensional data modified by the fireflies, the fitness of the fireflies after the positions are updated is calculated, the low-fitness fireflies are abandoned, the fireflies with the highest current fitness are recorded, the iteration number is added by 1, if the iteration number reaches the maximum value, the algorithm finishes searching, and an optimal text feature subset is output; otherwise, the search is continued.
And 4, step 4: and classifying the obtained optimal text feature subset by adopting a KNN classifier.
Further, the step 1 of text preprocessing specifically includes: 6000 articles from the six categories of the compound denier university corpus were selected as data sets, and each category was divided into 800 training sets and 200 test sets. Dividing words in the training set and the test set by using jieba software, carrying out stop word processing on the text set according to a stop word table with a large Haugh area, calculating the information gain of each word, and sequencing from large to small according to the obtained value.
Further, the information gain calculation formula of step 1 is as follows:
Figure RE-GDA0002362083370000032
in the formula, ig (t) represents an information gain value of the characteristic t; e (C) represents the entropy of the text set without taking the feature t into account; e (C | t) represents the entropy of the text set when the feature t is considered; p (C)j) Is represented by CjProbability of class documents appearing in the corpus; p (t) represents the probability of the occurrence of a document containing the feature t; p (C)jI t) indicates that the document containing the feature t belongs to CjThe probability of class occurrence;
Figure BDA0002120592940000033
representing the probability of occurrence of a document without the feature t;
Figure BDA0002120592940000034
indicating belonging to document C without feature tjThe probability of a class; m represents the number of categories, and j represents a certain category.
Further, the text feature preselection set is used as an input of an improved firefly algorithm, the firefly position is initialized randomly, and the fitness of all fireflies in the population is calculated, which specifically comprises the following steps: the vector space model VSM is adopted for text representation, a document can be regarded as a vector in an n-dimensional space, each document is regarded as a firefly, at the initial moment, the fireflies are randomly distributed in the whole search space, the firefly with weak luminescence moves to the luminous intensity, the position is updated through moving, the optimal position is found out finally to complete optimization, in a firefly algorithm, the self brightness and the attraction degree of the firefly determine the attractive force, the brightness is related to the position where the firefly is located, the position is brighter, the attraction degree is proportional to the brightness, namely, the attraction degree is stronger when the brightness of the firefly is larger, when the brightness of the firefly is equal, the firefly moves randomly, the brightness and the attraction degree of the firefly simulate the characteristic that light is transmitted in a medium and is absorbed by the medium to decline, so the brightness and the attraction degree are reduced along with the increase of the distance between the fireflies, the firefly algorithm follows the following three assumptions:
1) all fireflies are sex-independent and attract each other;
2) the attraction degree is determined only by the brightness and the distance, the high-brightness firefly attracts the surrounding low-brightness firefly,
but the attraction degree is reduced along with the increase of the distance, and the firefly with the highest brightness moves randomly;
3) the luminance of the firefly is obtained by calculating a fitness function.
Further, the firefly algorithm is initially used for solving a continuous optimization problem, and the selection of the text features belongs to a combinatorial optimization problem, so that in order to use the firefly algorithm for searching for an optimal feature subset, a Sigmoid function is introduced to convert a position update formula, and the Sigmoid function is defined as follows:
Figure BDA0002120592940000041
Pijthe probability value of the jth dimension vector of the ith firefly, thetaijRepresents the range of the abscissa of the function, where θijThe calculation formula of (a) is as follows:
Figure BDA0002120592940000042
in the formula, β0Represents the maximum attraction degree; γ represents a light absorption coefficient; r isijRepresents the distance between firefly i and firefly j; x is the number ofkjJ-dimensional vector value, x, representing high-brightness firefly kijJ-dimensional vector value representing low-intensity firefly i, α representing random parameter, and rand being [0,1 ]]Random numbers randomly distributed.
Using binary coding, using 0,1 to represent the selected condition of the features, each firefly representing a subset of features, the length of which is equal to the total number of the features, and the optimal solution form is
Figure BDA0002120592940000051
xidAnd expressing the value of the d-dimension vector of the ith firefly. Wherein xijE {0,1}, i ═ 1,2, … N, N is the size of the population, i.e., the number of fireflies, assuming a feature set F ═ F (F ═ F)1,f2,f3,…,fn) Then the firefly is represented as a binary vector of length n, where
Figure BDA0002120592940000052
If the value of a certain position is 0, the feature of the corresponding position is not selected; conversely, when the value of a certain position is 1, the position is selected as the feature.
Further, the initial population is randomly generated from a series of binary numbers, and in the iteration of the algorithm, the rule for updating the ith firefly position is as follows:
Figure BDA0002120592940000053
further, in the improved firefly algorithm, the step size factor α is adjusted;
a formula for dynamically updating α is presented as follows:
Figure BDA0002120592940000054
α therein0α value, α, indicating initialization settingstDenotes the value of α at the T-th iteration, TmaxRepresents the maximum number of iterations and t represents the t-th iteration.
Further, reducing the feature subset dimension as another secondary optimization objective introduces a new fitness function as follows:
Figure BDA0002120592940000055
where ω is a constant slightly less than 1, P is the classification accuracy,
Figure BDA0002120592940000056
represents the subset of features x (i) the vector modulo length, n being the total number of features.
Further, the step 4 uses the obtained feature subset on a KNN classifier for text classification, and the process of the KNN algorithm is as follows: calculating the distance between the new sample point and all sample points in the training set, selecting the first K sample points with the closest distance, and classifying the new sample point into the most belonged class of the K points, wherein the distance adopts Euclidean distance or cosine similarity, and when a method for calculating the cosine similarity is adopted, the formula is as follows:
Figure BDA0002120592940000061
in the formula: diFeature vector, d, representing new sample point ijA feature vector representing a sample point j in the training set; w is aikWeight of k dimension, w, for text ijkThe k-dimension weight of the text j; m is the dimension of the feature vector;
the formula of degree of membership is as follows:
Figure BDA0002120592940000062
wherein δ (D)i,Cm) Presentation textGear DiWhether or not to belong to class CmIf the new sample point D belongs to CmClass, then δ (D)i,Cm) Is 1, otherwise is 0.
Further, the specific steps of the KNN text classification include:
1) performing word segmentation on all texts, removing stop words, extracting characteristic words, and vectorizing the texts;
2) calculating the similarity of the text to be tested and all texts in the training set, sequencing, and selecting K most similar neighbor texts;
3) calculating the membership degree of the text to be detected in each category, and judging the text to be detected as the category with the maximum membership degree;
the classification effect is represented by precision (P), recall (R) and F1The values to evaluate:
Figure BDA0002120592940000063
Figure BDA0002120592940000064
Figure BDA0002120592940000065
wherein, TP represents a true positive example, TN represents a true negative example, FP represents a false positive example, and FN represents a false negative example. The invention has the following advantages and beneficial effects:
the method combines information gain and an improved firefly algorithm to construct a new text feature selection model, firstly, the information gain is utilized to sequence all features, then, stronger optimizing capability of the firefly algorithm is utilized to find out more representative feature subsets on feature sets which are sequenced at the front, aiming at the defects that the firefly algorithm is easy to fall into local optimization, complex calculation, slow convergence and the like, step size factors α in the algorithm are adjusted, the overall searching capability of the algorithm is ensured, the local searching capability is also ensured, a new fitness function is introduced, the number of the features is properly reduced on the basis of improving the accuracy of the feature subsets, finally, the model is used for text feature selection, and the obtained feature subsets are used on a KNN classifier to improve the accuracy of text classification.
Drawings
FIG. 1 is a schematic flow chart of the preferred embodiment of the present invention
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
as shown in fig. 1, firstly, the text input set is segmented by using jieba software, and then the text set is subjected to stop word processing according to the stop word list of the hayage. And then calculating the information gain of each word, sequencing the words from large to small according to the obtained values, and reserving the characteristics in the front of the sequencing to obtain a text characteristic preselection set. The information gain calculation formula is as follows:
as shown in fig. 1, firstly, the text input set is segmented by using jieba software, and then the text set is subjected to stop word processing according to the stop word list of the hayage. And then calculating the information gain of each word, sequencing the words from large to small according to the obtained values, and reserving the characteristics in the front of the sequencing to obtain a text characteristic preselection set. The information gain calculation formula is as follows:
in the formula, ig (t) represents an information gain value of the characteristic t; e (C) represents the entropy of the text set without taking the feature t into account; e (C | t) represents the entropy of the text set when the feature t is considered; p (C)j) Is represented by CjProbability of class documents appearing in the corpus; p (t) represents the probability of the occurrence of a document containing the feature t; p (C)jI t) indicates that the document containing the feature t belongs to CjThe probability of class occurrence;
Figure BDA0002120592940000081
representing the probability of occurrence of a document without the feature t;
Figure BDA0002120592940000082
indicating that t belongs to document C without a featurejThe probability of a class; m represents the number of categories.
Then defined according to the mathematical description of the firefly algorithm:
definition 1 light emission luminance of firefly:
Figure BDA0002120592940000083
wherein E0The luminance of the brightest firefly is shown, γ is the light absorption coefficient, and r is the distance between fireflies.
Definition 2 attraction of fireflies:
Figure BDA0002120592940000084
β therein0Is the attraction degree at a distance of 0, i.e., the maximum attraction degree.
Define 3 firefly location update:
Figure BDA0002120592940000085
wherein α is a random parameter and rand is at [0,1 ]]Random numbers randomly distributed. r isijExpressing the distance between the i firefly and the j firefly, and the formula is as follows:
Figure BDA0002120592940000086
where D represents the data dimension, xidA d-th dimension data component representing the ith firefly.
The firefly algorithm was originally proposed to solve the continuous optimization problem, while the selection of text features belongs to the combinatorial optimization problem. Therefore, in order to use the firefly algorithm for searching the optimal feature subset, a Sigmoid function is introduced to convert a position updating formula. The Sigmoid function is defined as follows:
Figure BDA0002120592940000087
Pijthe probability value of the jth dimension vector of the ith firefly, thetaijRepresents the range of the abscissa of the function, where θijThe calculation formula of (a) is as follows:
Figure BDA0002120592940000091
in the formula, β0Represents the maximum attraction degree; γ represents a light absorption coefficient; r isijRepresents the distance between firefly i and firefly j; x is the number ofkjJ-dimensional vector value, x, representing high-brightness firefly kijJ-dimensional vector value representing low-intensity firefly i, α representing random parameter, and rand being [0,1 ]]Random numbers randomly distributed.
The invention adopts binary coding, and 0 and 1 are used for representing the selected condition of the characteristics. Each firefly represents a subset of features whose length is equal to the total number of features. The best possible solution form
Figure BDA0002120592940000092
xidAnd expressing the value of the d-dimension vector of the ith firefly. Wherein xijE {0,1}, i ═ 1,2, … N, N is the size of the population, i.e., the number of fireflies. Suppose we have a set of features, F ═ F1,f2,f3,…,fn) Firefly is expressed as a binary vector of length n. In that
Figure BDA0002120592940000093
If the value of a certain position is 0, the feature of the corresponding position is not selected; conversely, when the value of a certain position is 1, the position is selected as the feature.
The initial population is randomly generated from a series of binary numbers, e.g., when F ═ F (F)1,f2,f3,f4,f5,f6,f7,f8) When the initial population size N is set to 3 as the set of initial characteristics, firefliesCan be expressed as:
Figure BDA0002120592940000094
Figure BDA0002120592940000095
Figure BDA0002120592940000096
wherein
Figure BDA0002120592940000097
Representing a subset of features (f)2,f4,f5,f7),
Figure BDA0002120592940000098
Represents a subset (f)1,f2,f5,f7),
Figure BDA0002120592940000099
Represents a subset (f)3,f4,f6,f8). In the iteration of the algorithm, the rule for updating the ith firefly location is as follows:
Figure BDA00021205929400000910
initializing algorithm parameters: the maximum number of iterations T with the population size N equal to 50max50, the light absorption coefficient γ is 1, the initial step factor α0Maximum attraction β at 0.50The constant ω in the fitness function is 0.95.
Then, calculating the fitness value of the firefly according to the fitness function, wherein the calculation formula is as follows:
Figure BDA00021205929400000911
where ω is a constant slightly less than 1 and P isThe accuracy of the classification is high,
Figure BDA0002120592940000101
represents the subset of features x (i) the vector modulo length, n being the total number of features.
And secondly, enabling the low-brightness firefly to move to the high-brightness firefly, and updating the position according to the modified various-dimensional data of the firefly. Calculating the fitness of each firefly after the position is updated, abandoning the firefly with low fitness, recording the firefly with the highest current fitness, and adding 1 to the iteration times. The location update formula is as follows:
Figure BDA0002120592940000102
wherein α is a random parameter and rand is at [0,1 ]]Random numbers randomly distributed. r isijExpressing the distance between the i firefly and the j firefly, and the formula is as follows:
Figure BDA0002120592940000103
where D represents the data dimension, xidA d-th dimension data component representing the ith firefly.
The algorithm realizes optimization by the fact that fireflies tend to high-brightness individual movement due to brightness difference, and after iteration is carried out to a certain degree, all fireflies are close to the optimal position, at the moment, the fireflies are close to the optimal position, and when the position is updated in the next iteration, the optimal position is likely to be missed due to overlarge movement distance, and the situation of swing back and forth occurs.
Figure BDA0002120592940000104
Wherein, αtDenotes the value of α at the t-th iteration, α0α value, T, indicating an initialization settingmaxThe maximum number of iterations is represented, and t represents the t-th iteration, so that the α value is relatively large at the early stage of the algorithm and has a large search range to avoid trapping into local optimization, and the α value is relatively small at the later stage, so that good local search can be performed to quickly find global optimization.
And finally, when the iteration times reach the set value, outputting the obtained optimal feature subset by the algorithm, and using the optimal feature subset for text classification.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (10)

1. A text classification method based on an improved firefly algorithm and K nearest neighbor is characterized by comprising the following steps:
step 1: acquiring a text, dividing the text into a training set and a test set, preprocessing the training set and the test set of the text including word segmentation and stop word removal, calculating information gain of each word and sequencing, and reserving the characteristics sequenced before a set sequence value n to obtain a text characteristic preselection set;
step 2, initializing the population size N and using the step factor α0Light absorption coefficient γ, maximum number of iterations TmaxThe search strategy of the firefly algorithm depends on a control parameter α which is usually taken as a constant and controls the step length of each position update, but in the solving process, namely after iteration for a certain number of times, all the fireflies are close to the optimal position, at the moment, the firefly individuals are very close to the optimal position, and when the position is updated next time, according to the original step length, the optimal position is likely to be missed due to overlarge moving distance, the situation of swinging back and forth occurs, so that the optimal solution searching efficiency is low, and the algorithm is influencedTo avoid this, a formula for dynamically updating α is proposed as follows:
Figure FDA0002120592930000011
wherein, αtValue representing the step size at the time of the t-th update of the position, α0α value, T, indicating an initialization settingmaxThe maximum number of iterations is represented, t represents the t-th iteration, so that the α value is relatively large at the initial stage of the algorithm and has a large search range to avoid falling into local optimum, and the α value is relatively small at the later stage, so that good local search can be performed, and global optimum can be quickly found;
and step 3: the method comprises the steps that the low-brightness fireflies move to the high-brightness fireflies, positions are updated according to various dimensional data modified by the fireflies, the fitness of the fireflies after the positions are updated is calculated, the low-fitness fireflies are abandoned, the fireflies with the highest current fitness are recorded, the iteration frequency is added by 1, if the iteration frequency reaches the maximum value, the algorithm finishes searching, and an optimal text feature subset is output; otherwise, continuing searching;
and 4, step 4: and classifying the obtained optimal text feature subset by adopting a KNN classifier.
2. The method for classifying texts based on the improved firefly algorithm and the K nearest neighbor as claimed in claim 1, wherein the step 1 text preprocessing step specifically comprises: the method comprises the steps of firstly, selecting 6000 articles of six categories from a compound denier university corpus as data sets, dividing each category into 800 training sets and 200 test sets, dividing words of the training sets and the test sets by using jieba software, removing stop words of the training sets and the test sets according to a stop word list with a great hayage size, then calculating information gain of each word, and sequencing from large to small according to obtained values.
3. The method for classifying texts based on the improved firefly algorithm and the K nearest neighbor according to claim 1 or 2, wherein the information gain calculation formula of the step 1 is as follows:
Figure FDA0002120592930000021
in the formula, ig (t) represents an information gain value of the characteristic t; e (C) represents the entropy of the text set without taking the feature t into account; e (C | t) represents the entropy of the text set when the feature t is considered; p (C)j) Is represented by CjProbability of occurrence of class documents in the corpus; p (t) represents the probability of the occurrence of a document containing the feature t; p (C)jI t) indicates that the document containing the feature t belongs to CjThe probability of class occurrence;
Figure FDA0002120592930000022
representing the probability of occurrence of a document without the feature t;
Figure FDA0002120592930000023
indicating that t belongs to document C without a featurejThe probability of a class; m represents the number of categories, and j represents a certain category.
4. The method for classifying texts based on the improved firefly algorithm and the K nearest neighbor according to claim 3, wherein the step of randomly initializing the firefly positions and calculating the fitness of all fireflies in the population by taking a preselected set of text features as input of the improved firefly algorithm comprises the following steps: the vector space model VSM is adopted for text representation, one document can be regarded as a vector in n-dimensional space, each document is regarded as a firefly, the fireflies are randomly distributed in the whole search space at the initial moment, the firefly with weak luminescence moves to the strong luminescence, the position is updated through movement, and finally the optimal position is found out to complete optimization, in the firefly algorithm, the attraction strength of the firefly is determined by the self brightness and the attraction degree of the firefly, the brightness is related to the position of the firefly, the brighter the position is, the attraction degree is in direct proportion to the brightness, that is, the larger the brightness is, the stronger the attraction force is, and when the brightness of the human body is equal, the firefly moves randomly, and the brightness and the attraction degree of the firefly simulate the characteristic that the light propagates in the medium and is absorbed by the medium to be faded, thus the brightness and attractiveness decrease as the distance between fireflies increases, and the fireflies algorithm follows the following three assumptions:
1) all fireflies are sex-independent and attract each other;
2) the attraction degree is determined only by the brightness and the distance, the high-brightness firefly attracts the surrounding low-brightness firefly, but the attraction degree decreases with the increase of the distance, and the firefly with the highest brightness moves randomly;
3) the luminance of the firefly is obtained by calculating a fitness function.
5. A method for classifying texts based on an improved firefly algorithm and K nearest neighbors as claimed in claim 4, wherein the firefly algorithm is used for solving the continuous optimization problem initially, and the selection of the text features belongs to the combinatorial optimization problem, so that in order to use the firefly algorithm for searching the optimal feature subset, a Sigmoid function is introduced to convert the position updating formula, and the Sigmoid function is defined as follows:
Figure FDA0002120592930000031
Pijthe probability value of the jth dimension vector of the ith firefly, thetaijRepresents the range of the abscissa of the function, where θijThe calculation formula of (a) is as follows:
Figure FDA0002120592930000032
in the formula, β0Represents the maximum attraction degree; γ represents a light absorption coefficient; r isijRepresents the distance between firefly i and firefly j; x is the number ofkjJ-dimensional vector value, x, representing high-brightness firefly kijJ-dimensional vector value representing low-intensity firefly i, α representing random parameter, and rand being [0,1 ]]Random numbers randomly distributed on the block;
binary coding is adopted, 0 and 1 are used for representing the selected condition of the characteristics, and each firefly represents a characteristic subsetThe length of which is equal to the total number of features, the optimal solution being as
Figure FDA0002120592930000033
xidAnd expressing the value of the d-dimension vector of the ith firefly. Wherein xijE {0,1}, i 1,2, … N, N is the size of the population, i.e., the number of fireflies, assuming a feature set F (F) is set1,f2,f3,…,fn) Then the firefly is represented as a binary vector of length n, where
Figure FDA0002120592930000034
If the value of a certain position is 0, the feature of the corresponding position is not selected; conversely, when the value of a certain position is 1, the position is selected as the feature.
6. A method for classifying texts based on an improved firefly algorithm and K nearest neighbors as claimed in claim 5, wherein the initial population is randomly generated by a series of binary numbers, and in the iteration of the algorithm, the rule for updating the ith firefly location is as follows:
Figure FDA0002120592930000041
7. a method for classifying texts based on an improved firefly algorithm and K neighbors as claimed in claim 5 or 6, wherein in the improved firefly algorithm, the step size factor α is adjusted;
a formula for dynamically updating α is presented as follows:
Figure FDA0002120592930000042
α therein0α value, α, indicating initialization settingstDenotes the value of α at the T-th iteration, TmaxRepresents the maximum number of iterations and t represents the t-th iteration.
8. The method for classifying texts based on the firefly algorithm and the K nearest neighbor as claimed in claim 7, wherein the dimension of the feature subset is further reduced as another secondary optimization objective, and a new fitness function is introduced as follows:
Figure FDA0002120592930000043
where ω is a constant slightly less than 1, P is the classification accuracy,
Figure FDA0002120592930000044
represents the modulo length of the feature subset x (i) vector, n being the total number of features.
9. The method for text classification based on the improved firefly algorithm and the K nearest neighbor as claimed in claim 8, wherein the step 4 uses the obtained feature subset on a KNN classifier for text classification, and the process of the KNN algorithm is as follows: calculating the distance between the new sample point and all sample points in the training set, selecting the first K sample points with the closest distance, and classifying the new sample point into the most belonged class of the K points, wherein the distance adopts Euclidean distance or cosine similarity, and when a method for calculating the cosine similarity is adopted, the formula is as follows:
Figure FDA0002120592930000045
in the formula: diFeature vector, d, representing new sample point ijA feature vector representing a sample point j in the training set; w is aikWeight of k dimension, w, for text ijkThe k-dimension weight of the text j; m is the dimension of the feature vector;
the formula of degree of membership is as follows:
Figure FDA0002120592930000051
wherein δ (D)i,Cm) Representing document DiWhether or not to belong to class CmIf the new sample point D belongs to CmClass, then δ (D)i,Cm) Is 1, otherwise is 0.
10. The method for classifying texts based on the improved firefly algorithm and the K nearest neighbor as claimed in claim 9, wherein the specific steps of the KNN text classification include:
1) performing word segmentation on all texts, removing stop words, extracting characteristic words, and vectorizing the texts;
2) calculating the similarity of the text to be tested and all texts in the training set, sequencing, and selecting K most similar neighbor texts;
3) calculating the membership degree of the text to be tested in each category, and judging the text to be tested as the category with the maximum membership degree;
the classification effect is represented by precision (P), recall (R) and F1The values to evaluate:
Figure FDA0002120592930000052
Figure FDA0002120592930000053
Figure FDA0002120592930000054
wherein, TP represents a true positive example, TN represents a true negative example, FP represents a false positive example, and FN represents a false negative example.
CN201910605245.XA 2019-07-05 2019-07-05 Text classification method based on improved firefly algorithm and K nearest neighbor Active CN110909158B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910605245.XA CN110909158B (en) 2019-07-05 2019-07-05 Text classification method based on improved firefly algorithm and K nearest neighbor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910605245.XA CN110909158B (en) 2019-07-05 2019-07-05 Text classification method based on improved firefly algorithm and K nearest neighbor

Publications (2)

Publication Number Publication Date
CN110909158A true CN110909158A (en) 2020-03-24
CN110909158B CN110909158B (en) 2022-10-18

Family

ID=69814440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910605245.XA Active CN110909158B (en) 2019-07-05 2019-07-05 Text classification method based on improved firefly algorithm and K nearest neighbor

Country Status (1)

Country Link
CN (1) CN110909158B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112000116A (en) * 2020-07-24 2020-11-27 西北工业大学 Heading angle control method of autonomous underwater vehicle based on improved firefly PID method
CN112446774A (en) * 2020-10-30 2021-03-05 杭州衡泰软件有限公司 Financial statement quality early warning method
CN113345420A (en) * 2021-06-07 2021-09-03 河海大学 Countermeasure audio generation method and system based on firefly algorithm and gradient evaluation
CN114822823A (en) * 2022-05-11 2022-07-29 云南升玥信息技术有限公司 Tumor fine classification system based on cloud computing and artificial intelligence fusion multi-dimensional medical data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103366362A (en) * 2013-04-17 2013-10-23 昆明理工大学 Glowworm optimization algorithm-based ore zone image segmentation method
AU2018100796A4 (en) * 2018-06-14 2018-07-19 Macau University Of Science And Technology A genetic feature identifying system and a search method for identifying features of genetic information
CN108388666A (en) * 2018-03-16 2018-08-10 重庆邮电大学 A kind of database multi-list Connection inquiring optimization method based on glowworm swarm algorithm
CN108876029A (en) * 2018-06-11 2018-11-23 南京航空航天大学 A kind of passenger flow forecasting based on the adaptive chaos firefly of double populations
CN109657147A (en) * 2018-12-21 2019-04-19 岭南师范学院 Microblogging abnormal user detection method based on firefly and weighting extreme learning machine
CN109711636A (en) * 2019-01-09 2019-05-03 南京工业大学 A kind of river level prediction technique promoting tree-model based on chaos firefly and gradient

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103366362A (en) * 2013-04-17 2013-10-23 昆明理工大学 Glowworm optimization algorithm-based ore zone image segmentation method
CN108388666A (en) * 2018-03-16 2018-08-10 重庆邮电大学 A kind of database multi-list Connection inquiring optimization method based on glowworm swarm algorithm
CN108876029A (en) * 2018-06-11 2018-11-23 南京航空航天大学 A kind of passenger flow forecasting based on the adaptive chaos firefly of double populations
AU2018100796A4 (en) * 2018-06-14 2018-07-19 Macau University Of Science And Technology A genetic feature identifying system and a search method for identifying features of genetic information
CN109657147A (en) * 2018-12-21 2019-04-19 岭南师范学院 Microblogging abnormal user detection method based on firefly and weighting extreme learning machine
CN109711636A (en) * 2019-01-09 2019-05-03 南京工业大学 A kind of river level prediction technique promoting tree-model based on chaos firefly and gradient

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
AKSHI KUMAR 等: "A Filter-Wrapper based Feature Selection for Optimized Website Quality Prediction", 《 2019 AMITY INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AICAI)》 *
G. VENKATA HARI PRASAD 等: "Performance analysis of feature selection methods for feature extracted PCG signals", 《2015 13TH INTERNATIONAL CONFERENCE ON ELECTROMAGNETIC INTERFERENCE AND COMPATIBILITY (INCEMIC)》 *
LONG ZHANG 等: "Optimal feature selection using distance-based discrete firefly algorithm with mutual information criterion", 《NEURAL COMPUTING AND APPLICATIONS》 *
左仲亮 等: "一种改进的萤火虫算法", 《微电子学与计算机》 *
王爱芳 等: "一种基于萤火虫和SVM的图像检索相关反馈方法", 《小型微型计算机系统》 *
陈东: "基于萤火虫算法的文本聚类研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112000116A (en) * 2020-07-24 2020-11-27 西北工业大学 Heading angle control method of autonomous underwater vehicle based on improved firefly PID method
CN112446774A (en) * 2020-10-30 2021-03-05 杭州衡泰软件有限公司 Financial statement quality early warning method
CN113345420A (en) * 2021-06-07 2021-09-03 河海大学 Countermeasure audio generation method and system based on firefly algorithm and gradient evaluation
CN114822823A (en) * 2022-05-11 2022-07-29 云南升玥信息技术有限公司 Tumor fine classification system based on cloud computing and artificial intelligence fusion multi-dimensional medical data
CN114822823B (en) * 2022-05-11 2022-11-29 云南升玥信息技术有限公司 Tumor fine classification system based on cloud computing and artificial intelligence fusion multi-dimensional medical data

Also Published As

Publication number Publication date
CN110909158B (en) 2022-10-18

Similar Documents

Publication Publication Date Title
CN110909158B (en) Text classification method based on improved firefly algorithm and K nearest neighbor
CN110598029B (en) Fine-grained image classification method based on attention transfer mechanism
CN109376242B (en) Text classification method based on cyclic neural network variant and convolutional neural network
CN109614979B (en) Data augmentation method and image classification method based on selection and generation
CN108564129B (en) Trajectory data classification method based on generation countermeasure network
US8566746B2 (en) Parameterization of a categorizer for adjusting image categorization and retrieval
CN109389037B (en) Emotion classification method based on deep forest and transfer learning
CN107683469A (en) A kind of product classification method and device based on deep learning
CN109344884A (en) The method and device of media information classification method, training picture classification model
CN111833322B (en) Garbage multi-target detection method based on improved YOLOv3
CN110287985B (en) Depth neural network image identification method based on variable topology structure with variation particle swarm optimization
CN111105045A (en) Method for constructing prediction model based on improved locust optimization algorithm
CN112733602B (en) Relation-guided pedestrian attribute identification method
CN112182221A (en) Knowledge retrieval optimization method based on improved random forest
CN112883931A (en) Real-time true and false motion judgment method based on long and short term memory network
CN115457332A (en) Image multi-label classification method based on graph convolution neural network and class activation mapping
CN111639695A (en) Method and system for classifying data based on improved drosophila optimization algorithm
CN114896398A (en) Text classification system and method based on feature selection
Qiao et al. A multi-level thresholding image segmentation method using hybrid Arithmetic Optimization and Harris Hawks Optimizer algorithms
CN113836330A (en) Image retrieval method and device based on generation antagonism automatic enhanced network
CN112148994B (en) Information push effect evaluation method and device, electronic equipment and storage medium
CN113934835A (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
Wang et al. An improved interactive genetic algorithm incorporating relevant feedback
Bai et al. Learning high-level image representation for image retrieval via multi-task dnn using clickthrough data
CN112883930A (en) Real-time true and false motion judgment method based on full-connection network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant