CN109711373A - A kind of big data feature selection approach based on improvement bat algorithm - Google Patents

A kind of big data feature selection approach based on improvement bat algorithm Download PDF

Info

Publication number
CN109711373A
CN109711373A CN201811642556.5A CN201811642556A CN109711373A CN 109711373 A CN109711373 A CN 109711373A CN 201811642556 A CN201811642556 A CN 201811642556A CN 109711373 A CN109711373 A CN 109711373A
Authority
CN
China
Prior art keywords
bat
formula
algorithm
individual
population
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811642556.5A
Other languages
Chinese (zh)
Inventor
李佳琪
赵志峰
李荣鹏
张宏纲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201811642556.5A priority Critical patent/CN109711373A/en
Publication of CN109711373A publication Critical patent/CN109711373A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of based on the big data feature selection approach for improving bat algorithm.Feature selecting is that the sample of higher dimensional space is transformed into lower dimensional space by way of mapping or converting, and then deletes choosing and falls redundancy and the further dimensionality reduction of uncorrelated features.Purpose is to obtain character subset as small as possible, while not significantly reducing nicety of grading, not influencing to be distributed.On the basis of the feature selection approach superiority and inferiority of analysis of classical, introduces and realize optimization feature selecting using Swarm Intelligence Algorithm.Have the advantages that the shortcomings that concurrency, strong robustness and fast convergence rate, the present invention easily falls into local optimum for it in view of bat algorithm, introduces sub- population dividing mechanism and binary system difference Variation mechanism based on K-means algorithm.Improved algorithm mutually learns to improve the otherness and search capability of individual with the ability of efficient information transmitting, avoid Premature Convergence between enhancing population.Finally using improved bat algorithm optimization feature selecting and achieve excellent effect.

Description

A kind of big data feature selection approach based on improvement bat algorithm
Technical field
The present invention relates to a kind of bat algorithm and feature selection approach, belong to artificial intelligence and machine learning field.
Background technique
With national total level of IT application fast lifting, wisdom people's livelihood information on services technology is widely used, the wisdom people's livelihood Information on services resource is significantly increased, and Public Culture Information Assurance ability rises appreciably, and wisdom people's livelihood service initially enters comprehensive The information age of covering, multi-level propulsion and professional development.Wisdom people's livelihood service system support technology, is in new information Under the support of technology, conventional wisdom people's livelihood service is transformed and is changed, the technological means of General Promotion service quality, service Platform and way to manage.It is constantly to meet masses' demand, the wisdom people's livelihood service for providing government becomes more economical, more effective The technology carrier of rate, efficiency and benefit, preferably guarantee people's Public Culture equity is infrastructure, is to improve the wisdom people The important means and approach of raw service " Cultural Force ".
With the wide-scale distribution of multimedia content in a network environment, all kinds of different types, different grain size, different-format Audio-video frequency content impacts traditional broadcasting and TV broadcasting media approach, causes mass data storage, management, inquiry, analysis, excavation Deng confusion and predicament.In such circumstances, the integrated and integration technology of more application port data becomes base under the fusion of three screens of research One urgent task of the wisdom people's livelihood comprehensive service platform technical support of Yu Sanping fusion.
The integrated and integration technology of more application port data under the fusion of three screens is studied, emphasis is to solve isomeric data in language Efficient data classification, integration problem in method, semanteme and big data exchange process.Main research contents includes: that distribution is deposited Storing up contents, the targets such as the extraction of data source under environment, data transformation, data integration and final data fusion is to establish one Relative quiescent, unified data integration, data management and Data analytic environment.Big data specification is in mass data analytic process In, raw data set to be analyzed is not used directly, and use the subset of raw data set as analysis object, it is close to obtain Like the method for analytical effect.Wherein, feature selecting is exactly a kind of method of big data specification.
In practical application scene, feature quantity is often more, wherein there may be incoherent feature, between feature It there may be interdepending, is easy that the time needed for leading training pattern is longer, and model is excessively complicated, causes dimension disaster etc..It is special Sign selection can reject uncorrelated or high Yu Tezheng, to reach reduction Characteristic Number, improve classification learning efficiency and model is accurate Degree, reduces the purpose of runing time.On the other hand, real relevant feature reduction model is selected, researcher is easy to Understand the process that data generate.There are many algorithms to be used to carry out the selection of feature at present, some are based on evaluation function to spy Sign importance is ranked up, its quality is measured by the feature inside analysis character subset, and common evaluation index is based on Information gain is based on distance, based on correlation etc.;In addition to this, there are also the character subsets of some algorithms selection to sample set Classify, using the precision of classification as the standard for measuring character subset quality.But these algorithms solely evaluate some spy The quality of sign does not account for influencing each other between the different manifestations and feature of different characteristic combination.
Swarm Intelligence Algorithm achieves good result on feature selection issues.It explains spring duckweed et al. and uses genetic algorithm (genetic algorithm, GA) and it is based on associated feature selecting (Correlation-based Feature Selection, CFS) mode that combines realizes the feature selecting of Chinese Web Page Automatic Classification.The disadvantage is that operation is comparatively laborious, Variation mechanism leads to bad stability, and calculation amount increases, and the training time is longer.N.Cleetus etc. is selected using improved PSO algorithm The optimization feature of intrusion detection is taken to combine, compared with GA, PSO algorithm does not have the operation such as more complicated variation, variation in GA, only It is adaptively adjusted using individual experience and species characteristic, rule is relatively simple, and convergence rate is very fast.But it is easily trapped into simultaneously Locally optimal solution causes convergence precision low and is not easy the defects of restraining.Ant group algorithm (the Ant Colony such as T.Mehmood Optimization, ACO) it carries out Feature Selection and is flowed with support vector machines (Support Vector Machine, SVM) Amount classification carries out Network anomaly detection, also achieves good experimental result.However that there are still parameters is more, convergence for these algorithms Speed is slow, realizes the disadvantages of complicated, needs that algorithm is further improved.
Bat algorithm (Bat Algorithm, BA) is that Xin-sheYang proposed a kind of acquisition globally optimal solution in 2010 Heuristic search algorithm.For its Inspiration Sources in the bionics behavior of the Nature bat, main thought is that simulation bat is catching Echolocation behavior in food.Different bat random flight in population perceives ambient enviroment using echolocation, finds Target prey.The position of bat is exactly the solution of this kind of optimization problem.For feature selection issues, usually changing using bat algorithm Into version-binary system bat algorithm.The present invention regards the process of feature selecting as in population individual is made iteratively position shifting Dynamic and target search process.In general, the superiority and inferiority of fitness function measurement solution can be used.Bat algorithm has model letter The advantages that list, strong robustness, high degree of concurrence.From the fervent concern for just receiving numerous scholars that comes out, theoretical and application is passed through Development in recent years has also obtained very big progress, is commonly used in natural science and engineering practice by numerous researchers, Such as engineering optimization, in the fields such as pattern-recognition, K mean cluster, feature selecting and data mining.
Bat algorithm possesses the strong point of Swarm Intelligence Algorithm, such as when powerful global search range and shorter convergence Between, but disadvantage common there is also some Swarm Intelligence Algorithms simultaneously, for example it is easy to happen the precocious phenomenon of algorithm, easily fall into office The shortcomings that portion's optimal solution.Because each bat is influenced by global optimum's individual merely, it is difficult efficiently to carry out letter with neighbours Breath exchange.Meanwhile algorithm itself lacks Variation mechanism, so that group's Personal position lack of diversity.
Summary of the invention
The technical problems to be solved by the present invention are: providing a kind of based on the big data feature selecting side for improving bat algorithm Method is mutually learnt between population and the efficient transmitting of information with enhancing, promotes the otherness and search capability of individual, avoid receiving too early It holds back.
Inventive concept of the invention is: three screens fusion back is selected by improving a kind of intelligent optimization algorithm-bat algorithm The more excellent feature combination of wisdom people's livelihood big data under scape.The present invention is a kind of based on the big data feature selecting for improving bat algorithm Method, the shortcomings that easily falling into local optimum for bat algorithm, propose to introduce the sub- population dividing mechanism based on K-means algorithm With binary system difference Variation mechanism.Improved algorithm, which enhances, mutually to be learnt between population and the efficient transmitting of information, is improved The otherness and search capability of individual, avoid Premature Convergence.
To solve technical problem of the invention, the technical solution adopted by the present invention is as follows:
The present invention is a kind of to be included the following steps: based on the big data feature selection approach for improving bat algorithm
(1) relevant parameter of bat algorithm is initialized, the relevant parameter includes: bat group number of individuals N, maximum impulse Volume A0, maximum pulse rate R0, search pulse frequency range [fmin,fmax], the attenuation coefficient α of volume, the enhancing system of search rate Number γ, maximum number of iterations Nt
The position of random initializtion bat by the following method generates N number of candidate feature combination:
For i-th of bat, according to the spatial position x of bati=(xi,1,xi,2,...,xi,d) and speed vi=(vi,1, vi,2,...,vi,d), the spatial position of bat is abstracted into the string of binary characters of a d dimension space, the d dimension space two into Character string processed is a candidate feature combination, wherein d is candidate feature number;The position that the value of the string of binary characters is 1 Indicate that the feature of current location is selected, the value of the string of binary characters is that the feature of 0 expression current location is not selected;
(2) the fitness value f (x of each bat is calculated according to the fitness function of formula (1)i), and from all bats Find out the position G of current optimal bat0
F=0.6 × R+0.4 × e-F×A (1)
In formula (1), R, F and A are respectively indicated to be combined using feature selected by current iteration and be called together as what input was classified Return rate, F score and accuracy rate;
The inertia coeffeicent and the self study factor of epicycle iteration are updated according to formula (2) and formula (3);
In formula (2) and formula (3), WtInertia coeffeicent when indicating each bat iteration t times, CtIndicate that each bat changes Self study factor when for t times, WmaxIt is maximum value, the W of inertia coeffeicentminIt is the minimum value of inertia coeffeicent, CmaxSelf study because The maximum value of son, CminIt is the minimum value of the self study factor, NtIt is maximum number of iterations;
The probability P whether control bat executes the variation of mutation operation is calculated according to formula (4)t:
In formula (4), NtIt is maximum number of iterations, t indicates the number of iterations;
The contraction factor F being randomly generated between 0,1 is calculated according to formula (5)t:
In formula (5), FmaxIt is the maximum value of contraction factor, FminIt is the minimum value of contraction factor, NtIt is greatest iteration time Number;
(3) bat group is divided into according to the distance between bat by sub- population using K-means algorithm;
(4) inside that each bat in every sub- population is accordingly updated according to the method as shown in formula (6)-(9) becomes Measure search pulse frequency fi, flying speed vi, spatial position xi;For the non local optimum individual drawn game inside every sub- population Portion's optimum individual updates flying speed according to formula (7) and (8) respectively:
fi=fmin+(fmax-fmin)·β (6)
vi t=Wt·vi t-1+(xi t-1-Mn t-1)·fi+(xi t-1-Pi t-1)·Ct (7)
vi t=Wt·vi t-1+(xi t-1-Gt-1)·fi+(xi t-1-Pi t-1)·Ct (8)
In formula (6), fmaxAnd fminIt is the maximum value of pulse frequency and the minimum value of pulse frequency respectively, β is one equal The stochastic variable of even distribution, and β ∈ [0,1];In formula (7) and (8), vi tAnd vi t-1Bat individual i is respectively indicated in t and t-1 The flying speed formula at moment;xi tAnd xi t-1Bat individual i is respectively indicated at the location of t and t-1 moment;Mn t-1For son kind Group n is in the position of the more excellent individual in t-1 moment part;Gt-1It is entire group in t-1 moment global best bat position;Pi t-1For Each bat i is in the history optimum position that the t-1 moment retains;WtAnd CtThe inertia coeffeicent of bat and self study when indicating iteration t times The factor;Formula (9) δ is a stochastic variable, and δ ∈ [0,1], S are Sigmoid functions;
(5) a random number rand1 is generated, if rand1 > ri, then according to formula (10) to bat group it is current most Good bat position carries out random perturbation and obtains the new position x of the batnew, then execute step (6);If rand1≤ri, Then follow the steps (7);riIt is the pulse frequency of i-th of bat of current iteration:
xnew=xold+δ·At * (10)
Wherein xoldIt is the original position of bat, δ is the random number between -1 and 1.At *It is t wheel all bats of iteration Mean loudness;
(6) a random number rand2 is generated, if rand2 < AiAnd f (xi) < f (xnew), then with new position xnewIt replaces The original position of the bat is changed, and accordingly updates the bat in the volume A of current iteration number t according to formula (11), (12)i tAnd arteries and veins Rush frequency ri t;Otherwise, step (7) are executed;Wherein, f (xi) indicate the fitness value of the bat original position, f (xnew) indicate the bat The fitness value of the new position of bat;
Ai t=α Ai t-1 (11)
Wherein Ai t-1Indicate volume of the bat individual i at the t-1 moment, α is constant, α ∈ (0,1);ri 0Indicate bat individual i Pulse frequency at the beginning, γ are constant, γ > 0;
(7) a random number rand3 is generated to each bat, according to formula (13) to wherein meeting rand3 < PtBat Present speed executes mutation;
Wherein r1, r2, r5Be with target bat in same sub- population randomly selected individual, r3, r4It is different sons kind The bat randomly selected in group;"+" is logic xor operationIt is logic or operation.Rand be generated between 0 to 1 with Machine number." " is if indicate condition rand < FtMeet, the operation in bracket will execute;
(8) fitness value of all bats is ranked up, is waited using the position of the highest bat of fitness value as current Feature is selected to combine;Judge whether the combination of current candidate feature meets preset optimal conditions, if satisfied, then with current candidate The feature combination of feature combination alternatively;If not satisfied, then returning to step (2).
Compared with the prior art, the advantages of the present invention are as follows:
The improved bat algorithm of the present invention makes searching for individual by inertia coeffeicent and the Studying factors control changed over time Rope and optimizing ability adaptive change.By introducing sub- population dividing mechanism, so that study and movement of the individual to optimal location It can be unfolded between population inside sub- population, both ensure that the transmitting of optimization information between individuals, in turn avoided part The extreme influence of optimum individual.The introducing of binary system differential variation enhances the diversity of body position, in an iterative process for The population for tending to assimilation introduces new vitality.Compared with common model, the improved bat algorithm of the present invention accelerates convergence rate, Improve effect of optimization.After carrying out feature selecting using the algorithm, selected characteristic of the present invention is huge to subsequent classification judgement contribution, Improve the precision and performance of classification.
Detailed description of the invention
Fig. 1 is sub- population dividing schematic diagram of mechanism;
Fig. 2 is that the performance of different groups intelligent algorithm compares figure;
Fig. 3 is that the algorithm performance of Different Individual number and the number of iterations compares figure;
Fig. 4 is that the performance of different population division methods compares figure.
Specific embodiment
The present invention is a kind of wisdom people's livelihood big data feature choosing based on improvement bat algorithm merged under background in three screens Selection method, it utilizes and improves Swarm Intelligent Algorithm-bat algorithm to select more excellent feature.It combines candidate feature and regards For position individual in bat algorithm, regard the process of feature selecting as in population it is mobile to be made iteratively position for bat individual With the process of target search, the global optimum position finally searched is the feature of selection.For original bat algorithm, do as Lower improvement: introducing the sub- population dividing mechanism based on K-means algorithm, enhances and efficiently learns between neighborhood individual inside sub- population Optimize the transmitting of information between sub- population;Introduce binary system difference Variation mechanism;In addition to this, draw in speed more new formula The inertial factor and self study coefficient for entering linear time-varying introduce mutation probability and constriction coefficient in variation, so that bat is individual Search capability with the number of iterations adaptive change, avoid falling into locally optimal solution too early, accelerate convergence rate.Changed using above-mentioned Feature selecting is carried out into bat algorithm, reduces the time of screening feature, the feature selected is more favorable to subsequent classification, performance Preferably.
Specifically, the present invention includes the following steps:
(1) relevant parameter of bat algorithm is initialized, the relevant parameter includes: bat group number of individuals N, maximum impulse Volume A0, maximum pulse rate R0, search pulse frequency range [fmin,fmax], the attenuation coefficient α of volume, the enhancing system of search rate Number γ, maximum number of iterations Nt
The position of random initializtion bat by the following method generates N number of candidate feature combination:
For i-th of bat, according to the spatial position x of bati=(xi,1,xi,2,...,xi,d) and speed vi=(vi,1, vi,2,...,vi,d), the spatial position of bat is abstracted into the string of binary characters of a d dimension space, the d dimension space two into Character string processed is a candidate feature combination, wherein d is candidate feature number;The position that the value of the string of binary characters is 1 Indicate that the feature of current location is selected, the value of the string of binary characters is that the feature of 0 expression current location is not selected.
(2) the fitness value f (x of each bat is calculated according to the fitness function of formula (1)i), and from all bats Find out the position G of current optimal bat0
F=0.6 × R+0.4 × e-F×A (1)
In formula (1), R, F and A are respectively indicated to be combined using feature selected by current iteration and be called together as what input was classified Return rate, F score and accuracy rate;Since the feature combination of selection has great influence to final classification result, so defined herein Fitness function is determined by the evaluation index using classifying quality after selected feature training classifier.
The inertia coeffeicent and the self study factor of epicycle iteration are updated according to formula (2) and formula (3);
In formula (2) and formula (3), WtInertia coeffeicent when indicating each bat iteration t times, CtIndicate that each bat changes Self study factor when for t times, WmaxIt is maximum value, the W of inertia coeffeicentminIt is the minimum value of inertia coeffeicent, CmaxSelf study because The maximum value of son, CminIt is the minimum value of the self study factor, NtIt is maximum number of iterations.
The present invention improves the optimization energy of algorithm by introducing the inertia weight Wt linearly reduced with the number of iterations increase Power.In earlier iterations, bat has biggish inertia coeffeicent and higher speed, has more powerful ability of searching optimum. Into the later period, lesser inertia weight facilitates more accurate local search, thus accelerating algorithm convergence rate.By using ginseng Number Ct draws the advantage that bat history bit-by-bit sets study, promotes speed more new effects.The parameter of dynamic self-adapting variation represents Influence degree of the history optimal location to present speed.Just start that there is the bat of larger Ct to fly near oneself current location Row has preferable local exploring ability.Later Ct is gradually reduced, and bat position is mainly influenced by global optimum's individual, is mentioned Its high development ability.
The probability P whether control bat executes the variation of mutation operation is calculated according to formula (4)t:
In formula (4), NtIt is maximum number of iterations, t indicates the number of iterations;In earlier iterations, every bat has smaller Mutation probability, its search capability optimizing can be made full use of in biggish space.As the number of iterations increases, bat more has It may make a variation, break the constraint of local optimum, avoid precocity.
The contraction factor F being randomly generated between 0,1 is calculated according to formula (5)t:
In formula (5), FmaxIt is the maximum value of contraction factor, FminIt is the minimum value of contraction factor, NtIt is greatest iteration time Number;Contraction factor controls mutation operation by the influence between control difference vector.Biggish F value helps to maintain population The diversity of body, and lesser F can make individual obtain better local search ability.
(3) bat group is divided into according to the distance between bat by sub- population using K-means algorithm.
Specifically, population is divided into according to the distance between individual using K-means algorithm in each iterative process The sub- population of fixed number.Algorithm one is divided into two levels: in first level, the individual inside every sub- population is only to it Current local optimum individual study and movement.Individual speed changes the only shadow by oneself history optimal location and locally optimal solution It rings;In second level, the local optimum individual of every sub- population similarly learns to global optimum's individual and mobile, the overall situation Information can be spread apart between population to be come.In this way, and the iteration mobile to more excellent individual of bat individual layering Ground obtains better position, avoids the extreme influence by a certain individual.All bats all can group again after each iteration Synthon population, until algorithm terminates.
(4) inside that each bat in every sub- population is accordingly updated according to the method as shown in formula (6)-(9) becomes Measure search pulse frequency fi, flying speed vi, spatial position xi;For the non local optimum individual drawn game inside every sub- population Portion's optimum individual updates flying speed according to formula (7) and (8) respectively:
fi=fmin+(fmax-fmin)·β (6)
vi t=Wt·vi t-1+(xi t-1-Mn t-1)·fi+(xi t-1-Pi t-1)·Ct (7)
vi t=Wt·vi t-1+(xi t-1-Gt-1)·fi+(xi t-1-Pi t-1)·Ct (8)
In formula (6), fmaxAnd fminIt is the maximum value of pulse frequency and the minimum value of pulse frequency respectively, β is one equal The stochastic variable of even distribution, and β ∈ [0,1];In formula (7) and (8), vi tAnd vi t-1Bat individual i is respectively indicated in t and t-1 The flying speed formula at moment;xi tAnd xi t-1Bat individual i is respectively indicated at the location of t and t-1 moment;Mn t-1For son kind Group n is in the position of the more excellent individual in t-1 moment part;Gt-1It is entire group in t-1 moment global best bat position;Pi t-1For Each bat i is in the history optimum position that the t-1 moment retains;WtAnd CtThe inertia coeffeicent of bat and self study when indicating iteration t times The factor;Formula (9) δ is a stochastic variable, and δ ∈ [0,1], S are Sigmoid functions.
(5) a random number rand1 is generated, if rand1 > ri, then according to formula (10) to bat group it is current most Good bat position carries out random perturbation and obtains the new position x of the batnew, then execute step (6);If rand1≤ri, Then follow the steps (7);riIt is the pulse frequency of i-th of bat of current iteration:
xnew=xold+δ·At * (10)
Wherein xoldIt is the original position of bat, δ is the random number between -1 and 1.At *It is t wheel all bats of iteration Mean loudness.
(6) a random number rand2 is generated, if rand2 < AiAnd f (xi) < f (xnew), then with new position xnewIt replaces The original position of the bat is changed, and accordingly updates the bat in the volume A of current iteration number t according to formula (11), (12)i tAnd arteries and veins Rush frequency ri t;Otherwise, step (7) are executed;Wherein, f (xi) indicate the fitness value of the bat original position, f (xnew) indicate the bat The fitness value of the new position of bat.
Ai t=α Ai t-1 (11)
Wherein Ai t-1Indicate volume of the bat individual i at the t-1 moment, α is constant, α ∈ (0,1);ri 0Indicate bat individual i Pulse frequency at the beginning, γ are constant, γ > 0;When bat individual is close to target, it can reduce A according to such as upper typei And increase ri
(7) a random number rand3 is generated to each bat, according to formula (13) to wherein meeting rand3 < PtBat Present speed executes mutation.
Wherein r1, r2, r5Be with target bat in same sub- population randomly selected individual, and r3, r4It is different sons The bat randomly selected in population.Since the position and speed of bat each in space is indicated by string of binary characters, the present invention Mutation process can be realized using logical operation."+" is logic xor operationIt is logic or operation.Rand be 0 to 1 it Between the random number that generates." " is if indicate condition rand < FtMeet, the operation in bracket will execute.
(8) fitness value of all bats is ranked up, is waited using the position of the highest bat of fitness value as current Feature is selected to combine;Judge whether the combination of current candidate feature meets preset optimal conditions, if satisfied, then with current candidate The feature combination of feature combination alternatively;If not satisfied, then returning to step (2).
The present invention is further introduced into the Mutation Mechanism of Differential Evolution, can enhance population diversity, improves bat individual and jumps out The ability of local optimum.By randomly selecting individual in population, is brought using the otherness between individual to target value and centainly disturbed It is dynamic.This disturbance contains otherness individual between different population in same population.The advantages of operating in this way is both to keep The superiority of former individual self-position, avoids unnecessary variation bring reduced performance.Meanwhile and introduce different groups it Between otherness, promote entire Evolution of Population.
Technical solution of the present invention is further illustrated with specific embodiment below.This implementation utilizes network invasion monitoring 1999 data set of KDD CUP will select network flow characteristic as input, random forest (Random Forest, RF) algorithm Classify as classifier to network flow, passes through the superiority and inferiority of the quality verifying feature selecting of classifying quality.Data set details Feature is as follows:
1 experimental data details of table
There are some improved models for bat algorithm at present.A.C.Enache et al. proposes to utilize Levy flight The randomness of method enhancing candidate solution.T.Kanungo et al. proposes to generate using the Euclidean distance of current candidate solution and more excellent solution Increment generates a new explanation at random, to prevent algorithm from falling into local optimum.The present invention has the innovatory algorithm of proposition with some Improvement bat algorithm and common Swarm Intelligence Algorithm be compared, they are used for feature selecting.The accuracy of classification and Error rate can be shown in Table 2.As seen from table, the present invention realizes nicety of grading 96.03%, error rate 1.18%, and performance is better than other Comparison other, provable innovatory algorithm proposed by the present invention can more effectively screen feature, improve the accuracy rate of classification, reduce Error rate.
The performance of 2 different characteristic selection algorithm of table compares
Algorithms of different Accuracy (%) Error rate (%)
ACO 94.25 4.78
PSO 94.52 3.99
BA 94.93 3.68
A.C.Enache proposes algorithm 95.35 2.74
T.Kanungo proposes algorithm 95.64 2.06
It proposes to improve BA algorithm 96.03 1.18
In addition to this, the present invention also demonstrates the present invention performance optimization and convergence speed during optimal characteristics combinatorial search Improvement on degree.As shown in Figure 2, improving BA algorithm, it is multiple to reduce the time far earlier than other algorithms in 40 wheel iteration or so convergence Miscellaneous degree.Meanwhile from the point of view of the high and low position of curve, the fitness for the optimal solution that algorithm is obtained in convergence is also above other calculations Method.From simulation result, it can be concluded that, the introducing of sub- population dividing and binary system difference Variation mechanism is so that algorithm is easier to jump out Local optimum obtains better feature combination, and the parameter of linear time-varying also enhances the News Search ability of individual, meets not With the needs of stage Search.
Since the size of population and the number of iterative search are to solve for very important two parameters of optimization problem, so figure 3 demonstrate influence of the different parameters value to optimization problem.It can be seen from the figure that when individual amount determines in population, with The increase of the number of iterations, nicety of grading also rise with it.It is inferred that population is constantly evolved under certain the number of iterations More excellent solution is searched, but eventually finds approximate optimal solution and one fixed wheel number of convergence domain.When fixed population the number of iterations, contain The larger population of more individuals compares Small Population and performs better than.This is because having bigger difference in biggish population between individual The opposite sex, individual between effectively can exchange and interact, be able to carry out larger range of search, avoid converging on local optimum.
In innovatory algorithm, an important optimization is to introduce sub- population dividing mechanism.The method for dividing population is more Kind multiplicity.Individual is only randomly assigned in different clusters by some methods, and other methods are referred to according to various evaluations Mark is clustered.So the present invention using different demarcation method the influence to BA algorithm and provide final convergence result as schemed 4.It compared to other broken lines, represents and is restrained at first using the broken line of K-means clustering algorithm with more precipitous slope, and restrain When the fitness highest that reaches.It can be proved that divide sub- population using K-means clustering algorithm, can be realized better performance, Higher adaptability and faster convergence rate, because neighbouring individual is based on distance and is focused into identical sub- population.One Aspect, entire population are mobile to current optimum position by knowledge sharing between study inside every sub- population and population. On the other hand, each individual is only influenced by the current locally optimal solution of place population, slowly mobile, avoids it largely Interference.

Claims (1)

1. a kind of based on the big data feature selection approach for improving bat algorithm, which comprises the steps of:
(1) relevant parameter of bat algorithm is initialized, the relevant parameter includes: bat group number of individuals N, maximum impulse volume A0, maximum pulse rate R0, search pulse frequency range [fmin,fmax], the attenuation coefficient α of volume, the enhancing coefficient of search rate γ, maximum number of iterations Nt
The position of random initializtion bat by the following method generates N number of candidate feature combination:
For i-th of bat, according to the spatial position x of bati=(xi,1,xi,2,...,xi,d) and speed vi=(vi,1, vi,2,...,vi,d), the spatial position of bat is abstracted into the string of binary characters of a d dimension space, the d dimension space two into Character string processed is a candidate feature combination, wherein d is candidate feature number;The position that the value of the string of binary characters is 1 Indicate that the feature of current location is selected, the value of the string of binary characters is that the feature of 0 expression current location is not selected;
(2) the fitness value f (x of each bat is calculated according to the fitness function of formula (1)i), and find out and work as from all bats The position G of preceding optimal bat0
F=0.6 × R+0.4 × e-F×A (1)
In formula (1), R, F and A respectively indicate using feature selected by current iteration combination as input classify recall rate, F score and accuracy rate;
The inertia coeffeicent and the self study factor of epicycle iteration are updated according to formula (2) and formula (3);
In formula (2) and formula (3), WtInertia coeffeicent when indicating each bat iteration t times, CtIt indicates each bat iteration t times When the self study factor, WmaxIt is maximum value, the W of inertia coeffeicentminIt is the minimum value of inertia coeffeicent, CmaxIt is the self study factor Maximum value, CminIt is the minimum value of the self study factor, NtIt is maximum number of iterations;
The probability P whether control bat executes the variation of mutation operation is calculated according to formula (4)t:
In formula (4), NtIt is maximum number of iterations, t indicates the number of iterations;
The contraction factor F being randomly generated between 0,1 is calculated according to formula (5)t:
In formula (5), FmaxIt is the maximum value of contraction factor, FminIt is the minimum value of contraction factor, NtIt is maximum number of iterations;
(3) bat group is divided into according to the distance between bat by sub- population using K-means algorithm;
(4) it is searched according to the built-in variable that the method as shown in formula (6)-(9) accordingly updates each bat in every sub- population Rope pulse frequency fi, flying speed vi, spatial position xi;For inside every sub- population non local optimum individual and part most Excellent individual updates flying speed according to formula (7) and (8) respectively:
fi=fmin+(fmax-fmin)·β (6)
vi t=Wt·vi t-1+(xi t-1-Mn t-1)·fi+(xi t-1-Pi t-1)·Ct (7)
vi t=Wt·vi t-1+(xi t-1-Gt-1)·fi+(xi t-1-Pi t-1)·Ct (8)
In formula (6), fmaxAnd fminIt is the maximum value of pulse frequency and the minimum value of pulse frequency respectively, β is one and uniformly divides The stochastic variable of cloth, and β ∈ [0,1];In formula (7) and (8), vi tAnd vi t-1Bat individual i is respectively indicated at t the and t-1 moment Flying speed formula;xi tAnd xi t-1Bat individual i is respectively indicated at the location of t and t-1 moment;Mn t-1For sub- population n In the position of the more excellent individual in t-1 moment part;Gt-1It is entire group in t-1 moment global best bat position;Pi t-1It is each Bat i is in the history optimum position that the t-1 moment retains;WtAnd CtIndicate iteration t times when bat inertia coeffeicent and self study because Son;Formula (9) δ is a stochastic variable, and δ ∈ [0,1], S are Sigmoid functions;
(5) a random number rand1 is generated, if rand1 > ri, then according to formula (10) to the current best bat of bat group Bat position carries out random perturbation and obtains the new position x of the batnew, then execute step (6);If rand1≤ri, then hold Row step (7);riIt is the pulse frequency of i-th of bat of current iteration:
xnew=xold+δ·At * (10)
Wherein xoldIt is the original position of bat, δ is the random number between -1 and 1.At *It is the flat of t wheel all bats of iteration Equal loudness;
(6) a random number rand2 is generated, if rand2 < AiAnd f (xi) < f (xnew), then with new position xnewReplacement should The original position of bat, and the bat is accordingly updated in the volume A of current iteration number t according to formula (11), (12)i tAnd pulse frequency Rate ri t;Otherwise, step (7) are executed;Wherein, f (xi) indicate the fitness value of the bat original position, f (xnew) indicate that the bat is new The fitness value of position;
Ai t=α Ai t-1 (11)
ri t=ri 0·[1-e-γt] (12)
Wherein Ai t-1Indicate volume of the bat individual i at the t-1 moment, α is constant, α ∈ (0,1);ri 0Indicate bat individual i first The pulse frequency when beginning, γ are constant, γ > 0;
(7) a random number rand3 is generated to each bat, according to formula (13) to wherein meeting rand3 < PtBat it is current Speed executes mutation;
Wherein r1, r2, r5Be with target bat in same sub- population randomly selected individual, r3, r4It is in different sub- populations The bat randomly selected;"+" is logic xor operationIt is logic or operation.Rand be generated between 0 to 1 it is random Number." " is if indicate condition rand < FtMeet, the operation in bracket will execute;
(8) fitness value of all bats is ranked up, it is special using the position of the highest bat of fitness value as current candidate Sign combination;Judge whether the combination of current candidate feature meets preset optimal conditions, if satisfied, then with current candidate feature The feature combination of combination alternatively;If not satisfied, then returning to step (2).
CN201811642556.5A 2018-12-29 2018-12-29 A kind of big data feature selection approach based on improvement bat algorithm Pending CN109711373A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811642556.5A CN109711373A (en) 2018-12-29 2018-12-29 A kind of big data feature selection approach based on improvement bat algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811642556.5A CN109711373A (en) 2018-12-29 2018-12-29 A kind of big data feature selection approach based on improvement bat algorithm

Publications (1)

Publication Number Publication Date
CN109711373A true CN109711373A (en) 2019-05-03

Family

ID=66259616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811642556.5A Pending CN109711373A (en) 2018-12-29 2018-12-29 A kind of big data feature selection approach based on improvement bat algorithm

Country Status (1)

Country Link
CN (1) CN109711373A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111080031A (en) * 2019-12-27 2020-04-28 圆通速递有限公司 Vehicle path optimization method and system based on improved dragonfly algorithm
CN111368900A (en) * 2020-02-28 2020-07-03 桂林电子科技大学 Image target object identification method
CN112308168A (en) * 2020-11-09 2021-02-02 国家电网有限公司 Method for detecting voltage data abnormity in power grid
CN112800224A (en) * 2021-01-28 2021-05-14 中南大学 Text feature selection method and device based on improved bat algorithm and storage medium
CN113076695A (en) * 2021-04-12 2021-07-06 湖北民族大学 Ionosphere high-dimensional data feature selection method based on improved BBA algorithm

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111080031A (en) * 2019-12-27 2020-04-28 圆通速递有限公司 Vehicle path optimization method and system based on improved dragonfly algorithm
CN111368900A (en) * 2020-02-28 2020-07-03 桂林电子科技大学 Image target object identification method
CN112308168A (en) * 2020-11-09 2021-02-02 国家电网有限公司 Method for detecting voltage data abnormity in power grid
CN112800224A (en) * 2021-01-28 2021-05-14 中南大学 Text feature selection method and device based on improved bat algorithm and storage medium
CN113076695A (en) * 2021-04-12 2021-07-06 湖北民族大学 Ionosphere high-dimensional data feature selection method based on improved BBA algorithm
CN113076695B (en) * 2021-04-12 2022-06-17 湖北民族大学 Ionosphere high-dimensional data feature selection method based on improved BBA algorithm

Similar Documents

Publication Publication Date Title
CN109711373A (en) A kind of big data feature selection approach based on improvement bat algorithm
CN107590436B (en) Radar emitter signal feature selection approach based on peplomer subgroup multi-objective Algorithm
He et al. A discrete multi-objective fireworks algorithm for flowshop scheduling with sequence-dependent setup times
Sikora A modified stacking ensemble machine learning algorithm using genetic algorithms
Zeng et al. Accurately clustering single-cell RNA-seq data by capturing structural relations between cells through graph convolutional network
CN107392919B (en) Adaptive genetic algorithm-based gray threshold acquisition method and image segmentation method
CN108875816A (en) Merge the Active Learning samples selection strategy of Reliability Code and diversity criterion
CN102214213A (en) Method and system for classifying data by adopting decision tree
CN107992887A (en) Classifier generation method, sorting technique, device, electronic equipment and storage medium
CN111105045A (en) Method for constructing prediction model based on improved locust optimization algorithm
CN110287985B (en) Depth neural network image identification method based on variable topology structure with variation particle swarm optimization
Carvalho et al. Tree-Based Methods: Concepts, Uses and Limitations under the Framework of Resource Selection Models.
CN110909158B (en) Text classification method based on improved firefly algorithm and K nearest neighbor
CN107798379A (en) Improve the method for quantum particle swarm optimization and the application based on innovatory algorithm
CN106971091A (en) A kind of tumour recognition methods based on certainty particle group optimizing and SVMs
CN109344956A (en) Based on the SVM parameter optimization for improving Lay dimension flight particle swarm algorithm
CN108629400A (en) A kind of chaos artificial bee colony algorithm based on Levy search
CN111832135A (en) Pressure container structure optimization method based on improved Harris eagle optimization algorithm
CN110059756A (en) A kind of multi-tag categorizing system based on multiple-objection optimization
CN107195297A (en) A kind of normalized TSP question flock of birds speech recognition system of fused data
CN109978023A (en) Feature selection approach and computer storage medium towards higher-dimension big data analysis
CN110796198A (en) High-dimensional feature screening method based on hybrid ant colony optimization algorithm
CN115688097A (en) Industrial control system intrusion detection method based on improved genetic algorithm feature selection
CN115101118A (en) Method for predicting serum-free medium component concentration based on machine learning
CN107995027A (en) Improved quantum particle swarm optimization and the method applied to prediction network traffics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190503