CN102254040A - SVM (Support Vector Machine)-based Web partitioning method - Google Patents

SVM (Support Vector Machine)-based Web partitioning method Download PDF

Info

Publication number
CN102254040A
CN102254040A CN2011102321925A CN201110232192A CN102254040A CN 102254040 A CN102254040 A CN 102254040A CN 2011102321925 A CN2011102321925 A CN 2011102321925A CN 201110232192 A CN201110232192 A CN 201110232192A CN 102254040 A CN102254040 A CN 102254040A
Authority
CN
China
Prior art keywords
web
svm
libsvm
result
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011102321925A
Other languages
Chinese (zh)
Inventor
张伟哲
张宏莉
何慧
邸文晨
魏一帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN2011102321925A priority Critical patent/CN102254040A/en
Publication of CN102254040A publication Critical patent/CN102254040A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an SVM (Support Vector Machine)-based Web partitioning method, which comprises the steps of: partitioning all Websites into N groups; taking K=1, 2, 3, ......,N, for the value of each K, selecting Website samples in the 1st-(K-1)th, (K+1)th-Nth groups, and initializing LibSvm training; training LibSvm; storing a trained SVM model; selecting a Website sample in a Kth group to perform a Web partitioning test; and saving the Web partitioning test result. According to the method provided by the invention, the generalization capability of the SVM is strong, fault tolerance and classification can be better performed when louder noise data are processed. The accuracy rate of coordinates established by a network coordinate system is approximately 80 percent, the problem of nonlinear classification can be solved by the SVM, the number of classification by the SVM is fixed, the extreme condition that no crawler crawls on a website is avoided, and the uncertainty of partitioning the number of a set in the clustering algorithm is overcome by using the classification algorithm.

Description

A kind of Web division methods based on Support Vector Machine
Technical field
The present invention relates to a kind of Web division methods, belong to the Web division methods technical field of vector machine based on Support Vector Machine.
Background technology
At first, Web division methods Chainsaw algorithm can make the load of total system be in admirable proportion at random, but does not consider the positional factor of node fully, and the network distance amount of overhead is excessive.
Secondly, the Web division methods HONet system based on cluster is applicable to that distance is little in the class, the situation that between class distance is big.Otherwise it is relevant with the order of sampling to divide the result, and it is too much generally can to cause first to divide the set internal specimen.
At last, IWAP algorithm based on iteration self-organization Web partition strategy can not guarantee that each division set all contains the reptile node, the reptile node that perhaps contains enough numbers, this will cause no reptile to divide the inner Web website of set does not have corresponding crawler capturing, the scheduling failure.
Summary of the invention
The purpose of this invention is to provide a kind of information acquisition system network distance total amount that is used to reduce, reduce the load of crawler system to network, improve the speed of response of reptile and get speed, and then can bring into play the Web division methods based on Support Vector Machine of the performance of the distributed reptile on the wide area network better with climbing.
The objective of the invention is to be achieved through the following technical solutions:
A kind of Web division methods based on Support Vector Machine the steps include:
(1) all Web websites is divided into the N group;
(2) get K=1,2,3 ... N to the value of each K, chooses wherein the 1st~K-1, and K+1~N group Web website sample carries out LibSvm training initialization;
(3) carry out the LibSVM training;
(4) store the SVM model that trains;
(5) choose K group Web website sample, carry out Web and divide test;
(6) Web divides the test result preservation;
(7) if K<N or K=N then repeat (2)~(6) step;
(8) if divide the result_sat as a result of result<expectation as a result, then repeat (1)~(7); Otherwise termination routine, Web are divided and are finished, and obtain dividing result as a result.
By the above-mentioned technical scheme that provides as can be seen, the present invention selects for use Support Vector Machine (SVM) to carry out the main advantage that Web divides to be:
At first, the SVM generalization ability is strong, can fine fault-tolerant and classification when handling the noise larger data.The coordinate accuracy rate of setting up by network coordinate system is about 80%, so the characteristics that the SVM generalization ability is strong can have good tolerance to the error in the coordinate Calculation.
Secondly, SVM can solve non-linear classification problem, and is non-linear by separating the classification interface from low-dimensional to the higher-dimension mapping.This method must be allocated to different reptiles with a large amount of Web website, so the classification interface must be a plurality of, SVM satisfies this demand.
Once more, the svm classifier number is fixed, and the extreme case of avoiding the website not have reptile to climb getting has overcome the uncertainty of dividing the set number in the clustering algorithm with sorting algorithm.
Description of drawings
Fig. 1 is the web partitioning algorithm process flow diagram based on Support Vector Machine;
Fig. 2 is that Support Vector Machine is used process flow diagram in the Web division system;
Fig. 3 is coordinate predicted time-iterations graph of a relation;
Fig. 4 is coordinate accuracy-iterations graph of a relation;
Fig. 5 divides accuracy-iterations graph of a relation for Web;
Fig. 6 is set radius comparative result;
Fig. 7 is Je value comparative result;
Fig. 8 is network distance total amount accumulative total comparison diagram.
Embodiment
This embodiment provides a kind of Web division methods based on Support Vector Machine, as shown in Figure 1, the steps include:
(1) all Web websites is divided into the N group;
(2) get K=1,2,3 ... N to the value of each K, chooses wherein the 1st~K-1, and K+1~N group Web website sample carries out LibSvm training initialization;
(3) carry out the LibSVM training;
(4) store the SVM model that trains;
(5) choose K group Web website sample, carry out Web and divide test;
(6) Web divides the test result preservation;
(7) if K<N or K=N then repeat (2)~(6) step;
(8) if divide the result_sat as a result of result<expectation as a result, then repeat (1)~(7); Otherwise termination routine, Web are divided and are finished, and obtain dividing result as a result.
Below be based on the Web partitioning algorithm overall procedure algorithm of Support Vector Machine
Figure BSA00000556428400031
Concrete method is as follows:
(1) system support vector machine using method in the Web division system
The present invention uses Support Vector Machine to be used for the general thought that Web divides: the 12 dimension network coordinates that utilize distributed information acquisition system reptile node that Levaldi (this algorithm please refer to the Levaldi:An Improved Network Distance Prediction Algorithm Based on Network Coordinate System that publishes thesis) sets up and Web website are as proper vector, Support Vector Machine selection aspect, one of development and Design such as use Taiwan Univ.'s woods intelligence benevolence (Lin Chin-Jen) associate professor is simple, be easy to use software package LibSVM with SVM pattern-recognition fast and effectively and recurrence, use leave one out (LOO) cross validation method among the SVM, classified in the Web website, comprising the feature extraction and the Feature Selection of Support Vector Machine itself, the process of parameter regulation.
(2) Web of Support Vector Machine divides the using method concrete steps
The implementation step of its using method can followingly be described:
(1) chooses a small amount of sample value and carry out the label classification, classification foundation is: total N of note reptile node C, total M of website node W, the network distance value of calculating according to top Levaldi algorithm, a small amount of (present embodiment is taked leave one out method, is fixing (9/10) M) sample is carried out classifying recently according to network distance.
(2) carry out LibSVM parameter regulation and setting.
(3) utilization leave one out mode is classified to remaining (1/10) M website node.
(4) test result classification situation is carried out the classifying quality statistics.
(5) reselect original sample in the remaining sample, repeat (1)-(4) step, 10 times classification results is averaged.
Observe classifying quality, regulate the LibSVM parameter, prepare next round and divide, till the division result is satisfied.Below be experimental design and experiment contrast.
(3) the experiment contrast is estimated
(1) clustering criteria function Je value
Good Web divides and wants can produce the division of getting well cluster as a result, and these bunches needs possess following characteristics: proximity in higher bunch; Proximity between lower bunch.
The criterion function of cluster (sum-of-squared-error criterion) Je can be used for weighing the quality of the division result with identical division number of sets.
The value of Je can define according to following:
J e = Σ i = 1 k Σ X ∈ B i | Y - n i | 2 - - - ( 1 )
Wherein Y divides set B iIn any one sample, n iBe B iCentroid.
The bright cluster of novel is reasonable more more for Je, and the cluster quality is high more.
(2) reptile network distance overhead amount accumulative total Dist_total in the information acquisition system
Because according to the purpose of network distance prognoses system, be exactly to distribute by the forecasting techniques sensing network, reduce in system call apart from the expense total amount, so the network distance expense total amount in partition process must be one of performance evaluating standard.
Note reptile set of node is C, and number is N; Suppose that the website set of node is W, number is M; The set of websites that note is allocated to the reptile node i is D iNetwork distance between node i and the node j be Network_distance (i j), then has:
Dist _ total = Σ i = 1 N Σ j ∈ D i Network _ dis tan ce ( i , j ) - - - ( 2 )
For information acquisition system, the Dist_total value is more little, and network distance expense total amount is more little, and the reptile node is few more to the load of network, and the division effect is also just good more.
(4) Web divides experiment content
The adjusting of Support Vector Machine with determine that one iterates until the process of optimum, wherein the extraction of the adjusting of parameter and proper vector is wherein emphasis link and importance, need constantly attempt and optimize.Especially the Support Vector Machine principle being used for the Web division is a brand-new trial, and a lot of aspects need a large amount of experiments and conclusion, and the Support Vector Machine LibSVM experimentation that range prediction Technology in Web Network Based is divided will be set forth in this part.
4.4.1, the LibSVM parameter regulation
Utilizing SVM to carry out problem types and corresponding kernel function that the branch time-like will determine that at first sorter will solve, classification problem is easy to determine, because it is the multicategory classification problem that this paper range prediction Technology in Web Network Based is divided, so on classification problem, select the multicategory classification problem.The problem of finding the solution of its correspondence is C_SVC: multi-class identification problem, find the solution problem
min w , b , x 1 2 w t w + C Σ i = 1 l x i - - - ( 3 )
Wherein 1 is the class categories number, and w is that C is a penalty factor, ζ perpendicular to the vector of unit length of classification lineoid iLax vector when being used for linear inseparable situation.
Kernel function has a variety of, and with respect to the kernel function of other kinds, radially basic kernel function (RBF) can be mapped to the more space of higher-dimension by lower dimensional space with sample, helps handling sample in the linear inseparable situation of lower dimensional space.Simultaneously, after radially basic kernel function mapping, the data complexity of sample is relatively low.Compare with radial basis function, it is infinite that the value of polynomial kernel function might be tending towards, and the Sigmoid kernel function can lose efficacy under some parameter.So the present invention selects the kernel function of radially basic kernel function as sorter for use.
Its expression formula of radial basis function is RBF,
K(x,y)=exp(-g||x-y|| 2) (4)
In utilizing the svm classifier process, need to determine a plurality of parameters, wherein of paramount importancely be: C and γ.Wherein C is used for the C_SVC classification problem, and for the punitive factor of misclassification sample, big more punitive is strong more, is equivalent to the confidence of data greatly more, and data noise is smaller.
In the present invention, adopt the grid search mode to determine parameter.Wherein, the range of choice of C is: and 0.005,0.1,0.5,1.0,5.0,10.0,50.0,100.0,500.0}, the range of choice of γ is: { 2 -5, 2 -4... 2 02 4, 2 5.In order to determine parameter, the method for use has been selected two groups of samples at random in all-network node sample, and wherein one group is used for training classifier, and another group is used for test.Wherein the quantity of the quantity of training examples and test sample is respectively 80% of training examples sum.Each one group of C and g numerical value selected with training group sample training classifier, is tested the accuracy rate of record sort with the test group.Take all factors into consideration the accuracy of the sorter that training obtains under different parameters, and the support vector number that obtains after the sorter training, { C=5.0, the parameter combinations of γ=2.0} is used to set up sorter in employing.In the svm classifier device concrete parameter be provided with as shown in the table:
Table 1 LibSVM major parameter table
Figure BSA00000556428400081
4.4.2, feature extraction and feature selecting
The research of classifier design method is no doubt important, but how to determine that suitable feature determines that suitable feature determines that suitable feature determines that the suitable feature space is another very important even more crucial problem of Design Mode recognition system.
Have compactness if the feature space that selected feature space is selected for use can make similar object distribute, in the zone that promptly all kinds of sample distribution separate in this feature space each other, this just successfully provides good basis for classifier design.Otherwise if different classes of sample is mixed in together in this feature space, good again method for designing also can't improve the accuracy of sorter.How In this Section constructs a feature space if belonging to, promptly how the things that will discern is described analysis.
Will be optimized initial feature space in general is for dimensionality reduction.Be that initial feature space dimension is higher.Can make a space that dimension is lower into, the feature space that is called after the optimization should more help follow-up classified calculating.So-called optimization is the dimension that requirement had both reduced feature, can improve the performance of sorter again.
Two kinds of basic skills are a feature selecting (deleting Partial Feature).The b combination of features is optimized (a kind of mapping), that is to say that each new feature is a function of original feature.
Clearer and more definite for what say, suppose existing D dimensional feature vector space Y={ y 1, y 2.。。。。y DThen so-called feature selecting is meant from original D dimensional feature space, leaves out some feature description amounts, thereby the feature space after obtaining simplifying.In this feature space, sample is described by the proper vector of d dimension: X={x 1, x 2..., x d, d<D.Because X is the subclass of Y, so each component xi must find the description amount x of its correspondence in former feature set i=y jFrom a stack features, pick out some the most effective features, to reach the purpose that reduces the feature space dimension.
Y:{y 1,y 2,......,y D}→X:{x 1,x 2,......,x d}
(5)
x i∈Y,i=1,2,...,d;d<D
Comprise D feature among the primitive character set Y, comprise d feature among the target signature set X.
Feature extraction then is to find mapping relations: an A:Y → X.New samples feature description dimension is reduced than former dimension.Each component x wherein iBe the function of former each component of proper vector, promptly
x i=x i(y 1,y 2,…,y D) (6)
Therefore the basic skills of these two kinds of dimensionality reductions is basic skills of different two kinds of dimensionality reductions.Both can be combined use in actual applications, such as carrying out optimal combination earlier, and then further select a wherein part, or conversely.
4.4.3, LibSVM feature extraction and feature selecting
Aspect feature extraction, will calculate the octuple space coordinate of gained as initial feature by Levaldi.Aspect feature selecting, mainly the parameter I ter_levaldi to feature regulates, and by regulating Iter_levaldi prediction volume coordinate precision is changed thereupon, thereby makes the proper vector of LibSVM more accurate, and quality is higher.
Aspect feature selecting, mainly utilize sequence forward direction selection algorithm (SFS) that existing ten two-dimensional feature space vectors are screened, select optimum combination of eigenvectors.
4.4.4, Iter_Levaldi regulates and determine
In the Levaldi algorithm, an important parameter is iterations Iter_Levaldi, it has determined the levels of precision of network coordinate to a certain extent, generally, the increase of iterations can bring the raising of network coordinate precision, and dividing for range prediction Technology in Web Network Based provides better basis.But the while is along with the raising of iterations Iter_Vivaldi, the convergence time of network coordinate prediction also can increase thereupon, especially iterations reach more than ten thousand times the time, convergence time will reach 0.5 second/more than the Website, so just lost the meaning of carrying out range prediction by the network coordinate system, so, need on accuracy and efficient, do certain experimental analysis, on the basis of experiment, make the most rational Iter_Levaldi and select.
The Iter_Levaldi experimental design
Divide the result for the performance of Levaldi under the more various Iter_Levaldi with based on the Web of the LibSVM of Levaldi, design more following parameter:
(1) iterations Iter_Levaldi is the input parameter of Levaldi program, and its value is big more, and convergence time is long more, and the coordinate accuracy is high more.
(2) convergence time Convergence time convergence time is all node coordinate required times that calculate of Levaldi.
(3) coordinate relative accuracy Coordinate Relative accuracy we utilize the relative error value to come the prediction effect of evaluating network distance:
Figure BSA00000556428400101
Can get relative error when prediction distance equals actual range is 0.The more for a short time accuracy of predicting that shows of relative error value is high more, and effect is good more.
Prediction distance is an Euclidean distance any 2 among the cyberspace S.The space Euclidean distance of any 2 x, y is in the N dimension space:
d ( x , y ) = | | x - y | | = Σ i = 1 N ( x i - y i ) 2 - - - ( 8 )
Wherein, x i(i ∈ 1,2 ..., N}) be the i dimension coordinate of x, y i(i ∈ 1,2 ..., N}) be the i dimension coordinate of y.
Actual range is exactly the distance that test obtains.In this experiment, actual range is the RTT distance of road sign node to any one Website node (comprise common reptile node and all Web website nodes).
Owing to have only the measured distance of road sign node in the network distance prediction algorithm, and do not contain the actual range between any two base node to the Website node.So have only the relative error of road sign node in the relative error to all ordinary nodes.
In order to obtain effect better directly perceived, this paper uses accuracy rate to come the prediction effect of evaluating network distance:
Accuracy rate=1-relative error (9)
(4) Web divides accuracy Partition accuracy, in the Web partitioning algorithm, if TOP-K algorithm time-out overhead issues is being optimum aspect the division accuracy, so, can will divide the criterion of accuracy as Web with the comparative result of TOP-K.
The most accurate evaluating standard that is divided into when getting K=1, note set T_ Web=Website} that all are divided, set A _ Web=all LibSVM divide the back and divide consistent Website} with TOP-K, set B _ WebAll LibSVM of=W/A={ divide the back and divide inconsistent Website} with TOP-K.
Then have
Figure BSA00000556428400111
Figure BSA00000556428400112
Experiment will compare the experiment that three kinds of topological diagrams carry out under the different I ter_Levaldi.
The Iter_Levaldi experimental result
Fig. 3 has shown and is rising under 10000 times the situation from 200 from iterations, the convergence time situation of coordinate prediction, wherein horizontal stripe column curve is the variation tendency of One-level Waxman topological model, following diagonal line column curve is a Top-Down Waxman-Waxman model, and grid column curve is Top-Down Waxman-Barabasi﹠amp; The Albert model.
As Fig. 3, rise to 10000 times the process from 200 at iterations, the predicted time of coordinate progressively improves with the increase of iterations.From distribution of curve (the figure for getting the coordinate predicted time behind the log) and actual calculation, can obtain, from 200 to 500, iteration of average every increase, average increase by 48.5% consuming time; From 500 to 1000, increase by 98.5%; , from 1000 to 2000, increase 21.8.%; From 2000 to 5000, reduce 18.3%; From 5000 to 8000, increase by 206.2%; From 8000 to 10000, increase by 179.6%; Its speedup reached the fastest at 8000 o'clock, and is the slowest 2000 to 5000 o'clock speedups, and 500 take second place to 1000.
Fig. 4 and Fig. 5 have shown the coordinate prediction accuracy and have divided the change curve of accuracy with the increase of iterations based on the Web that coordinate is predicted.Wherein solid-line curve is the variation tendency of One-level Waxman topological model, and the some tracing is a Top-Down Waxman-Waxman model, and the some dashed curve is Top-Down Waxman-Barabasi﹠amp; The Albert model.
From the experimental result of Fig. 4 and Fig. 5 as can be seen, in three groups of experiments, Iter_levaldi is 200 o'clock, the coordinate relative accuracy is lower, is about 60% only, causes the division accuracy of LibSVM also not high, only be 52%, 51%, 53%, can not satisfy high-quality Web and divide demand.
When Iter_levaldi brings up to 500, three groups of experimental results are all increasing aspect the coordinate relative accuracy, all more than 70%, simultaneously the Web of LibSVM divide accuracy also divide be clipped to 78%, 72% 65%, but still not high enough, total convergence time of coordinate prediction simultaneously reached about 15 seconds, then was 0.015 second/to each node on average, on speed, can accept, still can not accept but divide in the accuracy.
When Iter_levaldi reached 1000, three groups of experimental results were all having significantly raising aspect the coordinate relative accuracy, and mean value is 85%, and minimum is 81%, can accept, and speed of convergence can be accepted below 0.01/second simultaneously.Divide in the accuracy at Web, average reaches 87%, and minimum is 84% also, is acceptable, but remains further the LibSVM parameter to be debugged.
When Iter_levaldi reaches 2000~5000, coordinate precision and Web divide accuracy and increase, but the equal average of convergence time reached more than 0.15 second/, though the accuracy that Web divides has improved, but speed of convergence is slow excessively, lost the speed of utilizing coordinate Calculation and replaced the slow-footed advantage of actual measurement soon, can not satisfy the demand that high-quality Web divides, so can not accept.
It is for the accuracy of experimental verification Levaldi and the accuracy of LibSVM that Iter_levaldi gets 8000 and 10000, when Iter_levaldi gets 10000, Web divides accuracy and will reach more than 90%, but convergence time can't stand, and has verified that from experimental viewpoint accuracy and feasibility that the Web by the network distance forecasting techniques divides can guarantee.
By above experiment relatively, when can guaranteeing 88% accuracy rate, also can accept on the speed of convergence when choosing Iter_Levaldi and being 1000 left and right sides, satisfy high-quality Web and divide demand.Simultaneously, by thinner step experiment, when finding to get Iter_levaldi=1200, experiment effect is ideal, and three kinds of topologically corresponding convergence times were respectively 36.1422 seconds, and 42.7573 seconds, 49.2692 seconds; Web divides accuracy and is respectively 90.2%, 86.8%, 87.3%.
4.4.5, the SFS of Coord_Dim selects
In the Levaldi algorithm, the dimension Coord_Dim that another important parameter is a coordinate, it has also determined the levels of precision of network coordinate to a certain extent, generally, the increase of coordinate dimension can bring the raising of network coordinate precision, and dividing for range prediction Technology in Web Network Based provides better basis.
But, after coordinate dimension is brought up to necessarily, not only the process of coordinate prediction is slowed down, dimension increase the raising that can not bring precision of prediction, may cause on the contrary because dimension is too much, bring " dimension disaster " to Support Vector Machine, cause the Support Vector Machine speed of convergence slack-off, even might wrongly divide sample.Divide system for Web, because 1000 sample points are arranged, can guarantee the accurate of coordinate at Levaldi part 8 dimension coordinates, but for LibSVM, whether can cut down wherein some dimension, the accuracy and the Web that can guarantee coordinate divide, can improve simultaneously the speed of convergence of LibSVM training again, we need do certain experimental analysis, on the basis of experiment, make the most rational selection of the coordinate dimension that is used for LibSVM and corresponding dimension, save the LibSVM training time, improve the LibSVM dividing precision.
The Coord_Dim experimental design
In the selection of the coordinate dimension of LibSVM and respective dimensions, we adopt the sequence forward direction selection algorithm in the feature selecting algorithm of heuristic search strategy, choose the respective dimensions of Coord_Dim.
Sequence forward direction system of selection (SFS, Sequential forward selection): this method is also referred to as set increase method.It is a kind of searching method from bottom to top.Earlier needed characteristic set is initialized as an empty set, in characteristic set, increases a feature each time up to arriving last feature set.This process can be described as: establishing all characteristic sets is Q, supposes to have an existing d iThe characteristic set X of individual feature d, each is not selected feature P jCalculate its criterion function J i=J (X Di}+P jSelection makes J jThat maximum feature, and it is joined set X DiIn.In fact,, all selected a feature to join current set, made feature selecting criterion maximum in each step of algorithm.When the best is improved when making the feature set performance depreciation or reaching the maximum feature number that allows, this algorithm is thought and is selected optimal feature subset.
The operand of this algorithm is less relatively, but the statistic correlation between the feature is not considered fully.Only can be fit to a handful of characteristic sets that satisfy specific condition from the way of search of this angle.Give an example, what its first step was selected must be a feature that makes the criterion function maximum, all is the feature that previous characteristic set is replenished as the best and each step selected afterwards.In the process of reality, the best features set does not very likely comprise that feature that independent contribution rate (criterion function value) is maximum, only is the very common combination of features of some independent contribution rates.Each step all such phenomenon may occur in this algorithm.In the method, because each dimension space coordinate is relatively independent, so this situation can't occur.
In LibSVM, note set Q is ten two-dimensional coordinates that Levaldi draws, and this ten two-dimensional coordinate is taked the SFS algorithm, selects the optimal characteristics dimension.
The Coord_Dim experimental result
Table 2 has shown three groups of experiments through the last intrinsic dimensionality number after the Feature Selection, and the feature dimensions set of selecting at last is without training time before the feature selecting and the training time after the feature selecting.
Table 3 has shown in three groups of experiments, divides accuracy and does not divide accuracy rate through the Web before the Feature Selection through the Web after the Feature Selection.
Feature dimensions is determined behind table 2 SF8
Web divides the accuracy contrast behind table 3 SFS
Figure BSA00000556428400142
By table 2 and table 3 as can be seen, the intrinsic dimensionality that obtains by the system of selection of sequence forward direction all reduces to some extent than prediction scheme coordinate dimension, wherein under Top-down Waxman-Waxman and Top-down Waxman-Barabasi model, all reduced 33%, under Top-down Waxman-Barabasi model, reduced 25%; Save 9.83%, 12.69%, 7.68% than former feature space in Support Vector Machine respectively on training time, guaranteed simultaneously the Web dividing precision be lost in ± 0.5% in, improved division speed, saved the time division, guarantee that high-quality Web divides.
(5) Web divides experimental result
In order to verify the performance of LibSVM Web partitioning algorithm.Contrived experiment compares the division result of LibSVM from two aspects: the one, weigh the degree of coupling (hereinafter setting forth) of dividing set with set radius, clustering criteria function Je; The 2nd, the division result is applied in the distributed information acquisition system dispatching system, weigh the quality of division with network distance amount of overhead in the system.
Experiment is One-level Waxman, Top-down Waxman-Waxman, Top-down Waxman-Barabasi﹠amp with topological model; The Albert topological model, experiment is respectively TopK, Chainsaw with partitioning algorithm, HONet, Binning, IWAP partitioning algorithm, wherein use five kinds of non-optimal dividing to contrast at set radius and Je value comparison phase, in last network distance total amount contrast, in order to embody comparative result, will add the TOPK optimal scheduling.
In every kind of topological diagram, use the contrast that experimentizes of different partitioning algorithms respectively, draw final conclusion.
4.5.1, set radius, Je mean value comparative result
Fig. 6 has shown intuitively in three kinds of network topology models, the set radius comparing result of five kinds of partitioning algorithms, wherein horizontal stripe column curve is the variation tendency of One-level Waxman topological model, following diagonal line column curve is a Top-Down Waxman-Waxman model, and grid column curve is Top-Down Waxman-Barabasi﹠amp; The Albert model.
As shown in Figure 6, aspect the set radius, the advantage of LibSVM is comparatively obvious, and the set radius of LibSVM is respectively 19.1%, 32.4%, 42.6% of Chainsaw in three kinds of topological models; Be respectively 31.8%, 38.5,40.6 of Binning; Be respectively 52.2%, 69.7,58.7 of HONet; Be 64.5%, 56.8%, 66.8% of IWAP; We can say, aspect the set radius, the LibSVM advantage is comparatively obvious, and this ascribes two reasons to: the one, and the set number that the LibSVM partitioning algorithm is divided is fixed, and does not have the phenomenon of wanting trans-regional distribution reptile based on may there not being the reptile node in the class in the algorithm of distance.The 2nd, it is many that the classification number is fixed the class categories that (being fixed as reptile node number) cause, and relative radius will inevitably reduce; In a word, LibSVM is having certain advantage aspect the set radius.
Fig. 7 has shown intuitively in three kinds of network topology models, the Je value comparing result of five kinds of partitioning algorithms, wherein horizontal stripe column curve is the variation tendency of One-level Waxman topological model, following diagonal line column curve is a Top-Down Waxman-Waxman model, and grid column curve is Top-Down Waxman-Barabasi﹠amp; The Albert model.
As shown in Figure 7, in three kinds of topological models, the Je value of Chainsaw partitioning algorithm is all obviously higher, not in comparable scope; The overall performance of LibSVM algorithm is still more outstanding, and the Je value of LibSVM footpath is respectively 39.16%, 67.25,72.06 of Binning in three kinds of topological models; Be respectively 32.52%, 56.44,47.84 of HONet; Be 86.18%, 109.01%, 113.02% of IWAP; As can be seen, at LibSVM when removing IWAP and divide the point counting contrast certain advantage is arranged, when being that IWAP relatively, at Top-down Waxman-Waxman and Top-down Waxman-Barabasi﹠amp; In the Albert model, LibSVM is worse than IWAP slightly, and this and IWAP are better having the type of certain relation and topological model that certain relation is arranged as a kind of clustering algorithm aspect the control of Je value; To further improve LibSVM in work in the future, and make it further reducing aspect the Je value, performance is more excellent.
4.5.2, the accumulative total contrast of information acquisition system network distance total amount
Table 4 has shown under three kinds of different topology models, the data comparing result of crawler system network distance total amount in the distributed information acquisition system.
Fig. 8 has shown intuitively in three kinds of different topology models, the comparison of crawler system network distance total amount in the distributed information acquisition system, wherein horizontal stripe column curve is the variation tendency of One-level Waxman topological model, following diagonal line column curve is a Top-Down Waxman-Waxman model, and grid column curve is Top-Down Waxman-Barabasi﹠amp; The Albert model.
Reptile network distance expense total amount in table 4 information acquisition system
As table 4 and shown in Figure 8, LibSVM network distance total amount overall performance is a little less than the TOP-1 partitioning algorithm, a little more than TOP-3 partitioning algorithm and IWAP partitioning algorithm, than Chainsaw, Binning and HONet overall performance height.The network distance total amount of LibSVM is respectively 116.5%, 110.2%, 104.3% of TOP-1 in three kinds of topological models, than the network distance total amount of the TOP-1 algorithm consumption of on average Duoing 10.3%; Be 98.7%, 109.6%, 79.4% of TOP-3, on average lack than the TOP-3 algorithm and consume 5.1% network distance total amount; Be 79.6%, 72.7%, 64.8% of Binning, on average lack than the Binning algorithm and consume 26.4% network distance total amount; Be 67.5%, 76.1%, 70.3% of HONet, on average lack than the HONet algorithm and consume 28.7% network distance total amount; Be 90.6%, 89.2%, 91.3% of IWAP, on average lack than the IWAP algorithm and consume 9.6% network distance total amount; Described by preamble, TOP-1 and TOP-3 algorithm all need to calculate the meaning that has lost range prediction owing to each in real system, so just reference as a comparison; And the Chainsaw algorithm is because poor-performing need not to participate in contrast; Aspect the contrast of the network distance expense total amount of other algorithm accumulative total, LibSVM all is better than other algorithm, and minimum amplification is 9.6%, and information acquisition system network distance total amount reduces obviously, offered load is also just obviously reduced, improve the reptile speed of response and downloading rate.

Claims (1)

1. the Web division methods based on Support Vector Machine is characterized in that,
(1) all Web websites is divided into the N group;
(2) get K=1,2,3 ... N to the value of each K, chooses wherein the 1st~K-1, and K+1~N group Web website sample carries out LibSvm training initialization;
(3) carry out the LibSVM training;
(4) store the SVM model that trains;
(5) choose K group Web website sample, carry out Web and divide test;
(6) Web divides the test result preservation;
(7) if K<N or K=N then repeat (2)~(6) step;
(8) if divide the result_sat as a result of result<expectation as a result, then repeat (1)~(7); Otherwise termination routine, Web are divided and are finished, and obtain dividing result as a result.
CN2011102321925A 2011-08-15 2011-08-15 SVM (Support Vector Machine)-based Web partitioning method Pending CN102254040A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011102321925A CN102254040A (en) 2011-08-15 2011-08-15 SVM (Support Vector Machine)-based Web partitioning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011102321925A CN102254040A (en) 2011-08-15 2011-08-15 SVM (Support Vector Machine)-based Web partitioning method

Publications (1)

Publication Number Publication Date
CN102254040A true CN102254040A (en) 2011-11-23

Family

ID=44981304

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011102321925A Pending CN102254040A (en) 2011-08-15 2011-08-15 SVM (Support Vector Machine)-based Web partitioning method

Country Status (1)

Country Link
CN (1) CN102254040A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970271A (en) * 2014-04-04 2014-08-06 浙江大学 Daily activity identifying method with exercising and physiology sensing data fused
CN104573720A (en) * 2014-12-31 2015-04-29 北京工业大学 Distributed training method for kernel classifiers in wireless sensor network
CN103902706B (en) * 2014-03-31 2017-05-03 东华大学 Method for classifying and predicting big data on basis of SVM (support vector machine)
CN113822432A (en) * 2021-04-06 2021-12-21 京东科技控股股份有限公司 Sample data processing method and device, electronic equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AIXIN SUN等: "Web Classification Using Support Vector Machine", 《WIDM"02》 *
许笑等: "广域网分布式爬虫中的Agent协同与Web划分研究", 《高技术通讯》 *
魏一帆: "分布式信息采集系统Web划分技术研究", 《中国优秀硕士学位论文全文数据库》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902706B (en) * 2014-03-31 2017-05-03 东华大学 Method for classifying and predicting big data on basis of SVM (support vector machine)
CN103970271A (en) * 2014-04-04 2014-08-06 浙江大学 Daily activity identifying method with exercising and physiology sensing data fused
CN104573720A (en) * 2014-12-31 2015-04-29 北京工业大学 Distributed training method for kernel classifiers in wireless sensor network
CN104573720B (en) * 2014-12-31 2018-01-12 北京工业大学 A kind of distributed training method of wireless sensor network center grader
CN113822432A (en) * 2021-04-06 2021-12-21 京东科技控股股份有限公司 Sample data processing method and device, electronic equipment and storage medium
CN113822432B (en) * 2021-04-06 2024-02-06 京东科技控股股份有限公司 Sample data processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109496322B (en) Credit evaluation method and device and gradient progressive decision tree parameter adjusting method and device
CN113574327B (en) Method and system for controlling an environment by selecting a control setting
CN114721833B (en) Intelligent cloud coordination method and device based on platform service type
CN107203789B (en) Distribution model establishing method, distribution method and related device
CN107305637B (en) Data clustering method and device based on K-Means algorithm
CN105930862A (en) Density peak clustering algorithm based on density adaptive distance
CN105740424A (en) Spark platform based high efficiency text classification method
JP2013519152A (en) Text classification method and system
CN108280472A (en) A kind of density peak clustering method optimized based on local density and cluster centre
CN109376995A (en) Financial data methods of marking, device, computer equipment and storage medium
US11403550B2 (en) Classifier
CN109005130A (en) network resource allocation scheduling method and device
CN109300041A (en) Typical karst ecosystem recommended method, electronic device and readable storage medium storing program for executing
CN112684700A (en) Multi-target searching and trapping control method and system for swarm robots
CN106202092A (en) The method and system that data process
CN102254040A (en) SVM (Support Vector Machine)-based Web partitioning method
CN109389140A (en) The method and system of quick searching cluster centre based on Spark
US20220383036A1 (en) Clustering data using neural networks based on normalized cuts
CN103971136A (en) Large-scale data-oriented parallel structured support vector machine classification method
CN103218419B (en) Web tab clustering method and system
CN106326188B (en) Task dividing system and its method based on backward learning radius particle group optimizing
CN113988558B (en) Power grid dynamic security assessment method based on blind area identification and electric coordinate system expansion
CN108364030B (en) A kind of multi-categorizer model building method based on three layers of dynamic particles group's algorithm
Yin et al. Finding the informative and concise set through approximate skyline queries
CN114417095A (en) Data set partitioning method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20111123