CN101526960B - support vector data description shell algorithm - Google Patents

support vector data description shell algorithm Download PDF

Info

Publication number
CN101526960B
CN101526960B CN2009100824848A CN200910082484A CN101526960B CN 101526960 B CN101526960 B CN 101526960B CN 2009100824848 A CN2009100824848 A CN 2009100824848A CN 200910082484 A CN200910082484 A CN 200910082484A CN 101526960 B CN101526960 B CN 101526960B
Authority
CN
China
Prior art keywords
model
data
support vector
algorithm
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2009100824848A
Other languages
Chinese (zh)
Other versions
CN101526960A (en
Inventor
王鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN2009100824848A priority Critical patent/CN101526960B/en
Publication of CN101526960A publication Critical patent/CN101526960A/en
Application granted granted Critical
Publication of CN101526960B publication Critical patent/CN101526960B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a support vector data description shell algorithm. In response to problems and characteristics of data analysis model conformation and integration in distributed environment and bases on support vector data description (SVDD) algorithm, surrounding contour of the data is descried while parameters of the system are controlled, potential support vectors outside the contour are reserved, data which has no effect on future model conformation and integration is deleted, a hyper-spherical shell of potential support vector with certain thickness which describes data characteristics is formed, and the potential support vector quantity and model precision are balanced, thereby reaching the goal of expressing the model precisely by as little data as possible and reducing future model confirmation and integrating venture as a whole. The algorithm is applicable to models which are conformed and integrated by nodes in the distributed environment and to models which are found by analyses of different nodes.

Description

Support vector data description shell algorithm
Technical field
The present invention relates to a kind of data mining technology, be meant a kind of support vector data description shell algorithm especially.Be data-flow analysis digging technology and the support vector machine technology under distributed environment, solve the distributed environment drag emphatically and integrate integrated problem.
Background technology
At present, the traditional data analysis mode is based on the data aggregation of batch type, is stored in certain medium, and then the pattern analyzed is carried out.Along with the fast development of information society, mass data no longer is to store fully, analyzes through the mode that reads more than once.Thisly arrive with time-series, speed is uncertain, and is continuously a large amount of, and the stream that one group of data of potential endless constitute is called data stream.A plurality of data stream from the different pieces of information source have been formed multiple data stream.Under computer network environment, multiple data stream is input in the distributed system through Network Transfer Media, is called distributed traffic.
This type distributed traffic exists in natural society's life widely.For example, the transaction data of bank is the data flow model under a kind of distributed environment.The structure of bank is a typical institutional framework relation: head office, in lines, subbranch and traction equipment (like ATM, the POS machine) etc., as shown in Figure 1.Be provided with head office in each provinces and cities, each city has some cities and towns under its command, is respectively equipped with branch.(for the sake of simplicity, do not draw the subordinate's of branch mechanism among the figure.) branch offices provides the terminal that is used to serve for the client: ATM etc.Here, data stream is constant the generation and collection, flows to bottom-up.Terminal device is collected is pooled to subbranch, branch and head office step by step, and can and further compile integration.The data analyst can analyze the data of different stage.
And for example, the architecture of anti-rubbish mail, as shown in Figure 2.Because the distributed frame of electronic mail network, the spammer can through different mail servers, send spam the common client that disguises oneself as in different places.The anti-spam software filtrator need be deployed on the mail server, the screening spam.Such structure is applicable to other the network security problems intrusion detection etc. of preventing malice behavior (as be used for) and Distributed Calculation problem too: the collaboration type that is distributed in many data analysis node all over the world excavates, and can be used for finding models such as astronomy, earthquake, meteorology.
The maximum characteristics of data stream are potential unlimitednesss: in certain relatively long time (for example several years, the more than ten years), data stream will be imported continuously, and density is bigger, and velocity ratio is very fast.For this data, distributed traffic especially, storage and the mode of post analysis is not competent in batches, its major defect is: storage space can not be estimated; Can not real-time analysis handle.The characteristic of data stream has determined that corresponding analytical algorithm must be (perhaps being called online online) of increment type.And under distributed environment, it is unpractical adopting the distributed traffic method of carrying out analyzing and processing that converges, therefore require to analyze between the model that extracts be can increment type integration integrated.
SVMs (SVM) algorithm is the formal a kind of algorithm for pattern recognition based on Statistical Learning Theory that proposes of the nineties in last century.As outstanding algorithm, it has plurality of advantages such as model is simple, generalization ability is strong, priori is few, has mature theory background and broad application background simultaneously.It does not need the DATA DISTRIBUTION of priori, and through the balance model complexity and to the dependence of partial data, the complexity of simplified model as far as possible is to obtain optimum extensive predictive ability.It finally obtains optimal feasible solution through finding the solution quadratic programming problem; Simultaneously, under the established data environment, try to achieve one group of support vector.Quantity of this group support vector is far smaller than the input data, and can the accurate description model, has reached high level overview.Yet classical SVMs is a batch type, promptly imports data and need offer system simultaneously.
Existing increment type algorithm of support vector machine to single data stream; Safeguard a data window; The data that are used for the nearest part of buffer memory are brought in constant renewal in the data in the data window according to ad hoc rules, comprise the adding of new data and the deletion of gibberish etc.Simultaneously, bring in constant renewal in system model, main approach has piecemeal, and mistake drives, and the computation process analysis etc.
The sectional type method is to be divided into piece to input traffic according to certain rule.After collecting abundant piece of data composition, use the batch type algorithm of support vector machine to analyze, obtain the model of this piece.Then the model of this piece and the model of analyzing acquisition are in the past merged, reuse the batch type algorithm of support vector machine and analyze, obtain block mold.The major defect of this method is to need enough input data blockings of accumulation, and all will carry out twice calculating to every, and calculated amount increases, and the reaction time will slow down.
Mistake drive-type method is to keep model, and adds up the mistake that this model takes place.After error statistics reach specified conditions, use the batch type algorithm of support vector machine to analyze to current data set, obtain new model.The major defect of this method is that the model serviceability can be very poor during accumulating mistake, can not real-time analysis handle each data.
The computation process analytical approach is to analyze through the computation process to SVMs, overcomes the deficiency that traditional algorithm of support vector machine must be trained one group of data from computing method.To each input data, this clock algorithm will be adjusted current model, enable to adapt to this new data according to basicly stable rule, thereby obtain the model of renewal.Its ultimate principle is: in the system that is made up of original model and new data, after the micro-tensioning system parameter arrives to a certain degree, a state will take place and change in system; The amount of computing system parameter adjustment reaches the change of this system state, and other part of system does not change.This new system all satisfies basic stability regular with original system.So just realized increment type work effectively to each input data.
Support Vector data description (SVDD) algorithm is based on a kind of abnormal point method of determining and calculating that algorithm of support vector machine proposes.Through training, system can obtain the peripheral skeleton pattern of data, is called the description of data; And the core of confirming this data description remains and finds the solution support vector.Profile in data description is normal data with interior data, and data in addition then are abnormity point.Because the Support Vector data description algorithm does not rely on the DATA DISTRIBUTION of priori, even under the situation of not understanding DATA DISTRIBUTION, can obtain data description yet.Through using this algorithm, the data of description form more can be removed noise on the whole.Because this algorithm is based on algorithm of support vector machine, above-mentioned increment type algorithm of support vector machine can both be used.
Yet, because above-mentioned these algorithms all are to be directed against single data stream, do not consider the characteristics of distributed system, although extract the model (comprising support vector) of having summarized this data stream, it still is not enough only representing total system with these information.In algorithm of support vector machine, the parameter determining of balance model complexity and data dependency the tolerance how many data do not meet model.That is, often there are the sub-fraction data can't influence model.Therefore, when integrating integrated model from the different pieces of information source, only consider the support vector of these two models, cause these to represent the important information of the model of certain data source to be left in the basket probably, thereby cause the deviation of block mold.Therefore, though these data are not support vectors, might integrate in future becomes support vector, promptly potential support vector when integrated.On the other hand, under distributed environment, each data source will comprise that all it also is unpractical that the total data in the window of model passes to other node.And, under some application background (for example bank), for the consideration of data security, can not be at a large amount of local data of transmission over networks.
Summary of the invention
The objective of the invention is to avoid above-mentioned weak point of the prior art and a kind of support vector data description shell algorithm is provided; Considered above-mentioned reason just; Integrate integrated characteristics to the distributed environment drag; Based on the Support Vector data description algorithm, support vector data description shell (Support Vector Data Description-Shell) algorithm has been proposed.This algorithm keeps profile with interior potential support vector as far as possible in the peripheral profile of data of description, the deviation when the following integration of antagonism is integrated.
Its technological means is following:
Support vector data description shell (SVDD-S) algorithm:
If input traffic is sequence { (x i, y i) | i ∈ }, is the nature manifold, like Fig. 3.
Among the figure, { (x i, y i) be data item x in the data stream idBe the value vector (proper vector) of d attribute, is a set of real numbers
Figure G2009100824848D00041
Special, when i took from limited value, this algorithm also was applicable to the batch type data of non-traffic.
The target of this algorithm is the peripheral profile of data of description, when the reservation profile is with interior potential support vector, removes useless internal data.
According to the Support Vector data description algorithm, space, data place is dThe peripheral profile of data is mapped to through kernel function Φ (x) that can be expressed as a center of circle after certain space Z be that φ '=Φ (φ), radius are the hypersphere of R, can try to achieve support vector SV through following optimization problem R:
min Ω ( R , φ , ξ ) = R 2 + C Σ i = 1 N ξ i
| | x i - φ | | 2 ≤ R 2 + ξ i ξ i ≥ 0 , i = 1 , . . . , N - - - ( 7 )
Wherein, φ be hyperspherical center of circle φ ' in the space dInterior former vector, R is hyperspherical radius; ‖ ‖ is defined in the L on the Z of space 2-norm satisfies kernel function K (x i, x j)=Φ (x i) Φ (x j); C is the parameter of confirming of balance model complexity and data dependency, in order to control the quantity of the abnormity point that can tolerate, ξ iBe slack variable; N is the sample data quantity of training set.
On the other hand, be located at hyperspherical inside, constitute the hypersphere shell that thickness is δ, its two dimension view such as Fig. 4 by two concentric ultra discs.
Zone between the inside and outside hypersphere of hypersphere shell is the target data zone that the needs asked keep, and all the other zones are the zone at the data place that can abandon.The outside zone of hypersphere shell is confirmed that by the Support Vector data description algorithm zone of hypersphere shell internal cavities and hypersphere shell are by surface model is definite down:
min Ω ( R , ϵ , φ , ξ , ζ ) = κ R 2 + ϵ + C R Σ i = 1 N ξ i + C r Σ i = 1 N ζ i
| | x i - φ | | 2 ≤ R 2 + ξ i | | x i - φ | | 2 ≥ R 2 - ϵ - ζ i 0 ≤ ϵ ≤ R 2 ξ i ≥ 0 , ζ i ≥ 0 , i = 1 , . . . , N - - - ( 8 )
Wherein, φ be inside and outside two hyperspherical center of circle φ ' in the space dInterior former vector, R is outer hyperspherical radius, r is outer hyperspherical radius; ‖ ‖ is defined in the L on the Z of space 2-norm satisfies kernel function K (x i, x j)=Φ (x i) Φ (x j) though difference between the inside and outside hypersphere radius is defined as δ, ε R 2-r 2Be the monotonic increasing function of δ, reflected the thickness of hypersphere shell; κ is the balance parameters between the inside and outside hypersphere; C RBe the parameter of confirming of outer hyperspherical balance model complexity and data dependency, in order to control the quantity of the abnormity point that can tolerate, ξ iBe outer hyperspherical slack variable; C rBe the parameter of confirming of interior hyperspherical balance model complexity and data dependency, the quantity of the non-potential support vector that can allow in order to control to remove, ζ iBe interior hyperspherical slack variable; N is the sample data quantity of training set.
Use QUADRATIC PROGRAMMING METHOD FOR and optimized Algorithm thereof to find the solution the problems referred to above, can obtain two groups of sample datas that are positioned at respectively on the inside and outside lineoid of hypersphere shell, i.e. support vector points.These two groups of sample datas satisfy ‖ x respectively i-φ ‖ 2=R 2With ‖ x i-φ ‖ 2=r 2The center of circle that acquisition is simultaneously represented with the linear combination of sample data.For each input data x k(comprising sample data)
As ‖ x k-φ ‖ 2>R 2The time, these data are abnormity point, can abandon, and do not influence system performance.And the quantity of the abnormity point that allows can be through improving C RValue reduce, in case too much normal point is mistaken as " abnormity point ".
As ‖ x i-φ ‖ 2<r 2The time, these data are internal point, can abandon, because support vector algorithm family does not influence system performance in controlled range.And the quantity of the internal point that allows can be through reducing C rValue improve, with the volume of compressibility model, but also can improve like this risk of errors of model integration when integrated.
Work as r 2≤‖ x k-φ ‖ 2≤R 2The time, these data are support vector point and potential support vector point.This part data will be as the part of model.
Top problem is tried to achieve support vector
SV S=SV R∪SV r∪SV p (9)
And corresponding φ, R, r and ε.SV wherein RBe the corresponding support vector of outer hypersphere, SV rBe the corresponding support vector of interior hypersphere, SV pBe potential support vector.And can prove that objective function is equivalent to following objective function in (8) formula
min Ω ′ ( R , ϵ , φ , ξ , ζ ) = ( κ + 1 ) R 2 + r 2 + C R Σ i = 1 N ξ i + C r Σ i = 1 N ζ i - - - ( 10 )
(8) formula only differs from a constant with (10) formula, and all the other constraint conditions remain unchanged.In optimization problem, use objective function (8) and (10) will obtain of equal value separating.
If
cos θ = r R - - - ( 11 )
As R fixedly the time, cos θ ∈ [0,1] is the monotone decreasing function of r, the risk that reflection is understood.When cos θ=0, r=0, the inside of hypersphere shell does not have data to be removed, and the risk of potential support vector is zero; When cos θ=1, r=R only keeps the support vector SV that the Support Vector data description algorithm is obtained R, the risk of potential support vector is the zero maximum that reaches.
By last surface analysis, (φ, R θ) have described characteristic and the performance of being found the solution.When this helps the following model integration integrated, direct comparison model.
The above analysis, the model tormulation of asking does
(κ,C R,C r;σ;SV S;φ,R,θ) (12)
Wherein, (κ, C R, C r) be the setting value (constant) before the solving model, σ is the parameter of kernel function, SV SBe the support vector that above-mentioned model is tried to achieve, (φ, R θ) are SV SCharacteristic and performance.
The distributed environment that support vector data description shell (SVDD-S) algorithm is suitable for both can be to distinguish the tree structure of relationship between superior and subordinate, also can be the structure that does not have relationship between superior and subordinate, respectively like Fig. 5 (corresponding diagram 1) and Fig. 6 (corresponding diagram 2).Data stream is used to bottom-up arrow and is represented among the figure, and the transmission of model is represented with four-headed arrow.It is integrated to carry out model integration at data center and each analysis node.
● inductor is a terminal device, has only very weak arithmetic capability, and the main image data of being responsible for from the objective world becomes a data stream respectively, and data stream constantly is sent to analysis node in a steady stream.
● analysis node possesses the ability of operational analysis multiple data stream.They will carry out the analysis computing of multiple data stream, use support vector data description shell algorithm and will find model, and make action based on the rule of distributed system.Special, when the only corresponding inductor of certain analysis node, it is with the analysis list data stream.
■ may further comprise the steps at the concrete algorithm of carrying out of analysis node:
1. obtain setting (constant) parameter (κ, the C of model R, C r) and the parameter σ of kernel function.
2. use the QUADRATIC PROGRAMMING METHOD FOR and the optimized Algorithm thereof of batch type or increment type to find the solution the perhaps optimization problem of (10) formula of (8) formula, try to achieve model (12).
3. model (12) is encapsulated based on the rule of distributed system and transmit.
1), then directly sends this node increment type algorithm and newly try to achieve incremental result if receiving node allows to carry out the increment type collaborative work;
2), then when receiving node requires, perhaps specifying the moment that model is sent together if receiving node only allows the batch type collaborative work;
3) preserve model if desired and carry out the autocorrelation analysis in future, then preserve this model in order to this node
4. return step 1, whether the inspection model parameter changes.If change then upgrade.
■ is not in having the structure of relationship between superior and subordinate, and analysis node will be after receiving the model that other analysis nodes are found, local and the model that receives integrated integrated, and therefrom finds the world model of region.
■ also has the exchange of information and model between the analysis node in tree structure, and give data center with the Model Transfer of finding.
● in tree structure, data center's node is after receiving the model that each analysis node is found, the model that downstream site is found is integrated integrated, and therefrom finds the world model of region within the jurisdiction, makes action according to the rule of distributed system then.
● data center's node or analysis node are using a model and integrating in the integrated process, with the performance of testing.When performance can not satisfy requiring of appointment, this node will start the model parameter adjustment process, consult and adjust model parameter (κ, the C of each node according to the structure of distributed system R, C r).
● further integrated also by that analogy.
Support vector data description shell (SVDD-S) algorithm is integrated integrated characteristics to the distributed environment drag, in the peripheral profile of data of description, keeps profile with interior potential support vector as far as possible, reduces the risk of following integration when integrated.Setting value (constant) parameter (κ, C through control system R, C r) and the parameter σ of kernel function, potential support vector quantity of balance and model accuracy reach with few data of trying one's best and accurately express model, reduce the purpose of the integrated risk of model integration on the whole.
The object of the invention can reach through following measure:
A kind of support vector data description shell algorithm; In the peripheral profile of data of description; Parameter through control system; With interior potential support vector, removing not have the data that influence, potential support vector quantity of balance and a model accuracy to the model integration in future is integrated from certain thickness zone that profile extends internally in reservation.
Wherein, On the analysis node under the distributed environment, analyze the raw data that subordinate's inductor is collected, other support vector algorithm that using this locality needs is organized work; Use support vector data description shell algorithm simultaneously and safeguard a model; Here other support vector algorithm can based on or depend on support vector data description shell algorithm, then the model of finding is sent to specified node, and receive the lastest imformation of other models that other nodes send.
Its algorithm is separated formula
min Ω ( R , ϵ , φ , ξ , ζ ) = κ R 2 + ϵ + C R Σ i = 1 N ξ i + C r Σ i = 1 N ζ i
⇔ min Ω ′ ( R , ϵ , φ , ξ , ζ ) = ( κ + 1 ) R 2 + r 2 + C R Σ i = 1 N ξ i + C r Σ i = 1 N ζ i - - - ( 13 )
| | x i - φ | | 2 ≤ R 2 + ξ i | | x i - φ | | 2 ≥ R 2 - ϵ - ζ i 0 ≤ ϵ ≤ R 2 ξ i ≥ 0 , ζ i ≥ 0 , i = 1 , . . . , N
Wherein, φ be inside and outside two hyperspherical center of circle φ ' in the space dInterior former vector, R is outer hyperspherical radius, r is outer hyperspherical radius; Get
ε=R 2-r 2 (14)
The thickness that has reflected the hypersphere shell; ‖ ‖ is defined in the L on the Z of space 2-norm satisfies kernel function
K(x i,x j)=Φ(x i)·Φ(x j) (15)
κ is the balance parameters between the inside and outside hypersphere; C RBe the parameter of confirming of outer hyperspherical balance model complexity and data dependency, in order to control the quantity of the abnormity point that can tolerate, ξ iBe outer hyperspherical slack variable; C rBe the parameter of confirming of interior hyperspherical balance model complexity and data dependency, the quantity of the non-potential support vector that can allow in order to control to remove, ζ iBe interior hyperspherical slack variable; N is the sample data quantity of training set;
Utilize QUADRATIC PROGRAMMING METHOD FOR and optimized Algorithm thereof to try to achieve model
(κ,C R,C r;σ;SV S;φ,R,θ) (16)
Wherein, (κ, C R, C r) be the setting value (constant) before the solving model, σ is the parameter of kernel function,
SV S=SV R∪SV r∪SV p (17)
SV RBe the corresponding support vector of outer hypersphere, SV rBe the corresponding support vector of interior hypersphere, SV pBe potential support vector, (φ, R θ) are SV SCharacteristic and performance,
θ = arccos r R - - - ( 18 )
Risk for the potential support vector that data description shell comprised.
Carry out following steps at analysis node:
1) obtains setting (constant) parameter (κ, the C of model R, C r) and the parameter σ of kernel function;
2) optimization problem that uses the QUADRATIC PROGRAMMING METHOD FOR and the optimized Algorithm thereof of batch type or increment type to find the solution (1) formula is tried to achieve model (4);
3) model (4) is encapsulated according to the rule of distributed system and transmit;
[1], then directly sends this node increment type algorithm and newly try to achieve incremental result if receiving node allows to carry out the increment type collaborative work;
[2], then when receiving node requires, perhaps specifying the moment that model is sent together if receiving node only allows the batch type collaborative work;
[3] preserve model if desired and carry out the autocorrelation analysis in future, then preserve this model in order to this node;
4) return step 1), whether the inspection model parameter changes.If change then upgrade.
The present invention compares prior art and has following advantage: this algorithm keeps profile with interior potential support vector as far as possible in the peripheral profile of data of description, the deviation when the following integration of antagonism is integrated.Through the parameter of control system, the quantity of the potential support vector quantity of balance and the data abandoned reaches with few data of trying one's best and accurately expresses model, lowers the integrated purpose of model integration.
Description of drawings
The distributed frame synoptic diagram of Fig. 1 bank transaction data;
The schematic network structure of Fig. 2 anti-rubbish mail;
Fig. 3 single data stream synoptic diagram;
The two-dimensional representation of Fig. 4 hypersphere shell;
The distributed frame synoptic diagram of the differentiation relationship between superior and subordinate that Fig. 5 support vector data description shell algorithm is suitable for;
The distributed frame synoptic diagram that does not have relationship between superior and subordinate that Fig. 6 support vector data description shell algorithm is suitable for.
Specific embodiment
The concrete application example of this algorithm
Under distributed environment like Fig. 5 (corresponding diagram 1, for example bank transaction data) or Fig. 6 (corresponding diagram 2, for example distribution meteorological seismograph station all over the world):
● inductor is the physical detecting equipment that is deployed in each region (website).The for example ATM of bank, POS machine, the thermometer of weather station, hygrometer, satellite signal receiver etc.
● analysis node is the general computing equipment that is deployed in each common website.The server of ordinary bank site for example, mail server, meteorology, earthquake, satellite station or the like.These computing equipments do not need the such very high computing power of main equipment of similar data center, can be the chips that solidifies in hardware, even can be the software that is installed on the common computer.Its main task is to collect the primary data sample that inductor is collected, and analyzes.Specific to certain analysis node A, will carry out work according to following process:
1. subordinate's inductor will join input data (the for example transaction data of stock market or mail) in the data window of analysis node one by one.In the training stage, A will learn to the training data of input, obtain certain model (for example which trading activity is a malice, and perhaps which is a spam).This learning process will be learnt (more new model) incrementally in the future work stage.At working stage, make judgement according to current model, and it is constant to keep model.The input data that need not keep in the data window can remove.
2. other algorithm that using this locality needs is organized work, and uses support vector data description shell algorithm simultaneously and safeguards a model.Other algorithm can be based on perhaps depending on support vector data description shell algorithm here.For example, on the server that detects spam, beyond the program of differentiating spam, the program that also need dispose this algorithm.
3. according to support vector data description shell algorithm, send the receiving node of local model, receive parameter update instruction or model that specified node sends to appointment.For example, under bank's environment, the analysis node of subbranch will send to the model of one's own profession the data center of higher level head office, and receive the instruction that data center sends.
4. integrate integrated model according to the role in the distributed system of A.If A has downstream site, or in not having the structure of relationship between superior and subordinate, A need integrate the model data of input integrated according to this algorithm.
● data center's (in tree structure like Fig. 5) is the mass computing equipment that is deployed in important department.The large scale computer of head office of bank for example, meteorology, earthquake, satellite data processing enter.Data center will not occur in like the structure of Fig. 6, and this just greatly reduces the cost of application deployment.Its main task is to collect model from subordinate's analysis node, and integrates the world model that is integrated into this region.When being necessary, will further send this world model to the higher level.

Claims (3)

1. support vector data description shell algorithm based on distributed system architecture is characterized in that: the distributed environment that is suitable for can be to distinguish the tree structure of relationship between superior and subordinate, also can be the structure that does not have relationship between superior and subordinate; It is integrated to carry out model integration at data center and each analysis node; Data center is a kind of special analysis node in the tree structure;
Analysis node possesses the ability of operational analysis multiple data stream; They will carry out the analysis computing of multiple data stream, use support vector data description shell algorithm and will find model, and make action according to the rule of distributed system; Special, when the only corresponding inductor of certain analysis node, it is with the analysis list data stream;
The function of analysis node is following:
A. in not having the structure of relationship between superior and subordinate, analysis node will be after receiving the model that other analysis nodes are found, local and the model that receives integrated integrated, and therefrom finds the world model of region;
B. in tree structure, the exchange of information and model is arranged also between the analysis node, and give data center the Model Transfer of finding;
Carry out following steps at each analysis node:
1) obtains setting constant parameter (κ, the C of model R, C r) and the parameter σ of kernel function;
2) optimization problem that uses the QUADRATIC PROGRAMMING METHOD FOR and the optimized Algorithm thereof of batch type or increment type to find the solution (1) formula is tried to achieve model (4);
3) model (4) is encapsulated according to the rule of distributed system and transmit;
[1], then directly sends this node increment type algorithm and newly try to achieve incremental result if receiving node allows to carry out the increment type collaborative work;
[2], then when receiving node requires, perhaps specifying the moment that model is sent together if receiving node only allows the batch type collaborative work;
[3] preserve model if desired and carry out the autocorrelation analysis in future, then preserve this model in order to this node;
4) return step 1), whether the inspection model parameter changes, if change then upgrade;
The function of data center is following:
A. in tree structure, data center's node is after receiving the model that each analysis node is found, the model that downstream site is found is integrated integrated, and therefrom finds the world model of region within the jurisdiction, makes action according to the rule of distributed system then;
B. data center's node or analysis node are using a model and integrating in the integrated process, with the performance of testing; When performance can not satisfy requiring of appointment, this node will start the model parameter adjustment process, consult and adjust model parameter (κ, the C of each node according to the structure of distributed system R, C r);
C. further integrated also by that analogy;
The algorithm of this algorithm is separated formula
min Ω ( R , ϵ , φ , ξ , ζ ) = κ R 2 + ϵ + C R Σ i = 1 N ξ i + C r Σ i - 1 N ζ i
⇔ min Ω ′ ( R , ϵ , φ , ξ , ζ ) = ( κ + 1 ) R 2 + r 2 + C R Σ i = 1 N ξ i + C r Σ i - 1 N ζ i - - - ( 1 )
| | x i - φ | | 2 ≤ R 2 + ξ i | | x i - φ | | 2 ≥ R 2 - ϵ - ζ i 0 ≤ ϵ ≤ R 2 ξ i ≥ 0 , ζ i ≥ 0 , i = 1 , . . . , N
Wherein, φ be inside and outside two hyperspherical center of circle φ ' in the space dInterior former vector, dBe the space at data place, R is outer hyperspherical radius, r be in hyperspherical radius; Get
ε=R 2-r 2 (2)
The thickness that has reflected the hypersphere shell; || || be defined in the L on the Z of space 2-norm satisfies kernel function
K(x i,x j)=Φ(x i)·Φ(x j) (3)
κ is the balance parameters between the inside and outside hypersphere; C RFor the parameter of confirming of outer hyperspherical balance model complexity and data dependency, in order to control the quantity of patient abnormity point, ξ iBe outer hyperspherical slack variable; C rBe the parameter of confirming of interior hyperspherical balance model complexity and data dependency, in order to control the quantity of the non-potential support vector that can allow removal, ζ iBe interior hyperspherical slack variable; N is the sample data quantity of training set;
Utilize QUADRATIC PROGRAMMING METHOD FOR and optimized Algorithm thereof to try to achieve model
(κ,C R,C r;σ;SV S;φ,R,θ) (4)
Wherein, (κ, C R, C r) be the constant setting value before the solving model, σ is the parameter of kernel function,
SV S=SV R∪SV r∪SV p (5)
SV RBe the corresponding support vector of outer hypersphere, SV rBe the corresponding support vector of interior hypersphere, SV pBe potential support vector, (φ, R θ) are SV SCharacteristic and performance,
θ = arccos r R - - - ( 6 )
Risk for the potential support vector that data description shell comprised.
2. a kind of support vector data description shell algorithm as claimed in claim 1 based on distributed system architecture; It is characterized in that: in the peripheral profile of data of description; Parameter through control system; With interior potential support vector, removing not have the data that influence, potential support vector quantity of balance and a model accuracy to the model integration in future is integrated from certain thickness zone that profile extends internally in reservation.
3. a kind of support vector data description shell algorithm as claimed in claim 1 based on distributed system architecture; It is characterized in that: on the analysis node under the distributed environment; Analyze the raw data that subordinate's inductor is collected; Other support vector algorithm that using this locality needs is organized work, and uses support vector data description shell algorithm simultaneously and safeguards a model, and other support vector algorithm can be based on perhaps depending on support vector data description shell algorithm here; Then the model of finding is sent to specified node, and receive the lastest imformation of other models that other nodes send.
CN2009100824848A 2009-04-21 2009-04-21 support vector data description shell algorithm Expired - Fee Related CN101526960B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100824848A CN101526960B (en) 2009-04-21 2009-04-21 support vector data description shell algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100824848A CN101526960B (en) 2009-04-21 2009-04-21 support vector data description shell algorithm

Publications (2)

Publication Number Publication Date
CN101526960A CN101526960A (en) 2009-09-09
CN101526960B true CN101526960B (en) 2012-02-08

Family

ID=41094826

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100824848A Expired - Fee Related CN101526960B (en) 2009-04-21 2009-04-21 support vector data description shell algorithm

Country Status (1)

Country Link
CN (1) CN101526960B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103078856B (en) * 2012-12-29 2015-04-22 大连环宇移动科技有限公司 Method for detecting and filtering application layer DDoS (Distributed Denial of Service) attack on basis of access marking
CN103544633A (en) * 2013-10-09 2014-01-29 五邑大学 SVDD (support vector data description) algorithm based user interest identification method
CN107025205B (en) * 2016-01-30 2021-06-22 华为技术有限公司 Method and equipment for training model in distributed system
CN107247968A (en) * 2017-07-24 2017-10-13 东北林业大学 Based on logistics equipment method for detecting abnormality under nuclear entropy constituent analysis imbalance data
CN113158183A (en) * 2021-01-13 2021-07-23 青岛大学 Method, system, medium, equipment and application for detecting malicious behavior of mobile terminal
CN115953399B (en) * 2023-03-13 2023-08-04 常州微亿智造科技有限公司 Industrial part structural defect detection method based on contour features and SVDD

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1687428A (en) * 2005-03-24 2005-10-26 上海交通大学 Method of soft predicting state variables of biofermentation process based on supporting vector machine
CN101216436A (en) * 2008-01-03 2008-07-09 东华大学 Fabric flaw automatic detection method based on Support Vector data description theory
CN101339553A (en) * 2008-01-14 2009-01-07 浙江大学 Approximate quick clustering and index method for mass data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1687428A (en) * 2005-03-24 2005-10-26 上海交通大学 Method of soft predicting state variables of biofermentation process based on supporting vector machine
CN101216436A (en) * 2008-01-03 2008-07-09 东华大学 Fabric flaw automatic detection method based on Support Vector data description theory
CN101339553A (en) * 2008-01-14 2009-01-07 浙江大学 Approximate quick clustering and index method for mass data

Also Published As

Publication number Publication date
CN101526960A (en) 2009-09-09

Similar Documents

Publication Publication Date Title
CN101526960B (en) support vector data description shell algorithm
CN101093559B (en) Method for constructing expert system based on knowledge discovery
He et al. Multi-graph convolutional-recurrent neural network (MGC-RNN) for short-term forecasting of transit passenger flow
CN111324642A (en) Model algorithm type selection and evaluation method for power grid big data analysis
CN109800898A (en) A kind of intelligence short-term load forecasting method and system
CN112232909A (en) Business opportunity mining method based on enterprise portrait
CN103838857B (en) Automatic service combination system and method based on semantics
Ren et al. Long-Term Preservation of Electronic Record Based on Digital Continuity in Smart Cities.
Yang et al. Sub-minute probabilistic solar forecasting for real-time stochastic simulations
CN107798137B (en) A kind of multi-source heterogeneous data fusion architecture system based on additive models
CN106855865B (en) Water conservancy and hydropower big data architecture construction method
Wang et al. Internet of things-enabled tourism economic data analysis and supply chain modeling
CN104657429B (en) Technology-driven type Product Innovation Method based on complex network
Zhou et al. A hybrid energy system workflow for energy portfolio optimization
Ma The Research of Stock Predictive Model based on the Combination of CART and DBSCAN
Shao et al. Improving iForest for hydrological time series anomaly detection
Xu Research on enterprise knowledge unified retrieval based on industrial big data
Song et al. Research on enterprise cooperation management strategy analysis system based on knowledge transfer model
Chen et al. Research and application of cluster analysis algorithm
Li et al. Hybrid model of generative adversarial network and Takagi‐Sugeno for multidimensional incomplete hydrological big data prediction
Kourtellis et al. S2ce: a hybrid cloud and edge orchestrator for mining exascale distributed streams
Moradi et al. Evolving neural networks and fuzzy clustering for multireservoir operations
La Foucade et al. Forecasting Tourism Demand in Selected Caribbean Countries Using Optimised Grey Forecasting Models
Geng et al. Study on index model of tropical cyclone intensity change based on projection pursuit and evolution strategy
CN109033202A (en) A kind of book recommendation method and system based on Apriori algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120208

Termination date: 20130421