CN103678512A - Data stream merge sorting method under dynamic data environment - Google Patents

Data stream merge sorting method under dynamic data environment Download PDF

Info

Publication number
CN103678512A
CN103678512A CN201310608553.0A CN201310608553A CN103678512A CN 103678512 A CN103678512 A CN 103678512A CN 201310608553 A CN201310608553 A CN 201310608553A CN 103678512 A CN103678512 A CN 103678512A
Authority
CN
China
Prior art keywords
data
model
data stream
module
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310608553.0A
Other languages
Chinese (zh)
Inventor
姚远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Minzu University
Original Assignee
Dalian Nationalities University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Nationalities University filed Critical Dalian Nationalities University
Priority to CN201310608553.0A priority Critical patent/CN103678512A/en
Publication of CN103678512A publication Critical patent/CN103678512A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of intelligent information processing and discloses a data stream merge sorting method under a dynamic data environment. According to the data stream merge sorting method, a data stream sorting model is established through ensemble learning and a mixed model frame, the requirement for the massiveness feature, the real-time feature and the dynamic change feature of a data stream can be met, and accuracy of data stream sorting is improved. An ensemble learning model utilizes relevant content of the ensemble learning theory, sorting is conducted through multiple sorters, and the sorting effect and the ability to fit in with the dynamic nature of the data stream are improved. In addition, sorting results are clustered through a clustering method, the internal relationship between the sorting results is effectively utilized, the improvement of the sorting accuracy is facilitated, and time consumed by sorting is shortened.

Description

Data stream hybrid sorting process under a kind of dynamic data environment
Technical field
The present invention relates to intelligent information processing technology field, particularly the data stream hybrid sorting process under a kind of dynamic data environment, is applicable to network invasion monitoring, the aspects such as network security monitoring, sensing data monitoring and mains supply.
Background technology
Along with the development of Internet of Things, and the arrival in " large data " epoch, traditional data mining method is being faced with new challenge, and wherein the variation of data mode is the most important and basic content.Traditional data form mainly be take static data as main, and its finite capacity can be stored and substantially unchanged.Therefore, the design to traditional data mining algorithm, often tentation data is static, considers it is more algorithm itself rather than data mode adjustment.
But in recent years, along with going deep into of Informatization Development, a kind of brand-new data mode, data stream, becomes mainstream data form gradually.Different from static data form, data stream mainly comprises three kinds of essential characteristics, i.e. magnanimity, real-time and dynamic change, if therefore continue the simple traditional data mining method of applying mechanically, often cannot obtain gratifying result again, or even complete failure.The research of excavating for data stream at present also Just because of this, becomes new study hotspot.
Concerning data flow classification problem, its key problem is the sorting technique that design adapts to data flow characteristics (magnanimity, real-time and dynamic change).Specifically, compared with traditional classification method, the magnanimity feature of data stream requires method for classifying data stream under cannot the prerequisite of store historical data, data to be trained and to be classified; The requirement of real-time disaggregated model of data stream is in assorting process, except considering classification accuracy aspect, also need the classification time be optimized and compress, before new data stream produces, complete classification overall process as much as possible, the operational efficiency of disaggregated model has been proposed to new requirement; The dynamic change of data stream requires disaggregated model to have certain extendibility and self, can adapt to the variation of data stream.Because so, design the disaggregated model that meets three kinds of features of data stream completely, be the target that academia is pursued always, and current proposed sorting technique, major part can only meet one or both data flow characteristicses, can only reach to a certain extent the requirement of classification.
International, domestic not yet appearance at present adapts to the sorting technique of data flow characteristics, urgently the data stream hybrid sorting process under a kind of dynamic data environment completely.
Summary of the invention
The object of the invention is: for solving above-mentioned problems of the prior art, provide the hybrid sorting process of the data stream under a kind of dynamic data environment, can meet the feature of data stream magnanimity, real-time and dynamic change, reach classificating requirement.
For achieving the above object, the technical solution used in the present invention is: the data stream hybrid sorting process under a kind of dynamic data environment, specifically comprises the following steps:
Step 1: dynamic dataflow collection module 102 is collected data according to time sequencing from magnanimity real-time stream 101.
Step 2: data stream is divided the data flow data in module 103 read step 1, and according to the time order and function relation of data flow data, data stream is divided; Described data stream initialization module 103 is divided in the data block obtaining, and comprising 3 class data is respectively training set, checking collection and test set, and the data sample quantity that each data centralization comprises is N; N is fixed variable, by user, is set in advance.
Step 3: will divide the resulting three kinds of static data collection of module 103 through data stream is that training set, test set and checking collection are input to data initialization module 104, and static data collection is normalized.
Step 4: the training set data after data initialization module 104 is processed is input in integrated classifier module 105, and 105 pairs of training set data of described integrated classifier module are classified and build integrated classifier.
Step 5: utilize parameter optimization module (106) to carry out parameter optimization to integrated classifier model in step 4;
Step 6: the checking collection after data initialization module (104) is processed is input in the integrated classifier after step 5 is optimized, and the data category label obtaining is data set L;
Step 7: data set L is input in cluster module 107, used Clustering Model is trained.
Step 8: the resulting test set data of data initialization module 104 are input in constructed hybrid classification model, complete data flow classification process.
Wherein, in described step 2, data stream is divided the division of 103 pairs of data stream of module, comprises the following steps:
Step 2.1: first use slip window sampling to carry out staticize processing to magnanimity real-time stream; Wherein, the each distance of sliding of moving window is n, and the sample size that each static subset comprises is also nindividual;
Step 2.2: use and to randomly draw method the resulting subset of step 2.1 is mixed, obtain respectively three data sets, i.e. training set, test set and checking collection, wherein the size of training set and test set is 4 n.
Wherein, in described step 3, data initialization module 104 adopts MapMinMax method for normalizing to be normalized data, comprises the following steps:
Step 3.1: first by the training set obtaining, test set and checking collection, respectively its each property value is added up, found the minimum and maximum property value of each attribute;
Step 3.2: each attribute to data set is normalized, described method for normalizing formula is:
Figure 236721DEST_PATH_IMAGE002
Wherein, x i represent of current sample iindividual property value, min ( x i ) and max ( x i ) represent respectively current ithe minimum of individual attribute and maximal value, ymax and ymin represent respectively normalized upper and lower bound, if while wanting to normalize to [0,1] interval, ymax is that 1, ymin is 0.
Wherein, in described step 4, data integration classifier modules 105 adopts supporting vector machine model as basic classification model, data stream to be classified, and builds integrated classifier, comprises the following steps:
Step 4.1: first use two kinds of supporting vector machine models as basic classification model, i.e. C-SVM and ν (nu)-SVM model;
Step 4.2: use three kinds of functions to divide above-mentioned two kinds of supporting vector machine models, obtain six different support vector machine disaggregated models, wherein, the kernel function of using be linear kernel function, gaussian radial basis function kernel function and Sigmoid kernel function;
Step 4.3: to the integrated study model training obtaining.
Wherein, in described step 5,106 pairs of constructed integrated classifiers of parameter optimization module carry out parameter optimization, and the optimization method that uses is particle cluster algorithm, and optimizing process comprises following steps:
Step 5.1: first by the parameter of using in the constructed disaggregated model of C-SVM and gaussian radial basis function kernel function cwith gextract;
Step 5.2: the verification msg collection after 104 normalization of data stream initialization module is input in this model, then uses PSO algorithm to be optimized parameter, wherein the fitness function in optimizing process is used mthe method of cross validation, its formula table is shown:
Wherein, parameter mthe quantity of the sample set extracting from checking collection, l i represent the sample size in each sample set, l i t represent to be classified correct sample size in subset;
Step 5.3: by the parameter after optimizing cwith gjoin in model and use as model inner parameter.
Wherein, the classification results that in described step 7, cluster module 107 provides for integrated classifier is that data set L carries out cluster, obtains final classification results, the clustering method that uses for Self-organizing Maps, comprise following steps:
Step 7.1: first to the training of SOM model, the SOM model after being trained;
Step 7.2: test set is input in the Ensemble classifier model after building, obtains the class label data set that test set is corresponding;
Step 7.3: class label data set is input in the SOM model training, model calculates institute's sample of input and the final distance of classification, finds the node that is activated, and computing method are as follows:
Figure 746648DEST_PATH_IMAGE006
Wherein, xrepresent input sample, w i represent the weight between each node of SOM model;
Step 7.4: repeating step 7.2 is to step 7.3, until that all data are all classified is complete.
Wherein, the test set using in described step 2.2 is the set outside checking collection and training set, and its size is equal to moving window size n, parameter nartificial setting in advance.
Wherein, the integrated study model training method that uses in described step 4.3, comprises following sub-step:
Step 4.3.1: first training set is divided into six data subsets, division methods is halving method;
Step 4.3.2: being input to respectively in six sorters in integrated study model after dividing well trained.
Wherein, the PSO optimization method that uses in described step 5.2, comprises following sub-step:
Step 5.2.1: first use random value to carry out assignment to the variable that will optimize;
Step 5.2.2: then constantly update two variablees in optimizing process v[] and presentvalue, update method is as follows,
Figure 141857DEST_PATH_IMAGE008
Wherein, v[] represents the speed of searching optimization of PSO algorithm, present[] represents that current optimal value is in position and the direction of solution space,
Figure 956230DEST_PATH_IMAGE010
represent a random function, the random value scope providing is (0,1), variable c1with c2represent the study factor;
Step 5.2.3: repeat above-mentioned steps, until meet the fitness function in step 5.2.
Wherein, in described step 7.1, the training process of SOM Clustering Model that uses comprises following steps:
Step 7.1.1: first verification msg collection is input in integrated study disaggregated model, is verified the corresponding categorical data collection of data set L;
Step 7.1.2: by the training of resulting categorical data set pair SOM model.
Wherein, described data stream 101 comprises: network intrusion monitoring, network security monitoring, sensing data monitoring and mains supply various aspects data.
The invention has the beneficial effects as follows: the present invention adopts integrated study and mixture model framework to build data flow classification model, can adapt to the requirement of magnanimity, real-time and three kinds of features of dynamic change of data stream, and improve the accuracy rate of data flow classification.Wherein, integrated study model has utilized the theoretical related content of integrated study, by using a plurality of sorters to classify, improves the ability of classifying quality and adaptation data stream dynamic.In addition, clustering method gathers classification results, effectively utilizes the internal relations between classification results, is conducive to improve classification accuracy, reduces the elapsed time because of classification institute.
Accompanying drawing explanation
Fig. 1 is the FB(flow block) of the data stream hybrid sorting process under a kind of dynamic data environment of the present invention.
Fig. 2 is that the present invention utilizes integrated study to build a kind of embodiment of sorter.
Fig. 3 is the process flow diagram that data set of the present invention is converted into tally set.
Accompanying drawing sign: 101-data stream, 102-data stream collection module, 103-data stream is divided module, 104-data stream initialization module, 105-integrated classifier module, 106-parameter optimization module, 107-cluster module.
Embodiment
Below in conjunction with drawings and Examples, the present invention is described in detail.
With reference to Fig. 1, the framework of the data stream hybrid sorting process under a kind of dynamic data environment of the present invention, comprises data stream 101, data stream collection module 102, data stream is divided module 103, data stream initialization module 104, integrated classifier module 105, parameter optimization module 106, cluster module 107;
Wherein, data stream collection module 102 sequencing according to the time from data stream 101 obtains stream data, described data stream 101 comprises the data stream to any type known to persons of ordinary skill in the art, particularly including network invasion monitoring data stream, network security monitoring data stream, sensing data monitor data stream and mains supply data stream.Because data stream is that real time mass produces, therefore cannot to data, preserve by the mode of physical store, data are just deleted after finishing using.
Data stream is divided module 103 and from data stream collection module 102, is obtained stream data sample, and according to the moving window capacity artificially setting in advance, according to the time order and function relation of data sample, data stream is divided, and obtains the data subset of a plurality of static state.The sample size that these subsets comprise is identical, and does not occur simultaneously each other.Data stream is divided the specified size of module 103 and is specified in advance by user, and data stream initialization module 104, integrated classifier module 105, parameter optimization module 106, the division result that the designed raw data set of cluster module 107 is divided module 103 by data stream obtains.
Data stream is divided to the data block obtaining after module 103 is divided and be input in data stream initialization module 104, it is carried out to initialization operation, content comprises: first use existing method for normalizing to be normalized original data block; Then, use the method for randomly drawing, obtain two new data set, be i.e. training set and checking collection.Wherein training set is used for to the training of Ensemble classifier model, and checking collection is used for Clustering Model to train.
The training set that data stream initialization module 104 is obtained, is input in integrated classifier module 105, and integrated classifier is learnt.Integrated classifier module 105 is used supporting vector machine model as basic classification device, and by using different support vector machine (C-SVM and nv-SVM) and kernel function (linear kernel function, gaussian radial basis function kernel function and Sigmoid kernel function) to build 6 kinds of different disaggregated models, use training set to train it.
The constructed sorter operation parameter of integrated classifier module 105 is optimized to module 106 and carry out parameter optimization.First by the parameter of using in the constructed disaggregated model of C-SVM and gaussian radial basis function kernel function cwith gextract, then the enough verification msg collection of data stream initialization module 104 normalization are input in this model, then use PSO algorithm, parameter is optimized, wherein fitness function is:
Wherein, parameter mthe quantity of the sample set extracting from checking collection, l i represent the sample size in each sample set, l i t represent to be classified correct sample size in subset; Finally by the parameter after optimizing cwith gjoin in model and use as model inner parameter.
Ensemble classifier model module 105 after the input of verification msg collection is optimized, obtains label data collection, then uses the label data set pair clustering module 107 obtaining to train, and completes the structure to hybrid classification model.
The new data that data stream 101 is produced is as test data, and usage data stream collection module 102 carries out staticize processing, and then usage data stream divides module 103 and divide, and obtains static data set.104 pairs of test sets of last usage data stream initialization module are normalized, and the data of handling well are input in the constructed hybrid classification model of above-mentioned steps, finally obtain Data classification result.
With reference to Fig. 2, described a kind of embodiment of utilizing integrated study to build sorter.In multi-categorizer builds, used the thought of integrated study, use a plurality of supporting vector machine models to carry out Ensemble classifier model construction.Two kinds of supporting vector machine models (C-SVM and nv-SVM) have been used respectively, and construct 6 kinds of different disaggregated models in conjunction with 3 kinds of kernel functions (linear kernel function, gaussian radial basis function kernel function and Sigmoid kernel function), they are integrated and build Ensemble classifier model integral body.
With reference to Fig. 3, described data set is converted into the process flow diagram of tally set.After the Ensemble classifier model of training dataset by structure, each sorter can provide classification results, i.e. a class label.In the present invention, Ensemble classifier model includes 6 sorters, so data can obtain 6 class labels after being input to Ensemble classifier model.The data of these labels and input have corresponding relation, by Ensemble classifier model, mutually transform.Resulting class label data set is using the input data as Clustering Model, for follow-up work provides Data support.
Example 1
A data stream hybrid sorting process under dynamic data environment, specifically comprises the following steps:
Step 1: using Australian personal credit data set as data stream, dynamic dataflow collection module (102) is collected data according to time sequencing from data stream; It comprises sample size is 690,15 attributes, and wherein front 14 attributes are data attribute, and the 15th attribute is category attribute; By statistics, what class label was " 1 " accounts for 55.5% of overall data sample, and class label is " 0 " accounts for 44.5% of data integral body.
Step 2: data stream is divided module (103) and read Australian personal credit data flow data, and according to the time order and function relation of data flow data, data stream is divided;
Described data stream is divided the division of module (103) to data stream, comprises the following steps:
Step 2.1: first data stream is carried out to staticize processing, use moving window method, according to time sequencing, data stream is divided, window size is set as 10, therefore obtains 69 data subsets, and front 30 data subsets are taken out.
Step 2.2: use the method for randomly drawing to extract above-mentioned subset, obtain training set and checking collection, and the sample size of each set is 120; All the other 39 data subsets are as test set, for follow-up test is prepared.
Step 3: will divide the resulting three kinds of static data collection of module (103) through data stream is that training set, test set and checking collection are input to data initialization module (104), and static data collection is normalized;
Described data initialization module (104) adopts MapMinMax method for normalizing to be normalized data, comprises the following steps:
Step 3.1: data attribute value is mapped to [0,1] interval; With reference to MapMinMax formula, suppose that the maximin of the 1st attribute is respectively 100 and 50, and ymax is that 1, ymin is 0, the sample that property value is 66 so;
Step 3.2: above-mentioned attribute is normalized, after normalization is:
Figure 108042DEST_PATH_IMAGE012
Step 4: the training set data after data initialization module (104) is processed is input in integrated classifier module (105), and described integrated classifier module (105) is used training set data to train, and builds integrated classifier model.
Step 4.1: adopting supporting vector machine model (Support vector machine, SVM) is fundamental classifier.
Step 4.2: using different sorters and kernel function to build 6 kinds of disaggregated models, is respectively C-SVM and linear kernel function (Model1), C-SVM and gaussian radial basis function kernel function (Model2), C-SVM and Sigmoid kernel function (Model3), v-SVM and linear kernel function (Model4), v-SVM and gaussian radial basis function kernel function (Model5), v-SVM and Sigmoid kernel function (Model6).
Step 4.3: to the integrated study model training obtaining; Described integrated study model training method, comprises following sub-step:
Step 4.3.1: training set is used to the method randomly draw, be divided into 6 subsets (X1, X2 ..., X6), and each subset has same sample quantity; Quantity can shift to an earlier date artificially to be set, and for example 100.
Step 4.3.2: by the subset obtaining, be input to respectively in 6 kinds of corresponding sorters and train, complete training process.
Step 5: utilize parameter optimization module (106) to carry out parameter optimization to the Model2 in integrated classifier model model in step 4, the optimization method that uses be particle cluster algorithm (PSO), optimizing process comprises following steps:
Step 5.1: first by the parameter of using in the constructed disaggregated model of C-SVM and gaussian radial basis function kernel function cwith gextract;
Step 5.2: the checking collection after normalization is input in Ensemble classifier model, uses PSO algorithm to be optimized parameter;
Described PSO optimization method, optimizing process comprises following sub-step:
Step 5.2.1: to parameter cwith guse random value to carry out assignment, suppose cbe 0.5, gbe 0.7, be then brought in model and classify;
Step 5.2.3: calculate fitness function value, check and whether meet the demands; Fitness function computing method are: suppose that by the Model1 sample size accurately of classifying be that 60, Model2 is that 80, Model3 is that 30, Model4 is that 50, Model5 is that 78, Model6 is 88, and total sample number amount is 100:
Figure 306942DEST_PATH_IMAGE014
If the fitness function value of setting is in advance 50%, parameter meets the demands, and optimizing process finishes.
If the fitness function value of setting is in advance 80%, parameter does not meet the demands, and uses PSO algorithm to upgrade parameter; Suppose speed of searching optimization v[] be 0.6, the study factor c1 He c2 are respectively 0.3 and 0.4, for parameter c, parameter current value present[] and be 0.5, random value rand () is 0.1, current optimal value pbest[] and be 0.5, global optimum gbest[] be 0.6:
Figure 975821DEST_PATH_IMAGE016
Finally by calculating parameter cnew value be 1.104, parameter grenewal process and parameter csimilar, repeat no more herein.
Repeat said process, until meet fitness function requirement, complete optimizing process.
Step 6: verification msg collection is input in the Ensemble classifier model after optimization, obtains label data collection L;
Step 7: use label data collection L to train Self-organizing Maps (SOM) Clustering Model, training process is as follows:
Step 7.1: use random value initialization SOM model;
Step 7.2: suppose that it is L that categorical data is concentrated a certain vector i = l 1, l 2..., l 6limit weight is w j ={ w 1 j , w 2 j , w 6 j ;
Step 7.3: computing activation node; Suppose current sample vector for 1,0,1,1,1,0}, and limit weight be 0.1,0.5,0.3,0.4,0.2,0.4}:
Figure 2013106085530100002DEST_PATH_IMAGE017
By activating point and the classification of inputted sample vector, think associatedly, complete the training to SOM model.
Step 8: the resulting test set data of data initialization module (104) are input in constructed hybrid classification model, complete data flow classification process.Described test set assorting process is:
The first step: test set is input in integrated classifier model, collects the class label that each sub-classifier provides, thereby obtain class label data set L.
Second step: class label data set L is input in SOM model, finds and activate node.
The 3rd step: using the classification that activates node as data category, complete assorting process.
Example 2
A data stream hybrid sorting process under dynamic data environment, specifically comprises the following steps:
Step 1: using German personal credit data set as data stream, dynamic dataflow collection module (102) is collected data according to time sequencing from data stream; It comprises sample size is 1000,20 attributes, and wherein front 19 attributes are data attribute, and the 20th attribute is category attribute; By statistics, what class label was " 1 " accounts for 70% of overall data sample, and class label is " 0 " accounts for 30% of data integral body.
Step 2: data stream is divided module (103) and read German personal credit data flow data, and according to the time order and function relation of data flow data, data stream is divided;
Described data stream is divided the division of module (103) to data stream, comprises the following steps:
Step 2.1: first data stream is carried out to staticize processing, use moving window method, according to time sequencing, data stream is divided, window size is set as 10, therefore obtains 100 data subsets, and front 40 data subsets are taken out.
Step 2.2: use the method for randomly drawing to extract above-mentioned subset, obtain training set and checking collection, and the sample size of each set is 400; All the other 60 data subsets are as test set, for follow-up test is prepared.
Step 3: will divide the resulting three kinds of static data collection of module (103) through data stream is that training set, test set and checking collection are input to data initialization module (104), and static data collection is normalized;
Described data initialization module (104) adopts MapMinMax method for normalizing to be normalized data, comprises the following steps:
Step 3.1: data attribute value is mapped to [0,1] interval; With reference to MapMinMax formula, suppose that the maximin of the 1st attribute is respectively 350 and 120, and ymax is that 1, ymin is 0, the sample that property value is 136 so;
Step 3.2: above-mentioned attribute is normalized, after normalization is:
Step 4: the training set data after data initialization module (104) is processed is input in integrated classifier module (105), and described integrated classifier module (105) is used training set data to train, and builds integrated classifier model.
Step 4.1: adopting supporting vector machine model (Support vector machine, SVM) is fundamental classifier.
Step 4.2: using different sorters and kernel function to build 6 kinds of disaggregated models, is respectively C-SVM and linear kernel function (Model1), C-SVM and gaussian radial basis function kernel function (Model2), C-SVM and Sigmoid kernel function (Model3), v-SVM and linear kernel function (Model4), v-SVM and gaussian radial basis function kernel function (Model5), v-SVM and Sigmoid kernel function (Model6).
Step 4.3: to the integrated study model training obtaining; Described integrated study model training method, comprises following sub-step:
Step 4.3.1: training set is used to the method randomly draw, be divided into 6 subsets (X1, X2 ..., X6), and each subset has same sample quantity; Quantity can shift to an earlier date artificially to be set, and for example 300.
Step 4.3.2: by the subset obtaining, be input to respectively in 6 kinds of corresponding sorters and train, complete training process.
Step 5: utilize parameter optimization module (106) to carry out parameter optimization to the Model2 in integrated classifier model model in step 4, the optimization method that uses be particle cluster algorithm (PSO), optimizing process comprises following steps:
Step 5.1: first by the parameter of using in the constructed disaggregated model of C-SVM and gaussian radial basis function kernel function cwith gextract;
Step 5.2: the checking collection after normalization is input in Ensemble classifier model, uses PSO algorithm to be optimized parameter;
Described PSO optimization method, optimizing process comprises following sub-step:
Step 5.2.1: to parameter cwith guse random value to carry out assignment, suppose cbe 12, gbe 15, be then brought in model and classify;
Step 5.2.3: calculate fitness function value, check and whether meet the demands; Fitness function computing method are: suppose that by the Model1 sample size accurately of classifying be that 100, Model2 is that 200, Model3 is that 250, Model4 is that 247, Model5 is that 232, Model6 is 189, and total sample number amount is 300:
Figure 2013106085530100002DEST_PATH_IMAGE021
If the fitness function value of setting is in advance 50%, parameter meets the demands, and optimizing process finishes.
If the fitness function value of setting is in advance 90%, parameter does not meet the demands, and uses PSO algorithm to upgrade parameter; Suppose speed of searching optimization v[] be 0.45, the study factor c1 He c2 are respectively 0.2 and 0.3, for parameter c, parameter current value present[] and be 12, random value rand () is 0.1, current optimal value pbest[] and be 12, global optimum gbest[] be 15:
Figure DEST_PATH_IMAGE023
Finally by calculating parameter cnew value be 12.54, parameter grenewal process and parameter csimilar, repeat no more herein.
Repeat said process, until meet fitness function requirement, complete optimizing process.
Step 6: verification msg collection is input in the Ensemble classifier model after optimization, obtains label data collection L.
Step 7: use label data collection L to train Self-organizing Maps (SOM) Clustering Model, training process is as follows:
Step 7.1: use random value initialization SOM model;
Step 7.2: suppose that it is L that categorical data is concentrated a certain vector i = l 1, l 2..., l 6limit weight is w j ={ w 1 j , w 2 j , w 6 j ;
Step 7.3: computing activation node; Suppose current sample vector for 1,1,0,0,1,0}, and limit weight be 0.7,0.5,0.8,0.2,0.6,0.9}:
Figure 844551DEST_PATH_IMAGE024
By activating point and the classification of inputted sample vector, think associatedly, complete the training to SOM model.
Step 8: the resulting test set data of data initialization module (104) are input in constructed hybrid classification model, complete data flow classification process.Described test set assorting process is:
The first step: test set is input in integrated classifier model, collects the class label that each sub-classifier provides, thereby obtain class label data set L.
Second step: class label data set L is input in SOM model, finds and activate node.
The 3rd step: using the classification that activates node as data category, complete assorting process.
Above content is the further description of the present invention being done in conjunction with optimal technical scheme, can not assert that the concrete enforcement of invention only limits to these explanations.Concerning general technical staff of the technical field of the invention, not departing under the prerequisite of design of the present invention, can also make simple deduction and replacement, all should be considered as protection scope of the present invention.

Claims (10)

1. the data stream hybrid sorting process under dynamic data environment, specifically comprises the following steps:
Step 1: dynamic dataflow collection module (102) is collected data according to time sequencing from magnanimity real-time stream (101);
Step 2: data stream is divided the data flow data in module (103) read step 1, and according to the time order and function relation of data flow data, data stream is divided; Described data stream initialization module (103) is divided in the data block obtaining, and comprising 3 class data is respectively training set, checking collection and test set, and the data sample quantity that each data centralization comprises is N; N is fixed variable, by user, is set in advance;
Step 3: will divide the resulting three kinds of static data collection of module (103) through data stream is that training set, test set and checking collection are input to data initialization module (104), and static data collection is normalized;
Step 4: the training set data after data initialization module (104) is processed is input in integrated classifier module (105), and described integrated classifier module (105) is used training set data to train, and builds integrated classifier model;
Step 5: utilize parameter optimization module (106) to carry out parameter optimization to integrated classifier model in step 4;
Step 6: the checking collection after data initialization module (104) is processed is input in the integrated classifier after step 5 is optimized, and the data category label obtaining is data set L;
Step 7: data set L is input in cluster module (107), used Clustering Model is trained;
Step 8: the resulting test set data of data initialization module (104) are input in constructed hybrid classification model, complete data flow classification process.
2. the data stream hybrid sorting process under a kind of dynamic data environment according to claim 1, is characterized in that, in described step 2, data stream is divided the division of module (103) to data stream, comprises the following steps:
Step 2.1: first use slip window sampling to carry out staticize processing to magnanimity real-time stream; Wherein, the each distance of sliding of moving window is n, and the sample size that each static subset comprises is also nindividual;
Step 2.2: use and to randomly draw method the resulting subset of step 2.1 is mixed, obtain respectively three data sets, i.e. training set, test set and checking collection, wherein the size of training set and test set is 4 n.
3. the data stream hybrid sorting process under a kind of dynamic data environment according to claim 1, it is characterized in that, in described step 3, data initialization module (104) adopts MapMinMax method for normalizing to be normalized data, comprises the following steps:
Step 3.1: first by the training set obtaining, test set and checking collection, respectively its each property value is added up, found the minimum and maximum property value of each attribute;
Step 3.2: each attribute to data set is normalized, described method for normalizing formula is:
Wherein, x i represent of current sample iindividual property value, min ( x i ) and max ( x i ) represent respectively current ithe minimum of individual attribute and maximal value, ymax and ymin represent respectively normalized upper and lower bound, if while wanting to normalize to [0,1] interval, ymax is that 1, ymin is 0.
4. the data stream hybrid sorting process under a kind of dynamic data environment according to claim 1, it is characterized in that, in described step 4, data integration classifier modules (105) adopts supporting vector machine model as basic classification model, data stream to be classified, and build integrated classifier, comprise the following steps:
Step 4.1: first use two kinds of supporting vector machine models as basic classification model, i.e. C-SVM and ν (nu)-SVM model;
Step 4.2: use three kinds of functions to divide above-mentioned two kinds of supporting vector machine models, obtain six different support vector machine disaggregated models, wherein, the kernel function of using be linear kernel function, gaussian radial basis function kernel function and Sigmoid kernel function;
Step 4.3: to the integrated study model training obtaining.
5. the data stream hybrid sorting process under a kind of dynamic data environment according to claim 1, it is characterized in that, in described step 5, parameter optimization module (106) is carried out parameter optimization to constructed integrated classifier, the optimization method that uses is particle cluster algorithm, and optimizing process comprises following steps:
Step 5.1: first by the parameter of using in the constructed disaggregated model of C-SVM and gaussian radial basis function kernel function cwith gextract;
Step 5.2: the verification msg collection after data stream initialization module (104) normalization is input in this model, then uses PSO algorithm to be optimized parameter, wherein the fitness function in optimizing process is used mthe method of cross validation, its formula table is shown:
Figure 337338DEST_PATH_IMAGE002
Wherein, parameter mthe quantity of the sample set extracting from checking collection, l i represent the sample size in each sample set, l i t represent to be classified correct sample size in subset;
Step 5.3: by the parameter after optimizing cwith gjoin in model and use as model inner parameter.
6. the data stream hybrid sorting process under a kind of dynamic data environment according to claim 1, it is characterized in that, the classification results that in described step 7, cluster module (107) provides for integrated classifier is that data set L carries out cluster, obtain final classification results, the clustering method that uses is Self-organizing Maps, comprises following steps:
Step 7.1: first to the training of SOM model, the SOM model after being trained;
Step 7.2: test set is input in the Ensemble classifier model after building, obtains the class label data set that test set is corresponding;
Step 7.3: class label data set is input in the SOM model training, model calculates institute's sample of input and the final distance of classification, finds the node that is activated, and computing method are as follows:
Wherein, xrepresent input sample, w i represent the weight between each node of SOM model;
Step 7.4: repeating step 7.2 is to step 7.3, until that all data are all classified is complete.
7. according to the data stream hybrid sorting process under a kind of dynamic data environment claimed in claim 2, it is characterized in that, the test set using in described step 2.2 is the set outside checking collection and training set, and its size is equal to moving window size n, parameter nartificial setting in advance.
8. the data stream hybrid sorting process under a kind of dynamic data environment according to claim 4, is characterized in that the integrated study model training method that uses in described step 4.3 comprises following sub-step:
Step 4.3.1: first training set is divided into six data subsets, division methods is halving method;
Step 4.3.2: being input to respectively in six sorters in integrated study model after dividing well trained.
9. according to the data stream hybrid sorting process under a kind of dynamic data environment claimed in claim 5, it is characterized in that, the PSO optimization method that uses in described step 5.2, comprises following sub-step:
Step 5.2.1: first use random value to carry out assignment to the variable that will optimize;
Step 5.2.2: then constantly update two variablees in optimizing process v[] and presentvalue, update method is as follows,
Figure 96532DEST_PATH_IMAGE004
Wherein, v[] represents the speed of searching optimization of PSO algorithm, present[] represents that current optimal value is in position and the direction of solution space, represent a random function, the random value scope providing is (0,1), variable c1with c2represent the study factor;
Step 5.2.3: repeat above-mentioned steps, until meet the fitness function in step 5.2.
10. the data stream hybrid sorting process under a kind of dynamic data environment according to claim 6, is characterized in that, in described step 7.1, the training process of SOM Clustering Model that uses comprises following steps:
Step 7.1.1: first verification msg collection is input in integrated study disaggregated model, is verified the corresponding categorical data collection of data set L;
Step 7.1.2: by the training of resulting categorical data set pair SOM model.
CN201310608553.0A 2013-12-26 2013-12-26 Data stream merge sorting method under dynamic data environment Pending CN103678512A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310608553.0A CN103678512A (en) 2013-12-26 2013-12-26 Data stream merge sorting method under dynamic data environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310608553.0A CN103678512A (en) 2013-12-26 2013-12-26 Data stream merge sorting method under dynamic data environment

Publications (1)

Publication Number Publication Date
CN103678512A true CN103678512A (en) 2014-03-26

Family

ID=50316057

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310608553.0A Pending CN103678512A (en) 2013-12-26 2013-12-26 Data stream merge sorting method under dynamic data environment

Country Status (1)

Country Link
CN (1) CN103678512A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106100885A (en) * 2016-06-23 2016-11-09 浪潮电子信息产业股份有限公司 A kind of network security warning system and design
WO2018054342A1 (en) * 2016-09-22 2018-03-29 华为技术有限公司 Method and system for classifying network data stream
CN107948147A (en) * 2017-08-31 2018-04-20 上海财经大学 Network connection data sorting technique
CN107958327A (en) * 2017-11-21 2018-04-24 国网四川省电力公司凉山供电公司 A kind of project process Risk Forecast Method based on factorial analysis and SOM networks
CN109510811A (en) * 2018-07-23 2019-03-22 中国科学院计算机网络信息中心 Intrusion detection method, device and storage medium based on data packet
CN109541639A (en) * 2018-12-25 2019-03-29 天津珞雍空间信息研究院有限公司 A kind of inversion boundary layer height method based on particle cluster
CN109543746A (en) * 2018-11-20 2019-03-29 河海大学 A kind of sensor network Events Fusion and decision-making technique based on node reliability
CN110796200A (en) * 2019-10-30 2020-02-14 深圳前海微众银行股份有限公司 Data classification method, terminal, device and storage medium
CN111464529A (en) * 2020-03-31 2020-07-28 山西大学 Network intrusion detection method and system based on cluster integration
CN113033683A (en) * 2021-03-31 2021-06-25 中南大学 Industrial system working condition monitoring method and system based on static and dynamic joint analysis
CN113114266A (en) * 2021-04-30 2021-07-13 上海智大电子有限公司 Real-time data simplifying and compressing method for comprehensive monitoring system
CN113205184A (en) * 2021-04-28 2021-08-03 清华大学 Invariant learning method and device based on heterogeneous hybrid data
CN114615207A (en) * 2022-03-10 2022-06-10 四川三思德科技有限公司 Method and device for oriented processing of data before plug flow

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060004754A1 (en) * 2004-06-30 2006-01-05 International Business Machines Corporation Methods and apparatus for dynamic classification of data in evolving data stream
CN103020288A (en) * 2012-12-28 2013-04-03 大连理工大学 Method for classifying data streams under dynamic data environment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060004754A1 (en) * 2004-06-30 2006-01-05 International Business Machines Corporation Methods and apparatus for dynamic classification of data in evolving data stream
CN103020288A (en) * 2012-12-28 2013-04-03 大连理工大学 Method for classifying data streams under dynamic data environment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
姚远: "海量动态数据流分类方法研究", 《中国博士学位论文全文数据库》 *
郑伟平: "动态数据环境下的模式分类及应用", 《计算机应用》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106100885A (en) * 2016-06-23 2016-11-09 浪潮电子信息产业股份有限公司 A kind of network security warning system and design
US10999175B2 (en) 2016-09-22 2021-05-04 Huawei Technologies Co., Ltd. Network data flow classification method and system
WO2018054342A1 (en) * 2016-09-22 2018-03-29 华为技术有限公司 Method and system for classifying network data stream
CN107948147A (en) * 2017-08-31 2018-04-20 上海财经大学 Network connection data sorting technique
CN107948147B (en) * 2017-08-31 2020-01-17 上海财经大学 Network connection data classification method
CN107958327A (en) * 2017-11-21 2018-04-24 国网四川省电力公司凉山供电公司 A kind of project process Risk Forecast Method based on factorial analysis and SOM networks
CN109510811A (en) * 2018-07-23 2019-03-22 中国科学院计算机网络信息中心 Intrusion detection method, device and storage medium based on data packet
CN109543746A (en) * 2018-11-20 2019-03-29 河海大学 A kind of sensor network Events Fusion and decision-making technique based on node reliability
CN109541639A (en) * 2018-12-25 2019-03-29 天津珞雍空间信息研究院有限公司 A kind of inversion boundary layer height method based on particle cluster
CN110796200A (en) * 2019-10-30 2020-02-14 深圳前海微众银行股份有限公司 Data classification method, terminal, device and storage medium
CN111464529A (en) * 2020-03-31 2020-07-28 山西大学 Network intrusion detection method and system based on cluster integration
CN113033683A (en) * 2021-03-31 2021-06-25 中南大学 Industrial system working condition monitoring method and system based on static and dynamic joint analysis
CN113205184A (en) * 2021-04-28 2021-08-03 清华大学 Invariant learning method and device based on heterogeneous hybrid data
CN113114266A (en) * 2021-04-30 2021-07-13 上海智大电子有限公司 Real-time data simplifying and compressing method for comprehensive monitoring system
CN114615207A (en) * 2022-03-10 2022-06-10 四川三思德科技有限公司 Method and device for oriented processing of data before plug flow
CN114615207B (en) * 2022-03-10 2022-11-25 四川三思德科技有限公司 Method and device for oriented processing of data before plug flow

Similar Documents

Publication Publication Date Title
CN103678512A (en) Data stream merge sorting method under dynamic data environment
CN107846392B (en) Intrusion detection algorithm based on improved collaborative training-ADBN
Bhattacharya et al. From smart to deep: Robust activity recognition on smartwatches using deep learning
CN110472817A (en) A kind of XGBoost of combination deep neural network integrates credit evaluation system and its method
CN104063472B (en) KNN text classifying method for optimizing training sample set
CN104866829A (en) Cross-age face verify method based on characteristic learning
CN105868775A (en) Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm
CN106973057A (en) A kind of sorting technique suitable for intrusion detection
Huang et al. A graph neural network-based node classification model on class-imbalanced graph data
CN104391860A (en) Content type detection method and device
CN103679012A (en) Clustering method and device of portable execute (PE) files
CN109446986A (en) A kind of validity feature extraction and wood recognition method towards trees laser point cloud
CN110225055A (en) A kind of network flow abnormal detecting method and system based on KNN semi-supervised learning model
CN106095101A (en) Human bodys' response method based on power-saving mechanism and client
CN104933445A (en) Mass image classification method based on distributed K-means
CN104156729B (en) A kind of classroom demographic method
CN107944460A (en) One kind is applied to class imbalance sorting technique in bioinformatics
CN107562722A (en) Internet public feelings monitoring analysis system based on big data
CN110288028A (en) ECG detecting method, system, equipment and computer readable storage medium
CN103473556A (en) Hierarchical support vector machine classifying method based on rejection subspace
CN104268507A (en) Manual alphabet identification method based on RGB-D image
CN102103691A (en) Identification method for analyzing face based on principal component
CN106650558A (en) Facial recognition method and device
CN104966075A (en) Face recognition method and system based on two-dimensional discriminant features
CN103077405A (en) Bayes classification method based on Fisher discriminant analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140326