CN103577602A - Secondary clustering method and system - Google Patents

Secondary clustering method and system Download PDF

Info

Publication number
CN103577602A
CN103577602A CN201310581217.1A CN201310581217A CN103577602A CN 103577602 A CN103577602 A CN 103577602A CN 201310581217 A CN201310581217 A CN 201310581217A CN 103577602 A CN103577602 A CN 103577602A
Authority
CN
China
Prior art keywords
data
reference point
cluster
data stream
density
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310581217.1A
Other languages
Chinese (zh)
Inventor
侯德龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Beijing Electronic Information Industry Co Ltd
Original Assignee
Inspur Beijing Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Beijing Electronic Information Industry Co Ltd filed Critical Inspur Beijing Electronic Information Industry Co Ltd
Priority to CN201310581217.1A priority Critical patent/CN103577602A/en
Publication of CN103577602A publication Critical patent/CN103577602A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24561Intermediate data storage techniques for performance improvement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a secondary clustering method which is applied to the technical field of data mining. The secondary clustering method comprises the following steps of carrying out partitioning on data flow and reading in data blocks, obtaining the reference point of density clusters by using DBSCAN algorithm clustering, carrying out k-means algorithm clustering on the obtained reference point of the density clusters and storing the k mean value reference point obtained through the k-means algorithm clustering by adopting a layered structure. The secondary clustering method effectively improves the clustering quality, the clustering precision and efficiency of the data flow which is distributed irregularly and contains noise.

Description

A kind of secondary clustering method and system
Technical field
The invention belongs to Mining Data Stream Technology field, relate in particular to a kind of secondary clustering method and system.
Background technology
In recent years, along with the development of hardware technology, there is increasing application to produce data stream, data stream is different from traditional static data on disk that is stored in, but the new data object of a class, it is data unlimited, continuous, orderly, fast-changing, magnanimity; Typical data stream packets includes network and the monitoring information data of system of monitoring road traffic are, the message registration data of telecommunication department, the various Monitoring Data of being passed back by sensor, the stock price information data in stock exchange and the Monitoring Data of environment temperature etc.When these features of data stream itself have determined data stream to process, the scanning of one to twice can only be done to data, and a small amount of data can only be stored temporarily.Therefore original a lot of ripe data mining, data analysis and data query technique become inapplicable in data stream, need to propose new solution.
Therefore, once the attention that occurs having caused researcher, there are a lot of achievements in research in the problem of data stream, data stream is studied from many aspects such as management, inquiry, analysis and mining algorithms; Mining Data Stream Technology is as the new problem of Data Mining, and a lot of mining algorithms need to be transformed for data stream; Data stream clustering is analyzed an important research direction of excavating as data stream, is faced with equally huge challenge, has also caused researchers' extensive concern, has occurred at present many relevant achievements in research, and has been applied in practice.
Traditional cluster is based upon under database manipulation pattern; Complicated query manipulation be stored and be supported to traditional database can to all data.Therefore, under database schema, classic method can adopt repeatedly reading out data, and data is carried out to the operations such as random access and realize the cluster to stored data.Yet under data stream environment, these methods of operating are all infeasible, the feature that data stream itself has makes traditional clustering algorithm (even can not) not directly apply to data stream clustering.
Thereby, to compare with traditional clustering method, Data Stream Clustering Algorithm should have following characteristics:
First, use limited internal memory and storage space.Data stream has continuous unlimitedness, and data total amount is wherein considerably beyond offering space (main memory) capacity that clustering algorithm is used, so the data in full storage data stream are infeasible, is also impossible.Data Stream Clustering Algorithm can not be stored all need data object to be processed, can only be by generalization or give up selectively data and guarantee that used space size is limited, reasonably.
Secondly, linear sweep increment type is processed or a scanning.For mass data ultra-large in data stream, linear sweep is unique effective reading out data method, and random reading out data needs quite expensive calculation cost.And, even the data in data stream are carried out to repeatedly linear sweep, be also to need a lot of calculation costs, because these data are stored in the very slow external equipment of reading speed conventionally.Moreover in a lot of data stream environment, data, with very fast velocity variations, do not need its storage.These data must be just processed when it produces, and is then dropped.Therefore, Data Stream Clustering Algorithm should only carry out a scanning to data, at least will realize the increment type of linear sweep and process.
Again, the processing of data recording is had to real-time.In data stream, the pace of change of data is very fast, very high to the requirement of response speed.Therefore, in Data Stream Clustering Algorithm, the processing procedure of usage data record must have very fast processing speed, and avoiding omitting need data recording to be processed.
But known Data Stream Clustering Algorithm is applicable to have the data of specific distribution mostly, and more responsive to noise.Yet, the data stream in practical application area mostly have data distribute irregular, contain the features such as noise, what make existing Data Stream Clustering Algorithm is difficult to obtain gratifying cluster quality.
Summary of the invention
The invention provides a kind of secondary clustering method and system, to address the above problem.
The invention provides a kind of secondary clustering method, comprise the following steps:
Data stream is carried out to piecemeal reading data piece;
Use DBSCAN algorithm cluster, obtain Density Cluster reference point;
The described Density Cluster reference point of obtaining is carried out k-means algorithm cluster and adopted the structure of layering to preserve the k mean reference point that k-means algorithm cluster obtains.
The invention provides a kind of secondary clustering system, comprising: piecemeal reads in module, Density Cluster reference point acquisition module, k mean reference point acquisition module; Piecemeal reads in module and is connected with k mean reference point acquisition module by Density Cluster reference point acquisition module;
Described piecemeal reads in module, for data stream being carried out to piecemeal reading data piece;
Described Density Cluster reference point acquisition module, for using DBSCAN algorithm cluster, obtains Density Cluster reference point;
Described k mean reference point acquisition module, for carrying out k-means algorithm cluster to the described Density Cluster reference point of obtaining and adopting the structure of layering to preserve the k mean reference point that k-means algorithm cluster obtains.
The present invention proposes a kind of secondary clustering method, first data stream is carried out to piecemeal and do the middle cluster result Density Cluster of DBSCAN cluster generation with reference to point set, subsequently these Density Cluster reference point are carried out to k-means cluster, by the structure of layering, preserve bunch reference point that each cluster obtains, effectively improve cluster quality, clustering precision and the efficiency of the irregular data stream with containing noise of data distribution.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide a further understanding of the present invention, forms the application's a part, and schematic description and description of the present invention is used for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the processing flow chart of the embodiment of the present invention 1;
Fig. 2 is the system diagram of realizing of the embodiment of the present invention 2.
Embodiment
Hereinafter with reference to accompanying drawing, also describe the present invention in detail in conjunction with the embodiments.It should be noted that, in the situation that not conflicting, embodiment and the feature in embodiment in the application can combine mutually.
Fig. 1 is the processing flow chart of the embodiment of the present invention 1, comprises the following steps:
Step 101: data stream is carried out to piecemeal reading data piece;
Step 102: use DBSCAN algorithm cluster, obtain Density Cluster reference point;
Step 103: the described Density Cluster reference point of obtaining is carried out k-means algorithm cluster and adopted the structure of layering to preserve the k mean reference point that k-means algorithm cluster obtains.
Wherein, the process of data stream being carried out to piecemeal reading data piece is: in moving window, realize the Circulant Block of data stream is processed, and obtain final cluster result; If data are untreated complete, read in next data block, until Data Stream Processing is complete.
Wherein, Density Cluster reference point is defined as follows:
In tentation data stream, data are with piece X1, X2,, Xn, form arrive, can in internal memory, process the data point that each data block comprises similar number for every;
Definition 1:{| ≠: Density Cluster reference point: the data block Xt that moment t is arrived, with the clustering algorithm based on density, carry out cluster, generate k.(kt=1,2) individual clustering, and average point is respectively cl,, Ci,, ek, data block will be by k.Individual shape is as the two tuples formations of (ci, ni), and ni is the data point number that is under the jurisdiction of ci in Xi, claims that rd (ei, ni) is the Density Cluster reference point in data stream.
Wherein, k mean reference point is defined as follows:
In tentation data stream, data are with piece X1, X2,, Xn, form arrive, can in internal memory, process the data point that each data block comprises similar number for every;
Definition 2:k mean reference point: individual the clustering of 2k of m Density Cluster reference point being carried out to the generation of k mean cluster, average point is respectively cl,, Ci, c2k, data block will consist of two tuples of 2k shape j/N (el, hi), and ni is the data point number that is under the jurisdiction of ci, claim that rk (ci, ni) is the k mean reference point in data stream.
The present invention proposes data stream secondary clustering method (Twice Clustering Streaming Algorithm-TCLUSA), to improve clustering precision.
TCLUSA is based on subregion thought, use DBSCAN method to delete outlier to every blocks of data cluster and using the average point of each bunch as its representative point, obtain Local Clustering result (Density Cluster reference point), then by k-means method, obtained representative point is carried out to cluster acquisition net result (k mean reference point).
TCLUSA describes:
In tentation data stream, data are with piece X1, X2,, Xn, form arrive, can in internal memory, process the data point that each data block comprises similar number for every.
Definition 1:{| ≠: Density Cluster reference point.The data block Xt that moment t is arrived, carries out cluster with the clustering algorithm based on density, generates k.(kt=1,2) individual clustering, and average point is respectively cl,, Ci,, ek, data block will be by k.Individual shape is as the two tuples formations of (ci, ni), and ni is the data point number that is under the jurisdiction of ci in Xi, claims that rd (ei, ni) is the Density Cluster reference point in data stream.
Definition 2:k mean reference point.M Density Cluster reference point carried out to individual the clustering of 2k of k mean cluster generation, average point is respectively cl,, Ci, c2k, data block will consist of two tuples of 2k shape j/N (el, hi), and ni is the data point number that is under the jurisdiction of ci, claim that rk (ci, ni) is the k mean reference point in data stream.
In definition l and definition 2,11i can be interpreted as to the weight of reference point.
The thought of TCLUSA algorithm based on subregion is carried out piecemeal processing to data stream, and preserve according to the structure of layering bunch reference point that each cluster obtains, preserve m reference point for every layer, this makes this algorithm can in limited memory headroom, realize the cluster to data stream, and data stream in chronological order every m data point forms a data block.
This algorithm is used DBSCAN algorithm to each data block cluster, calculates Density Cluster reference point, obtains Local Clustering result, uses k-means algorithm to Density Cluster reference point cluster, until obtain final result (k mean reference point).
Below the specific implementation of above-mentioned principle is elaborated:
False code is as follows:
Procedure?TCLUSA
/ wooden Function: the data stream of accumulation a period of time, in moving window, then utilizes TCLUSA algorithm to data stream clustering, and according to a plurality of cluster pieces of result division place.
Input: data block records number oncepattern, bunch number k, radius of neighbourhood eps, the minpits that counts that core point at least comprises.
Output:k bunch
*/
(1) do while (less than data stream end)
(2) f=1; //f represents the number of plies of processing
(3) read a data block;
(4) use DBSCAN algorithm to carry out cluster to this data block;
(5) calculate Density Cluster reference point, be stored in f layer;
(6) if (number of Density Cluster reference point==m) //m represents to store intermediate density bunch reference point maximum number
(7) t=0; //f is that two-valued variable: f is that 1 expression is all
Layer be not all filled with data,
//f is 0 and indicates that layer has been filled with data.
(8)dowhile(f==0)
(9) use k-means algorithm to i layer bunch reference point cluster;
(10) juice+;
(11) store the 2Ji} obtaining a k mean reference point into f layer;
(12) if (a bunch reference for f layer count unequal to institute)
(13)t?t=1;
(14)end?if
(15)end//end?while
(16)end?if
(17)end//end?while
(18) all Density Cluster reference point clusters to storage with k-means algorithm, generate final k bunch;
Step (1) realizes in moving window to be processed the Circulant Block of data stream, and obtains final cluster result.If data are untreated complete, read in next data block, until Data Stream Processing is complete; DBSCAN algorithm cluster is used to the data block of reading in step (3)-(5), and bulk density bunch reference point, is saved in i layer;
Step (6)-(8) judge that whether the Density Cluster reference point of i layer is full, if i layer reference point is full, use k.means algorithm to process the layering of Density Cluster reference point, and result is saved in to its last layer, until all layers are not all filled with; Step (9)-(11) are used k-means algorithm to carry out cluster to i layer reference point, obtain 2k mean reference point, are saved in i+l layer, the mean reference point number that (12) step judges its upper strata whether less than, as less than, end process; When data stream finishes, step (18) represents to use k-means algorithm to carry out cluster to all Density Cluster reference point of storage, obtains k bunch ((k mean reference point)) output.
Carry out Algorithm Analysis below, be described in detail as follows:
Note m is data point number in piece, k is cluster number of clusters, C is cycle index, t is for obtaining the data block number of m Density Cluster mean reference point, the time overhead of algorithm consists of two parts: data stream is carried out to piecemeal, 111 data points of order form a data block, the m of each a data block data point is carried out to DBSCAN cluster and generate Density Cluster mean reference point, and the expense of this part is O (m2); M Density Cluster mean reference point carried out to k.means cluster, and this part of expense is O (mkc).So total time overhead reaches O (m2t+mkc).
Because data point number m in deblocking is generally less, therefore, total time overhead is little.After generating m Density Cluster mean reference point, this m Density Cluster reference point carried out to k.means cluster, because it has simply, efficient advantage, its computation complexity is O (rake), conventionally have k<<m and c<m, and this part of calculation times is less than the number of times of bulk density bunch mean reference point, so its time overhead is less.Aspect the space expense of algorithm, because this algorithm adopts the structure of layering and by the information of average point, data preserved, make algorithm can in limited memory headroom, realize the cluster to data stream.
Aspect cluster quality, TCLUSA algorithm adopts DBSCAN algorithm to carry out ground floor cluster preprocessing to data stream, for distribution is irregular, processes with data that contain a large amount of noises, makes the Density Cluster reference point generating have higher precision.Adopt complete partitioning algorithm k.means to carry out cluster to Density Cluster reference point, therefore, in cluster process, retained all bunch reference point information, make the result of cluster more can reflect the overall condition that data distribute.
Carry out a concrete experiment below, technical solution of the present invention be elaborated:
Realize the Data Stream Clustering Algorithm TCLUSA software and hardware configuration proposing as follows: CPU frequency is P4.2.80G, inside saves as DDR.II256MB, and hard disk size is 80GB; Operating system is Windows XP, take C++ as development language.With Scat Comp coefficient [38], weigh cluster quality (Scat Comp coefficient is less, and cluster quality is better).Test the data source with network invasion monitoring data set KDD.CUP99 herein, this data set derives from 2 weeks interior 494020 the LAN (Local Area Network) linkage records in Lincoln laboratory of MIT, is a kind of noise and data irregular data that distribute that contain.Every of data centralization records 42 attributes, corresponding normal mode or certain intrusion model, and utilization 120000 data recording are wherein imitated flow data and are carried out cluster, have deleted the nonumeric attribute in data, only use 34 numerical attributes wherein.
Data Stream Clustering Algorithm (being called KSCDC algorithm herein) based on k-means in Data Stream Clustering Algorithm TCLUSA and document [39] is compared.This algorithm is divided into one by every 100 data recording, and k gets 5, minpits and gets 7, eps and get 0.5.With KSCDC algorithm and TCLUSA algorithm, the different pieces of information of 20,000 to 100,000 is carried out to cluster respectively, experimental result is as shown in table l.As known from Table 1: algorithm TCLUSA all can obtain the better cluster quality than algorithm KSCDC under different parameters condition.Owing to adopting DBSCAN algorithm in ground floor cluster, consuming time more than k-means algorithm, cause algorithm TCLUSA slightly poorer than algorithm KSCDC, however in the larger environment of cluster quality-critical degree this part time performance be lost in acceptable scope within.
The time performance of table 1 algorithm KSCDC and TCLUSA and cluster mass ratio are
Figure BDA0000416358340000101
Fig. 2 is the system diagram of realizing of the embodiment of the present invention 2, comprising: piecemeal reads in module 201, Density Cluster reference point acquisition module 202, k mean reference point acquisition module 203; Piecemeal reads in module 201 and is connected with k mean reference point acquisition module 203 by Density Cluster reference point acquisition module 202;
Described piecemeal reads in module 201, for data stream being carried out to piecemeal reading data piece;
Described Density Cluster reference point acquisition module 202, for using DBSCAN algorithm cluster, obtains Density Cluster reference point;
Described k mean reference point acquisition module 203, for carrying out k-means algorithm cluster to the described Density Cluster reference point of obtaining and adopting the structure of layering to preserve the k mean reference point that k-means algorithm cluster obtains.
Wherein, described piecemeal reads in module 201, for data stream being carried out to the process of piecemeal reading data piece, is also: in moving window, realize the Circulant Block of data stream is processed, and obtain final cluster result; If data are untreated complete, read in next data block, until Data Stream Processing is complete.
The present invention proposes a kind of secondary clustering method, first data stream is carried out to piecemeal and do the middle cluster result Density Cluster of DBSCAN cluster generation with reference to point set, subsequently these Density Cluster reference point are carried out to k-means cluster, by the structure of layering, preserve bunch reference point (k mean reference point) that each cluster obtains, effectively improve cluster quality and clustering precision and the efficiency of the irregular data stream with containing noise of data distribution.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (6)

1. a secondary clustering method, is characterized in that, comprises the following steps:
Data stream is carried out to piecemeal reading data piece;
Use DBSCAN algorithm cluster, obtain Density Cluster reference point;
The described Density Cluster reference point of obtaining is carried out k-means algorithm cluster and adopted the structure of layering to preserve the k mean reference point that k-means algorithm cluster obtains.
2. method according to claim 1, is characterized in that:
The process of data stream being carried out to piecemeal reading data piece is: in moving window, realize the Circulant Block of data stream is processed, and obtain final cluster result; If data are untreated complete, read in next data block, until Data Stream Processing is complete.
3. method according to claim 1, is characterized in that: Density Cluster reference point is defined as follows:
In tentation data stream, data are with piece X1, X2,, Xn, form arrive, can in internal memory, process the data point that each data block comprises similar number for every;
Definition 1:{| ≠: Density Cluster reference point: the data block Xt that moment t is arrived, with the clustering algorithm based on density, carry out cluster, generate k.(kt=1,2) individual clustering, and average point is respectively cl,, Ci,, ek, data block will be by k.Individual shape is as the two tuples formations of (ci, ni), and ni is the data point number that is under the jurisdiction of ci in Xi, claims that rd (ei, ni) is the Density Cluster reference point in data stream.
4. method according to claim 1, is characterized in that: k mean reference point is defined as follows:
In tentation data stream, data are with piece X1, X2,, Xn, form arrive, can in internal memory, process the data point that each data block comprises similar number for every;
Definition 2:k mean reference point: individual the clustering of 2k of m Density Cluster reference point being carried out to the generation of k mean cluster, average point is respectively cl,, Ci, c2k, data block will consist of two tuples of 2k shape j/N (el, hi), and ni is the data point number that is under the jurisdiction of ci, claim that rk (ci, ni) is the k mean reference point in data stream.
5. a secondary clustering system, is characterized in that, comprising:
Piecemeal reads in module, Density Cluster reference point acquisition module, k mean reference point acquisition module; Piecemeal reads in module and is connected with k mean reference point acquisition module by Density Cluster reference point acquisition module;
Described piecemeal reads in module, for data stream being carried out to piecemeal reading data piece;
Described Density Cluster reference point acquisition module, for using DBSCAN algorithm cluster, obtains Density Cluster reference point;
Described k mean reference point acquisition module, for carrying out k-means algorithm cluster to the described Density Cluster reference point of obtaining and adopting the structure of layering to preserve the k mean reference point that k-means algorithm cluster obtains.
6. system according to claim 5, is characterized in that,
Described piecemeal reads in module, for data stream being carried out to the process of piecemeal reading data piece, is also: in moving window, realize the Circulant Block of data stream is processed, and obtain final cluster result; If data are untreated complete, read in next data block, until Data Stream Processing is complete.
CN201310581217.1A 2013-11-18 2013-11-18 Secondary clustering method and system Pending CN103577602A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310581217.1A CN103577602A (en) 2013-11-18 2013-11-18 Secondary clustering method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310581217.1A CN103577602A (en) 2013-11-18 2013-11-18 Secondary clustering method and system

Publications (1)

Publication Number Publication Date
CN103577602A true CN103577602A (en) 2014-02-12

Family

ID=50049378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310581217.1A Pending CN103577602A (en) 2013-11-18 2013-11-18 Secondary clustering method and system

Country Status (1)

Country Link
CN (1) CN103577602A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202335A (en) * 2016-06-28 2016-12-07 银江股份有限公司 A kind of big Data Cleaning Method of traffic based on cloud computing framework
CN107392220A (en) * 2017-05-31 2017-11-24 阿里巴巴集团控股有限公司 The clustering method and device of data flow
CN107657277A (en) * 2017-09-22 2018-02-02 上海斐讯数据通信技术有限公司 A kind of human body unusual checking based on big data and decision method and system
CN108520023A (en) * 2018-03-22 2018-09-11 合肥佳讯科技有限公司 A kind of identification of thunderstorm core and method for tracing based on Hybrid Clustering Algorithm
CN109344171A (en) * 2018-12-21 2019-02-15 中国计量大学 A kind of nonlinear system characteristic variable conspicuousness mining method based on Data Stream Processing
CN110287244A (en) * 2019-07-03 2019-09-27 武汉中海庭数据技术有限公司 It is a kind of based on the traffic lights localization method repeatedly clustered
CN110298558A (en) * 2019-06-11 2019-10-01 欧拉信息服务有限公司 Vehicle resources dispositions method and device
CN110367969A (en) * 2019-07-05 2019-10-25 复旦大学 A kind of improved electrocardiosignal K-Means Cluster
CN111179592A (en) * 2019-12-31 2020-05-19 合肥工业大学 Urban traffic prediction method and system based on spatio-temporal data flow fusion analysis
CN111832791A (en) * 2019-11-27 2020-10-27 北京中交兴路信息科技有限公司 Gas station prediction method based on machine learning logistic regression
CN111860554A (en) * 2019-04-28 2020-10-30 杭州海康威视数字技术股份有限公司 Risk monitoring method and device, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060015215A1 (en) * 2004-07-15 2006-01-19 Howard Michael D System and method for automated search by distributed elements
CN101505314A (en) * 2008-12-29 2009-08-12 成都市华为赛门铁克科技有限公司 P2P data stream recognition method, apparatus and system
CN101853291A (en) * 2010-05-24 2010-10-06 合肥工业大学 Data flow based car fault diagnosis method
CN101989289A (en) * 2009-08-06 2011-03-23 富士通株式会社 Data clustering method and device
CN102289478A (en) * 2011-08-01 2011-12-21 江苏广播电视大学 System and method for recommending video on demand based on fuzzy clustering

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060015215A1 (en) * 2004-07-15 2006-01-19 Howard Michael D System and method for automated search by distributed elements
CN101505314A (en) * 2008-12-29 2009-08-12 成都市华为赛门铁克科技有限公司 P2P data stream recognition method, apparatus and system
CN101989289A (en) * 2009-08-06 2011-03-23 富士通株式会社 Data clustering method and device
CN101853291A (en) * 2010-05-24 2010-10-06 合肥工业大学 Data flow based car fault diagnosis method
CN102289478A (en) * 2011-08-01 2011-12-21 江苏广播电视大学 System and method for recommending video on demand based on fuzzy clustering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡学钢 等: "一种有效的数据流二次聚类算法", 《西南交通大学学报》, vol. 44, no. 4, 31 August 2009 (2009-08-31) *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202335B (en) * 2016-06-28 2019-06-28 银江股份有限公司 A kind of traffic big data cleaning method based on cloud computing framework
CN106202335A (en) * 2016-06-28 2016-12-07 银江股份有限公司 A kind of big Data Cleaning Method of traffic based on cloud computing framework
CN107392220A (en) * 2017-05-31 2017-11-24 阿里巴巴集团控股有限公司 The clustering method and device of data flow
WO2018219284A1 (en) * 2017-05-31 2018-12-06 阿里巴巴集团控股有限公司 Method and apparatus for clustering data stream
US11226993B2 (en) 2017-05-31 2022-01-18 Advanced New Technologies Co., Ltd. Method and apparatus for clustering data stream
CN107392220B (en) * 2017-05-31 2020-05-05 创新先进技术有限公司 Data stream clustering method and device
CN107657277A (en) * 2017-09-22 2018-02-02 上海斐讯数据通信技术有限公司 A kind of human body unusual checking based on big data and decision method and system
CN107657277B (en) * 2017-09-22 2022-02-01 金言 Human body abnormal behavior detection and judgment method and system based on big data
CN108520023A (en) * 2018-03-22 2018-09-11 合肥佳讯科技有限公司 A kind of identification of thunderstorm core and method for tracing based on Hybrid Clustering Algorithm
CN108520023B (en) * 2018-03-22 2021-07-20 合肥佳讯科技有限公司 Thunderstorm kernel identification and tracking method based on hybrid clustering algorithm
CN109344171A (en) * 2018-12-21 2019-02-15 中国计量大学 A kind of nonlinear system characteristic variable conspicuousness mining method based on Data Stream Processing
CN111860554A (en) * 2019-04-28 2020-10-30 杭州海康威视数字技术股份有限公司 Risk monitoring method and device, storage medium and electronic equipment
CN110298558A (en) * 2019-06-11 2019-10-01 欧拉信息服务有限公司 Vehicle resources dispositions method and device
CN110287244A (en) * 2019-07-03 2019-09-27 武汉中海庭数据技术有限公司 It is a kind of based on the traffic lights localization method repeatedly clustered
CN110367969A (en) * 2019-07-05 2019-10-25 复旦大学 A kind of improved electrocardiosignal K-Means Cluster
CN111832791A (en) * 2019-11-27 2020-10-27 北京中交兴路信息科技有限公司 Gas station prediction method based on machine learning logistic regression
CN111179592B (en) * 2019-12-31 2021-06-11 合肥工业大学 Urban traffic prediction method and system based on spatio-temporal data flow fusion analysis
CN111179592A (en) * 2019-12-31 2020-05-19 合肥工业大学 Urban traffic prediction method and system based on spatio-temporal data flow fusion analysis

Similar Documents

Publication Publication Date Title
CN103577602A (en) Secondary clustering method and system
Pahins et al. Hashedcubes: Simple, low memory, real-time visual exploration of big data
CN103345514B (en) Streaming data processing method under big data environment
CN106202569A (en) A kind of cleaning method based on big data quantity
CN107066476A (en) A kind of real-time recommendation method based on article similarity
CN105808358B (en) A kind of data dependence thread packet mapping method for many-core system
CN103118132B (en) A kind of distributed cache system towards space-time data and method
CN106372190A (en) Method and device for querying OLAP (on-line analytical processing) in real time
CN106708989A (en) Spatial time sequence data stream application-based Skyline query method
JP2019204472A (en) Method for reading plurality of small files of 2 mb or smaller from hdfs having data merge module and hbase cash module on the basis of hadoop
CN103916478B (en) The method and apparatus that streaming based on distributed system builds data side
CN107103068A (en) The update method and device of service buffer
WO2015062540A9 (en) Driving amount model event-based storage and index methods and system
CN104036029A (en) Big data consistency comparison method and system
Zhong et al. VegaIndexer: A distributed composite index scheme for big spatio-temporal sensor data on cloud
CN102012946A (en) High-efficiency safety monitoring video/image data storage method
Xia et al. SW-BiLSTM: a Spark-based weighted BiLSTM model for traffic flow forecasting
Jin et al. Association rules redundancy processing algorithm based on hypergraph in data mining
Xia et al. DAPR-tree: a distributed spatial data indexing scheme with data access patterns to support Digital Earth initiatives
CN107426315A (en) A kind of improved method of the distributed cache system Memcached based on BP neural network
CN103118102A (en) System and method for counting and controlling spatial data access laws under cloud computing environment
Demir et al. Clustering spatial networks for aggregate query processing: A hypergraph approach
Gaurav et al. An outline on big data and big data analytics
CN105956816A (en) Cargo transportation information intelligent processing method
Qin et al. Towards a smart, internet-scale cache service for data intensive scientific applications

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140212