CN112286989A - Big data clustering mining method and platform - Google Patents

Big data clustering mining method and platform Download PDF

Info

Publication number
CN112286989A
CN112286989A CN202011169745.2A CN202011169745A CN112286989A CN 112286989 A CN112286989 A CN 112286989A CN 202011169745 A CN202011169745 A CN 202011169745A CN 112286989 A CN112286989 A CN 112286989A
Authority
CN
China
Prior art keywords
data
clustering center
clustering
center
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011169745.2A
Other languages
Chinese (zh)
Inventor
陈宝
计春雷
李建敦
郝元峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Dianji University
Original Assignee
Shanghai Dianji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Dianji University filed Critical Shanghai Dianji University
Priority to CN202011169745.2A priority Critical patent/CN112286989A/en
Publication of CN112286989A publication Critical patent/CN112286989A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A big data clustering mining method comprises the following steps of obtaining a data set, initializing a clustering center by adopting an ant colony algorithm, and selecting the initial clustering center; calculating the distance from the data to the initial clustering center, and classifying the data according to a maximum and minimum distance method; and checking whether the clustering center changes or not, updating the clustering center if the clustering center changes, and executing the step of calculating the distance from the data to the clustering center again. And further building an intelligent big data clustering mining platform by adopting the big data clustering mining method.

Description

Big data clustering mining method and platform
Technical Field
The invention belongs to the technical field of big data, and particularly relates to a big data clustering mining method in a clustering analysis method.
Background
The cluster mining algorithm used at present is still based on a K mean and a fuzzy mean algorithm under most conditions. However, the parallel efficiency of the algorithms is worse and worse as the number of iterations increases, and when actual conditions are combined, the mining quality of mass data cannot be guaranteed. This is because it is possible to use a solution,
firstly, the K mean algorithm is difficult to determine the value of K, is sensitive to noise and abnormal points, and is easy to cause local optimization, so that clustering is inaccurate; the fuzzy C-means algorithm has the advantages that the clustering category number is difficult to determine, the algorithm is sensitive to an initial value, the convergence speed is low, and local optimization is easy to cause, so that clustering is inaccurate;
secondly, as the number of iterations of the K-means algorithm and the fuzzy C-means algorithm is increased, the parallel execution efficiency is worse and worse, the complexity is high, the computer resources are consumed, and the data mining quality is difficult to guarantee.
Disclosure of Invention
In one embodiment of the invention, a big data clustering mining method comprises the following steps of obtaining a data set,
initializing a clustering center by adopting an ant colony algorithm, and selecting the initial clustering center;
calculating the distance from the data to the initial clustering center, and classifying the data according to a maximum and minimum distance method;
and checking whether the clustering center changes or not, updating the clustering center if the clustering center changes, and executing the step of calculating the distance from the data to the clustering center again.
The invention provides an improved clustering mining method, which adopts an ant colony algorithm to initialize a clustering center and adopts a density-based maximum-minimum distance method to update the clustering center, thereby improving the clustering precision and the calculation efficiency.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
fig. 1 is a flow chart of cluster center updating in a big data cluster mining method according to one embodiment of the present invention.
Detailed Description
According to one or more embodiments, the big data clustering mining method is characterized in that an ant colony algorithm is used for achieving a big data clustering mining technology and is combined with a maximum-minimum distance method, so that uneven data distribution of the ant colony algorithm under actual application conditions is made up, and an intelligent big data clustering mining platform is built, so that the shortcoming of clustering efficiency under the actual application conditions is made up. The specific steps are roughly divided into three steps, namely, initializing a clustering center, including selecting the clustering center; secondly, updating the clustering center, and enabling the clustering center to be continuously optimized by a method combined with a maximum-minimum distance method to achieve the effect of accurate clustering; and step three, an intelligent big data clustering mining platform is mainly built, and clustering mining efficiency is improved.
According to one or more embodiments, in order to avoid a situation that a clustering result fluctuates when an initial clustering center is randomly selected, the embodiment adopts a method of adding an ant colony algorithm, that is, when the clustering center is initialized, the whole data set is regarded as ants for searching food, and a clustering process is regarded as a process in which the ants search for food sources, so that the clustering center is more accurate, and a data set Q ═ Q is providedi1,qi2,…,qin1,2, …, m, where n and m represent constants, and the specific calculation formula is as follows:
Figure BDA0002746916880000031
Figure BDA0002746916880000032
a and B in formula (1) are normal numbers;
κ -pheromone residual intensity;
t is time;
λij(t) -the size of the pheromone between data i and data j at time t;
c in formula (2)j-the merged data set;
n is the number of data in the data set;
in order to solve the problem that the cluster center is initialized by the ant colony algorithm and the data distribution is not uniform, the algorithm adopts the most suitable algorithmThe maximum minimum distance method updates the cluster center. Calculating the distance between any two data and recording the distance in the matrix to obtain the I between the two data-According to Density (q)i) Principle, isolate data from set CjAnd (3) removing to obtain an updated clustering center, wherein the formula is as follows:
Figure BDA0002746916880000041
in the formula
Figure BDA0002746916880000042
Density(qi)——qi(ii) a density of (d);
after the updating of the cluster center is completed, for the set CjAnd calculating the distance from the other data centers to the new cluster sample center according to the other data samples (see fig. 1). In order to improve the clustering calculation efficiency, an intelligent big data mining platform is built by Hadoop, a Map transaction is responsible for calculating the distance from each data to a clustering center, and a Reduce transaction is responsible for updating the clustering center every time.
As shown in fig. 1, which is a flow chart combining an ant colony algorithm and a distance maximum and minimum algorithm, according to the flow chart, a cluster center is initialized by using the ant colony algorithm, an initial cluster center is selected, the distance from data to the initial cluster center is calculated, the data is classified according to the distance, whether the cluster center changes or not is checked, if the cluster center changes, the cluster center is updated, and the next round of calculation of the distance from the data to the cluster center is participated in, and if the cluster center does not change, the process is ended.
The improved algorithm adopted by the invention adopts a maximum-minimum distance method to update the clustering center, an ant colony algorithm is selected from the initial data set to calculate the initial clustering center, the distance between any two data is calculated on the basis of the initial clustering center, and the ant colony algorithm is combined with a density-based maximum-minimum distance method to update the clustering center, so that clustering is realized and isolated samples are eliminated, thereby achieving data clustering; and by constructing an intelligent big data clustering mining platform, the clustering calculation efficiency is improved. The improved clustering mining technology well improves the conditions that the initial clustering center selection is too random and the algorithm operation efficiency is low in the clustering process.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (7)

1. A big data cluster mining method is characterized by comprising the following steps of obtaining a data set,
initializing a clustering center by adopting an ant colony algorithm, and selecting the initial clustering center;
calculating the distance from the data to the initial clustering center, and classifying the data according to a maximum and minimum distance method;
and checking whether the clustering center changes or not, updating the clustering center if the clustering center changes, and executing the step of calculating the distance from the data to the clustering center again.
2. The big data cluster mining method according to claim 1, characterized in that an intelligent big data cluster mining platform is further built by adopting the big data cluster mining method.
3. The big data cluster mining method as claimed in claim 1, wherein when initializing the cluster center, the whole data set is regarded as ants for finding food, the clustering process is regarded as the process of ants for finding food source, so that the cluster center is more accurate, and a data set Q ═ { Q | Q is providedi1,qi2,…,qin1,2, …, m, where n and m represent constants, and the specific calculation formula is as follows:
Figure FDA0002746916870000011
Figure FDA0002746916870000012
a and B in formula (1) are normal numbers;
κ -pheromone residual intensity;
t is time;
λij(t) -the size of the pheromone between data i and data j at time t;
c in formula (2)j-the merged data set;
n is the number of data in the data set.
4. The big data cluster mining method according to claim 3, wherein the updating the cluster centers by the maximum and minimum distance method comprises,
calculating the distance between any two data and recording the distance in the matrix to obtain the I between the two data-According to Density (q)i) Principle, will be isolatedData from set CjAnd (3) removing to obtain an updated clustering center, wherein the formula is as follows:
Figure FDA0002746916870000021
in the formula
Figure FDA0002746916870000022
Density(qi)——qi(ii) a density of (d);
after the updating of the cluster center is completed, for the set CjAnd calculating the distance from the other data centers to the new cluster sample center.
5. The big data cluster mining method according to claim 2, wherein an intelligent big data mining platform is built by Hadoop, a Map transaction is responsible for calculating the distance between each piece of data and a cluster center, and a Reduce transaction is responsible for updating the cluster center every time.
6. A big data clustering mining platform, characterized in that the platform comprises a server having a memory; and
a processor coupled to the memory, the processor configured to execute instructions stored in the memory, the processor to:
a data set is obtained that is then used,
initializing a clustering center by adopting an ant colony algorithm, and selecting the initial clustering center;
calculating the distance from the data to the initial clustering center, and classifying the data according to a maximum and minimum distance method;
and checking whether the clustering center changes or not, updating the clustering center if the clustering center changes, and executing the step of calculating the distance from the data to the clustering center again.
7. A storage medium on which a computer program is stored which, when executed by a processor, carries out the method of any one of claims 1 to 5.
CN202011169745.2A 2020-10-28 2020-10-28 Big data clustering mining method and platform Pending CN112286989A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011169745.2A CN112286989A (en) 2020-10-28 2020-10-28 Big data clustering mining method and platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011169745.2A CN112286989A (en) 2020-10-28 2020-10-28 Big data clustering mining method and platform

Publications (1)

Publication Number Publication Date
CN112286989A true CN112286989A (en) 2021-01-29

Family

ID=74373595

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011169745.2A Pending CN112286989A (en) 2020-10-28 2020-10-28 Big data clustering mining method and platform

Country Status (1)

Country Link
CN (1) CN112286989A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103838863A (en) * 2014-03-14 2014-06-04 内蒙古科技大学 Big-data clustering algorithm based on cloud computing platform
CN104850629A (en) * 2015-05-21 2015-08-19 杭州天宽科技有限公司 Analysis method of massive intelligent electricity-consumption data based on improved k-means algorithm
CN109509196A (en) * 2018-12-24 2019-03-22 广东工业大学 A kind of lingual diagnosis image partition method of the fuzzy clustering based on improved ant group algorithm
CN110909792A (en) * 2019-11-21 2020-03-24 安徽大学 Clustering analysis method based on improved K-means algorithm and new clustering effectiveness index

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103838863A (en) * 2014-03-14 2014-06-04 内蒙古科技大学 Big-data clustering algorithm based on cloud computing platform
CN104850629A (en) * 2015-05-21 2015-08-19 杭州天宽科技有限公司 Analysis method of massive intelligent electricity-consumption data based on improved k-means algorithm
CN109509196A (en) * 2018-12-24 2019-03-22 广东工业大学 A kind of lingual diagnosis image partition method of the fuzzy clustering based on improved ant group algorithm
CN110909792A (en) * 2019-11-21 2020-03-24 安徽大学 Clustering analysis method based on improved K-means algorithm and new clustering effectiveness index

Similar Documents

Publication Publication Date Title
Ramírez‐Gallego et al. Fast‐mRMR: Fast minimum redundancy maximum relevance algorithm for high‐dimensional big data
US11354282B2 (en) Classifying an unmanaged dataset
JP7392668B2 (en) Data processing methods and electronic equipment
CN108932301B (en) Data filling method and device
US10621492B2 (en) Multiple record linkage algorithm selector
US7805443B2 (en) Database configuration analysis
US20170199902A1 (en) Outlier detection for streaming data
CN108897842A (en) Computer readable storage medium and computer system
CN104424360A (en) Method and system for accessing a set of data tables in a source database
US11687540B2 (en) Fast, approximate conditional distribution sampling
CN108804473B (en) Data query method, device and database system
US11971892B2 (en) Methods for stratified sampling-based query execution
CN110689368B (en) Method for designing advertisement click rate prediction system in mobile application
WO2015180340A1 (en) Data mining method and device
CN116830097A (en) Automatic linear clustering recommendation for database region maps
CN115510981A (en) Decision tree model feature importance calculation method and device and storage medium
CN114116829A (en) Abnormal data analysis method, abnormal data analysis system, and storage medium
CN114880482A (en) Graph embedding-based relation graph key personnel analysis method and system
KR20140130014A (en) Method for producing co-occurrent subgraph for graph classification
CN112286989A (en) Big data clustering mining method and platform
CN110472659A (en) Data processing method, device, computer readable storage medium and computer equipment
JP5542732B2 (en) Data extraction apparatus, data extraction method, and program thereof
WO2016144360A1 (en) Progressive interactive approach for big data analytics
JP2017167980A (en) Feature selection device, feature selection method and program
US11921756B2 (en) Automated database operation classification using artificial intelligence techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210129