CN112286989A

CN112286989A - Big data clustering mining method and platform

Info

Publication number: CN112286989A
Application number: CN202011169745.2A
Authority: CN
Inventors: 陈宝; 计春雷; 李建敦; 郝元峰
Original assignee: Shanghai Dianji University
Current assignee: Shanghai Dianji University
Priority date: 2020-10-28
Filing date: 2020-10-28
Publication date: 2021-01-29

Abstract

A big data clustering mining method comprises the following steps of obtaining a data set, initializing a clustering center by adopting an ant colony algorithm, and selecting the initial clustering center; calculating the distance from the data to the initial clustering center, and classifying the data according to a maximum and minimum distance method; and checking whether the clustering center changes or not, updating the clustering center if the clustering center changes, and executing the step of calculating the distance from the data to the clustering center again. And further building an intelligent big data clustering mining platform by adopting the big data clustering mining method.

Description

Big data clustering mining method and platform

Technical Field

The invention belongs to the technical field of big data, and particularly relates to a big data clustering mining method in a clustering analysis method.

Background

The cluster mining algorithm used at present is still based on a K mean and a fuzzy mean algorithm under most conditions. However, the parallel efficiency of the algorithms is worse and worse as the number of iterations increases, and when actual conditions are combined, the mining quality of mass data cannot be guaranteed. This is because it is possible to use a solution,

firstly, the K mean algorithm is difficult to determine the value of K, is sensitive to noise and abnormal points, and is easy to cause local optimization, so that clustering is inaccurate; the fuzzy C-means algorithm has the advantages that the clustering category number is difficult to determine, the algorithm is sensitive to an initial value, the convergence speed is low, and local optimization is easy to cause, so that clustering is inaccurate;

secondly, as the number of iterations of the K-means algorithm and the fuzzy C-means algorithm is increased, the parallel execution efficiency is worse and worse, the complexity is high, the computer resources are consumed, and the data mining quality is difficult to guarantee.

Disclosure of Invention

In one embodiment of the invention, a big data clustering mining method comprises the following steps of obtaining a data set,

initializing a clustering center by adopting an ant colony algorithm, and selecting the initial clustering center;

calculating the distance from the data to the initial clustering center, and classifying the data according to a maximum and minimum distance method;

and checking whether the clustering center changes or not, updating the clustering center if the clustering center changes, and executing the step of calculating the distance from the data to the clustering center again.

The invention provides an improved clustering mining method, which adopts an ant colony algorithm to initialize a clustering center and adopts a density-based maximum-minimum distance method to update the clustering center, thereby improving the clustering precision and the calculation efficiency.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 is a flow chart of cluster center updating in a big data cluster mining method according to one embodiment of the present invention.

Detailed Description

According to one or more embodiments, the big data clustering mining method is characterized in that an ant colony algorithm is used for achieving a big data clustering mining technology and is combined with a maximum-minimum distance method, so that uneven data distribution of the ant colony algorithm under actual application conditions is made up, and an intelligent big data clustering mining platform is built, so that the shortcoming of clustering efficiency under the actual application conditions is made up. The specific steps are roughly divided into three steps, namely, initializing a clustering center, including selecting the clustering center; secondly, updating the clustering center, and enabling the clustering center to be continuously optimized by a method combined with a maximum-minimum distance method to achieve the effect of accurate clustering; and step three, an intelligent big data clustering mining platform is mainly built, and clustering mining efficiency is improved.

According to one or more embodiments, in order to avoid a situation that a clustering result fluctuates when an initial clustering center is randomly selected, the embodiment adopts a method of adding an ant colony algorithm, that is, when the clustering center is initialized, the whole data set is regarded as ants for searching food, and a clustering process is regarded as a process in which the ants search for food sources, so that the clustering center is more accurate, and a data set Q ═ Q is provided_i1,q_i2,…,q_in1,2, …, m, where n and m represent constants, and the specific calculation formula is as follows:

a and B in formula (1) are normal numbers;

κ -pheromone residual intensity;

t is time;

λ_ij(t) -the size of the pheromone between data i and data j at time t;

c in formula (2)_j-the merged data set;

n is the number of data in the data set;

in order to solve the problem that the cluster center is initialized by the ant colony algorithm and the data distribution is not uniform, the algorithm adopts the most suitable algorithmThe maximum minimum distance method updates the cluster center. Calculating the distance between any two data and recording the distance in the matrix to obtain the I between the two data^-According to Density (q)_i) Principle, isolate data from set C_jAnd (3) removing to obtain an updated clustering center, wherein the formula is as follows:

in the formula

Density(q_i)——q_i(ii) a density of (d);

after the updating of the cluster center is completed, for the set C_jAnd calculating the distance from the other data centers to the new cluster sample center according to the other data samples (see fig. 1). In order to improve the clustering calculation efficiency, an intelligent big data mining platform is built by Hadoop, a Map transaction is responsible for calculating the distance from each data to a clustering center, and a Reduce transaction is responsible for updating the clustering center every time.

As shown in fig. 1, which is a flow chart combining an ant colony algorithm and a distance maximum and minimum algorithm, according to the flow chart, a cluster center is initialized by using the ant colony algorithm, an initial cluster center is selected, the distance from data to the initial cluster center is calculated, the data is classified according to the distance, whether the cluster center changes or not is checked, if the cluster center changes, the cluster center is updated, and the next round of calculation of the distance from the data to the cluster center is participated in, and if the cluster center does not change, the process is ended.

The improved algorithm adopted by the invention adopts a maximum-minimum distance method to update the clustering center, an ant colony algorithm is selected from the initial data set to calculate the initial clustering center, the distance between any two data is calculated on the basis of the initial clustering center, and the ant colony algorithm is combined with a density-based maximum-minimum distance method to update the clustering center, so that clustering is realized and isolated samples are eliminated, thereby achieving data clustering; and by constructing an intelligent big data clustering mining platform, the clustering calculation efficiency is improved. The improved clustering mining technology well improves the conditions that the initial clustering center selection is too random and the algorithm operation efficiency is low in the clustering process.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A big data cluster mining method is characterized by comprising the following steps of obtaining a data set,

2. The big data cluster mining method according to claim 1, characterized in that an intelligent big data cluster mining platform is further built by adopting the big data cluster mining method.

3. The big data cluster mining method as claimed in claim 1, wherein when initializing the cluster center, the whole data set is regarded as ants for finding food, the clustering process is regarded as the process of ants for finding food source, so that the cluster center is more accurate, and a data set Q ═ { Q | Q is provided_i1,q_i2,…,q_in1,2, …, m, where n and m represent constants, and the specific calculation formula is as follows:

a and B in formula (1) are normal numbers;

κ -pheromone residual intensity;

t is time;

λ_ij(t) -the size of the pheromone between data i and data j at time t;

c in formula (2)_j-the merged data set;

n is the number of data in the data set.

4. The big data cluster mining method according to claim 3, wherein the updating the cluster centers by the maximum and minimum distance method comprises,

calculating the distance between any two data and recording the distance in the matrix to obtain the I between the two data^-According to Density (q)_i) Principle, will be isolatedData from set C_jAnd (3) removing to obtain an updated clustering center, wherein the formula is as follows:

in the formula

Density(q_i)——q_i(ii) a density of (d);

after the updating of the cluster center is completed, for the set C_jAnd calculating the distance from the other data centers to the new cluster sample center.

5. The big data cluster mining method according to claim 2, wherein an intelligent big data mining platform is built by Hadoop, a Map transaction is responsible for calculating the distance between each piece of data and a cluster center, and a Reduce transaction is responsible for updating the cluster center every time.

6. A big data clustering mining platform, characterized in that the platform comprises a server having a memory; and

a processor coupled to the memory, the processor configured to execute instructions stored in the memory, the processor to:

a data set is obtained that is then used,

7. A storage medium on which a computer program is stored which, when executed by a processor, carries out the method of any one of claims 1 to 5.