WO2015180340A1 - 一种数据挖掘方法及装置 - Google Patents

一种数据挖掘方法及装置 Download PDF

Info

Publication number
WO2015180340A1
WO2015180340A1 PCT/CN2014/087630 CN2014087630W WO2015180340A1 WO 2015180340 A1 WO2015180340 A1 WO 2015180340A1 CN 2014087630 W CN2014087630 W CN 2014087630W WO 2015180340 A1 WO2015180340 A1 WO 2015180340A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
execution
physical resources
input
input data
Prior art date
Application number
PCT/CN2014/087630
Other languages
English (en)
French (fr)
Inventor
谭卫国
汪芳山
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP14893347.6A priority Critical patent/EP3121735A4/en
Publication of WO2015180340A1 publication Critical patent/WO2015180340A1/zh
Priority to US15/337,508 priority patent/US10606867B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the embodiments of the present invention relate to data processing technologies, and in particular, to a data mining method and apparatus.
  • Data Mining refers to the non-trivial process of revealing hidden, previously unknown and potentially valuable information from a large amount of data in a database. It is mainly based on artificial intelligence, machine learning, pattern recognition, statistics, database, visualization technology, etc., highly automated analysis of enterprise data, inductive reasoning, mining potential models, helping decision makers adjust market strategies, Reduce risk and make the right decisions.
  • step (3) there is a problem that insufficient resources such as insufficient memory may cause the data mining process to fail.
  • the embodiment of the invention provides a data mining method and device to overcome the failure of data mining process execution caused by insufficient physical resources in the data mining process.
  • an embodiment of the present invention provides a data mining method, where the method is applied to a distributed system, where the distributed system includes at least one node, and the method includes:
  • Determining a plurality of execution steps of the data mining process obtaining a correspondence between the physical resources required by the execution steps in the running process and the physical resources occupied by the input data of the data mining process; a node performing a step, the node providing a node of a physical resource for each execution step; determining, according to the correspondence relationship and a physical resource owned by a node for performing a corresponding execution step, a node that performs each step can process a maximum amount of data of the input data; determining a maximum amount of data that the distributed system can process according to a maximum amount of data that can be processed by the node performing the respective steps; The maximum amount of input data that the system can process, and the data to be mined is processed according to the data mining process.
  • an embodiment of the present invention provides a data mining apparatus, where the apparatus includes: a transceiver, a processor, and a memory;
  • the transceiver is configured to receive an original data set, and send the extracted input data to be processed to each node for processing;
  • the memory is configured to store an original data set; and
  • the processor is configured to determine a data mining process a plurality of execution steps; obtaining a correspondence between physical resources required by the execution steps in the running process and physical resources occupied by input data of the data mining process; determining nodes for performing the execution steps Determining, by the node, a node that provides a physical resource for each execution step; determining, according to the correspondence relationship and a physical resource owned by a node for performing a corresponding execution step, the input data that can be processed by a node that performs each step a maximum amount of data; determining a maximum amount of data that the distributed system can process according to a maximum amount of data that can be processed by the node performing the steps; and processing according to the distributed system
  • the maximum amount of data of the input data is processed according to the data mining process for the data to be mined.
  • the embodiment of the invention comprehensively evaluates the characteristics of the data mining process and the relationship between the data mining process and the physical resources owned by the network node in the distributed network system, thereby obtaining that the data mining process is run in the network system of the branch.
  • the maximum amount of data that can be supported, the input data is accurately and effectively defined to ensure the normal operation of the system.
  • Embodiment 1 is a flowchart of Embodiment 1 of a data mining method according to the present invention
  • FIG. 2 is a schematic diagram of a data mining process exemplified by the present invention.
  • Embodiment 3 is a flowchart of Embodiment 2 of a data mining method according to the present invention.
  • Embodiment 4 is a flowchart of Embodiment 3 of a data mining method according to the present invention.
  • FIG. 5 is a structural diagram of a device according to Embodiment 1 of a data mining device according to the present invention.
  • FIG. 1 is a flowchart of Embodiment 1 of a data mining method according to the present invention.
  • the execution body of this embodiment may be a general data mining device, and the data mining device may be implemented by general software and/or hardware.
  • the data mining method of this embodiment is applied to a distributed architecture, where the distributed architecture includes at least one node, which may be a general PC, a virtual machine in a server in a cloud architecture, or other capable of applying to the distributed architecture.
  • the method in this embodiment may include:
  • Step 101 Determine multiple execution steps of the data mining process.
  • the manner of determining the plurality of execution steps of the data mining process may be obtained by the data mining device by analyzing the data mining process, or by the data mining device to the storage device storing the execution steps of the data mining process.
  • the method for obtaining the data mining process may be divided according to different algorithm principles adopted in different stages of the data mining process; or may be based on the phased processing results obtained in the data mining process; or may be based on
  • the logical steps of the data mining process are divided, and the logical steps are usually set when researching and designing the data mining process, and are usually strongly related to the processing stage.
  • the above analysis method is a simple enumeration of the manner in which the present invention can be covered, and does not specifically limit the range that can be included.
  • Step 102 Obtain a correspondence between physical resources required by the execution steps in the running process and physical resources occupied by input data of the data mining process.
  • the corresponding relationship is a ratio parameter between a physical resource required by each execution step in the running process and a physical resource occupied by the input data of the data mining process.
  • Step 103 Determine a node that performs each of the execution steps, and the node provides physical resources for each execution step.
  • the relationship of nodes providing physical resources for each execution step includes: the same node provides physical resources for multiple execution steps; multiple nodes collectively provide physical resources for one execution step; multiple nodes provide physical resources for multiple execution steps, etc. .
  • the data mining device obtains in advance all the nodes or available nodes in the distributed system, for example, which nodes are idle, those nodes are combinable, and even the execution steps are run on the nodes. History and so on.
  • the operation of each node is managed by a management device in the distributed system, and the data mining device can directly from the The distribution device and the capability attribute of each node are obtained in the management device.
  • Step 104 Determine, according to the correspondence relationship and a physical resource owned by a node for performing a corresponding execution step, a maximum data amount of the input data that can be processed by a node that executes each step.
  • each execution step calculates a maximum amount of data that the corresponding single execution step allows the data mining process to input according to the physical resource owned by the corresponding node.
  • Step 105 Determine a maximum data amount of input data that the distributed system can process according to a maximum data amount of input data that can be processed by the node that performs each step.
  • step 104 it is obtained that each execution step respectively allows a plurality of maximum data quantities input by the data mining process, and then the maximum amount of data that the distributed system can allow to input is the plurality of maximum input data amounts.
  • the minimum value The principle is similar to the short board principle. The distributed system can only operate normally if the amount of data that is input is less than the minimum of the maximum amount of data that can be processed in each execution step.
  • Step 106 Process the data to be mined according to the data mining process according to the maximum data amount of the input data that can be processed by the distributed system.
  • the embodiment of the present invention comprehensively evaluates the characteristics of the data mining process (including: the execution steps included in the data mining process, and the relationship between the execution steps and the nodes running the execution steps) and the physical resources owned by the network nodes themselves in the distributed network system. The relationship is obtained, and the maximum amount of data that can be supported by running the data mining process in the distributed network system is obtained, and the input data is accurately and effectively defined to ensure the normal operation of the system.
  • the given data mining process can be any well-known data mining process. What the present invention needs to do is to analyze the data mining process and combine the physical resources owned by each node of the distributed system. Therefore, the input data is correspondingly limited and optimized.
  • the embodiment is not particularly limited herein.
  • step 101 a plurality of execution step related methods of determining a data mining process are disclosed.
  • the process of determining the data mining process will be described in detail below in conjunction with a specific data mining process.
  • FIG. 2 is a schematic diagram of a data mining process according to an example of the present invention.
  • This embodiment The data mining process is only schematic.
  • the method of the present invention can be applied to obtain the execution steps based on the disclosure of the embodiment.
  • the data mining process includes the following steps:
  • Step 1 Feature column selection. This step refers to selecting the feature column from the input data. The subsequent process runs only on the selected feature column, and the remaining feature columns will no longer participate in the analysis in the subsequent steps. Those skilled in the art can understand that there is a target column in each feature column, and the target column requirement is the data column most relevant to the problem to be solved by the data mining.
  • the feature selection here is the operational steps in the data mining process as exemplified, and the purpose is to perform more efficiently.
  • the extraction of the feature columns involved in the specific method of the present invention is based on the limitation of the maximum amount of data allowed by the distributed system, and the purpose and meaning of the two are different.
  • the execution step of the data mining process includes the selection of the feature column
  • the execution step of the feature column selection in the data mining may be merged into the feature column in the data mining method provided by the present invention. Selection, for example, merges step 1 into step 404 as a step for processing.
  • Table 1 is an example of input data as a data mining process.
  • the target column is preferably an "off-network" data column.
  • step 2 the data in the selected feature column is normalized.
  • This step refers to normalizing the feature values in the feature column to 0-1.
  • the age value range is 0-100, and the value of each age is divided by 100 to obtain the data in the feature column. The result of the transformation.
  • a median padding is performed on the missing values in the selected feature column. This step refers to if the value of a sample in the feature column is null in the input data. To avoid affecting the subsequent process, the position is empty and filled with the median value. For example, the age of a user sample is empty, filled with a median of 50 for 0 and 100.
  • Step 4 data partitioning. This step refers to the data processed by the 123 step, half of which is the input data of the fifth step, and the other half of which is the input data of the sixth step.
  • Step 5 The nearest neighbor method (k-Nearest Neighbor (KNN) model learning.
  • KNN k-Nearest Neighbor
  • the KNN model learning is performed by taking half of the data rows of the step 4 partition as input. After performing step 5, a KNN model is output, and the KNN model is also the main output of the entire data mining process.
  • Step 6 Evaluation of the KNN model. This step takes the KNN model outputted in step 5 as an input, and performs KNN model evaluation on the data obtained by the partitioning of step 4. Step 6 obtains parameters such as accuracy and recall rate of the KNN model.
  • the foregoing execution step parses the data mining process shown in FIG. 2, and obtains multiple execution steps of executing the data mining process, as shown in Table 2.
  • the analysis of the data mining process in this embodiment is a specific manner of determining a plurality of execution steps of the data mining process.
  • the six execution steps are obtained in a relatively simple parsing manner, and the six execution steps obtained by the parsing are referred to as the first set of execution steps in the subsequent embodiments.
  • the first set of execution steps of the data mining process may be selected in another manner, for example, the correspondence between the data mining process and the corresponding first set of execution steps is directly recorded in the distributed system.
  • the embodiment of the present invention provides an optimization processing method for the first set of execution steps, in addition to directly performing the first set of execution steps to perform the subsequent steps 102-106.
  • the process data generated in the plurality of execution steps is analyzed, and specifically includes: determining that the number of execution steps using the process data as input data is one, and the determined one execution step
  • the input data does not contain process data other than the process data; the execution step of combining the process data and the execution step with the process data as input are an optimized execution step.
  • the process data is specifically represented by input data or output data of specific steps in Table 2.
  • the judgment condition for judging the combination of the two execution steps is also applicable to the completion of the merge between the execution steps of the two or more serial relationships. For ease of description, embodiments of the present invention will be combined
  • the process data between at least two execution steps is called temporary data. For example, after the step 123 is merged, the process data between step 1 and step 2 can be called temporary data, and the process between step 2 and step 3. Data can also be called temporary data.
  • the set of execution steps consisting of the original execution steps and the optimized execution steps in the first set of execution steps are also referred to as the second set of execution steps, as shown in Table 3 below.
  • the inventor combines the execution steps of generating the temporary data and the execution steps of inputting the temporary data as input by analyzing the association relationship between the process data and the execution steps in the data mining process, thereby avoiding occupation of the temporary data.
  • the space is calculated into the space occupied by the execution step, which improves the utilization of the node. After the optimization of the execution steps, the physical resources of the nodes can be utilized more effectively, and larger input data is processed.
  • the execution step 123 in Table 2 is for the operation of a single-line sample, and the rows and rows are independent of each other, so that for each row of data, the operation of 123 can be performed sequentially, and only one process data is output, instead of each execution.
  • the steps all output a temporary data. Since the output data of step 4 is not completely input as the execution step 5 or the execution step 6, the process data generated in step 4 cannot be directly deleted after the input data as the step 5 or the step 6 is used, but must be Steps 5 and 6 can only be deleted after they are used. Therefore, step 4 cannot be combined with step 5 or step 6 to form an execution step.
  • the composition of the data volume is determined by the number of samples and the number of feature columns included in the sample. Therefore, in judging the combination of the above-described execution steps, based on the characteristics of the process data between the execution steps (ie, determining the number of execution steps using the process data as input data, and the determined one execution step)
  • the input data does not include process data other than the process data, and it is preferable to limit the input data within the determined maximum data amount by means of feature column extraction; in judging the combination of the above execution steps, According to the fact that the unit for performing the step processing is a single sample, it is preferable to limit the input data to the maximum amount of data determined by controlling the total number of samples.
  • step 102 which is the first set of execution steps or the second set of execution steps.
  • step 104 the implementation principles of the steps are all Are the same.
  • a detailed second set of execution steps shown in Table 3 will be taken as an example for detailed description.
  • the nodes in the distributed architecture in this embodiment may specifically be computers, servers, virtual machines, and the like.
  • the physical resources of this embodiment may specifically be a processor core, a hard disk, a memory, or the like.
  • the details can be as shown in Table 4. Table 4 shows that the distributed system includes two nodes. In the specific implementation process, the number of nodes in the distributed system in this embodiment may be specified according to a specific environment, and no specific limitation is imposed herein.
  • step 102 the correspondence between the physical resources required by the execution steps in the running process and the physical resources occupied by the input data of the data mining process is obtained. Specifically, a step is performed for any one of a plurality of execution steps of performing a data mining process for the plurality of execution steps Each of the execution steps determines a ratio between a physical resource occupied by the input data and the output data and a physical resource occupied by the input data of the data mining process when the execution step is executed. The ratio is a specific representation of the correspondence.
  • memory is more likely to become the bottleneck of the number of feature columns. Therefore, the following analysis uses memory as an example. According to the experience of CPU, hard disk and other cluster resources estimation and the impact on the number of feature columns. It can be considered separately based on the memory instance or using a similar method of analyzing memory, and the statement will not be extended here.
  • the above ratio parameter may be preset according to an empirical value, or may be calculated by an execution step in an instant.
  • this embodiment lists, by using Table 5, an example of the proportional relationship between the memory Ti occupied by the input data of the data mining process and the memory To occupied by the output data, and the specificity of T1 to T4. The meaning is shown in Table 3.
  • the output data represented by T4 is the model evaluation result, which generally includes several indicators such as accuracy rate and recall rate, so the occupied memory can be neglected, so it will not be mentioned below. It should be noted that, in Table 5, the proportional relationship between Ti and To occupied resources of the second set of execution steps of the optimized table 3 is given.
  • the processed input data is set to M. Therefore, the ratio between the physical resources required for obtaining the execution steps in the running process and the physical resources occupied by the data mining process is obtained, as shown in Table 6.
  • the physical resources required for each execution step during the run include the input data of the execution step
  • the execution step is specifically an optimized execution step
  • the physical resources occupied by the process data generated internally by the optimized execution step are greater than the input data and/or output data of the optimized execution step
  • the acquisition Corresponding relationship between the physical resources required for each execution step and the physical resources occupied by the input data of the data mining process including:
  • the physical resources occupied by the process data the physical resources occupied by the input data of the optimized execution step, and the physical resources occupied by the output data of the optimized execution step, respectively, the input data of the data mining process respectively.
  • the two larger ratios of the three ratios of the physical resources, and the summation calculation obtains the physical resources required by the optimized execution step in the running process and the physical resources occupied by the input data of the data mining process.
  • step 103 a node is executed that performs each of the execution steps, the node providing physical resources for the respective execution steps.
  • the physical resources required to perform step 1 to step 4 in the second set in Table 3 may be distributed and stored on nodes in the distributed architecture, and thus The physical resources owned by the nodes provided by this execution step determine the maximum amount of data that can be processed in each execution step.
  • step 104 the maximum data amount of the input data that can be processed by the node performing the respective steps is determined according to the correspondence relationship and the physical resources owned by the node performing the corresponding execution step.
  • a distributed architecture includes two nodes, and the physical resources owned by the two nodes are respectively recorded as M1 and M2.
  • step 105 the maximum amount of data that the distributed system can process is determined based on the maximum amount of data that can be processed by the node performing the various steps.
  • M ⁇ A M ⁇ A
  • A is the maximum data amount of the input data that can be processed by the data mining process.
  • the data to be mined is processed according to the data mining process.
  • the input data of the preparation process extracted from the original data set is less than or equal to the maximum data amount.
  • the size of the input data to be processed such as the saturation level, the normal level, and the optimal level, may be respectively set according to different levels of 80%, 60%, and 50% of the maximum data amount, so that the user can The level determines the appropriate input data for the preparation process.
  • the input data to be processed is used as an input of the data mining process, the data mining process is executed, a data mining model is obtained, and the effect of the data mining model is performed. Verification and evaluation, etc.
  • the data mining method provided by the embodiment of the present invention is based on the ratio between the physical resources required by the execution steps in the data mining process and the physical resources occupied by the input data of the data mining process, and in the distributed system.
  • the data mining process provides physical resources owned by each node of the physical resource, determines the maximum amount of data that can be processed, and extracts input data to be processed from the original data set according to the maximum amount of data. For data mining of big data, it is possible to determine the maximum amount of data that can be processed by the data mining process under the constraints of limited physical resources, and to ensure the effective completion of data mining tasks.
  • FIG. 3 is a flowchart of Embodiment 2 of a data mining method according to the present invention. This embodiment describes the steps 103-104 in detail based on the first embodiment of FIG. In the specific implementation process, continue with the data mining process shown in FIG. 2 and the second set of execution steps after the step optimization process is completed as shown in Table 3. The physical resources required for the execution step in the running process are distributed. Multiple nodes in the architecture are shared.
  • Step 301 Determine a node that provides physical resources for each execution step.
  • Step 302 sequentially traverse each execution step to obtain a maximum value of input data that can be processed by the node that executes each execution step.
  • determining one or more nodes that provide physical resources for the execution steps traversed, according to The physical resource owned by the one or more nodes and the ratio parameter of the execution step are calculated, and the maximum value of the data that can be input by the execution step on the one or more nodes is calculated.
  • Step 303 after traversing each execution step, calculating a maximum value of the plurality of input data obtained by each execution step in the traversal process, and taking a minimum value of the maximum values of the plurality of input data as the distributed system The maximum amount of data that can be processed for input data.
  • step 302 the description is performed by taking step 1 as an example.
  • the physical resources of the nodes M1 and M2 are respectively 4G and 8G
  • steps 2 and 3 the implementation process is similar.
  • the following describes the processing of the data mining process in the distributed system by combining the physical resources owned by each node in the distributed system. The maximum amount of data for the input data.
  • step 4 that is, 0.45M+2 ⁇ 0.45M ⁇ M1+M2
  • the physical resources required to perform the input data of the step are smaller than the physical resources owned by each node in the distributed architecture, that is, 0.45M ⁇ M1, 0.45M ⁇ M2, wherein 0.45M corresponds to the input data T4.
  • the data mining method provided by the embodiment can be used for data mining of big data, and can determine the maximum amount of data that can be processed by the data mining process under the constraint of limited physical resources, thereby ensuring effective completion of tasks.
  • FIG. 4 is a flowchart of Embodiment 3 of a data mining method according to the present invention. The embodiment is implemented on the basis of the embodiments of FIG. 1 and FIG. 3, and specifically includes the following steps:
  • Step 401 Analyze the data mining process to obtain a plurality of execution steps of executing the data mining process.
  • Step 402 Obtain a ratio between physical resources required by the execution steps in the running process and physical resources occupied by input data of the data mining process.
  • Step 403 Determine, according to the ratio and physical resources owned by each node in the distributed system, a maximum amount of data that can be processed by the data mining process in the distributed system.
  • Step 404 Determine, according to the maximum amount of data, a maximum number K of data columns selected from the original data set, where K is an integer.
  • Step 405 Select K data columns from the original data set, where the K data columns include K-1 feature columns and one target column.
  • Step 406 Extract, according to the K data columns, input data to be processed from the original data set.
  • Step 407 Perform data mining processing by using the input data of the preparation process as an input of the data mining process.
  • step 401-403 in this embodiment are similar to the steps 101-105, and are not described herein again.
  • the specific implementation process of step 403 in this embodiment may be performed according to the embodiment in FIG. 3, and details are not described herein again.
  • the maximum number K of data columns selected from the original data set is determined based on the maximum amount of data. Specifically, according to the maximum data amount, the number of rows of the original data set, and the physical resources occupied by one data column, the maximum number K of the selected data columns is determined. For example, following the results of the previous embodiment 2, M ⁇ 6.31 G is known.
  • K data columns are selected from the original data set, the K data columns including K-1 feature columns and one target column.
  • step 405 includes two possible implementations, in two possible implementations.
  • Implementation also includes:
  • first correlation coefficient between any two feature columns in the original data set, wherein the first correlation coefficient is greater than or equal to 0, less than or equal to 1, and the correlation between any two feature columns is proportional to the value of the first correlation coefficient ;
  • the "off-network" column is the target column (1 indicates that it is off-grid, 0 indicates that it is not off-grid), and the remaining columns are feature columns.
  • Each feature column can be regarded as a vector.
  • the dimension of the vector is the number of samples of the original data set, and the value is the value of the feature column for each sample of the original data set.
  • the target column can also be thought of as a vector, and the value is the value of the target column for each sample.
  • the correlation coefficient ranges from [0, 1]. The closer to 1 is, the higher the correlation of the vector is. The closer to 0, the lower the correlation of the vector.
  • the correlation coefficient between age and duration of the network is:
  • the correlation coefficient between each feature column and the target column can also be calculated in the same way.
  • matrix multiplication can be used to calculate the two-two feature and the correlation coefficient between the feature column and the target column.
  • Table 8 can be represented as a matrix
  • the off-diagonal element of the matrix is actually the denominator part of equation (1)
  • the diagonal element of the matrix is the molecular part of equation (1)
  • the correlation coefficient between the feature columns or between the feature columns and the target columns can be directly obtained.
  • the first correlation coefficient between the first feature column and the second feature column is
  • the second correlation coefficient between the second feature column and the target column is
  • the correlation coefficient matrix is obtained as follows. Since the matrix is symmetrical and the diagonal element is 1, it is only necessary to calculate the upper triangular or lower triangular portion, and the resulting first correlation coefficient and second correlation coefficient can be as shown in Table IX. .
  • K data columns are selected from the original data set, and the specific implementation manner is as follows:
  • the first correlation coefficient between any two feature columns in the original data set clustering arbitrary feature columns in the original data set to obtain P clusters; according to the second between the arbitrary feature columns and the target columns in the original data set Correlation coefficient, in each P cluster, determine the feature column with the highest correlation with the target column, and obtain P feature columns; according to the P feature columns and the target column, select K data columns from the original data set.
  • the clustering algorithm corresponding to the clustering calculation in this embodiment includes any one of the following:
  • K-Means clustering algorithm hierarchical clustering algorithm, density clustering algorithm.
  • P k-1.
  • the input of the k-means clustering algorithm is the distance between two pairs of feature columns.
  • steps a) to d) are repeated.
  • the result of feature column clustering can ensure that the correlation between feature columns is high in the same cluster, and the correlation of feature columns between different clusters is low.
  • a feature column is selected, and the selected feature is listed in the cluster, and the second correlation coefficient with the target column is the highest.
  • the number of clusters is 4, the number of feature columns When it is 10 columns, the relationship between the number of feature columns in the cluster and the second correlation coefficient between the feature column and the target column is as shown in Table 10.
  • one feature column having the highest correlation coefficient with the target column is selected from each cluster, for example, the second correlation coefficient between the feature column 4 and the target column in cluster 3 is 0.9 is the largest. Then, the feature column 4 is selected from the cluster 3, and finally, the feature columns 1, 5, 4, and 8 are selected.
  • feature columns 1, 5, 4, 8 and the target column are the K data columns selected from the original data set.
  • K data columns are selected from the original data set by the clustering algorithm, which not only meets the maximum data amount, but also satisfies the reliability of the data column.
  • the hierarchical clustering algorithm and the density clustering algorithm P are not equal to K-1.
  • P is greater than K-1
  • the correlation between the target column and the target column is determined according to the second correlation coefficient between the arbitrary feature column and the target column in the original data set.
  • the feature column obtains P feature columns.
  • the K- with the second correlation coefficient between the target column and the target column is selected according to the order of relevance.
  • One feature column, K-1 feature columns and target columns are K data columns selected from the original data set. If P is not greater than K-1, then the P features are listed as selected feature columns.
  • the number of feature columns in the cluster, and the second between the feature column and the target column The relationship between the correlation coefficients can be as shown in Table 9.
  • one feature column with the second correlation coefficient between the target column and the target column is selected from each of the clusters, for example, the second correlation coefficient between the feature column 4 and the target column in the cluster 3 is 0.9 is the largest. Then, the feature column 4 is selected from the cluster 3, and the selection result of the mode can be as shown in Table 11.
  • K data columns are selected from the original data set by the clustering algorithm, which not only meets the maximum data amount, but also satisfies the reliability of the data.
  • FIG. 5 is a structural diagram of a device according to Embodiment 1 of a data mining device according to the present invention.
  • the embodiment of the invention further provides a data mining device 50, the device comprising: a transceiver 501, a processor 503 and a memory 502, wherein:
  • the transceiver 501 is configured to receive the original data set, and send the extracted input data of the preparation process to each node for processing.
  • the memory 502 is configured to store an original data set.
  • the processor 503 is configured to determine a plurality of execution steps of the data mining process, and obtain between the physical resources required by the execution steps in the running process and the physical resources occupied by the input data of the data mining process. Corresponding relationship; determining a node that performs each of the execution steps, the node providing a physical resource for each execution step; determining to perform the performing according to the correspondence relationship and a physical resource owned by a node for performing a corresponding execution step.
  • the maximum amount of data of the input data that can be processed by the nodes of each step determining the maximum amount of input data that the distributed system can process according to the maximum amount of data that can be processed by the node performing each step According to the maximum data amount of the input data that the distributed system can process, the data to be mined is processed according to the data mining process.
  • the processor 503 is further configured to: acquire a plurality of execution steps of executing the data mining process, analyze process data generated in the multiple execution steps; and determine execution of the process data as input data
  • the number of steps is one, and the input data of the determined one execution step does not include process data other than the process data; the execution step of combining the process data is generated and the process data is input
  • the execution steps are an optimized execution step.
  • the processor is further configured to: perform a step for each of the multiple execution steps, and determine input data and output data when the execution step is run.
  • the performing step is specifically an optimized execution step, and the physical resources occupied by the temporary data generated internally by the optimized execution step are larger than the input and/or output data of the optimized execution step
  • the processing The device 503 is further configured to: according to the physical resources occupied by the temporary data, the physical resources occupied by the input data of the optimized execution step, and the physical resources occupied by the output data of the optimized execution step, respectively, and the data Mining the larger ratio of the three ratios between the physical resources occupied by the input data of the process, and calculating the physical resources required by the optimized execution step in the running process and the data mining process Enter the ratio parameter between the physical resources occupied by the data.
  • the processor 503 is further configured to: screen one or more maximum data quantities that each node can allow to input the data set, and use the smallest one of the one or more maximum data quantities as the distributed The maximum amount of data that can be processed by the data mining process in the system.
  • the processor 503 is further configured to: determine, according to the maximum amount of data, a maximum number K of data columns selected from data to be mined, the K being an integer; selecting from the data to be mined K data columns, the K data columns including K-1 feature columns and one target column.
  • the processor 503 is further configured to perform clustering calculation on the feature columns in the preparation mining data to obtain P clusters, and from the P clusters according to correlation between the feature columns and the target columns. K data columns are screened out.
  • the data mining device 50 can be used to implement the implementation of various methods in the foregoing Embodiments 1 to 3, and the preferred features in this embodiment are specific implementations in the respective method embodiments. Proposed. The one-to-one correspondence is not described here.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Abstract

本发明实施例提供一种数据挖掘方法及装置。所述方法应用于分布式系统,所述分布式系统包括至少一个节点,所述方法包括:确定数据挖掘流程的多个执行步骤;获取所述各执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的对应关系;确定执行所述各执行步骤的节点,所述节点为所述各执行步骤提供物理资源;根据所述执行各个步骤的节点所能处理的输入数据的最大数据量,确定所述分布式系统所能处理的输入数据的最大数据量;根据所述分布式系统所能处理的输入数据的最大数据量,对准备挖掘的数据按照所述数据挖掘流程进行处理。对于输入的数据做了准确而有效的限定,从而保证系统正常运行。

Description

一种数据挖掘方法及装置
本申请要求于2014年05月30日提交中国专利局、申请号为201410239140.4、发明名称为“一种数据挖掘方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及数据处理技术,尤其涉及一种数据挖掘方法及装置。
背景技术
数据挖掘(Data Mining,简称DM)是指从数据库的大量数据中揭示出隐含的、先前未知的并有潜在价值的信息的非平凡过程。它主要基于人工智能、机器学习、模式识别、统计学、数据库、可视化技术等,高度自动化地分析企业的数据,做出归纳性的推理,从中挖掘出潜在的模式,帮助决策者调整市场策略,减少风险,做出正确的决策。
然而,随着大数据时代的到来,数据挖掘的对象的来源越来越广泛,使得数据集中的样本数,和/或是特征列的数目,都达到了一个非常大的规模,现有技术在步骤(2)进行特征列选择之后,如果选择的特征列数量过大,在步骤(3)中,会出现内存不足等资源不够的问题,使得数据挖掘流程执行失败。
发明内容
本发明实施例提供一种数据挖掘方法及装置,以克服数据挖掘过程中,物理资源不足导致的数据挖掘流程执行失败。
一方面,本发明实施例提供了一种数据挖掘方法,所述方法应用于分布式系统,所述分布式系统包括至少一个节点,所述方法包括:
确定数据挖掘流程的多个执行步骤;获取所述各执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的对应关系;确定执行所述各执行步骤的节点,所述节点为所述各执行步骤提供物理资源的节点;根据所述对应关系和用于执行相应执行步骤的节点所拥有的物理资源,确定执行各个步骤的节点所能处理的所述输入数据的最大数据量;根据所述执行各个步骤的节点所能处理的输入数据的最大数据量,确定所述分布式系统所能处理的输入数据的最大数据量;根据所述分布式系统所能处理的输入数据的最大数据量,对准备挖掘的数据按照所述数据挖掘流程进行处理。
另一方面,本发明实施例提供了一种数据挖掘装置,所述装置包括:收发器、处理器和存储器;
所述收发器,用于接收原始数据集,并将抽取得到的准备处理的输入数据发送给各节点处理;所述存储器,用于存储原始数据集;所述处理器,用于确定数据挖掘流程的多个执行步骤;获取所述各执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的对应关系;确定执行所述各执行步骤的节点,所述节点为所述各执行步骤提供物理资源的节点;根据所述对应关系和用于执行相应执行步骤的节点所拥有的物理资源,确定执行各个步骤的节点所能处理的所述输入数据的最大数据量;根据所述执行各个步骤的节点所能处理的输入数据的最大数据量,确定所述分布式系统所能处理的输入数据的最大数据量;根据所述分布式系统所能处理的输入数据的最大数据量,对准备挖掘的数据按照所述数据挖掘流程进行处理。
本发明实施例通过综合评估数据挖掘流程的特性以及数据挖掘流程和分布式网络系统中网络节点自身拥有物理资源间的关系,从而得出了在该分部是网络系统中运行该数据挖掘流程所能支持的最大数据量,对于输入的数据做了准确而有效的限定,从而保证系统正常运行。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍。
图1为本发明数据挖掘方法实施例一的流程图;
图2为本发明所举例的一个数据挖掘流程示意图;
图3为本发明数据挖掘方法实施例二的流程图;
图4为本发明数据挖掘方法实施例三的流程图;
图5为本发明数据挖掘装置实施例一的装置结构图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其 他实施例,都属于本发明保护的范围。
图1为本发明数据挖掘方法实施例一的流程图。本实施例的执行主体可以为通用的数据挖掘装置,该数据挖掘装置可由通用的软件和/或硬件实现。本实施例的数据挖掘方法应用于分布式架构,该分布式架构包括至少一个节点,所述节点可以是普通的PC机、云架构中服务器中的虚拟机或者其他能够运用到所述分布式架构中的计算资源。如图1所示,本实施例的方法可以包括:
步骤101、确定数据挖掘流程的多个执行步骤。
其中,确定数据挖掘流程的多个执行步骤的方式可以由数据挖掘装置通过解析数据挖掘流程获得,或者由数据挖掘装置到存储有所述数据挖掘流程各执行步骤的存储装置上获取。
其中,解析数据挖掘流程获得的方式可以是依据数据挖掘流程中不同阶段采用的不同算法原理来划分;也可以是依据数据挖掘流程中取得的各阶段性的处理结果作为划分依据;还可以是依据该数据挖掘流程的逻辑步骤来划分,所述逻辑步骤通常在研究设计该数据挖掘流程时设定,通常跟处理阶段强相关。上述解析方法是对本发明所能覆盖方式的简单列举,并不对其所能包含的范围做特俗限定。
步骤102、获取所述各执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的对应关系。
其中,所述对应关系优选的是采用各执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的比值参数。
步骤103、确定执行所述各执行步骤的节点,所述节点为所述各执行步骤提供物理资源。
为各执行步骤提供物理资源的节点的关系包括:同一个节点为多个执行步骤提供物理资源;多个节点共同为一个执行步骤提供物理资源;多个节点为多个执行步骤提供物理资源等等。
本步骤中,优选的,数据挖掘装置事先获取分布式系统中拥有的所有节点或可用节点情况,例如:哪些节点是空闲的、那些节点是可以组合使用的、甚至于执行步骤在节点上运行的历史记录等等。通常情况下各节点的运行情况都会由分布式系统中的管理装置进行管理,而所述数据挖掘装置可以直接从所述 管理装置中获取各节点的分布情况和能力属性。
步骤104、根据所述对应关系和用于执行相应执行步骤的节点所拥有的物理资源,确定执行各个步骤的节点所能处理的所述输入数据的最大数据量。
其中,在步骤102中已经得到各执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的对应关系,因此,进一步在步骤103中确定为所述各执行步骤提供物理资源的节点后,每一个执行步骤根据相应节点拥有的物理资源,计算得到相应的单个执行步骤允许所述数据挖掘流程输入的最大数据量。
步骤105、根据所述执行各个步骤的节点所能处理的输入数据的最大数据量,确定所述分布式系统所能处理的输入数据的最大数据量。
在步骤104中,得到的是各个执行步骤分别允许数据挖掘流程输入的多个最大数据量,那么所述分布式系统所能允许输入的最大数据量,便是所述多个最大输入数据量中的最小值。其原理类似于短板原理,只有满足输入的数据量小于各执行步骤所能处理的最大数据量中的最小值,分布式系统才能正常的运行。
步骤106、根据所述分布式系统所能处理的输入数据的所述最大数据量,对准备挖掘的数据按照所述数据挖掘流程进行处理。
本发明实施例通过综合评估数据挖掘流程的特性(包括:数据挖掘流程包含的执行步骤,以及各执行步骤和运行该执行步骤节点间关系)以及分布式网络系统中网络节点自身拥有物理资源间的关系,从而得出了在该分布式网络系统中运行该数据挖掘流程所能支持的最大数据量,对于输入数据做了准确而有效的限定,从而保证系统正常运行。
本领域技术人员可以理解,该给定的数据挖掘流程可以是任意的公知的数据挖掘流程,本发明所要做的是如何对数据挖掘流程进行分析,并结合分布式系统各节点拥有的物理资源,从而对输入的数据做相应的限定和优化。对于所述数据挖掘流程,本实施例此处不做特别限制。
在步骤101的解释中,公开了确定数据挖掘流程的多个执行步骤相关方法。下面将结合具体的数据挖掘流程,详细说明确定数据挖掘流程的过程。
请参照图2,图2为本发明所举例的一个数据挖掘流程示意图。本实施例 的数据挖掘流程仅为示意性的,对于其它数据挖掘流程也可以在本实施例公开内容的基础上应用本发明的方法完成执行步骤的获取。如图2所示,该数据挖掘流程包括以下执行步骤:
步骤①特征列选择。该步骤是指从输入数据中选择特征列,后续流程只在选择的特征列上运行,其余特征列将不再参与到后续步骤中的分析。本领域技术人员可以理解,各特征列中存在一个目标列,该目标列要求是进行该数据挖掘所要解决问题最相关的数据列。
此处的特征选择是所举例的数据挖掘流程中的操作步骤,其目的是为了执行效率更高。而本发明的在具体的方法中所涉及的特征列的提取,则是出于分布式系统允许输入的最大数据量的限制考虑,两者的目的和含义不同。但是在可选的方案中,当数据挖掘流程的执行步骤中包含有特征列的选择时,可以将数据挖掘中的特征列选择的执行步骤合并到本发明提供的数据挖掘方法中的特征列的选择,例如:将步骤①合并到步骤404作为一个步骤进行处理。
表一是作为数据挖掘流程输入数据的一个实例。
表一
用户ID 年龄 在网时长 上网次数 短信发送量 通话时长 …… 已离网
1 35 10 25 10 300 …… 0
2 26 1 40 25 80 …… 1
3 41 15 3 2 180 …… 0
……              
在所要解决的问题是:识别有离网倾向的用户时,则所述目标列优选的为“已离网”的数据列。
步骤②,将选择的特征列中数据归一化处理。该步骤是指将特征列中的特征取值归一化到0-1之间,例如年龄原始取值范围是0-100,将每一个年龄的值除以100得到该特征列中数据归一化的结果。
步骤③,对于选择的特征列中的缺失值,进行中值填充。该步骤是指输入数据中如果特征列中的某个样本取值为空,为了避免影响后续的流程,则将位置为空地方用中值填充。例如某个用户样本的年龄为空,填充为0和100的中值50。
步骤④,数据分区。该步骤是指将经过①②③步骤处理后的数据,一半的数据作为第⑤步的输入数据,另外一半数据作为第⑥步的输入数据。
步骤⑤,最近邻居法(k-Nearest Neighbor,简称KNN)模型学习。以步骤④分区的一半的数据行作为输入,进行KNN模型学习。执行完步骤⑤会输出一个KNN模型,KNN模型也是整个数据挖掘流程的主要输出。
步骤⑥,KNN模型评估。该步骤以步骤⑤输出的KNN模型作为输入,对步骤④的分区得到的数据进行KNN模型评估。步骤⑥得到KNN模型的准确率、召回率等参数。
优选的上述执行步骤对图2所示的数据挖掘流程进行解析,得到执行该数据挖掘流程的多个执行步骤,具体如表二所示。
表二
Figure PCTCN2014087630-appb-000001
由表二可知,本实施例对数据挖掘流程进行解析是确定执行所述数据挖掘流程的多个执行步骤的具体方式。所述6个执行步骤是在较简单的解析方式下得到的,该解析得到的6个执行步骤在后续的实施例中被称为第一套执行步骤。实际中可以选择其它的方式获取数据挖掘流程的第一套执行步骤,例如:直接在分布式系统中记录所述数据挖掘流程和相应的第一套执行步骤的对应关系。
而本发明的实施例除了可以直接使用上述第一套执行步骤进行后续的步骤102-106外,还提供了针对所述第一套执行步骤进行优化处理方法。
分析所述多个执行步骤(第一套执行步骤)中产生的过程数据,具体包括:当确定以所述过程数据作为输入数据的执行步骤个数为一个,并且所述确定出的一个执行步骤的输入数据不包含除所述过程数据以外的其它的过程数据时;合并产生所述过程数据的执行步骤和以所述过程数据为输入的执行步骤为一个优化的执行步骤。其中,所述过程数据具体表现为表二中的具体步骤的输入数据或输出数据。所述判断两个执行步骤合并的判断条件同样适用于两个以上串行关系的执行步骤间完成合并。为了描述方便,本发明各实施例将进行合并 的至少两个执行步骤之间的过程数据称为临时数据,例如:步骤①②③合并之后,则步骤①和步骤②之间的过程数据便可称为临时数据,步骤②和步骤③之间的过程数据也可称为临时数据。
其中,由第一套执行步骤中原有的执行步骤和所述优化的执行步骤构成的一套执行步骤,也被称为第二套执行步骤,如下面表三所示。
表三
Figure PCTCN2014087630-appb-000002
发明人通过分析数据挖掘流程中过程数据和执行步骤间的关联关系,将产生所述临时数据的执行步骤和将所述临时数据作为输入的执行步骤进行合并,从而避免了所述临时数据所占用的空间被计算到该执行步骤所占用的空间中去,提高了节点的利用率。经过执行步骤的优化,能够更有效的利用节点的物理资源,处理更大的输入数据。
结合表二的执行步骤,除了KNN模型学习和KNN模型评估外,其它的执行步骤产生的过程数据都满足上述合并执行步骤的要求。上述执行步骤的优化原理具体分析如下:
表二中的执行步骤①②③都是针对单行样本的操作,行与行之间都是相互独立的,所以可以对每行数据,依次执行①②③的运算,只输出一个过程数据,而不是每个执行步骤都输出一个临时数据。由于执行步骤④的输出数据没有完全作为执行步骤⑤或执行步骤⑥的输入,即执行步骤④产生的过程数据在作为步骤⑤或步骤⑥的输入数据被使用完毕后不能直接删除,而是必须在步骤⑤和步骤⑥同时使用完毕后才能删除。因此,执行步骤④不能与执行步骤⑤或执行步骤⑥合并成一个执行步骤。
通过上述具体分析,可以从另一个角度总结出,作为合并执行步骤的依据, 具体为:
获取执行所述数据挖掘流程的多个执行步骤,分析所述多个执行步骤中产生的过程数据;当确定连续的两个或两个以上的执行步骤,其每次处理的单位是单个样本时,合并所述两个或两个以上的执行步骤。
对于一个原始数据集,由于其数据量的构成包含样本数和样本所包含的特征列数共同决定。因此,在判断上述执行步骤的合并,依据的是执行步骤间的过程数据的特性时(即确定以所述过程数据作为输入数据的执行步骤个数为一个,并且所述确定出的一个执行步骤的输入数据不包含除所述过程数据以外的其它的过程数据),则优选的是通过特征列提取的方式,限制输入数据在确定出的最大数据量范围内;在判断上述执行步骤的合并,依据的是执行步骤处理的单位是单个样本时,则优选的是通过控制样本数总量的方式,限制输入数据再确定出的最大数据量范围内。
本领域技术人员可以理解,当数据挖掘流程中的各执行步骤不产生临时数据时,则不需要将多个执行步骤合并成一个执行步骤;若表二中的各执行步骤不产生临时数据,则最后得到用于步骤102处理的多个执行步骤,如表二所示。
在具体实现过程中,不管是通过哪种方式获取的多个执行步骤,进入步骤102的是第一套执行步骤或者是第二套执行步骤,对于后续的步骤102至104其步骤的实现原理都是相同的。在本实施例中,为了便于说明,以表三所示的经过优化的第二套执行步骤为例,进行详细说明。
本实施例分布式架构中的节点,具体可以为计算机、服务器、虚拟机等等。本实施例的物理资源具体可以为处理器核、硬盘、内存等。具体可如表四所示。表四示出了分布式系统包括两个节点,在具体实现过程中,本实施例对分布式系统中的节点数量是可以根据具体环境指定的,在此不做具体限制。
表四
  处理器核 硬盘 内存
节点1 8核 500G 4G
节点2 12核 1T 8G
在步骤102中,获取所述各执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的对应关系。具体地,针对执行数据挖掘流程的多个执行步骤中的任意一个执行步骤,针对所述多个执行步骤 中的每一个执行步骤,确定所述执行步骤运行时,其输入数据和输出数据一共所占用的物理资源与所述数据挖掘流程的输入数据占用的物理资源之间的比值。所述比值是所述对应关系的一种具体表现。
在具体实现过程中,内存更有可能成为特征列数量选择的瓶颈,因此下面的分析都是以内存作为示例进行的,根据经验CPU、硬盘等其它集群资源的估计和对特征列数量选择的影响可以基于内存实例基础上进一步考虑或者使用类似分析内存的方式单独考虑,在此不一一展开陈述。
上述比值参数,可以根据经验值预先设定,也可以是即时的通过执行步骤计算得到。为了更清楚的描述后续过程,本实施例通过表五列出了各执行步骤中数据挖掘流程输入数据占用的内存Ti与输出数据占用的内存To的比例关系的某个实例,T1至T4的具体含义如表三所示。
表五
  To/Ti
T1 0.9
T2 0.45
T3 0.45
T4 0
其中,T4代表的输出数据为模型评估结果,一般包括准确率、召回率等几个指标,所以占用内存可以忽略不计,所以以下不再提及。需要说明的是,表五给出的是经过优化后的对应表三的第二套执行步骤的Ti和To占用资源的比例关系。
本领域技术人员可以理解,在正常的数据挖掘过程中,将处理的输入数据设为M。由此,得到获取各执行步骤在运行过程中所需的物理资源与数据挖掘流程的输入数据所占的物理资源之间的比值,具体如表六所示。
表六
Figure PCTCN2014087630-appb-000003
由于各执行步骤在运行过程中所需的物理资源包括执行步骤的输入数据 占用的物理资源和执行步骤的输出数据占用的物理资源,其中,(执行步骤的输入数据+执行步骤的输出数据)/M为各执行步骤在运行过程中所需的物理资源与数据挖掘流程的输入数据所占的物理资源之间的比值。
在所述执行步骤具体为优化的执行步骤,并且所述优化的执行步骤内部产生的过程数据所占用的物理资源大于所述优化的执行步骤的输入数据和/或输出数据,则所述获取所述各执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的对应关系,具体包括:
根据所述过程数据所占用的物理资源、优化的执行步骤的输入数据所占用的物理资源和优化的执行步骤的输出数据所占用的物理资源,三者分别与所述数据挖掘流程的输入数据所占的物理资源的三个比值中较大的两个比值,求和计算得到所述优化的执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的比值参数。例如:步骤①和步骤②间产生的临时数据大小为1.2M,则所述执行步骤1运行时所需的物理资源和所述数据挖掘流程的输入数据所占用的物理资源的比值为(1.2M+M):M=2.2:1。
在步骤103中,确定执行所述各执行步骤的节点,所述节点为所述各执行步骤提供物理资源。
由于本发明实施例应用于分布式架构,如表三中第二套的执行步骤1至执行步骤4所需的物理资源可以分布式的在分布式架构中的节点上存储和处理,因此,为该执行步骤提供的节点所拥有的物理资源决定了各执行步骤所能处理的最大数据量。
在步骤104中,根据所述对应关系和执行相应执行步骤的节点所拥有的物理资源,确定执行各个步骤的节点所能处理的所述输入数据的最大数据量。
例如,分布式架构中包括两个节点,该两个节点所拥有的物理资源的大小分别记为M1和M2。
对于表五中的数据,当执行步骤1所需的物理资源可以由两个节点分担提供时,则满足M+0.9M<M1+M2;当执行步骤4中的KNN模型需要每个节点都存储时,即执行步骤4所需的物理资源不能由两个节点分担提供,则同时满足0.45M+2×0.45M<M1+M2,0.45M<M1,0.45M<M2,对于执行步骤2和执行步骤3类似,可以得到对应的不等式0.9M+0.45M+0.45M<M1+M2, 0.45M+0.45M<M1+M2。
在步骤105中,根据执行各个步骤的节点所能处理的输入数据的最大数据量,确定所述分布式系统所能处理的输入数据的最大数据量。
根据同时满足上述各执行步骤相应不等式的限定条件,可以解出M<A,则A即为数据挖掘流程所能处理的输入数据的最大数据量。
在步骤106中,根据所述最大数据量,对准备挖掘的数据按照所述数据挖掘流程进行处理。具体地包括,从原始数据集中抽取的准备处理的输入数据小于等于最大数据量。优选的,可以基于最大数据量的80%、60%和50%等不同的等级,分别对应的设置饱和级别、普通级别和最适级别等几个准备处理的输入数据的大小,以便用户能够根据所述等级确定合适的准备处理的输入数据。
在如图2所示的本实施例具体的应用环境中,具体地,将准备处理的输入数据作为数据挖掘流程的输入,执行该数据挖掘流程,得到数据挖掘模型,对数据挖掘模型的效果进行验证和评估等。
本发明实施例提供的数据挖掘方法,根据数据挖掘流程中各执行步骤在运行过程中所需的物理资源与数据挖掘流程的输入数据所占的物理资源之间的比值,以及分布式系统中为所述数据挖掘流程提供物理资源的每个节点所拥有的物理资源,确定所能处理的输入数据的最大数据量;根据最大数据量,从原始数据集中抽取准备处理的输入数据。针对大数据的数据挖掘,能够在有限的物理资源的约束下,确定数据挖掘流程能够处理的最大数据量,保证了有效完成数据挖掘任务。
图3为本发明数据挖掘方法实施例二的流程图。本实施例在图1实施例一的基础上,对步骤103-104进行详细说明。在具体实现过程中,继续以图2所示的数据挖掘流程以及表三所示完成了步骤优化处理后的第二套执行步骤为例,执行步骤在运行过程中所需的物理资源是由分布式架构中的多个节点分担提供的。
步骤301、确定为各个执行步骤提供物理资源的节点。
步骤302、依次遍历各执行步骤,得到执行各执行步骤的节点所能处理的输入数据的最大值。
其中,确定为遍历到的执行步骤提供物理资源的一个或多个节点,根据所 述一个或多个节点自身拥有的物理资源和该执行步骤的比值参数,计算得到该执行步骤在所述一个或多个节点上,所能输入数据的最大值。
步骤303、在遍历完各执行步骤后,计算遍历过程中由各执行步骤得到的多个输入数据的最大值,并取所述多个输入数据的最大值中的最小值为所述分布式系统所能处理的输入数据的最大数据量。
在步骤302中,以执行步骤1为例,进行说明。具体地,根据执行步骤在运行过程中所需的物理资源与处理的输入数据所占的物理资源之间的比值参数=(M+0.9M)/M=1.9,节点M1和M2的物理资源分别是4G和8G,则对于执行步骤1来说所能处理的输入数据的最大数据量=(4+8)/1.9=6.32G。对于执行步骤2和3,其实现过程类似,下面结合表七,并结合所述分布式系统中每个节点所拥有的物理资源,确定所述分布式系统中运行所述数据挖掘流程所能处理的输入数据的最大数据量。
表七
Figure PCTCN2014087630-appb-000004
对于执行步骤1,则M+0.9M<M1+M2;对于执行步骤2,则0.9M+0.45M+0.45M<M1+M2;对于执行步骤3,则0.45M+0.45M<M1+M2;在执行步骤4中,即0.45M+2×0.45M<M1+M2,同时,执行步骤的输入数据所需的物理资源小于分布式架构中每个节点拥有的物理资源,即0.45M<M1,0.45M<M2,其中,0.45M对应输入数据T4。根据表七中的表达式确定M的取值范围,假设M1=4G,M2=8G,代入表七中的表达式,可以计算出所述理想值中的最小值情况M<6.31G,即所能处理的输入数据的最大数据量为6.31G。
本实施例提供的数据挖掘方法,针对大数据的数据挖掘,能够在有限的物理资源的约束下,确定数据挖掘流程能够处理的最大数据量,保证了有效完成任务。
图4为本发明数据挖掘方法实施例三的流程图。本实施例在图1和图3实施例的基础上实现,具体包括以下步骤:
步骤401、解析数据挖掘流程,得到执行所述数据挖掘流程的多个执行步骤。
步骤402、获取所述各执行步骤在运行过程中所需的物理资源与数据挖掘流程的输入数据所占的物理资源之间的比值。
步骤403、根据所述比值和所述分布式系统中每个节点所拥有的物理资源,确定所述分布式系统中运行所述数据挖掘流程所能处理的输入数据的最大数据量。
步骤404、根据所述最大数据量,确定从原始数据集中选择的数据列的最大数量K,所述K为整数。
步骤405、从所述原始数据集中选择K个数据列,所述K个数据列包括K-1个特征列和一个目标列。
步骤406、根据所述K个数据列,从原始数据集中抽取准备处理的输入数据。
步骤407、将所述准备处理的输入数据作为所述数据挖掘流程的输入,进行数据挖掘处理。
本实施例中的步骤401-403分别与步骤101-105类似,本实施例此处不再赘述。本实施例步骤403的具体实现过程,可按图3实施例执行,此处不再赘述。
对于步骤404,根据最大数据量,确定从原始数据集中选择的数据列的最大数量K。具体地,根据所述最大数据量,原始数据集的行数,一个数据列占用的物理资源,确定该选择数据列的最大数量K。例如,沿用上一个实施例二的结果,已知M<6.31G。原始数据集有1千万行,假设每个数据列占用8个字节内存,则最大数据量K=6.31*109/(107*8)=78,即最多从原始数据集中选出78个数据列,才能保证各节点的运算不超出最大可用内存。
在步骤405中,从原始数据集中选择K个数据列,该K个数据列包括K-1个特征列和一个目标列。
在本实施例中,步骤405包括两种可能的实现方式,在两种可能的实现方 式实现,还包括:
获取原始数据集中任意两个特征列之间的第一相关系数,其中,第一相关系数大于等于0,小于等于1,任意两个特征列之间的相关性与第一相关系数的值成正比;
获取原始数据集中任意特征列与目标列之间的第二相关系数,其中,第二相关系数大于等于0,小于等于1,任意特征列与目标列之间的相关性与第二相关系数的值成正比。
具体地,为了便于说明,此处以一个小数据集为例说明如何计算第一相关系数和第二相关系数。具体的小数据集如表八所示。
表八
年龄 在网时长 上网次数 已离网
35 10 25 0
26 1 40 1
41 15 3 0
其中,“已离网”这一列是目标列(1表示已离网,0表示未离网),其余列是特征列。
每个特征列都可以看作一个向量,向量的维度是原始数据集的样本数,取值就是原始数据集每个样本该特征列的取值。目标列也可以看作一个向量,取值就是每个样本的目标列取值。
向量a=<a1,a2,…,an>和b=<b1,b2,…,bn>的相关系数可以用余弦相似度进行计算,公式如下:
Figure PCTCN2014087630-appb-000005
相关系数取值范围在[0,1]之间,越接近1表示向量的相关性越高,越接近0表示向量的相关性越低。
根据上述公式(1),年龄和在网时长的相关系数为:
Figure PCTCN2014087630-appb-000006
每个特征列和目标列(“已离网”)之间的相关系数也可以用相同方法计算。
为了提升性能,可以用矩阵乘法来计算两两特征以及特征列与目标列之间的相关系数。
例如:表八中的数据可以表示成矩阵
Figure PCTCN2014087630-appb-000007
然后计算矩阵乘法AT*A,其中AT为A的转置矩阵,得到
Figure PCTCN2014087630-appb-000008
矩阵的非对角线元素实际上就是公式(1)的分母部分
Figure PCTCN2014087630-appb-000009
矩阵的对角线元素就是公式(1)的分子部分
Figure PCTCN2014087630-appb-000010
根据AT*A,可以直接求出特征列之间、或者特征列与目标列的相关系数。例如第1个特征列与第2个特征列的第一相关系数就是
Figure PCTCN2014087630-appb-000011
第2个特征列与目标列之间的第二相关系数是
Figure PCTCN2014087630-appb-000012
这样得到相关系数矩阵如下,因为矩阵是对称的,而且对角线元素为1,所以只需要算出上三角或者下三角部分,最终得到的第一相关系数和第二相关系数可如表九所示。
表九
Figure PCTCN2014087630-appb-000013
在得到第一相关系数和第二相关系数之后,从原始数据集中选择K个数据列,具体的实现方式如下:
第一种可能的实现方式:
根据原始数据集中任意两个特征列之间的第一相关系数,对原始数据集中的任意特征列进行聚类计算,得到P个簇;根据原始数据集中任意特征列与目标列之间的第二相关系数,在每个P个簇中,确定与目标列相关性最高的特征列,得到P个特征列;根据P个特征列和目标列,从原始数据集中选择K个数据列。
在具体实现过程中,本实施例中的聚类计算对应的聚类算法包括如下中的任一一种:
K-Means聚类算法,层次聚类算法、密度聚类算法。
其中,不同的聚类算法,P的取值会有不同。对于k-means聚类算法,P=k-1。k-means聚类算法的输入是两两特征列的距离,特征列F1和特征列F2之间的距离可定义为Dist(F1,F2)=1-Corr(F1,F2),即相关性越高的特征列,距离越小。
在已知两两特征列的距离的情况下,对特征列进行聚类的流程如下:
a)在原始数据集中随机选择P个特征列,作为聚类中心向量,其中,P=k-1。
b)对每一个特征列F,比较F与K-1个聚类中心向量的距离,将F分配到距离最近的聚类中心向量,这样每个特征列都被分配到K-1个聚类中心向量对应的簇。
c)对每个簇,对簇中的所有特征列,求出特征列对应向量的平均向量,然后找出和平均向量最近的特征列作为新的聚类中心向量,这样就得到了K-1个新的聚类中心向量。
d)比较新旧聚类中心向量的距离,如果距离小于预设阈值,则聚类结束;否则,重复步骤a)至d)。
特征列聚类的结果,可以保证同一个簇中,特征列之间的相关性高,不同簇之间的特征列相关性低。
接着,直接从每个簇中,选择一个特征列,该选择的特征列在这个簇中,与目标列之间的第二相关系数最高。例如,当簇的个数为4时,特征列的个数 为10列时,簇中的特征列个数,以及特征列与目标列之间的第二相关系数的关系如表十所示。
表十
Figure PCTCN2014087630-appb-000014
从4个簇中,每个簇中选出1个与目标列之间第二相关系数最高的特征列,例如簇3中特征列4与目标列之间的第二相关系数为0.9是最大的,于是从簇3选出特征列4,最终,选择出特征列1、5、4、8。
因此,特征列1、5、4、8以及目标列即为从原始数据集中选择的K个数据列。本领域技术人员可以理解,本实施例中使用的是k-means聚类算法,而在具体实现过程中,其它聚类算法也可以,只要该聚类算法能够满足P=K-1即可,本实施例对聚类算法不做特别限制。
本实施例通过聚类算法从原始数据集中选择K个数据列,不仅符合最大数据量,还满足了数据列的可靠性。
第二种可能的实现方式中:
层次聚类算法、密度聚类算法的P不等于K-1,当P大于K-1时,根据原始数据集中任意特征列与目标列之间的第二相关系数,确定与目标列相关性最高的特征列,得到P个特征列,根据P个特征列与目标列之间的相关性,按照相关性从大到小的顺序,选出与目标列之间的第二相关系数最大的K-1个特征列,K-1个特征列与目标列为从原始数据集中选择的K个数据列。如果P不大于K-1,则这P个特征列为选择出的特征列。
例如,本实施例中的K=4,则需选出K-1=3个特征列。P=4,即通过第二聚类计算,得到4个簇。簇中的特征列个数,以及特征列与目标列之间的第二 相关系数的关系可如表九所示。
从4个簇中,每个簇中选出1个与目标列之间的第二相关系数最高的特征列,例如簇3中特征列4与目标列之间的第二相关系数为0.9是最大的,于是从簇3选出特征列4,该方式的选择结果可如表十一所示。
表十一
特征列标识 第二相关系数
1 0.5
5 0.7
4 0.9
8 0.6
按照特征列与目标列之间的第二相关系数从大到小排序,分别是0.9,0.7,0.6,0.5,对应的特征列标识分别是4,5,8,1。因为K-1=3,所以选择出特征列4、特征列5、特征列8,最终,即目标列和特征列4、特征列5、特征列8即为从原始数据集中选择的K个数据列。
本实施例通过聚类算法从原始数据集中选择K个数据列,不仅符合最大数据量,还满足了数据的可靠性。
图5为本发明数据挖掘装置实施例一的装置结构图。本发明实施例还提供了一种数据挖掘装置50,所述装置包括:收发器501、处理器503和存储器502,其特征在于:
所述收发器501,用于接收原始数据集,并将抽取得到的准备处理的输入数据发送给各节点处理。
所述存储器502,用于存储原始数据集。
所述处理器503,用于确定数据挖掘流程的多个执行步骤;获取所述各执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的对应关系;确定执行所述各执行步骤的节点,所述节点为所述各执行步骤提供物理资源;根据所述对应关系和用于执行相应执行步骤的节点所拥有的物理资源,确定执行所述各个步骤的节点所能处理的所述输入数据的最大数据量;根据执行各个步骤的节点所能处理的输入数据的最大数据量,确定所述分布式系统所能处理的输入数据的最大数据量;根据所述分布式系统所能处理的输入数据的最大数据量,对准备挖掘的数据按照所述数据挖掘流程进行处理。
优选的,所述处理器503还用于:获取执行所述数据挖掘流程的多个执行步骤,分析所述多个执行步骤中产生的过程数据;当确定以所述过程数据作为输入数据的执行步骤个数为一个,并且所述确定出的一个执行步骤的输入数据不包含除所述过程数据以外的其它的过程数据时;合并产生所述过程数据的执行步骤和以所述过程数据为输入的执行步骤为一个优化的执行步骤。
优选的,所述对应关系具体表现为比值参数时,所述处理器还用于:针对所述多个执行步骤中的每一个执行步骤,确定所述执行步骤运行时,其输入数据和输出数据一共所占用的物理资源与所述数据挖掘流程的输入数据占用的物理资源之间的比值。
优选的,在所述执行步骤具体为优化的执行步骤,并且所述优化的执行步骤内部产生的临时数据所占用的物理资源大于所述优化的执行步骤的输入和/或输出数据,所述处理器503还用于:根据所述临时数据所占用的物理资源、优化的执行步骤的输入数据所占用的物理资源和优化的执行步骤的输出数据所占用的物理资源,三者分别与所述数据挖掘流程的输入数据所占的物理资源之间的三个比值中较大的两个比值,求和计算得到所述优化的执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的比值参数。
优选的,所述处理器503还用于:筛选各节点所能够允许输入数据集的一个或多个最大数据量,并将所述一个或多个最大数据量中最小的值作为所述分布式系统中运行所述数据挖掘流程所能处理的输入数据的最大数据量。
优选的,所述处理器503还用于:根据所述最大数据量,确定从准备挖掘的数据中选择的数据列的最大数量K,所述K为整数;从所述准备挖掘的数据中选择K个数据列,所述K个数据列包括K-1个特征列和一个目标列。
优选的,所述处理器503还用于:对所述准备挖掘数据中的特征列进行聚类计算,得到P个簇;根据特征列与目标列之间的相关性,从所述P个簇中筛选出K个数据列。
具体的,所述数据挖掘装置50可以用来完成上述实施例一到实施例三中各种方法的实现,而本实施例中优选的特性,即针对各方法实施例中所涉及的具体实现所提出的。其一一对应性在此不作赘述。
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。

Claims (17)

  1. 一种数据挖掘方法,其特征在于,所述方法应用于分布式系统,所述分布式系统包括至少一个节点,所述方法包括:
    确定数据挖掘流程的多个执行步骤;
    获取所述各执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的对应关系;
    确定执行所述各执行步骤的节点,所述节点为所述各执行步骤提供物理资源;
    根据所述对应关系和用于执行相应执行步骤的节点所拥有的物理资源,确定执行各个步骤的节点所能处理的所述输入数据的最大数据量;
    根据所述执行各个步骤的节点所能处理的输入数据的最大数据量,确定所述分布式系统所能处理的输入数据的最大数据量;
    根据所述分布式系统所能处理的输入数据的最大数据量,对准备挖掘的数据按照所述数据挖掘流程进行处理。
  2. 根据权利要求1所述的方法,其特征在于,所述确定数据挖掘流程的多个执行步骤,具体包括:
    获取执行所述数据挖掘流程的多个执行步骤,分析所述多个执行步骤中产生的过程数据;
    当确定以所述过程数据作为输入数据的执行步骤个数为一个,并且所述确定出的一个执行步骤的输入数据不包含除所述过程数据以外的其它的过程数据时;
    合并产生所述过程数据的执行步骤和以所述过程数据为输入的执行步骤为一个优化的执行步骤。
  3. 根据权利要求1或2所述的方法,其特征在于,所述确定数据挖掘流程的多个执行步骤,具体包括:
    获取执行所述数据挖掘流程的多个执行步骤,分析所述多个执行步骤中产生的过程数据;
    当确定连续的两个或两个以上的执行步骤,其每次处理的单位是单个样本时;
    合并所述两个或两个以上的执行步骤。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,在所述对应关系具体表现为比值参数时,所述获取所述各执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的对应关系,具体包括:
    针对所述多个执行步骤中的每一个执行步骤,确定所述执行步骤运行时,其输入数据和输出数据一共所占用的物理资源与所述数据挖掘流程的输入数据占用的物理资源的比值。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,在所述执行步骤具体为优化的执行步骤,并且所述优化的执行步骤内部产生的过程数据所占用的物理资源大于所述优化的执行步骤的输入数据和/或输出数据,则所述获取所述各执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的对应关系,具体包括:
    确定所述过程数据所占用的物理资源、优化的执行步骤的输入数据所占用的物理资源和优化的执行步骤的输出数据所占用的物理资源,三者分别与所述数据挖掘流程的输入数据所占的物理资源的三个比值中较大的两个比值,求和计算得到所述优化的执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的比值参数。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述根据执行各个步骤的节点所能处理的输入数据的最大数据量,确定所述分布式系统所能处理的输入数据的最大数据量,还包括:
    筛选各节点所能够允许输入数据的一个或多个最大数据量,并将所述一个或多个最大数据量中最小的值作为所述分布式系统中运行所述数据挖掘流程所能处理的输入数据的最大数据量。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述根据所述最大数据量,对准备挖掘的数据按照所述数据挖掘流程进行处理,具体包括:
    根据所述最大数据量,确定从准备挖掘的数据中选择的数据列的最大数量K,所述K为整数;
    从所述准备挖掘的数据中选择K个数据列,所述K个数据列包括K-1个特征列和一个目标列。
  8. 根据权利要求7所述的方法,其特征在于,所述从所述准备挖掘的数据中选择K个数据列,包括:
    对所述准备挖掘数据中的特征列进行聚类计算,得到P个簇,所述P为整数;
    根据特征列与目标列之间的相关性,从所述P个簇中筛选出K个数据列。
  9. 根据权利要求8所述的方法,其特征在于,所述聚类计算对应的聚类算法包括如下中的任一一种:
    K-Means聚类算法、层次聚类算法或密度聚类算法。
  10. 根据权利要求1至9任一项所述的方法,其特征在于,所述物理资源包括内存资源、硬盘资源、处理器核资源中的至少一种。
  11. 一种数据挖掘装置,所述装置应用于分布式系统,所述分布式系统包括至少一个节点,其中,所述装置包括:收发器、处理器和存储器,其特征在于:
    所述收发器,用于接收原始数据集,并将抽取得到的准备处理的输入数据发送给各节点处理;
    所述存储器,用于存储原始数据集;
    所述处理器,用于确定数据挖掘流程的多个执行步骤;获取所述各执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的对应关系;确定执行所述各执行步骤的节点,所述节点为所述各执行步骤提供物理资源;根据所述对应关系和用于执行相应执行步骤的节点所拥有的物理资源,确定执行各个步骤的节点所能处理的所述输入数据的最大数据量;根据所述执行各个步骤的节点所能处理的输入数据的最大数据量,确定所述分布式系统所能处理的输入数据的最大数据量;根据所述分布式系统所能处理的输入数据的最大数据量,对准备挖掘的数据按照所述数据挖掘流程进行处理。
  12. 根据权利要求11所述的装置,其特征在于,所述处理器还用于:
    获取执行所述数据挖掘流程的多个执行步骤,分析所述多个执行步骤中产生的过程数据;当确定以所述过程数据作为输入数据的执行步骤个数为一个,并且所述确定出的一个执行步骤的输入数据不包含除所述过程数据以外的其 它的过程数据时;合并产生所述过程数据的执行步骤和以所述过程数据为输入的执行步骤为一个优化的执行步骤。
  13. 根据权利要求11或12所述的装置,其特征在于,在所述对应关系具体表现为比值参数时,所述处理器还用于:
    针对所述多个执行步骤中的每一个执行步骤,确定所述执行步骤运行时,其输入数据和输出数据一共所占用的物理资源与所述数据挖掘流程的输入数据占用的物理资源之间的比值。
  14. 根据权利要求11-13任一项所述的装置,其特征在于,在所述执行步骤具体为优化的执行步骤,并且所述优化的执行步骤内部产生的临时数据所占用的物理资源大于所述优化的执行步骤的输入和/或输出数据,所述处理器还用于:
    根据所述临时数据所占用的物理资源、优化的执行步骤的输入数据所占用的物理资源和优化的执行步骤的输出数据所占用的物理资源,三者分别与所述数据挖掘流程的输入数据所占的物理资源的三个比值中较大的两个比值,求和计算得到所述优化的执行步骤在运行过程中所需的物理资源与所述数据挖掘流程的输入数据所占的物理资源之间的比值参数。
  15. 根据权利要求11-14任一项所述的装置,其特征在于,所述处理器还用于:
    筛选各节点所能够允许输入数据的一个或多个最大数据量,并将所述一个或多个最大数据量中最小的值作为所述分布式系统中运行所述数据挖掘流程所能处理的输入数据的最大数据量。
  16. 根据权利要求11-15任一项所述的装置,其特征在于,所述处理器还用于:
    根据所述最大数据量,确定从准备挖掘的数据中选择的数据列的最大数量K,所述K为整数;
    从所述准备挖掘的数据中选择K个数据列,所述K个数据列包括K-1个特征列和一个目标列。
  17. 根据权利要求16所述的装置,其特征在于,所述处理器还用于:
    对所述准备挖掘数据中的特征列进行聚类计算,得到P个簇,所述P为整 数;
    根据特征列与目标列之间的相关性,从所述P个簇中筛选出K个数据列。
PCT/CN2014/087630 2014-05-30 2014-09-28 一种数据挖掘方法及装置 WO2015180340A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP14893347.6A EP3121735A4 (en) 2014-05-30 2014-09-28 Data mining method and device
US15/337,508 US10606867B2 (en) 2014-05-30 2016-10-28 Data mining method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410239140.4A CN105205052B (zh) 2014-05-30 2014-05-30 一种数据挖掘方法及装置
CN201410239140.4 2014-05-30

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/337,508 Continuation US10606867B2 (en) 2014-05-30 2016-10-28 Data mining method and apparatus

Publications (1)

Publication Number Publication Date
WO2015180340A1 true WO2015180340A1 (zh) 2015-12-03

Family

ID=54698001

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/087630 WO2015180340A1 (zh) 2014-05-30 2014-09-28 一种数据挖掘方法及装置

Country Status (4)

Country Link
US (1) US10606867B2 (zh)
EP (1) EP3121735A4 (zh)
CN (1) CN105205052B (zh)
WO (1) WO2015180340A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257955A (zh) * 2020-11-06 2021-01-22 开普云信息科技股份有限公司 一种基于聚类算法的共享单车优化调配方法、控制装置、电子设备及其存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229663B (zh) * 2016-03-25 2022-05-27 阿里巴巴集团控股有限公司 数据处理方法和装置以及数据表处理方法和装置
US10657145B2 (en) 2017-12-18 2020-05-19 International Business Machines Corporation Clustering facets on a two-dimensional facet cube for text mining
CN108664605B (zh) * 2018-05-09 2021-03-09 北京三快在线科技有限公司 一种模型评估方法及系统
US11069447B2 (en) * 2018-09-29 2021-07-20 Intego Group, LLC Systems and methods for topology-based clinical data mining
CN110427341A (zh) * 2019-06-11 2019-11-08 福建奇点时空数字科技有限公司 一种基于路径排序的知识图谱实体关系挖掘方法
US20210133556A1 (en) * 2019-10-31 2021-05-06 International Business Machines Corporation Feature-separated neural network processing of tabular data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7516152B2 (en) * 2005-07-05 2009-04-07 International Business Machines Corporation System and method for generating and selecting data mining models for data mining applications
CN101799809A (zh) * 2009-02-10 2010-08-11 中国移动通信集团公司 数据挖掘方法和数据挖掘系统
CN102096602A (zh) * 2009-12-15 2011-06-15 中国移动通信集团公司 一种任务调度方法及其系统和设备
CN102693317A (zh) * 2012-05-29 2012-09-26 华为软件技术有限公司 数据挖掘流程生成方法及装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2159269C (en) * 1995-09-27 2000-11-21 Chaitanya K. Baru Method and apparatus for achieving uniform data distribution in a parallel database system
US6032146A (en) 1997-10-21 2000-02-29 International Business Machines Corporation Dimension reduction for data mining application
US6862623B1 (en) * 2000-04-14 2005-03-01 Microsoft Corporation Capacity planning for server resources
US7472107B2 (en) * 2003-06-23 2008-12-30 Microsoft Corporation Integrating horizontal partitioning into physical database design
US7493406B2 (en) * 2006-06-13 2009-02-17 International Business Machines Corporation Maximal flow scheduling for a stream processing system
US9495427B2 (en) * 2010-06-04 2016-11-15 Yale University Processing of data using a database system in communication with a data processing framework
CN102903114A (zh) 2012-10-09 2013-01-30 河海大学 一种基于改进型层次聚类的高光谱遥感数据降维方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7516152B2 (en) * 2005-07-05 2009-04-07 International Business Machines Corporation System and method for generating and selecting data mining models for data mining applications
CN101799809A (zh) * 2009-02-10 2010-08-11 中国移动通信集团公司 数据挖掘方法和数据挖掘系统
CN102096602A (zh) * 2009-12-15 2011-06-15 中国移动通信集团公司 一种任务调度方法及其系统和设备
CN102693317A (zh) * 2012-05-29 2012-09-26 华为软件技术有限公司 数据挖掘流程生成方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3121735A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257955A (zh) * 2020-11-06 2021-01-22 开普云信息科技股份有限公司 一种基于聚类算法的共享单车优化调配方法、控制装置、电子设备及其存储介质

Also Published As

Publication number Publication date
CN105205052B (zh) 2019-01-25
EP3121735A1 (en) 2017-01-25
US10606867B2 (en) 2020-03-31
US20170046422A1 (en) 2017-02-16
CN105205052A (zh) 2015-12-30
EP3121735A4 (en) 2017-04-19

Similar Documents

Publication Publication Date Title
WO2015180340A1 (zh) 一种数据挖掘方法及装置
CN108492201B (zh) 一种基于社区结构的社交网络影响力最大化方法
US9449115B2 (en) Method, controller, program and data storage system for performing reconciliation processing
US10540354B2 (en) Discovering representative composite CI patterns in an it system
WO2017084362A1 (zh) 模型生成方法、推荐方法及对应装置、设备和存储介质
CN107430611B (zh) 过滤数据沿袭图
Khurana et al. Storing and analyzing historical graph data at scale
CN104077723B (zh) 一种社交网络推荐系统及方法
CN107251021B (zh) 过滤数据沿袭图
Hamann et al. Structure-preserving sparsification methods for social networks
WO2022116689A1 (zh) 图数据处理方法、装置、计算机设备和存储介质
CN110619231B (zh) 一种基于MapReduce的差分可辨性k原型聚类方法
JP2016100005A (ja) リコンサイル方法、プロセッサ及び記憶媒体
CN104137095A (zh) 用于演进分析的系统
CN106033425A (zh) 数据处理设备和数据处理方法
JP6382284B2 (ja) ベクトル推定に基づくグラフ分割を伴う、コンピューティング装置のデータフロープログラミング
CN106610977B (zh) 一种数据聚类方法和装置
US20210035025A1 (en) Systems and methods for optimizing machine learning models by summarizing list characteristics based on multi-dimensional feature vectors
Bulysheva et al. Segmentation modeling algorithm: a novel algorithm in data mining
CN114723014A (zh) 张量切分模式的确定方法、装置、计算机设备及介质
Kumar et al. Scalable performance tuning of hadoop MapReduce: A noisy gradient approach
KR102153161B1 (ko) 확률 그래프 기반의 서열 데이터 연관성 학습 방법 및 시스템
de Oliveira et al. Scalable fast evolutionary k-means clustering
WO2023093689A1 (zh) 一种计算图优化方法、装置及设备
EP2541409A1 (en) Parallelization of large scale data clustering analytics

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14893347

Country of ref document: EP

Kind code of ref document: A1

REEP Request for entry into the european phase

Ref document number: 2014893347

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2014893347

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE