CN107038244A - A kind of data digging method and device, a kind of computer-readable recording medium and storage control - Google Patents
A kind of data digging method and device, a kind of computer-readable recording medium and storage control Download PDFInfo
- Publication number
- CN107038244A CN107038244A CN201710273242.1A CN201710273242A CN107038244A CN 107038244 A CN107038244 A CN 107038244A CN 201710273242 A CN201710273242 A CN 201710273242A CN 107038244 A CN107038244 A CN 107038244A
- Authority
- CN
- China
- Prior art keywords
- data
- training
- back end
- subelement
- disaggregated model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/24569—Query processing with adaptation to specific hardware, e.g. adapted for using GPUs or SSDs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of data digging method and device, a kind of computer-readable recording medium and storage control, the data digging method includes:Initial data is stored on distributed file system HDFS, and is assigned at least one back end;Initial data with M dimensional feature vectors is subjected to dimension-reduction treatment, the low-dimensional data with m dimensional feature vectors, wherein M is formed>m;According to preset ratio, low-dimensional data is divided into training data and test data on each back end;Training data is trained on each back end, multilayer perceptron disaggregated model is obtained;Using above-mentioned knowing, device disaggregated model is predicted to test data, is determined the forecasting accuracy of multilayer perceptron disaggregated model and is realized data mining.The data digging method can reduce the cost of data mining and improve the efficiency of data mining.
Description
Technical field
The present invention relates to data analysis and digging technology field, more particularly to a kind of data digging method and device, one kind
Computer-readable recording medium and storage control.
Background technology
Fast development and internet with information technology are gradually widely used, particularly cloud computing and during big data
Data on the arriving in generation, internet exponentially increase, and internet becomes most important information source simultaneously.But
It is that the information of internet has that data volume is big, dimension is high, complicated irregular, and comprising substantial amounts of noise data, then
In face of so huge, complicated information, how quickly to organize, manage, using, excavate valuable information be some very
Great challenge.
Data mining is also known as the Knowledge Discovery in database, refers to from substantial amounts of incomplete, noisy, fuzzy number
Implicit, information or pattern unknown, non-trivial and that have potential using value are extracted according to middle.
Traditional data digging method is typically only applicable to the small data set of low-dimensional, for the magnanimity big data of various dimensions
When, the continuous improvement of the requirement due to operation time and to computing resource causes traditional data digging method carrying out data
Cost is too high and less efficient during excavation.
The content of the invention
The embodiments of the invention provide a kind of data digging method and device, a kind of computer-readable recording medium and storage control, energy
Enough reduce the cost of data mining and improve the efficiency of data mining.
In a first aspect, the embodiments of the invention provide a kind of data digging method, the data digging method includes:
Initial data is stored on distributed file system HDFS, and is assigned at least one back end;
Initial data with M dimensional feature vectors is subjected to dimension-reduction treatment, the low-dimensional number with m dimensional feature vectors is formed
According to wherein M>m;
According to preset ratio, low-dimensional data is divided into training data and test data on each back end;
Training data is trained on each back end, multilayer perceptron disaggregated model is obtained;
Test data is predicted using above-mentioned multilayer perceptron disaggregated model, multilayer perceptron disaggregated model is determined
Forecasting accuracy and realize data mining.
Preferably, low-dimensional data is divided into after training data and test data on each back end, enters one
Step includes:
Training data constitutes training dataset, and test data composition test data set on each back end;
Training data is trained on each back end, obtains before multilayer perceptron disaggregated model, enters
One step includes:
Universal parallel framework Spark platforms read the training dataset on each back end from HDFS;
Each training dataset of reading is converted to elasticity distribution formula data set RDD objects by Spark platforms;
Each RDD object is stored in internal memory by Spark platforms;
Training data is trained on each back end, multilayer perceptron disaggregated model is obtained, including:
RDD objects are assigned on back end and are trained by Spark platforms, obtain multilayer perceptron disaggregated model.
Preferably, RDD objects are assigned on back end and are trained by Spark platforms, obtain multilayer perceptron classification
Model, including:
To be trained by Pipelining technologies and perform flowsheet simulation is multiple working stages;
Each working stage is assigned on back end;
Each working stage is performed on back end, multilayer perceptron disaggregated model is obtained.
Preferably, training data is trained on each back end, obtains multilayer perceptron disaggregated model, bag
Include:
Training parameter t, initialization weights ω (0) are set, wherein t=0, ω (0) is small random number;
Following training are performed for training data:
S1:Training data is calculated, reaching output layer by hidden layer from input layer obtains each layer output valve
S2:Training error δ is calculated to output layer:
S3:The training error δ of hidden layer is calculated from output layer to input layer:
S4:Calculate and preserve the correction of each weighted valueWherein, γ is
The learning rate of momentum term;
S5:Correct weighted value:ωij(t+1)=ωij(t)+Δωij;
S6:Judge current training data t whether convergence, if so, then terminate training, otherwise, using training data t+1 as
Current training data, performs S1.
Second aspect, the embodiments of the invention provide a kind of data mining device, the data mining device includes:Distribution is single
Member, dimensionality reduction unit, division unit, training unit and excavation unit, wherein,
Allocation unit, for initial data to be stored in into distributed file system HDFS, and is assigned at least one number
According on node;
Dimensionality reduction unit, for the initial data with M dimensional feature vectors to be carried out into dimension-reduction treatment, being formed has m dimensional features
The low-dimensional data of vector, wherein M>m;
Division unit, for according to preset ratio, low-dimensional data to be divided into training data on each back end
And test data;
Training unit, for being trained on each back end to training data, obtains multilayer perceptron classification
Model;
Unit is excavated, for being predicted using above-mentioned multilayer perceptron disaggregated model to test data, multilayer is determined
The forecasting accuracy of perceptron disaggregated model simultaneously realizes data mining.
Preferably, the data mining device further comprises:Aggregation units and universal parallel framework Spark platforms, wherein,
Aggregation units, for the training data composition training dataset on each back end, and test data set
Into test data set;
Spark platforms, for read from HDFS the training dataset on each back end, by reading each
Training dataset is converted to elasticity distribution formula data set RDD objects and each RDD object is stored in internal memory;
Training unit, is trained specifically for RDD objects are assigned on back end by Spark platforms, obtains
Multilayer perceptron disaggregated model.
Preferably, training unit, including:Decompose subelement, distribution subelement and obtain subelement, wherein,
Subelement is decomposed, flowsheet simulation is performed for multiple working stages for that will be trained by Pipelining technologies;
Subelement is distributed, for each working stage to be assigned into back end;
Subelement is obtained, for performing each working stage on back end, multilayer perceptron disaggregated model is obtained.
Preferably, training unit, including:Default subelement, output subelement, the first error subelement, the second error
Unit, amendment quantum boxes, revise subelemen and judgment sub-unit, wherein,
Default subelement, for setting training parameter t, initialization weights ω (0), wherein t=0, ω (0) is small random
Number;
Subelement is exported, for calculating training data, output layer is reached by hidden layer from input layer and obtains respectively
Layer output valve
First error subelement, for calculating training error δ to output layer:
Second error subelement, the training error δ for calculating hidden layer from output layer to input layer:
Correct quantum boxes, the correction for calculating and preserving each weighted value
Wherein, γ is the learning rate of momentum term;
Revise subelemen, for correcting weighted value:ωij(t+1)=ωij(t)+Δωij;
Judgment sub-unit, for judge current training data t whether convergence, if so, then terminate training, otherwise, will instruct
Practice data t+1 as current training data, perform triggering output subelement.
The third aspect, the embodiments of the invention provide a kind of computer-readable recording medium, the computer-readable recording medium includes:Execute instruction, when depositing
When storing up the computing device execute instruction of controller, any data digging method in storage control right of execution first aspect.
Fourth aspect, the embodiments of the invention provide a kind of storage control, the storage control includes:Processor, deposit
Reservoir and bus;
Processor and memory are connected by bus;
Memory, when storage control is run, the execute instruction of computing device memory storage, so that storage is controlled
Device performs data digging method any in first aspect.
The embodiments of the invention provide a kind of data digging method and device, a kind of computer-readable recording medium and storage control, lead to
Cross and store initial data onto distributed file system HDFS, and initial data is assigned at least one back end,
The characteristics of due to distributed computing framework so that magnanimity calculating task is distributed band on individual back end, using what is divided and rule
Strategy, effectively reduces the amount of calculation and complexity of single server, can improve computational efficiency.Simultaneously by initial data
Dimension-reduction treatment is carried out, complexity is reduced, not only further increases computational efficiency, and improve the accurate of data prediction
Degree.With making full use of, the data iteration operational efficiency of multilayer perceptron sorting algorithm is high, method simple practical, realizes convenient excellent
Gesture, not only reduces cost and improves computational efficiency again.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are the present invention
Some embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can also basis
These accompanying drawings obtain other accompanying drawings.
Fig. 1 is a kind of flow chart for data digging method that one embodiment of the invention is provided;
Fig. 2 is a kind of multilayer perceptron topological structure schematic diagram that one embodiment of the invention is provided;
Fig. 3 is a kind of schematic diagram for Feedback error that one embodiment of the invention is provided;
Fig. 4 is the flow chart for another data digging method that one embodiment of the invention is provided;
Fig. 5 is a kind of structural representation for data mining device that one embodiment of the invention is provided.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention
In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is
A part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention, those of ordinary skill in the art
The every other embodiment obtained on the premise of creative work is not made, belongs to the scope of protection of the invention.
As shown in figure 1, the embodiments of the invention provide a kind of data digging method, this method may comprise steps of:
Step 101:Initial data is stored on distributed file system HDFS, and distribute initial data at least one
On back end.
Step 102:Initial data with M dimensional feature vectors is subjected to dimension-reduction treatment, formed with m dimensional feature vectors
Low-dimensional data, wherein M>m.
Step 103:According to preset ratio, low-dimensional data is divided into training data and test on each back end
Data.
Step 104:Training data is trained on each back end, multilayer perceptron disaggregated model is obtained.
Step 105:Test data is predicted using multilayer perceptron disaggregated model, multilayer perceptron classification mould is determined
The forecasting accuracy of type simultaneously realizes data mining.
In the above-described embodiments, by by initial data storage on distributed file system HDFS, and by initial data
It is assigned at least one back end, the characteristics of due to distributed computing framework so that magnanimity calculating task is distributed band to individual
On back end, using the strategy divided and rule, the amount of calculation and complexity of single server are effectively reduced, meter can be improved
Calculate efficiency.Simultaneously by carrying out dimension-reduction treatment to initial data, complexity is reduced, computational efficiency is not only further increased,
And improve the degree of accuracy of data prediction.With making full use of, the data iteration operational efficiency of multilayer perceptron sorting algorithm is high,
Method simple practical, realizes convenient advantage, not only reduces cost and improves computational efficiency again.
, in an embodiment of the invention, will be low on each back end in order to improve the reliability of data mining
Dimension data is divided into after training data and test data, is further comprised:
Training data constitutes training dataset, and test data composition test data set on each back end;
Training data is trained on each back end, obtains before multilayer perceptron disaggregated model, enters
One step includes:
Universal parallel framework Spark platforms read the training dataset on each back end from HDFS;
Each training dataset of reading is converted to elasticity distribution formula data set RDD objects by Spark platforms;
Each RDD object is stored in internal memory by Spark platforms;
Training data is trained on each back end, multilayer perceptron disaggregated model is obtained, including:
RDD objects are assigned on back end and are trained by Spark platforms, obtain multilayer perceptron disaggregated model.
In this embodiment, Spark is the extra large dupp Hadoop of class that UC Berkeley AMP lab are increased income
MapReduce general parallel computation frame, the Distributed Calculation that Spark is realized based on MapReduce algorithms possesses
Hadoop MapReduce are had the advantage that;But what it is different from MapReduce is that output and result can be stored in the middle of Job
In internal memory, thus no longer need read and write HDFS, so when iterations is identical Spark it is faster than Hadoop on 100 times;And visit
Ask 10 times faster than Hadoop of the speed of disk.Therefore Spark is more suitable for running more complicated algorithm, can effectively improve calculating
Efficiency.RDD be it is read-only, support it is fault-tolerant, can subregion internal memory distributed data collection, cluster can be buffered in part or all point
In internal memory, to be reused in next calculating process.Therefore Spark memory management mechanism, computing Optimization Mechanism is made full use of
With calculating fault tolerant mechanism, the computational efficiency of data mining can be not only improved, and the reliability of result of calculation can be improved.
In order to improve the utilization rate of resource, in an embodiment of the invention, RDD objects are assigned to number by Spark platforms
According to being trained on node, multilayer perceptron disaggregated model is obtained, including:
To be trained by Pipelining technologies and perform flowsheet simulation is multiple working stages;
Each working stage is assigned on back end;
Each working stage is performed on back end, multilayer perceptron disaggregated model is obtained.
In this embodiment it is possible to which each working stage further is decomposed into identical quantity according to the quantity of RDD objects
Subtask, each subtask is assigned on back end by Spark Resource Scheduler.Make full use of Spark RDD
The data set of each back end, is sub-divided into less data cell by characteristic, is calculated and appointed according to system resource Optimized Operation
Business, parallel execution or the ready rear order of waiting system resource are performed, and can significantly improve resource utilization ratio.
In order to improve the degree of accuracy of model, in an embodiment of the invention, to training number on each back end
According to being trained, multilayer perceptron disaggregated model is obtained, including:
Training parameter t, initialization weights ω (0) are set, wherein t=0, ω (0) is small random number;
Following training are performed for training data:
S1:Training data is calculated, reaching output layer by hidden layer from input layer obtains each layer output valve
S2:Training error δ is calculated to output layer:
S3:The training error δ of hidden layer is calculated from output layer to input layer:
S4:Calculate and preserve the correction of each weighted valueWherein, γ is
The learning rate of momentum term;
S5:Correct weighted value:ωij(t+1)=ωij(t)+Δωij;
S6:Judge current training data t whether convergence, if so, then terminate training, otherwise, using training data t+1 as
Current training data, performs S1.
In this embodiment, Fig. 2 is refer to, multilayer perceptron topological structure is set up, by input layer, at least one hidden layer
With output layer composition, the number of hidden layer is set by the user.The number of figure interior joint can equally have user's setting.It refer to figure
3, multilayer perceptron sorting algorithm in an iterative process, is repaiied by Feedback error using gradient descent algorithm to error
Just, convergence rate is accelerated, signal to noise ratio is improved, noise data is reduced, has model accuracy and is greatly improved.
To become apparent from illustrating technical scheme, enter below with reference to Fig. 4 data digging methods provided the present invention
Row is discussed in detail.
In the following embodiments, using Hadoop distributed system+Spark internal memory Computational frames, large data sets group's composition
For:Client-server 1, back end server 48, other secondary servers 5, altogether 54 servers.Per number of units
Configure as follows according to node server:In 2 Intel (R) Xeon (R) CPU E5-2620v2@2.10GHz, 96GB DDR3ECC
Deposit, 12 pieces of 2T SATA disks, 2 10,000,000,000 network interfaces, 64 (SuSE) Linux OS of Centos7.2.Software systems are:Apache
Hadoop 2.7.3, Spark 2.1.0, programming language is:Scala.
Specific implementation step is as follows:
Step 401:Initial data is stored on HDFS, and distributes initial data to 48 back end.
Step 402:Initial data with M dimensional feature vectors is subjected to dimension-reduction treatment, formed with m dimensional feature vectors
Low-dimensional data, wherein M>m.
In this step, M and m value depends on the complexity of initial data.
Step 403:According to preset ratio, low-dimensional data is divided into training data and test number on 48 back end
According to 48 training datasets of composition and 48 test data sets.
Step 404:48 training datasets are converted into by 48 RDD objects by Spark platforms.
Step 405:To be trained by Pipelining technologies and perform flowsheet simulation is multiple working stages.
Step 406:Each working stage is assigned on back end.
Step 407:Each working stage is performed on back end, multilayer perceptron disaggregated model is obtained.
Step 408:Test data set is predicted using multilayer perceptron disaggregated model, determines that multilayer perceptron is classified
The forecasting accuracy of model simultaneously realizes data mining.
In this embodiment, multilayer perceptron sorting algorithm is applied in distributed computing framework, taken full advantage of many
The advantage of layer perceptron sorting algorithm, and the characteristics of combination distributed computing framework so that the calculating task of magnanimity is uniformly distributed
Onto each back end, using the strategy divided and rule, the amount of calculation and complexity of single server are effectively reduced, significantly
Improve computational efficiency.Multilayer perceptron sorting algorithm in an iterative process, is modified using gradient descent algorithm to error,
Convergence rate is accelerated, signal to noise ratio is improved, noise data is reduced, has model accuracy and is greatly improved.Training data
Collection and test data set deposit in distributed file system, to whole big data using the strategy divided and rule, make excavation computing
Parallelization operation is able to, algorithm complex is simplified, improves arithmetic speed.Multilayer perceptron sorting algorithm is on Spark
Use, take full advantage of Spark memory management mechanism, computing Optimization Mechanism and calculating fault tolerant mechanism, not only increase excavation
The operational efficiency of algorithm, more improves reliability.
As shown in figure 5, the embodiments of the invention provide a kind of data mining device, the data mining device can include:
Allocation unit 501, dimensionality reduction unit 502, division unit 503, training unit 504 and excavation unit 505, wherein,
Allocation unit 501, for initial data to be stored in into distributed file system HDFS, and is assigned at least one
On back end;
Dimensionality reduction unit 502, for the initial data with M dimensional feature vectors to be carried out into dimension-reduction treatment, being formed has m Wei Te
Levy the low-dimensional data of vector, wherein M>m;
Division unit 503, for according to preset ratio, low-dimensional data to be divided into training number on each back end
According to and test data;
Training unit 504, for being trained on each back end to training data, obtains multilayer perceptron point
Class model;
Unit 505 is excavated, for being predicted using multilayer perceptron disaggregated model to test data, Multilayer Perception is determined
The forecasting accuracy of device disaggregated model simultaneously realizes data mining.
In order to improve the reliability of data mining, in an embodiment of the invention, the data mining device can enter one
Step includes:Aggregation units and universal parallel framework Spark platforms, wherein,
Aggregation units, for the training data composition training dataset on each back end, and test data set
Into data set;
Spark platforms, for read from HDFS the training dataset on each back end, by reading each
Training dataset is converted to elasticity distribution formula data set RDD objects and each RDD object is stored in internal memory;
Training unit, is trained specifically for RDD objects are assigned on back end by Spark platforms, obtains
Multilayer perceptron disaggregated model.
In order to improve the utilization rate of resource, in an embodiment of the invention, training unit, including:Decompose subelement, divide
With subelement and acquisition subelement, wherein,
Subelement is decomposed, flowsheet simulation is performed for multiple working stages for that will be trained by Pipelining technologies;
Subelement is distributed, for each working stage to be assigned into back end;
Subelement is obtained, for performing each working stage on back end, multilayer perceptron disaggregated model is obtained.
In order to improve the degree of accuracy of model, in an embodiment of the invention, training unit, including:It is default subelement, defeated
Go out subelement, the first error subelement, the second error subelement, amendment quantum boxes, revise subelemen and judgment sub-unit, its
In,
Default subelement, for setting training parameter t, initialization weights ω (0), wherein t=0, ω (0) is small random
Number;
Subelement is exported, for calculating training data, output layer is reached by hidden layer from input layer and obtains respectively
Layer output valve
First error subelement, for calculating training error δ to output layer:
Second error subelement, the training error δ for calculating hidden layer from output layer to input layer:
Correct quantum boxes, the correction for calculating and preserving each weighted value
Wherein, γ is the learning rate of momentum term;
Revise subelemen, for correcting weighted value:ωij(t+1)=ωij(t)+Δωij;
Judgment sub-unit, for judge current training data t whether convergence, if so, then terminate training, otherwise, will instruct
Practice data t+1 as current training data, perform triggering output subelement.
The contents such as the information exchange between each unit, implementation procedure in said apparatus, due to implementing with the inventive method
Example is based on same design, and particular content can be found in the narration in the inventive method embodiment, and here is omitted.
The embodiments of the invention provide a kind of computer-readable recording medium, the computer-readable recording medium can include:Execute instruction, when storage control
During the computing device execute instruction of device, storage control performs the data digging method in any of the above-described embodiment.
The embodiments of the invention provide a kind of storage control, the storage control can include:Processor, memory and
Bus;
Processor and memory are connected by bus;
Memory, when storage control is run, the execute instruction of computing device memory storage, so that storage is controlled
Device performs the data digging method of any of the above-described embodiment.
To sum up, various embodiments of the present invention, at least have the advantages that:
1st, in an embodiment of the present invention, by by initial data storage on distributed file system HDFS, and by original
Beginning data distribution is at least one back end, the characteristics of due to distributed computing framework so that magnanimity calculating task is distributed
Band, can be with individual back end, using the strategy divided and rule, effectively reducing the amount of calculation and complexity of single server
Improve computational efficiency.Simultaneously by carrying out dimension-reduction treatment to initial data, complexity is reduced, calculating is not only further increased
Efficiency, and improve the degree of accuracy of data prediction.Effect is run with the data iteration of multilayer perceptron sorting algorithm is made full use of
Rate is high, and method simple practical realizes convenient advantage, not only reduces cost and improve computational efficiency again.
2nd, in an embodiment of the present invention, multilayer perceptron sorting algorithm is applied in distributed computing framework, fully
The advantage of multilayer perceptron sorting algorithm is make use of, and the characteristics of combination distributed computing framework so that the calculating task of magnanimity
It is evenly distributed on each back end, using the strategy divided and rule, effectively reduces the amount of calculation of single server and answer
Miscellaneous degree, substantially increases computational efficiency.
3rd, in embodiments of the present invention, multilayer perceptron sorting algorithm in an iterative process, using gradient descent algorithm pair
Error is modified, and accelerates convergence rate, improves signal to noise ratio, reduces noise data, model accuracy is had greatly
Improve.
4th, in embodiments of the present invention, training dataset and test data set deposit in distributed file system, to whole
Big data is enable excavation computing parallelization to run, is simplified algorithm complex, improve computing using the strategy divided and rule
Speed.
5th, in embodiments of the present invention, use of the multilayer perceptron sorting algorithm on Spark, takes full advantage of Spark
Memory management mechanism, computing Optimization Mechanism and calculate fault tolerant mechanism, not only increase the operational efficiency of mining algorithm, more improve
Reliability.
It should be noted that herein, such as first and second etc relational terms are used merely to an entity
Or operation makes a distinction with another entity or operation, and not necessarily require or imply exist between these entities or operation
Any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant be intended to it is non-
It is exclusive to include, so that process, method, article or equipment including a series of key elements not only include those key elements,
But also other key elements including being not expressly set out, or also include solid by this process, method, article or equipment
Some key elements.In the absence of more restrictions, the key element limited by sentence " including one ", is not arranged
Except also there is other identical factor in the process including key element, method, article or equipment.
One of ordinary skill in the art will appreciate that:Realizing all or part of step of above method embodiment can pass through
Programmed instruction related hardware is completed, and foregoing program can be stored in the storage medium of embodied on computer readable, the program
Upon execution, the step of including above method embodiment is performed;And foregoing storage medium includes:ROM, RAM, magnetic disc or light
Disk etc. is various can be with the medium of store program codes.
It is last it should be noted that:Presently preferred embodiments of the present invention is these are only, the technical side of the present invention is merely to illustrate
Case, is not intended to limit the scope of the present invention.It is any modification for being made within the spirit and principles of the invention, equivalent
Replace, improve etc., it is all contained in protection scope of the present invention.
Claims (10)
1. a kind of data digging method, it is characterised in that this method includes:
Initial data is stored on distributed file system HDFS, and is assigned at least one back end;
Initial data with M dimensional feature vectors is subjected to dimension-reduction treatment, the low-dimensional data with m dimensional feature vectors is formed, its
Middle M>m;
According to preset ratio, low-dimensional data is divided into training data and test data on each back end;
Training data is trained on each back end, multilayer perceptron disaggregated model is obtained;
Test data is predicted using above-mentioned multilayer perceptron disaggregated model, the pre- of multilayer perceptron disaggregated model is determined
Survey accuracy and realize data mining.
2. data digging method according to claim 1, it is characterised in that by low-dimensional data on each back end
It is divided into after training data and test data, further comprises:
Training data is constituted into training dataset on each back end, and test data is constituted into test data set;
Training data is trained on each back end, obtains before multilayer perceptron disaggregated model, further wraps
Include:
Universal parallel framework Spark platforms read the training dataset on each back end from HDFS;
Each training dataset of reading is converted to elasticity distribution formula data set RDD objects by Spark platforms;
Each RDD object is stored in internal memory by Spark platforms;
Training data is trained on each back end, multilayer perceptron disaggregated model is obtained, including:
RDD objects are assigned on back end and are trained by Spark platforms, obtain multilayer perceptron disaggregated model.
3. data digging method according to claim 2, it is characterised in that RDD objects are assigned to data by Spark platforms
It is trained on node, obtains multilayer perceptron disaggregated model, including:
To be trained by Pipelining technologies and perform flowsheet simulation is multiple working stages;
Each working stage is assigned on back end;
Each working stage is performed on back end, multilayer perceptron disaggregated model is obtained.
4. the data digging method according to 1 to 4 any claim, it is characterised in that right on each back end
Training data is trained, and obtains multilayer perceptron disaggregated model, and detailed process includes:
Training parameter t, initialization weights ω (0) are set, wherein t=0, ω (0) is small random number;
Following training are performed for training data:
S1:Training data is calculated, reaching output layer by hidden layer from input layer obtains each layer output valve
S2:Training error δ is calculated to output layer:
S3:The training error δ of hidden layer is calculated from output layer to input layer:
S4:Calculate and preserve the correction of each weighted valueWherein, γ is momentum
The learning rate of item;
S5:Correct weighted value:ωij(t+1)=ωij(t)+Δωij;
S6:Judge current training data t whether convergence, if so, then terminating training, otherwise, using training data t+1 as current
Training data, performs S1.
5. a kind of data mining device, it is characterised in that the device includes:Allocation unit, dimensionality reduction unit, division unit, training
Unit and excavation unit, wherein,
Allocation unit, for initial data to be stored in into distributed file system HDFS, and is assigned at least one data section
Point on;
Dimensionality reduction unit, for the initial data with M dimensional feature vectors to be carried out into dimension-reduction treatment, being formed has m dimensional feature vectors
Low-dimensional data, wherein M>m;
Division unit, for according to preset ratio, low-dimensional data to be divided into training data and survey on each back end
Try data;
Training unit, for being trained on each back end to training data, obtains multilayer perceptron disaggregated model;
Unit is excavated, for being predicted using above-mentioned multilayer perceptron disaggregated model to test data, Multilayer Perception is determined
The forecasting accuracy of device disaggregated model simultaneously realizes data mining.
6. data mining device according to claim 5, it is characterised in that the device further comprises:Aggregation units and
Universal parallel framework Spark platforms, wherein,
Aggregation units, are surveyed for the training data composition training dataset on each back end, and test data composition
Try data set;
Spark platforms, for reading the training dataset on each back end, each training by reading from HDFS
Data set is converted to elasticity distribution formula data set RDD objects and each RDD object is stored in internal memory;
Training unit, is trained specifically for RDD objects are assigned on back end by Spark platforms, obtains multilayer
Perceptron disaggregated model.
7. data mining device according to claim 6, it is characterised in that training unit includes:Decompose subelement, distribution
Subelement and acquisition subelement, wherein,
Subelement is decomposed, flowsheet simulation is performed for multiple working stages for that will be trained by Pipelining technologies;
Subelement is distributed, for each working stage to be assigned into back end;
Subelement is obtained, for performing each working stage on back end, multilayer perceptron disaggregated model is obtained.
8. the data mining device according to 5 to 7 any claims, it is characterised in that training unit includes:Default son is single
Member, output subelement, the first error subelement, the second error subelement, amendment quantum boxes, revise subelemen and judgement are single
Member, wherein,
Default subelement, for setting training parameter t, initialization weights ω (0), wherein t=0, ω (0) is small random number;
Subelement is exported, for calculating training data, to obtain each layer defeated by hidden layer arrival output layer from input layer
Go out value
First error subelement, for calculating training error δ to output layer:
Second error subelement, the training error δ for calculating hidden layer from output layer to input layer:
Correct quantum boxes, the correction for calculating and preserving each weighted value
Wherein, γ is the learning rate of momentum term;
Revise subelemen, for correcting weighted value:ωij(t+1)=ωij(t)+Δωij;
Judgment sub-unit, for judge current training data t whether convergence, if so, then terminate training, otherwise, number will be trained
According to t+1 as current training data, triggering output subelement is performed.
9. a kind of computer-readable recording medium, it is characterised in that the computer-readable recording medium includes:Execute instruction, when the processor of storage control is held
During row execute instruction, any data digging method in storage control perform claim requirement 1 to 4.
10. a kind of storage control, it is characterised in that the storage control includes:Processor, memory and bus;
Processor and memory are connected by bus;
Memory, when storage control is run, the execute instruction of computing device memory storage, so that storage control is held
Any data digging method in row Claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710273242.1A CN107038244A (en) | 2017-04-24 | 2017-04-24 | A kind of data digging method and device, a kind of computer-readable recording medium and storage control |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710273242.1A CN107038244A (en) | 2017-04-24 | 2017-04-24 | A kind of data digging method and device, a kind of computer-readable recording medium and storage control |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107038244A true CN107038244A (en) | 2017-08-11 |
Family
ID=59536742
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710273242.1A Pending CN107038244A (en) | 2017-04-24 | 2017-04-24 | A kind of data digging method and device, a kind of computer-readable recording medium and storage control |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107038244A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108268638A (en) * | 2018-01-18 | 2018-07-10 | 浙江工业大学 | A kind of generation confrontation network distribution type implementation method based on Spark frames |
CN113641497A (en) * | 2021-08-03 | 2021-11-12 | 北京三易思创科技有限公司 | Method for realizing distributed high-concurrency data summarization based on dimension reduction and segmentation technology |
CN116882522A (en) * | 2023-09-07 | 2023-10-13 | 湖南视觉伟业智能科技有限公司 | Distributed space-time mining method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20140096936A (en) * | 2013-01-29 | 2014-08-06 | (주)소만사 | System and Method for Big Data Processing of DLP System |
CN104899561A (en) * | 2015-05-27 | 2015-09-09 | 华南理工大学 | Parallelized human body behavior identification method |
CN105740424A (en) * | 2016-01-29 | 2016-07-06 | 湖南大学 | Spark platform based high efficiency text classification method |
CN106250461A (en) * | 2016-07-28 | 2016-12-21 | 北京北信源软件股份有限公司 | A kind of algorithm utilizing gradient lifting decision tree to carry out data mining based on Spark framework |
-
2017
- 2017-04-24 CN CN201710273242.1A patent/CN107038244A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20140096936A (en) * | 2013-01-29 | 2014-08-06 | (주)소만사 | System and Method for Big Data Processing of DLP System |
CN104899561A (en) * | 2015-05-27 | 2015-09-09 | 华南理工大学 | Parallelized human body behavior identification method |
CN105740424A (en) * | 2016-01-29 | 2016-07-06 | 湖南大学 | Spark platform based high efficiency text classification method |
CN106250461A (en) * | 2016-07-28 | 2016-12-21 | 北京北信源软件股份有限公司 | A kind of algorithm utilizing gradient lifting decision tree to carry out data mining based on Spark framework |
Non-Patent Citations (1)
Title |
---|
王之仓: "多层感知器学习算法研究", 《中国优秀硕士学位论文电子期刊网》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108268638A (en) * | 2018-01-18 | 2018-07-10 | 浙江工业大学 | A kind of generation confrontation network distribution type implementation method based on Spark frames |
CN113641497A (en) * | 2021-08-03 | 2021-11-12 | 北京三易思创科技有限公司 | Method for realizing distributed high-concurrency data summarization based on dimension reduction and segmentation technology |
CN116882522A (en) * | 2023-09-07 | 2023-10-13 | 湖南视觉伟业智能科技有限公司 | Distributed space-time mining method and system |
CN116882522B (en) * | 2023-09-07 | 2023-11-28 | 湖南视觉伟业智能科技有限公司 | Distributed space-time mining method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3179415B1 (en) | Systems and methods for a multi-core optimized recurrent neural network | |
EP4036803A1 (en) | Neural network model processing method and apparatus, computer device, and storage medium | |
JP7087079B2 (en) | Robust gradient weight compression scheme for deep learning applications | |
US20160092794A1 (en) | General framework for cross-validation of machine learning algorithms using sql on distributed systems | |
US20170330078A1 (en) | Method and system for automated model building | |
US20230236888A1 (en) | Memory allocation method, related device, and computer-readable storage medium | |
US11366806B2 (en) | Automated feature generation for machine learning application | |
CN113220450B (en) | Load prediction method, resource scheduling method and device for cloud-side multi-data center | |
CN116644804B (en) | Distributed training system, neural network model training method, device and medium | |
CN115437795B (en) | Video memory recalculation optimization method and system for heterogeneous GPU cluster load perception | |
CN107038244A (en) | A kind of data digging method and device, a kind of computer-readable recording medium and storage control | |
CN105205052A (en) | Method and device for mining data | |
Jankov et al. | Declarative recursive computation on an RDBMS: or, why you should use a database for distributed machine learning | |
CN111966495A (en) | Data processing method and device | |
CN115860081A (en) | Core particle algorithm scheduling method and system, electronic equipment and storage medium | |
CN109117475A (en) | A kind of method and relevant device of text rewriting | |
KR20210115863A (en) | Method and appartus of parallel processing for neural network model | |
CN116569177A (en) | Weight-based modulation in neural networks | |
CN106648891A (en) | MapReduce model-based task execution method and apparatus | |
CN113128771B (en) | Expensive function optimization method and device for parallel differential evolution algorithm | |
US20230009237A1 (en) | Multi-dimensional data labeling | |
Reeves et al. | Propositional proof skeletons | |
Heye | Scaling deep learning without increasing batchsize | |
Fiosina et al. | Distributed nonparametric and semiparametric regression on SPARK for big data forecasting | |
Saldanha et al. | Determining the probability distribution of execution times |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170811 |
|
RJ01 | Rejection of invention patent application after publication |