CN107038244A - A kind of data digging method and device, a kind of computer-readable recording medium and storage control - Google Patents

A kind of data digging method and device, a kind of computer-readable recording medium and storage control Download PDF

Info

Publication number
CN107038244A
CN107038244A CN201710273242.1A CN201710273242A CN107038244A CN 107038244 A CN107038244 A CN 107038244A CN 201710273242 A CN201710273242 A CN 201710273242A CN 107038244 A CN107038244 A CN 107038244A
Authority
CN
China
Prior art keywords
data
training
back end
subelement
disaggregated model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710273242.1A
Other languages
Chinese (zh)
Inventor
高洪涛
胡建斌
白志凌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing VRV Software Corp Ltd
Original Assignee
Beijing VRV Software Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing VRV Software Corp Ltd filed Critical Beijing VRV Software Corp Ltd
Priority to CN201710273242.1A priority Critical patent/CN107038244A/en
Publication of CN107038244A publication Critical patent/CN107038244A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/24569Query processing with adaptation to specific hardware, e.g. adapted for using GPUs or SSDs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Abstract

The invention discloses a kind of data digging method and device, a kind of computer-readable recording medium and storage control, the data digging method includes:Initial data is stored on distributed file system HDFS, and is assigned at least one back end;Initial data with M dimensional feature vectors is subjected to dimension-reduction treatment, the low-dimensional data with m dimensional feature vectors, wherein M is formed>m;According to preset ratio, low-dimensional data is divided into training data and test data on each back end;Training data is trained on each back end, multilayer perceptron disaggregated model is obtained;Using above-mentioned knowing, device disaggregated model is predicted to test data, is determined the forecasting accuracy of multilayer perceptron disaggregated model and is realized data mining.The data digging method can reduce the cost of data mining and improve the efficiency of data mining.

Description

A kind of data digging method and device, a kind of computer-readable recording medium and storage control
Technical field
The present invention relates to data analysis and digging technology field, more particularly to a kind of data digging method and device, one kind Computer-readable recording medium and storage control.
Background technology
Fast development and internet with information technology are gradually widely used, particularly cloud computing and during big data Data on the arriving in generation, internet exponentially increase, and internet becomes most important information source simultaneously.But It is that the information of internet has that data volume is big, dimension is high, complicated irregular, and comprising substantial amounts of noise data, then In face of so huge, complicated information, how quickly to organize, manage, using, excavate valuable information be some very Great challenge.
Data mining is also known as the Knowledge Discovery in database, refers to from substantial amounts of incomplete, noisy, fuzzy number Implicit, information or pattern unknown, non-trivial and that have potential using value are extracted according to middle.
Traditional data digging method is typically only applicable to the small data set of low-dimensional, for the magnanimity big data of various dimensions When, the continuous improvement of the requirement due to operation time and to computing resource causes traditional data digging method carrying out data Cost is too high and less efficient during excavation.
The content of the invention
The embodiments of the invention provide a kind of data digging method and device, a kind of computer-readable recording medium and storage control, energy Enough reduce the cost of data mining and improve the efficiency of data mining.
In a first aspect, the embodiments of the invention provide a kind of data digging method, the data digging method includes:
Initial data is stored on distributed file system HDFS, and is assigned at least one back end;
Initial data with M dimensional feature vectors is subjected to dimension-reduction treatment, the low-dimensional number with m dimensional feature vectors is formed According to wherein M>m;
According to preset ratio, low-dimensional data is divided into training data and test data on each back end;
Training data is trained on each back end, multilayer perceptron disaggregated model is obtained;
Test data is predicted using above-mentioned multilayer perceptron disaggregated model, multilayer perceptron disaggregated model is determined Forecasting accuracy and realize data mining.
Preferably, low-dimensional data is divided into after training data and test data on each back end, enters one Step includes:
Training data constitutes training dataset, and test data composition test data set on each back end;
Training data is trained on each back end, obtains before multilayer perceptron disaggregated model, enters One step includes:
Universal parallel framework Spark platforms read the training dataset on each back end from HDFS;
Each training dataset of reading is converted to elasticity distribution formula data set RDD objects by Spark platforms;
Each RDD object is stored in internal memory by Spark platforms;
Training data is trained on each back end, multilayer perceptron disaggregated model is obtained, including:
RDD objects are assigned on back end and are trained by Spark platforms, obtain multilayer perceptron disaggregated model.
Preferably, RDD objects are assigned on back end and are trained by Spark platforms, obtain multilayer perceptron classification Model, including:
To be trained by Pipelining technologies and perform flowsheet simulation is multiple working stages;
Each working stage is assigned on back end;
Each working stage is performed on back end, multilayer perceptron disaggregated model is obtained.
Preferably, training data is trained on each back end, obtains multilayer perceptron disaggregated model, bag Include:
Training parameter t, initialization weights ω (0) are set, wherein t=0, ω (0) is small random number;
Following training are performed for training data:
S1:Training data is calculated, reaching output layer by hidden layer from input layer obtains each layer output valve
S2:Training error δ is calculated to output layer:
S3:The training error δ of hidden layer is calculated from output layer to input layer:
S4:Calculate and preserve the correction of each weighted valueWherein, γ is The learning rate of momentum term;
S5:Correct weighted value:ωij(t+1)=ωij(t)+Δωij
S6:Judge current training data t whether convergence, if so, then terminate training, otherwise, using training data t+1 as Current training data, performs S1.
Second aspect, the embodiments of the invention provide a kind of data mining device, the data mining device includes:Distribution is single Member, dimensionality reduction unit, division unit, training unit and excavation unit, wherein,
Allocation unit, for initial data to be stored in into distributed file system HDFS, and is assigned at least one number According on node;
Dimensionality reduction unit, for the initial data with M dimensional feature vectors to be carried out into dimension-reduction treatment, being formed has m dimensional features The low-dimensional data of vector, wherein M>m;
Division unit, for according to preset ratio, low-dimensional data to be divided into training data on each back end And test data;
Training unit, for being trained on each back end to training data, obtains multilayer perceptron classification Model;
Unit is excavated, for being predicted using above-mentioned multilayer perceptron disaggregated model to test data, multilayer is determined The forecasting accuracy of perceptron disaggregated model simultaneously realizes data mining.
Preferably, the data mining device further comprises:Aggregation units and universal parallel framework Spark platforms, wherein,
Aggregation units, for the training data composition training dataset on each back end, and test data set Into test data set;
Spark platforms, for read from HDFS the training dataset on each back end, by reading each Training dataset is converted to elasticity distribution formula data set RDD objects and each RDD object is stored in internal memory;
Training unit, is trained specifically for RDD objects are assigned on back end by Spark platforms, obtains Multilayer perceptron disaggregated model.
Preferably, training unit, including:Decompose subelement, distribution subelement and obtain subelement, wherein,
Subelement is decomposed, flowsheet simulation is performed for multiple working stages for that will be trained by Pipelining technologies;
Subelement is distributed, for each working stage to be assigned into back end;
Subelement is obtained, for performing each working stage on back end, multilayer perceptron disaggregated model is obtained.
Preferably, training unit, including:Default subelement, output subelement, the first error subelement, the second error Unit, amendment quantum boxes, revise subelemen and judgment sub-unit, wherein,
Default subelement, for setting training parameter t, initialization weights ω (0), wherein t=0, ω (0) is small random Number;
Subelement is exported, for calculating training data, output layer is reached by hidden layer from input layer and obtains respectively Layer output valve
First error subelement, for calculating training error δ to output layer:
Second error subelement, the training error δ for calculating hidden layer from output layer to input layer:
Correct quantum boxes, the correction for calculating and preserving each weighted value Wherein, γ is the learning rate of momentum term;
Revise subelemen, for correcting weighted value:ωij(t+1)=ωij(t)+Δωij
Judgment sub-unit, for judge current training data t whether convergence, if so, then terminate training, otherwise, will instruct Practice data t+1 as current training data, perform triggering output subelement.
The third aspect, the embodiments of the invention provide a kind of computer-readable recording medium, the computer-readable recording medium includes:Execute instruction, when depositing When storing up the computing device execute instruction of controller, any data digging method in storage control right of execution first aspect.
Fourth aspect, the embodiments of the invention provide a kind of storage control, the storage control includes:Processor, deposit Reservoir and bus;
Processor and memory are connected by bus;
Memory, when storage control is run, the execute instruction of computing device memory storage, so that storage is controlled Device performs data digging method any in first aspect.
The embodiments of the invention provide a kind of data digging method and device, a kind of computer-readable recording medium and storage control, lead to Cross and store initial data onto distributed file system HDFS, and initial data is assigned at least one back end, The characteristics of due to distributed computing framework so that magnanimity calculating task is distributed band on individual back end, using what is divided and rule Strategy, effectively reduces the amount of calculation and complexity of single server, can improve computational efficiency.Simultaneously by initial data Dimension-reduction treatment is carried out, complexity is reduced, not only further increases computational efficiency, and improve the accurate of data prediction Degree.With making full use of, the data iteration operational efficiency of multilayer perceptron sorting algorithm is high, method simple practical, realizes convenient excellent Gesture, not only reduces cost and improves computational efficiency again.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are the present invention Some embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can also basis These accompanying drawings obtain other accompanying drawings.
Fig. 1 is a kind of flow chart for data digging method that one embodiment of the invention is provided;
Fig. 2 is a kind of multilayer perceptron topological structure schematic diagram that one embodiment of the invention is provided;
Fig. 3 is a kind of schematic diagram for Feedback error that one embodiment of the invention is provided;
Fig. 4 is the flow chart for another data digging method that one embodiment of the invention is provided;
Fig. 5 is a kind of structural representation for data mining device that one embodiment of the invention is provided.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is A part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention, those of ordinary skill in the art The every other embodiment obtained on the premise of creative work is not made, belongs to the scope of protection of the invention.
As shown in figure 1, the embodiments of the invention provide a kind of data digging method, this method may comprise steps of:
Step 101:Initial data is stored on distributed file system HDFS, and distribute initial data at least one On back end.
Step 102:Initial data with M dimensional feature vectors is subjected to dimension-reduction treatment, formed with m dimensional feature vectors Low-dimensional data, wherein M>m.
Step 103:According to preset ratio, low-dimensional data is divided into training data and test on each back end Data.
Step 104:Training data is trained on each back end, multilayer perceptron disaggregated model is obtained.
Step 105:Test data is predicted using multilayer perceptron disaggregated model, multilayer perceptron classification mould is determined The forecasting accuracy of type simultaneously realizes data mining.
In the above-described embodiments, by by initial data storage on distributed file system HDFS, and by initial data It is assigned at least one back end, the characteristics of due to distributed computing framework so that magnanimity calculating task is distributed band to individual On back end, using the strategy divided and rule, the amount of calculation and complexity of single server are effectively reduced, meter can be improved Calculate efficiency.Simultaneously by carrying out dimension-reduction treatment to initial data, complexity is reduced, computational efficiency is not only further increased, And improve the degree of accuracy of data prediction.With making full use of, the data iteration operational efficiency of multilayer perceptron sorting algorithm is high, Method simple practical, realizes convenient advantage, not only reduces cost and improves computational efficiency again.
, in an embodiment of the invention, will be low on each back end in order to improve the reliability of data mining Dimension data is divided into after training data and test data, is further comprised:
Training data constitutes training dataset, and test data composition test data set on each back end;
Training data is trained on each back end, obtains before multilayer perceptron disaggregated model, enters One step includes:
Universal parallel framework Spark platforms read the training dataset on each back end from HDFS;
Each training dataset of reading is converted to elasticity distribution formula data set RDD objects by Spark platforms;
Each RDD object is stored in internal memory by Spark platforms;
Training data is trained on each back end, multilayer perceptron disaggregated model is obtained, including:
RDD objects are assigned on back end and are trained by Spark platforms, obtain multilayer perceptron disaggregated model.
In this embodiment, Spark is the extra large dupp Hadoop of class that UC Berkeley AMP lab are increased income MapReduce general parallel computation frame, the Distributed Calculation that Spark is realized based on MapReduce algorithms possesses Hadoop MapReduce are had the advantage that;But what it is different from MapReduce is that output and result can be stored in the middle of Job In internal memory, thus no longer need read and write HDFS, so when iterations is identical Spark it is faster than Hadoop on 100 times;And visit Ask 10 times faster than Hadoop of the speed of disk.Therefore Spark is more suitable for running more complicated algorithm, can effectively improve calculating Efficiency.RDD be it is read-only, support it is fault-tolerant, can subregion internal memory distributed data collection, cluster can be buffered in part or all point In internal memory, to be reused in next calculating process.Therefore Spark memory management mechanism, computing Optimization Mechanism is made full use of With calculating fault tolerant mechanism, the computational efficiency of data mining can be not only improved, and the reliability of result of calculation can be improved.
In order to improve the utilization rate of resource, in an embodiment of the invention, RDD objects are assigned to number by Spark platforms According to being trained on node, multilayer perceptron disaggregated model is obtained, including:
To be trained by Pipelining technologies and perform flowsheet simulation is multiple working stages;
Each working stage is assigned on back end;
Each working stage is performed on back end, multilayer perceptron disaggregated model is obtained.
In this embodiment it is possible to which each working stage further is decomposed into identical quantity according to the quantity of RDD objects Subtask, each subtask is assigned on back end by Spark Resource Scheduler.Make full use of Spark RDD The data set of each back end, is sub-divided into less data cell by characteristic, is calculated and appointed according to system resource Optimized Operation Business, parallel execution or the ready rear order of waiting system resource are performed, and can significantly improve resource utilization ratio.
In order to improve the degree of accuracy of model, in an embodiment of the invention, to training number on each back end According to being trained, multilayer perceptron disaggregated model is obtained, including:
Training parameter t, initialization weights ω (0) are set, wherein t=0, ω (0) is small random number;
Following training are performed for training data:
S1:Training data is calculated, reaching output layer by hidden layer from input layer obtains each layer output valve
S2:Training error δ is calculated to output layer:
S3:The training error δ of hidden layer is calculated from output layer to input layer:
S4:Calculate and preserve the correction of each weighted valueWherein, γ is The learning rate of momentum term;
S5:Correct weighted value:ωij(t+1)=ωij(t)+Δωij
S6:Judge current training data t whether convergence, if so, then terminate training, otherwise, using training data t+1 as Current training data, performs S1.
In this embodiment, Fig. 2 is refer to, multilayer perceptron topological structure is set up, by input layer, at least one hidden layer With output layer composition, the number of hidden layer is set by the user.The number of figure interior joint can equally have user's setting.It refer to figure 3, multilayer perceptron sorting algorithm in an iterative process, is repaiied by Feedback error using gradient descent algorithm to error Just, convergence rate is accelerated, signal to noise ratio is improved, noise data is reduced, has model accuracy and is greatly improved.
To become apparent from illustrating technical scheme, enter below with reference to Fig. 4 data digging methods provided the present invention Row is discussed in detail.
In the following embodiments, using Hadoop distributed system+Spark internal memory Computational frames, large data sets group's composition For:Client-server 1, back end server 48, other secondary servers 5, altogether 54 servers.Per number of units Configure as follows according to node server:In 2 Intel (R) Xeon (R) CPU E5-2620v2@2.10GHz, 96GB DDR3ECC Deposit, 12 pieces of 2T SATA disks, 2 10,000,000,000 network interfaces, 64 (SuSE) Linux OS of Centos7.2.Software systems are:Apache Hadoop 2.7.3, Spark 2.1.0, programming language is:Scala.
Specific implementation step is as follows:
Step 401:Initial data is stored on HDFS, and distributes initial data to 48 back end.
Step 402:Initial data with M dimensional feature vectors is subjected to dimension-reduction treatment, formed with m dimensional feature vectors Low-dimensional data, wherein M>m.
In this step, M and m value depends on the complexity of initial data.
Step 403:According to preset ratio, low-dimensional data is divided into training data and test number on 48 back end According to 48 training datasets of composition and 48 test data sets.
Step 404:48 training datasets are converted into by 48 RDD objects by Spark platforms.
Step 405:To be trained by Pipelining technologies and perform flowsheet simulation is multiple working stages.
Step 406:Each working stage is assigned on back end.
Step 407:Each working stage is performed on back end, multilayer perceptron disaggregated model is obtained.
Step 408:Test data set is predicted using multilayer perceptron disaggregated model, determines that multilayer perceptron is classified The forecasting accuracy of model simultaneously realizes data mining.
In this embodiment, multilayer perceptron sorting algorithm is applied in distributed computing framework, taken full advantage of many The advantage of layer perceptron sorting algorithm, and the characteristics of combination distributed computing framework so that the calculating task of magnanimity is uniformly distributed Onto each back end, using the strategy divided and rule, the amount of calculation and complexity of single server are effectively reduced, significantly Improve computational efficiency.Multilayer perceptron sorting algorithm in an iterative process, is modified using gradient descent algorithm to error, Convergence rate is accelerated, signal to noise ratio is improved, noise data is reduced, has model accuracy and is greatly improved.Training data Collection and test data set deposit in distributed file system, to whole big data using the strategy divided and rule, make excavation computing Parallelization operation is able to, algorithm complex is simplified, improves arithmetic speed.Multilayer perceptron sorting algorithm is on Spark Use, take full advantage of Spark memory management mechanism, computing Optimization Mechanism and calculating fault tolerant mechanism, not only increase excavation The operational efficiency of algorithm, more improves reliability.
As shown in figure 5, the embodiments of the invention provide a kind of data mining device, the data mining device can include: Allocation unit 501, dimensionality reduction unit 502, division unit 503, training unit 504 and excavation unit 505, wherein,
Allocation unit 501, for initial data to be stored in into distributed file system HDFS, and is assigned at least one On back end;
Dimensionality reduction unit 502, for the initial data with M dimensional feature vectors to be carried out into dimension-reduction treatment, being formed has m Wei Te Levy the low-dimensional data of vector, wherein M>m;
Division unit 503, for according to preset ratio, low-dimensional data to be divided into training number on each back end According to and test data;
Training unit 504, for being trained on each back end to training data, obtains multilayer perceptron point Class model;
Unit 505 is excavated, for being predicted using multilayer perceptron disaggregated model to test data, Multilayer Perception is determined The forecasting accuracy of device disaggregated model simultaneously realizes data mining.
In order to improve the reliability of data mining, in an embodiment of the invention, the data mining device can enter one Step includes:Aggregation units and universal parallel framework Spark platforms, wherein,
Aggregation units, for the training data composition training dataset on each back end, and test data set Into data set;
Spark platforms, for read from HDFS the training dataset on each back end, by reading each Training dataset is converted to elasticity distribution formula data set RDD objects and each RDD object is stored in internal memory;
Training unit, is trained specifically for RDD objects are assigned on back end by Spark platforms, obtains Multilayer perceptron disaggregated model.
In order to improve the utilization rate of resource, in an embodiment of the invention, training unit, including:Decompose subelement, divide With subelement and acquisition subelement, wherein,
Subelement is decomposed, flowsheet simulation is performed for multiple working stages for that will be trained by Pipelining technologies;
Subelement is distributed, for each working stage to be assigned into back end;
Subelement is obtained, for performing each working stage on back end, multilayer perceptron disaggregated model is obtained.
In order to improve the degree of accuracy of model, in an embodiment of the invention, training unit, including:It is default subelement, defeated Go out subelement, the first error subelement, the second error subelement, amendment quantum boxes, revise subelemen and judgment sub-unit, its In,
Default subelement, for setting training parameter t, initialization weights ω (0), wherein t=0, ω (0) is small random Number;
Subelement is exported, for calculating training data, output layer is reached by hidden layer from input layer and obtains respectively Layer output valve
First error subelement, for calculating training error δ to output layer:
Second error subelement, the training error δ for calculating hidden layer from output layer to input layer:
Correct quantum boxes, the correction for calculating and preserving each weighted value Wherein, γ is the learning rate of momentum term;
Revise subelemen, for correcting weighted value:ωij(t+1)=ωij(t)+Δωij
Judgment sub-unit, for judge current training data t whether convergence, if so, then terminate training, otherwise, will instruct Practice data t+1 as current training data, perform triggering output subelement.
The contents such as the information exchange between each unit, implementation procedure in said apparatus, due to implementing with the inventive method Example is based on same design, and particular content can be found in the narration in the inventive method embodiment, and here is omitted.
The embodiments of the invention provide a kind of computer-readable recording medium, the computer-readable recording medium can include:Execute instruction, when storage control During the computing device execute instruction of device, storage control performs the data digging method in any of the above-described embodiment.
The embodiments of the invention provide a kind of storage control, the storage control can include:Processor, memory and Bus;
Processor and memory are connected by bus;
Memory, when storage control is run, the execute instruction of computing device memory storage, so that storage is controlled Device performs the data digging method of any of the above-described embodiment.
To sum up, various embodiments of the present invention, at least have the advantages that:
1st, in an embodiment of the present invention, by by initial data storage on distributed file system HDFS, and by original Beginning data distribution is at least one back end, the characteristics of due to distributed computing framework so that magnanimity calculating task is distributed Band, can be with individual back end, using the strategy divided and rule, effectively reducing the amount of calculation and complexity of single server Improve computational efficiency.Simultaneously by carrying out dimension-reduction treatment to initial data, complexity is reduced, calculating is not only further increased Efficiency, and improve the degree of accuracy of data prediction.Effect is run with the data iteration of multilayer perceptron sorting algorithm is made full use of Rate is high, and method simple practical realizes convenient advantage, not only reduces cost and improve computational efficiency again.
2nd, in an embodiment of the present invention, multilayer perceptron sorting algorithm is applied in distributed computing framework, fully The advantage of multilayer perceptron sorting algorithm is make use of, and the characteristics of combination distributed computing framework so that the calculating task of magnanimity It is evenly distributed on each back end, using the strategy divided and rule, effectively reduces the amount of calculation of single server and answer Miscellaneous degree, substantially increases computational efficiency.
3rd, in embodiments of the present invention, multilayer perceptron sorting algorithm in an iterative process, using gradient descent algorithm pair Error is modified, and accelerates convergence rate, improves signal to noise ratio, reduces noise data, model accuracy is had greatly Improve.
4th, in embodiments of the present invention, training dataset and test data set deposit in distributed file system, to whole Big data is enable excavation computing parallelization to run, is simplified algorithm complex, improve computing using the strategy divided and rule Speed.
5th, in embodiments of the present invention, use of the multilayer perceptron sorting algorithm on Spark, takes full advantage of Spark Memory management mechanism, computing Optimization Mechanism and calculate fault tolerant mechanism, not only increase the operational efficiency of mining algorithm, more improve Reliability.
It should be noted that herein, such as first and second etc relational terms are used merely to an entity Or operation makes a distinction with another entity or operation, and not necessarily require or imply exist between these entities or operation Any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant be intended to it is non- It is exclusive to include, so that process, method, article or equipment including a series of key elements not only include those key elements, But also other key elements including being not expressly set out, or also include solid by this process, method, article or equipment Some key elements.In the absence of more restrictions, the key element limited by sentence " including one ", is not arranged Except also there is other identical factor in the process including key element, method, article or equipment.
One of ordinary skill in the art will appreciate that:Realizing all or part of step of above method embodiment can pass through Programmed instruction related hardware is completed, and foregoing program can be stored in the storage medium of embodied on computer readable, the program Upon execution, the step of including above method embodiment is performed;And foregoing storage medium includes:ROM, RAM, magnetic disc or light Disk etc. is various can be with the medium of store program codes.
It is last it should be noted that:Presently preferred embodiments of the present invention is these are only, the technical side of the present invention is merely to illustrate Case, is not intended to limit the scope of the present invention.It is any modification for being made within the spirit and principles of the invention, equivalent Replace, improve etc., it is all contained in protection scope of the present invention.

Claims (10)

1. a kind of data digging method, it is characterised in that this method includes:
Initial data is stored on distributed file system HDFS, and is assigned at least one back end;
Initial data with M dimensional feature vectors is subjected to dimension-reduction treatment, the low-dimensional data with m dimensional feature vectors is formed, its Middle M>m;
According to preset ratio, low-dimensional data is divided into training data and test data on each back end;
Training data is trained on each back end, multilayer perceptron disaggregated model is obtained;
Test data is predicted using above-mentioned multilayer perceptron disaggregated model, the pre- of multilayer perceptron disaggregated model is determined Survey accuracy and realize data mining.
2. data digging method according to claim 1, it is characterised in that by low-dimensional data on each back end It is divided into after training data and test data, further comprises:
Training data is constituted into training dataset on each back end, and test data is constituted into test data set;
Training data is trained on each back end, obtains before multilayer perceptron disaggregated model, further wraps Include:
Universal parallel framework Spark platforms read the training dataset on each back end from HDFS;
Each training dataset of reading is converted to elasticity distribution formula data set RDD objects by Spark platforms;
Each RDD object is stored in internal memory by Spark platforms;
Training data is trained on each back end, multilayer perceptron disaggregated model is obtained, including:
RDD objects are assigned on back end and are trained by Spark platforms, obtain multilayer perceptron disaggregated model.
3. data digging method according to claim 2, it is characterised in that RDD objects are assigned to data by Spark platforms It is trained on node, obtains multilayer perceptron disaggregated model, including:
To be trained by Pipelining technologies and perform flowsheet simulation is multiple working stages;
Each working stage is assigned on back end;
Each working stage is performed on back end, multilayer perceptron disaggregated model is obtained.
4. the data digging method according to 1 to 4 any claim, it is characterised in that right on each back end Training data is trained, and obtains multilayer perceptron disaggregated model, and detailed process includes:
Training parameter t, initialization weights ω (0) are set, wherein t=0, ω (0) is small random number;
Following training are performed for training data:
S1:Training data is calculated, reaching output layer by hidden layer from input layer obtains each layer output valve
S2:Training error δ is calculated to output layer:
S3:The training error δ of hidden layer is calculated from output layer to input layer:
S4:Calculate and preserve the correction of each weighted valueWherein, γ is momentum The learning rate of item;
S5:Correct weighted value:ωij(t+1)=ωij(t)+Δωij
S6:Judge current training data t whether convergence, if so, then terminating training, otherwise, using training data t+1 as current Training data, performs S1.
5. a kind of data mining device, it is characterised in that the device includes:Allocation unit, dimensionality reduction unit, division unit, training Unit and excavation unit, wherein,
Allocation unit, for initial data to be stored in into distributed file system HDFS, and is assigned at least one data section Point on;
Dimensionality reduction unit, for the initial data with M dimensional feature vectors to be carried out into dimension-reduction treatment, being formed has m dimensional feature vectors Low-dimensional data, wherein M>m;
Division unit, for according to preset ratio, low-dimensional data to be divided into training data and survey on each back end Try data;
Training unit, for being trained on each back end to training data, obtains multilayer perceptron disaggregated model;
Unit is excavated, for being predicted using above-mentioned multilayer perceptron disaggregated model to test data, Multilayer Perception is determined The forecasting accuracy of device disaggregated model simultaneously realizes data mining.
6. data mining device according to claim 5, it is characterised in that the device further comprises:Aggregation units and Universal parallel framework Spark platforms, wherein,
Aggregation units, are surveyed for the training data composition training dataset on each back end, and test data composition Try data set;
Spark platforms, for reading the training dataset on each back end, each training by reading from HDFS Data set is converted to elasticity distribution formula data set RDD objects and each RDD object is stored in internal memory;
Training unit, is trained specifically for RDD objects are assigned on back end by Spark platforms, obtains multilayer Perceptron disaggregated model.
7. data mining device according to claim 6, it is characterised in that training unit includes:Decompose subelement, distribution Subelement and acquisition subelement, wherein,
Subelement is decomposed, flowsheet simulation is performed for multiple working stages for that will be trained by Pipelining technologies;
Subelement is distributed, for each working stage to be assigned into back end;
Subelement is obtained, for performing each working stage on back end, multilayer perceptron disaggregated model is obtained.
8. the data mining device according to 5 to 7 any claims, it is characterised in that training unit includes:Default son is single Member, output subelement, the first error subelement, the second error subelement, amendment quantum boxes, revise subelemen and judgement are single Member, wherein,
Default subelement, for setting training parameter t, initialization weights ω (0), wherein t=0, ω (0) is small random number;
Subelement is exported, for calculating training data, to obtain each layer defeated by hidden layer arrival output layer from input layer Go out value
First error subelement, for calculating training error δ to output layer:
Second error subelement, the training error δ for calculating hidden layer from output layer to input layer:
Correct quantum boxes, the correction for calculating and preserving each weighted value Wherein, γ is the learning rate of momentum term;
Revise subelemen, for correcting weighted value:ωij(t+1)=ωij(t)+Δωij
Judgment sub-unit, for judge current training data t whether convergence, if so, then terminate training, otherwise, number will be trained According to t+1 as current training data, triggering output subelement is performed.
9. a kind of computer-readable recording medium, it is characterised in that the computer-readable recording medium includes:Execute instruction, when the processor of storage control is held During row execute instruction, any data digging method in storage control perform claim requirement 1 to 4.
10. a kind of storage control, it is characterised in that the storage control includes:Processor, memory and bus;
Processor and memory are connected by bus;
Memory, when storage control is run, the execute instruction of computing device memory storage, so that storage control is held Any data digging method in row Claims 1-4.
CN201710273242.1A 2017-04-24 2017-04-24 A kind of data digging method and device, a kind of computer-readable recording medium and storage control Pending CN107038244A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710273242.1A CN107038244A (en) 2017-04-24 2017-04-24 A kind of data digging method and device, a kind of computer-readable recording medium and storage control

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710273242.1A CN107038244A (en) 2017-04-24 2017-04-24 A kind of data digging method and device, a kind of computer-readable recording medium and storage control

Publications (1)

Publication Number Publication Date
CN107038244A true CN107038244A (en) 2017-08-11

Family

ID=59536742

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710273242.1A Pending CN107038244A (en) 2017-04-24 2017-04-24 A kind of data digging method and device, a kind of computer-readable recording medium and storage control

Country Status (1)

Country Link
CN (1) CN107038244A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268638A (en) * 2018-01-18 2018-07-10 浙江工业大学 A kind of generation confrontation network distribution type implementation method based on Spark frames
CN113641497A (en) * 2021-08-03 2021-11-12 北京三易思创科技有限公司 Method for realizing distributed high-concurrency data summarization based on dimension reduction and segmentation technology
CN116882522A (en) * 2023-09-07 2023-10-13 湖南视觉伟业智能科技有限公司 Distributed space-time mining method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20140096936A (en) * 2013-01-29 2014-08-06 (주)소만사 System and Method for Big Data Processing of DLP System
CN104899561A (en) * 2015-05-27 2015-09-09 华南理工大学 Parallelized human body behavior identification method
CN105740424A (en) * 2016-01-29 2016-07-06 湖南大学 Spark platform based high efficiency text classification method
CN106250461A (en) * 2016-07-28 2016-12-21 北京北信源软件股份有限公司 A kind of algorithm utilizing gradient lifting decision tree to carry out data mining based on Spark framework

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20140096936A (en) * 2013-01-29 2014-08-06 (주)소만사 System and Method for Big Data Processing of DLP System
CN104899561A (en) * 2015-05-27 2015-09-09 华南理工大学 Parallelized human body behavior identification method
CN105740424A (en) * 2016-01-29 2016-07-06 湖南大学 Spark platform based high efficiency text classification method
CN106250461A (en) * 2016-07-28 2016-12-21 北京北信源软件股份有限公司 A kind of algorithm utilizing gradient lifting decision tree to carry out data mining based on Spark framework

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王之仓: "多层感知器学习算法研究", 《中国优秀硕士学位论文电子期刊网》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268638A (en) * 2018-01-18 2018-07-10 浙江工业大学 A kind of generation confrontation network distribution type implementation method based on Spark frames
CN113641497A (en) * 2021-08-03 2021-11-12 北京三易思创科技有限公司 Method for realizing distributed high-concurrency data summarization based on dimension reduction and segmentation technology
CN116882522A (en) * 2023-09-07 2023-10-13 湖南视觉伟业智能科技有限公司 Distributed space-time mining method and system
CN116882522B (en) * 2023-09-07 2023-11-28 湖南视觉伟业智能科技有限公司 Distributed space-time mining method and system

Similar Documents

Publication Publication Date Title
US20210049512A1 (en) Explainers for machine learning classifiers
JP7087079B2 (en) Robust gradient weight compression scheme for deep learning applications
Liu et al. A speculative approach to spatial‐temporal efficiency with multi‐objective optimization in a heterogeneous cloud environment
EP4036803A1 (en) Neural network model processing method and apparatus, computer device, and storage medium
EP3179415A1 (en) Systems and methods for a multi-core optimized recurrent neural network
US20170061279A1 (en) Updating an artificial neural network using flexible fixed point representation
CN108170529A (en) A kind of cloud data center load predicting method based on shot and long term memory network
US20160092794A1 (en) General framework for cross-validation of machine learning algorithms using sql on distributed systems
Ashish et al. Parallel bat algorithm-based clustering using mapreduce
US20170330078A1 (en) Method and system for automated model building
US20230236888A1 (en) Memory allocation method, related device, and computer-readable storage medium
CN115437795B (en) Video memory recalculation optimization method and system for heterogeneous GPU cluster load perception
CN110069502A (en) Data balancing partition method and computer storage medium based on Spark framework
CN113220450B (en) Load prediction method, resource scheduling method and device for cloud-side multi-data center
CN106777006A (en) A kind of sorting algorithm based on parallel super-network under Spark
US11295236B2 (en) Machine learning in heterogeneous processing systems
CN107038244A (en) A kind of data digging method and device, a kind of computer-readable recording medium and storage control
CN105205052A (en) Method and device for mining data
CN115860081A (en) Core particle algorithm scheduling method and system, electronic equipment and storage medium
CN109117475A (en) A kind of method and relevant device of text rewriting
Fan et al. An evaluation model and benchmark for parallel computing frameworks
CN116644804B (en) Distributed training system, neural network model training method, device and medium
CN115544033B (en) Method, device, equipment and medium for updating check repeat vector library and checking repeat data
KR20210115863A (en) Method and appartus of parallel processing for neural network model
CN106648891A (en) MapReduce model-based task execution method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170811

RJ01 Rejection of invention patent application after publication