CN106874482A

CN106874482A - A kind of device and method of the patterned data prediction based on big data technology

Info

Publication number: CN106874482A
Application number: CN201710090025.9A
Authority: CN
Inventors: 刘涛; 杨立涛; 王庆刚; 丛兴滋; 李书明
Original assignee: Shandong Luneng Software Technology Co Ltd
Current assignee: Shandong Luneng Software Technology Co Ltd
Priority date: 2017-02-20
Filing date: 2017-02-20
Publication date: 2017-06-20

Abstract

A kind of device and method of the patterned data prediction based on big data technology, including data acquisition device, monitoring of equipment device, distributed memory, spark internal memory computing engines, computing unit, ETL processing units, data pre-processing unit, wherein data acquisition device is connected with monitoring of equipment device, monitoring of equipment device connects distributed memory, distributed memory connects data pre-processing unit, data pre-processing unit includes spark internal memory computing engines, computing unit, ETL processing units, can be quick, efficiently, magnanimity isomeric data is processed in time, ensure equipment safety simultaneously, stabilization, efficiently run.

Description

A kind of device and method of the patterned data prediction based on big data technology

Technical field

The present invention relates to monitoring of equipment analysis application field, and in particular to a kind of based on the patterned of big data technology The device and method of data prediction.

Background technology

With the fast development of intelligent grid, power system has begun to march toward energy internet and " big data " epoch, The a large amount of service datas of power industry are increasingly presented that the scale of construction is big, type is more, are worth high feature, Data Analysis Services ability fall behind with Contradiction between data rapid growth will be protruded more；As data volume, data type are on the increase, also there is data analysis Performance bottleneck, lack data analysis excavation sophisticated method, unstructured data still lack effectively utilize the problems such as, this restrict Electric power trade information is from digitlization to intelligentized development.The big data key technology of energy Internet era is adopted including data The many aspects such as collection, transmission, storage, quality management, fusion be shared and depth is excavated.

History service data collection and analysis, the instant analysis of real-time or near-realtime data are power industries in power industry Important content during middle informatization, the big data that it needs complete set, stablizes, agrees with practical business scene The solution of analytical equipment, to equipment fault early-warning etc., analysis classes business scenario provides reliable and stable bottom data branch in real time Support.

In recent years, with the IT technology fast developments such as cloud computing, big data, machine learning, data mining, distribution is deposited Storage, high-performance calculation obtain key breakthrough in theoretical research and engineering practice aspect, industry emerged it is a collection of with Hadoop is big data treatment and the application solution of representative.

Hadoop is an extensible framework, and reliable distributed treatment, the framework of Hadoop can be carried out to big data Most crucial design includes HDFS and MapReduce.HDFS provides storage for the data of magnanimity, then MapReduce is magnanimity Data provide calculate.HDFS is a distributed file system, with low cost, high reliability, high-throughput spy Point.MapReduce is one and becomes model and software frame, and it can greatly simplify the processing procedure of large-scale data. Spark is a kind of distributed big data handling implement, and itself does not provide data storage function, and it may operate in Hadoop's On HDFS or other distributed file systems, the design original intention of Spark is exactly anti-in order to solve Hadoop MapReduce So as to the problem of inefficiency, it supports data to re-reading written document system by building elasticity distribution formula data set (RDD) structure Memory resident, realizes In-memory MapReduce frameworks, and the deficiency of MapReduce is made up under application-specific scene. Hadoop, Spark etc. general open source technology component has some limitations in terms of functional completeness, operation stability, And be based on some commercial big data platforms derived from Hadoop and there is deviation with the actual demand of power business scene, thus, Depth analysis study power industry business demand, the integration of heterogeneous data source, it is integrated be that IT application in enterprise process is frequent The realistic problem for running into, with the increase of sharply increasing for data volume, particularly unstructured data, traditional data bins The performance of storehouse technology and data extraction tool in terms of data pre-processing unit have too many difficulties to cope with, it is impossible to meet magnanimity isomeric data and The data processing performance requirement of mixed and disorderly low quality data, builds a kind of patterned data prediction list based on big data technology Element apparatus and method, have far-reaching significance and stronger value.

Intelligent grid big data complex structure, species are various, in addition to traditional structural data, also comprising substantial amounts of half hitch Structure, unstructured data, such as speech data of the system of Customer Service Center 95598, the video in equipment on-line monitoring system Data and view data etc..The sample frequency of these data is also variant with life cycle, from Microsecond grade, minute level, hour level, Until annual level.Current grid company magnanimity, various data resource provide good condition for the profound analysis of data, such as The performance of what lifting data processing, abundant mining data are worth and realize that data assets management makes data be provided as enterprise key Produce, as current problem to be solved.

In consideration of it, needing a kind of unified presentation that can realize massive multi-source electric power big data, flexibly collection, concentration badly Storage, effectively assessment, quick treatment and the shared solution of safety, multi-source heterogeneous big data of the research based on metadata are managed System is extremely urgent.

The distributed computation ability reply isomeric data integration problem of big data, is built based on Spark internal memories computing engines ETL processing units, data pick-up, data conversion, data load logic are converted into the meter for supporting parameter configuration and dynamic combined Unit is calculated, coordinates patterned flow configuration instrument to realize the flexible customization of data pre-processing unit process, can not only solved The performance issue of isomeric data pretreatment unit, can also effectively improve data pre-processing unit program reusing degree and flexibly Degree.

But a kind of device and method of the patterned data pre-processing unit based on big data technology is built, solve to pass The performance issue of the magnanimity isomeric data integration that system ETL instruments cannot be dealt carefully with, improves answering for data pre-processing unit program Expenditure, flexibility ratio and execution efficiency.

The content of the invention

It is an object of the invention to overcome the deficiencies in the prior art, there is provided a kind of patterned number based on big data technology Data preprocess device and method, can quickly, efficiently, magnanimity isomeric data is processed in time, while ensureing equipment safety, steady Determine, efficiently run.

The invention provides a kind of device of the patterned data pre-processing unit based on big data technology, including data Harvester, monitoring of equipment device, distributed memory, spark internal memories computing engines, computing unit, ETL processing units, number Data preprocess unit, wherein data acquisition device are connected with monitoring of equipment device, monitoring of equipment device connection distributed memory, Distributed memory connect data pre-processing unit, data pre-processing unit include spark internal memories computing engines, computing unit, ETL processing units；

Data acquisition device, for obtaining facility information isomeric data, and the equipment letter that will be collected in real time or quasi real time Breath isomeric data is transferred to monitoring of equipment device；

Monitoring of equipment device, is stored for being collected facility information isomeric data, and being pushed to distributed memory, And in the way of data are flowed into, by monitoring of equipment device data output to data pre-processing unit process；

Distributed memory, also known as time series data memory, for pre- by equipment real time mass isomeric data and data The storage of the device data after processing unit.

Spark internal memory computing engines, for being calculated data by calling computing unit logic rules, and will calculate Data output afterwards is to distributed memory；

Computing unit, calls and receives the data of distributed memory storage, foundation for driving scheduling rule engine Programmed treatment logic is processed the data called and receive in advance, and training forms data mining model；

Computing unit includes many sub- computing units, and many sub- computing units are graphically dynamically matched somebody with somebody according to practical business demand Put, dynamic layout forms operation；Every sub- computing unit is individually present, and can independently extend evolution according to industry specialists experience, Exported in real time after being calculated the data called and receive using distributed streaming computing engines, and output data to point Cloth data storage；

ETL processing units, for forming operation based on computing unit dynamic layout, based on Spark internal memory computing engines structures Build, data pick-up, data conversion, data load logic are converted into the graphical parameter configuration of support and dynamic combined；

Data pre-processing unit, for by facility information isomeric data, according to ETL processing units carry out data extraction, Conversion, the pretreatment of loading, while standard data format can be carried out, abnormal data is removed, error correcting, repeated data Remove；And the data in multiple data sources are combined into unified storage；By smoothing aggregation, Data generalization and/or standardization Mode converts the data into the process of the data mode suitable for data mining.

Preferably, data acquisition device is the data sampling sensor being installed in monitoring device；

Preferably, the data acquisition device is the infrared detector or temperature detector of monitoring device installation region

Preferably, also including the manual input device being connected with monitoring of equipment device, for being implemented because of safety requirements Quarantine measures or Input Monitor Connector device data in the case of do not support data access.

Preferably, monitoring of equipment device is pushed away during the data pre-processing unit is additionally operable to call and receive distributed memory The new time series data of generation is sent, and training process is repeated to new time series data, data mining model is updated.

Preferably, the manual input device is notebook computer, panel computer or mobile phone.

Preferably, the related computing unit of data pre-processing unit includes but invalid value filter element, missing values supplement are single One or many in unit, the additional unit of data column selection unit, Data Row Transformation unit, data row and data acquisition system and unit It is individual, it is mutually combined according to specific business, and extension is supported, specifically：

Invalid value filter element：Freely configuring for combination condition judgment rule is realized using regulation engine, by invalid note Record is removed, and is retained satisfactory data and is entered next processing links；

Missing values supplementary units：Freely configuring for missing values calculating logic is realized using function is calculated, is specifically being calculated Logic can be calculated with self-defined missing value complement in operation, the data for completing to mend calculation operation enter next processing links；

Data column selection unit：Legacy data collection includes n field, and m field of unrestricted choice enters next treatment ring Section, wherein (m<=n)；

Data Row Transformation unit:Change the title or data form of some row of legacy data collection, the data for completing conversion are entered Enter next processing links；

The additional unit of data row：Legacy data collection includes n field, freely adds m field, the name of new field, Data type, data value can be self-defined, complete the additional data of row and enter next processing links；

Data set combining unit：The aggregation node of many data sets, supports SQL statement inquiry, and result data collection enters next Individual processing links.

Present invention also offers a kind of processing method of the device of the patterned data prediction based on big data technology, Comprise the following steps：

(1) initialize, the initial parameter of data acquisition device is set, according to the initial parameter control data collection for setting The sampling period of device is per hour for 10 times, and the sampling time is 7 days, and the data sampled in 7 days are averaged A；

(2) under the conditions of same initial parameter, real-time data collection, using 4 data of every continuous acquisition as a group [B C D E], 4 data are designated as B, C, D, E respectively, calculate difference scores M respectively using formula, wherein：

A ' is a value in B, C, D, E in formula；

(3) if difference scores are in threshold range, then it is assumed that secondary gathered data group is effective, by B, C, D, E averaging M, makes the real-time measured value that P' is data acquisition device, then：

If A.Then data acquisition device stable performance, into step (4)；

If B.Then data acquisition device unstable properties, then into step (1)；

(4) facility information Monitoring Data, and the facility information Monitoring Data transmission that will be collected are obtained in real time or quasi real time In to monitoring of equipment device, in the way of data-pushing, (account data and history are mainly included in being pushed to distributed memory Data, magnanimity isomeric data), or in the way of streaming is exported, by monitoring of equipment data output to data pre-processing unit mistake Journey；

(5) in the way of batch is accessed, the routine in distributed memory is obtained by predefined operation plan automatically Account data and historical data, equipment magnanimity isomeric data is carried out the number of the extraction of data, conversion, loading with preprocessing rule Data preprocess unit, and pretreated data output to distributed memory is stored；

(6) in the way of streaming is accessed, the equipment magnanimity in distributed memory is obtained by predefined system drive Isomeric data, carries out extraction, conversion, the data pre-processing unit of loading of data under preprocessing rule, and by after pretreatment Data output to distributed memory stored；

(7) scheduling rule engine is driven to call and receive distribution by computing engines during data pre-processing unit The data of memory storage, and the data called and receive are processed according to programmed treatment logic in advance, train Data mining model is formed, by by the data back after ETL processing unit processes to distributed memory.

The device and method of the patterned data prediction based on big data technology of the invention, it is possible to achieve：

1) with stabilization, it is reliable, efficient increase income distributed memory system and parallel computation service is core, by the pre- of data Processing procedure, transfer to Distributed Calculation unit perform, can not only reduce data processing complexity, improve time series data access gulp down The amount of telling；

2) ETL processing units are built based on Spark internal memories computing engines, data pick-up, data conversion, data loading is patrolled The computing unit for being converted into and supporting parameter configuration and dynamic combined is collected, coordinates patterned flow configuration instrument to realize that data are located in advance The flexible customization of unit process is managed, the performance issue of isomeric data pretreatment unit can be not only solved, can also be effectively improved The reusing degree and flexibility ratio of data pre-processing unit program；

3) for the reliability of system data, devise average data and confirm scheme so that monitoring of equipment data are more It is reliable and stable, the live load of device is alleviated, service life is longer, and performance is more stablized.

4) the data acquisition device performances evaluation mode of optimization so that data are more reliable.

Brief description of the drawings

Fig. 1 is based on the apparatus structure schematic diagram that the graphics data of big data is pre-processed

Specific embodiment

The following detailed description of specific implementation of the invention, it is necessary to it is pointed out here that, below implement to be only intended to this hair Bright further illustrates, it is impossible to be interpreted as limiting the scope of the invention, and art skilled person is according to above-mentioned Some nonessential modifications and adaptations that the content of the invention is made to the present invention, still fall within protection scope of the present invention.

The invention provides a kind of device and method of the patterned data pre-processing unit based on big data technology, such as Shown in accompanying drawing 1, including data acquisition device 1, monitoring of equipment device 2, distributed memory 3, Spark internal memories computing engines 4, meter Calculate unit 5, ETL processing units 6, data pre-processing unit 7, wherein monitoring of equipment device 2 respectively with data acquisition device 1 and divide Cloth memory 3 is connected, and distributed storage device 3 is connected with data pre-processing unit 7, and data pre-processing unit 7 includes ETL treatment Unit 6, ETL processing units 6 include spark internal memories computing engines 4 and computing unit 5；

Data acquisition device, for obtaining monitoring of equipment data, and the monitoring of equipment number that will be collected in real time or quasi real time According to monitoring of equipment device is transferred to, data acquisition device is the information acquisition sensor being installed in monitoring device, can also be The sensors such as camera, the temperature detector of monitoring device installation region, monitoring of equipment device can in real time by facility information Monitoring Data store, and by push in the way of or by streaming export in the way of, by facility information Monitoring Data export to point Cloth memory.

Monitoring of equipment device, for by facility information data acquisition, and in the way of pushing or streaming output side Formula, by monitoring of equipment device data-pushing to distributed memory.

Distributed memory, also known as time series data memory, pushes or by data prediction for monitoring of equipment device Account data after unit, historical data, achievement data, the storage of magnanimity isomeric data.

Spark internal memory computing engines, are drivers that data are calculated, by calling computing unit logic rules logarithm According to being calculated, and by the data output after calculating to distributed memory.

Computing unit, also known as operator, calls and receives distributed memory storage for driving scheduling rule engine The data called and receive can be processed by data according to programmed treatment logic in advance, and training forms data mining Model.Data pick-up, data conversion, data load logic are converted into the computing unit for supporting parameter configuration and dynamic combined, In patterned flow configuration instrument, according to practical business demand, marshal data pretreatment unit process, by pulling operator Mode flexible configuration is carried out to data prediction unit process.Wherein computing unit includes many sub- computing units, many height Computing unit forms operation according to the graphical dynamic configuration of practical business demand, dynamic layout.Every sub- computing unit is independently deposited Evolution can independently extended according to industry specialists experience, using distributed streaming computing engines to the number that calls and receive According to being exported in real time after being calculated, and output data to distributed data storage；

Computing unit is the part for calculating operation simultaneously, and calculating operation is used to define (the also referred to as operation of calculating task Node) topological structure and execution logic, similar to workflow (Workflow), it is graphical that its definition procedure can be provided in system Flow configuration instrument in complete, by way of pulling jobs node, by jobs node independent assortment and configuration, form one Job task.From in terms of the visual angle of computing engines, each jobs node corresponds to a computing unit (Compute Unit), calculates The corresponding programmed logic of unit is referred to as operator (Transformation).System provides visual modeling tool, preset abundant Data processing and data display operator, while open operator development specifications, supports the secondary development of practical business scene.

ETL is responsible for being drawn into the data in scattered, heterogeneous data source such as relation data, unstructured data file etc. Behind interim intermediate layer, cleaned, changed, it is integrated, be finally loaded into data warehouse or Data Mart, carried as data mining For the data of decision support.The integrated ETL instruments major part function of ETL processing units, mainly computing unit dynamic layout is formed Operation, ETL visualization processing frameworks are built based on Spark internal memories computing engines, by data pick-up, data conversion, data loading Logic is converted into the computing unit for supporting graphical parameter configuration and dynamic combined, more intuitively shows ETL data processings Journey.

Data pre-processing unit, for by facility information isomeric data, according to ETL processing units carry out data extraction, Conversion, the pretreatment of loading, while standard data format can be carried out, abnormal data is removed, error correcting, repeated data Remove；And the data in multiple data sources are combined into unified storage；Assembled by smooth, Data generalization, the side such as standardization Formula converts the data into the process of the data mode suitable for data mining.The magnanimity isomeric data of access before storing, can To carry out necessary pretreatment, data pick-up, data conversion, data loading etc. are carried out using pre-configured preprocessing rule Operation.Device data (or other data) enters data pre-processing unit journey in forms such as data flow, timer-triggered scheduler, manual importings Sequence, result is by according to the configuration output of specific treatment operation to specified location.Pre-process logic realization configuration, configuration Change, visualize, each configurable logic unit is referred to as operator, according to actual business demand, in graphical tools, drag Data pre-processing unit operator is dragged, dynamic layout forms preprocessing process, and configures operator relevant parameter.Whole pretreatment logic Referred to as operation, the parallelization of operation is realized by the parallelization of operator.

The related operator of data pre-processing unit include but is not limited to invalid value filtering, missing values supplement, data column selection, The units such as Data Row Transformation, data row are additional, data set merging, can be mutually combined, and support extension according to specific business.

Invalid value is filtered：Freely configuring for combination condition judgment rule is realized using regulation engine, can be by invalid note Record is removed, and is retained satisfactory data and is entered next processing links.

Missing values are supplemented：Freely configuring for missing values calculating logic is realized using function is calculated, operation is calculated specific In can with it is self-defined missing value complement calculate logic, complete mend calculate operation data enter next processing links.

Data column selection：Legacy data collection includes n field, can be with m field (m of unrestricted choice<=n) enter next Processing links.

Data Row Transformation:Change the title or data form (example of some row of legacy data collection：Numeric type is converted to character String type), the data for completing conversion enter next processing links.

Data row are additional：Legacy data collection includes n field, can freely add m field, the name of new field, Data type, data value can customize (example：It is additional " creating the time " field of the data set comprising " creation time " field), it is complete Additional data enter next processing links in column.

Data set merges：The aggregation node of many data sets, supports SQL statement inquiry, and result data collection enters next place Reason link.

The present invention also provides a kind of device and method of the patterned data prediction based on big data technology, wraps successively Include following steps：

A ' is a value in B, C, D, E in formula；

If A.Then data acquisition device stable performance, into step (4)；

If B.Then data acquisition device unstable properties, then into step (1)；

The device and method of the patterned data pre-processing unit based on big data technology of the invention is by software Cooperation with hardware unit is completed, but be not limited to that this, under certain condition, it is also possible to the reality completely by way of software It is existing.

Although for illustrative purposes, it has been described that illustrative embodiments of the invention, those skilled in the art Member it will be understood that, in the case of not departing from the scope and spirit of invention disclosed in appended claims, can be in form and details On carry out various modifications, addition and replace etc. change, and it is all these change should all belong to appended claims of the present invention Each step in protection domain, and claimed each department of product and method, can be in any combination Form is combined.Therefore, to disclosed in this invention implementation method description be not intended to limit the scope of the present invention, But for describing the present invention.Correspondingly, the scope of the present invention is not limited by embodiment of above, but by claim or Its equivalent is defined.

Claims

1. a kind of device of the patterned data prediction based on big data technology, it is characterised in that：Including data acquisition dress Put, monitoring of equipment device, distributed memory, spark internal memories computing engines, computing unit, ETL processing units, data are located in advance Reason unit, wherein data acquisition device is connected with monitoring of equipment device, monitoring of equipment device connection distributed memory, distributed Memory connects data pre-processing unit, and data pre-processing unit includes spark internal memories computing engines, computing unit, ETL treatment Unit；

Data acquisition device, for obtaining facility information isomeric data in real time or quasi real time, and the facility information that will be collected is different Structure data are transferred to monitoring of equipment device；

Distributed memory, also known as time series data memory, for by equipment real time mass isomeric data and data prediction The storage of the device data after unit.

Spark internal memory computing engines, for being calculated data by calling computing unit logic rules, and by after calculating Data output is to distributed memory；

Computing unit, calls and receives the data of distributed memory storage for driving scheduling rule engine, according in advance Programmed treatment logic is processed the data called and receive, and training forms data mining model；

Computing unit includes many sub- computing units, many sub- computing units according to the graphical dynamic configuration of practical business demand, Dynamic layout forms operation；Every sub- computing unit is individually present, and can independently extend evolution according to industry specialists experience, uses Distributed streaming computing engines are exported in real time after calculating the data called and receive, and output data to distribution Data storage；

ETL processing units, for forming operation based on computing unit dynamic layout, are built based on Spark internal memories computing engines, will Data pick-up, data conversion, data load logic are converted into the graphical parameter configuration of support and dynamic combined；

Data pre-processing unit, for by facility information isomeric data, according to ETL processing units carry out the extraction of data, conversion, The pretreatment of loading, while standard data format can be carried out, abnormal data is removed, error correcting, the removing of repeated data； And the data in multiple data sources are combined into unified storage；By smooth aggregation, Data generalization and/or normalized fashion will Process of the data conversion into the data mode suitable for data mining.

2. device as claimed in claim 1, it is characterised in that：Data acquisition device is that the data being installed in monitoring device are adopted Collection sensor.

3. device as claimed in claim 1, it is characterised in that：The data acquisition device is red for monitoring device installation region Outer thread detector or temperature detector.

4. device as claimed in claim 1, it is characterised in that：Also include that what is be connected with monitoring of equipment device is manually entered dress Put, for because safety requirements implements quarantine measures or does not support data access in the case of Input Monitor Connector device data.

5. device as claimed in claim 1, it is characterised in that：The data pre-processing unit is additionally operable to call and receive distribution Monitoring of equipment device pushes the new time series data for producing in formula memory, and new time series data is repeated trains Journey, is updated to data mining model.

6. device as claimed in claim 1, it is characterised in that：The manual input device is notebook computer, panel computer Or mobile phone.

7. device as claimed in claim 1, it is characterised in that：The related computing unit of data pre-processing unit includes but invalid Value filter element, missing values supplementary units, data column selection unit, Data Row Transformation unit, data row additional unit and data Collection combining unit in one or more, be mutually combined according to specific business, and support extension, specifically：

Invalid value filter element：Freely configuring for combination condition judgment rule is realized using regulation engine, invalid record is moved Remove, retain satisfactory data and enter next processing links；

Missing values supplementary units：Freely configuring for missing values calculating logic is realized using function is calculated, operation is calculated specific In can with it is self-defined missing value complement calculate logic, complete mend calculate operation data enter next processing links；

Data column selection unit：Legacy data collection includes n field, and m field of unrestricted choice enters next processing links, its In (m<=n)；

Data Row Transformation unit:Change the title or data form of some row of legacy data collection, complete under the data entrance of conversion One processing links；

Data set combining unit：The aggregation node of many data sets, supports SQL statement inquiry, and result data collection enters next place Reason link.

8. a kind of patterned data prediction based on big data technology using as described in above-mentioned any one of claim 1-6 Device processing method, it is characterised in that in turn include the following steps：

(1) initialize, the initial parameter of data acquisition device is set, according to the initial parameter control data harvester for setting Sampling period be per hour for 10 times, the sampling time is 7 days, and the data sampled in 7 days are averaged A；

(2) under the conditions of same initial parameter, real-time data collection, using 4 data of every continuous acquisition as one group of [B C D E], 4 data are designated as B, C, D, E respectively, calculate difference scores M respectively using formula, wherein：

A ' is a value in B, C, D, E in formula；

(3) if difference scores are in threshold range, then it is assumed that secondary gathered data group effectively, B, C, D, E averaging M makes P' is the real-time measured value of data acquisition device, then：

If A.Then data acquisition device stable performance, into step (4)；

If B.Then data acquisition device unstable properties, then into step (1)；

(4) facility information Monitoring Data is obtained in real time or quasi real time, and the facility information Monitoring Data that will be collected is transferred to and sets In standby monitoring device, in the way of data-pushing, being pushed in distributed memory (mainly includes account data and history number According to magnanimity isomeric data), or in the way of streaming is exported, by monitoring of equipment data output to data pre-processing unit process；

(5) in the way of batch is accessed, the conventional account in distributed memory is obtained automatically by predefined operation plan Data and historical data, by equipment magnanimity isomeric data with preprocessing rule carry out the extraction of data, conversion, loading data it is pre- Processing unit, and pretreated data output to distributed memory is stored；

(6) in the way of streaming is accessed, the equipment magnanimity isomery in distributed memory is obtained by predefined system drive Data, carry out the extraction of data, conversion, the data pre-processing unit of loading under preprocessing rule, and by pretreated number Stored according to output to distributed memory；

(7) scheduling rule engine is driven to call and receive distributed storage by computing engines during data pre-processing unit The data of device storage, and the data called and receive are processed according to programmed treatment logic in advance, training is formed Data mining model, by by the data back after ETL processing unit processes to distributed memory.