CN107193967A - A kind of multi-source heterogeneous industry field big data handles full link solution - Google Patents

A kind of multi-source heterogeneous industry field big data handles full link solution Download PDF

Info

Publication number
CN107193967A
CN107193967A CN201710376130.9A CN201710376130A CN107193967A CN 107193967 A CN107193967 A CN 107193967A CN 201710376130 A CN201710376130 A CN 201710376130A CN 107193967 A CN107193967 A CN 107193967A
Authority
CN
China
Prior art keywords
data
layer
storage
analysis
task
Prior art date
Application number
CN201710376130.9A
Other languages
Chinese (zh)
Inventor
张莹
罗永洪
杨志帆
史慧珂
宋珂慧
袁晓洁
Original Assignee
南开大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南开大学 filed Critical 南开大学
Priority to CN201710376130.9A priority Critical patent/CN107193967A/en
Publication of CN107193967A publication Critical patent/CN107193967A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/904Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing

Abstract

A kind of multi-source heterogeneous industry field big data handles full link solution.The present invention is on the basis of industry field big data magnanimity, diversity, rapidity, value feature is understood, according to the application demand of industry field administrative decision, designs and realizes multi-source heterogeneous industry field big data and handle full link solution.This solution summary and induction first industry field big data handling process, then three layers of storage architecture scheme of Industry-oriented field big data are proposed, the multi-level multi-dimensional data analysis towards administrative decision and Knowledge Discovery scheme are had also been proposed, the full link data processing platform of Industry-oriented field big data has finally been built.

Description

A kind of multi-source heterogeneous industry field big data handles full link solution
Technical field
The invention belongs to big data processing technology field.Specifically related to a kind of multi-source heterogeneous industry field big data processing is complete Link solution.
Background technology
As information technology is deeply applied in every profession and trade field, the very fast generation of industry field big data and accumulation are for example golden Melt transaction data, the network data of field of telecommunications, the traffic flow data of communications and transportation, the students ' behavior of education sector in field Data etc..Industry field big data has obvious big data feature, i.e., so-called 4V characteristic attributes:Magnanimity (Volume), Diversity (Variety), high speed (Velocity), value (Value).
These data scales are huge, species is various, but it is abundant to contain value, if can be known from magnanimity complex data Know, be possible to provide intellectual support for the activity in every profession and trade field, the operational mode of decision-making mode and social economy to the mankind Produce far-reaching influence.At present, under the driving of big data, all trades and professions will appreciate that in the urgent need to studying powerful big data Management and the data analysis algorithm of processing platform, effective data processing technique and intelligence, for supporting real time data statistics, number Applied according to analysis, Complex data mining etc..But big data by collection, it is integrated, store, analyze generation value, be a complexity Process.The feature of industry field big data, be industry field big data management with processing procedure propose many challenges with It is difficult:
1) in face of cross-system is distributed, data mode is various, configurations industry field big datas, proposition tool how is realized There is the industry field big data handling process of versatility
2) in face of data volume is huge, type is complicated, efficient data storage framework how is designed, it is real-time to meet Business diagnosis demand
3) industry field decision requirements are faced, the Data Analysis Model of multi-angle how is built, really excavates big data In the value that contains
4) the multi-source heterogeneous industry field big data processing platform of full link how is built, by industry field big data processing stream Journey, efficient storage framework, Data Analysis Model are put into practice
In summary, the arrival in information explosion epoch, the researching value of big data is very huge.With data management, number According to the fast development of the investigative techniques such as excavation, cloud computing, many effective data integrations, data fusion, data storage, distribution Calculate, data analysis technique emerge in large numbers one after another, for multi-source heterogeneous industry field big data processing research provide theoretical foundation with Technical support.In this context, for industry field big data feature, multi-source heterogeneous data integration is studied, research multi-layer is deposited Framework is stored up, the multi-level multi-dimensional data studied towards administrative decision is analyzed, and builds the big number of the multi-source heterogeneous industry field of full link According to processing platform, with important Research Significance and application value.
The content of the invention
Present invention aim to address how management scale is huge, miscellaneous industry field big data and how from The problem of wherein obtaining knowledge, is understanding the basis of industry field big data magnanimity, diversity, rapidity, value feature On, according to the application demand of industry field administrative decision, multi-source heterogeneous field big data processing key technology is deeply ground Study carefully, including the problems such as data cleansing fusion, data storage, depth analysis, designed and Implemented the big number of multi-source heterogeneous industry field According to the full link solution of processing.
The multi-source heterogeneous industry field big data that the present invention is provided, which handles full link solution detailed step, to be included:
1st, industry field big data handling process is concluded
Existing big data processing platform is summarized, on the basis of traditional big data processing basic procedure, with reference to industry field The visuality and authenticity feature and field application demand of big data, explore the most basic of Industry-oriented field big data processing Flow;
The handling process of whole industry field big data is defined as, under different collecting devices, instrument and system supplymentary, Multiple isomery industry field data sources extract with it is integrated, uniformly storage is carried out the characteristics of further according to data, with reference to industry Using multi-angle demand, the data of storage are analyzed and excavated using data analysis technique, obtain administrative decision knowledge, and Result is showed into user using visual analysis method;The handling process of industry field big data can be divided into data pick-up With integration module, data memory module, data analysis module and visual analysis module.
(1) data pick-up and integration module
The data source of industry field big data is varied, the data that are for example gathered in real time by equipment, different purposes The data that operation formula system is produced, want to handle big data, first have to the data needed for being extracted from different data sources.Due to data There may be between wide material sources, configurations and data it is inconsistent, after data pick-up must to heterogeneous multi-source data carry out data It is integrated, including the traditional datas such as dirty data, data type conversion cleaning data transfer device is cleaned, and it is only for field big data Some authenticity features there is provided data fusion function with resolving data conflicts, so as to ensure the integrality of data, uniformity and just True property, improves the quality of data.
(2) data memory module
, it is necessary to carry out unified storage management to Various types of data according to certain standard in big data processing procedure.In industry In the big data handling process of field, exist initial data, data cleansing, fusion, conversion generation intermediate data, data analysis and For the result data of visual presentation, their data type, data scale, data use are not quite similar.The present invention considers Build three level industry field data and unify storing framework, to meet the need of the different purposes data storages of different levels different scales Ask.
(3) data analysis module
Data analysis is the core of whole industry field big data handling process, is the embodiment of big data value characteristic.Greatly Value of knowledge density is low in data, extensively, it is necessary to which data are carried out with point in all directions according to certain mode in terms of knowledge covering Analysis.Traditional data statistics and data mining algorithm can not handle mass data, Distributed Calculation frame in time under big data background Frame plays key effect in the previous decade of twentieth century, but also occurs in that performance bottleneck over time.The present invention considers Multi-level multi-dimensional data analysis model is built based on New-generation distributed Computational frame Spark, to meet under big data background Data analysis.
(4) visual analysis module
The processing of industry field big data will not only complete data analysis task, in addition it is also necessary to carry out data results directly perceived Embody.Visual analysis is directed to the visual feature of industry field big data, and by visual analysis technology, lteral data is converted to Easy-operating chart directly perceived, apparent impression is provided for decision-making.The present invention considers what is combined based on Html5 and Echarts.js Visual analysis technology, builds and provides data selection, data column selection, Code automatic build, the editable visual analysis mould of code Block, to meet under industry field big data background the need for visual analysis.
2nd, three layer data storage models are designed
Obtain after industry field initial data, also it should be cleaned, merged and be excavated, therefore the big number of industry field Multistage storage is needed according to analysis, includes achievement data, department's decision data, data analysis basic data and initial data to meet Storage, management and calculate.
Define 1:Several indexs considered are needed to be defined as follows during Data Storage Models for choosing:
1. Query Costs:Query Cost refer to inquire about unit data in Data Storage Models spent by when Between;
2. inserts cost:Insertion cost refers to carry out spent by insertion operation unit data in Data Storage Models Time;
3. deletes cost:Cost is deleted to refer to delete the time spent by unit data in Data Storage Models;
4. compression efficiencies:Compression efficiency refers to the ratio between the size after data compression and the size before compression;
When choosing per layer data storage model, each index is different to the influence degree of every layer data storage model , by analyze these indexs to the influence size of every layer data storage model to choose suitable data storage facility to build Three layer data storage models.
2.1st, first layer data active layer storage model is designed
For more perfect preservation field data, the present invention proposes data active layer storage model, for storing from industry field The source data that interior numerous external data sources are obtained.Data volume in data active layer storage model is very big, in these data predominantly Structural data, such as common pipelined data, this kind of data are often stored in database, such as SQL Server, MySQL, it is easy with inquiry as the table in database, the characteristics of clear in structure.Number under industry field big data background According to the journal file for also including unstructured data and semi-structured data, such as web station system generation, or management system In picture file.These data are often stored with text or XML file format, and its feature is that structuring degree is low, Excavate difficulty big.
Data active layer data, it is more to be preserved as historical data, it is not intended as daily data analysis access and uses, by It is the start node of whole storage model in data active layer, some data modifications or loss will cause irreversible consequence.Therefore, its There is life cycle long, access frequency is low, data value density is low, the features such as loss of data can not reappear.
2.2nd, design second layer intermediate data layer storage model
Intermediate data layer storage model is used to store produces from data active layer storage model by one or many data integrations The raw form for being used to analyze, these forms have passed through a data cleansing and data fusion, one or many ETL process, number It is reported that knowing, density is bigger, and data format is more unified, and such data provide strong to daily data analysis and data mining Strong support.The form produced by data active layer storage model by data integration is included into base by intermediate data layer storage model Plinth data Layer table, will turn into data set city level table by basic data layer table by the ETL downstream forms produced.Basic data layer table It is that data active layer storage model passes through the interim table that data integration is produced, for supporting daily data analysis;Data set city level table It is that basic data layer table gos deep into a certain specific subdomain in field by what ETL was produced, the analysis for supporting a certain specific direction.
Pass through the interaction of basic data layer table and Data Mart layer table so that intermediate data layer storage model reduces logarithm According to the dependence of active layer storage model, once extract, be used for multiple times so that daily data analysis concentrates on intermediate data layer storage mould Type.The data of intermediate data layer storage model can be extracted daily or hourly newest from data active layer storage model according to demand Data, and complete from basic data layer table upstream root node to the output successively of data set city level table downstream leaf node, enter And circulate and obtain latest data, the data of mining analysis are updated the data, more instant result is obtained.
There are the data of intermediate data layer storage model data can reappear, access frequency is high, upstream-downstream relationship is complicated, tool The characteristics of having certain life cycle.Because the data of intermediate data layer storage model are derived from data active layer, once there are data Lose, data integration process can be re-executed to recover data, its loss of data can reappear.Intermediate data layer storage model Data often access frequency is very high because not only all downstream tables of current table need to access the data of current table, possible number According to the data for being also required to access current table in mining analysis.The table of intermediate data layer storage model often exists complex upper Downstream relationship, in daily new table generation phase, downstream table needs to wait upstream table output to run.Intermediate data layer storage The data of model have certain life cycle, and the demand that its life cycle often has data analysis is determined, if data analysis The data of 7 days before needing to access daily, then the life cycle of the table can just be set to 7-8 days, at the end of life cycle Data should be deleted.The need for can so data analysis being met, the consumption that storage is produced can be also reduced.
2.3rd, third layer result expression layer storage model is designed
The data in intermediate layer are still big data, often in GB, TB rank.Conventional data mining and data results Often the knowledge with height converges characteristic, and its more content is converged with higher order statistical result meaning or for class cluster Total data.The size of these data often very little, in KB, MB rank.
As a result expression layer storage model be used to storing above-mentioned (intermediate data layer storage model) by data statistic analysis and The result table that data mining analysis is produced, these result granularities are big, and knowledge density is high, for supporting in field routine analysis and certainly Plan.
As a result the data of expression layer storage model have the characteristics of access frequency is high, data volume is small.As a result expression layer is stored Model data can often be quoted in many places, such as generate result displaying form, such as generation line chart etc. real-time, pseudo- Real time graph.Mapping and display chart can all constitute a data access every time, its data magnitude and data access frequency magnitude The difference of highly significant is generated with statement analysis layer.
3rd, the multi-level multi-dimensional data analysis towards administrative decision and Knowledge Discovery scheme are proposed
Invention describes a kind of analysis model of multi-level various dimensions and its formal definitions is provided, in addition also set Distributed decision making Methods of Knowledge Discovering Based is counted.
3.1st, multi-level multi dimensional analysis model is built;
2 are defined, multi-level multi dimensional analysis model may be defined as the form of a four-tuple, Dimension= (Subject, Time, Attributes, Rules), four-tuple is respectively:
1. main bodys are first (Subject), are individual, colony or entirety;Wherein individual refers to a certain specific things, big portion A certain specific user is referred in the case of point;Colony refers to a group things, and these things often have some common traits, entirely Body refers to the complete or collected works of all things;
2. tempons (Time), is year, month, day, hour or the different grain size of minute, can be according to different time granularity Carry out statistical analysis;
3. attributes are first (Attributes), there is single attribute behavioural analysis and many attribute behavioural analyses;Single attributive analysis Meaning is the value for analyzing a certain attribute, and multi-attribute analysis then lays particular emphasis on the contact analyzed between multiple attributes and belonged to more Influence under property collective effect to things;
4. rules are first (Rules), represent the rule applied to attribute member, time tuple, these rules can be statistical Analysis rule or data mining algorithm;
In actual applications, according to the different grain size of theme member and behavior member, it is divided into six levels, is respectively the single category of individual Property analysis, individual multi-attribute analysis, the single attributive analysis of colony, colony's multi-attribute analysis, all single attributive analysises, all many attributes Analysis.The variation pattern of four groups of variables is similar, be respectively provided with bore and under the function that takes, upper brill represents granularity becoming big, examines Examine more condensed information, under take, represent to diminish granularity, focus on more specifically knowledge.As main body dimension can have on individual Different colonies are upgraded to, entirety is eventually become, and attribute can then have single attribute to be changed into many attributes, rule can be according to different Combinations of attributes is chosen more suitably regular.
3.2nd, the distributed decision making Methods of Knowledge Discovering Based based on Spark is designed.Distributed data digging algorithm is designed, is made Can be interacted with Distributed Computing Platform Spark;
Spark is the universal parallel framework that UC Berkeley AMP laboratories are increased income, and is the not enough proposition for Hadoop New distribution type Computational frame.Spark better profits from internal memory, and Map results are all no longer write back into hard disk, but by interior Capable operation is deposited into, this mode reduces substantial amounts of IO costs so that in most cases, and Spark will be fast than Hadoop A lot.Spark by elasticity distribution formula data storage (RDD) preserve file, RDD be Spark to the abstract of distributed document, be Set of records ends with subregion.RDD supports a large amount of operations by simple API, and such as map, sort, count disclosure satisfy that data Process demand.Spark supports stream data to handle by Spark Streaming.
The present invention is devised to be led including collaborative filtering, correlation rule, dimensionality reduction, five algorithms of classification and recurrence and clustering 14 kinds of distributed data digging algorithms of domain-functionalities.User Defined algorithm is supported simultaneously, and user can be uploaded with jar bag forms Packaged code, and by algorithm invoking page by the incoming backstage of parameter.
4th, industry field big data Treatment Analysis platform is built
This platform intergration typical ETL process, cleaning and integration technology, support three-level data storage proposed by the present invention Analyzed with multi-level multi-dimensional data, and propose data visualization analysis method and task flow Managed Solution.
4.1st, data integration is realized
Data integration module mainly completes to import data active layer from data source, from data active layer data is passed through into ETL and clear Wash basic data layer table that fusion imported in intermediate data layer storage model, by data from the basis of middle data Layer storage model Data Layer table imports the function of the data set city level table of intermediate data layer storage model by ETL.
4.2nd, Data Storage Models are implemented;Three obtained layer data storage models are designed according to the 2nd step, number is chosen According to the data storage framework of the storage level of tools build three;
Consider Hbase data compression effects, the present invention chooses Hbase and is used as data active layer storage model data storage; Consider the performance that MongoDB is protruded when Volume data is accessed, choose MongoDB as intermediate data layer storage model number According to storage;Consider the advantage that MySQL is frequently accessed in small data quantity, choose MySQL expression layer storage model numbers as a result According to storage.
4.3rd, data analysis is realized;Data analysis module is built using the distributed computing framework based on Spark;
Spark makes it compared with traditional distributed meter as emerging distributed computing framework, its computing mode based on internal memory Calculating framework Hadoop has significant performance lifting.Data analysis module is divided into two parts, data statistic analysis and data mining Analysis.
Wherein, data statistic analysis is based on SparkSQL, passes through the intermediate data in java language and third-level storage structure Layer storage model sets up connection, and user's request is converted into SparkSQL language by graphical language, the summation of complete paired data, Average, seek variance and calculate the basic data statistics demands such as Data Entry.
Data mining analysis is then integrated including the GBDT decision Tree algorithms and FISM association rule algorithms of listed earlier Totally 14 kinds of distributed data digging algorithms, be related to collaborative filtering, correlation rule, dimensionality reduction, classification with return and clustering five Individual algorithm field.Data mining analysis supports User Defined algorithm simultaneously, and user can upload packaged with jar bag forms Code, and by algorithm invoking page by the incoming backstage of parameter.
Algorithm structure figure is as shown in Figure 2.
4.4th, data visualization is analyzed;Using HTML5, echarts technologies carry out data visualization analysis;
Visual analysis module realizes that echarts is the javascript charts storehouse that Baidu provides, can be with based on echarts Smooth operates on PC and mobile device, and compatible current most browsers, bottom relies on the Canvas class libraries of lightweight Zrender.This visual analysis module provides Data Input Interface on the basis of echarts these functions are integrated with, automatically By user-selected number according to being synchronized in echarts charts, reduce visual analysis function and use threshold, while remaining user's volume Code module, the space more freely played to user.Main working process is as shown in Figure 3.
4.5th, task flow management;Appointed parallel using upstream and downstream mechanism and the producer consumer model of task based access control queue Business stream scheduling;
Industry field big data analysis platform supports task flow management, and Fig. 4 gives this platform task flow management process Figure, according to task flow management process, user can be managed to existing task, including immediately begin to task, reservation start appoint Business and deletion task.Meanwhile, user can create new task;User can be with the execution time of self-defined new task;It can refer to Determine previous task:New task must be performed after previous task completion, and general previous task is upstream task, so can be with Downstream Jobs are performed according to the performance of upstream task;And task type is set:Task type includes data pick-up, data Cleaning and fusion and data statistics and excavation, for different types of task, the step of user can be with custom task, and need Configure corresponding parameter;Finally, task is preserved, task is to be performed in the time set in user.
This platform supports the producer consumer parallel task stream scheduling of task based access control queue, as shown in Figure 5.Wherein task Queue is the class of maintenance task stream, and whether task queue can perform completion according to task execution time and upstream task judges current Whether task can be by consumer spending.The task that the producer (Producer) is responsible for adding user is inserted with multithreading In task queue.Consumer then obtains the allowing to perform of the task with multithreading from task queue, and gives downstream and perform section Point execution task
Advantages and positive effects of the present invention
The present invention proposes the multi-layer storage architecture scheme of Industry-oriented field big data, meets the big number of industry field According to during administrative decision to the demand of initial data, basic data, department's decision data and final achievement data;The present invention is also designed <Main body, time, attribute, rule>The multi-level multi dimensional analysis model of quadruple form, is bottom-up comprehensive data Analysis provides theory support.
Multi-source heterogeneous industry field big data constructed by the present invention handles full link solution being capable of slave pattern complexity Magnanimity isomeric data in realize knowledge, the administrative decision for the field such as society, politics, economy, culture provides powerful data Support, change field internal operations mode, improve field internal operations efficiency.
Brief description of the drawings
Fig. 1 is that multi-source heterogeneous industry field big data handles full link solution running figure;
Fig. 2 is big data platform data analysis part algorithm Organization Chart;
Fig. 3 is the main working process of visual analyzing;
Fig. 4 is task flow management flow chart;
Fig. 5 is the task flow scheduling based on producers and consumers' model;
Fig. 6 is the performance comparison inserted in batches between MongoDB and HBase;
Fig. 7 is multi-level multi dimensional analysis model;
Fig. 8 is that multi-level multi dimensional analysis model develops example;
Fig. 9 is network playing by students discharge model;
Figure 10 is 24 hours trend of individual Internet Access flow;
Figure 11 is the distribution of network playing by students hobby;
Figure 12 is distribution map of the Shanghai Communications University personnel at 11;
Figure 13 is data storage Organization Chart;
Figure 14 is visual analysis data selection schematic diagram;
Figure 15 is IDP2 platform visual analysis schematic diagrames;
Figure 16 is IDP2 platform task flow diagrams.
Embodiment
The specific implementation process of the present invention has been divided into four implementation phases, implements the following is four the detailed of implementation phase Journey.
1st step, conclusion industry field big data handling process
Large data types are varied, wide material sources, such as Internet of Things big data, social networks big data, the big number in internet According to, industry field big data, multimedia big data etc., their application demand and data type is not quite similar, but most basic Big data handling process is more consistent.The present invention is on the basis of traditional big data processing basic procedure, with reference to the big number of industry field According to visuality and authenticity feature and field application demand, explore the most basic stream of Industry-oriented field big data processing Journey, as shown in Figure 1.
The handling process of whole industry field big data can be defined as, in different collecting devices, instrument, system supplymentary Under, multiple isomery industry field data sources extract with it is integrated, uniformly storage is carried out the characteristics of further according to data, with reference to row The data of storage are analyzed and excavated using data analysis technique by industry application multi-angle demand, obtain administrative decision knowledge, And result is showed into user using visual analysis method.Above-mentioned handling process can be divided into data pick-up and integrated, data Storage, data analysis, visual analysis.
1. data pick-up with it is integrated
Data pick-up is divided into ETL process, data cleansing and problem of data fusion with integrated.
1) ETL process
ETL process (Extraction, Transormation and Loading) refers to data pick-up, conversion and filled Carry.The core of ETL process is data conversion, that is, existing data are converted into target data, then by target data application In data mining and data analysis.Data pick-up is the process that existing data are obtained from data source, and it is unloading mesh that data, which are loaded, Mark the process of data.Therefore, ETL process is by data pick-up, conversion and is loaded, and completes to pass through the existing data of data source Processing is crossed, target data and the data handling procedure preserved is converted into.
2) data cleansing
Multi-source heterogeneous data are stored in different data sources in a different format, and the data tape of separation carrys out information Fragmentary, i.e., the information embodied in single or several data is incomplete, unilateral, in analysis resolution can be caused to lose By mistake, the problems such as clue is interrupted.The purpose of data cleansing is to find out mistake present in data, missing, redundancy or exception, corrects it , to improve the quality of data.
In multi-source heterogeneous data age, weighing the index of data includes uniformity, correctness and integrality.And in actual behaviour In work, data in the typing of different time probably due to cause data inconsistent, it is also possible to because be difficult to obtain, typing mistake, Transfer data are slipped up and cause shortage of data, the correctness of data are influenceed, as various directly or indirectly reasons are led Cause error in data.In order to lift the quality of data source, to ensure that subsequent data analysis and data mining process are meaningful, carry out many Source isomeric data cleaning is most important.
3) data fusion
Data fusion is the emerging problem under multi-source heterogeneous data processing background.Data fusion is mainly solved in multi-data source The problem of accuracy of data value, by data digging method, choose truer from the multiple data sources that there is data collision Believable numerical value is filled into fusion results, and also known as Conflict solving, true value are found.The predecessor of data fusion is desired value filling, There is provided more quickly and accurately desired value padding scheme by more in-depth study for data fusion.The target of data fusion is then Knowledge fusion, it is desirable to by the way that the data of multiple data sources are preferably assessed, obtains most accurate most real data, by data That concentrates is not only data, in addition to knowledge, is merged.
2. data storage
In industry field big data handling process, there is initial data, data cleansing, fusion, the centre of conversion generation Data, data analysis and the result data for visual presentation, their data type, data scale, data use are not to the utmost It is identical.Therefore a multi-layer Data Storage Models should be designed to deposit all kinds of numbers in industry field big data handling process According to.
3. data analysis
Data analysis is the core of whole industry field big data handling process, is the embodiment of big data value characteristic.Greatly Value of knowledge density is low in data, extensively, it is necessary to which data are carried out with point in all directions according to certain mode in terms of knowledge covering Analysis.Traditional data statistics and data mining algorithm can not handle mass data, Distributed Calculation frame in time under big data background Frame plays key effect in the previous decade of twentieth century, but also occurs in that performance bottleneck over time.The present invention considers Multi-level multi-dimensional data analysis model is built based on New-generation distributed Computational frame Spark, to meet under big data background Data analysis.
4. visual analysis
The processing of industry field big data will not only complete data analysis task, in addition it is also necessary to carry out data results directly perceived Embody.Visual analysis is directed to the visual feature of industry field big data, and by visual analysis technology, lteral data is converted to Easy-operating chart directly perceived, apparent impression is provided for decision-making.The present invention considers what is combined based on Html5 and Echarts.js Visual analysis technology, builds and provides data selection, data column selection, Code automatic build, the editable visual analysis mould of code Block, to meet under industry field big data background the need for visual analysis.
2nd step, three layer data storage models of design
According to the selection standard of the storage level of each in three layer data storage models, the present invention have chosen emerging distribution The Document image analysis of NoSQL databases represents MongoDB and row deposit data storehouse represents Hbase as candidate, by testing this A little databases are selected to data active layer and intermediate data layer data the compatible degree of data active layer and intermediate data layer data feature Best database storage engines are supported to be used as respective storage scheme.Need to examine respectively when specifically choosing Data Storage Models Consider Query Cost, insertion cost, delete several indexs such as cost and/or compression efficiency, be specifically defined and see that Summary is determined Justice 1.
Experiment carries out contrast experiment by the way of True Data and random data are combined to both performances.True Data comes Come from trade tables of data, net_traffic tables of data and the weather tables of data in Shanghai Communications University's EMC data sets, data volume Respectively 7,915,289,12,736,407 and 79,980.Shown in its data structure such as table (1), table (2) and table (3):
Table (1) trade tables
Table (2) net_traffic tables
Table (3) Weather tables
Experiment devises following several test cases for the Shi Jishiyong situation of system:
1) storage cost is tested, and insertion weather tables and net_traffic table data, statistics take storage size;
2) batch insertion test, often inserts 10,000 and does once time-consuming record;
3) inquiry of index single-point, single-point inquiry is carried out according to single property index;
In storage cost test, source file is respectively the weather tables of 5MB sizes and the net_ of 2.7GB sizes traffic.Experimental result is shown in Table (4), it can be seen that the weather of 5MB sizes is occupied respectively under MongoDB and Hbase 28MB and 48MB, 2.7GB net_traffic tables then take 4.7GB and 7.2GB respectively.But Hbase has as row deposit data storehouse Powerful data compression algorithm, after snappy compressions are carried out to row cluster, weather tables and net_traffic tables under Hbase 9.1MB and 2.4GB are only taken up respectively, and data compression effects have respectively reached 18.8% and 30.2%.
Table (4) storage cost test result
Fig. 6 illustrates the performance comparison result inserted in batches between MongoDB and HBase.Abscissa is insertion record number (ten thousand), the time spent in ordinate is inserts newest 10,000 data (millisecond).The time spends high expression MongoDB in figure, Time spends low expression HBase.For on the whole, it can be seen that Hbase is when batch is inserted on the average insertion time MongoDB is substantially better than, and MongoDB often inserts a certain amount of data and time-consuming cost just occurs once during insertion Peak value, this is relevant with storage strategy in MongoDB burst mechanism.
It is above-mentioned test result indicates that, before compression, data take the storage bigger than text by Hbase and MongoDB Amount, and Hbase data compression effects are more preferable.MongoDB is to read more efficient database engine, and Hbase is to write effect The higher database engine of rate.
Analyzed based on more than, it can be deduced that, MongoDB is that more engine in a balanced way is looked into additions and deletions, and it provides secondary index, more It is adapted to access the data pattern for reading to be more than under write mode.Final choice MongoDB of the present invention as intermediate data layer storage Engine.Write diagnostics and data compression ratio outstanding Hbase demonstrates its advantage in distributed storage, and it is more suitable for big number According to the storage of amount.Final choice Hbase of the present invention as data active layer storage engines.And result expression layer due to data volume very It is small, therefore have chosen relevant database MySQL.
3rd step, the multi-level multi-dimensional data proposed towards administrative decision are analyzed and Knowledge Discovery scheme
3.1st, multi-level multi dimensional analysis model
Contain substantial amounts of knowledge in the field big data epoch, data.How to go to find and analyze in big data exist Knowledge, it is preferably served data owner, become important problem.The present invention proposes multi-level various dimensions point Model is analysed, helps that user is apparent, knowledge that is more fully containing in analyze data.
The present invention is by multi-level multi dimensional analysis model definition into the form of a four-tuple, Dimension= (Subject,Time,Attributes,Rules).Its specific definition is shown in definition 2.
The four-tuple proposed according to this model, model of the invention can be with actual applications, according to main body member and row For the different grain size of member, it is divided into six levels, is respectively the single attributive analysis of individual, individual multi-attribute analysis, the single attribute point of colony Analysis, colony's multi-attribute analysis, all single attributive analysises, all multi-attribute analysis.
This model is altogether main body, time, attribute and rule respectively comprising four groups of variables.The variation pattern of four groups of variables is It is similar, be respectively provided with bore and under the function that takes, upper brill represents granularity becoming big, investigates more condensed information, under take, represent Granularity is diminished, more specifically knowledge is focused on.Such as main body dimension can rise to different colonies by individual, eventually become complete Body, and attribute can be then changed into many attributes from single attribute, rule can be chosen more suitably regular according to different combinations of attributes. In example in the figure 7, consumption data collection is have chosen, data set five is arranged totally, respectively consumer, consumption time, consumes whereabouts, Spending amount and whereabouts code.Individual is chosen in main body, time rule is hour, in the case that rule is not chosen temporarily, figure In illustrate individual multiattribute variation tendency arrived by individual single attribute, corresponding analysis can be then that individual daily consumption is gone To analysis, rule can be chosen for statistics and be averaging, it is possible to obtain daily consuming whereabouts.Similarly, can daily it be consumed The amount of money and average daily consumption whereabouts classification.After the single attributive analysis of individual is completed, and then the double combinations of attributes analyses of individual are carried out, such as schemed In understand to have carried out the personal average daily spending amount analysis of consumption whereabouts, whereabouts correspondence classification and its parsing of personal consumption custom and go It is accustomed to analyzing to the classification personal consumption amount of money.Comprehensive modeling of personal consumption custom may finally be carried out.
After personal consumption custom modeling is completed, not only analyzing personal can be selected, and then analyzes colony, it is such as graduate Consumption habit and the similarities and differences of ordinary person.As shown in figure 8, by individual single attribute, can carry out be in the single attributive analysis of colony, example Graduate's number changes over time analysis, graduate's pre-capita consumption sum analysis, graduate's consumption whereabouts category analysis.And then point Analyse consumption habit of graduate colony etc..Concrete analysis type is determined that above analysis have chosen corresponding statistical analysis by rule Model, and the selection of this model supports data mining algorithm, can such as analyze the association rule between school's difference consumption whereabouts Then, show which is planted consumption and is more likely to front and rear appearance, the demand of more accurate locating students is that the life of student preferably takes Business.
One group of multi-level multi dimensional analysis example, the validity of the illustration model will be passed through below.
Example uses True Data example model validity.True Data derives from disappearing in Shanghai Communications University's EMC data sets Charge information table tables of data, user profile table and network data table.Data volume is respectively 7,915,289,8,000 and 12, 736,407.Shown in its data structure table (1), table (5) and table (2):
Table (5) user profile table
According to multi-level multi dimensional analysis model, the present invention is practised from personal single attribute, the personal flow of analysis student It is used, by contrasting the difference of personal habits and ordinary people, confirm as outlier.In actual analysis using flow median as With reference to 24 hours trend of analyzing personal surfing flow, model such as Fig. 9.As a result show as shown in Figure 10, certain doctor and certain master and Ordinary people has marked difference, and then finds the conclusion that master surf time and flow are more, need to draw attention.
From personal single attribute, the single attributive analysis of personal multi-attribute analysis and colony has been carried out successively.It is many based on individual Attribute layer, user's portrait is carried out to each user, excavates user sex, age, admission time to internet behavior, consumer behavior Influence.Based on the single attribute layer of colony, the present invention has carried out hobby analysis, has drawn the online hobby of male and female studentses, such as schemed Shown in 11.a and Figure 11 .b, it was found that the technology residence tendency of boy student and the consumption propensity of schoolgirl have significant difference.
The proposition of multi-level multi dimensional analysis model, provides to the analysis of field big data and instructs.Illustrated from upper section, Data analysis parses knowledge present in single attribute, and its relation between individual, time from individual single attribute. After the single attributive analysis of one or more individuals is completed, more attributes and more multiagent can be analyzed to data point by point and face The change brought is analysed, it is preferential to carry out the single attributive analysis of individual multi-attribute analysis and colony in line with the principle of control variable.Individual is more Attributive analysis is responsible for completing user's portrait, and the single attributive analysis of colony is then intended to judge influence of each attribute to current group Power.Similarly, after individual and population analysis is completed, it is the bigger entirety of granularity that will be bored in main body, or is by being bored on single attribute The bigger many attributes of granularity, can further disclose more macroscopical knowledge.
When it is determined that after main body member and attribute first, the time can be carried out to bore and under take, carry out sequential by adjusting granularity Analysis.After tempon determination, the selection of rule member seems most important.Main body, time, attribute are determined analyzes for which kind of thing Business, and the selection of rule then determines how analysis.Rule includes data statistical approach, such as sums, averages, also includes Data digging method, such as correlation rule, clustering.Whether rule continuously has strictly to data format, especially data Requirement, according to the data format of the main body chosen, time and attribute, can easily exclude most of rule, and remaining Less rule reduces the difficulty of data analysis.
Multi-level multi dimensional analysis model is a kind of analysis with specific direction, step by step deep by point and face from point Enter, be a kind of bottom-up analytical model.Multi-level multi dimensional analysis model is that a kind of have specific direction and prune approach Exhaustion a, exhaustion for having specific direction can avoid fan of the Data Analyst between the data and complicated attribute of magnanimity Lose.Complete analysis model, it can be deduced that more comprehensively analyze, it is to avoid miss more important analysis angle because of carelessness Degree.
3.2nd, distributed decision making Methods of Knowledge Discovering Based
The present invention is devised to be led including collaborative filtering, correlation rule, dimensionality reduction, five algorithms of classification and recurrence and clustering 14 kinds of distributed data digging algorithms of domain-functionalities.User Defined algorithm is supported simultaneously, and user can be uploaded with jar bag forms Packaged code, and by algorithm invoking page by the incoming backstage of parameter.
4th step, structure industry field big data Treatment Analysis platform
The present invention proposes Data Storage Models and Data Analysis Model on the basis of data integrated approach, and builds Industry field big data processing platform (Industry data processing platform, abbreviation IDP2).IDP2 platform bases In Spark Computational frames, typical ETL process, cleaning and integration technology are integrated with, supports three DBMS proposed by the present invention to deposit Storage and the analysis of multi-level multi-dimensional data, and propose data visualization analysis method and task flow Managed Solution.
4.1st, data integration is realized
Data integration module mainly completes data pick-up, data cleansing, data fusion, data conversion and data and loaded.Examine Consider third-level storage structure, i.e. data active layer, intermediate data layer and result expression layer, data integration module is mainly completed from data source Import data active layer, by data from data active layer by ETL and cleaning fusion import intermediate data layer in basic data layer, will Data import the function of the data set city level of intermediate data layer from the basic data layer of middle data Layer by ETL.
ETL process field related tool is more ripe, and IDP2 platforms operate boundary by the good user based on WEB Face captures user's request, and the data pick-up of user, conversion and loading demand are stored in the number of database in XML configuration file form According in stream, then by calling the API for the ETL instruments Kettle that increases income, by the incoming Kettle of configuration file content.Collect in Kettle Into numerous database connection modes based on JDBC, it generates corresponding data base manipulation statement according to profile parameters, Complete the extraction, conversion and loading of data.Specific call instruction is as follows:
"C:\Program Files\Java\jdk1.7.0_51\bin\java.exe"″-Xmx512m"″-XX: MaxPermSize=256m " "-Djava.library.path=libswt win64 " "-DKETTLE_HOME=" "- DKETTLE_REPOSITORY=" "-DKETTLE_USER=" "-DKETTLE_PASSWORD=" "-DKETTLE_PLUGIN_ PACKAGES=" "-DKETTLE_LOG_SIZE_LIMIT=" "-DKETTLE_JNDI_ROOT="-jar launcher pentaho-application-launcher-5.3.0.0-213.jar-lib..\libswt\win64-main org.pentaho.di.pan.Pan/file C:\\kettle\\orderf2c299c6-908f-47a0-8da5- 86369a5c92d4.xml
Kettle is integrated with most of data integration function, and conventional is as follows:
Data pick-up:Including table input, file input etc. pattern of the input, specific pattern of the input include MySQL, MSSQL, The traditional Relational DataBases such as Oracle, the emerging NoSQL databases such as MongoDB, Hbase, semi-structured data and the text such as XML The unstructured datas such as this document.
Data conversion:Including many data transfer methods such as table connection, field selection, record set merging, complete most of Data conversion demand.The data extracted from external data source do not allow to carry out data conversion.
Data are loaded:By in the data storage tertiary storage after persistence architecture.Tertiary storage is transparent to user, data Load according to the different by data automatic clustering of data source.The data extracted from external data source will be included into data active layer, and The data set city level table of intermediate data layer will be entered from the basic data layer data of data active layer and intermediate data layer.Data are loaded Result expression layer will not be loaded data into.
Data cleansing:Including Missing Data Filling, noise brilliance, the deletion of useless attribute, logic error inspection, data standard The routine data cleaning methods such as change, data normalization and Data Discretization.
The functional requirement of the data integration module of IDP2 platforms mainly completes data and is loaded into source data layer from data source, with And the data of data active layer are loaded into the basic data layer of intermediate data layer by ETL process and data cleansing fusion, and will The basic data layer data of intermediate data layer is converted by ETL and cleaning fusion and is loaded into the data set of intermediate data layer In city level.
Kettle is as ETL instruments of increasing income, the not integrated multi-source heterogeneous necessary data anastomosing algorithm of Data processing.Cause This IDP2 platform is encapsulated to Kettle instruments, and adds the data fusion related algorithm based on cluster.Platform intergration CRM algorithms, the algorithm is summed by the residual weighted of data true value and estimate, progressively adjusts data weighting, algorithm card Understand data weighting calculate be can be convergent, and using the true value after final data Weight as finally returning that value.Data are melted The addition of hop algorithm, the data validity to multi-data source has carried out effective assessment, and source data is entered by data anastomosing algorithm Row processing, the numerical value chosen closer to true value is used as final evaluation criterion, it is ensured that the correctness of data source.
4.2nd, data storage is realized
The present invention proposes the three layer data storage models based on data active layer, intermediate data layer and result expression layer.For reality Existing bottom storage model is to the transparent of upper layer application, and invention introduces data channel concept.
Data channel is the unique channel of data transfer in system.Passage DataChannel classes in IDP2 systems are safeguarded Data transfer is worked, and the API of all data accesses is safeguarded by DataChannel.DataChannel proposition constrains data The channel of transmission, by safeguard, optimization DataChannel can preferably standardize data transfer, improving data transmission efficiency, Reduce data transfer delay, it is ensured that the efficient stable of multi-source heterogeneous data access.
It is transparent that data channel realizes that bottom storage is asked user.Data channel by user to the requests of data according to The operation at family is mapped to different data storage management services, data active layer data management, the management of intermediate data layer data and knot The data management of fruit expression layer is then read the request of user according to these requests or the data of write-in are mapped in data server. The proposition of data channel causes user to know that data are in data active layer, intermediate data layer or result expression layer, user By providing database name and data table name, its storage hierarchy of system meeting Auto-matching passes through corresponding API and obtains corresponding level Data.Data storage Organization Chart is as shown in figure 13.
On the basis of data channel, according to experimental result, it is considered to Hbase data compression effects, the present invention chooses Hbase is used as data active layer data storage;Consider the performance that MongoDB is protruded when Volume data is accessed, choose MongoDB is stored as intermediate data layer data;Consider the advantage that MySQL is frequently accessed in small data quantity, choose MySQL and make For result expression layer data storage.
4.3rd, data analysis is realized
Spark makes it compared with traditional distributed meter as emerging distributed computing framework, its computing mode based on internal memory Calculating framework Hadoop has significant performance lifting.IDP2 has built data analysis module based on Spark frameworks, and table (6) is illustrated Spark cluster configurations of the IDP2 based on 1 Master node and 3 worker nodes.
The configuration of table (6) Spark clusters
Data analysis module is divided into two parts, data statistic analysis and data mining analysis.
Wherein, data statistic analysis is based on SparkSQL, passes through the intermediate data in java language and third-level storage structure Layer sets up connection, and user's request is converted into SparkSQL language, summation, the averaging of complete paired data by graphical language It is worth, seeks variance and calculate the basic data statistics demands such as Data Entry.
Data mining analysis then integrated 14 totally kinds points including GBDT decision Tree algorithms and FISM association rule algorithms Cloth data mining algorithm, is related to collaborative filtering, correlation rule, dimensionality reduction, classification and recurrence and five algorithm fields of clustering. Data mining analysis supports User Defined algorithm simultaneously, and user can be uploaded with the packaged code of jar bag forms, and is passed through Algorithm invoking page is by the incoming backstage of parameter.IDP2 platform data analysis part functional frame compositions are as shown in Figure 2.
Data analysis provides the form translation function of input data, supports the conversion of numeric type data and discrete data. But this behavior is not encouraged in data analysis, data conversion work should be completed in ETL.
Data analysis provides good result translation function, is provided for above-mentioned all algorithms and agrees with turning for visual analysis form Function is changed, the result of data analysis is stored in result expression layer, good support is provided for subsequent visual analysis.
4.4th, data visualization is analyzed
As field large data sets into storage analysis platform, IDP2 not only supports data integration function and distributed data point Function is analysed, the page end visual analysis module based on HTML5 is still further provides.Notebook data visual analysis module is without user Coding, user only needs to carry out simple data source capability, graph choice, data column selection can just generate it is directly perceived, lively, can Interaction, can height personalized customization form.
Visual analysis block process is as follows.First, user selects data source as needed:Optional data source is data file Or the implementing result of the data analysis mining task on upper strata.Then, user selects specific data content as generation chart Data source, select the subtype to be generated, you can produce the visual analyzing result to the data source of selection.User can be with The need for oneself, the visualization result of generation is preserved.
Visual analysis module realizes that echarts is the javascript charts storehouse that Baidu provides, can be with based on echarts Smooth operates on PC and mobile device, and compatible current most browsers, bottom relies on the Canvas class libraries of lightweight Zrender.This visual analysis module provides Data Input Interface on the basis of echarts these functions are integrated with, automatically By user-selected number according to being synchronized in echarts charts, reduce visual analysis function and use threshold, while remaining user's volume Code module, the space more freely played to user.The main working process of visual analyzing is as shown in Figure 3.
For lifting Consumer's Experience, IDP2 platform visual analysis modules have done following optimization:
Intuitive optimizes:When user chooses data source and data row, visual analysis module provides data preview function, Choose after data table types, user can intuitively see data preview, and can directly choose the row of needs, backstage will according to The data row of family selection automatically generate chart.Data display is optimized module, will not or row excessive because of data row Excessively cause to overflow.As shown in figure 14, first row and the 3rd row are clicked on, the data that have selected this two row are represented.
Code Edit function:Visual analysis module is supported to automatically generate chart, and supports User Defined code function.With Family edit code function gives user's free degree to a greater extent, and user can check the code automatically generated, and according to language Method enters edlin, and code is performed finally by " RUN " button, and actual effect is as shown in figure 15.
4.5th, task flow key technology
4.5.1 task stream management modules
IDP2 platforms support task flow management, and Fig. 4 gives IDP2 platform task flow management flow charts, according to IDP2 tasks Flow management flow, user can be managed to existing task, including immediately begin to task, preengage to begin a task with and delete and appoint Business.Meanwhile, user can create new task;User can be with the execution time of self-defined new task;Previous task can be specified:Newly Task must be performed after previous task completion, and general previous task is upstream task, so can be according to upstream task Performance perform Downstream Jobs;And task type is set:Task type include data pick-up, data cleansing and fusion with And data statistics and excavation, for different types of task, the step of user can be with custom task, and need configuration corresponding Parameter;Finally, task is preserved, task is to be performed in the time set in user.
The key of task flow is that upstream and downstream mechanism, and the proposition of upstream and downstream mechanism causes task to be no longer isolated appoint Business, but one inputs the streaming mechanism accessed with downstream comprising upstream.Task flow is directed acyclic graph, and DP2 can be according to task Flow the execution that dispatching technique completes task in task flow.Figure 16 shows the task flow for including 9 tasks, and wherein root node is to appoint Business 1, other tasks need to wait for execution after the completion of task 1.Task 1 includes 3 child nodes, after the completion of task 1, three child nodes It can be waken up, if child node has arrived at the execution time, can be taken away and be performed by the consumer in task flow scheduling mechanism.Appoint Business 8 needs execution after the completion of wait task 6 and task 7.In the task of establishment, whether platform meeting inspection task stream is set up, and refuses The task of deadlock can be caused to configure.
4.5.2 task flow dispatching techniques
IDP2 platforms support the producer consumer parallel task stream scheduling of task based access control queue, as shown in Figure 5.Wherein appoint Business queue is the class of maintenance task stream, and whether task queue can perform completion according to task execution time and upstream task judges to work as Whether preceding task can be by consumer spending.The task that the producer (Producer) is responsible for adding user is inserted with multithreading Enter in task queue.Consumer (Consumer) then obtains the allowing to perform of the task with multithreading from task queue, and hands over Node, which is performed, to downstream performs task.
The producer supports multithreading, when user submits task, and IDP2 platforms can generate a Producer tasks preservation Current task, if current time task queue is by other thread lockeds, the producer can sleep and be waken up until by task queue.
Task queue can safeguard all pending task lists, and task can be with time Bit-reversed, closest to what is performed Task can be placed in first.Task queue checks whether task reaches the execution time according to intervals, if reached The execution time then can check whether task previous task completes.Because task queue is according to time-sequencing, the traversal to task exists Run into and do not arrive the task of execution time and can then jump out current traversal.If task execution time has been arrived and task previous task is complete Into can then wake up consumption node, if consumption node can consume the task currently without task, and call execution node to perform The task.
Consumer supports multithreading, and multiple consumer threads can be started when platform starts, if task should be performed, Task queue can wake up these consumer threads, if consumer thread is currently without task, can consume the task.Consumer is responsible for Extraction task, and downstream execution node execution is given by task, consumer does not perform task in itself.
Dispatched by the producer consumer parallel task stream of above-mentioned task based access control queue, IDP2 ensure that the steady of task flow It is fixed to perform.

Claims (6)

1. a kind of multi-source heterogeneous industry field big data handles full link solution, the solution detailed step is as follows:
1st, industry field big data handling process is concluded
Existing big data processing platform is summarized, on the basis of traditional big data processing basic procedure, with reference to the big number of industry field According to visuality and authenticity feature and field application demand, explore the most basic stream of Industry-oriented field big data processing Journey;
2nd, three layer data storage models are designed
Define 1:Index for choosing Data Storage Models is defined as follows:
1. Query Costs:Query Cost refer to inquire about unit data in Data Storage Models spent by time;
2. inserts cost:Insertion cost refer in Data Storage Models to unit data carry out insertion operation spent by when Between;
3. deletes cost:Cost is deleted to refer to delete the time spent by unit data in Data Storage Models;
4. compression efficiencies:Compression efficiency refers to the ratio between the size after data compression and the size before compression;
Choose per layer data storage model when, each index be to the influence degree of every layer data storage model it is different, By analyze these indexs to the influence size of every layer data storage model to choose suitable data storage facility to build three Layer data storage model;
2.1st, first layer data active layer storage model is designed;Data active layer is used to store the number for being drawn from numerous external data sources According to;
2.2nd, design second layer intermediate data layer storage model;Intermediate data layer is used to be stored in industry field routine use The form being related to;Intermediate data layer storage model is divided into basic data layer table and Data Mart layer table;Stored in intermediate data layer In model, the table produced by data active layer storage model by data integration is referred to as basic data layer table, these contents are included More contents, towards extensive demand;ETL (Extraction, Transormation and will be passed through by basic data layer table Loading), i.e., the table that data pick-up, conversion and loading are produced is referred to as data set city level table, and these contents often relate to a certain tool Body field, is a certain specific departmental service;
2.3rd, third layer result expression layer storage model is designed;As a result expression layer is used to store intermediate data layer by data system The result table that meter analysis or data mining analysis are produced;
3rd, the multi-level multi-dimensional data analysis towards administrative decision and Knowledge Discovery scheme are proposed
3.1st, multi-level multi dimensional analysis model is built;
Define 2:Multi-level multi dimensional analysis model is defined as the form of a four-tuple, Dimension=(Subject, Time, Attributes, Rules), four-tuple is respectively:
1. main bodys are first (Subject), are individual, colony or entirety;Wherein individual refers to a certain specific things, most of feelings A certain specific user is referred under condition;Colony refers to a group things, and these things often have some common traits, and entirety refers to Be all things complete or collected works;
2. tempons (Time), is year, month, day, hour or the different grain size of minute, can be carried out according to different time granularity Statistical analysis;
3. attributes are first (Attributes), there is single attribute behavioural analysis and many attribute behavioural analyses;The meaning of single attributive analysis It is the value for analyzing a certain attribute, and multi-attribute analysis then lays particular emphasis on the contact analyzed between multiple attributes and is total in many attributes To the influence of things under same-action;
4. rules are first (Rules), represent the rule applied to attribute member, time tuple, and these rules are statistical analysis rules, Or data mining algorithm;
3.2nd, the distributed decision making Methods of Knowledge Discovering Based based on Spark is designed;Distributed data digging algorithm is designed, is made it Enough and Distributed Computing Platform Spark is interacted;
4th, industry field big data Treatment Analysis platform is built
4.1st, data integration is realized;Data integration module mainly completes to import data active layer storage model, by number from data source According to from data active layer storage model by ETL and cleaning fusion import intermediate data layer storage model in basic data layer table, Data are imported to the function of the data set city level table of intermediate data layer from the basic data layer table of middle data Layer by ETL;
4.2nd, Data Storage Models are implemented;Three obtained layer data storage models are designed according to the 2nd step, data is chosen and deposits Store up the layer data storage architecture of tools build three;
4.3rd, data analysis is realized;Data analysis module is built using the distributed computing framework based on Spark;
4.4th, data visualization is analyzed;Using HTML5, echarts technologies carry out data visualization analysis;
4.5th, task flow management;Use upstream and downstream mechanism and the producer consumer model parallel task stream of task based access control queue Scheduling.
2. according to the method described in claim 1, it is characterised in that the industry field big data handling process described in the 1st step is such as Under:
The handling process of whole industry field big data is defined as, under different collecting devices, instrument and system supplymentary, to many Individual isomery industry field data source extract with it is integrated, uniformly storage is carried out the characteristics of further according to data, with reference to sector application The data of storage are analyzed and excavated using data analysis technique by multi-angle demand, obtain administrative decision knowledge, and utilize Result is showed user by visual analysis method;The industry field big data handling process be specifically divided into data pick-up with Integration module, data memory module, data analysis module and visual analysis module, wherein visual analysis module are for industry neck The visual feature of domain big data, the solution module for being different from general big data processing of proposition.
3. according to the method described in claim 1, it is characterised in that the index for selection of the data active layer storage model described in the 2nd step It is as follows:
Consider that insertion data more can be only appeared in rear two layers of Data Storage Models once and to the access of data, compression Efficiency mainly investigates target as data active layer storage model;And data active layer data can be drawn into the ETL stages, but it is sent out Raw frequency is relatively low, therefore Query Cost is relatively low, and when choosing data active layer storage model, Query Cost would is that secondary cause.
4. according to the method described in claim 1, it is characterised in that the index for selection of the intermediate data layer storage model described in 2 steps It is as follows:
Consider that the inquiry times of intermediate data layer storage model are more, Query Cost, which will turn into, chooses intermediate data layer storage model Main investigation target;Consider that intermediate data layer storage model will be generated newest one day from data active layer storage model daily Form, therefore the secondary cause that insertion cost will be chosen as intermediate data layer storage model.
5. according to the method described in claim 1, it is characterised in that the data integration implementation method described in the 4.1st step is as follows:
Data integration includes data pick-up, data conversion, data and loaded and data cleaning step;Concrete function is as follows:
1. data pick-up:Including table input, file input pattern of the input, specific pattern of the input include MySQL, MSSQL and Oracle traditional Relational DataBases, the emerging NoSQL databases of MongoDB, Hbase, XML semi-structured data and text document Unstructured data;
2. data conversion:The many data transfer methods merged including table connection, field selection and record set, complete most of number According to conversion requirements;The data extracted first do not allow to carry out data conversion;Only when the data Cun Chudao data active layers of extraction are deposited Just allow to carry out data conversion after storage model;
3. data are loaded:Data after persistence architecture are respectively stored into three layer data storage models;Three layer datas are deposited Storage model is transparent to user, and data are loaded according to the different by data automatic clustering of data source;The data extracted first will return Enter data active layer storage model, and the basic data layer table number from data active layer storage model and intermediate data layer storage model According to the data set city level table by intermediate data layer storage model is entered;Data loading will not load data into result expression layer and deposit Store up model;
4. data cleansing:Including Missing Data Filling, noise brilliance, the deletion of useless attribute, logic error inspection, data normalization, The routine data cleaning method of data normalization and Data Discretization.
6. according to the method described in claim 1, it is characterised in that the producer consumer model parallel task described in the 4.5th step Stream scheduling is as follows:
Task queue is the class of maintenance task stream, and whether task queue can perform completion according to task execution time and upstream task Judge whether current task can be by consumer spending;The producer (Producer) is responsible for the adding user of the task with multithreading Mode is inserted in task queue;Consumer (Consumer) is then obtained with multithreading from task queue allows what is performed to appoint Business, and give downstream execution node execution task.
CN201710376130.9A 2017-05-25 2017-05-25 A kind of multi-source heterogeneous industry field big data handles full link solution CN107193967A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710376130.9A CN107193967A (en) 2017-05-25 2017-05-25 A kind of multi-source heterogeneous industry field big data handles full link solution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710376130.9A CN107193967A (en) 2017-05-25 2017-05-25 A kind of multi-source heterogeneous industry field big data handles full link solution

Publications (1)

Publication Number Publication Date
CN107193967A true CN107193967A (en) 2017-09-22

Family

ID=59874871

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710376130.9A CN107193967A (en) 2017-05-25 2017-05-25 A kind of multi-source heterogeneous industry field big data handles full link solution

Country Status (1)

Country Link
CN (1) CN107193967A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107918600A (en) * 2017-11-15 2018-04-17 泰康保险集团股份有限公司 report development system and method, storage medium and electronic equipment
CN108196797A (en) * 2018-01-26 2018-06-22 江苏财会职业学院 A kind of data processing system based on cloud computing
CN108449375A (en) * 2018-01-30 2018-08-24 上海天旦网络科技发展有限公司 The system and method for network interconnection data grabber distribution
CN108595190A (en) * 2018-04-23 2018-09-28 平安科技(深圳)有限公司 Report tool building method, device, computer installation and storage medium
CN108595574A (en) * 2018-04-16 2018-09-28 上海达梦数据库有限公司 Connection method, device, equipment and the storage medium of data-base cluster
CN108845921A (en) * 2018-05-30 2018-11-20 郑州云海信息技术有限公司 A kind of method, apparatus, readable storage medium storing program for executing and the computer equipment of test database performance
CN109117321A (en) * 2018-07-27 2019-01-01 山东师范大学 A kind of full link application moving method of cloud platform based on label figure
CN109189839A (en) * 2018-07-20 2019-01-11 广微数据科技(苏州)有限公司 Multilayer business model based on big data platform
CN109271384A (en) * 2018-09-06 2019-01-25 语联网(武汉)信息技术有限公司 Large database concept and its method for building up, the device and electronic equipment of interpreter's behavior
CN109408586A (en) * 2018-09-03 2019-03-01 中新网络信息安全股份有限公司 A kind of polynary isomeric data fusion method of distribution
CN109587125A (en) * 2018-11-23 2019-04-05 南方电网科学研究院有限责任公司 A kind of network security big data analysis method, system and relevant apparatus
CN109739922A (en) * 2019-01-10 2019-05-10 江苏徐工信息技术股份有限公司 A kind of industrial data intelligent analysis system
CN110008306A (en) * 2019-04-04 2019-07-12 北京易华录信息技术股份有限公司 A kind of data relationship analysis method, device and data service system
CN110209506A (en) * 2019-05-09 2019-09-06 上海联影医疗科技有限公司 Data processing system, method, computer equipment and readable storage medium storing program for executing
CN110309118A (en) * 2018-03-06 2019-10-08 北京询达数据科技有限公司 A kind of design method of depth network data excavation robot
CN110309467A (en) * 2018-03-25 2019-10-08 北京询达数据科技有限公司 A kind of design method of Full-automatic deep Web Mining machine
CN110321377A (en) * 2019-04-25 2019-10-11 北京科技大学 A kind of multi-source heterogeneous data true value determines method and device
CN110515990A (en) * 2019-07-23 2019-11-29 华信永道(北京)科技股份有限公司 Data query methods of exhibiting and inquiry display systems
CN111125052A (en) * 2019-10-25 2020-05-08 北京华如科技股份有限公司 Big data intelligent modeling system and method based on dynamic metadata
CN112100525A (en) * 2020-11-02 2020-12-18 中国人民解放军国防科技大学 Multi-source heterogeneous aerospace information resource storage method, retrieval method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120265726A1 (en) * 2011-04-18 2012-10-18 Infosys Limited Automated data warehouse migration
CN104573071A (en) * 2015-01-26 2015-04-29 湖南大学 Intelligent school situation analysis system and method based on megadata technology
CN106339509A (en) * 2016-10-26 2017-01-18 国网山东省电力公司临沂供电公司 Power grid operation data sharing system based on large data technology

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120265726A1 (en) * 2011-04-18 2012-10-18 Infosys Limited Automated data warehouse migration
CN104573071A (en) * 2015-01-26 2015-04-29 湖南大学 Intelligent school situation analysis system and method based on megadata technology
CN106339509A (en) * 2016-10-26 2017-01-18 国网山东省电力公司临沂供电公司 Power grid operation data sharing system based on large data technology

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107918600A (en) * 2017-11-15 2018-04-17 泰康保险集团股份有限公司 report development system and method, storage medium and electronic equipment
CN108196797A (en) * 2018-01-26 2018-06-22 江苏财会职业学院 A kind of data processing system based on cloud computing
CN108196797B (en) * 2018-01-26 2021-01-05 江苏财会职业学院 Data processing system based on cloud computing
CN108449375A (en) * 2018-01-30 2018-08-24 上海天旦网络科技发展有限公司 The system and method for network interconnection data grabber distribution
CN110309118A (en) * 2018-03-06 2019-10-08 北京询达数据科技有限公司 A kind of design method of depth network data excavation robot
CN110309467A (en) * 2018-03-25 2019-10-08 北京询达数据科技有限公司 A kind of design method of Full-automatic deep Web Mining machine
CN108595574A (en) * 2018-04-16 2018-09-28 上海达梦数据库有限公司 Connection method, device, equipment and the storage medium of data-base cluster
CN108595190A (en) * 2018-04-23 2018-09-28 平安科技(深圳)有限公司 Report tool building method, device, computer installation and storage medium
CN108595190B (en) * 2018-04-23 2020-06-19 平安科技(深圳)有限公司 Report tool building method and device, computer device and storage medium
CN108845921A (en) * 2018-05-30 2018-11-20 郑州云海信息技术有限公司 A kind of method, apparatus, readable storage medium storing program for executing and the computer equipment of test database performance
CN109189839A (en) * 2018-07-20 2019-01-11 广微数据科技(苏州)有限公司 Multilayer business model based on big data platform
CN109117321A (en) * 2018-07-27 2019-01-01 山东师范大学 A kind of full link application moving method of cloud platform based on label figure
CN109408586A (en) * 2018-09-03 2019-03-01 中新网络信息安全股份有限公司 A kind of polynary isomeric data fusion method of distribution
CN109271384A (en) * 2018-09-06 2019-01-25 语联网(武汉)信息技术有限公司 Large database concept and its method for building up, the device and electronic equipment of interpreter's behavior
CN109587125A (en) * 2018-11-23 2019-04-05 南方电网科学研究院有限责任公司 A kind of network security big data analysis method, system and relevant apparatus
CN109739922A (en) * 2019-01-10 2019-05-10 江苏徐工信息技术股份有限公司 A kind of industrial data intelligent analysis system
CN110008306A (en) * 2019-04-04 2019-07-12 北京易华录信息技术股份有限公司 A kind of data relationship analysis method, device and data service system
CN110321377A (en) * 2019-04-25 2019-10-11 北京科技大学 A kind of multi-source heterogeneous data true value determines method and device
CN110209506A (en) * 2019-05-09 2019-09-06 上海联影医疗科技有限公司 Data processing system, method, computer equipment and readable storage medium storing program for executing
CN110515990A (en) * 2019-07-23 2019-11-29 华信永道(北京)科技股份有限公司 Data query methods of exhibiting and inquiry display systems
CN111125052A (en) * 2019-10-25 2020-05-08 北京华如科技股份有限公司 Big data intelligent modeling system and method based on dynamic metadata
CN112100525A (en) * 2020-11-02 2020-12-18 中国人民解放军国防科技大学 Multi-source heterogeneous aerospace information resource storage method, retrieval method and device
CN112100525B (en) * 2020-11-02 2021-02-12 中国人民解放军国防科技大学 Multi-source heterogeneous aerospace information resource storage method, retrieval method and device

Similar Documents

Publication Publication Date Title
US10817534B2 (en) Systems and methods for interest-driven data visualization systems utilizing visualization image data and trellised visualizations
US20170140058A1 (en) Systems and Methods for Identifying Influencers and Their Communities in a Social Data Network
US20200019869A1 (en) Systems and methods for semantic inference and reasoniing
US10776965B2 (en) Systems and methods for visualizing and manipulating graph databases
Moniruzzaman et al. Nosql database: New era of databases for big data analytics-classification, characteristics and comparison
CN104767813B (en) Public&#39;s row big data service platform based on openstack
CN103793465B (en) Mass users behavior real-time analysis method and system based on cloud computing
Atzeni et al. The relational model is dead, SQL is dead, and I don't feel so good myself
US8589337B2 (en) System and method for analyzing data in a report
US9824127B2 (en) Systems and methods for interest-driven data visualization systems utilized in interest-driven business intelligence systems
Khan et al. Big data text analytics: an enabler of knowledge management
CN104412265B (en) Update for promoting the search of application searches to index
CN106796578B (en) Autoknowledge system and method and memory
Zhang et al. Bringing Agriculture Back In: The Central Place of Agrarian Change in Rural C hina Studies
US8447721B2 (en) Interest-driven business intelligence systems and methods of data analysis using interest-driven data pipelines
Leydesdorff Theories of citation?
Healy The performativity of networks
Sidi et al. Data quality: A survey of data quality dimensions
CN103678665B (en) A kind of big data integration method of isomery based on data warehouse and system
Chen et al. The thematic and citation landscape of data and knowledge engineering (1985–2007)
US8234298B2 (en) System and method for determining driving factor in a data cube
CN101093559B (en) Method for constructing expert system based on knowledge discovery
Junghanns et al. Management and analysis of big graph data: current systems and open challenges
Bozzon et al. Liquid query: multi-domain exploratory search on the web
Xiao et al. Knowledge diffusion path analysis of data quality literature: A main path analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170922