CN105956015A

CN105956015A - Service platform integration method based on big data

Info

Publication number: CN105956015A
Application number: CN201610254729.0A
Authority: CN
Inventors: 向富强; 曾逸; 杨雪琴
Original assignee: SICHUAN ZHONGRUAN TECHNOLOGY Co Ltd
Current assignee: SICHUAN ZHONGRUAN TECHNOLOGY Co Ltd
Priority date: 2016-04-22
Filing date: 2016-04-22
Publication date: 2016-09-21

Abstract

The invention discloses a service platform integration method based on big data. The method comprises the following steps: (1) acquiring multi-source heterogeneous data; (2) integrating the acquired multi-source heterogeneous data, and storing the integrated data in an Hbase database; (3) using Hive to perform ETL treatment on the integrated data which is stored in the Hbase database, and storing in the Hbase database, cleaning the data which is stored in the Hbase database, to obtain clean data, and storing the clean data in the Hbase database; (4) based on an Hadoop technology, performing modeling analysis on the obtained clean data, and storing an analysis result in an Hive database; (5) using a service-oriented architecture based on SOA to establish a data exchange and shared service bus, and then based on the service bus, establishing a data exchange architecture, and through the data exchange architecture, pushing the analysis result which is stored in the Hive database to a business application system database. The method effectively reduces communication cost and time cost, and improves data effective utilization rate.

Description

A kind of service platform integration method based on big data

Technical field

The present invention relates to big data technique field, particularly to a kind of service platform integration methods based on big data.

Background technology

Big data are information-based industry next generation's information technology commanding elevations, and discussion is put in smart city construction the most at home Schedule, substantial amounts of data are contained in smart city, are to serve government affairs, enterprise and civic a new generation application technology, but existing The big Data Integration in smart city still can not serve government affairs and the public in city various aspects, this mainly due to Under the limitation of several aspects cause:

(1) computation model of the complexity of big data, the most simply carries out property analysis and regularity exploring to multi-source data, does not still have There is complete application process system；

(2) structural data is few, and unstructured data is many, does not the most have more advanced technology or means process destructuring and half Structural data；

(3) system modelling exploring big data complexity, the depicting method of uncertain feature description and big data is not perfect；

(4) current big data mining is substantially at the state once excavating coarse knowledge, does not mouses out more perfect secondary and digs Pick method provides wisdom knowledge elicitation decision-making for decision-making level.

Summary of the invention

In order to overcome disadvantages described above, it is an object of the invention to provide a kind of service platform integration method based on big data, This invention is the collection to urban multi-source isomeric data, integrate, store, clean, modeling analysis and a kind of method of application, passes through This method will form the processing procedure from bottom to top of data；Compare with routine data processing mode, the multi-data source of integration Add the effective rate of utilization of data, effectively reduce communication cost and time cost.

In order to reach object above, the invention provides a kind of service platform integration method based on big data, including with Lower step:

Step 1: gather multi-source heterogeneous data；

Step 2: the multi-source heterogeneous data collected are integrated, and will integrate after data be stored in Hbase data base；

Step 3: utilize Hive that the data being stored in Hbase data base after integrating are carried out ETL process, and be stored in Hbase data base, The data being stored in Hbase data base are carried out obtaining clean data, and clean data are stored in Hbase data In storehouse；

Step 4: be modeled analyzing to the clean data being stored in Hbase data base based on Hadoop technology, and will analyze Result is stored in Hive data base；

Step 5: use service architecture based on SOA set up data exchange and share service bus, be then based on service bus and build Vertical data architecture for exchanging, will be stored in the analysis result in Hive data base by data architecture for exchanging and pushes to service application system In system data base, in order to analysis result data is applied in corresponding service system.

Preferably, gathering multi-source heterogeneous data in described step 1, its step is specific as follows:

Step 1.1: configuration multiple and distributing sources；

Step 1.2: multiple and distributing sources is packaged into data members；

Step 1.3: the data members being packaged into is read out and converts thereof into global object；

Step 1.4: will convert into the data members combination of global object, it is achieved the unified component that accesses of multi-source heterogeneous data is put down Platform；

Step 1.5: gather multi-source heterogeneous data by component platform and transmit to data center, completing adopting of multi-source heterogeneous data Collection；

Preferably, the data being stored in Hbase data base are carried out obtaining clean data by described step 3, its step The most specific as follows:

Step 3.1: the data being stored in Hbase data base are carried out duplicate checking process；

Step 3.2: the missing data after duplicate checking is carried out interpolation data process；

Step 3.3: the data after filling a vacancy are carried out cluster analysis, analyzes the data being free in cluster edge；According to different numbers According to type set effective range, get rid of extraneous value, obtain clean data, and be stored in Hbase data base.

Preferably, the clean data being stored in Hbase data base are built by described step 4 based on Hadoop technology Mode division is analysed, including:

Based on hadoop technology, the clean data being stored in Hbase data base are carried out cluster analysis, the number after cluster analysis According to being respectively stored in Hive data base, for future use, its detailed process is as follows:

(1) create an initialization point, select k object from the clean data being stored in Hbase data base randomly, by this A little objects are as a bunch center；(2) remaining clean data and the distance at each bunch of center in Hbase data base are judged；(3) by remaining Under clean data be assigned to a bunch center successively；(4) when having data object to join and depart from bunch when, the flat of this bunch is automatically calculated These data, if being unsatisfactory for minimum range, are redistributed bunch by average；(5) circulation repeat the above steps, until bunch center Data no longer change, and now record result；(6) result is stored in Hive data base.

The clean data that will be stored in Hbase data base based on hadoop technology carry out Collaborative Recommendation analysis, work in coordination with and push away Recommending the data after analysis to be stored in Hive data base, for future use, its detailed process is as follows:

(1) obtain the clean data being stored in Hbase data base, and be converted into the data set analyzing desirable format；

(2) data set is divided into training dataset and test data set；

(3) recommended models is trained with training dataset；

(4) precision of recommended models is assessed by test set data；

(5) when the precision of recommended models meets demand, recommend, export result, otherwise re-start training and obtain model, Till reevaluating until being met the data of demand；

(6) result of output is stored in Hive data base；

The clean data that will be stored in Hbase data base based on hadoop technology carry out classification analysis, after classification analysis Data are stored in Hive data base, and stamp different label for different pieces of information, and for future use, its detailed process is as follows:

(2) it is that data set gives characteristic attribute, according to characteristic attribute, data set is suitably divided into multiple item to be sorted, right A part of sorting item is classified, and forms training sample set；

(3) frequency and each occurred according to each classification in our data classified counting training sample set to be obtained The characteristic attribute probability Estimation to each classification, obtains grader；

(4) use grader that the data needing classification are classified, export result；

(5) result is saved in Hive data base.

Preferably, in described step 3.1 data being stored in Hbase data base being carried out duplicate checking process, it is concrete Step is as follows:

Step 3.1.1: the data being stored in Hbase data base are repeated inquiry, filters out all fields and repeat completely Data；Retain a pen data, remove other data repeated completely；

Step 3.1.2: carry out Data duplication inquiry with critical field；Filter out the data that critical field repeats；Heavier plural number According to integrity, more complete one of reserved field data, remove remaining repeat data.

Preferably, in described step 3.2, the missing data after duplicate checking being carried out interpolation data process, its concrete steps are such as Under:

Step 3.2.1: for regular missing data and inessential data, then delete disappearance；For regular missing data and More important data, utilize partial data to calculate data weighting and augment；For irregular missing data according to missing data Type processes respectively；

Step 3.2.2: the same attribute data acquisition value that there is data mean value the highest with this property value probability of occurrence is carried out Fill up；First it is that each missing values produces possible interpolation value for different attribute missing at random data separate data, according to can The partial data that the interpolation value of energy is formed carries out statistical analysis, evaluates analysis result, forms final interpolation value to lacking Mistake value carries out interpolation.

Preferably, in described step 1.2, multiple and distributing sources being packaged into data members, it concretely comprises the following steps:

Step 1.2.1: utilize database table structure to prepare component object；

Step 1.2.2: gone out the tabular table in data base by data base querying；

Step 1.2.3: go out Database field and the field data structure of each table with the tables of data in tabular table for Object Query；

Step 1.2.4: read out by list structure for object with tables of data, is set to the basic of table object by data field attributes Attribute information；

Step 1.2.5: Object table is packaged into a component can inquired about by attribute field；

Compared with prior art, beneficial effects of the present invention: the present invention is the collection to urban multi-source isomeric data, integrates, deposits Storage, cleaning, modeling analysis and a kind of method of application, will form the processing procedure from bottom to top of data by this method； And the source and processing procedure forming data is accomplished have mark to look into, has Zhang Kezun；Compare with routine data processing mode, whole The multi-data source closed adds the effective rate of utilization of data, effectively reduces communication cost and time cost.

Accompanying drawing explanation

Fig. 1 is flow chart of the present invention；

Fig. 2 is ETL module based on Sqoop.

Detailed description of the invention

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Describe wholely.

As depicted in figs. 1 and 2, the invention provides a kind of service platform integration method based on big data, including following Step:

Step 1: gather multi-source heterogeneous data, its concrete gatherer process comprises the following steps:

Step 1.1: configuration multiple and distributing sources, (including: territory, environmental protection, water conservancy, meteorology, woods in the department gathering data Industry, safety supervision, quality supervision etc.) configure front end processor and the terminal unit being manually entered, in data center configuration front end processor, application service The equipment such as device, database server, WEB server, system monitoring terminal and operational terminal are acquired the Business Processing of data；

Step 1.2: multiple and distributing sources is packaged into data members, it concretely comprises the following steps:

Step 1.2.1: utilize database table structure to prepare component object；

Step 1.2.2: gone out the tabular table in data base by data base querying；

Step 1.5: gather multi-source heterogeneous data by component platform and transmit to data center, completing adopting of multi-source heterogeneous data Collecting, it concretely comprises the following steps:

Step 1.5.1 arranges data acquisition modes, including frequency acquisition, acquisition node, acquisition range；

Step 1.5.2. is intended to gather digital independent out by component platform；

The data read out are transmitted through the network to data center by step 1.5.3..

Step 2: the multi-source heterogeneous data collected are integrated, and will integrate after data be stored in Hbase data Storehouse.

Step 3: utilize Hive that the data being stored in Hbase data base after integrating are carried out ETL process, and be stored in Hbase number According to storehouse, the data being stored in Hbase data base are carried out obtaining clean data, and clean data are stored in In Hbase data base, its detailed process is:

As in figure 2 it is shown, utilize Hive that the data being stored in Hbase data base after integrating are carried out ETL process, ETL based on Sqoop Module first passes through Java data base and connects JDBC and data source and set up the metadata information being connected and checking data source, then general The SQL categorical data that JDBC end obtains is converted to the Sqoop record of java class form, and submits to as formatting input MapReduce task, finally, by initiating Map task and the Reduce task of respective numbers, and then writes data into HDFS In, client node calls HDFS API, and whole file is divided into packet (packet) one by one, simultaneously by data queue The mode of (Data queue) manages packet, and etc. pending, then, to the data block that NameNode application is new, and obtain Taking one group of DataNode and carry out actual storage data block copy (replicas), DataNode constitutes a pipeline (pipeline), Packet is write successively in corresponding DataNode, when last DataNode data has write, and opposite direction will confirm that Information returns successively, finally submits to NameNode to represent that write completes；

Being carried out obtaining clean data to the data being stored in Hbase data base, it concretely comprises the following steps:

Step 3.1: the data being stored in Hbase data base are carried out duplicate checking process, it concretely comprises the following steps:

Step 3.2: the missing data after duplicate checking is carried out interpolation data process, it concretely comprises the following steps:

Step 3.3: the data after filling a vacancy are carried out cluster analysis, analyzes the data being free in cluster edge；According to not Set effective range with data type, get rid of extraneous value, obtain clean data, and be stored in Hbase data base.

Step 4: be modeled analyzing to the clean data being stored in Hbase data base based on Hadoop technology, and will Analysis result is stored in Hive data base, comprising:

Carry out cluster analysis based on hadoop technology, will be stored in data in Hbase data base and cluster, Hbase data base Middle data object is multiple class, has higher similarity between the object in same class, and the object difference in inhomogeneity Relatively big, the data after cluster are stored in Hive data base, for future use；K-means cluster be a kind of widely used based on The cluster algorithm divided, the process of implementing is:

Carrying out Collaborative Recommendation analysis based on hadoop technology, use habit and data customization label according to user are to difference User type or the corresponding data of object recommendation, by being stored in Hive data base after training；Collaborative filtering is commending system Widely used a kind of technology, it is mainly by considering the similarity between user and user, between article and article, come to Family is recommended, and Collaborative Filtering with ALS-WR is a conventional proposed algorithm, this algorithm core Thought exactly all of user and project are imagined as a two-dimensional table, this form has data cell (i, j), Being the i-th user scoring to jth project, then utilizing this algorithm to use in form has the cell of data to be predicted as Empty cell.The data that prediction obtains are user's scoring to project, then according to the project of prediction is marked from high to low Sequence, just can recommend, and the process of implementing is:

(2) data set is divided into training dataset and test data set；

(3) recommended models is trained with training dataset；

(4) precision of recommended models is assessed by test set data；

(6) result of output is stored in Hive data base.

Classifying based on hadoop technology, the data after gathering training carry out classifying and are stored in Hive data base In, and stamp different label for different pieces of information, in order to and follow-up use, Naive Bayes Classification is a kind of conventional sorting algorithm, Its core concept is the item to be sorted for being given, solve each classification occurs under conditions of this occurs probability which Greatly, being considered as which classification this item to be sorted belongs to, the process of implementing is:

(5) result is saved in Hive data base.

Step 5: use service architecture based on SOA set up data exchange and share service bus, be then based on service total Data architecture for exchanging set up by line, by data architecture for exchanging will be stored in the analysis result in Hive data base push to business should With in system database, in order to analysis result data is applied in corresponding service system.

Claims

1. a service platform integration method based on big data, it is characterised in that comprise the following steps:

Step 1: gather multi-source heterogeneous data；

Step 3: utilize Hive that the data being stored in Hbase data base after integrating are carried out ETL process, and be stored in Hbase data base, The data being stored in Hbase data base are carried out obtaining clean data, and clean data are stored in Hbase data base In；

Step 5: use service architecture based on SOA set up data exchange and share service bus, be then based on service bus and build Vertical data architecture for exchanging, will be stored in the analysis result in Hive data base by data architecture for exchanging and pushes to service application system In system data base, in order to apply in corresponding service system.

A kind of service platform integration methods based on big data the most according to claim 1, it is characterised in that described step Gathering multi-source heterogeneous data in 1, its step is specific as follows:

Step 1.1: configuration multiple and distributing sources；

Step 1.2: multiple and distributing sources is packaged into data members；

Step 1.5: gather multi-source heterogeneous data by component platform and transmit to data center, completing adopting of multi-source heterogeneous data Collection.

A kind of service platform integration methods based on big data the most according to claim 1, it is characterised in that described step Being carried out obtaining clean data to the data being stored in Hbase data base in 3, its step is specific as follows:

A kind of service platform integration methods based on big data the most according to claim 1, it is characterised in that described step It is modeled analyzing to the clean data being stored in Hbase data base based on Hadoop technology in 4, including:

(1) create an initialization point, select k object from the clean data being stored in Hbase data base randomly, by this A little objects are as a bunch center；(2) remaining clean data and the distance at each bunch of center in Hbase data base are judged；(3) by remaining Under clean data be assigned to a bunch center successively；(4) when having data object to join and depart from bunch when, the flat of this bunch is automatically calculated These data, if being unsatisfactory for minimum range, are redistributed bunch by average；(5) circulation repeat the above steps, until bunch center Data no longer change, and now record result；(6) result is stored in Hive data base；

The clean data that will be stored in Hbase data base based on hadoop technology carry out Collaborative Recommendation analysis, and Collaborative Recommendation divides Data after analysis are stored in Hive data base, and for future use, its detailed process is as follows:

(2) data set is divided into training dataset and test data set；

(3) recommended models is trained with training dataset；

(4) precision of recommended models is assessed by test set data；

(6) result of output is stored in Hive data base；

(5) result is saved in Hive data base.

A kind of service platform integration methods based on big data the most according to claim 3, it is characterised in that step 3.1 In the data being stored in Hbase data base are carried out duplicate checking process, it specifically comprises the following steps that

A kind of service platform integration methods based on big data the most according to claim 3, it is characterised in that step 3.2 In the missing data after duplicate checking is carried out interpolation data process, it specifically comprises the following steps that

A kind of service platform integration methods based on big data the most according to claim 2, it is characterised in that step 1.2 Middle multiple and distributing sources being packaged into data members, it concretely comprises the following steps:

Step 1.2.1: utilize database table structure to prepare component object；

Step 1.2.2: gone out the tabular table in data base by data base querying；

Step 1.2.5: Object table is packaged into a component can inquired about by attribute field.