CN110175207A

CN110175207A - Expandability big data analysis platform based on Hadoop and Spark

Info

Publication number: CN110175207A
Application number: CN201910463031.3A
Authority: CN
Inventors: 刘昕林; 罗伟峰; 邓巍; 黄萍
Original assignee: Shenzhen Power Supply Co ltd
Current assignee: Shenzhen Power Supply Co ltd
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2019-08-27

Abstract

The invention discloses an extensible big data analysis platform based on Hadoop and Spark, which comprises a plurality of extensible data access modules, a feature extraction module, a global data analysis module, a data management module, a stream management module and an operator management module, wherein the extensible data access modules are used for providing access to various data sources, inputting data and supporting data storage modes of a distributed file system, a column database and a relational database; the feature extraction module is connected with the expandable data access modules and used for reading input data, and comprises an integration unit and a data type extraction unit, wherein the integration unit receives the input data of the expandable data access modules and integrates the input data, and integrates the input data into a prefabricated data set. The method has the characteristics of faster processing, more accurate prediction, stability, reliability and easiness in expansion, and can be used for analyzing mass data to help a user to better acquire value from the data.

Description

A kind of scalability big data analysis platform based on Hadoop and Spark

Technical field

The present invention relates to big data analysis field, specially a kind of scalability big data based on Hadoop and Spark Analysis platform.

Background technique

With the rapid development of the applications such as internet, mobile Internet, Internet of Things, there is explosive increasing in global metadata amount It is long.The digital cosmic exploration of IDC LLC's publication, which is reported, to be claimed, and global information total amount will every two years double, the whole world in 2011 The total amount of data for being created and being replicated is 1.8ZB.IDC thinks, arrives coming decade (the year two thousand twenty), and all IT departments in the whole world gather around There is the total amount of server that will have more 10 times than now, the data managed will have more 50 times than now.The year two thousand twenty is expected, The whole world will possess the data volume of 35ZB in total.Being skyrocketed through for data volume implies that we have come into a big data now Epoch.However, being currently not only that data scale is increasing, and data type is mostly all very big with processing real-time requirement Ground increases the complexity of big data processing.

Distributed computing framework spark is suitable for the data analysis mining under mass data scene, the data structure of spark Ataframe is similar to the dataframe of python and R language, is a kind of data processing structure of structuring, has line index And column index.Based on these characteristics, it can be convenient, accurately data are handled, dataframe itself has carried many use In the API of data cleansing processing, many complicated functions can be realized by simply calling.It is applied to big data analysis at present Analysis platform scheme is single, data-handling efficiency is low, while poor expandability.

Summary of the invention

The purpose of the present invention is to provide a kind of scalability big data analysis platform based on Hadoop and Spark, with Solve the problems mentioned above in the background art.

To achieve the above object, the invention provides the following technical scheme: a kind of scalability based on Hadoop and Spark Big data analysis platform, including data access module, characteristic extracting module, global data analysis module, data management can be expanded Module, stream management module and operator management module, it is described expand data access module be provided with it is multiple and its be used to provide it is more Kind data source access, input data simultaneously support distributed file system, columnar database, the storage of the data of relevant database Mode；The characteristic extracting module expands that data access module is connect and it is used to read input data with multiple comprising Integral unit and data type extraction unit, integral unit receive multiple input datas for expanding data access module and to defeated Enter data to be integrated, input data is integrated into prefab data collection；The data type extraction unit and integral unit connect It connects and it is used to obtain prefab data collection, speculate that prefab data concentrates the data type of each column, and data class is carried out to it Type mark.

Preferably, global data analysis module has global storage unit, and global data analysis module is mentioned with data type Take that unit connects and it is used to carry out prefab data collection batch global analysis or online real-time global analysis, based on memory into Row iteration formula calculates, and carries out global analysis to magnanimity prefab data collection and decomposed after analysis to store to overall situation storage list Member.

Preferably, data management module, will by http agreement for being managed to data in global storage unit Data upload in the distributed file system of platform.

Preferably, stream management module for being managed to the workflow in platform, additions and deletions change and look into.

Preferably, operator management module is used to be managed all kinds of spark operators encapsulated in platform, and by platform pair These operators are patterned encapsulation, and operator management module utilizes spark dataframe operator classified catalogue formula management method Operator is classified, and managed, shown and generate classified catalogue.

Preferably, operator management module includes slice unit, and slice unit is to obtain operator slice simultaneously by predetermined period Operator slice is managed.

The present invention also provides a kind of above-mentioned analyses of the scalability big data analysis platform based on Hadoop and Spark Method, comprising the following steps:

S1: data access module can be expanded and provide multiple data sources access and by data source access features extraction module；

S2: characteristic extracting module obtains multiple data sources and is integrated, extracted to it；

S3: global data analysis module carries out global analysis to the data after integration, extraction, is sent to after global analysis To data management module；

S4: data management module is managed data in global storage unit, and is uploaded data by http agreement Into the distributed file system of platform；

S5: operator management module carries out periodical management to all kinds of spark operators encapsulated in platform.

Compared with prior art, the beneficial effects of the present invention are:

The present invention reduces the workload of compiling exploitation spark dataframe script manually, is had based on Hadoop and Spark Have processing faster, prediction it is more acurrate, have the characteristics that it is reliable and stable, be easy extension, can carry out mass data analyze to help User's value -capture preferably from data.

Detailed description of the invention

Fig. 1 is modular structure schematic diagram of the invention；

Fig. 2 is the modular structure schematic diagram of feature of present invention extraction module.

In figure: 1, data access module can be expanded；2, characteristic extracting module；21, integral unit；22, data type is extracted Unit；3, global data analysis module；4, data management module；5, stream management module；6, operator management module.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

In the description of the present invention, it is also necessary to which explanation is unless specifically defined or limited otherwise, term " setting ", " installation ", " connected ", " connection " shall be understood in a broad sense, for example, it may be fixedly connected, may be a detachable connection or one Connect to body；It can be mechanical connection, be also possible to be electrically connected；It can be directly connected, it can also be indirect by intermediary It is connected, can be the connection inside two elements.For the ordinary skill in the art, it can manage as the case may be Solve the concrete meaning of above-mentioned term in the present invention.

The present invention provides a kind of technical solution referring to FIG. 1-2: a kind of scalability based on Hadoop and Spark is big Data Analysis Platform, including data access module 1, characteristic extracting module 2, global data analysis module 3, data management can be expanded Module 4, stream management module 5 and operator management module 6, the data access module 1 of expanding are provided with that multiple and it is used to mention It is accessed for multiple data sources, input data simultaneously supports distributed file system, columnar database, the data of relevant database Storage mode；The characteristic extracting module 2 expands that data access module 1 is connect and it is used to read input data with multiple, It includes integral unit 21 and data type extraction unit 22, and integral unit 21, which receives, multiple expands the defeated of data access module Enter data and input data is integrated, input data is integrated into prefab data collection；The data type extraction unit 22 connect and it is used to obtain prefab data collection with integral unit 21, speculate that prefab data concentrates the data type of each column, And data type mark is carried out to it.

Global data analysis module 3 has global storage unit, global data analysis module 3 and data type extraction unit 22 connections and its be used to carry out prefab data collection batch global analysis or online real-time global analysis, change based on memory It is calculated for formula, magnanimity prefab data collection is carried out global analysis and decomposed after analysis to store to global storage unit.

Data management module 4, will be in data by http agreement for being managed to data in global storage unit It passes in the distributed file system of platform.

Stream management module 5 is for being managed the workflow in platform, additions and deletions change and look into.

Operator management module 6 calculates these for being managed to all kinds of spark operators encapsulated in platform, and by platform Son is patterned encapsulation, and operator management module 6 utilizes spark dataframe operator classified catalogue formula management method by operator Classify, and manages, shows and generate classified catalogue.

Operator management module 6 includes slice unit, and slice unit is to obtain operator slice by predetermined period and to the calculation Son slice is managed.

A kind of analysis method of the scalability big data analysis platform based on Hadoop and Spark, comprising the following steps:

S1: data access module 1 can be expanded and provide multiple data sources access and by data source access features extraction module 2；

S2: characteristic extracting module 2 obtains multiple data sources and is integrated, extracted to it；

Data after S3: 3 pairs of global data analysis module integration, extraction carry out global analysis, are sent out after global analysis It send to data management module 4；

Data are managed in S4: 4 pairs of data management module global storage units, and are uploaded data by http agreement Into the distributed file system of platform；

S5: operator management module 6 carries out periodical management to all kinds of spark operators encapsulated in platform.

It although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with A variety of variations, modification, replacement can be carried out to these embodiments without departing from the principles and spirit of the present invention by understanding And modification, the scope of the present invention is defined by the appended.

Claims

1. a kind of scalability big data analysis platform based on Hadoop and Spark, which is characterized in that including data can be expanded AM access module (1), characteristic extracting module (2), global data analysis module (3), data management module (4), stream management module (5) With operator management module (6), the data access module (1) of expanding is provided with that multiple and it is used to provide multiple data sources and connects Enter, input data and support distributed file system, columnar database, relevant database data storage method；

The characteristic extracting module (2) expands that data access module (1) is connect and it is used to read input data with multiple, Including integral unit (21) and data type extraction unit (22), integral unit (21) reception is multiple to expand data access module Input data and input data is integrated, input data is integrated into prefab data collection；The data type is extracted Unit (22) is connect with integral unit (21) and it is used to obtain prefab data collection, speculates that prefab data concentrates the number of each column Data type mark is carried out according to type, and to it.

2. a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 1, feature It is, the global data analysis module (3) has global storage unit, and global data analysis module (3) is mentioned with data type It takes unit (22) to connect and it is used to carry out prefab data collection batch global analysis or online real-time global analysis, based on interior The calculating of row iteration formula is deposited into, carries out global analysis to magnanimity prefab data collection and decomposed to store to the overall situation after analysis to store Unit.

3. a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 1 or 2, It is characterized in that, the data management module (4), will by http agreement for being managed to data in global storage unit Data upload in the distributed file system of platform.

4. a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 1, feature It is, flow tube reason module (5) is for being managed the workflow in platform, additions and deletions change and look into.

5. a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 1, feature It is, the operator management module (6) is used to be managed all kinds of spark operators encapsulated in platform, and by platform to this A little operators are patterned encapsulation, and operator management module (6) utilizes spark dataframe operator classified catalogue formula management method Operator is classified, and managed, shown and generate classified catalogue.

6. a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 5, feature It is, the operator management module (6) includes slice unit, and slice unit is to obtain operator slice by predetermined period and to this Operator slice is managed.

7. -6 any a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 1 Analysis method, which comprises the following steps:

S1: data access module (1) can be expanded and provide multiple data sources access and by data source access features extraction module (2)；

S2: characteristic extracting module (2) obtains multiple data sources and is integrated, extracted to it；

S3: global data analysis module (3) carries out global analysis to the data after integration, extraction, is sent to after global analysis To data management module (4)；

S4: data management module (4) is managed data in global storage unit, and is upload the data to by http agreement In the distributed file system of platform；

S5: operator management module (6) carries out periodical management to all kinds of spark operators encapsulated in platform.