CN110175207A - Expandability big data analysis platform based on Hadoop and Spark - Google Patents

Expandability big data analysis platform based on Hadoop and Spark Download PDF

Info

Publication number
CN110175207A
CN110175207A CN201910463031.3A CN201910463031A CN110175207A CN 110175207 A CN110175207 A CN 110175207A CN 201910463031 A CN201910463031 A CN 201910463031A CN 110175207 A CN110175207 A CN 110175207A
Authority
CN
China
Prior art keywords
data
module
analysis
global
spark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910463031.3A
Other languages
Chinese (zh)
Inventor
刘昕林
罗伟峰
邓巍
黄萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Power Supply Co ltd
Original Assignee
Shenzhen Power Supply Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Power Supply Co ltd filed Critical Shenzhen Power Supply Co ltd
Priority to CN201910463031.3A priority Critical patent/CN110175207A/en
Publication of CN110175207A publication Critical patent/CN110175207A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/256Integrating or interfacing systems involving database management systems in federated or virtual databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an extensible big data analysis platform based on Hadoop and Spark, which comprises a plurality of extensible data access modules, a feature extraction module, a global data analysis module, a data management module, a stream management module and an operator management module, wherein the extensible data access modules are used for providing access to various data sources, inputting data and supporting data storage modes of a distributed file system, a column database and a relational database; the feature extraction module is connected with the expandable data access modules and used for reading input data, and comprises an integration unit and a data type extraction unit, wherein the integration unit receives the input data of the expandable data access modules and integrates the input data, and integrates the input data into a prefabricated data set. The method has the characteristics of faster processing, more accurate prediction, stability, reliability and easiness in expansion, and can be used for analyzing mass data to help a user to better acquire value from the data.

Description

A kind of scalability big data analysis platform based on Hadoop and Spark
Technical field
The present invention relates to big data analysis field, specially a kind of scalability big data based on Hadoop and Spark Analysis platform.
Background technique
With the rapid development of the applications such as internet, mobile Internet, Internet of Things, there is explosive increasing in global metadata amount It is long.The digital cosmic exploration of IDC LLC's publication, which is reported, to be claimed, and global information total amount will every two years double, the whole world in 2011 The total amount of data for being created and being replicated is 1.8ZB.IDC thinks, arrives coming decade (the year two thousand twenty), and all IT departments in the whole world gather around There is the total amount of server that will have more 10 times than now, the data managed will have more 50 times than now.The year two thousand twenty is expected, The whole world will possess the data volume of 35ZB in total.Being skyrocketed through for data volume implies that we have come into a big data now Epoch.However, being currently not only that data scale is increasing, and data type is mostly all very big with processing real-time requirement Ground increases the complexity of big data processing.
Distributed computing framework spark is suitable for the data analysis mining under mass data scene, the data structure of spark Ataframe is similar to the dataframe of python and R language, is a kind of data processing structure of structuring, has line index And column index.Based on these characteristics, it can be convenient, accurately data are handled, dataframe itself has carried many use In the API of data cleansing processing, many complicated functions can be realized by simply calling.It is applied to big data analysis at present Analysis platform scheme is single, data-handling efficiency is low, while poor expandability.
Summary of the invention
The purpose of the present invention is to provide a kind of scalability big data analysis platform based on Hadoop and Spark, with Solve the problems mentioned above in the background art.
To achieve the above object, the invention provides the following technical scheme: a kind of scalability based on Hadoop and Spark Big data analysis platform, including data access module, characteristic extracting module, global data analysis module, data management can be expanded Module, stream management module and operator management module, it is described expand data access module be provided with it is multiple and its be used to provide it is more Kind data source access, input data simultaneously support distributed file system, columnar database, the storage of the data of relevant database Mode;The characteristic extracting module expands that data access module is connect and it is used to read input data with multiple comprising Integral unit and data type extraction unit, integral unit receive multiple input datas for expanding data access module and to defeated Enter data to be integrated, input data is integrated into prefab data collection;The data type extraction unit and integral unit connect It connects and it is used to obtain prefab data collection, speculate that prefab data concentrates the data type of each column, and data class is carried out to it Type mark.
Preferably, global data analysis module has global storage unit, and global data analysis module is mentioned with data type Take that unit connects and it is used to carry out prefab data collection batch global analysis or online real-time global analysis, based on memory into Row iteration formula calculates, and carries out global analysis to magnanimity prefab data collection and decomposed after analysis to store to overall situation storage list Member.
Preferably, data management module, will by http agreement for being managed to data in global storage unit Data upload in the distributed file system of platform.
Preferably, stream management module for being managed to the workflow in platform, additions and deletions change and look into.
Preferably, operator management module is used to be managed all kinds of spark operators encapsulated in platform, and by platform pair These operators are patterned encapsulation, and operator management module utilizes spark dataframe operator classified catalogue formula management method Operator is classified, and managed, shown and generate classified catalogue.
Preferably, operator management module includes slice unit, and slice unit is to obtain operator slice simultaneously by predetermined period Operator slice is managed.
The present invention also provides a kind of above-mentioned analyses of the scalability big data analysis platform based on Hadoop and Spark Method, comprising the following steps:
S1: data access module can be expanded and provide multiple data sources access and by data source access features extraction module;
S2: characteristic extracting module obtains multiple data sources and is integrated, extracted to it;
S3: global data analysis module carries out global analysis to the data after integration, extraction, is sent to after global analysis To data management module;
S4: data management module is managed data in global storage unit, and is uploaded data by http agreement Into the distributed file system of platform;
S5: operator management module carries out periodical management to all kinds of spark operators encapsulated in platform.
Compared with prior art, the beneficial effects of the present invention are:
The present invention reduces the workload of compiling exploitation spark dataframe script manually, is had based on Hadoop and Spark Have processing faster, prediction it is more acurrate, have the characteristics that it is reliable and stable, be easy extension, can carry out mass data analyze to help User's value -capture preferably from data.
Detailed description of the invention
Fig. 1 is modular structure schematic diagram of the invention;
Fig. 2 is the modular structure schematic diagram of feature of present invention extraction module.
In figure: 1, data access module can be expanded;2, characteristic extracting module;21, integral unit;22, data type is extracted Unit;3, global data analysis module;4, data management module;5, stream management module;6, operator management module.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
In the description of the present invention, it is also necessary to which explanation is unless specifically defined or limited otherwise, term " setting ", " installation ", " connected ", " connection " shall be understood in a broad sense, for example, it may be fixedly connected, may be a detachable connection or one Connect to body;It can be mechanical connection, be also possible to be electrically connected;It can be directly connected, it can also be indirect by intermediary It is connected, can be the connection inside two elements.For the ordinary skill in the art, it can manage as the case may be Solve the concrete meaning of above-mentioned term in the present invention.
The present invention provides a kind of technical solution referring to FIG. 1-2: a kind of scalability based on Hadoop and Spark is big Data Analysis Platform, including data access module 1, characteristic extracting module 2, global data analysis module 3, data management can be expanded Module 4, stream management module 5 and operator management module 6, the data access module 1 of expanding are provided with that multiple and it is used to mention It is accessed for multiple data sources, input data simultaneously supports distributed file system, columnar database, the data of relevant database Storage mode;The characteristic extracting module 2 expands that data access module 1 is connect and it is used to read input data with multiple, It includes integral unit 21 and data type extraction unit 22, and integral unit 21, which receives, multiple expands the defeated of data access module Enter data and input data is integrated, input data is integrated into prefab data collection;The data type extraction unit 22 connect and it is used to obtain prefab data collection with integral unit 21, speculate that prefab data concentrates the data type of each column, And data type mark is carried out to it.
Global data analysis module 3 has global storage unit, global data analysis module 3 and data type extraction unit 22 connections and its be used to carry out prefab data collection batch global analysis or online real-time global analysis, change based on memory It is calculated for formula, magnanimity prefab data collection is carried out global analysis and decomposed after analysis to store to global storage unit.
Data management module 4, will be in data by http agreement for being managed to data in global storage unit It passes in the distributed file system of platform.
Stream management module 5 is for being managed the workflow in platform, additions and deletions change and look into.
Operator management module 6 calculates these for being managed to all kinds of spark operators encapsulated in platform, and by platform Son is patterned encapsulation, and operator management module 6 utilizes spark dataframe operator classified catalogue formula management method by operator Classify, and manages, shows and generate classified catalogue.
Operator management module 6 includes slice unit, and slice unit is to obtain operator slice by predetermined period and to the calculation Son slice is managed.
A kind of analysis method of the scalability big data analysis platform based on Hadoop and Spark, comprising the following steps:
S1: data access module 1 can be expanded and provide multiple data sources access and by data source access features extraction module 2;
S2: characteristic extracting module 2 obtains multiple data sources and is integrated, extracted to it;
Data after S3: 3 pairs of global data analysis module integration, extraction carry out global analysis, are sent out after global analysis It send to data management module 4;
Data are managed in S4: 4 pairs of data management module global storage units, and are uploaded data by http agreement Into the distributed file system of platform;
S5: operator management module 6 carries out periodical management to all kinds of spark operators encapsulated in platform.
The present invention reduces the workload of compiling exploitation spark dataframe script manually, is had based on Hadoop and Spark Have processing faster, prediction it is more acurrate, have the characteristics that it is reliable and stable, be easy extension, can carry out mass data analyze to help User's value -capture preferably from data.
It although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with A variety of variations, modification, replacement can be carried out to these embodiments without departing from the principles and spirit of the present invention by understanding And modification, the scope of the present invention is defined by the appended.

Claims (7)

1. a kind of scalability big data analysis platform based on Hadoop and Spark, which is characterized in that including data can be expanded AM access module (1), characteristic extracting module (2), global data analysis module (3), data management module (4), stream management module (5) With operator management module (6), the data access module (1) of expanding is provided with that multiple and it is used to provide multiple data sources and connects Enter, input data and support distributed file system, columnar database, relevant database data storage method;
The characteristic extracting module (2) expands that data access module (1) is connect and it is used to read input data with multiple, Including integral unit (21) and data type extraction unit (22), integral unit (21) reception is multiple to expand data access module Input data and input data is integrated, input data is integrated into prefab data collection;The data type is extracted Unit (22) is connect with integral unit (21) and it is used to obtain prefab data collection, speculates that prefab data concentrates the number of each column Data type mark is carried out according to type, and to it.
2. a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 1, feature It is, the global data analysis module (3) has global storage unit, and global data analysis module (3) is mentioned with data type It takes unit (22) to connect and it is used to carry out prefab data collection batch global analysis or online real-time global analysis, based on interior The calculating of row iteration formula is deposited into, carries out global analysis to magnanimity prefab data collection and decomposed to store to the overall situation after analysis to store Unit.
3. a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 1 or 2, It is characterized in that, the data management module (4), will by http agreement for being managed to data in global storage unit Data upload in the distributed file system of platform.
4. a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 1, feature It is, flow tube reason module (5) is for being managed the workflow in platform, additions and deletions change and look into.
5. a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 1, feature It is, the operator management module (6) is used to be managed all kinds of spark operators encapsulated in platform, and by platform to this A little operators are patterned encapsulation, and operator management module (6) utilizes spark dataframe operator classified catalogue formula management method Operator is classified, and managed, shown and generate classified catalogue.
6. a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 5, feature It is, the operator management module (6) includes slice unit, and slice unit is to obtain operator slice by predetermined period and to this Operator slice is managed.
7. -6 any a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 1 Analysis method, which comprises the following steps:
S1: data access module (1) can be expanded and provide multiple data sources access and by data source access features extraction module (2);
S2: characteristic extracting module (2) obtains multiple data sources and is integrated, extracted to it;
S3: global data analysis module (3) carries out global analysis to the data after integration, extraction, is sent to after global analysis To data management module (4);
S4: data management module (4) is managed data in global storage unit, and is upload the data to by http agreement In the distributed file system of platform;
S5: operator management module (6) carries out periodical management to all kinds of spark operators encapsulated in platform.
CN201910463031.3A 2019-05-30 2019-05-30 Expandability big data analysis platform based on Hadoop and Spark Pending CN110175207A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910463031.3A CN110175207A (en) 2019-05-30 2019-05-30 Expandability big data analysis platform based on Hadoop and Spark

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910463031.3A CN110175207A (en) 2019-05-30 2019-05-30 Expandability big data analysis platform based on Hadoop and Spark

Publications (1)

Publication Number Publication Date
CN110175207A true CN110175207A (en) 2019-08-27

Family

ID=67696620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910463031.3A Pending CN110175207A (en) 2019-05-30 2019-05-30 Expandability big data analysis platform based on Hadoop and Spark

Country Status (1)

Country Link
CN (1) CN110175207A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808746A (en) * 2016-03-14 2016-07-27 中国科学院计算技术研究所 Relational big data seamless access method and system based on Hadoop system
CN106951497A (en) * 2017-03-15 2017-07-14 深圳市德信软件有限公司 A kind of method and system based on Hadoop framework data analysis diagrammatic representation
CN107220310A (en) * 2017-05-11 2017-09-29 中国联合网络通信集团有限公司 A kind of database data management system, method and device
CN107229976A (en) * 2017-06-08 2017-10-03 郑州云海信息技术有限公司 A kind of distributed machines learning system based on spark
CN107526600A (en) * 2017-09-05 2017-12-29 成都优易数据有限公司 A kind of visual numeric simulation analysis platform and its data cleaning method based on hadoop and spark

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808746A (en) * 2016-03-14 2016-07-27 中国科学院计算技术研究所 Relational big data seamless access method and system based on Hadoop system
CN106951497A (en) * 2017-03-15 2017-07-14 深圳市德信软件有限公司 A kind of method and system based on Hadoop framework data analysis diagrammatic representation
CN107220310A (en) * 2017-05-11 2017-09-29 中国联合网络通信集团有限公司 A kind of database data management system, method and device
CN107229976A (en) * 2017-06-08 2017-10-03 郑州云海信息技术有限公司 A kind of distributed machines learning system based on spark
CN107526600A (en) * 2017-09-05 2017-12-29 成都优易数据有限公司 A kind of visual numeric simulation analysis platform and its data cleaning method based on hadoop and spark

Similar Documents

Publication Publication Date Title
US10372705B2 (en) Parallel querying of adjustable resolution geospatial database
Zhang et al. Subject clustering analysis based on ISI category classification
Wang et al. Winter wheat yield prediction using an LSTM model from MODIS LAI products
CN108446293A (en) A method of based on urban multi-source isomeric data structure city portrait
CN106202430A (en) Live platform user interest-degree digging system based on correlation rule and method for digging
Iqbal et al. Drones for flood monitoring, mapping and detection: A bibliometric review
CN106294644A (en) A kind of magnanimity time series data collection and treatment device based on big data technique and method
CN104915535A (en) Biomass population dynamics predictive parsing worldwide general key factor presupposing array platform
US11360970B2 (en) Efficient querying using overview layers of geospatial-temporal data in a data analytics platform
CN108197182A (en) A kind of data atlas analysis system and method
CN110968636A (en) Multi-dimensional big data analysis and processing system for earthquake early warning
Dettki et al. Wireless remote animal monitoring (WRAM)-A new international database e-infrastructure for management and sharing of telemetry sensor data from fish and wildlife
CN113486005A (en) Space science satellite big data organization and query method under heterogeneous structure
Li et al. Antarctic surface ice velocity retrieval from MODIS-based mosaic of Antarctica (MOA)
Liu et al. Review of Land Use Change Detection—A Method Combining Machine Learning and Bibliometric Analysis
Patel et al. Effective motion sensors and deep learning techniques for unmanned ground vehicle (UGV)-based automated pavement layer change detection in road construction
Wöllauer et al. TubeDB: An on-demand processing database system for climate station data
Croce et al. Fixed and mobile low-cost sensing approaches for microclimate monitoring in urban areas: A preliminary study in the city of Bolzano (Italy)
CN115391545A (en) Knowledge graph construction method and device for multi-platform collaborative observation task
Trillo-Montero et al. Design and Development of a Relational Database Management System (RDBMS) with Open Source Tools for the Processing of Data Monitored in a Set of Photovoltaic (PV) Plants
CN110175207A (en) Expandability big data analysis platform based on Hadoop and Spark
Toh et al. Sequential data processing for IMERG satellite rainfall comparison and improvement using LSTM and ADAM optimizer
Bhaduri et al. Distributed Anomaly Detection using Satellite Data From Multiple Modalitie.
KR101545998B1 (en) Method for Management Integration of Runoff-Hydraulic Model Data and System thereof
Liu et al. Study on the prediction of cotton yield within field scale with time series hyperspectral imagery

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190827

RJ01 Rejection of invention patent application after publication