CN110175207A - Expandability big data analysis platform based on Hadoop and Spark - Google Patents
Expandability big data analysis platform based on Hadoop and Spark Download PDFInfo
- Publication number
- CN110175207A CN110175207A CN201910463031.3A CN201910463031A CN110175207A CN 110175207 A CN110175207 A CN 110175207A CN 201910463031 A CN201910463031 A CN 201910463031A CN 110175207 A CN110175207 A CN 110175207A
- Authority
- CN
- China
- Prior art keywords
- data
- module
- analysis
- global
- spark
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007405 data analysis Methods 0.000 title claims abstract description 32
- 238000007726 management method Methods 0.000 claims abstract description 30
- 238000000605 extraction Methods 0.000 claims abstract description 16
- 238000013523 data management Methods 0.000 claims abstract description 14
- 230000010354 integration Effects 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims abstract description 4
- 238000013500 data storage Methods 0.000 claims abstract 2
- 238000004458 analytical method Methods 0.000 claims description 23
- 238000013480 data collection Methods 0.000 claims description 12
- 238000007792 addition Methods 0.000 claims description 3
- 239000012141 concentrate Substances 0.000 claims description 3
- 238000012217 deletion Methods 0.000 claims description 3
- 230000037430 deletion Effects 0.000 claims description 3
- 238000005538 encapsulation Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 abstract description 7
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- JJWKPURADFRFRB-UHFFFAOYSA-N carbonyl sulfide Chemical compound O=C=S JJWKPURADFRFRB-UHFFFAOYSA-N 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/256—Integrating or interfacing systems involving database management systems in federated or virtual databases
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an extensible big data analysis platform based on Hadoop and Spark, which comprises a plurality of extensible data access modules, a feature extraction module, a global data analysis module, a data management module, a stream management module and an operator management module, wherein the extensible data access modules are used for providing access to various data sources, inputting data and supporting data storage modes of a distributed file system, a column database and a relational database; the feature extraction module is connected with the expandable data access modules and used for reading input data, and comprises an integration unit and a data type extraction unit, wherein the integration unit receives the input data of the expandable data access modules and integrates the input data, and integrates the input data into a prefabricated data set. The method has the characteristics of faster processing, more accurate prediction, stability, reliability and easiness in expansion, and can be used for analyzing mass data to help a user to better acquire value from the data.
Description
Technical field
The present invention relates to big data analysis field, specially a kind of scalability big data based on Hadoop and Spark
Analysis platform.
Background technique
With the rapid development of the applications such as internet, mobile Internet, Internet of Things, there is explosive increasing in global metadata amount
It is long.The digital cosmic exploration of IDC LLC's publication, which is reported, to be claimed, and global information total amount will every two years double, the whole world in 2011
The total amount of data for being created and being replicated is 1.8ZB.IDC thinks, arrives coming decade (the year two thousand twenty), and all IT departments in the whole world gather around
There is the total amount of server that will have more 10 times than now, the data managed will have more 50 times than now.The year two thousand twenty is expected,
The whole world will possess the data volume of 35ZB in total.Being skyrocketed through for data volume implies that we have come into a big data now
Epoch.However, being currently not only that data scale is increasing, and data type is mostly all very big with processing real-time requirement
Ground increases the complexity of big data processing.
Distributed computing framework spark is suitable for the data analysis mining under mass data scene, the data structure of spark
Ataframe is similar to the dataframe of python and R language, is a kind of data processing structure of structuring, has line index
And column index.Based on these characteristics, it can be convenient, accurately data are handled, dataframe itself has carried many use
In the API of data cleansing processing, many complicated functions can be realized by simply calling.It is applied to big data analysis at present
Analysis platform scheme is single, data-handling efficiency is low, while poor expandability.
Summary of the invention
The purpose of the present invention is to provide a kind of scalability big data analysis platform based on Hadoop and Spark, with
Solve the problems mentioned above in the background art.
To achieve the above object, the invention provides the following technical scheme: a kind of scalability based on Hadoop and Spark
Big data analysis platform, including data access module, characteristic extracting module, global data analysis module, data management can be expanded
Module, stream management module and operator management module, it is described expand data access module be provided with it is multiple and its be used to provide it is more
Kind data source access, input data simultaneously support distributed file system, columnar database, the storage of the data of relevant database
Mode;The characteristic extracting module expands that data access module is connect and it is used to read input data with multiple comprising
Integral unit and data type extraction unit, integral unit receive multiple input datas for expanding data access module and to defeated
Enter data to be integrated, input data is integrated into prefab data collection;The data type extraction unit and integral unit connect
It connects and it is used to obtain prefab data collection, speculate that prefab data concentrates the data type of each column, and data class is carried out to it
Type mark.
Preferably, global data analysis module has global storage unit, and global data analysis module is mentioned with data type
Take that unit connects and it is used to carry out prefab data collection batch global analysis or online real-time global analysis, based on memory into
Row iteration formula calculates, and carries out global analysis to magnanimity prefab data collection and decomposed after analysis to store to overall situation storage list
Member.
Preferably, data management module, will by http agreement for being managed to data in global storage unit
Data upload in the distributed file system of platform.
Preferably, stream management module for being managed to the workflow in platform, additions and deletions change and look into.
Preferably, operator management module is used to be managed all kinds of spark operators encapsulated in platform, and by platform pair
These operators are patterned encapsulation, and operator management module utilizes spark dataframe operator classified catalogue formula management method
Operator is classified, and managed, shown and generate classified catalogue.
Preferably, operator management module includes slice unit, and slice unit is to obtain operator slice simultaneously by predetermined period
Operator slice is managed.
The present invention also provides a kind of above-mentioned analyses of the scalability big data analysis platform based on Hadoop and Spark
Method, comprising the following steps:
S1: data access module can be expanded and provide multiple data sources access and by data source access features extraction module;
S2: characteristic extracting module obtains multiple data sources and is integrated, extracted to it;
S3: global data analysis module carries out global analysis to the data after integration, extraction, is sent to after global analysis
To data management module;
S4: data management module is managed data in global storage unit, and is uploaded data by http agreement
Into the distributed file system of platform;
S5: operator management module carries out periodical management to all kinds of spark operators encapsulated in platform.
Compared with prior art, the beneficial effects of the present invention are:
The present invention reduces the workload of compiling exploitation spark dataframe script manually, is had based on Hadoop and Spark
Have processing faster, prediction it is more acurrate, have the characteristics that it is reliable and stable, be easy extension, can carry out mass data analyze to help
User's value -capture preferably from data.
Detailed description of the invention
Fig. 1 is modular structure schematic diagram of the invention;
Fig. 2 is the modular structure schematic diagram of feature of present invention extraction module.
In figure: 1, data access module can be expanded;2, characteristic extracting module;21, integral unit;22, data type is extracted
Unit;3, global data analysis module;4, data management module;5, stream management module;6, operator management module.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
In the description of the present invention, it is also necessary to which explanation is unless specifically defined or limited otherwise, term " setting ",
" installation ", " connected ", " connection " shall be understood in a broad sense, for example, it may be fixedly connected, may be a detachable connection or one
Connect to body;It can be mechanical connection, be also possible to be electrically connected;It can be directly connected, it can also be indirect by intermediary
It is connected, can be the connection inside two elements.For the ordinary skill in the art, it can manage as the case may be
Solve the concrete meaning of above-mentioned term in the present invention.
The present invention provides a kind of technical solution referring to FIG. 1-2: a kind of scalability based on Hadoop and Spark is big
Data Analysis Platform, including data access module 1, characteristic extracting module 2, global data analysis module 3, data management can be expanded
Module 4, stream management module 5 and operator management module 6, the data access module 1 of expanding are provided with that multiple and it is used to mention
It is accessed for multiple data sources, input data simultaneously supports distributed file system, columnar database, the data of relevant database
Storage mode;The characteristic extracting module 2 expands that data access module 1 is connect and it is used to read input data with multiple,
It includes integral unit 21 and data type extraction unit 22, and integral unit 21, which receives, multiple expands the defeated of data access module
Enter data and input data is integrated, input data is integrated into prefab data collection;The data type extraction unit
22 connect and it is used to obtain prefab data collection with integral unit 21, speculate that prefab data concentrates the data type of each column,
And data type mark is carried out to it.
Global data analysis module 3 has global storage unit, global data analysis module 3 and data type extraction unit
22 connections and its be used to carry out prefab data collection batch global analysis or online real-time global analysis, change based on memory
It is calculated for formula, magnanimity prefab data collection is carried out global analysis and decomposed after analysis to store to global storage unit.
Data management module 4, will be in data by http agreement for being managed to data in global storage unit
It passes in the distributed file system of platform.
Stream management module 5 is for being managed the workflow in platform, additions and deletions change and look into.
Operator management module 6 calculates these for being managed to all kinds of spark operators encapsulated in platform, and by platform
Son is patterned encapsulation, and operator management module 6 utilizes spark dataframe operator classified catalogue formula management method by operator
Classify, and manages, shows and generate classified catalogue.
Operator management module 6 includes slice unit, and slice unit is to obtain operator slice by predetermined period and to the calculation
Son slice is managed.
A kind of analysis method of the scalability big data analysis platform based on Hadoop and Spark, comprising the following steps:
S1: data access module 1 can be expanded and provide multiple data sources access and by data source access features extraction module 2;
S2: characteristic extracting module 2 obtains multiple data sources and is integrated, extracted to it;
Data after S3: 3 pairs of global data analysis module integration, extraction carry out global analysis, are sent out after global analysis
It send to data management module 4;
Data are managed in S4: 4 pairs of data management module global storage units, and are uploaded data by http agreement
Into the distributed file system of platform;
S5: operator management module 6 carries out periodical management to all kinds of spark operators encapsulated in platform.
The present invention reduces the workload of compiling exploitation spark dataframe script manually, is had based on Hadoop and Spark
Have processing faster, prediction it is more acurrate, have the characteristics that it is reliable and stable, be easy extension, can carry out mass data analyze to help
User's value -capture preferably from data.
It although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with
A variety of variations, modification, replacement can be carried out to these embodiments without departing from the principles and spirit of the present invention by understanding
And modification, the scope of the present invention is defined by the appended.
Claims (7)
1. a kind of scalability big data analysis platform based on Hadoop and Spark, which is characterized in that including data can be expanded
AM access module (1), characteristic extracting module (2), global data analysis module (3), data management module (4), stream management module (5)
With operator management module (6), the data access module (1) of expanding is provided with that multiple and it is used to provide multiple data sources and connects
Enter, input data and support distributed file system, columnar database, relevant database data storage method;
The characteristic extracting module (2) expands that data access module (1) is connect and it is used to read input data with multiple,
Including integral unit (21) and data type extraction unit (22), integral unit (21) reception is multiple to expand data access module
Input data and input data is integrated, input data is integrated into prefab data collection;The data type is extracted
Unit (22) is connect with integral unit (21) and it is used to obtain prefab data collection, speculates that prefab data concentrates the number of each column
Data type mark is carried out according to type, and to it.
2. a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 1, feature
It is, the global data analysis module (3) has global storage unit, and global data analysis module (3) is mentioned with data type
It takes unit (22) to connect and it is used to carry out prefab data collection batch global analysis or online real-time global analysis, based on interior
The calculating of row iteration formula is deposited into, carries out global analysis to magnanimity prefab data collection and decomposed to store to the overall situation after analysis to store
Unit.
3. a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 1 or 2,
It is characterized in that, the data management module (4), will by http agreement for being managed to data in global storage unit
Data upload in the distributed file system of platform.
4. a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 1, feature
It is, flow tube reason module (5) is for being managed the workflow in platform, additions and deletions change and look into.
5. a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 1, feature
It is, the operator management module (6) is used to be managed all kinds of spark operators encapsulated in platform, and by platform to this
A little operators are patterned encapsulation, and operator management module (6) utilizes spark dataframe operator classified catalogue formula management method
Operator is classified, and managed, shown and generate classified catalogue.
6. a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 5, feature
It is, the operator management module (6) includes slice unit, and slice unit is to obtain operator slice by predetermined period and to this
Operator slice is managed.
7. -6 any a kind of scalability big data analysis platform based on Hadoop and Spark according to claim 1
Analysis method, which comprises the following steps:
S1: data access module (1) can be expanded and provide multiple data sources access and by data source access features extraction module (2);
S2: characteristic extracting module (2) obtains multiple data sources and is integrated, extracted to it;
S3: global data analysis module (3) carries out global analysis to the data after integration, extraction, is sent to after global analysis
To data management module (4);
S4: data management module (4) is managed data in global storage unit, and is upload the data to by http agreement
In the distributed file system of platform;
S5: operator management module (6) carries out periodical management to all kinds of spark operators encapsulated in platform.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910463031.3A CN110175207A (en) | 2019-05-30 | 2019-05-30 | Expandability big data analysis platform based on Hadoop and Spark |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910463031.3A CN110175207A (en) | 2019-05-30 | 2019-05-30 | Expandability big data analysis platform based on Hadoop and Spark |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110175207A true CN110175207A (en) | 2019-08-27 |
Family
ID=67696620
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910463031.3A Pending CN110175207A (en) | 2019-05-30 | 2019-05-30 | Expandability big data analysis platform based on Hadoop and Spark |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110175207A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105808746A (en) * | 2016-03-14 | 2016-07-27 | 中国科学院计算技术研究所 | Relational big data seamless access method and system based on Hadoop system |
CN106951497A (en) * | 2017-03-15 | 2017-07-14 | 深圳市德信软件有限公司 | A kind of method and system based on Hadoop framework data analysis diagrammatic representation |
CN107220310A (en) * | 2017-05-11 | 2017-09-29 | 中国联合网络通信集团有限公司 | A kind of database data management system, method and device |
CN107229976A (en) * | 2017-06-08 | 2017-10-03 | 郑州云海信息技术有限公司 | A kind of distributed machines learning system based on spark |
CN107526600A (en) * | 2017-09-05 | 2017-12-29 | 成都优易数据有限公司 | A kind of visual numeric simulation analysis platform and its data cleaning method based on hadoop and spark |
-
2019
- 2019-05-30 CN CN201910463031.3A patent/CN110175207A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105808746A (en) * | 2016-03-14 | 2016-07-27 | 中国科学院计算技术研究所 | Relational big data seamless access method and system based on Hadoop system |
CN106951497A (en) * | 2017-03-15 | 2017-07-14 | 深圳市德信软件有限公司 | A kind of method and system based on Hadoop framework data analysis diagrammatic representation |
CN107220310A (en) * | 2017-05-11 | 2017-09-29 | 中国联合网络通信集团有限公司 | A kind of database data management system, method and device |
CN107229976A (en) * | 2017-06-08 | 2017-10-03 | 郑州云海信息技术有限公司 | A kind of distributed machines learning system based on spark |
CN107526600A (en) * | 2017-09-05 | 2017-12-29 | 成都优易数据有限公司 | A kind of visual numeric simulation analysis platform and its data cleaning method based on hadoop and spark |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10372705B2 (en) | Parallel querying of adjustable resolution geospatial database | |
Zhang et al. | Subject clustering analysis based on ISI category classification | |
Wang et al. | Winter wheat yield prediction using an LSTM model from MODIS LAI products | |
CN108446293A (en) | A method of based on urban multi-source isomeric data structure city portrait | |
CN106202430A (en) | Live platform user interest-degree digging system based on correlation rule and method for digging | |
Iqbal et al. | Drones for flood monitoring, mapping and detection: A bibliometric review | |
CN106294644A (en) | A kind of magnanimity time series data collection and treatment device based on big data technique and method | |
CN104915535A (en) | Biomass population dynamics predictive parsing worldwide general key factor presupposing array platform | |
US11360970B2 (en) | Efficient querying using overview layers of geospatial-temporal data in a data analytics platform | |
CN108197182A (en) | A kind of data atlas analysis system and method | |
CN110968636A (en) | Multi-dimensional big data analysis and processing system for earthquake early warning | |
Dettki et al. | Wireless remote animal monitoring (WRAM)-A new international database e-infrastructure for management and sharing of telemetry sensor data from fish and wildlife | |
CN113486005A (en) | Space science satellite big data organization and query method under heterogeneous structure | |
Li et al. | Antarctic surface ice velocity retrieval from MODIS-based mosaic of Antarctica (MOA) | |
Liu et al. | Review of Land Use Change Detection—A Method Combining Machine Learning and Bibliometric Analysis | |
Patel et al. | Effective motion sensors and deep learning techniques for unmanned ground vehicle (UGV)-based automated pavement layer change detection in road construction | |
Wöllauer et al. | TubeDB: An on-demand processing database system for climate station data | |
Croce et al. | Fixed and mobile low-cost sensing approaches for microclimate monitoring in urban areas: A preliminary study in the city of Bolzano (Italy) | |
CN115391545A (en) | Knowledge graph construction method and device for multi-platform collaborative observation task | |
Trillo-Montero et al. | Design and Development of a Relational Database Management System (RDBMS) with Open Source Tools for the Processing of Data Monitored in a Set of Photovoltaic (PV) Plants | |
CN110175207A (en) | Expandability big data analysis platform based on Hadoop and Spark | |
Toh et al. | Sequential data processing for IMERG satellite rainfall comparison and improvement using LSTM and ADAM optimizer | |
Bhaduri et al. | Distributed Anomaly Detection using Satellite Data From Multiple Modalitie. | |
KR101545998B1 (en) | Method for Management Integration of Runoff-Hydraulic Model Data and System thereof | |
Liu et al. | Study on the prediction of cotton yield within field scale with time series hyperspectral imagery |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190827 |
|
RJ01 | Rejection of invention patent application after publication |