CN113254436A - Hadoop-based data management system and method - Google Patents

Hadoop-based data management system and method Download PDF

Info

Publication number
CN113254436A
CN113254436A CN202110802194.7A CN202110802194A CN113254436A CN 113254436 A CN113254436 A CN 113254436A CN 202110802194 A CN202110802194 A CN 202110802194A CN 113254436 A CN113254436 A CN 113254436A
Authority
CN
China
Prior art keywords
data
module
sending
management
hadoop
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110802194.7A
Other languages
Chinese (zh)
Inventor
周文明
花霖
杨欢
姚琪
罗启铭
王宗强
刘小双
覃江威
吴育校
王春洲
张建宇
刘桂芬
陈品宏
陈军
朱瑜鑫
冯建设
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xinrun Fulian Digital Technology Co Ltd
Original Assignee
Shenzhen Xinrun Fulian Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xinrun Fulian Digital Technology Co Ltd filed Critical Shenzhen Xinrun Fulian Digital Technology Co Ltd
Priority to CN202110802194.7A priority Critical patent/CN113254436A/en
Publication of CN113254436A publication Critical patent/CN113254436A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • G06F16/287Visualization; Browsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of data management, and discloses a data management system and a method based on hadoop, wherein the system comprises the following components: the big data platform collects data sent by each device, stores the data sent by each device, and sends the stored data to the data processing module through a preset management interface; the data processing module is used for preprocessing the stored data and sending the preprocessed data to a data warehouse; the data warehouse is used for cleaning the preprocessed data and sending the cleaned data to the data integration module; the data integration module extracts characteristic data of the cleaned data, analyzes the characteristic data through data mart, and visually displays the analysis result; according to the invention, the data sent by each device are sequentially preprocessed, cleaned and analyzed, and the data in the analysis result is visually displayed, so that the management efficiency of each work among the data can be effectively improved.

Description

Hadoop-based data management system and method
Technical Field
The invention relates to the technical field of data management, in particular to a data management system and method based on hadoop.
Background
With the continuous development of communication technology, data is used as an interactive medium and becomes more and more important information resources, the data resource quantity of departments of different enterprises is more and more huge, the data types are more and more diverse, and the data types are more and more complex.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a data management system and method based on hadoop, and aims to solve the technical problem that the management efficiency of various works among data cannot be effectively improved in the prior art.
In order to achieve the above object, the present invention provides a hadoop-based data management system, which includes: the system comprises a big data platform, a data processing module, a data warehouse and a data integration module which are connected in sequence;
the big data platform is used for acquiring data sent by each device, storing the data sent by each device and sending the stored data to the data processing module through a preset management interface;
the data processing module is used for preprocessing the stored data and sending the preprocessed data to the data warehouse;
the data warehouse is used for cleaning the preprocessed data and sending the cleaned data to the data integration module;
the data integration module is used for extracting the characteristic data of the cleaned data, analyzing the characteristic data through a data mart, and visually displaying the analysis result to realize the management of the data.
Optionally, the data processing module includes a data cleaning module, a data conversion module, a data monitoring module and a data calculation module;
the data cleaning module is used for cleaning the stored data and sending the cleaned data to the data conversion module;
the data conversion module is used for converting the cleaned data and uploading the converted data to the data monitoring module;
the data monitoring module is used for monitoring the converted data in real time, and if the data in the monitoring result is not abnormal data, the data in the monitoring result is sent to the data calculation module;
the data calculation module is used for classifying the data in the monitoring result, determining a corresponding calculation strategy according to the classified data, calculating the classified data according to the calculation strategy, and sending the calculation result to the data warehouse.
Optionally, the data calculation module includes a process determination module and a policy selection module;
the process determining module is used for classifying the data in the monitoring result, extracting the process information of the classified data, and sending the classified data and the process information to the strategy selecting module;
and the strategy selection module is used for determining a corresponding calculation strategy according to the process information, calculating the classified data according to the calculation strategy and sending a calculation result to the data warehouse.
Optionally, the data warehouse includes a data checking module, a warehouse data cleaning module, and a data aggregation module;
the data checking module is used for acquiring a preset checking rule, checking the preprocessed data according to the preset checking rule, and sending the preprocessed data to the warehouse data cleaning module when the checking is passed;
the warehouse data cleaning module is used for cleaning the preprocessed data and sending the cleaned data to the data aggregation module;
and the aggregation module is used for aggregating the cleaned data to obtain corresponding aggregated data and sending the aggregated data to the data integration module.
Optionally, the data integration module includes a data analysis module and a report generation module;
the data analysis module is used for extracting characteristic data of the cleaned data, analyzing the characteristic data through a relational data mart when the type of the characteristic data is relational data, and sending an analysis result to the report generation module;
and the report generation module is used for generating a corresponding data report according to the analysis result and displaying the data report in a visual manner in sequence so as to realize the management of data.
Optionally, the hadoop-based data management system further includes an ad hoc analysis module;
the data analysis module is used for analyzing the characteristic data through a multi-dimensional data mart and sending an analysis result to the ad hoc analysis module when the type of the characteristic data is multi-dimensional data;
and the impromptu analysis module is used for performing impromptu analysis on the data in the analysis result.
Optionally, the hadoop-based data management system further includes a data set management module and a data source management module;
the data set management module is used for extracting header data of the data report, generating a corresponding data set according to the header data, and sending the data set to the data source management module;
the data source management module is used for tracing the source of the data in the data set to obtain source information corresponding to the data in the data set, and if the source information is inconsistent with the historical source information, the historical source information is modified according to the source information to realize the management of the data source of the data.
Optionally, the hadoop-based data management system further includes a data parsing module and a data extraction module;
the data analysis module is used for receiving the stored data sent by the data processing module through a preset integrated interface, analyzing the stored data and sending the analyzed data to the data extraction module;
the data extraction module is used for extracting the feature information of the analyzed data, determining corresponding quality information and value information according to the feature information of the analyzed data, obtaining the first N data based on the quality information and the value information, and managing the first N data.
Optionally, the hadoop-based data management system further includes a data marking module and a data publishing module;
the data subscription module is used for acquiring stored data sent by a big data platform through a preset service interface, determining a corresponding data type according to the stored data, and sending the data of the data type to the data publishing module;
and the data publishing module is used for publishing the data of the data type.
In addition, in order to achieve the above object, the present invention further provides a hadoop-based data management method, which is applied to a hadoop-based data management system, and the system includes: the method comprises the following steps that a big data platform, a data processing module, a data warehouse and a data integration module are sequentially connected, and the method comprises the following steps:
the big data platform collects data sent by each device, stores the data sent by each device, and sends the stored data to the data processing module through a preset management interface;
the data processing module is used for preprocessing the stored data and sending the preprocessed data to the data warehouse;
the data warehouse is used for cleaning the preprocessed data and sending the cleaned data to the data integration module;
the data integration module extracts the characteristic data of the cleaned data, analyzes the characteristic data through a data mart, and visually displays the analysis result to realize the management of the data.
The method comprises the steps that data sent by each device are collected through a big data platform, the data sent by each device are stored, and the stored data are sent to a data processing module through a preset management interface; the data processing module is used for preprocessing the stored data and sending the preprocessed data to the data warehouse; the data warehouse is used for cleaning the preprocessed data and sending the cleaned data to the data integration module; the data integration module extracts characteristic data of the cleaned data, analyzes the characteristic data through a data mart, and visually displays an analysis result to realize data management; according to the invention, the data sent by each device are sequentially preprocessed, cleaned and analyzed, and the data in the analysis result is visually displayed, so that the management efficiency of each work among the data can be effectively improved.
Drawings
FIG. 1 is a block diagram of a hadoop-based data management system according to a first embodiment of the present invention;
FIG. 2 is a block diagram of a hadoop-based data management system according to a second embodiment of the present invention;
FIG. 3 is a block diagram of a hadoop-based data management system according to a third embodiment of the present invention;
FIG. 4 is a block diagram illustrating a fourth embodiment of a hadoop-based data management system according to the present invention;
FIG. 5 is a schematic flow chart of a hadoop-based data management method according to a first embodiment of the present invention;
fig. 6 is a schematic diagram of a data management system according to an embodiment of the hadoop-based data management method of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, fig. 1 is a block diagram illustrating a data management system based on hadoop according to a first embodiment of the present invention. The hadoop-based data management system comprises: a big data platform 10, a data processing module 20, a data warehouse 30, and a data integration module.
In this embodiment, the big data platform 10 in the hadoop-based data Management System 100 collects data sent by each device, stores the data sent by each device, and sends the stored data to the data processing module 20 through a preset Management interface, where the data sent by each device includes data issued by enterprise Management Solutions (SPAs), data issued by Supplier Relationship Management Software (SRM), financial data, workflow data, market decision data, and so on, and after receiving the data of each device, the data of each device is stored in a local database, the preset Management interface refers to a connection interface between the big data platform 20 and the data processing module 30, and is mainly used for data transmission between the two, the preset Management interface may be a big data Management interface or other Management interfaces, this embodiment is not limited to this, and a big data management interface is taken as an example for explanation.
In this embodiment, the data processing module 20 performs preprocessing on the stored data and sends the preprocessed data to the data warehouse 30, where the preprocessing refers to performing preset processing on the stored data, and specifically includes: cleaning, converting, uploading, calculating, etc., the data processing module may be an Extract-Transform-Load (ETL) tool, which refers to a tool for extracting, converting, and loading data from a source to a destination.
Further, in order to effectively improve the efficiency of processing data, the data processing module 20 includes a data cleaning module, a data conversion module, a data monitoring module, and a data calculation module;
in this embodiment, the data cleaning module cleans the stored data and sends the cleaned data to the data conversion module, where data cleaning refers to correcting recognizable errors in the data, including checking data consistency, processing invalid values and missing values, and after cleaning is completed, sends the cleaned data to the data conversion module.
In this embodiment, the data conversion module converts the cleaned data and uploads the converted data to the data monitoring module, wherein after receiving the cleaned data, the cleaned data is converted, the converted content includes a format, a data type and the like, and after the conversion is completed, the converted data is sent to the data monitoring module;
in this embodiment, the data monitoring module monitors the converted data in real time, and if the data in the monitoring result is not abnormal data, the data in the monitoring result is sent to the data calculating module, where abnormal data refers to the data having a large difference from other data, for example, normal data has 8 bits, the length of the abnormal data is 16 bits, and the data with the length of 16 bits is the abnormal data, after the converted data is received, the data needs to be monitored in real time, when abnormal data does not exist in the monitoring result, the data in the monitoring result is sent to the data calculation module, when abnormal data exists in the monitoring result, the abnormal data is fed back to the big data platform 10 through the big data platform, and the big data platform 10 deletes the abnormal data from the local database after receiving the abnormal data.
In this embodiment, the data calculation module classifies data in the monitoring result, determines a corresponding calculation strategy according to the classified data, calculates the classified data according to the calculation strategy, and sends the calculation result to the data warehouse 30, where the calculation strategy refers to a strategy for calculating data, the calculation strategy is divided into a distributed calculation strategy and a parallel calculation strategy, the distributed calculation strategy refers to a strategy for sequentially calculating data, and the parallel calculation strategy refers to a strategy for simultaneously calculating data.
Furthermore, in order to effectively improve the accuracy of the calculated data, the data calculation module comprises a process determination module and a strategy selection module;
in this embodiment, the process determining module classifies data in the monitoring result, extracts process information of the classified data, and sends the classified data and the process information to the policy selecting module, where the process information refers to process information of data running in a program, and the process information includes multi-process information and single-process information, and after receiving the monitoring result sent by the data calculating module, classifies the data in the monitoring result, and determines corresponding process information according to the classified data, for example, after classification, the a-type data is data of the single-process information, and the B-type data is data of the multi-process information.
In this embodiment, the policy selection module is configured to determine a corresponding calculation policy according to the process information, calculate the classified data according to the calculation policy, and send the calculation result to the data warehouse 30, where the calculation policy corresponding to the single-process information is a distributed calculation policy, and the calculation policy corresponding to the multi-process information is a parallel calculation policy, and after determining the corresponding calculation policy according to the process information, calculate the classified data through the calculation policy, for example, when the calculation policy is a distributed calculation policy, calculate the classified data according to the distributed calculation policy, and obtain the corresponding calculation result.
In this embodiment, the data warehouse 30 cleans the preprocessed data, and sends the cleaned data to the data integration module, wherein the data is cleaned according to preset data hierarchical information, the preset data hierarchical information refers to different hierarchical information in the data warehouse, for example, a data source layer, a data model layer, and a data application layer, and the preprocessed data is cleaned again according to the preset data hierarchical information to ensure the accuracy of the data.
Further, in order to effectively improve the accuracy of the calculated data, the data warehouse 30 includes a data checking module, a warehouse data cleaning module, and a data aggregation module.
In this embodiment, the data checking module obtains a preset checking rule, checks the preprocessed data according to the preset checking rule, and sends the preprocessed data to the warehouse data cleaning module when the checking is passed, where the preset checking rule refers to a rule when the data is checked, and checks the preprocessed data according to the preset checking rule after the preprocessed data is obtained, so that the preprocessed data is consistent, and the data checking module refers to a data pasting layer.
In this embodiment, the warehouse data cleaning module cleans the preprocessed data and sends the cleaned data to the data aggregation module 30, where data cleaning refers to correcting recognizable errors in the data, and the data cleaning module needs to clean the data again after cleaning the stored data for the first time, so as to effectively improve the accuracy of processing the data.
In this embodiment, the data aggregation module aggregates the cleaned data to obtain corresponding aggregated data, and sends the aggregated data to the data integration module 40, where aggregation refers to gathering and converging the cleaned data, and the aggregated data is a data set, and after aggregation is completed, sends the aggregated data to the data integration module 40.
In this embodiment, the data integration module 40 extracts feature data of the cleaned data, analyzes the feature data through a data mart, and visually displays an analysis result to manage the data, where the feature data refers to data that can uniquely identify the cleaned data, the data mart refers to a data market that meets the requirements of a specific department or a user and is stored in a multidimensional manner, including dimension definition, index calculation, dimension hierarchy, and the like, different feature data correspond to different data marts, for example, feature data a corresponds to a relational data mart, and feature data B corresponds to a multidimensional data mart.
Further, in order to effectively improve the efficiency of analyzing data, the data integration module 40 includes a data analysis module and a report generation module.
In this embodiment, the data analysis module extracts feature data of the cleaned data, and when the type of the feature data is relational data, analyzes the feature data through a relational data mart, and sends an analysis result to the report generation module, where the relational data mart refers to a data market in which the relational data is analyzed, and there is specificity between the data mart and the data, that is, the relational data can only be analyzed through a relational database, and the analyzed data is sent to the report generation module.
In this embodiment, the report generation module generates a corresponding data report according to the analysis result, and displays the data report in sequence in a visualized manner to achieve data management, where the data report refers to a table generated according to data, and displays the data report according to the generation sequence of the report.
In this embodiment, when the type of the feature data is multidimensional data, the data analysis module analyzes the feature data through a multidimensional data mart and sends an analysis result to the ad hoc analysis module, where the multidimensional data refers to data with multiple dimensions, the multidimensional mart refers to a data market in which the multidimensional data is analyzed, and the multidimensional data is analyzed through the multidimensional mart and the analysis result is sent to the ad hoc analysis module.
In this embodiment, the ad hoc analysis module performs ad hoc analysis on the data in the analysis result, wherein the ad hoc analysis refers to analyzing the data in the analysis result according to the requirement of the user.
In this embodiment, data sent by each device is collected through a big data platform, the data sent by each device is stored, and the stored data is sent to a data processing module through a preset management interface; the data processing module is used for preprocessing the stored data and sending the preprocessed data to the data warehouse; the data warehouse is used for cleaning the preprocessed data and sending the cleaned data to the data integration module; the data integration module extracts characteristic data of the cleaned data, analyzes the characteristic data through a data mart, and visually displays an analysis result to realize data management; the data sent by each device are sequentially preprocessed, cleaned and analyzed, and the data in the analysis result is visually displayed, so that the management efficiency of each work among the data can be effectively improved.
Referring to fig. 2, fig. 2 is a block diagram illustrating a second embodiment of a hadoop-based data management system according to the present invention, and the second embodiment of the hadoop-based data management system according to the present invention is proposed based on the embodiment illustrated in fig. 1.
In this embodiment, the hadoop-based data management system 100 further includes a data set management module 501 and a data source management module 502, where the data set management module 501 extracts header data of the data report, generates a corresponding data set according to the header data, and sends the data set to the data source management module 502, where the header information refers to data information of a first row or a first column in the data report, and the data corresponding to the header can be queried through the header information, and different rows and columns in the data report are combined to obtain a corresponding data set, for example, the header data includes A, B and C, and the data set is a set of data A, B and C.
In this embodiment, a data source management module traces a source of data in a data set to obtain source information corresponding to the data in the data set, and if the source information is inconsistent with historical source information, the historical source information is modified according to the source information to achieve management of a data source of the data, where the source information refers to source information of the data, such as data a source M device and data B source N device, and when the current source information is inconsistent with the historical source information, it indicates that the data in the data set is sent by other devices, and at this time, the historical source information is modified through the source information to obtain actual source information of the data, and the historical source information refers to source information before the data, such as the source before the data is M device, and the source at this time is N device.
In this embodiment, the data set management module extracts header data of the data report, generates a corresponding data set according to the header data, and sends the data set to the data source management module; the data source management module traces the source of the data in the data set to obtain source information corresponding to the data in the data set, and if the source information is inconsistent with the historical source information, the historical source information is modified according to the source information; the corresponding data set is generated by extracting the header information of the data report, and the historical source information is modified according to the source information of the data in the data set, so that the efficiency of tracing the source data source can be effectively improved.
Referring to fig. 3, fig. 3 is a block diagram illustrating a second embodiment of a hadoop-based data management system according to the present invention, and a third embodiment of the hadoop-based data management system according to the present invention is provided based on the embodiment illustrated in fig. 1.
In this embodiment, the hadoop-based data management system 100 further includes a data parsing module 503 and a data extracting module 504, where the data parsing module 503 receives the stored data sent by the data processing module through a preset integrated interface, parses the stored data, and sends the parsed data to the data extracting module 504, where after the stored data is received, the stored data is parsed to obtain structural information of the data, the structural information includes characteristic information and non-characteristic information, the preset integrated interface refers to an integrated interface where the data processing module 30 is connected to the data parsing module 503, and the preset integrated interface may be a big data integrated interface or other integrated interfaces, which is not limited in this embodiment, and is described by taking the big data integrated interface as an example.
In this embodiment, the data extraction module 504 extracts feature information of the analyzed data, determines corresponding quality information and value information according to the feature information, obtains top N pieces of data based on the quality information and the value information, and manages the top N pieces of data, where the quality information refers to correct information of the data, the value information refers to value information provided for data mining and data analysis, and comprehensively sorts the data according to the quality information and the value information, and manages the sorted top several pieces of data, that is, data quality management, data asset management, and so on, where N is any positive integer, which is not limited in this embodiment, and N =5 is taken as an example for explanation.
In this embodiment, the data analysis module receives the stored data sent by the data processing module through the preset integrated interface, analyzes the stored data, and sends the analyzed data to the data extraction module; the data extraction module extracts feature information of the analyzed data, determines corresponding quality information and value information according to the feature information, obtains the first N data based on the quality information and the value information, and manages the first N data; the stored data is analyzed, the feature information of the analyzed data is extracted, the corresponding quality information and value information are determined according to the feature information, and the first N data obtained based on the quality information and the value information are managed, so that the data management efficiency can be effectively improved.
Referring to fig. 4, fig. 4 is a block diagram illustrating a second embodiment of a hadoop-based data management system according to the present invention, and a fourth embodiment of the hadoop-based data management system according to the present invention is provided based on the embodiment illustrated in fig. 1.
In this embodiment, the hadoop-based data management system 100 further includes a data tagging module 505 and a data publishing module 506, where the data tagging module 505 obtains stored data sent by the big data platform through a preset service interface, determines a corresponding data type according to the stored data, and sends the data of the data type to the data publishing module, where the data type refers to a type corresponding to the stored data, the data type is determined according to user requirement information, at this time, the data of the data type is the most seen data by a user, the preset service interface refers to a service interface where the big data platform 10 is connected with the data tagging module 505, the preset service interface may be a big data service interface or other service interfaces, and this embodiment is not limited thereto, and the big data service interface is taken as an example for explanation.
In this embodiment, the data publishing module 506 publishes the data of the data category, wherein after receiving the data in the data category, the data is published so as to be viewed by a user or others.
In this embodiment, the data marking module acquires the stored data sent by the big data platform through the preset service interface, determines the corresponding data type according to the stored data, and sends the data of the data type to the data publishing module; the data publishing module publishes the data of the data category; the corresponding data type is determined through the stored data, and the data in the data type is published, so that the experience of a user can be effectively improved.
Referring to fig. 5, the hadoop-based data management system of the present invention provides a hadoop-based data management method, and fig. 5 is a schematic flow chart of a first embodiment of the hadoop-based data management method of the present invention, where the hadoop-based data management system includes: the system comprises a big data platform, a data processing module, a data warehouse and a data integration module which are connected in sequence;
the data management method based on hadoop comprises the following steps:
and step S10, the big data platform collects data sent by each device, stores the data sent by each device, and sends the stored data to the data processing module through a preset management interface.
It should be understood that the data sent by each device includes data issued by System Applications and Products (SPA), data issued by Supplier Relationship Management Software (SRM), financial data, workflow data, market decision data, and the like, and after receiving the data of each device, the data of each device is stored in a local database, where a preset Management interface refers to a connection interface between a big data platform and a data processing module, and is mainly used for data transmission between the big data platform and the data processing module, the preset Management interface may be a big data Management interface, and may also be other Management interfaces, which is not limited in this embodiment, and is described by taking the big data Management interface as an example.
Step S20, the data processing module preprocesses the stored data and sends the preprocessed data to the data warehouse.
It can be understood that the preprocessing refers to performing preset processing on the stored data, and specifically includes: cleaning, converting, uploading, calculating, etc., the data processing module may be an Extract-Transform-Load (ETL) tool, which refers to a tool for extracting, converting, and loading data from a source to a destination.
Furthermore, in order to effectively improve the efficiency of processing data, the data processing module comprises a data cleaning module, a data conversion module, a data monitoring module and a data calculation module;
in this embodiment, the data cleaning module cleans the stored data and sends the cleaned data to the data conversion module, where data cleaning refers to correcting recognizable errors in the data, including checking data consistency, processing invalid values and missing values, and after cleaning is completed, sends the cleaned data to the data conversion module.
In this embodiment, the data conversion module converts the cleaned data and uploads the converted data to the data monitoring module, wherein after receiving the cleaned data, the cleaned data is converted, the converted content includes a format, a data type and the like, and after the conversion is completed, the converted data is sent to the data monitoring module;
in this embodiment, the data monitoring module monitors the converted data in real time, and if the data in the monitoring result is not abnormal data, the data in the monitoring result is sent to the data calculating module, where abnormal data refers to the data having a large difference from other data, for example, normal data has 8 bits, the length of the abnormal data is 16 bits, and the data with the length of 16 bits is the abnormal data, after the converted data is received, the data needs to be monitored in real time, when abnormal data does not exist in the monitoring result, the data in the monitoring result is sent to the data calculation module, and when abnormal data exist in the monitoring result, the abnormal data are fed back to the big data platform through the big data platform, and the big data platform deletes the abnormal data from the local database after receiving the abnormal data.
In this embodiment, the data calculation module classifies data in the monitoring result, determines a corresponding calculation strategy according to the classified data, calculates the classified data according to the calculation strategy, and sends the calculation result to the data warehouse, where the calculation strategy refers to a strategy for calculating data, the calculation strategy is divided into a distributed calculation strategy and a parallel calculation strategy, the distributed calculation strategy refers to a strategy for sequentially calculating data, and the parallel calculation strategy refers to a strategy for simultaneously calculating data.
Furthermore, in order to effectively improve the accuracy of the calculated data, the data calculation module comprises a process determination module and a strategy selection module;
in this embodiment, the process determining module classifies data in the monitoring result, extracts process information of the classified data, and sends the classified data and the process information to the policy selecting module, where the process information refers to process information of data running in a program, and the process information includes multi-process information and single-process information, and after receiving the monitoring result sent by the data calculating module, classifies the data in the monitoring result, and determines corresponding process information according to the classified data, for example, after classification, the a-type data is data of the single-process information, and the B-type data is data of the multi-process information.
In this embodiment, the policy selection module is configured to determine a corresponding calculation policy according to the process information, calculate the classified data according to the calculation policy, and send the calculation result to the data warehouse, where the calculation policy corresponding to the single-process information is a distributed calculation policy, the calculation policy corresponding to the multi-process information is a parallel calculation policy, and after the corresponding calculation policy is determined according to the process information, calculate the classified data through the calculation policy, for example, when the calculation policy is a distributed calculation policy, calculate the classified data according to the distributed calculation policy, and obtain the corresponding calculation result.
And step S30, the data warehouse cleans the preprocessed data and sends the cleaned data to the data integration module.
It should be understood that, in which data is cleaned according to preset data hierarchical information, the preset data hierarchical information refers to different hierarchical information in a data warehouse, such as a data source layer, a data model layer and a data application layer, and the preprocessed data is cleaned again according to the preset data hierarchical information to ensure the accuracy of the data.
Further, in order to effectively improve the accuracy of the calculated data, the data warehouse comprises a data checking module, a warehouse data cleaning module and a data aggregation module.
In this embodiment, the data checking module obtains a preset checking rule, checks the preprocessed data according to the preset checking rule, and sends the preprocessed data to the warehouse data cleaning module when the checking is passed, where the preset checking rule refers to a rule when the data is checked, and checks the preprocessed data according to the preset checking rule after the preprocessed data is obtained, so that the preprocessed data is consistent, and the data checking module refers to a data pasting layer.
In this embodiment, the warehouse data cleaning module cleans the preprocessed data and sends the cleaned data to the data aggregation module, where data cleaning refers to correcting recognizable errors in the data, and the data cleaning module needs to clean the data again after cleaning the stored data for the first time, so as to effectively improve the accuracy of processing the data.
In this embodiment, the data aggregation module aggregates the cleaned data to obtain corresponding aggregated data, and sends the aggregated data to the data integration module, where aggregation refers to gathering and converging the cleaned data, and the aggregated data is a data set, and after aggregation is completed, the aggregated data is sent to the data integration module.
And step S40, the data integration module extracts the characteristic data of the cleaned data, analyzes the characteristic data through a data mart, and visually displays the analysis result to realize the management of the data.
It can be understood that the feature data refers to data that can uniquely identify the cleaned data, the data mart refers to a data market that meets the requirements of a specific department or user and is stored in a multidimensional manner, and includes dimension definition, index calculation, dimension hierarchy, and the like, different feature data corresponds to different data marts, for example, feature data a corresponds to a relational data mart, and feature data B corresponds to a multidimensional data mart.
Further, in order to effectively improve the efficiency of analyzing data, the data integration module comprises a data analysis module and a report generation module.
In this embodiment, the data analysis module extracts feature data of the cleaned data, and when the type of the feature data is relational data, analyzes the feature data through a relational data mart, and sends an analysis result to the report generation module, where the relational data mart refers to a data market in which the relational data is analyzed, and there is specificity between the data mart and the data, that is, the relational data can only be analyzed through a relational database, and the analyzed data is sent to the report generation module.
In this embodiment, the report generation module generates a corresponding data report according to the analysis result, and displays the data report in sequence in a visualized manner to achieve data management, where the data report refers to a table generated according to data, and displays the data report according to the generation sequence of the report.
In this embodiment, when the type of the feature data is multidimensional data, the data analysis module analyzes the feature data through a multidimensional data mart and sends an analysis result to the ad hoc analysis module, where the multidimensional data refers to data with multiple dimensions, the multidimensional mart refers to a data market in which the multidimensional data is analyzed, and the multidimensional data is analyzed through the multidimensional mart and the analysis result is sent to the ad hoc analysis module.
In this embodiment, the ad hoc analysis module performs ad hoc analysis on the data in the analysis result, wherein the ad hoc analysis refers to analyzing the data in the analysis result according to the requirement of the user.
In this embodiment, referring to fig. 6, fig. 6 is a schematic diagram of a data management system of an embodiment of a hadoop-based data management method of the present invention, which specifically includes a data processing module, a data warehouse and a data integration module, wherein the data processing module collects and regularly extracts data sent by SAP software, SRM software and a data collection platform in batch, the data includes index-type data, device data, and the like, converts the data and loads the converted data to the data warehouse, the data warehouse stores the converted data, and at the same time, performs standardized conversion on the stored data, determines the type of the converted data, when the converted data is relational data, analyzes the converted data through a relational mart, displays the analyzed data through a report, when the converted data is multidimensional data, and analyzing the converted data through a multi-dimensional data mart, and performing spot analysis on the analyzed data.
In the embodiment, data sent by each device is acquired through a big data platform, the data sent by each device is stored, and the stored data is sent to a data processing module through a preset management interface; the data processing module is used for preprocessing the stored data and sending the preprocessed data to the data warehouse; the data warehouse is used for cleaning the preprocessed data and sending the cleaned data to the data integration module; the data integration module extracts characteristic data of the cleaned data, analyzes the characteristic data through a data mart, and visually displays an analysis result to realize data management; the data sent by each device are sequentially preprocessed, cleaned and analyzed, and the data in the analysis result is visually displayed, so that the management efficiency of each work among the data can be effectively improved.
Other embodiments or implementations of hadoop-based data management systems according to the present invention are described with reference to the method embodiments described above and are not intended to be exhaustive.
Further, it is to be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g. Read Only Memory (ROM)/RAM, magnetic disk, optical disk), and includes several instructions for enabling a terminal device (e.g. a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A hadoop-based data management system, the hadoop-based data management system comprising: the system comprises a big data platform, a data processing module, a data warehouse and a data integration module which are connected in sequence;
the big data platform is used for acquiring data sent by each device, storing the data sent by each device and sending the stored data to the data processing module through a preset management interface;
the data processing module is used for preprocessing the stored data and sending the preprocessed data to the data warehouse;
the data warehouse is used for cleaning the preprocessed data and sending the cleaned data to the data integration module;
the data integration module is used for extracting the characteristic data of the cleaned data, analyzing the characteristic data through a data mart, and visually displaying the analysis result to realize the management of the data.
2. The hadoop-based data management system as recited in claim 1 wherein the data processing module comprises a data cleaning module, a data conversion module, a data monitoring module, and a data calculation module;
the data cleaning module is used for cleaning the stored data and sending the cleaned data to the data conversion module;
the data conversion module is used for converting the cleaned data and uploading the converted data to the data monitoring module;
the data monitoring module is used for monitoring the converted data in real time, and if the data in the monitoring result is not abnormal data, the data in the monitoring result is sent to the data calculation module;
the data calculation module is used for classifying the data in the monitoring result, determining a corresponding calculation strategy according to the classified data, calculating the classified data according to the calculation strategy, and sending the calculation result to the data warehouse.
3. The hadoop-based data management system as recited in claim 2 wherein the data computation module comprises a process determination module and a policy selection module;
the process determining module is used for classifying the data in the monitoring result, extracting the process information of the classified data, and sending the classified data and the process information to the strategy selecting module;
and the strategy selection module is used for determining a corresponding calculation strategy according to the process information, calculating the classified data according to the calculation strategy and sending a calculation result to the data warehouse.
4. The hadoop-based data management system as recited in claim 1 wherein the data warehouse comprises a data check module, a warehouse data cleaning module, and a data aggregation module;
the data checking module is used for acquiring a preset checking rule, checking the preprocessed data according to the preset checking rule, and sending the preprocessed data to the warehouse data cleaning module when the checking is passed;
the warehouse data cleaning module is used for cleaning the preprocessed data and sending the cleaned data to the data aggregation module;
and the aggregation module is used for aggregating the cleaned data to obtain corresponding aggregated data and sending the aggregated data to the data integration module.
5. The hadoop-based data management system as claimed in claim 1, wherein the data integration module comprises a data analysis module and a report generation module;
the data analysis module is used for extracting characteristic data of the cleaned data, analyzing the characteristic data through a relational data mart when the type of the characteristic data is relational data, and sending an analysis result to the report generation module;
and the report generation module is used for generating a corresponding data report according to the analysis result and displaying the data report in a visual manner in sequence so as to realize the management of data.
6. The hadoop based data management system according to claim 5, further comprising an impromptu analysis module;
the data analysis module is used for analyzing the characteristic data through a multi-dimensional data mart and sending an analysis result to the ad hoc analysis module when the type of the characteristic data is multi-dimensional data;
and the impromptu analysis module is used for performing impromptu analysis on the data in the analysis result.
7. The hadoop based data management system as recited in claim 5, wherein the hadoop based data management system further comprises a data set management module and a data source management module;
the data set management module is used for extracting header data of the data report, generating a corresponding data set according to the header data, and sending the data set to the data source management module;
the data source management module is used for tracing the source of the data in the data set to obtain source information corresponding to the data in the data set, and if the source information is inconsistent with the historical source information, the historical source information is modified according to the source information to realize the management of the data source of the data.
8. The hadoop based data management system according to any of claims 1 to 7, further comprising a data parsing module and a data extraction module;
the data processing module is also used for sending the stored data to the data analysis module;
the data analysis module is used for receiving the stored data sent by the data processing module through a preset integrated interface, analyzing the stored data and sending the analyzed data to the data extraction module;
the data extraction module is used for extracting the feature information of the analyzed data, determining corresponding quality information and value information according to the feature information of the analyzed data, obtaining the first N data based on the quality information and the value information, and managing the first N data.
9. The hadoop based data management system as recited in claim 1 further comprising a data tagging module and a data publishing module;
the big data platform is also used for sending the stored data to the data marking module;
the data marking module is used for acquiring stored data sent by a big data platform through a preset service interface, determining a corresponding data type according to the stored data, and sending the data of the data type to the data publishing module;
and the data publishing module is used for publishing the data of the data type.
10. A hadoop-based data management method, which is applied to the hadoop-based data management system according to any one of claims 1 to 9, and the system comprises: the method comprises the following steps that a big data platform, a data processing module, a data warehouse and a data integration module are sequentially connected, and the method comprises the following steps:
the big data platform collects data sent by each device, stores the data sent by each device, and sends the stored data to the data processing module through a preset management interface;
the data processing module is used for preprocessing the stored data and sending the preprocessed data to the data warehouse;
the data warehouse is used for cleaning the preprocessed data and sending the cleaned data to the data integration module;
the data integration module extracts the characteristic data of the cleaned data, analyzes the characteristic data through a data mart, and visually displays the analysis result to realize the management of the data.
CN202110802194.7A 2021-07-15 2021-07-15 Hadoop-based data management system and method Pending CN113254436A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110802194.7A CN113254436A (en) 2021-07-15 2021-07-15 Hadoop-based data management system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110802194.7A CN113254436A (en) 2021-07-15 2021-07-15 Hadoop-based data management system and method

Publications (1)

Publication Number Publication Date
CN113254436A true CN113254436A (en) 2021-08-13

Family

ID=77180438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110802194.7A Pending CN113254436A (en) 2021-07-15 2021-07-15 Hadoop-based data management system and method

Country Status (1)

Country Link
CN (1) CN113254436A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113448951A (en) * 2021-09-02 2021-09-28 深圳市信润富联数字科技有限公司 Data processing method, device, equipment and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559562A (en) * 2013-11-20 2014-02-05 贵州电网公司电力调度控制中心 Power grid intelligent operation system and achieving method thereof
CN109857832A (en) * 2019-01-03 2019-06-07 中国银行股份有限公司 A kind of preprocess method and device of payment data
CN111190965A (en) * 2018-11-15 2020-05-22 北京宸瑞科技股份有限公司 Text data-based ad hoc relationship analysis system and method
CN112035442A (en) * 2020-09-02 2020-12-04 南京星邺汇捷网络科技有限公司 Dynamic CMDB automatic association method based on big data
CN112181959A (en) * 2020-09-15 2021-01-05 山东特检鲁安工程技术服务有限公司 Special equipment multi-source data processing platform and processing method
CN112559488A (en) * 2020-12-09 2021-03-26 中铁第四勘察设计院集团有限公司 Escalator full life cycle data management method and system based on data center station

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559562A (en) * 2013-11-20 2014-02-05 贵州电网公司电力调度控制中心 Power grid intelligent operation system and achieving method thereof
CN111190965A (en) * 2018-11-15 2020-05-22 北京宸瑞科技股份有限公司 Text data-based ad hoc relationship analysis system and method
CN109857832A (en) * 2019-01-03 2019-06-07 中国银行股份有限公司 A kind of preprocess method and device of payment data
CN112035442A (en) * 2020-09-02 2020-12-04 南京星邺汇捷网络科技有限公司 Dynamic CMDB automatic association method based on big data
CN112181959A (en) * 2020-09-15 2021-01-05 山东特检鲁安工程技术服务有限公司 Special equipment multi-source data processing platform and processing method
CN112559488A (en) * 2020-12-09 2021-03-26 中铁第四勘察设计院集团有限公司 Escalator full life cycle data management method and system based on data center station

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113448951A (en) * 2021-09-02 2021-09-28 深圳市信润富联数字科技有限公司 Data processing method, device, equipment and computer readable storage medium
CN113448951B (en) * 2021-09-02 2021-12-21 深圳市信润富联数字科技有限公司 Data processing method, device, equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
US7926026B2 (en) Graphical analysis to detect process object anomalies
US20080313184A1 (en) Multidimensional analysis tool for high dimensional data
CN111400288A (en) Data quality inspection method and system
WO2024067358A1 (en) Efficiency analysis method and system for warehouse management system, and computer device
CN113254436A (en) Hadoop-based data management system and method
CN111291028A (en) High-speed industrial field oriented data acquisition system and method
CN114153914A (en) Power plant equipment defect visualization system, method, computer equipment and storage medium
US7992126B2 (en) Apparatus and method for quantitatively measuring the balance within a balanced scorecard
CN113159118A (en) Logistics data index processing method, device, equipment and storage medium
US11308104B2 (en) Knowledge graph-based lineage tracking
US7337029B2 (en) Design data management system and trace system
US10380135B2 (en) Data aggregation and reporting environment for data center infrastructure management
Schreiber et al. Data Value chains in manufacturing: data-based process transparency through traceability and process mining
Rakushev et al. The Technique of Operational Processing of Heterogeneous Surveillance Data in Assessing Situation in Geographic Information Systems
CN115099428A (en) Management platform for full life cycle of equipment and full-dimensional quantitative evaluation method
CN115689463A (en) Enterprise standing book database management system in rare earth industry
CN112817938A (en) General data service construction method and system based on data productization
EP3073675A1 (en) Performance data management method and device
JP2019185582A5 (en)
KR20200129132A (en) Data preparation method and data utilization system for data utilization
CN109241388A (en) A kind of application programming interfaces behavior analysis method and system
CN110019109B (en) Method and apparatus for processing data warehouse data
CN116739646B (en) Method and system for analyzing big data of network transaction
CN112597207B (en) Metadata management system
CN116521742B (en) Source code analysis result aggregation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210813

RJ01 Rejection of invention patent application after publication