CN110633271A - Data extraction method from Hbase to mysql based on json - Google Patents
Data extraction method from Hbase to mysql based on json Download PDFInfo
- Publication number
- CN110633271A CN110633271A CN201911137971.XA CN201911137971A CN110633271A CN 110633271 A CN110633271 A CN 110633271A CN 201911137971 A CN201911137971 A CN 201911137971A CN 110633271 A CN110633271 A CN 110633271A
- Authority
- CN
- China
- Prior art keywords
- data
- json
- hbase
- mysql
- extraction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/048—Interaction techniques based on graphical user interfaces [GUI]
- G06F3/0481—Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
- G06F3/0482—Interaction with lists of selectable items, e.g. menus
Abstract
The invention discloses a data extraction method from Hbase to mysql based on json, and relates to the technical field of data processing. The method comprises the steps of configuring HBase data source information, mysql target data source information, an extraction strategy, a cleaning rule, a conversion rule and a field mapping relation on a visual interface; executing a query on a visual interface to generate a json data set; the JCleanprocessor analyzes dirty data and invalid data in the json data set and deletes the dirty data and the invalid data; the JTransformprocessor converts data in the json data set according to different data types; json data converted by the MultiPutHBaseJSON processor is stored in the mysql database. According to the invention, through visual page configuration, the convenient and fast efficacy transfer from the Hbase data source to the mysql target data source is realized.
Description
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a json-based data extraction method from Hbase to mysql.
Background
Hbase is a distributed, column-oriented open-source database, which is a database suitable for unstructured data storage. Meanwhile, the Hbase realizes a database system with high reliability, high performance, column storage, scalability and real-time reading and writing. But due to the inherent limitations of a single RowKey, it was determined that it was not possible to efficiently support multi-conditional queries. This results in that Hbase is not so convenient to use when the service scenario is complicated, multi-table association is required and multi-conditional queries are required. Meanwhile, in many business systems, many data are stored in a relational database, such as mysql. Thus, when data synchronization is required, data transfers across data sources can occur.
Referring to fig. 8, a common technical scheme for transferring data from the Hbase database to the mysql database is to export the data in the Hbase to the csv file or txt file, and then import the data in the csv file or txt file to the mysql database. However, this method has the following disadvantages:
a. reading data from Hbase requires command line operation, and has high requirement on professional knowledge of operators;
b. when writing data to mysql, since the amount of data stored in Hbase may be large, there may be various performance problems at the time of introduction.
c. The user experience is poor due to the fact that the visual operation interface is not available.
The invention provides a data extraction method from Hbase to mysql based on json, which is used for solving the following problems:
1) through a visual configuration mode, a user configures Hbase data source information, mysql target data source information, a cleaning rule, a conversion rule, a field mapping relation and data loading which are required to be extracted according to the actual requirement of the user; while the data is transferred in json format to effect the transfer of data from Hbase to mysql library.
2) Due to the existence of column families in Hbase, a plurality of columns exist under one column family; whereas mysql is a relational database with no column families; the technical scheme of the invention also solves the problem of how to take out the required column from the column family of Hbase and store the value of the column into mysql. .
Disclosure of Invention
The invention aims to provide a json-based method for extracting data from Hbase to mysql, which realizes convenient and rapid effect transfer from an Hbase data source to a mysql target data source through visual page configuration.
In order to solve the technical problems, the invention is realized by the following technical scheme:
the invention relates to a data extraction method from Hbase to mysql based on json, which comprises the following steps:
s00: configuring HBase data source information, mysql target data source information, an extraction strategy, a cleaning rule, a conversion rule and a field mapping relation on a visual interface;
the configuration of the HBase data source information further comprises the steps of selecting the HBase data source, the table name and the column family or the column information to be extracted;
s01: executing a query on a visual interface to generate a json data set;
s02: the JCleanprocessor analyzes dirty data and invalid data in the json data set and deletes the dirty data and the invalid data;
s03: the JTransformprocessor converts data in the json data set according to different data types;
s04: json data converted by the MultiPutHBaseJSON processor is stored in the mysql database.
Preferably, S00 specifically includes the following:
reading data by adopting a MultiGetHBase processor according to the specified HBase data source information and the extraction strategy;
the extraction strategy comprises incremental extraction and full extraction; when the increment extraction is executed, extracting data changed after the last task is finished; when the full amount extraction is performed, all data is extracted.
Preferably, the MultiGetHBase processor separates the extracted json data set into json data groups.
Preferably, the method further comprises monitoring the data extraction task by using a JMonitorProcessor processor.
Preferably, the MultiPutHBaseJSON processor writes json data into the mysql database in batches.
The invention has the following beneficial effects:
1. according to the invention, through visual page configuration, the convenient and fast efficacy transfer from the Hbase data source to the mysql target data source is realized;
2. in the data extraction process, the data is transmitted in a json format, so that the problem can be timely positioned through the generated json data and the data can be conveniently modified;
3. the invention reduces the requirement of professional knowledge of operators through interface configuration operation, is convenient to operate and improves the efficiency.
Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a json-based method for extracting Hbase-mysql data according to the invention;
FIG. 2 is a schematic diagram of a create task interface in the present invention;
FIG. 3 is a schematic diagram of an interface for configuring an Hbase data source, an extraction field and an extraction strategy according to the present invention;
FIG. 4 is a schematic diagram of an interface for configuring mysql data sources and table field information in the present invention;
FIG. 5 is a schematic diagram of an interface for configuring cleaning rules according to the present invention;
FIG. 6 is a schematic interface diagram of the mapping relationship between the Hbase table field and the mysql table field established in the present invention;
FIG. 7 is a schematic diagram of an interface for configuring a loading policy and a time scheduling policy according to the present invention;
FIG. 8 is a diagram illustrating a conventional scheme for transferring data from the Hbase database to the mysql database in the background art of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The implementation method comprises the following steps:
referring to fig. 1, the present invention is a json-based method for extracting data from Hbase to mysql, comprising the following steps:
s00: configuring HBase data source information, mysql target data source information, an extraction strategy, a cleaning rule, a conversion rule and a field mapping relation on a visual interface;
the configuration of the HBase data source information further comprises the steps of selecting the HBase data source, a table name and column family or column information to be extracted; reading data by adopting a MultiGetHBase processor according to the specified HBase data source information and the extraction strategy;
configuring Hbase database information, and connecting the Hbase database information to the Hbase database; since there may be multiple column families in the same rowkey data in the Hbase database, there may be multiple columns under a column family, so there is multidimensional data, but the mysql database stores data in one-dimensional form. At this time, the mapping of data needs to be processed; through a visual configuration mode, data with different dimensions can be definitely specified to be transferred to different storage places, and therefore association of source data and target data is achieved.
The format of the source data is as follows:
ROW COLUMN+CELL
001column=info1:age, timestamp=1536739425819, value=18
001column=info1:name, timestamp=1536542197314, value=tom
001column=info2:chinese, timestamp=1536542217612, value=89
001column=info2:math, timestamp=1536739462180, value=67
002column=info1:age, timestamp=1536739588914, value=34
002column=info1:name, timestamp=1536739081007, value=James
002column=info1:sex, timestamp=1539335801990, value=male
002column=info2:chinese, timestamp=1536739111620, value=78
002 column=info2:math, timestamp=1536739480289, value=76
003column=info1:age, timestamp=1537155958877, value=37
004 column=info1:name, timestamp=1537947449018, value=jack
referring to fig. 2-7, a specific configuration process is shown:
wherein, as shown in fig. 2, firstly, a task name is established; as shown in fig. 3, when the Hbase data source, the extracted field and the extraction strategy are selected, the Hbase data source is automatically synchronized to the right frame after clicking the field in the left frame; as shown in fig. 4, mysql data source and table field information are configured; as shown in fig. 5, a cleaning rule is configured; as shown in fig. 6, the conversion rule is configured, and meanwhile, the mapping relationship between the Hbase table field and the mysql table field is also automatically established; as shown in fig. 7, a loading policy and a time scheduling policy are configured, the loading policy may select full deletion and incremental deletion, and the time scheduling policy may select interval time or cron expression, so as to complete creation and configuration tasks;
s01: executing a query on a visual interface to generate a json data set;
specifically, the extraction strategy comprises incremental extraction and full extraction; when the incremental extraction is selected, the system records the time point of the last extraction completion, and each task only extracts data changed after the last task is completed; when the full extraction is selected, the data in the table are extracted completely. In order to prevent the problem of memory overflow caused by overlarge data volume, the processor can automatically group json data, and finally, a complete json data set is divided into a plurality of small json data groups, so that the transfer is facilitated, and the performance is improved; in order to realize reasonable grouping of json, data in a table needs to be sorted firstly, then extracted in segments, the size of each segment can be manually configured by a user (the default is that 1 ten thousand pieces of data are in a group), and finally the extracted data is converted into json character strings.
S02: the JCleanprocessor analyzes dirty data and invalid data in the json data set and deletes the dirty data and the invalid data;
because of the existence of dirty data and invalid data, the JCleanprocessor automatically analyzes all data, finds out data which does not meet the requirements, and deletes the data from the json result set;
the specific process is as follows: performing cleaning rule check on all data in batch according to the json result set generated by the last processor, for example, if the cleaning rule is to judge whether the age is greater than 18 years, finding age information from the json result set, then comparing the age information with the age of 18 years, if so, retaining the data, and if not, deleting the data from the whole json result set;
s03: the JTransformprocessor converts data in the json data set according to different data types;
s04: saving json data converted by the MultiPutHBaseJSON processor into a mysql database;
specifically, a MultiPutHBaseJSON processor writes json data into a mysql database in batches; the MultiPutHBaseJSON processor writes a plurality of pieces of data into mysql at one time by adopting a batch processing technology, batch submission is achieved, and the writing times of a database are reduced; meanwhile, data exception processing is also carried out, when storage fails, data rollback is achieved, and all error data can be recorded.
The method comprises the following steps of monitoring a data extraction task by adopting a JMonitorProcessor processor; the jmonitor processor completely records the execution condition of the whole extraction task, including the start time and the end time of the task, the amount of extracted data, the amount of cleaned data, the amount of converted data, the amount of warehoused data, the amount of error data and other information.
It should be noted that, in the above system embodiment, each included unit is only divided according to functional logic, but is not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
In addition, it is understood by those skilled in the art that all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing associated hardware, and the corresponding program may be stored in a computer-readable storage medium.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.
Claims (5)
1. A data extraction method from Hbase to mysql based on json is characterized by comprising the following steps:
s00: configuring HBase data source information, mysql target data source information, an extraction strategy, a cleaning rule, a conversion rule and a field mapping relation on a visual interface;
the configuration of the HBase data source information further comprises the steps of selecting the HBase data source, the table name and the column family or the column information to be extracted;
s01: executing a query on a visual interface to generate a json data set;
s02: the JCleanprocessor analyzes dirty data and invalid data in the json data set and deletes the dirty data and the invalid data;
s03: the JTransformprocessor converts data in the json data set according to different data types;
s04: json data converted by the MultiPutHBaseJSON processor is stored in the mysql database.
2. The method for extracting data from hbsase to mysql based on json as claimed in claim 1, wherein S00 specifically includes the following:
reading data by adopting a MultiGetHBase processor according to the specified HBase data source information and the extraction strategy;
the extraction strategy comprises incremental extraction and full extraction; when the increment extraction is executed, extracting data changed after the last task is finished; when the full amount extraction is performed, all data is extracted.
3. The json-based Hbase-to-mysql data extraction method of claim 3, wherein the MultiGetHBase processor separates the extracted json dataset into json data sets.
4. The json-based data extraction method from Hbase to mysql as claimed in claim 1, further comprising monitoring the data extraction task using a JMonitorProcessor processor.
5. The json-based Hbase-to-mysql data extraction method of claim 1, wherein the MultiPutHBaseJSON processor writes json data to the mysql database in batches.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911137971.XA CN110633271A (en) | 2019-11-20 | 2019-11-20 | Data extraction method from Hbase to mysql based on json |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911137971.XA CN110633271A (en) | 2019-11-20 | 2019-11-20 | Data extraction method from Hbase to mysql based on json |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110633271A true CN110633271A (en) | 2019-12-31 |
Family
ID=68979468
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911137971.XA Pending CN110633271A (en) | 2019-11-20 | 2019-11-20 | Data extraction method from Hbase to mysql based on json |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110633271A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111694824A (en) * | 2020-05-25 | 2020-09-22 | 智强通达科技(北京)有限公司 | Method for mapping and cleaning oil data chain |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110069551A (en) * | 2019-04-25 | 2019-07-30 | 江南大学 | Medical Devices O&M information excavating analysis system and its application method based on Spark |
CN110196876A (en) * | 2019-06-05 | 2019-09-03 | 浪潮软件股份有限公司 | A method of it is isolated tool based on web administration and scheduling Kettle |
-
2019
- 2019-11-20 CN CN201911137971.XA patent/CN110633271A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110069551A (en) * | 2019-04-25 | 2019-07-30 | 江南大学 | Medical Devices O&M information excavating analysis system and its application method based on Spark |
CN110196876A (en) * | 2019-06-05 | 2019-09-03 | 浪潮软件股份有限公司 | A method of it is isolated tool based on web administration and scheduling Kettle |
Non-Patent Citations (2)
Title |
---|
丿青辞: "从Hbase导入数据至Mysql", 《博客园:HTTPS://WWW.CNBLOGS.COM/AUTUMNNIGHT/P/10938709.HTML》 * |
崔有文 等: "基于KETTLE的数据集成研究", 《计算机技术与发展》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111694824A (en) * | 2020-05-25 | 2020-09-22 | 智强通达科技(北京)有限公司 | Method for mapping and cleaning oil data chain |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11023448B2 (en) | Data scrubbing method and apparatus, and computer readable storage medium | |
CN111046034B (en) | Method and system for managing memory data and maintaining data in memory | |
CN102129458B (en) | Method and device for storing relational database | |
US10452491B2 (en) | Scalable log partitioning system | |
US8924373B2 (en) | Query plans with parameter markers in place of object identifiers | |
CN112286941B (en) | Big data synchronization method and device based on Binlog + HBase + Hive | |
US9619492B2 (en) | Data migration | |
CN107301214B (en) | Data migration method and device in HIVE and terminal equipment | |
CN109376196B (en) | Method and device for batch synchronization of redo logs | |
CN103593440B (en) | The reading/writing method and device of journal file | |
CN111324610A (en) | Data synchronization method and device | |
US10671586B2 (en) | Optimal sort key compression and index rebuilding | |
AU2018290753B2 (en) | Systems and methods of creation and deletion of tenants within a database | |
WO2019161645A1 (en) | Shell-based data table extraction method, terminal, device, and storage medium | |
JP2016539449A (en) | Database implementation method | |
US20180150536A1 (en) | Instance-based distributed data recovery method and apparatus | |
CN110245134B (en) | Increment synchronization method applied to search service | |
CN105900093A (en) | Keyvalue database data table updating method and data table updating device | |
CN108121564A (en) | A kind of json data Fast synchronization and the method and system device of control | |
CN101833511A (en) | Data management method, device and system | |
CN110633271A (en) | Data extraction method from Hbase to mysql based on json | |
CN113760902A (en) | Data splitting method, device, equipment, medium and program product | |
CN113094442A (en) | Full data synchronization method, device, equipment and medium | |
CN116010345A (en) | Method, device and equipment for realizing table service scheme of flow batch integrated data lake | |
JP2015130165A (en) | Automated compilation of graph input for hypergraph solver |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191231 |
|
RJ01 | Rejection of invention patent application after publication |