CN116244381A - Data transfer method, device, equipment and storage medium - Google Patents

Data transfer method, device, equipment and storage medium Download PDF

Info

Publication number
CN116244381A
CN116244381A CN202310188996.2A CN202310188996A CN116244381A CN 116244381 A CN116244381 A CN 116244381A CN 202310188996 A CN202310188996 A CN 202310188996A CN 116244381 A CN116244381 A CN 116244381A
Authority
CN
China
Prior art keywords
field
service
data
index table
spliced
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310188996.2A
Other languages
Chinese (zh)
Inventor
姚田文
韦超豪
孙琳
陆东东
董彪
许先文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unionpay Co Ltd
Original Assignee
China Unionpay Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unionpay Co Ltd filed Critical China Unionpay Co Ltd
Priority to CN202310188996.2A priority Critical patent/CN116244381A/en
Publication of CN116244381A publication Critical patent/CN116244381A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/214Database migration support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a data transfer method, a device, equipment and a storage medium, and relates to the technical field of databases, wherein the method comprises the following steps: reading data to be transferred from a relation service table of a relation database; transferring the data to be transferred to a data table of the distributed database, and storing index information of the data to be transferred in an index table of the distributed database, wherein a row key of the index table comprises: the spliced field with the unique identification function is spliced based on at least one service field in all service fields of the relation service table; the column families of the index table include: at least one status field of the service hash field and the all service field, the service hash field being associated with at least one of the all service field. In the embodiment of the application, the state field is transferred from the row key of the index table to the column group of the index table, so that the phenomenon that data are repeated when a user queries historical data can be avoided, and meanwhile, the data transfer efficiency is improved.

Description

Data transfer method, device, equipment and storage medium
Technical Field
The present invention relates to the field of database technologies, and in particular, to a data transfer method, apparatus, device, and storage medium.
Background
When the traffic data volume (such as the historical flow data of the traffic travel business) is too large, a distributed database is often used to store the data. In practical application, the collected data is stored in a relational database, and then the data is transferred from the relational database to a distributed database.
However, status fields that are constantly changing during actual transactions, such as transaction payment status, clearing status, user subsidy status, customer complaint status, in-out status, etc., may appear. Therefore, in the data transfer process, one transaction record in the relational database corresponds to a plurality of repeated transaction records after being transferred to the distributed database, and the plurality of repeated transaction records only have different information of the status field, so that the phenomenon of data repetition occurs when historical data is queried.
Disclosure of Invention
The embodiment of the application provides a data transfer method, a device, equipment and a storage medium, which are used for solving the problem of data repetition when historical data are inquired.
In one aspect, an embodiment of the present application provides a data transfer method, including:
reading data to be transferred from a relation service table of a relation database;
Transferring the data to be transferred to a data table of a distributed database, and storing index information of the data to be transferred in an index table of the distributed database;
wherein, the row key of the index table includes: the spliced field with the unique identification function is obtained by splicing at least one service field in all service fields of the relation service table; the column families of the index table include: a service hash field and at least one status field of the all service fields, the service hash field being associated with at least one of the all service fields.
In the embodiment of the application, the state field is transferred from the row key of the index table to the column group of the index table, the efficiency of inquiring and filtering the historical data is improved through redundant storage, the phenomenon that the data is repeated when the user inquires the historical data is avoided, and meanwhile the data transfer efficiency is improved. By calculating the abstract for the first field of the row key non-time field, the hot spot problem during data query is greatly reduced.
Optionally, the column group of the data table includes all service fields of the relational service table, and the row key of the data table includes: a traffic hash field.
Optionally, the storing the index information of the data to be transferred in an index table of the distributed database includes:
splicing the field information of at least one service field associated with the spliced field in the index information according to a preset splicing sequence to obtain the field information of the spliced field;
and adding the field information of the spliced field as the row key content of the index table into the index table to be saved.
In the embodiment of the application, the service fields are spliced according to the preset splicing sequence, and the obtained information of the spliced fields is used as the row key content of the index table, so that the uniqueness of the row keys of the index table can be enhanced.
Optionally, the storing the index information of the data to be transferred in an index table of the distributed database includes:
acquiring field information of at least one service field associated with the spliced field from the index information;
determining the abstract of the field information of the target field positioned at the first bit of the spliced field in the at least one service field by adopting an abstract algorithm;
splicing the abstract of the field information of the target field with the field information of other service fields in the at least one service field according to a preset sequence to obtain the field information of the spliced field;
And adding the field information of the spliced field as the row key content of the index table into the index table to be saved.
In the embodiment of the application, the original plaintext spliced field is replaced by calculating the abstract for the target field of the first bit of the spliced field, so that the row keys of the index table have strong randomness, and the query hotspots are greatly reduced.
Optionally, the method further comprises:
and adding field information of the service field which is respectively associated with the service hash field and the status field in the index information into the index table as column family content of the index table.
In the embodiment of the application, the business hash field and the status field are stored in the column of the index table, so that the repeated problem of query data caused by the status field in a row key can be avoided, and meanwhile, the data transfer efficiency is improved. By calculating the abstract for the first field of the row key non-time field, the hot spot problem during data query is greatly reduced.
Optionally, before reading the data to be transferred from the relational service table of the relational database, the method further includes:
acquiring at least one service field in the whole service fields;
if the first field of the at least one service field is not a time field, calculating a summary for the first field;
And splicing the abstract with other fields in the at least one service field according to a preset sequence by taking the abstract as a beginning field, obtaining the spliced field, and taking the spliced field as a row key of the index table.
In the embodiment of the application, before transferring data, the row keys of the index table are configured, and the data can be transferred into a required format according to requirements during data transfer. If the first field is not the time field, the abstract can be calculated for the first field, and the phenomenon of hot spot in the distributed data storage caused by the similarity of field information is avoided.
Optionally, the method further comprises:
and if the first field of the at least one service field is a time field, splicing the at least one service field according to a preset sequence to obtain the spliced field, and taking the spliced field as a row key of the index table.
In the embodiment of the application, if the first field of the service field is a time field, the field is directly spliced, so that some services for searching data according to the time range are prevented from being influenced.
Optionally, before reading the data to be transferred from the relational service table of the relational database, the method further includes:
Acquiring at least one service field in the whole service fields;
splicing field information in the at least one service field according to a preset sequence, and carrying out hash processing on a splicing result to obtain the service hash field;
and taking at least one status field in the business hash field and the whole business fields as a column family of the index table.
In the embodiment of the application, before transferring data, the columns of the index table are configured, and the data can be transferred into a required format according to requirements during data transfer.
In one aspect, an embodiment of the present application provides a data transfer apparatus, including:
the reading module reads data to be transferred from a relation service table of the relation database;
the transfer module is used for transferring the data to be transferred to a data table of a distributed database and storing index information of the data to be transferred in the index table of the distributed database;
wherein, the row key of the index table includes: the spliced field with the unique identification function is obtained by splicing at least one service field in all service fields of the relation service table; the column families of the index table include: a service hash field and at least one status field of the all service fields, the service hash field being associated with at least one of the all service fields.
Optionally, the column group of the data table includes all service fields of the relational service table, and the row key of the data table includes: a traffic hash field.
Optionally, the transfer module is specifically configured to:
splicing field information of at least one service field associated with the spliced field in the index information according to a preset sequence to obtain field information of the spliced field;
and adding the field information of the spliced field into the index table as the row key content of the index table.
Optionally, the transfer module is specifically configured to:
acquiring field information of at least one service field associated with the spliced field from the index information;
determining the abstract of the field information of the target field positioned at the first bit of the spliced field in the at least one service field by adopting an abstract algorithm;
splicing the abstract of the field information of the target field with the field information of other service fields in the at least one service field according to a preset sequence to obtain the field information of the spliced field;
and adding the field information of the spliced field into the index table as the row key content of the index table.
Optionally, the transfer module is further configured to:
and adding field information of the service field which is respectively associated with the service hash field and the status field in the index information as column family content of the index table to be stored in the index table.
Optionally, the transfer module is further configured to:
acquiring at least one service field in the whole service fields;
if the first field of the at least one service field is not a time field, calculating a summary for the first field;
and splicing the abstract with other fields in the at least one service field according to a preset sequence by taking the abstract as a beginning field, obtaining the spliced field, and taking the spliced field as a row key of the index table.
Optionally, the transfer module is further configured to:
and if the first field of the at least one service field is a time field, splicing the at least one service field according to a preset sequence to obtain the spliced field, and taking the spliced field as a row key of the index table.
Optionally, the reading module is further configured to:
acquiring at least one service field in the whole service fields;
Splicing field information in the at least one service field according to a preset sequence, and carrying out hash processing on a splicing result to obtain the service hash field;
and taking at least one status field in the business hash field and the whole business fields as a column family of the index table.
In the embodiment of the application, the state field is transferred from the row key of the index table to the column group of the index table, the efficiency of inquiring and filtering the historical data is improved through redundant storage in the index table and the data table, the phenomenon that the data is repeated when the user inquires the historical data is avoided, and meanwhile the data transfer efficiency is improved. By calculating the abstract for the first field of the row key non-time field, the hot spot problem during data query is greatly reduced.
Drawings
FIG. 1 is a system architecture diagram provided in an embodiment of the present application;
FIG. 2 is a flowchart illustrating a data transfer method according to an embodiment of the present disclosure;
FIG. 3 is a second flowchart of a data transfer method according to an embodiment of the present disclosure;
FIG. 4 is a flowchart of a method for constructing an index table row key according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a data transfer overall framework according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of a frame structure of a data transfer module according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a data query framework according to an embodiment of the present application;
fig. 8 is a schematic flow diagram of expanding Sqoop according to an embodiment of the present application;
fig. 9 is a second schematic flow chart for expanding Sqoop according to the embodiment of the present application;
fig. 10 is a schematic flow chart III for expanding Sqoop according to an embodiment of the present application;
FIG. 11 is a third flowchart of a data transfer method according to an embodiment of the present disclosure;
fig. 12 is a schematic structural diagram of a data transfer device according to an embodiment of the present application;
fig. 13 is a schematic diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantageous effects of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Referring to fig. 1, a system architecture diagram applicable to an embodiment of the present application includes at least a terminal device 101 and a server 102.
Wherein the terminal device 101 pre-installs a service application, which may be a client application. The server 102 is a background server of a business application, and the server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content distribution network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform.
In an actual application scenario, the terminal device 101 acquires data of a read card in response to a card reading operation triggered in a service application, and sends the data to the server 102. The server 102 adopts the data transfer method in the embodiment of the present application, firstly stores the read card data in a relational database, and then transfers the data from the relational database to a distributed database. The relational database may be a MySQL database and the distributed database may be an HBase database. When data is transferred from the relational database MySQL to the distributed data HBase, an sqoop data transfer tool is used, and the sqoop is a source opening tool and is mainly used for data transfer between the distributed database and the relational database, so that the data in one relational database can be led into the distributed database.
Based on the system architecture diagram of fig. 1, an embodiment of the present application provides a data transfer method, as shown in fig. 2, where a flow of the method is performed by a computer device, and the computer device may be the terminal device 101 and/or the server 102 shown in fig. 1, and includes the following steps:
step 201, reading data to be transferred from a relation service table of a relation database.
Specifically, the user may generate various data related to the transaction by swiping the card, and store all the data in a relational service table in the relational database, that is, in a service table in the MySQL database. Before transferring the data to be transferred, the data to be transferred in the relation service table is read first.
For example, in the traffic travel service, the MySQL database stores related data of the traffic travel service, and the traffic travel service table records the user id, the transaction list number, the arrival/arrival time, the arrival/arrival state, the trip amount, and the like of each user, and these data are recorded in the traffic travel service table, and when transferring the data in the traffic travel service table, the data to be transferred are first acquired.
Step 202, transferring the data to be transferred to a data table of the distributed database, and storing index information of the data to be transferred in the index table of the distributed database.
Specifically, after the data to be transferred is obtained from the MySQL database, the data to be transferred is transferred to a data table in the distributed database, namely, a data table in the HBase database. And simultaneously, the index information of the data to be transferred is stored in an index table in the HBase database.
Wherein, the row key of the index table includes: the spliced field with the unique identification function is spliced based on at least one service field in all service fields of the relation service table; the column families of the index table include: at least one status field of the service hash field and the all service field, the service hash field being associated with at least one of the all service field.
Specifically, the row key of the index table is from all service fields of the relational service table in the MySQL database, and can be one field or a plurality of fields spliced according to the splicing sequence, but the row key of the index table has a unique identification function. The column family of the index table may have a plurality of columns including 1 index column and a plurality of status columns. The index list is a service hash field, which is obtained by processing at least one service field in all fields except the status field through a hash algorithm, wherein the hash algorithm (hash for short) is also called a digest algorithm, and is used for calculating any group of input data to obtain an output digest with a fixed length. The status column is composed of status fields, which are constantly changing fields, for example, in traffic traveling service, the incoming and outgoing status of the user is constantly changing, and may be an unincoming status, an incoming status, an outgoing status, and the incoming and outgoing status is a status field.
In the embodiment of the application, the state field is transferred from the row key of the index table to the column group of the index table, the efficiency of inquiring and filtering the historical data is improved through redundant storage, the phenomenon that the data is repeated when the user inquires the historical data is avoided, and meanwhile the data transfer efficiency is improved. By calculating the abstract for the first field of the row key non-time field, the hot spot problem during data query is greatly reduced.
In some embodiments, the column group of the data table includes all of the service fields of the relational service table, and the row key of the data table includes: a traffic hash field.
Specifically, the row key of the data table is an index column in the index table, i.e., a service hash field. The column family of the data table is all the service fields in the relational service table in the MySQL database.
In some embodiments, storing index information of data to be transferred in an index table of a distributed database includes:
splicing field information of at least one service field associated with the spliced field in the index information according to a preset sequence to obtain field information of the spliced field; and adding field information of the spliced field into the index table as row key content of the index table.
Specifically, the index information includes service fields in the index table and a value corresponding to each service field. The row keys of the index table are splicing fields, the splicing fields are spliced by at least one service field according to a preset sequence, and field information of the splicing fields is stored in the index table as the content of the row keys of the index table.
For example, in the traffic travel service, possible service fields include a user id, an operator id, an order generation time, and the like, and each service field has a corresponding value, that is, field information, and after the service fields are spliced according to a preset sequence, the following table 1 shows:
TABLE 1
Figure BDA0004104837800000091
In the embodiment of the application, the service fields are spliced according to the preset splicing sequence, and the obtained information of the spliced fields is used as the row key content of the index table, so that the uniqueness of the row keys of the index table can be ensured.
In some embodiments, the index information of the data to be transferred is stored in an index table of the distributed database, as shown in fig. 3, and further includes the following steps:
step 301, acquiring field information of at least one service field associated with the spliced field from the index information.
Specifically, firstly, service fields associated with splicing fields are acquired in index information, and at least one service field is associated with the splicing fields.
Step 302, determining a summary of field information of a target field located in a first bit of the spliced field in at least one service field by adopting a summary algorithm.
Specifically, a digest algorithm is adopted for the service field located in the first bit in the spliced field, so that a hash field, namely a target field, is obtained.
For example, in a traffic travel business, in the business field: in the user id, the operator id and the order generation time, the user id is used as the first bit of the spliced field, and a digest algorithm is adopted for the user id, so that a target field hash (user id) is obtained.
Step 303, splicing the abstract of the field information of the target field with the field information of other service fields in at least one service field according to a preset sequence to obtain the field information of the spliced field.
For example, in the traffic travel service, field information obtained by using a summarization algorithm for the user id is placed in the first bit of the spliced field, for example, the value of hash (user id) is 84435AD8D2C9EF3DC450660CAC0E0CEF. The service fields except the first bit are spliced in the order of carrier id, order generation time, as shown in table 2 below:
TABLE 2
Figure BDA0004104837800000101
And 304, adding field information of the spliced field as row key content of the index table into the index table.
In the embodiment of the application, the original plaintext spliced field is replaced by calculating the abstract for the target field of the first bit of the spliced field, so that the row keys of the index table have strong randomness, and the query hotspots are greatly reduced.
In some embodiments, field information of the service field associated with each of the service hash field and the status field in the index information is added as column family content of the index table to the index table stored therein.
Specifically, the column group of the index table includes an index column and a plurality of status columns, wherein the field information corresponding to the index column is field information obtained by adopting hash processing, and the field information corresponding to the status column is field information corresponding to the status field.
For example, in a traffic travel service, possible service fields include: the field information of the user id, the operator id, the order generation time, the station entering and exiting state, the complementary state and the like can be used as the field information of an index column after hash processing, and the field information of the station entering and exiting state and the complementary state can be used as the field information corresponding to a state column.
In the embodiment of the application, the business hash field and the state field are stored in the column of the index table, so that the problem that the state field is repeated due to the row key is avoided, meanwhile, the problem of low query efficiency of directly querying the data table according to the state field is avoided, and the data transfer efficiency is improved. By calculating the abstract for the first field of the row key non-time field, the hot spot problem during data query is greatly reduced.
In some embodiments, before transferring the data, as shown in fig. 4, the following steps are further included:
Step 401, at least one service field in all service fields is obtained, and a first field in the at least one service field is determined based on a preset splicing sequence.
Specifically, before transferring the data to be transferred, the data table and the index table of the HBase are configured so that the data to be transferred is transferred according to the requirement. At least one service field in all service fields is acquired from the MySQL database, and the first field in the service fields is determined according to a preset splicing sequence.
Step 402, if the first field is not a time field, a summary is calculated for the first field.
Specifically, after determining the first field of the service field, it is determined whether the first field is a time field, and if not, the summary is calculated for the first field.
For example, the preset splicing sequence of the service fields is as follows: user id, operator id, order generation time, the first field is user id, and if not the time field, the abstract, i.e. the hash (user id), needs to be calculated for the user id.
Step 403, splicing the abstract with other fields in at least one service field according to a preset splicing sequence by taking the abstract as a beginning field, obtaining a spliced field, and taking the spliced field as a row key of an index table.
Specifically, after calculating the abstract of the first field, the abstract of the first field is used as a head field, and then other service fields are spliced according to a preset splicing sequence, so that a spliced field is obtained, and the spliced field is used as a row key of an index table.
For example, the first field is user id, a summary is calculated for the user id to obtain a hash (user id), then the hash (user id) is spliced with operator id and order generation time to obtain a spliced field hash (user id) _operator id_order generation time, and the spliced field is used as a row key of an index table.
In the embodiment of the application, before transferring data, the row keys of the index table are configured, and the data can be transferred into a required format according to requirements during data transfer. If the first field is not the time field, the abstract can be calculated for the first field, and the phenomenon of hot spot in data storage caused by similarity of field information is avoided.
In some embodiments, if the first field of the at least one service field is a time field, splicing the at least one service field according to a splicing order to obtain a spliced field, and using the spliced field as a row key of the index table.
Specifically, if the first field is a time field, the time field and other service fields are directly spliced to obtain a spliced field as a row key of the index table without calculating the abstract for the first field. Because the data index may be performed according to time if the first field is a time field, if the summary is calculated for the first field of the time field, the data query and the filtering cannot be performed according to the time range, and further the data index cannot be performed according to time.
For example, if the first field is the order generation time, the order generation time is directly spliced with other service fields, such as order generation time_user id_operator id, without calculating an abstract for the order generation time, and the order generation time is used as a row key of the index table.
In the embodiment of the application, if the first field of the service field is a time field, the field is directly spliced, so that some services for searching data according to time are prevented from being influenced.
In some embodiments, before reading the data to be transferred from the relational service table of the relational database, the method further comprises:
at least one service field in all service fields is obtained, field information in the at least one service field is spliced according to a preset splicing sequence, hash processing is carried out on a splicing result to obtain a service hash field, and the service hash field and at least one state field in all service fields are used as column groups of an index table.
Specifically, the column family of the index table contains an index column and a status column. At least one of all the service fields is acquired from the MySQL database, the acquired all the service fields do not contain the status field, the acquired service fields are spliced according to a preset splicing sequence, and then hash processing is carried out, and the service fields are used as index columns. The status field is a status column, and finally the index column and the status column are used as column groups of the index table.
For example, in the traffic travel service, the service field which can be used as an index column has a user id and an operator id, the user id and the operator id are spliced according to a preset sequence, then hash processing is performed to obtain a hash (user id_operator id), the hash is used as an index column of an index table, the state column of the index table can be the order generation time, and then the hash (user id_operator id) and the order generation time together form a column group of the index table.
In the embodiment of the application, before transferring data, the columns of the index table are configured, and the data can be transferred into a required format according to requirements during data transfer.
The system in the embodiment of the application mainly comprises 4 modules, such as a scheduling framework based on python, data transfer processing based on customization of an open source sqoop component, HBase inquiry, data transfer monitoring and the like, and further comprises other modules, such as a configuration module and the like. And transferring the data stored in the MySQL database to the HBase database for storage according to a customized data format through various scheduling modes, and observing the transfer condition through a data transfer visualization module. The overall framework for data transfer is shown in fig. 5.
The configuration module comprises a configuration file, and is used for configuring query parameters, scheduling parameters and the like, such as a service field needing to be configured for query, or parameters such as time for initiating data transfer, whether the data transfer is successful or not, and the like.
The scheduling framework is a python-based scheduling framework that can provide an interface, such as an API scheduling interface, to an external system for scheduling data transfer tools sqoop to transfer data at a time, etc.
The data transfer module transfers the data in a relational database such as MySQL database to the HBase database through a data transfer tool sqoop.
Specifically, as shown in fig. 6, after the framework of the data transfer module obtains data from the outside through the API call interface, the configuration analysis module configures scheduling information and the like, and if the whole scheduling framework is finished, the data transfer task is finished; if the whole dispatching framework is not finished, judging whether the data transfer task is a timing task, a primary task or a monitoring task. If the task is timed, judging whether the preset time is reached, and if so, starting data transfer; if the predetermined time is not reached, a timed task is returned, which is typically daily newly added data. If the task is a disposable task, the data transfer is directly started, and the disposable task generally transfers the historical data once. The monitoring task provides a data interface for other services to schedule, if other service systems want to transfer data according to own time, the monitoring task is required to be sent through the interface provided by the monitoring task, if the monitoring task is the monitoring task, whether monitoring information is received is judged, and if the monitoring task is the monitoring task, data transfer is started. After the data transfer is finished, whether the service system is notified or not, and the service system is notified of the completion of the data transfer task. If yes, a timing task can be set to inform the service system of the completion condition of the data transfer task at regular time; if not, continuing to carry out the next data transfer task.
The frame of data query is shown in fig. 7, after the HBase data table and the index table are built, the service system can query the HBase database through remote procedure call (RPC for short), index data through the index table, when the data is indexed, a query filter and a paging filter are set according to query conditions, firstly, data table row keys meeting the conditions are filtered out from the HBase index table, then the HBase data table is queried according to the row keys, finally, the stream data format required by the service is generated and returned, and the specific data query frame is shown in fig. 7. Query application 1 and query application 2 are backup to each other.
The remote procedure call acquires a request from a client, and after the HBase query module monitors the external request, the HBase query module judges whether the request is effective or not, if not, the HBase query module continues to monitor the external request; if the data index information is effective, a query filter is generated, required data index information is searched in an HBase index table, and if the required data index information is searched, required complete data is obtained from the HBase data table; if the needed data index information is not found, returning, and continuing to monitor the external request. After the query filter is generated, the required data can also be obtained directly from the data table. After the required data is acquired, the external request is continuously monitored.
The data transfer monitoring module is used for monitoring whether data transfer is successful or whether a process hanging phenomenon occurs in the HBase database, and the HBase snapshot is used for backing up the data in the HBase database once and can be backed up to a hard disk or a USB flash disk, so that data loss caused by faults in machine storage is prevented. In the state management high availability module, when the data transfer process hangs up due to a fault, the backup data transfer process is pulled up to continue transferring, and when the conditions of hanging up of a data transfer main node, network problems, data transfer failure and the like occur, the state management high availability mechanism can continue working on another backup machine.
Before transferring the data to be transferred from the MySQL database to the HBase database, firstly, partitioning the data to be transferred, and dividing the data to be transferred into a plurality of areas for parallel transfer, so that the transfer efficiency of the data to be transferred can be improved. And meanwhile, the transferring tool is required to be expanded, so that data to be transferred are transferred according to the method in the embodiment of the application. The transfer tool used in the embodiment of the application is a Sqoop, which is a tool for opening a source, and can be used for guiding data in a relational database into a distributed database or guiding data in the distributed database into the relational database.
The method comprises the steps of expanding the Sqoop, firstly, adding a batch of parameters for controlling different execution logics in the data transfer process, such as adding expansion parameters to BaseSqoopTool class, sqoopOption class and HBaseIndortJob class, transmitting the parameters into a MapReduce task through a Configuration object, secondly, initializing the parameters according to the original parameters and the expansion parameters of the Sqoop, then realizing the PutTransformer abstract class of a class inheritance framework through a custom NewPutTransformer, and completing the custom expansion logics in the getMutationCommand interface and the mutionRecordInHBase interface of the NewPutTransformer implementation class, wherein the two interfaces are respectively used for realizing the generation logics of row keys and column families of an HBase index table and a data table. Fig. 8 is a flow chart of expanding Sqoop.
The flow of implementing the key generation logic by the getMuationCommand interface in the main flow of FIG. 8 is shown in FIG. 9, and the key generation logic is obtained from the extension parameters first, and whether the key generation logic is an HBase index table or a data table is judged, if the key generation logic is an HBase index table, all the fields of the index table are obtained, and then whether the first field is a time field is judged, if the key generation logic is a time field, the key generation logic is directly spliced according to a preset splicing sequence; if the time field is not the time field, calculating the abstract of the first field, taking the abstract as the first field, and splicing with other service fields according to a preset splicing sequence. If the data table is the HBase data table, acquiring service fields except the status field of the data table, splicing according to a preset splicing sequence, generating splicing fields, calculating the abstract for the splicing fields, and taking the abstract as a row key of the data table.
The flow of implementing column group and column generation logic by the mutationRecordInHBase interface in the main flow of FIG. 8 is shown in FIG. 10, firstly, the expansion parameters are acquired, whether the expansion parameters are index tables or data tables in the HBase is judged, if the expansion parameters are the index tables of the HBase, whether a state column exists is judged, and if the state column does not exist, all service fields are spliced according to the splicing sequence; if the state column exists, all the state fields are acquired, all the component fields of the index column are acquired, all the service fields are spliced according to the splicing sequence, the abstract is calculated for the spliced fields, and the abstract and the state column are used as column groups of the index table. If the HBase data table is, firstly, obtaining the service field of each column in the data table, and then carrying out the following steps for each column: the column name of the current column is obtained (the column name is a name configured for the column in advance, the column name may include a plurality of fields), all the fields in the current column are serialized and stored as Json format, the Json format is a field format required by the service, and in some embodiments, the Json format may be stored as other formats according to the service requirement), and the Json format is used as a column of the HBase data table.
In the HBase data storage, the HBase is divided into a plurality of storage machines Region servers, data can be randomly stored on different machines, and the hbmaster is a management machine and is divided into a Master (Master) and a Backup (Backup Master) for managing the whole HBase data storage, managing data writing, inquiring and the like. The Zookeeper can be used for storing management data, and a user can also interact with the Region Server through the Zookeeper. Taking traffic travel service data as an example, the data transfer performance and query performance conditions of the scheme and other schemes are compared, as shown in table 3:
TABLE 3 Table 3
Figure BDA0004104837800000161
For better explanation of the embodiments of the present application, a flowchart of a data transfer method provided in the embodiments of the present application is described below in connection with a specific implementation scenario, where the flow of the method may be performed by the server 102 shown in fig. 1, and includes the following steps, as shown in fig. 11:
step 1101, configuring data transfer parameters according to service requirements.
Step 1102, determining the first field in the service fields according to the configured splicing order.
Step 1103, judging whether the first field is a time field, if so, executing step 1105; if the first field is not a time field, step 1104 is performed.
Step 1104, calculate the abstract for the first field, and take the abstract as the first field.
Step 1105, splicing the service fields according to the splicing order and other service fields.
Step 1106, using the spliced field as a row key of the index table.
Step 1107, calculating the abstract after splicing the field information in the service fields according to the configured splicing sequence, and obtaining the hash field.
Step 1108, using the hash field and the status field as column groups of the index table.
Step 1109, the scheduling framework actively schedules or receives the API scheduling request, dynamically generates a scheduling command according to the configuration information, and initiates a data transfer task for the data to be transferred according to the row key and column group requirements of the index table.
Step 1110, the data transfer task obtains the data to be transferred, and partitions the data to be transferred in the MySQL database according to the data to be transferred.
And 1111, sqoop transfers the partitioned data to be transferred in MySQL to HBase in parallel.
In the embodiment of the application, the state field is transferred from the row key of the index table to the column group of the index table, the efficiency of inquiring and filtering the historical data is improved through redundant storage, the phenomenon that the data is repeated when the user inquires the historical data is avoided, and meanwhile the data transfer efficiency is improved. By calculating the abstract for the first field of the row key non-time field, the hot spot problem during data query is greatly reduced.
Based on the same technical concept, the embodiment of the present application provides a data transfer apparatus 1200, as shown in fig. 12, including:
the reading module 1201 reads data to be transferred from a relational service table of a relational database;
a transferring module 1202 for transferring the data to be transferred to a data table of a distributed database and storing index information of the data to be transferred in an index table of the distributed database;
wherein, the row key of the index table includes: the spliced field with the unique identification function is obtained by splicing at least one service field in all service fields of the relation service table; the column families of the index table include: a service hash field and at least one status field of the all service fields, the service hash field being associated with at least one of the all service fields.
Optionally, the column group of the data table includes all service fields of the relational service table, and the row key of the data table includes: a traffic hash field.
Optionally, the transferring module 1202 is specifically configured to:
splicing field information of at least one service field associated with the spliced field in the index information according to a preset sequence to obtain field information of the spliced field;
and adding the field information of the spliced field into the index table as the row key content of the index table.
Optionally, the transferring module 1202 is specifically configured to:
acquiring field information of at least one service field associated with the spliced field from the index information;
determining the abstract of the field information of the target field positioned at the first bit of the spliced field in the at least one service field by adopting an abstract algorithm;
splicing the abstract of the field information of the target field with the field information of other service fields in the at least one service field according to a preset sequence to obtain the field information of the spliced field;
and adding the field information of the spliced field into the index table as the row key content of the index table.
Optionally, the transferring module 1202 is further configured to:
and adding field information of the service field which is respectively associated with the service hash field and the status field in the index information as column family content of the index table to be stored in the index table.
Optionally, the transferring module 1202 is further configured to:
acquiring at least one service field in the whole service fields;
if the first field of the at least one service field is not a time field, calculating a summary for the first field;
and splicing the abstract with other fields in the at least one service field according to a preset sequence by taking the abstract as a beginning field, obtaining the spliced field, and taking the spliced field as a row key of the index table.
Optionally, the transferring module 1202 is further configured to:
and if the first field of the at least one service field is a time field, splicing the at least one service field according to a preset sequence to obtain the spliced field, and taking the spliced field as a row key of the index table.
Optionally, the reading module 1201 is further configured to:
acquiring at least one service field in the whole service fields;
Splicing field information in the at least one service field according to a preset sequence, and carrying out hash processing on a splicing result to obtain the service hash field;
and taking at least one status field in the business hash field and the whole business fields as a column family of the index table.
Based on the same technical concept, the embodiments of the present application provide a computer device, which may be a server, as shown in fig. 13, including at least one processor 1301 and a memory 1302 connected to the at least one processor, where specific connection media between the processor 1301 and the memory 1302 are not limited in the embodiments of the present application, and fig. 13 is an example of connection between the processor 1301 and the memory 1302 through a bus. The buses may be divided into address buses, data buses, control buses, etc.
In the embodiment of the present application, the memory 1302 stores instructions executable by the at least one processor 1301, and the at least one processor 1301 can perform the steps included in the above data transfer method by executing the instructions stored in the memory 1302.
Wherein processor 1301 is the control center of the computer device, various interfaces and lines may be utilized to connect the various portions of the computer device, and image processing may occur by executing or executing instructions stored in memory 1302 and invoking data stored in memory 1302. In the alternative, processor 1301 may include one or more processing units, and processor 1301 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1301. In some embodiments, processor 1301 and memory 1302 may be implemented on the same chip, and in some embodiments they may be implemented separately on separate chips.
Processor 1301 may be a general purpose processor such as a Central Processing Unit (CPU), digital signal processor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, that may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution.
The memory 1302, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 1302 may include at least one type of storage medium, which may include, for example, flash Memory, hard disk, multimedia card, card Memory, random access Memory (Random Access Memory, RAM), static random access Memory (Static Random Access Memory, SRAM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read-Only Memory (ROM), charged erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory), magnetic Memory, magnetic disk, optical disk, and the like. Memory 1302 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1302 in the present embodiment may also be circuitry or any other device capable of implementing a memory function for storing program instructions and/or data.
Based on the same inventive concept, embodiments of the present application provide a computer-readable storage medium storing a computer program executable by a computer device, which when run on the computer device, causes the computer device to perform the steps of the above-described data transfer method.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (11)

1. A method of transferring data, comprising:
reading data to be transferred from a relation service table of a relation database;
transferring the data to be transferred to a data table of a distributed database, and storing index information of the data to be transferred in an index table of the distributed database;
wherein, the row key of the index table includes: the spliced field with the unique identification function is obtained by splicing at least one service field in all service fields of the relation service table; the column families of the index table include: a service hash field and at least one status field of the all service fields, the service hash field being associated with at least one of the all service fields.
2. The method of claim 1, wherein a column group of the data table includes all of the service fields of the relational service table, and wherein a row key of the data table includes: a traffic hash field.
3. The method of claim 1, wherein storing the index information of the data to be transferred in an index table of the distributed database comprises:
Splicing field information of at least one service field associated with the spliced field in the index information according to a splicing sequence to obtain field information of the spliced field;
and adding the field information of the spliced field as the row key content of the index table into the index table to be saved.
4. The method of claim 1, wherein storing the index information of the data to be transferred in an index table of the distributed database comprises:
acquiring field information of at least one service field associated with the spliced field from the index information;
determining the abstract of the field information of the target field positioned at the first bit of the spliced field in the at least one service field by adopting an abstract algorithm;
splicing the abstract of the field information of the target field with the field information of other service fields in the at least one service field according to the splicing sequence to obtain the field information of the spliced field;
and adding the field information of the spliced field into the index table as the row key content of the index table.
5. The method of claim 3 or 4, further comprising:
And adding field information of the service field which is respectively associated with the service hash field and the status field in the index information as column family content of the index table to be stored in the index table.
6. The method of claim 1, wherein prior to reading the data to be transferred from the relational service table of the relational database, further comprising:
acquiring at least one service field in all service fields, and determining a first field in the at least one service field based on a preset splicing sequence;
if the first field is not a time field, calculating a summary for the first field;
and splicing the abstract with other fields in the at least one service field according to a preset splicing sequence by taking the abstract as a beginning field, obtaining the spliced field, and taking the spliced field as a row key of the index table.
7. The method as recited in claim 6, further comprising:
and if the first field is a time field, splicing the at least one service field according to the splicing sequence to obtain the spliced field, and taking the spliced field as a row key of the index table.
8. The method of claim 6, wherein prior to reading the data to be transferred from the relational service table of the relational database, further comprising:
acquiring at least one service field in the whole service fields;
splicing field information in the at least one service field according to a preset splicing sequence, and carrying out hash processing on a splicing result to obtain the service hash field;
and taking at least one status field in the business hash field and the whole business fields as a column family of the index table.
9. A data transfer device, comprising:
the reading module reads data to be transferred from a relation service table of the relation database;
the transfer module is used for transferring the data to be transferred to a data table of a distributed database and storing index information of the data to be transferred in the index table of the distributed database;
wherein, the row key of the index table includes: the spliced field with the unique identification function is obtained by splicing at least one service field in all service fields of the relation service table; the column families of the index table include: a service hash field and at least one status field of the all service fields, the service hash field being associated with at least one of the all service fields.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1-8 when the program is executed.
11. A computer readable storage medium, characterized in that it stores a computer program executable by a computer device, which program, when run on the computer device, causes the computer device to perform the steps of the method according to any one of claims 1-8.
CN202310188996.2A 2023-02-28 2023-02-28 Data transfer method, device, equipment and storage medium Pending CN116244381A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310188996.2A CN116244381A (en) 2023-02-28 2023-02-28 Data transfer method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310188996.2A CN116244381A (en) 2023-02-28 2023-02-28 Data transfer method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116244381A true CN116244381A (en) 2023-06-09

Family

ID=86625799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310188996.2A Pending CN116244381A (en) 2023-02-28 2023-02-28 Data transfer method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116244381A (en)

Similar Documents

Publication Publication Date Title
US10949447B2 (en) Blockchain-based data synchronizing and data block parsing method and device
CN107515878B (en) Data index management method and device
CN107464151B (en) Order data processing method and device for high-concurrency service
CN105049268A (en) Distributed computing resource allocation system and task processing method
CN111258978B (en) Data storage method
CN112506870B (en) Data warehouse increment updating method and device and computer equipment
CN103106585A (en) Real-time duplication eliminating method and device of product information
CN110968603A (en) Data access method and device
CN111552701A (en) Method for determining data consistency in distributed cluster and distributed data system
CN108399175B (en) Data storage and query method and device
CN106815318B (en) Clustering method and system for time sequence database
US11120513B2 (en) Capital chain information traceability method, system, server and readable storage medium
CN116775712A (en) Method, device, electronic equipment, distributed system and storage medium for inquiring linked list
CN116244381A (en) Data transfer method, device, equipment and storage medium
CN108829709A (en) Distributed database management method, apparatus, storage medium and processor
CN113849286A (en) Account checking data importing method, system, equipment and computer readable storage medium
CN110363515B (en) Rights and interests card account information inquiry method, system, server and readable storage medium
CN112416980A (en) Data service processing method, device and equipment
CN111221847A (en) Monitoring data storage method and device and computer readable storage medium
CN112804335B (en) Data processing method, data processing device, computer readable storage medium and processor
CN109254870A (en) The method and apparatus of data backup
CN111797062B (en) Data processing method, device and distributed database system
CN115168366B (en) Data processing method, data processing device, electronic equipment and storage medium
CN112860694B (en) Service data processing method, device and equipment
CN108805741A (en) A kind of fusion method of power quality data, apparatus and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination