WO2016169237A1

WO2016169237A1 - Data processing method and device

Info

Publication number: WO2016169237A1
Application number: PCT/CN2015/092759
Authority: WO
Inventors: 韩烨
Original assignee: 中兴通讯股份有限公司
Priority date: 2015-04-23
Filing date: 2015-10-23
Publication date: 2016-10-27
Also published as: CN106156209A

Abstract

Provided are a data processing method and device. The method comprises: receiving a data importing instruction for instructing a data import to a database; splitting the data according to the data importing instruction; and importing the split data to different storage spaces in the database in blocks. The present invention addresses the existing problem in the related art of low data importing efficiency, thus improving data importing efficiency.

Description

Data processing method and device

Technical field

The present invention relates to the field of communications, and in particular to a data processing method and apparatus.

Background technique

With the development of technology, databases play an increasingly important role in people's lives. In the current information society, fully and effectively managing and utilizing various resources is a prerequisite for scientific research and decision management. Database technology is the core of various information systems such as management information systems, office automation systems, and decision support systems. Part of it is an important technical means for scientific research and decision management.

Traditional database systems generally ensure database integrity through high-end devices, such as minicomputers or high-end storage, or increase database processing power by adding a central processing unit (CPU). However, this centralized database architecture is increasingly unsuitable for massive database processing, data import efficiency is low, and it also has to pay a high cost.

In view of the low efficiency of data introduction in the related art, an effective solution has not been proposed yet.

Summary of the invention

The present invention provides a data processing method and apparatus to solve at least the problem of low data import efficiency existing in the related art.

According to an aspect of the present invention, a data processing method includes: receiving a data import instruction for instructing data to be imported into a database; and splitting the data according to the data import instruction; The data chunks are imported into different storage spaces in the database.

Optionally, the splitting the data according to the data importing instruction comprises: determining, according to the data importing instruction, a table structure of the table and data distribution information of the data on the table; according to the table structure And the data distribution information and the descriptor information of the data carried in the data importing instruction identify each data row field in the data; and perform the data according to each data row field in the identified data. Split processing.

Optionally, performing the splitting process on the data according to the data importing instruction includes: determining whether the data satisfies a splitting rule; and if the determining result is yes, performing the splitting process on the data; If the result is no, the data is subjected to correction processing; and the corrected data is subjected to split processing.

Optionally, importing the split data into the different storage spaces in the database includes: downloading the split processed data; and importing the downloaded split data into the storage In the different storage spaces in the database.

Optionally, after the split processing data is into the different storage spaces in the database, the method further includes: deleting the downloaded data after the split processing.

Optionally, after the split data is imported into different storage spaces in the database, the method further includes: summarizing the import result after the split processed data is imported; and feeding back the import result.

According to another aspect of the present invention, a data processing apparatus is provided, comprising: a receiving module configured to receive a data import instruction for instructing data to be imported into a database; and a processing module configured to: according to the data import instruction The data is split and processed; the import module is configured to import the split processed data into different storage spaces in the database.

Optionally, the processing module includes: a determining unit, configured to determine a table structure of the table according to the data importing instruction and data distribution information of the data on the table; and an identifying unit configured to be according to the table structure And the data distribution information and the descriptor information of the data carried in the data importing instruction identify each data row field in the data; the first processing unit is configured to each data according to the identified data The row field splits the data.

Optionally, the processing module includes: a determining unit, configured to determine whether the data satisfies a splitting rule; and a second processing unit configured to: when the determining result of the determining unit is yes, the data is Performing a splitting process; the correcting unit is configured to perform a correction process on the data when the determination result of the determination unit is negative; and the third processing unit is configured to perform a split process on the data after the correction process.

Optionally, the importing module includes: a downloading unit configured to download the split processed data; and an importing unit configured to import the downloaded split processed data into blocks into different storages in the database In space.

Optionally, the device further includes: a deleting module, configured to delete the downloaded split processed data.

Optionally, the device further includes: a summary module, configured to summarize the import result after the split processing is performed, and the feedback module is configured to feed back the import result.

According to the present invention, a data import instruction for instructing data to be imported into a database is received; the data is split according to the data import instruction; and the split data block is imported into the database. In the storage space, the problem of low data import efficiency existing in the related art is solved, and the effect of improving data import efficiency is achieved.

DRAWINGS

The drawings described herein are intended to provide a further understanding of the invention, and are intended to be a part of the invention. In the drawing:

1 is a flow chart of a data processing method according to an embodiment of the present invention;

2 is a block diagram showing the structure of a data processing apparatus according to an embodiment of the present invention;

3 is a first structural block diagram of a processing module 24 in a data processing apparatus according to an embodiment of the present invention;

4 is a block diagram showing a second structure of the processing module 24 in the data processing apparatus according to an embodiment of the present invention;

FIG. 5 is a structural block diagram of an import module 26 in a data processing apparatus according to an embodiment of the present invention; FIG.

6 is a block diagram of a first preferred structure of a data processing apparatus according to an embodiment of the present invention;

FIG. 7 is a block diagram showing a second preferred structure of a data processing apparatus according to an embodiment of the present invention; FIG.

8 is a block diagram showing the structure of an import system according to an embodiment of the present invention;

9 is a flow chart of data import processing in accordance with an embodiment of the present invention.

detailed description

The invention will be described in detail below with reference to the drawings in conjunction with the embodiments. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict.

It is to be understood that the terms "first", "second" and the like in the specification and claims of the present invention are used to distinguish similar objects, and are not necessarily used to describe a particular order or order.

A data processing method is provided in this embodiment. FIG. 1 is a flowchart of a data processing method according to an embodiment of the present invention. As shown in FIG. 1, the process includes the following steps:

Step S102, receiving a data import instruction for instructing to import data into the database;

Step S104, performing splitting processing on the data according to the data importing instruction;

In step S106, the split data block is imported into different storage spaces in the database.

Through the above steps, when performing the process of importing data into the database, the data is first split, and then the split data is imported into different storage spaces of the database, and when the data is partitioned. Can be executed in parallel to improve import efficiency. Therefore, the problem of low data import efficiency existing in the related art is solved, and the effect of improving data import efficiency is achieved. The above-mentioned database can be called a distributed database system, in which the database system is free from the dependence of large equipment by constructing a high-availability and high-expansion cluster by using ordinary inexpensive equipment. A good distributed database architecture can be easily accessed for high availability and can scale out. The import and export function of large data volume is a key technology in distributed databases.

When the splitting process is performed on the data, there may be multiple splitting modes. In an optional embodiment, the data may be split according to the table used for splitting, wherein the data is imported according to the data import instruction. Performing the splitting process includes: determining, according to the data importing instruction, a table structure of the table and data distribution information of the data on the table; and identifying data according to the table structure, the data distribution information, and the descriptor information of the data carried in the data importing instruction. Each data row field; the data is split according to each data row field in the identified data. After receiving the data import instruction, the legality of the data import instruction may be first determined, and then the table structure information and the distribution policy information of the imported destination library table are obtained, and then the data file is read, according to the table structure information and the distribution strategy. Splitting the imported data file into multiple underlying databases (ie, the above-mentioned storage space) storing a plurality of corresponding small files and transmitting them to the underlying database of each destination After the directory is specified, the cluster management module issues instructions to each of the underlying databases to perform import of the corresponding files.

The premise that the data can be split is that the data needs to meet the predetermined splitting rules, but there are cases where the data does not satisfy the splitting rules. In this case, the data needs to be corrected so that the data satisfies the split. Sub-rules. In an optional embodiment, the splitting the data according to the data importing instruction includes: determining whether the data satisfies the splitting rule; and if the determining result is yes, splitting the data; If not, the above data is corrected; the corrected data is split. Among them, when the data is corrected, there may be multiple correction methods, which may be performed by an administrator, that is, artificially; or, by means of a module that performs split processing, other modules may be acquired without manual intervention. Correction is performed according to some correction rules; of course, it can be corrected by manual and corresponding modules, and so on. This method can be used to know in time whether the data that needs to be imported into the database can be split, thereby further improving the splitting efficiency. Moreover, if there is an error in the data to be imported, the error line data can be extracted to ensure the correctness of the imported data.

In an optional embodiment, when the split data block is imported into a different storage space in the database, the split processed data may be downloaded first; and the downloaded split processed data is downloaded. The partitions are imported into different storage spaces in the database.

After the data is imported into the database, the downloaded data may not be retained. In an optional embodiment, after the split processed data is partitioned into different storage spaces in the database, the method further includes: Delete the downloaded split processed data. Thereby achieving the purpose of clearing the garbage data file and reducing the memory occupation. This allows the database to store more data.

After the split processing data is imported into different storage spaces in the database, the import result may also be fed back. In an optional embodiment, the method further includes: performing import processing on the split processed data. After the import result; feedback the above import results. This allows the user to clearly determine the import result.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware, but in many cases, the former is A better implementation. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a cell phone, a computer, a server, or a network device, etc.) to perform the methods of various embodiments of the present invention.

In the embodiment, a data processing device is also provided, which is used to implement the above-mentioned embodiments and preferred embodiments, and will not be described again. As used below, the term "module" may implement a combination of software and/or hardware of a predetermined function. Although the apparatus described in the following embodiments is preferably implemented in software, hardware, or a combination of software and hardware, is also possible and contemplated.

2 is a block diagram showing the structure of a data processing apparatus according to an embodiment of the present invention. As shown in FIG. 2, the apparatus includes a receiving module 22, a processing module 24, and an importing module 26. The apparatus will be described below.

The receiving module 22 is configured to receive a data import instruction for instructing to import data into the database; the processing module 24, The receiving module 22 is configured to split the data according to the data importing instruction; the importing module 26 is connected to the processing module 24, and is configured to import the split processed data into different storages in the database. In space.

3 is a first structural block diagram of a processing module 24 in a data processing apparatus according to an embodiment of the present invention. As shown in FIG. 3, the processing module 24 includes a determining unit 32, an identifying module 34, and a first processing unit 36. The device will be described.

The determining unit 32 is configured to determine the table structure of the table and the data distribution information of the data on the table according to the data importing instruction; the identifying unit 34 is connected to the determining unit 32, and is set according to the table structure, the data distribution information, and the data importing instruction. The descriptor information of the carried data identifies each data line field in the data; the first processing unit 36, coupled to the above-described identification unit 34, is arranged to split the data according to each of the data row fields in the identified data.

4 is a second structural block diagram of a processing module 24 in a data processing apparatus according to an embodiment of the present invention. As shown in FIG. 4, the processing module 24 includes a determining unit 42, a second processing unit 44, a correcting unit 46, and a third Processing unit 48, the processing module 24 will be described below.

The determining unit 42 is configured to determine whether the data satisfies the splitting rule; the second processing unit 44 is connected to the determining unit, and is configured to perform splitting processing on the data if the determining result of the determining unit 42 is YES; The correcting unit 46 is connected to the determining unit 42 and configured to perform correction processing on the data when the determination result of the determining unit 42 is negative. The third processing unit 48 is connected to the correcting unit 46 and is set to correct the data. The processed data is split.

5 is a block diagram showing the structure of the import module 26 in the data processing apparatus according to the embodiment of the present invention. As shown in FIG. 5, the import module 26 includes a download unit 52 and an import unit 54, which will be described below.

The download unit 52 is configured to download the split processed data; the import unit 54 is connected to the download unit 52, and is configured to import the downloaded split processed data into different storage spaces in the database.

6 is a block diagram of a first preferred structure of a data processing apparatus according to an embodiment of the present invention. As shown in FIG. 6, the apparatus includes a deletion module 62 in addition to all the modules shown in FIG. Be explained.

The deletion module 62 is connected to the above-described import module 26 and is set to delete the downloaded split processed data.

FIG. 7 is a second preferred structural block diagram of a data processing apparatus according to an embodiment of the present invention. As shown in FIG. 7, the apparatus includes a summary module 72 and a feedback module 74, in addition to all the modules shown in FIG. The device will be described.

The summary module 72 is connected to the import module 26, and is provided to summarize the import result after the split processing data is imported. The feedback module 74 is connected to the summary module 72 and is provided to feed back the import result.

The invention will be further described below in conjunction with specific embodiments.

As can be seen from the foregoing, the existing solutions in the related art are all performed on a traditional single database, and the efficiency of the table is not required, and the system architecture is not required. The solution in the embodiment of the present invention is based on a distributed database system, and satisfies the characteristics of the atomicity/consistency/isolation/durability (ACID) of the database, and can be executed concurrently. The use of shell scripts for import and export, with a high degree of real-time, portability and feasibility, and a very high user experience, is a major innovation in the existing technology.

8 is a structural block diagram of an import system according to an embodiment of the present invention. As shown in FIG. 8, the data import client module 82 is included. The module may be located between an external system and a data import server module, or may be located in an external system. The module is not shown in FIG. 8), the data import server module 84 (corresponding to the download server 84 in FIG. 8, the same as the receiving module 22, the processing module 24, and the import module 26), and the metadata center module 86 ( Corresponding to the metadata server 84 in FIG. 8 , the cluster management center module 88 (corresponding to the cluster manager 88 in FIG. 8 , the same as the summary module 72 and the feedback module 74 described above), and the database proxy module 810 (the same as the above deletion) Module 62) and database module 812, each module will be described below.

The data import client module 82 (LoadClient) is mainly for the user, and the user initiates an import and export command through the module.

The data import server module 84 (LoadServer) is configured to accept the import and export commands sent by the client, split and merge the data files according to the data distribution policy, and interact with other modules to coordinate the entire import and export process.

The metadata center module 86 is arranged to store and manage all metadata information for the entire distributed database system.

The cluster management center module 88 is mainly responsible for monitoring, managing, and maintaining various database clusters (DBClusters).

The database agent module 810 is a database node management monitoring module. It is responsible for real-time monitoring of the running status of the DB nodes under its jurisdiction, and periodically collects running statistics.

Database module 812 is the underlying module that holds all data.

The core algorithm that can be implemented by using the block diagram of the import system shown in Figure 8 is as follows:

For the import process:

The data import and export server module 84 queries the metadata center module 86 for the metadata information of the table according to the cluster ID, the database name, and the table name, and is used to obtain the table structure definition and the data distribution information;

The data import server module 84 uses the obtained information (plus the data file descriptor information) to identify each data row field in the data file (datafilename), and performs data file splitting;

The data import server module 84 requests the cluster management center module 88 to notify each database agent module 810 to download the split file of the managed DBGroup;

The data import server module 84 requests the cluster management center module 88 to notify the respective database agent module 810 to execute the real load data file LOAD DATA INFILE command after each database agent module 810 downloads successfully;

After the LOAD DATA INFILE command is successfully executed, the data import and export server module 84 requests the cluster management center 88 to notify each database agent module 810 to delete the garbage data file (the garbage data file here may be the downloaded data after being loaded);

The data import server module 84 summarizes the results and notifies the data import and export client module 82.

FIG. 9 is a flowchart of data import processing according to an embodiment of the present invention. As shown in FIG. 9, the flow includes the following steps:

Step S902, the data import client module 82 sends an import data request to the data import server module 84.

Step S904, the data import server module 84 sends a query database metadata request to the metadata center module 86 according to the cluster ID, the database name, and the table name, and the request is used to query the metadata information of the table;

Step S906, the metadata center module 86 returns a table structure definition and data distribution information, including various field types and lengths of the table, and distribution keys and which DBGroups are distributed;

In step S908, the data import server module 84 uses the information returned by the metadata center module 88 (in addition, the data file descriptor information) to identify each data row field in the data file (datafilename) for data file splitting. If the data is wrong during the splitting process, if the type does not meet the definition of the table, the error data is selected and placed in the error file;

Step S910, the data import server module 84 requests the cluster management center module 88 to notify each database agent module 810 to download the split file of the managed DBGroup;

Step S912, the cluster management center module 88 notifies each database proxy module 810 to download the split file of the managed DBGroup;

In step S914, each database proxy module 810 notifies the ftp service connection data import server module 84 to download the corresponding split file, and each database proxy module 810 successfully downloads the corresponding split file and returns to the cluster management center module 88. response;

Step S916, the cluster management center module 88 summarizes the download result;

Step S918, after receiving the successful response of all the database proxy modules 810, the cluster management center module 88 returns a successful response to the data import server module 84.

Step S920, after the data import server module 84 downloads successfully, the cluster management center module 88 is requested to notify each database agent module 810 to execute a real LOAD DATA INFILE command;

Step S922, the cluster management center module 88 notifies the database proxy module 810 to execute the real LOAD DATA INFILE command;

Step S924, each database proxy module 810 connects to the managed database module to execute a real LOAD DATA INFILE command; after each database proxy module 810 executes the real LOAD DATA INFILE command successfully, it returns a successful response to the cluster management center module 88;

Step S926, after receiving the successful response of all the database proxy modules 810, the cluster management center module 88 returns a successful response to the data import server module 84; after the LOAD DATA INFILE command is successfully executed, the data import server module 84 requests the cluster management center module again. 88 to notify each database agent module to delete the garbage data file; the data import server module 84 summarizes the results and notifies the data import client module 82.

The solution in the above embodiment is based on a distributed database system, and can import all data types supported by the Mysql database, and of course, can support other types of numbers. Applying the solution in the embodiment of the present invention to the distributed database system can increase the concurrency of 2 to 3 times, balance the load, ensure the correctness of importing and exporting data, and the system is robust.

It should be noted that each of the above modules may be implemented by software or hardware. For the latter, the foregoing may be implemented by, but not limited to, the foregoing modules are all located in the same processor; or, the modules are located in multiple In the processor.

Embodiments of the present invention also provide a storage medium. Optionally, in the embodiment, the foregoing storage medium may be configured to store program code for performing the following steps:

S1, receiving a data import instruction for instructing to import data into the database;

S2, splitting the data according to the data import instruction;

S3, the split data block is imported into different storage spaces in the above database.

Optionally, in the embodiment, the foregoing storage medium may include, but is not limited to, a USB flash drive, a Read-Only Memory (ROM), and a Random Access Memory (RAM). A variety of media that can store program code, such as a hard disk, a disk, or an optical disk.

It will be apparent to those skilled in the art that the various modules or steps of the present invention described above can be implemented by a general-purpose computing device that can be centralized on a single computing device or distributed across a network of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device such that they may be stored in the storage device by the computing device and, in some cases, may be different from the order herein. The steps shown or described are performed, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps thereof are fabricated as a single integrated circuit module. Thus, the invention is not limited to any specific combination of hardware and software.

The above description is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Industrial applicability

As described above, the data processing method and apparatus provided by the embodiments of the present invention have the following beneficial effects: the problem of low data import efficiency existing in the related art is solved, and the effect of improving data import efficiency is achieved.

Claims

A data processing method comprising:

Receiving a data import instruction for instructing data to be imported into the database;

Performing split processing on the data according to the data import instruction;

The split processed data is partitioned into different storage spaces in the database.
The method of claim 1, wherein the splitting the data according to the data importing instruction comprises:

Determining, according to the data import instruction, a table structure of the table and data distribution information of the data on the table;

Identifying each data row field in the data according to the table structure, the data distribution information, and descriptor information of the data carried in the data import instruction;

The data is split according to each of the identified data row fields.
The method of claim 1, wherein the splitting the data according to the data importing instruction comprises:

Determining whether the data satisfies a splitting rule;

When the judgment result is yes, the data is split and processed;

When the determination result is negative, the data is subjected to correction processing; and the corrected data is subjected to resolution processing.
The method according to claim 1, wherein the importing the split processed data into different storage spaces in the database comprises:

Download the split processed data;

The downloaded split data block is imported into different storage spaces in the database.
The method of claim 4, after the splitting of the processed data into different storage spaces in the database, further comprising:

The downloaded data after the split processing is deleted.
The method according to claim 1, wherein after the split-processed data is partitioned into different storage spaces in the database, the method further includes:

Summarize the import result after importing and processing the data after the split processing;

Feedback the import results.
A data processing device comprising:

a receiving module, configured to receive a data import instruction for instructing data to be imported into the database;

a processing module, configured to perform splitting processing on the data according to the data importing instruction;

The import module is configured to import the split processed data into different storage spaces in the database.
The apparatus of claim 7 wherein said processing module comprises:

a determining unit, configured to determine a table structure of the table according to the data importing instruction and data distribution information of the data on the table;

An identifying unit, configured to identify each data row field in the data according to the table structure, the data distribution information, and descriptor information of the data carried in the data importing instruction;

The first processing unit is configured to split the data according to each of the identified data row fields.
The apparatus of claim 7 wherein said processing module comprises:

a determining unit, configured to determine whether the data satisfies a splitting rule;

a second processing unit configured to perform split processing on the data if the determination result of the determining unit is YES;

a correction unit configured to perform a correction process on the data if the determination result of the determination unit is negative;

The third processing unit is configured to perform split processing on the corrected data.
The apparatus of claim 7, wherein the importing module comprises:

Download unit, set to download the split processed data;

The import unit is configured to import the downloaded data after the split processing into different storage spaces in the database.
The apparatus of claim 10, further comprising:

The module is deleted and set to delete the downloaded data after the split processing.
The apparatus according to claim 7, further comprising:

The summary module is set to summarize the import result after importing and processing the split processed data;

A feedback module, configured to feed back the import result.