WO2012048555A1

WO2012048555A1 - Method and device for importing data into database

Info

Publication number: WO2012048555A1
Application number: PCT/CN2011/072076
Authority: WO
Inventors: 胡丽蓉; 刘永平
Original assignee: 中兴通讯股份有限公司
Priority date: 2010-10-13
Filing date: 2011-03-23
Publication date: 2012-04-19
Also published as: CN101980187A

Abstract

A method and device for importing data into a database are disclosed. The method comprises the following steps of: taking out one or more data records from data files (S202); verifying the data records taken out (S204); importing the data records successfully verified into the database (S206). The solution resolves the problem of low accuracy in the method for importing the data into a database in prior art, and improves the accuracy of importing the data into the database.

Description

TECHNICAL FIELD The present invention relates to the field of databases, and in particular to a data storage method and apparatus. BACKGROUND Currently, real-time software systems in many fields generate massive backup data for subsequent statistics and verification. For example, in the telecom industry, the traffic system, the billing system, and the authentication system all generate a large number of offline bills. These bills will be imported into the database on a daily or monthly basis, and reconciliation, statistics, and reports will be made for the user to check the bill. , or for various subsequent processing such as data mining, so the efficient, accurate and flexible storage of these data has important application value. However, the inventors have found that the data warehousing method in the prior art has the following problems:

1) The ordinary one-by-one warehousing method is more efficient. For example, users in the telecommunications industry usually have billions of units, and their billing data is very large. The ordinary storage method will inevitably consume more time, but it takes a few days to enter the warehouse to meet the application requirements.

2) The bulk storage method is less flexible and has a greater limit. At present, various mainstream databases provide the function of importing data in batches, but directly using such a mainstream database will bring about a big problem: The file format must meet the limitations of the database; one or several errors in the file may cause the entire file to be Unable to enter the library.

3) The item-by-block warehousing method or batch warehousing method is less accurate. At present, in the case of database exceptions, warehousing system exceptions, some data logging exceptions, etc., the accuracy of warehousing cannot be guaranteed, that is, data duplication and omission are not avoided, and inaccurate warehousing will be greatly reduced. Availability of inbound data. Summary of the invention

The main object of the present invention is to provide a data storage method and apparatus to solve at least one of the above problems. In order to achieve the above object, according to an aspect of the present invention, a data storage method is provided, comprising: taking one or more data records from a data file; verifying the extracted data records; Successful data records are imported into the database. The step of verifying the retrieved data record includes: determining whether the field in each of the extracted data records satisfies a preset format; if satisfied, the determined data record is successfully verified; if not, the The judged data record is saved in the error log file. The step of importing the successfully verified data record into the database includes: assigning a serial number to each of the above-mentioned successfully verified data records, wherein each of the above-mentioned successful data records corresponding to the school-risk is unique in the above database The serial number of the above-mentioned school-risk data data assigned to the above-mentioned serial number is imported into the database. The step of importing the data record of the school-risk success into the database includes: importing the data record of the above school-risk success into the above database in batch mode; if the current batch of the above-mentioned school-risk successful data record import fails, the above The current batch of the above-mentioned school-risk successful data records and the serial number corresponding to each of the data records of the current batch of the above-mentioned verification successes are saved in the inbound failure record file; the above-mentioned storage is performed in a single manner The data record saved in the failure log file is re-imported into the above database. If the import fails, the data record that failed to be imported is saved to the error file. The step of importing the data record of the successful school-risk into the database includes: determining whether the data table currently used in the database satisfies a predetermined rule; if not, using the data table currently used to store the data record of the school-risk success If it is satisfied, another free data table in the above database is used to store the data record of the above-mentioned school-risk success. The step of using the other idle data table in the database to store the data record of the successful school-risk includes: determining whether the currently used data table is in a preset plurality of data tables for storing data records. The last data table; if the currently used data table is the last one of the plurality of data tables for storing the data record, the preset plurality of data records for storing the data record are used. The first data table in the data table is used to store the data record successfully verified; if the currently used data table is not the last data table in the plurality of data tables for storing the data record, The next data table of the currently used data table is used in the plurality of preset data tables for storing data records to store the data record of the above verification success. The predetermined rule includes at least one of the following: the amount of data stored in the currently used data table exceeds a predetermined threshold; the currently used data table is used for more than a predetermined length of time. In order to achieve the above object, according to another aspect of the present invention, a data storage device is provided, comprising: a reading unit configured to take one or more data records from a data file; a verification unit, set to be The extracted data record is verified; the import unit is set to import the data record with successful verification into the database. The importing unit includes: an allocating module, configured to allocate a serial number for each of the above-mentioned successful data records, wherein the serial number corresponding to each of the successfully verified data records is a unique serial number in the database; Import module, set to the above school to be assigned the above serial number

- Risk successful data records are imported into the database. The device further includes: a storage unit configured to: after verifying the extracted data record, save the data record of the verification failure in the error record file; and save the data record of the failed storage to the storage failure record In the file, the import unit is further configured to re-import the data record saved in the storage failure record file into the database in a single manner. The importing unit further includes: a determining module, configured to determine whether the data table currently used in the database satisfies a predetermined rule, wherein the predetermined rule includes at least one of the following: the amount of data stored in the currently used data table exceeds a predetermined amount Threshold; the currently used data table is used for more than a predetermined length of time; the table change module is configured to use the currently used data table to store the data record of the above-mentioned succession when the predetermined rule is not satisfied; When the above predetermined rule is satisfied, another idle data table in the above database is used to store the above-mentioned successful data record of the school-risk. Through the invention, the data record is calibrated when the data is put into storage, and the batch warehousing mode and the single warehousing mode are combined to improve the accuracy of data warehousing. In addition, when the data is stored in the database, the present invention also assigns a unique serial number to each data record, thereby avoiding repeated storage and omission of storage; further, when the data is stored in the database, the target data table can be automatically switched, preventing The efficiency of the query or secondary processing is affected by the excessive amount of data in a single table. Other features and advantages of the invention will be set forth in the description which follows, and The object and other advantages of the present invention The points may be realized and obtained by the structures specified in the written description, the claims, and the drawings. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are set to illustrate,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, In the drawings: FIG. 1 is a schematic diagram of a location of a data warehousing system in an application according to an embodiment of the present invention; FIG. 2 is a preferred flowchart of a data warehousing method according to an embodiment of the present invention; Another preferred flowchart of the data storage method of the embodiment of the present invention; FIG. 4 is a schematic diagram of a preferred structure of the data storage device according to the embodiment of the present invention. BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, the present invention will be described in detail with reference to the accompanying drawings. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict. 1 is a schematic diagram of a location of a data storage system in an application according to an embodiment of the present invention. The utility model imports a large amount of offline data into a data table, and the application system can directly query data from the tables, and can also customize a database task. The data is processed in the second table and imported into other tables or libraries for query. Applications can migrate data from the inbound table to other tables or libraries as needed. Then use the DB (Date Base) task to delete the inbound table data. For example, in order to provide query efficiency, an application can migrate data to another location. Embodiment 1 FIG. 2 is a preferred flowchart of a data storage method according to an embodiment of the present invention, which includes the following steps:

S202, taking one or more data records from the data file; S204, verifying the extracted data records; S206, importing the successfully verified data records into the database. Through the invention, the data record is verified when the data is stored in the library, thereby improving the accuracy of the data storage. Preferably, the step of verifying the retrieved data record comprises: determining whether a field in each of the retrieved data records satisfies a preset format; if satisfied, the determined data record is successfully verified; If it is not satisfied, the determined data record is saved in the error log file (Error file). For example, it is judged whether the format of the time information field in the data record conforms to a predetermined format, such as year-month-day. Preferably, the step of importing the data record of the successful school-risk into the database comprises: assigning a serial number to each of the data records of the school-risk success, wherein each of the school-risk successful data records corresponds to the pipeline The number is a unique serial number in the database; the data record of the school-risk success assigned with the serial number is imported into the database. In the preferred embodiment, by assigning a unique serial number to each data record, duplicate warehousing and missing warehousing are avoided, and the acknowledgment of the warehousing is further improved. Preferably, the verification will be performed. The step of importing the successful data record into the database includes: importing the data record of the school-risk success into the database in a batch manner; if the current batch of the school-risk successful data record import fails, the The current batch of the school-risk successful data record and the serial number corresponding to each of the current batch of the school-risk successful data records are saved in the inbound failure record file (Fail file) And re-importing the data record saved in the warehousing failure record file into the database in a single manner, and if the import fails, saving the data record that failed the import into the error record file. In the preferred embodiment By using the batch warehousing method, the warehousing efficiency is ensured; the record of the failure of the 4 metrics into the database is further added to the database, thereby effectively avoiding The defect that the original data record is leaked due to a system abnormality or the like further improves the accuracy of the storage. Preferably, the step of importing the successfully verified data record into the database further includes: determining the database Whether the currently used data table satisfies a predetermined rule; if not, the currently used data table is used to store the successfully verified data record; if satisfied, another idle data in the database is used The table stores the data record of the verification success. In the preferred embodiment, the data record is stored by using a plurality of data tables, thereby preventing the query or the secondary processing from being affected due to the excessive amount of data in the single table. Preferably, the step of using the other idle data table in the database to store the data record of the successful school-to-risk comprises: determining whether the currently used data table is preset for storing The last data table of the plurality of data tables stored in the data record; if the currently used data table is the last data table of the plurality of data tables for storing the data record, The preset first data table of the plurality of data tables for storing the data record to store the data record of the verification success; if the currently used data table is not the preset for storage The last data table of the plurality of data tables of the data record, the next data table of the currently used data table is used in the preset plurality of data tables for storing the data records to store A data record that verifies the success. In the preferred embodiment, the storage space is effectively saved by recycling the data table. Preferably, the predetermined rule includes at least one of the following: the amount of data stored in the currently used data table exceeds a predetermined threshold; and the currently used data table is used for more than a predetermined length of time. Embodiment 2 For convenience of description, in the embodiment of the present invention, the following identification is made: data file F (File), file record R (Record), data table T (Table), serial number SN (Serial Number), batch storage The failure record temporarily saves the directory Fail file, and saves the data in the same table with M table structures. In this embodiment, the data warehousing process includes the following steps: Step S1: fetching n records from the data file F, assigning a serial number to each record, performing validity check on the fields that need to be logged, and verifying failure The record is saved to the specified Error file for future reference, and the verification is successfully packaged into a bulk packet or data file. Step S2: The data prepared in the previous step is batch-inputted into the database, and the target data table Tn. Step S3: If the batch storage fails, save the batch record to the Fail file (the SN corresponding to each record needs to be saved at the same time). Step S4: Save the position that F has already processed. If the file has not been processed, return to step S1 to continue the storage. Step S5: F batch storage is completed. Step S6: The data record of the failed inbound storage saved in the Fail file is replenished into the library one by one with an insert statement. Step S7: If the Tn data amount reaches the set value, the target data table is changed to Τη+1. If η+1 reaches the last table, the entry starts from T1. Step S8: Processing the next data file. Preferably, for the data table processing, since the data amount in one data table exceeds the efficiency of the application query data, the preferred embodiment uses Μ (Μ>1) data tables to recycle and share the massive data. In addition, in order to recycle, it is necessary to periodically clean up the data table to ensure that the Τη+1 table data is already empty when the inbound table Tn is transferred to the inbound table Tn+1. How to clean up the contents of the table is determined by the specific application, such as: Synchronize the required information to another query table, and index the table, and supply the idle query. The warehousing system works most of the time in the batch mode normal warehousing process. In this process, the original record preprocessing is performed, the error records are eliminated, and the fields required by the application are filtered out for batch storage; at the appropriate time (according to the specific application setting) ) Go to the Fail file to add data records. At this time, the data records in the Fail file that failed to be bulk-inbound are inserted into the database one by one. The embodiment of the present invention introduces a serial number field, which prevents the heavy order by controlling the serial number, and has an obvious advantage in comparing the inbound rate compared with the usual establishment of the primary key on the data table; using a combination of the two storage methods , not only ensure the efficiency of warehousing, but also take into account the accuracy of the warehousing, prevent leakage; maintain the progress of the current warehousing file, can restore the warehousing after the system is abnormal, no need to manually process the massive data files, the degree of automation High to reduce the burden on maintenance personnel. Of course, in order to ensure the accuracy of the data, when the insert mode fails to record in a single batch, it is necessary to first determine whether the serial number of the record already exists in the table, which will reduce the efficiency to a certain extent. However, considering the probability of batch failure is small, the number of inserts into the insert mode is relatively small, so it has little effect on the overall storage performance. Embodiment 3 FIG. 3 is another preferred flowchart of a data storage method according to an embodiment of the present invention, which includes the following steps:

S302. Obtain the data file F from the inbound directory.

S304. Read the data record from the data file F.

S306. Preprocessing the data record: sorting the inbound field, verifying, and assigning the serial number. S308. If the preprocessing is successful, the records are assembled into a batch data block; if the preprocessing fails, the record is saved to an Error file.

S310. Batching the data records that have been successfully processed into the table Tn; preferably, the data records are batched into the library in a BCP (Bulk Copy Program) manner. S312. If the warehousing fails, save the batch data record to the Fail file.

S314. Update the inbound status file.

S316 If the data file F has not been processed, the process jumps to S304 and continues to read the data record; if the data file F is processed, the process goes to S318.

S318. If there is a data record in the Fail file that fails to be inbound, use the insert method to add the data record of the failed storage in the Fail file to the data table Tn.

S320. If the data amount of the data table Tn reaches the set value, replace the next table for data storage.

S322. F is completed and updated into the library. In general, the data warehousing method according to an embodiment of the present invention checks the original records in the data file, and eliminates the erroneous data rows to improve the warehousing success rate; assigns a unique serial number to each record (this field does not need to be Indexing ;), to ensure that the data is not stored in the warehouse; the correct records are stored in the batch mode, if the batch of data fails to be stored, the file is saved, and then the insert is re-submitted into the library by the insert method; Switch the target data table. This method preprocesses the original records, can be used to store a variety of data files, and adapt to a variety of target data bases; batch mode and insert mode combination, complementary advantages, while meeting the requirements of fast warehousing, real-time query and data accuracy. Its advantages are:

1) The offline data files generated by the high-efficiency system generally have no errors, the batch storage method has high success rate, and the overall storage efficiency is high; the current file storage status is saved in time, and the file can be continuously stored when the storage system is abnormally restarted. It is not necessary to insert from the beginning, and the storage efficiency is maximized. The inbound target data table can be automatically switched to prevent the data volume in a single table from being too large and affecting the efficiency of the query or secondary processing.

2) Accuracy Each record is assigned a unique serial number to ensure that the records will not be re-stocked; the records that fail to be bulk-inbound are placed in the library in the insert mode, and the normal records will not be missed. 3) Universality supports data files in multiple formats and multiple target databases. Embodiment 4 FIG. 4 is a schematic diagram of a preferred structure of a data storage device according to an embodiment of the present invention, comprising: a reading unit 402 configured to take one or more data records from a data file; and a verification unit 404 And connected to the reading unit 402, wherein the school-risk unit 404 is configured to perform the school-risk on the retrieved data record; the import unit 406 is connected to the school-risk unit 404, wherein the import unit 406 is set to check Successful data records are imported into the database. Through the invention, the data record is verified when the data is stored in the library, thereby improving the accuracy of the data storage. Preferably, the step of verifying the extracted data record by the checking unit 404 includes: determining whether a field in each of the extracted data records satisfies a preset format; if satisfied, the determined data The record verification is successful; if not, the determined data record is saved to the error log file. For example, it is judged whether the format of the time information field in the data record conforms to a predetermined format, such as year-month-day. Preferably, the importing unit 406 includes: an allocating module, connected to the school-risk unit 404, wherein the allocating module is configured to allocate a serial number for each of the data records that are successfully verified, wherein each of the school-risks The serial number corresponding to the successful data record is the only serial number in the database; the import module is connected to the distribution module, wherein the import module is configured to successfully assign the school-risk to which the serial number is assigned The data record is imported into the database. In the preferred embodiment, by assigning a unique serial number to each data record, duplicate warehousing and missing warehousing are avoided, further improving the accuracy of warehousing. Preferably, the data storage device of the embodiment of the present invention further includes: a storage unit 408 connected to the school-risk unit 404, wherein the storage unit 408 is configured to, after the pair of the retrieved data records are verified, The data record that failed the verification is saved in the error log file; the data record that failed the inbound storage is saved in the inbound failure record file. In this scenario, the import unit 406 is further configured to re-import the data records saved in the inbound failure log file into the database in a single manner. Preferably, the importing unit 406, the step of importing the successfully verified data record into the database comprises: importing the data record of the school-risk success into the database in a batch manner; if the current batch of the school-risk is successful If the data record import fails, the current batch of the school-risk successful data record and the serial number corresponding to each of the current batch of the successfully verified data records are saved to the storage. In the failure record file (Fail file); re-import the data record saved in the inbound failure record file into the database in a single way, and if the import fails, save the data record that failed to be imported into the error record file. . In the preferred embodiment, the warehousing efficiency is ensured by using the batch warehousing method; the record of the batch warehousing failure is further added to the database, thereby effectively avoiding the original reason due to system abnormality and the like. The data record leaks into the defect, further improving the accuracy of the storage. Preferably, in the process of introducing a single unit 406 in the embodiment ¹ ^ ΐ storage preservation have failed file recorded in the data recording reintroduced into the database, the correction - the storage unit 404 risk failure log The data record saved in the file is verified; if the verification is successful, the import unit 406 imports the data record successfully verified in the inbound failure record file into the database; if the verification fails, the The data record of the failure of the warehousing failure record file in the school-risk is saved in the error log file for subsequent reference. In the preferred embodiment, the data record in the warehousing failure record file is further added to the database, thereby effectively avoiding the defect of deleting the original correct data record due to the system exception, further improving the warehousing. Accuracy. Preferably, in each of the above preferred embodiments, the importing unit 406 further includes: a determining module, configured to determine whether the data table currently used in the database satisfies a predetermined rule, wherein the predetermined rule includes at least the following a: the amount of data stored in the currently used data table exceeds a predetermined threshold; the currently used data table is used for more than a predetermined length of time; the meter changing module is connected to the determining module, wherein the The table module is configured to use the currently used data table to store the data record of the verification success when the predetermined rule is not met; when the predetermined rule is met, another idle space in the database is used A data table is used to store the data record for which the verification was successful. In the preferred embodiment, by using a plurality of data tables to store data records, the efficiency of the query or secondary processing is prevented due to the excessive amount of data in a single table. Preferably, the saving module of the importing unit 406 uses another idle data table in the database to store the data record of the verification success: determining whether the currently used data table is preset for Storing a last data table of the plurality of data tables of the data record; if yes, using the first one of the plurality of data tables for storing the data record to store the school-risk success Data record; if not, at the preset multiple sheets for storing data records The next data table of the currently used data table is used in the data table to store the data record for which the verification is successful. In the preferred embodiment, the storage space is effectively saved by recycling the data table. It should be noted that the steps shown in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer executable instructions, and, although the logical order is shown in the flowchart, in some cases, The steps shown or described may be performed in an order different than that herein. Obviously, those skilled in the art should understand that the above modules or steps of the present invention can be implemented by a general-purpose computing device, which can be concentrated on a single computing device or distributed over a network composed of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device, such that they may be stored in the storage device by the computing device, or they may be separately fabricated into individual integrated circuit modules, or they may be Multiple modules or steps are made into a single integrated circuit module. Thus, the invention is not limited to any specific combination of hardware and software. The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the scope of the present invention are intended to be included within the scope of the present invention.

Claims

Claim

1. A data warehousing method, including:

Take one or more data records from the data file;

Verify the data records taken;

Import the data record with successful verification into the database.

2. The method according to claim 1, wherein the step of performing the calibration of the extracted data records comprises:

It is judged whether the field in each data record taken out meets a preset format; if it is satisfied, the determined data record is successfully checked; if not, the determined data record is saved in the error record file.

3. The method according to claim 1, wherein the step of importing the successfully verified data record into the database comprises:

And assigning a serial number to each of the data records that are successfully verified, wherein each of the data records corresponding to the successfully verified data record is a unique serial number in the database;

The school-risk successful data record to which the serial number is assigned is imported into the database.

4. The method according to claim 1, wherein the step of importing the successfully verified data record into the database comprises:

Importing the school-risk successful data record into the database in batch mode; if the current batch of the school-risk successful data record import fails, the current batch of the school-risk success The data record and the serial number corresponding to each data record in the current batch of the school-risk successful data records are saved in the storage failure record file; and the storage failure record file is saved in a single manner. The data record is re-imported into the database. If the import fails, the data record that failed to be imported is saved to the error file.

The method according to any one of claims 1 to 4, wherein the step of importing the successfully verified data record into the database comprises: Determining whether the data table currently used in the database satisfies a predetermined rule; if not, using the currently used data table to store the data record with successful verification;

If so, another idle data table in the database is used to store the data record for the success of the school.

6. The method according to claim 5, wherein the step of storing the successfully verified data record by using another idle data table in the database comprises:

Determining whether the currently used data table is the last one of the plurality of data tables for storing the data record;

If the currently used data table is the last one of the plurality of data tables for storing the data record, using the preset plurality of data tables for storing the data record The first data table to store the data record of the verification success;

If the currently used data table is not the last one of the plurality of data tables for storing the data record, in the preset multiple data tables for storing the data record The data record of the verification success is stored using the next data table of the currently used data table.

7. The method according to claim 5, wherein the predetermined rule comprises at least one of the following:

The amount of data stored by the currently used data table exceeds a predetermined threshold; the currently used data table is used for more than a predetermined length of time.

8. A data storage device, including:

a reading unit, configured to take one or more data records from the data file; and a verification unit configured to verify the retrieved data records;

Import unit, set to import data records with successful verification into the database.

9. The device according to claim 8, wherein the importing unit comprises:

An allocation module, configured to allocate a serial number for each of the data records that are successfully verified, wherein each of the data records corresponding to the successful data record of the school-risk is a unique serial number in the database; The import module is configured to import the data record of the verification successfully assigned the serial number into the database.

10. The device according to claim 8, further comprising:

a storage unit, configured to save the data record of the failed test data in the error record file after the data record of the extracted data is verified; and save the data record of the failed storage to the storage failure record file;

Wherein the unit is further arranged to import a single failure mode storage ΐ ^ ¹ has the recorded file data records saved into the database again.

The device according to any one of claims 8 to 10, wherein the importing unit further comprises:

The determining module is configured to determine whether the data table currently used in the database satisfies a predetermined rule, where the predetermined rule includes at least one of the following: the amount of data stored in the currently used data table exceeds a predetermined threshold; The currently used data table is used for more than a predetermined length of time;

a table changing module, configured to use the currently used data table to store the data record of the school-risk success when the predetermined rule is not satisfied; when the predetermined rule is met, use another one in the database An idle data table is stored to store the data record for which the verification was successful.