WO2013066010A1

WO2013066010A1 - Method for pre-loading in memory and method for parallel processing for high-volume batch processing

Info

Publication number: WO2013066010A1
Application number: PCT/KR2012/008913
Authority: WO
Inventors: 채조욱; 정호철; 박수용; 김경희; 박진철; 최종건; 곽송해; 황민정
Original assignee: 에스케이씨앤씨 주식회사
Priority date: 2011-10-31
Filing date: 2012-10-29
Publication date: 2013-05-10
Also published as: KR20130047431A

Abstract

The present invention relates to a high-volume information batch-processing method which is based on a relational database. The high-volume information batch processing method according to the present invention pre-loads second comparative data from a second database into a memory by checking and processing same, withdraws first comparative data for a specific identifier from a first database by checking and processing same, retrieves second comparative data for the specific identifier from the memory, outputs a cross-checking result by cross-checking the second comparative data and the first comparative data for the specific identifier, and thus secures the memory by deleting from the memory the second comparative data and first comparative data for the specific identifier, which have undergone cross-checking.

Description

Memory shipments and parallel processing methods for large batch processing

The present invention relates to a method of batch processing a large amount of information in a batch, and more particularly, to a memory shipping material and a parallel processing method for a large batch processing of selecting and batching a large amount of data at a time based on a relational database. .

Conventionally, a file-based system used to batch process large amounts of information in batches is built on a sequential access method (SAM) based processing to metabolize large amounts of information in batches.

In the case of performing large-scale work with such a file-based system, disk I / O contention among the tasks degrades overall system performance.

In addition, a large amount of unnecessary intermediate work files are generated during metabolic processing, consuming resources of a disk and a central processing unit (CPU).

In addition, the maintenance logic is low because the dialogue logic and input / output logic are not easily separated, and when executing, large sized queries are generated based on each action read from the file, which hinders performance and file sorting / dividing / It requires additional solutions for tasks such as mergers.

SUMMARY OF THE INVENTION The present invention has been made in view of the above-described prior art, and has a relational database based on a database storing large amounts of information. It is to provide a method.

In addition, the present invention is to provide a memory shipping material and a parallel processing method for a large-scale batch processing by dividing the first and second comparison information required for the metabolic operation by section and loading them in a memory batch.

In addition, the present invention divides the input data based on a key value, and then divides the work in parallel, and performs a memory loading material for a large-capacity batch processing performed by sequentially dividing the divided input data into intervals within each parallel job. And a parallel processing method.

According to an embodiment of the present invention, a memory shipping material and a parallel processing method for batch processing of a large amount of information may include retrieving and processing second comparison information from a second database and loading the same into a memory. Searching and processing the first comparison information of the specific identifier from a database, and retrieving the second comparison information of the specific identifier from the memory; and the dialogue of the first comparison information and the second comparison information of the specific identifier. And outputting a result of the metabolism, and deleting the first comparison information and the second comparison information of the specific identifier that performed the metabolism after the metabolism is performed in the memory.

The loading of the second comparison information may include loading the blue code information into the memory in advance, dividing the second comparison information of the second database into at least one or more sections, and performing the second comparison for each divided section. Loading information into the memory as a hash table.

In addition, it is characterized in that before loading the first comparison information of the specific identifier, the information center and the tax calendar, the blueprint code, various information such as corporation / closing company in the form of a hash table in advance in the memory.

The memory may be implemented as a heap memory.

In addition, the special identifier is characterized in that the identification information for identifying each individual, such as a business operator ID.

The metabolic result outputting step may include outputting the first comparison information disagreement file, the second comparison information disagreement file, the first comparison information updating file, and the second comparison information updating file.

The method may further include inserting the first comparison information mismatch file and the second comparison information mismatch file into a database.

The method may further include updating the first and second databases based on the first comparison information update file and the second comparison information update file. .

In addition, when the metabolism is performed, the sections are divided based on the specific identifier, and the same tasks are distributed in parallel, and the distributed input data is sequentially divided into sections in each parallel task. .

As described above, the present invention implements a large-scale database based on a relational database to select and process data in large quantities at a time, thereby improving the number of database calls.

In addition, the present invention can divide the information required for the metabolic operation by the interval to be shipped to the memory to perform a batch process without generating an intermediate file, it can be provided as an online service.

In addition, the present invention does not require file sorting / merge / split so that there is no need to consider resource contention. Accordingly, the present invention can improve performance, and it is possible to precisely separate business logic from SQL statements, thereby improving maintainability. .

In addition, the present invention divides the interval based on the key value of the input data when batch processing a large amount of data, and then distributes the tasks in parallel, and since the divided input data is sequentially divided into the intervals within each parallel operation. This can improve throughput and throughput.

1 is a block diagram showing a mass information batch processing system related to the present invention.

2 is a flow chart showing a large-capacity information batch processing method related to the present invention.

3 is a diagram illustrating a process of collectively processing a large amount of information using a parallel process related to the present invention.

4 illustrates the structure of a hash processing logic and a hash table in accordance with the present invention.

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

As shown in FIG. 1, the large-capacity information batch processing system according to the present invention stores a first database (hereinafter, referred to as a first DB) 10 in which a large amount of first comparison information is stored, and a large amount of second comparison information. And a second database (hereinafter referred to as a second DB) 20 and a metabolic processing device 30 which performs cross check of the first comparison information and the second comparison information.

The first DB 10 and the second DB 20 are implemented based on a relational database (RDB) that expresses data (first and second comparison information) in a simple table form.

When the large-capacity information batch processing system according to the present invention is applied to a tax processing related task of a public office, the first comparison information may correspond to report information and include a business ID and a payment tax amount, and the second comparison information. May correspond to payment information, and includes a business ID, account number, and payment amount, and the information is provided by a bank. In addition, the first comparison information of the first DB 10 is directly provided by the taxpayer through a computer provided by a public office (for example, a tax processing office (National Tax Service, etc.)) or a lower office such as a tax office.

The metabolic processor 30 compares the first and second comparison information stored in the memory 31 with the memory 31 temporarily storing the first and second comparison information used for metabolic processing. Processor 32 to review.

In the memory 31, first comparison information and second comparison information drawn from the first DB 10 and the second DB 20 are loaded in the form of a hash table. In addition, the memory 31 includes a heap memory 31a, and the small sized codeable data resides in the heap memory 31a. 4 is a diagram illustrating the structure of a hash processing logic and a hash table according to the present invention.

The heap memory 31a is loaded with various related information in advance so that the first comparison information can be inquired or processed. When the large-capacity information batch processing system according to the present invention is applied to a tax processing related task of a government office, various related information such as blueprint code and payment information, information center, tax calendar, corporation / discard company, etc. are stored in the heap memory 31a. Preloaded. In the case of corporation / closer information, it is reloaded by section.

The processor 32 inquires and processes the second comparison information from the second DB 20 and loads the second comparison information into the heap memory 31a. In addition, the processor 32 inquires, processes and withdraws the first comparison information of the specific identifier from the first DB 10, and the second comparison information of the specific identifier among the second comparison information loaded in the heap memory 31a. Search for. The processor 32 checks the extracted first comparison information and the retrieved second comparison information to check the state (eg, payment status) of the corresponding entity. The state of the individual (eg payment status) includes normal payment, overpayment, nonpayment, mispayment, and the like.

In addition, the processor 32 outputs four kinds of metabolic results according to the metabolic performance. The processor 32 inserts the metabolic result into the first comparison information / second comparison information disagreement DB or updates the metabolic result in the first / second DB by reflecting the metabolic result. The metabolic result includes a first comparison information disagreement file, a second comparison information disagreement file, a first comparison information update file, a second comparison information update file, and the like. In addition, the metabolic results may be used for confirming the evidence data and processing results according to the corresponding large-scale information processing.

The processor 32 metabolizes the retrieved first comparison information and the retrieved second comparison information, and then deletes the first comparison information, which is completed, from the memory 31, and the retrieved second comparison information is stored in the heap memory 31a. The storage space of the memory 31 is secured by removing the comparison information.

2 is a flowchart illustrating a method of processing a large amount of information in a batch according to the present invention.

According to FIG. 2, the processor 32 inquires and processes the second comparison information from the second DB 20 and loads the second comparison information into the heap memory 31a of the memory 31 (S101). In this case, the processor 32 does not load the entire second comparison information stored in the second DB 20 into the memory 31, but divides it into at least one or more sections and loads each of the sections into the memory 31 for each section. The second comparison information is loaded into the heap memory 31a in the form of a hash table. For example, when the second comparison information stored in the second DB 20 is 220,000, the 220,000 second comparison information is divided into ten sections, and the second comparison information is stored in the memory 31. )).

Before querying the second comparison information, the processor 32 loads predetermined classification information (eg, blue code information) into the heap memory 31a in advance, and sets a memory limit to perform buffering. do.

The processor 32 retrieves and processes the first comparison information of the specific identifier from the first DB 10 (S102). The specific identifier refers to identification information that can distinguish each entity, such as a business ID. For example, the processor 32 extracts first comparison information having an operator ID of '1102401444' from the first DB 10 to generate a metabolic input file (eg, a hash table).

In addition, before withdrawing the first comparison information of the specific identifier from the first DB 10, the processor 32 may provide various related information (eg, information center and tax calendar, blue book code, Information such as corporation / disclosure) is preloaded into the memory 31 as a hash table.

The processor 32 searches for the second comparison information of the specific identifier in the heap memory 31a (S103). For example, the processor 32 searches that the operator ID is '1102401444' among the second comparison information stored in the heap memory 31a.

The processor 32 performs a cross-check between the first comparison information of the specific identifier fetched from the first DB 10 and the second comparison information of the specific identifier retrieved from the heap memory 31a. (S104). The processor 32 performs metabolism in a multi-loop manner based on the 'first comparison information' case by one second comparison information for each first comparison information.

The processor 32 generates a metabolic result in a file and outputs it (S105).

The metabolic result is output as a first comparison information disagreement file, a second comparison information disagreement file, a first comparison information updating file, and a second comparison information updating file.

After performing the metabolism, the processor 32 secures the memory by deleting the first comparison information and the second comparison information of the specific identifier that performed the metabolism from the memory 31 (S106). For example, the processor 32 removes the first comparison information and the second comparison information having the operator ID '1102401444' from the memory 31.

The processor 32 then performs additional logic to maximize performance, as shown in FIG. The processor 32 performs the metabolism, and then deletes the finished hash table record. Originally, the hash table search speed is approximately O (1), much faster than the DBMS search speed O (logn). The maximum search rate 0 (1/2) may be implemented by deleting the hash table.

In the above-described embodiment, the second comparison information is divided into a plurality of sections, and the metabolic operations are separately performed on the divided second comparison information for each divided section. However, the metabolic operations may be performed in parallel. .

As shown in FIG. 3, when the number of distributed tasks is determined by the task scheduler, the processor 32 performs a preceding task for parallel processing. The processor 32 processes the same tasks in parallel by creating threads for the determined number of distributed tasks.

For example, the first comparison information of the specific identifier is queried and retrieved, and the extracted first comparison information is processed in parallel with the second comparison information for each of the plurality of sections and each metabolic task. That is, since the first comparison information of the specific identifier and the second section information of the ten sections are metabolized in parallel, the metabolic operation of a large amount of information for one information entity is performed on the second comparison information as a whole.

Alternatively, the metabolic operation of the first comparison information for each of the plurality of identifiers and the second comparison information for the specific section is processed in parallel. For example, the first comparison information for ten different operator IDs is extracted, and the first comparison information for each extracted operator ID and the second comparison information of a specific section are simultaneously processed in parallel. In other words, by simultaneously processing a large amount of information metabolic operations for a plurality of individuals, it is necessary to repeatedly perform the metabolic operation for the second comparison information for each section.

When the metabolic task is completed, the processor 32 performs parallel processing with a plurality of metabolic result insertion task threads for inserting the first comparison information / second comparison information discord information of metabolic results output from each metabolic task thread into a DB. In addition, after inserting the dialogue result, the dialogue result update (update) tasks are executed in parallel. The processor 32 performs a metabolic result report task when the metabolic result update task is completed.

As described above, the present invention may be performed by dividing a section based on a specific identifier and then distributing the same work in parallel, and sequentially dividing the input data distributed into sections in each parallel work. .

Claims

Querying and processing the second comparison information from the second database and reloading it into memory;

Retrieving and processing first comparison information of a specific identifier from a first database and retrieving second comparison information of the specific identifier from the memory;

Outputting a metabolic result by performing metabolism on the first comparison information and the second comparison information of the specific identifier;

And deleting the first comparison information and the second comparison information of the specific identifier that performed the metabolism from the memory after the metabolism is performed.
According to claim 1, The payment information loading step,

Loading predetermined classification code information into the memory in advance;

Dividing the second comparison information of the second database into at least one section and loading the divided second comparison information into the memory as a hash table in a large capacity based on the relational database. Information batch processing method.
The method of claim 1,

A method of batch processing of large amounts of information based on a relational database, characterized in that before loading the first comparison information of the specific identifier, various related information necessary for querying or processing the information is preloaded into a memory in the form of a hash table.
The memory of claim 1, wherein the memory comprises:

A method for batch processing of large amounts of information based on a relational database characterized by being implemented in heap memory.
The method of claim 1, wherein the special identifier,

A large-scale information batch processing method based on a relational database, characterized in that the identification information that can identify each entity, such as the operator ID.
According to claim 1, The metabolic result output step,

And a first comparison information disagreement file, a second comparison information disagreement file, a first comparison information update file, and a second comparison information update file.
The method of claim 6,

And inserting the first comparison information disagreement file and the second comparison information disagreement file into a database.
The method of claim 6,

And updating the first and second databases based on the first comparison information update file and the second comparison information update file.
The method of claim 1,

When the metabolism is performed, a division is performed based on the specific identifier, and then the same job is distributed in parallel, and even in each parallel job, the distributed input data is sequentially divided into sections to perform the relational database. A large amount of information batch processing based on.