KR101269428B1

KR101269428B1 - System and method for data distribution

Info

Publication number: KR101269428B1
Application number: KR1020120083209A
Authority: KR
Inventors: 김태홍; 최성필; 정창후; 엄정호; 정성재; 정한민
Original assignee: 한국과학기술정보연구원
Priority date: 2012-07-30
Filing date: 2012-07-30
Publication date: 2013-05-30

Abstract

The present invention relates to a data distribution system and method, comprising: a plurality of data nodes storing data, an input data being analyzed to identify a type pattern, and storing the data based on state information of data nodes in which the type pattern is set; A management node for determining a data node and for distributing the data.

Description

System and Method for data distribution

The present invention relates to a data distribution system and method, and more particularly, to analyze a type pattern by analyzing input data, and to determine a data node to store data based on state information of data nodes in which the type pattern is set. A data distribution system and method for distributing / storing data.

As the Internet develops, a lot of data is generated and distributed by netizens a day, and recently, a large amount of data is collected and accumulated as much as possible among many companies, especially search engine companies and web portals. Extracting meaningful information from data as quickly as possible becomes a competitive advantage for companies.

As a result, many companies are investigating large-scale distributed management and distributed workload processing technology by building large clusters at low cost.

In other words, the value of large data that is difficult to process in the existing single-machine system is highlighted, and distributed parallel-based systems have been introduced / used in various fields as an alternative for processing them.

However, in the distributed parallel system that stores and processes data in multiple nodes, the processing speed of the entire system is inevitable due to the load caused by the network IO and the number of join operations between nodes in the process of processing one task. There was an inherent problem with processing large amounts of data at high speed.

The present invention has been made to solve the above problems, to provide a data distribution system and method that can reduce the response time of the overall system by minimizing the network IO time and Join operation between each node of a distributed parallel system There is this.

Another object of the present invention is to provide a data distribution system and method capable of improving query processing speed by distributing and storing data in a data node, and generating a data replica to ensure fault tolerance.

It is still another object of the present invention to provide a data distribution system and method capable of minimizing network IO between data nodes to reduce the speed of an entire task.

According to an aspect of the present invention to achieve the above objects, a plurality of data nodes for storing data, the input data is analyzed to confirm a type pattern, and based on the state information of the data nodes in which the type pattern is set; A data distribution system is provided that includes a management node that determines a data node to store data from and distributes the data.

The state information of the data node may include overlapping storage information, the number of data nodes, a storage capacity of each data node, and a type pattern.

The management node is allocated to one data node when the data includes a plurality of type patterns, and is allocated to an empty data node when the data includes an undistributed type pattern. Replica can be created in the data node, and the replica can be distributed to neighboring data nodes by repeating the replica creation until the replica configuration is satisfied.

According to another aspect of the present invention, a data node information database in which information about connected data nodes is stored, a data analyzer for analyzing typed data and checking a type pattern and capacity, and searching the data node information database for searching A data node selector configured to identify data nodes having a type pattern set and to select a data node to store the data based on the identified state information of the data nodes; and a data distributor configured to distribute data to the selected data nodes. A management node is provided.

The data node information database may store at least one of overlapping storage information, the number of data nodes, a pattern type of each data node, and a storage capacity.

The data node selecting unit selects data nodes having a storage capacity greater than or equal to the capacity of the data from among the identified data nodes, or divides the data into a predetermined size when there are no data nodes greater than or equal to the capacity. Among the data nodes, data nodes larger than the capacity of the divided data may be selected.

The data node selector may be allocated to one data node when the data includes data of a plurality of type patterns, or to an empty data node when data includes a type pattern that is not distributed. As a result, the replica may be generated in the neighboring data node, and the replica may be repeatedly distributed to the neighboring data node until the replica setting is satisfied.

The management node may further include an updater configured to check state information of each data node in real time and update state information of each data node stored in the data node information database.

According to another aspect of the present invention, in a method in which a managed node distributes and stores data among a plurality of data nodes, analyzing the input data to identify type patterns and capacities; and searching the provided data node information database. Identifying the data nodes for which the identified type pattern is set, selecting a data node to store the data based on state information of the identified data nodes, and distributing data to the selected data nodes. A data distribution method is provided.

Selecting a data node to store the data on the basis of the confirmed state information of the data nodes, selecting data nodes having a storage capacity greater than or equal to the data capacity among the identified data nodes, If the node does not exist, the data may be divided into a predetermined size, and among the identified data nodes, data nodes that are larger than or equal to the capacity of the divided data may be selected.

The selecting of the data node to store the data on the basis of the confirmed status information of the data nodes may include: a type pattern not allocated or distributed to one data node when the data is data including a plurality of type patterns. In the case of the data including the data, the data may be allocated to the empty data node, but the replica may be generated in the neighboring data node according to the preset overlapping storage information, and the replica may be repeatedly distributed to the neighboring data node until the replica setting is satisfied.

According to the present invention, network IO time and join operations between nodes of a distributed parallel system can be minimized to reduce the response speed of the entire system.

In addition, by distributing and storing data in data nodes, query processing speed can be improved, and data replicas can be created to ensure fault tolerance.

In addition, network IO between data nodes can be minimized to speed up the overall task.

1 illustrates a data distribution system in accordance with the present invention.
Figure 2 is a block diagram schematically showing the configuration of a management node according to the present invention.
3 is a flow chart illustrating a method for a managed node to distribute data to a plurality of data nodes in accordance with the present invention.

The foregoing and other objects, features, and advantages of the present invention will become more apparent from the following detailed description taken in conjunction with the accompanying drawings, in which: FIG.

1 is a diagram illustrating a data distribution system according to the present invention.

Referring to FIG. 1, a data distribution system includes a plurality of data nodes 200a 200b,..., 200n, and binomial 200 that store data, and a management node 100 that distributes / stores data to each data node 200. ).

Each data node 200 is preset with a type pattern of data to be stored according to a preset distribution rule. Therefore, the data node 200 stores data corresponding to the type pattern set to the data node 200.

The management node 100 checks the type pattern by analyzing the input data, and determines and distributes the data node 200 to store the data based on state information of the data nodes in which the same type pattern as the type pattern is set. Here, the state information of the data node may include overlapping storage information, the number of data nodes, the storage capacity and the type pattern of each data node.

When the input data is data including a plurality of type patterns, the management node 100 allocates the data to one data node, and allocates the data to the empty data node in the case of data including a type pattern that is not distributed. At this time, the management node 100 generates a replica in the neighboring data node according to the preset overlapping storage information, and repeatedly generates the replica in the neighboring data node by repeatedly generating the replica until the replica setting is satisfied.

Detailed description of the management node 100 as described above with reference to FIG.

2 is a block diagram schematically illustrating a configuration of a management node according to the present invention.

Referring to FIG. 2, the management node 100 includes a data analyzer 110, a data node selector 120, a data node information database 130, a distribution rule database 140, and a data distributor 150. do.

The data node information database 130 stores state information of each data node. That is, the data node information database 130 stores overlapping storage information, the number of data nodes, the type pattern of each data node, and the storage capacity. Here, the overlapping storage information may refer to the number of times of duplicate storage of data. For example, when the overlapping storage information is set three times, the management node 100 allows the input data to be repeatedly stored in three data nodes.

The distribution rule database 140 stores a predetermined distribution rule for each service. The distribution rule may be a rule for storing association data in an independent node when using distributed parallel based data nodes.

The distribution rule stored in the distribution rule database 140 extracts a pattern by analyzing query sets included in a query list of each service, generates a pattern set configuration file of a query unit based on the pattern, and then selects a pattern of a query unit. The three configuration files are the distribution rules of the service.

The data analyzer 110 analyzes the input data to check the type pattern and the capacity. That is, the data analyzer 110 may read the input data in line units and check the type pattern. In addition, the data analyzer 110 may store the data in a buffer having a predetermined size, and then read the data stored in the buffer to check the type pattern. In this case, the size of the buffer can be arbitrarily changed.

The data node selector 120 searches the data node information database 130 to identify data nodes in which the type pattern identified in the data analyzer 110 is set, and based on the checked state information of the data nodes. Select a data node to store the data. In this case, the data node selector 120 acquires data nodes in which the same type pattern as the type pattern of the data is set, and whether there is a data node whose storage capacity of the obtained data nodes is equal to or greater than the capacity of the data. Judge. If there is a data node that is greater than or equal to the capacity of the data, the data node selector 120 selects a data node that is greater than or equal to the capacity of the data from the obtained data nodes.

If there is no data node that is greater than or equal to the capacity of the data, the data node selector 120 divides the data into a predetermined size and is equal to or greater than the capacity of the divided data among the obtained data nodes. Select the data nodes.

In addition, the data node selector 120 allocates an empty data node when the data includes a type pattern that is not allocated or distributed to one data node when the data includes a plurality of type patterns to shorten the number of joins. Can be assigned to

In addition, when the overlapping storage information is set in the data node information database 130, the data node selecting unit 120 generates a replica in an adjacent data node according to the overlapping storage information, and repeats the replica generation until the replica setting is satisfied. The data may be redundantly distributed to adjacent data nodes. That is, the management node 100 stores the same data in a plurality of data nodes in order to prevent data loss and service when a data node fails. To this end, the management node 100 may set overlapping storage information, and duplicately store the input data according to the set overlapping storage information.

The data distributor 150 distributes and stores the data to the data nodes selected by the data node selector 120.

Although not shown in the drawing, the management node 100 may further include an update unit (not shown) for checking the state information of each data node in real time and updating the state information of each data node stored in the data node information database. Can be.

The management node 100 configured as described above analyzes the input data and distributes the data to one or more data nodes. Data stored in the unit of query can reduce the number of unnecessary joins and eliminate the I / O cost, enabling fast query processing. It also maintains the advantages of a distributed parallel system that guarantees fault tolerance by nesting and storing data according to user settings.

3 is a flowchart illustrating a method for distributing data to a plurality of data nodes by a management node according to the present invention.

Referring to FIG. 3, when data is input (S302), the management node analyzes the input data and checks a type pattern and capacity (S304). That is, the management node analyzes the input data in a line unit or a predetermined size unit to check the type pattern and the capacity.

After performing the step S304, the management node searches the provided data node information database and checks the data nodes in which the identified type pattern is stored (S306). That is, the management node searches the data node information database and identifies data nodes in which the same type pattern as that of the data is set.

After performing S306, the management node selects a data node to store the data based on the confirmed state information of the data nodes (S308).

In this case, the management node compares the capacity of the data with the storage capacity of the identified data nodes, and selects data nodes having a storage capacity more than the data capacity. Then, the management node may allocate the data to one data node when the data includes a plurality of type patterns, and to the empty data node when the data includes a non-distributed type pattern.

When the overlapping storage information is set in the data node information database, the management node creates a replica in the neighboring data node according to the overlapping storage information, and repeats the replica generation until the replica setting satisfies to duplicate the data in the neighboring data node. Can be distributed.

After performing the step S308, the management node distributes and stores data to the selected data nodes (S310).

The management node determines whether all input data has been stored (S312), and if the storage is not completed, repeats from step S302.

Hereinafter, a method of distributing and storing data including a plurality of type patterns in a data node will be described as an example.

For example, as a result of analyzing the type pattern of the input data,

ID # 1 => Typepattern # 1,2

ID # 2 = (ID # 1 + ID # 3 + Typepattern # 5,6) = Typepattern # 1,2 + Typepattern # 8,9 + Typepattern # 5,6

ID # 3 => Typepattern # 8,9

ID # 4 => Typepattern # 10,11

ID # 5 => Typepattern # 12,13,14

ID # 6 => Typepattern # 15

The case where ID # 7 => Typepattern # 16, there are 5 data nodes, and the replica is 3 will be described using Table 1.

Node One Node 2 Node 3 Node 4 Node 5 (1) Assign groups containing multiple patterns to one node to shorten the number of joins. 1,2,8,9,5,6 10,11 12,13,14 ② Assign undistributed pattern to empty node 1,2,8,9,5,6 10,11 12,13,14 15 16 ③ Create replicas on neighbor nodes 1,2,8,9,5,6
10,11 10,11
12,13,14 12,13,14
15 15
16 16
1,2,8,9,5,6 ④ Repeat creation until satisfied replica setting 1,2,8,9,5,6
10,11
15 10,11
12,13,14
16 12,13,14
15
10,11 15
16
1,2,8,9,5,6 16
1,2,8,9,5,6
12,13,14

Referring to Table 1, the management node distributes data to a single data node in the case of data including a plurality of type patterns to shorten the number of joins. That is, Typepattern # 1,2,8,9,5,6 of ID # 2 is distributed to data node 1, Typepattern # 10,11 of ID # 4 is distributed to data node 2, Typepattern # 12 of ID # 5, 13, 14 distributes to data node 3.

In addition, the management node distributes the data to the empty data node in the case of data including the undistributed type pattern.

That is, the management node distributes the unpatterned Typepattern # 15 with the ID # 6 to the data node 4 and distributes the Typepattern # 16 with the ID # 7 to the data node 5.

In addition, since the replica is set to 3, the management node generates a replica in the neighboring data node, and repeats the replica generation until the replica configuration is satisfied, and distributes the replica to the neighboring data node.

That is, the management node duplicates Typepattern # 10, 11 at data node 1, duplicates Typepattern # 12, 13, 14 at data node 2, duplicates Typepattern # 15 at data node 3, and typespattern at data node 4 Duplicate storage # 16, duplicate typepattern # 1,2,8,9,5,6 on data node 5.

Then, the management node duplicates Typepattern # 15 in data node 1, duplicates Typepattern # 16 in data node 2, duplicates Typepattern # 10,11 in data node 3, and typespattern # 1,2 in data node 4 Duplicate storage of, 8,9,5,6 and Typepattern # 12,13,14 on data node 5.

The method for data distribution can be written programmatically, and the codes and code segments constituting the program can be easily inferred by a programmer in the art.

Thus, those skilled in the art will appreciate that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. It is therefore to be understood that the embodiments described above are to be considered in all respects only as illustrative and not restrictive. The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.

100: management node 110: data analysis unit
120: data node selection unit 130: data node information DB
140: distribution rule DB 150: data distribution unit
200: data node

Claims

A plurality of data nodes for storing data; And
Analyze the input data to confirm a type pattern, determine a data node to store the data based on state information of data nodes in which the same type pattern as the identified type pattern is set, and distribute the data to the determined data node. Include managed nodes,
The management node allocates to one data node when the data includes data having a plurality of type patterns, and assigns to an empty data node when the data includes data patterns that are not distributed.
Replica generation in the adjacent data node according to the preset overlapping storage information, and repeats the replica generation until the replica setting satisfies the data distribution system characterized in that to distribute the data to the adjacent data node.

The method of claim 1,
The state information of the data node includes overlapping storage information, the number of data nodes, storage capacity and type pattern of each data node.

delete

A data node information database in which information about connected data nodes is stored;
A data analyzer which analyzes the input data and checks a type pattern and a capacity;
A data node selector configured to search the data node information database to identify data nodes having the same type pattern as the identified type pattern and to select a data node to store the data based on state information of the identified data nodes; And
And a data distribution unit for distributing data to the selected data nodes.
The data node selecting unit allocates the data node to one data node when the data includes a plurality of type patterns, and allocates the data node to an empty data node when the data includes a non-distributed type pattern.
And a replica is generated in a neighboring data node according to preset overlapping storage information, and the replica is distributed to neighboring data nodes by repeating the replica generation until the replica setting is satisfied.

5. The method of claim 4,
And at least one of overlapping storage information, a number of data nodes, a pattern type of each data node, and a storage capacity in the data node information database.

5. The method of claim 4,
The data node selector selects data nodes having a storage capacity greater than or equal to the data capacity among the identified data nodes,
And when there are no data nodes above the capacity, splitting the data into a predetermined size and selecting data nodes that are larger than or equal to the capacity of the divided data among the identified data nodes.

delete

5. The method of claim 4,
And a updating unit which checks the state information of each data node in real time and updates the state information of each data node stored in the data node information database.

A method for a managed node to distribute data to a plurality of data nodes, the method comprising:
(a) analyzing the input data to identify a type pattern and a capacity;
(b) searching the provided data node information database to identify data nodes having the same type pattern as the identified type pattern; And
(c) selecting a data node to store the data based on the identified state information of the data nodes and distributing the data to the selected data nodes;
In the step (c), the data is allocated to one data node when the data includes a plurality of type patterns, and the data is allocated to an empty data node when the data includes an undistributed type pattern.
A replica is generated in a neighboring data node according to preset overlapping storage information, and the replica is repeatedly distributed to a neighboring data node until the replica setting is satisfied.

10. The method of claim 9,
And at least one of overlapping storage information, a number of data nodes, a pattern type of each data node, and a storage capacity in the data node information database.

10. The method of claim 9,
The step (c)
Selecting data nodes having a storage capacity greater than or equal to the capacity of the data from among the identified data nodes, or dividing the data into a predetermined size if there are no data nodes greater than or equal to the data capacity, and among the identified data nodes Selecting data nodes that are greater than or equal to the capacity of the divided data; And
Distributing data to the selected data nodes.

delete