KR101269428B1 - System and method for data distribution - Google Patents

System and method for data distribution Download PDF

Info

Publication number
KR101269428B1
KR101269428B1 KR1020120083209A KR20120083209A KR101269428B1 KR 101269428 B1 KR101269428 B1 KR 101269428B1 KR 1020120083209 A KR1020120083209 A KR 1020120083209A KR 20120083209 A KR20120083209 A KR 20120083209A KR 101269428 B1 KR101269428 B1 KR 101269428B1
Authority
KR
South Korea
Prior art keywords
data
node
nodes
data node
capacity
Prior art date
Application number
KR1020120083209A
Other languages
Korean (ko)
Inventor
김태홍
최성필
정창후
엄정호
정성재
정한민
Original Assignee
한국과학기술정보연구원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한국과학기술정보연구원 filed Critical 한국과학기술정보연구원
Priority to KR1020120083209A priority Critical patent/KR101269428B1/en
Application granted granted Critical
Publication of KR101269428B1 publication Critical patent/KR101269428B1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a data distribution system and method, comprising: a plurality of data nodes storing data, an input data being analyzed to identify a type pattern, and storing the data based on state information of data nodes in which the type pattern is set; A management node for determining a data node and for distributing the data.

Description

System and Method for data distribution

The present invention relates to a data distribution system and method, and more particularly, to analyze a type pattern by analyzing input data, and to determine a data node to store data based on state information of data nodes in which the type pattern is set. A data distribution system and method for distributing / storing data.

As the Internet develops, a lot of data is generated and distributed by netizens a day, and recently, a large amount of data is collected and accumulated as much as possible among many companies, especially search engine companies and web portals. Extracting meaningful information from data as quickly as possible becomes a competitive advantage for companies.

As a result, many companies are investigating large-scale distributed management and distributed workload processing technology by building large clusters at low cost.

In other words, the value of large data that is difficult to process in the existing single-machine system is highlighted, and distributed parallel-based systems have been introduced / used in various fields as an alternative for processing them.

However, in the distributed parallel system that stores and processes data in multiple nodes, the processing speed of the entire system is inevitable due to the load caused by the network IO and the number of join operations between nodes in the process of processing one task. There was an inherent problem with processing large amounts of data at high speed.

The present invention has been made to solve the above problems, to provide a data distribution system and method that can reduce the response time of the overall system by minimizing the network IO time and Join operation between each node of a distributed parallel system There is this.

Another object of the present invention is to provide a data distribution system and method capable of improving query processing speed by distributing and storing data in a data node, and generating a data replica to ensure fault tolerance.

It is still another object of the present invention to provide a data distribution system and method capable of minimizing network IO between data nodes to reduce the speed of an entire task.

According to an aspect of the present invention to achieve the above objects, a plurality of data nodes for storing data, the input data is analyzed to confirm a type pattern, and based on the state information of the data nodes in which the type pattern is set; A data distribution system is provided that includes a management node that determines a data node to store data from and distributes the data.

The state information of the data node may include overlapping storage information, the number of data nodes, a storage capacity of each data node, and a type pattern.

The management node is allocated to one data node when the data includes a plurality of type patterns, and is allocated to an empty data node when the data includes an undistributed type pattern. Replica can be created in the data node, and the replica can be distributed to neighboring data nodes by repeating the replica creation until the replica configuration is satisfied.

According to another aspect of the present invention, a data node information database in which information about connected data nodes is stored, a data analyzer for analyzing typed data and checking a type pattern and capacity, and searching the data node information database for searching A data node selector configured to identify data nodes having a type pattern set and to select a data node to store the data based on the identified state information of the data nodes; and a data distributor configured to distribute data to the selected data nodes. A management node is provided.

The data node information database may store at least one of overlapping storage information, the number of data nodes, a pattern type of each data node, and a storage capacity.

The data node selecting unit selects data nodes having a storage capacity greater than or equal to the capacity of the data from among the identified data nodes, or divides the data into a predetermined size when there are no data nodes greater than or equal to the capacity. Among the data nodes, data nodes larger than the capacity of the divided data may be selected.

The data node selector may be allocated to one data node when the data includes data of a plurality of type patterns, or to an empty data node when data includes a type pattern that is not distributed. As a result, the replica may be generated in the neighboring data node, and the replica may be repeatedly distributed to the neighboring data node until the replica setting is satisfied.

The management node may further include an updater configured to check state information of each data node in real time and update state information of each data node stored in the data node information database.

According to another aspect of the present invention, in a method in which a managed node distributes and stores data among a plurality of data nodes, analyzing the input data to identify type patterns and capacities; and searching the provided data node information database. Identifying the data nodes for which the identified type pattern is set, selecting a data node to store the data based on state information of the identified data nodes, and distributing data to the selected data nodes. A data distribution method is provided.

The data node information database may store at least one of overlapping storage information, the number of data nodes, a pattern type of each data node, and a storage capacity.

Selecting a data node to store the data on the basis of the confirmed state information of the data nodes, selecting data nodes having a storage capacity greater than or equal to the data capacity among the identified data nodes, If the node does not exist, the data may be divided into a predetermined size, and among the identified data nodes, data nodes that are larger than or equal to the capacity of the divided data may be selected.

The selecting of the data node to store the data on the basis of the confirmed status information of the data nodes may include: a type pattern not allocated or distributed to one data node when the data is data including a plurality of type patterns. In the case of the data including the data, the data may be allocated to the empty data node, but the replica may be generated in the neighboring data node according to the preset overlapping storage information, and the replica may be repeatedly distributed to the neighboring data node until the replica setting is satisfied.

According to the present invention, network IO time and join operations between nodes of a distributed parallel system can be minimized to reduce the response speed of the entire system.

In addition, by distributing and storing data in data nodes, query processing speed can be improved, and data replicas can be created to ensure fault tolerance.

In addition, network IO between data nodes can be minimized to speed up the overall task.

1 illustrates a data distribution system in accordance with the present invention.
Figure 2 is a block diagram schematically showing the configuration of a management node according to the present invention.
3 is a flow chart illustrating a method for a managed node to distribute data to a plurality of data nodes in accordance with the present invention.

The foregoing and other objects, features, and advantages of the present invention will become more apparent from the following detailed description taken in conjunction with the accompanying drawings, in which: FIG.

1 is a diagram illustrating a data distribution system according to the present invention.

Referring to FIG. 1, a data distribution system includes a plurality of data nodes 200a 200b,..., 200n, and binomial 200 that store data, and a management node 100 that distributes / stores data to each data node 200. ).

Each data node 200 is preset with a type pattern of data to be stored according to a preset distribution rule. Therefore, the data node 200 stores data corresponding to the type pattern set to the data node 200.

The management node 100 checks the type pattern by analyzing the input data, and determines and distributes the data node 200 to store the data based on state information of the data nodes in which the same type pattern as the type pattern is set. Here, the state information of the data node may include overlapping storage information, the number of data nodes, the storage capacity and the type pattern of each data node.

When the input data is data including a plurality of type patterns, the management node 100 allocates the data to one data node, and allocates the data to the empty data node in the case of data including a type pattern that is not distributed. At this time, the management node 100 generates a replica in the neighboring data node according to the preset overlapping storage information, and repeatedly generates the replica in the neighboring data node by repeatedly generating the replica until the replica setting is satisfied.

Detailed description of the management node 100 as described above with reference to FIG.

2 is a block diagram schematically illustrating a configuration of a management node according to the present invention.

Referring to FIG. 2, the management node 100 includes a data analyzer 110, a data node selector 120, a data node information database 130, a distribution rule database 140, and a data distributor 150. do.

The data node information database 130 stores state information of each data node. That is, the data node information database 130 stores overlapping storage information, the number of data nodes, the type pattern of each data node, and the storage capacity. Here, the overlapping storage information may refer to the number of times of duplicate storage of data. For example, when the overlapping storage information is set three times, the management node 100 allows the input data to be repeatedly stored in three data nodes.

The distribution rule database 140 stores a predetermined distribution rule for each service. The distribution rule may be a rule for storing association data in an independent node when using distributed parallel based data nodes.

The distribution rule stored in the distribution rule database 140 extracts a pattern by analyzing query sets included in a query list of each service, generates a pattern set configuration file of a query unit based on the pattern, and then selects a pattern of a query unit. The three configuration files are the distribution rules of the service.

The data analyzer 110 analyzes the input data to check the type pattern and the capacity. That is, the data analyzer 110 may read the input data in line units and check the type pattern. In addition, the data analyzer 110 may store the data in a buffer having a predetermined size, and then read the data stored in the buffer to check the type pattern. In this case, the size of the buffer can be arbitrarily changed.

The data node selector 120 searches the data node information database 130 to identify data nodes in which the type pattern identified in the data analyzer 110 is set, and based on the checked state information of the data nodes. Select a data node to store the data. In this case, the data node selector 120 acquires data nodes in which the same type pattern as the type pattern of the data is set, and whether there is a data node whose storage capacity of the obtained data nodes is equal to or greater than the capacity of the data. Judge. If there is a data node that is greater than or equal to the capacity of the data, the data node selector 120 selects a data node that is greater than or equal to the capacity of the data from the obtained data nodes.

If there is no data node that is greater than or equal to the capacity of the data, the data node selector 120 divides the data into a predetermined size and is equal to or greater than the capacity of the divided data among the obtained data nodes. Select the data nodes.

In addition, the data node selector 120 allocates an empty data node when the data includes a type pattern that is not allocated or distributed to one data node when the data includes a plurality of type patterns to shorten the number of joins. Can be assigned to

In addition, when the overlapping storage information is set in the data node information database 130, the data node selecting unit 120 generates a replica in an adjacent data node according to the overlapping storage information, and repeats the replica generation until the replica setting is satisfied. The data may be redundantly distributed to adjacent data nodes. That is, the management node 100 stores the same data in a plurality of data nodes in order to prevent data loss and service when a data node fails. To this end, the management node 100 may set overlapping storage information, and duplicately store the input data according to the set overlapping storage information.

The data distributor 150 distributes and stores the data to the data nodes selected by the data node selector 120.

Although not shown in the drawing, the management node 100 may further include an update unit (not shown) for checking the state information of each data node in real time and updating the state information of each data node stored in the data node information database. Can be.

The management node 100 configured as described above analyzes the input data and distributes the data to one or more data nodes. Data stored in the unit of query can reduce the number of unnecessary joins and eliminate the I / O cost, enabling fast query processing. It also maintains the advantages of a distributed parallel system that guarantees fault tolerance by nesting and storing data according to user settings.

3 is a flowchart illustrating a method for distributing data to a plurality of data nodes by a management node according to the present invention.

Referring to FIG. 3, when data is input (S302), the management node analyzes the input data and checks a type pattern and capacity (S304). That is, the management node analyzes the input data in a line unit or a predetermined size unit to check the type pattern and the capacity.

After performing the step S304, the management node searches the provided data node information database and checks the data nodes in which the identified type pattern is stored (S306). That is, the management node searches the data node information database and identifies data nodes in which the same type pattern as that of the data is set.

After performing S306, the management node selects a data node to store the data based on the confirmed state information of the data nodes (S308).

In this case, the management node compares the capacity of the data with the storage capacity of the identified data nodes, and selects data nodes having a storage capacity more than the data capacity. Then, the management node may allocate the data to one data node when the data includes a plurality of type patterns, and to the empty data node when the data includes a non-distributed type pattern.

When the overlapping storage information is set in the data node information database, the management node creates a replica in the neighboring data node according to the overlapping storage information, and repeats the replica generation until the replica setting satisfies to duplicate the data in the neighboring data node. Can be distributed.

After performing the step S308, the management node distributes and stores data to the selected data nodes (S310).

The management node determines whether all input data has been stored (S312), and if the storage is not completed, repeats from step S302.

Hereinafter, a method of distributing and storing data including a plurality of type patterns in a data node will be described as an example.

For example, as a result of analyzing the type pattern of the input data,

ID # 1 => Typepattern # 1,2

ID # 2 = (ID # 1 + ID # 3 + Typepattern # 5,6) = Typepattern # 1,2 + Typepattern # 8,9 + Typepattern # 5,6

ID # 3 => Typepattern # 8,9

ID # 4 => Typepattern # 10,11

ID # 5 => Typepattern # 12,13,14

ID # 6 => Typepattern # 15

The case where ID # 7 => Typepattern # 16, there are 5 data nodes, and the replica is 3 will be described using Table 1.

Node  One Node  2 Node  3 Node  4 Node  5 (1) Assign groups containing multiple patterns to one node to shorten the number of joins. 1,2,8,9,5,6 10,11 12,13,14 ② Assign undistributed pattern to empty node 1,2,8,9,5,6 10,11 12,13,14 15 16 ③ Create replicas on neighbor nodes 1,2,8,9,5,6
10,11
10,11
12,13,14
12,13,14
15
15
16
16
1,2,8,9,5,6
④ Repeat creation until satisfied replica setting 1,2,8,9,5,6
10,11
15
10,11
12,13,14
16
12,13,14
15
10,11
15
16
1,2,8,9,5,6
16
1,2,8,9,5,6
12,13,14

Referring to Table 1, the management node distributes data to a single data node in the case of data including a plurality of type patterns to shorten the number of joins. That is, Typepattern # 1,2,8,9,5,6 of ID # 2 is distributed to data node 1, Typepattern # 10,11 of ID # 4 is distributed to data node 2, Typepattern # 12 of ID # 5, 13, 14 distributes to data node 3.

In addition, the management node distributes the data to the empty data node in the case of data including the undistributed type pattern.

That is, the management node distributes the unpatterned Typepattern # 15 with the ID # 6 to the data node 4 and distributes the Typepattern # 16 with the ID # 7 to the data node 5.

In addition, since the replica is set to 3, the management node generates a replica in the neighboring data node, and repeats the replica generation until the replica configuration is satisfied, and distributes the replica to the neighboring data node.

That is, the management node duplicates Typepattern # 10, 11 at data node 1, duplicates Typepattern # 12, 13, 14 at data node 2, duplicates Typepattern # 15 at data node 3, and typespattern at data node 4 Duplicate storage # 16, duplicate typepattern # 1,2,8,9,5,6 on data node 5.

Then, the management node duplicates Typepattern # 15 in data node 1, duplicates Typepattern # 16 in data node 2, duplicates Typepattern # 10,11 in data node 3, and typespattern # 1,2 in data node 4 Duplicate storage of, 8,9,5,6 and Typepattern # 12,13,14 on data node 5.

The method for data distribution can be written programmatically, and the codes and code segments constituting the program can be easily inferred by a programmer in the art.

Thus, those skilled in the art will appreciate that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. It is therefore to be understood that the embodiments described above are to be considered in all respects only as illustrative and not restrictive. The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.

100: management node 110: data analysis unit
120: data node selection unit 130: data node information DB
140: distribution rule DB 150: data distribution unit
200: data node

Claims (12)

A plurality of data nodes for storing data; And
Analyze the input data to confirm a type pattern, determine a data node to store the data based on state information of data nodes in which the same type pattern as the identified type pattern is set, and distribute the data to the determined data node. Include managed nodes,
The management node allocates to one data node when the data includes data having a plurality of type patterns, and assigns to an empty data node when the data includes data patterns that are not distributed.
Replica generation in the adjacent data node according to the preset overlapping storage information, and repeats the replica generation until the replica setting satisfies the data distribution system characterized in that to distribute the data to the adjacent data node.
The method of claim 1,
The state information of the data node includes overlapping storage information, the number of data nodes, storage capacity and type pattern of each data node.
delete A data node information database in which information about connected data nodes is stored;
A data analyzer which analyzes the input data and checks a type pattern and a capacity;
A data node selector configured to search the data node information database to identify data nodes having the same type pattern as the identified type pattern and to select a data node to store the data based on state information of the identified data nodes; And
And a data distribution unit for distributing data to the selected data nodes.
The data node selecting unit allocates the data node to one data node when the data includes a plurality of type patterns, and allocates the data node to an empty data node when the data includes a non-distributed type pattern.
And a replica is generated in a neighboring data node according to preset overlapping storage information, and the replica is distributed to neighboring data nodes by repeating the replica generation until the replica setting is satisfied.

5. The method of claim 4,
And at least one of overlapping storage information, a number of data nodes, a pattern type of each data node, and a storage capacity in the data node information database.
5. The method of claim 4,
The data node selector selects data nodes having a storage capacity greater than or equal to the data capacity among the identified data nodes,
And when there are no data nodes above the capacity, splitting the data into a predetermined size and selecting data nodes that are larger than or equal to the capacity of the divided data among the identified data nodes.
delete 5. The method of claim 4,
And a updating unit which checks the state information of each data node in real time and updates the state information of each data node stored in the data node information database.
A method for a managed node to distribute data to a plurality of data nodes, the method comprising:
(a) analyzing the input data to identify a type pattern and a capacity;
(b) searching the provided data node information database to identify data nodes having the same type pattern as the identified type pattern; And
(c) selecting a data node to store the data based on the identified state information of the data nodes and distributing the data to the selected data nodes;
In the step (c), the data is allocated to one data node when the data includes a plurality of type patterns, and the data is allocated to an empty data node when the data includes an undistributed type pattern.
A replica is generated in a neighboring data node according to preset overlapping storage information, and the replica is repeatedly distributed to a neighboring data node until the replica setting is satisfied.
10. The method of claim 9,
And at least one of overlapping storage information, a number of data nodes, a pattern type of each data node, and a storage capacity in the data node information database.
10. The method of claim 9,
The step (c)
Selecting data nodes having a storage capacity greater than or equal to the capacity of the data from among the identified data nodes, or dividing the data into a predetermined size if there are no data nodes greater than or equal to the data capacity, and among the identified data nodes Selecting data nodes that are greater than or equal to the capacity of the divided data; And
Distributing data to the selected data nodes.

delete
KR1020120083209A 2012-07-30 2012-07-30 System and method for data distribution KR101269428B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020120083209A KR101269428B1 (en) 2012-07-30 2012-07-30 System and method for data distribution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020120083209A KR101269428B1 (en) 2012-07-30 2012-07-30 System and method for data distribution

Publications (1)

Publication Number Publication Date
KR101269428B1 true KR101269428B1 (en) 2013-05-30

Family

ID=48667188

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020120083209A KR101269428B1 (en) 2012-07-30 2012-07-30 System and method for data distribution

Country Status (1)

Country Link
KR (1) KR101269428B1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170040995A (en) * 2015-10-06 2017-04-14 삼성전자주식회사 Method and apparatus for analyzing interaction network
US9934325B2 (en) 2014-10-20 2018-04-03 Korean Institute Of Science And Technology Information Method and apparatus for distributing graph data in distributed computing environment
KR101927658B1 (en) * 2018-05-16 2019-03-12 양동국 A System of Water Treatment Management Using PLC Data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004252663A (en) 2003-02-19 2004-09-09 Toshiba Corp Storage system, sharing range deciding method and program
JP2012123544A (en) 2010-12-07 2012-06-28 Nippon Hoso Kyokai <Nhk> Load distribution device and program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004252663A (en) 2003-02-19 2004-09-09 Toshiba Corp Storage system, sharing range deciding method and program
JP2012123544A (en) 2010-12-07 2012-06-28 Nippon Hoso Kyokai <Nhk> Load distribution device and program

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9934325B2 (en) 2014-10-20 2018-04-03 Korean Institute Of Science And Technology Information Method and apparatus for distributing graph data in distributed computing environment
KR20170040995A (en) * 2015-10-06 2017-04-14 삼성전자주식회사 Method and apparatus for analyzing interaction network
KR102183089B1 (en) * 2015-10-06 2020-11-25 삼성전자주식회사 Method and apparatus for analyzing interaction network
KR101927658B1 (en) * 2018-05-16 2019-03-12 양동국 A System of Water Treatment Management Using PLC Data

Similar Documents

Publication Publication Date Title
US10002148B2 (en) Memory-aware joins based in a database cluster
CN104598376B (en) The layering automatization test system and method for a kind of data-driven
US8140625B2 (en) Method for operating a fixed prefix peer to peer network
CN112163048A (en) Method and device for realizing OLAP analysis based on ClickHouse
CN110032549B (en) Partition splitting method, partition splitting device, electronic equipment and readable storage medium
CN103678609A (en) Large data inquiring method based on distribution relation-object mapping processing
CN102932415A (en) Method and device for storing mirror image document
CN104423960A (en) Continuous project integration method and continuous project integration system
CN105683940A (en) Processing a data flow graph of a hybrid flow
CN107239468B (en) Task node management method and device
Wang et al. BENU: Distributed subgraph enumeration with backtracking-based framework
CN103902544A (en) Data processing method and system
CN104871153A (en) System and method for flexible distributed massively parallel processing (mpp) database
KR101269428B1 (en) System and method for data distribution
US10452685B2 (en) Method and apparatus for replicating data
CN105045917A (en) Example-based distributed data recovery method and device
CN103971036A (en) Page field access control system and method
CN105556474A (en) Managing memory and storage space for a data operation
CN108062314B (en) Dynamic sub-table data processing method and device
CN102385588A (en) Method and system for improving performance of data parallel insertion
CN101673374A (en) Bill processing method and device
CN102207935A (en) Method and system for establishing index
Lwin et al. Non-redundant dynamic fragment allocation with horizontal partition in Distributed Database System
CN107239568A (en) Distributed index implementation method and device
CN111858739A (en) Mapreduce-based data aggregation method and system

Legal Events

Date Code Title Description
A201 Request for examination
A302 Request for accelerated examination
E902 Notification of reason for refusal
AMND Amendment
E601 Decision to refuse application
AMND Amendment
X701 Decision to grant (after re-examination)
GRNT Written decision to grant
FPAY Annual fee payment

Payment date: 20160406

Year of fee payment: 4

FPAY Annual fee payment

Payment date: 20170327

Year of fee payment: 5

FPAY Annual fee payment

Payment date: 20181030

Year of fee payment: 6