WO2016199955A1 - 코드 분산 해쉬테이블 기반의 맵리듀스 시스템 및 방법 - Google Patents
코드 분산 해쉬테이블 기반의 맵리듀스 시스템 및 방법 Download PDFInfo
- Publication number
- WO2016199955A1 WO2016199955A1 PCT/KR2015/005851 KR2015005851W WO2016199955A1 WO 2016199955 A1 WO2016199955 A1 WO 2016199955A1 KR 2015005851 W KR2015005851 W KR 2015005851W WO 2016199955 A1 WO2016199955 A1 WO 2016199955A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- server
- data
- hash
- memory cache
- file
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2255—Hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
Definitions
- the present invention relates to a distributed file system, and more particularly, to manage data in a file system layer based on a code distributed hash table and a double-layered ring structure of an in-memory cache layer. After predicting the probability distribution of the data access request, adjust the hash key range of the chord distributed hash table of the in-memory cache layer and perform tasks based on the predicted probability distribution.
- the present invention relates to a map reduction system and method based on a code distributed hash table capable of load balancing and increasing a cache hit rate by scheduling.
- Cloud computing refers to the concept of delegating a computer to a cloud, which is a collection of computers, rather than entrusting the storage or operation of data to individual computers. It means. Cloud computing has been used in many fields, and the biggest use area is the big data field.
- Such big data refers to data having a size of petabytes or more beyond a large number of terabytes, and since cloud computing cannot be processed by a single computer, cloud computing is the basis of big data processing. It is considered a platform.
- Apache's Hadoop system which is a multi-computer cloud environment based on the Java language.
- This Hadoop includes the Hadoop Distributed File System (HDFS), which allows you to divide and process incoming data.
- Distributed data is a map reduce developed for high-speed parallel processing of large data in a cluster environment. Is processed by the process.
- HDFS Hadoop Distributed File System
- the Hadoop distributed file system applied in Hadoop is a central file system, and there is a manager that manages the directory in the center, and the manager who manages the directory performs all the management of what data is in each server. There is a problem of poor performance in managing a lot of data.
- the Hadoop distributed file system does not consider what is in the file at all and divides the file appropriately so that it is distributed and stored on several servers. Maybe not at all.
- the present invention manages data in a ring structure of a dual layer of a file system layer and an in-memory cache layer based on a code distribution hash table, predicts a probability distribution of a data access request according to a user's data access request frequency, and then predicts the data. Based on the probability distribution, load balancing is possible by adjusting the hash distribution range of the code distribution hash table of the in-memory cache layer and scheduling the task, and the map reduction system based on the code distribution hash table that can increase the cache hit rate; To provide a method.
- the above-described present invention is a code reduction hash table based map reduce system, comprising: a plurality of servers including a file system and an in-memory cache storing data based on the code distributed hash table, and stored in the file system and the in-memory cache.
- the data management is managed in a dual layer ring structure, and when a data access request for a specific file is received from the outside, the map reduce task is allocated to a server that stores the data access requested files among the plurality of servers. And a task scheduler that outputs the result of the execution in response to the data access request.
- the task scheduler when receiving the data access request, retrieves the hash through the name of the file and checks the hash range assigned to the in-memory cache of each server to search the server that stores the file; And receiving the metadata about the file from the searched server and assigning the map reduce task to the servers in which the file is stored.
- the task scheduler may receive a data block structure of the file and information of a server in which data blocks are distributed and stored as the metadata, and allocate the map reduce task to servers in which each data block is stored. It is done.
- the in-memory cache stores hashes corresponding to data using the code distribution hash table, and the hashes corresponding to hashed ranges allocated to the in-memory caches are stored in the hashed storage. And stores data corresponding to the hash.
- the task scheduler may be configured to dynamically change and set the hash range of the in-memory cache of each server for each server according to the frequency of data access requests for each server.
- the task scheduler may store an intermediate calculation result generated in the map reduce task process for each data block of the file in the file system.
- the intermediate calculation result may be generated to have different solutions according to the data blocks, and may be distributed to different servers.
- the intermediate calculation result may be stored in an intermediate result reuse cache area of the in-memory cache.
- system is linked with the task scheduler, characterized in that it further comprises a resource manager for managing the addition, removal, recovery of the new server or upload of the file.
- the present invention also provides a method for performing a map reduce task in a map distributed system based on a code distributed hash table including a plurality of servers having a file system and an in-memory cache and a job scheduler for allocating a map reduce task to the server.
- the file system and the in-memory cache may store the data based on a code distribution hash table.
- the in-memory cache stores hashes corresponding to data using the code distribution hash table, and receives a hash range corresponding to a hash range corresponding to the hash range and the corresponding hash. It is characterized in that the data corresponding to the.
- the hashing range may be dynamically changed and set for each server according to a data access request frequency for each server.
- the map reduce task may be processed in servers in which the respective data blocks are stored, and an intermediate calculation result generated in the process of processing the map reduce task may be stored in the file system.
- the intermediate calculation result may be generated to have different solutions according to the data blocks, and may be distributed to different servers.
- the intermediate calculation result may be stored in an intermediate result reuse cache area of the in-memory cache.
- the data is managed by a ring structure of a file system layer based on a code distribution hash table and a dual layer of an in-memory layer, and the probability distribution of the data access request is predicted according to the frequency of the user's data access request. Based on the probability distribution, load balancing is possible by adjusting the hash distribution range of the code distribution hash table of the in-memory layer and scheduling the task, and there is an advantage of increasing the cache hit rate.
- a code distributed file system is used instead of a centrally controlled distributed file system, and in the code distributed file system, each server manages code routing tables and manages metadata centrally. Extensibility is guaranteed because you can access remote files directly, rather than through them.
- in-memory cache that actively utilizes the distributed memory environment, indexes key value data using code distributed hash tables, and not only input data but also intermediate calculation results generated as a result of map tasks.
- the cache hit rate can be increased by storing in the.
- indexing of the in-memory cache is managed independently of the code distribution hash table that manages the file system, so that the range of hashing can be flexibly adjusted according to the frequency of data requests, so that even data access can be made for each server.
- the task scheduler by applying a local recognition pair scheduling algorithm, based on the range that the task scheduler distributes, it checks which data is stored in the in-memory cache of the server and schedules the data to be reused. If you are biased in, there is an advantage that the data access is evenly distributed to all servers by adjusting the hash range.
- FIG. 1 is an operation conceptual diagram of a code distributed file system according to an embodiment of the present invention
- FIG. 2 is a conceptual diagram illustrating data management using a code distribution hash table based file system and an in-memory dual layer ring structure according to an embodiment of the present invention
- 3 to 5 are diagrams illustrating a data access probability distribution graph according to an embodiment of the present invention.
- FIG. 6 is an exemplary diagram of a cumulative probability distribution graph according to an embodiment of the present invention.
- FIG. 7 is a conceptual diagram illustrating a MapReduce task performing operation in a MapReduce system based on a code distribution hash table according to an embodiment of the present invention.
- a code distributed file system is used instead of a conventional centralized distributed file system such as Hadoop.
- each server manages code routing tables and can directly access remote files without using centrally managed metadata, thereby ensuring scalability.
- FIG. 1 illustrates a concept of a file system applied to a code distributed hash table (DHT) based map reduce system according to an embodiment of the present invention.
- DHT code distributed hash table
- the central directory stores all the information about the data of each node, while the code distributed file system of the present invention is implemented to have information about neighbor nodes in each node.
- each node may be implemented to have information about neighboring nodes immediately adjacent to each other indicated by arrows and neighboring nodes multiplied by two again from the adjacent nodes as shown in FIG. 1.
- the neighboring node immediately next to the node with hash number 0 (100) is node 102 with hash number 1, node 104 with hash number 2, node 106 with hash number 4, hash Information about node 114 having a key of 8 is known.
- the node 100 having hashing of 0 if a user request message for data corresponding to hashing 7, for example, is received by the node 100 having hashing of 0, the node with hashing is 0. Since the node 100 does not know information about the node 112 having the hashed 7, the node 102 having the hashed 1 connected to the arrow as shown in FIG. 1 is known. The user request message is transmitted to the node 104 having 2, the node 106 having 4, and the node 114 having 8 having a hash.
- the node 100 having a hash value of 0 has a user request message to the node 106 having the largest hash value 4 among the hash values less than 7 because the hash value 8 exceeds 7 of the hash values known to him. Will be sent.
- the node 106 having the hashing 4 the node 108 having the hashing 5, the node 110 having the hashing 6, and the node 114 having the hashing 8 are connected to the arrows known to each other.
- the node 106 whose hashing is 4 has a user request to the node 110 which stores the largest hashing 6 among the hashings smaller than 7 since the hashing 12 of the hashing 12 is greater than 7. It will send a message.
- the node 110 having the hash request 6 receiving the user request message again searches for a node to transmit the user request message among nodes known to the user.
- the node request 6 has a user request. Since the node 112 storing the hashed 7 corresponding to the message is known, the user request message is transmitted to the node 112 storing the hashed 7.
- data can be read from the node 100 storing the hashed zero which first receives the user request message requesting the data corresponding to the hashed seven.
- FIG. 2 illustrates a concept of managing data in a dual-layer ring structure of a code distribution hash table based file system and an in-memory cache according to an embodiment of the present invention.
- each server manages files only with information about its peers. Not only applied to the file system, but also applied to the in-memory cache, which is a cache memory interworking with the file system, to manage files in a dual layer ring structure.
- the distributed file system managed by each server includes file systems 202, 206, 210, 214, 218, and 222 and in memory 200, 204, 208, and 212, as shown in FIG. , 216, and 220).
- the file system of the plurality of servers and the in-memory cache may be connected in a dual layer ring structure to process a data access request from a user.
- such a file system may refer to a mass storage device such as a hard disk
- the in-memory cache may refer to a cache memory or the like, but is not limited thereto.
- a hash key range is allocated to each server's file system and in-memory cache, and information about the hash is managed by a job scheduler described later. You can handle the data access request message for a specific file.
- Each server can be configured with a range of hashes it manages to store, which can be set in both the file system and in-memory cache.
- the file system 202 of server A may have a hash range set to store hashes corresponding to a hash range of 56 to 5, and the in-memory cache 200 may have a hash range of 57 to 5. It can be set to have.
- file system 206 of server B has a hash range of 5-15, and in-memory cache 204 may be set to have a hash range of 5-11.
- the file system 210 of the C server may have a hash range of 15-26, and the in-memory cache 208 may be set to have a hash range of 11-18.
- file system 214 of server D may have a hash range of 26-39, and in-memory cache 212 may be set to have a hash range of 18-39.
- file system 218 of the E server may have a hash range of 39 to 47, and the in-memory cache 216 may be set to have a hash range of 39 to 48.
- file system 222 of the F server may have a hash range of 47 to 56, and the in-memory cache 220 may be set to have a hash range of 48 to 57.
- the range of the hash can be changed based on the number of data access requests from the user to each server. For example, in the case of the B server, the hashing range of the file system 206 is 5 to 15, while the hashing range of the in-memory cache 204 is set to 5 to 11.
- the reason for changing the hash range of the in-memory cache is that the overhead of the move to the file system of the other server in the file system is large. This is because the overhead is relatively low when moving to the in-memory cache of another server, and if the data is not found in the in-memory cache, it is not a big problem because the data is found in the file system.
- the hashing range allocated to the in-memory cache is changed and set based on the number of data access requests from the user, thereby improving data search efficiency.
- FIG. 5 illustrates a graph in which the conventional histogram of FIG. 3 and the most recent histogram of FIG. 4 are combined using a moving average formula according to a distribution approaching hashing, and the graph shown in FIG. It can represent the probability distribution that data access occurs.
- a graph of the cumulative probability distribution as shown in FIG. 6 may be calculated using the probability distribution as shown in FIG. 5.
- the intervals of the respective servers may be displayed to have the same probability distribution in the cumulative probability distribution, and in this case, the server having data having relatively many accesses (server 2, The range of hashing of server4) is reduced, and the range of hashing of servers (server 1, server 3, server 5) with relatively less accessible data can be widened. Accordingly, based on the above hashed range, the hashed range allocated in the in-memory cache of each server may be changed to the hashed range according to data access of each server.
- the hashing range of the in-memory cache can be changed dynamically according to the probability distribution that data access occurs per server differently from the hashing range allocated to the original file system. Load balance can be balanced to increase file system performance.
- the in-memory cache 212 of the D server since the in-memory cache 212 of the D server does not have much data access, the in-memory cache 212 of the D server has a wide range to be harmed.
- in-memory cache 204 of server B has a relatively large amount of data access, so the server in-memory cache 204 of server B has a narrow hash range. You can manage hashes that are smaller than the file system's hash range.
- a portion of the load that the server B is responsible for is reduced by reducing the hash range allocated to the in-memory cache 204 of the server B.
- the rest is to allow the server to run on the C server so that the load is balanced between servers to increase performance.
- FIG. 7 illustrates a concept of performing a MapReduce task in a MapReduce system based on a code distribution hash table according to an embodiment of the present invention.
- the code distribution hash table-based MapReduce system includes files of a plurality of servers connected by a job scheduler 702 and a resource manager 704 in a dual layer ring structure.
- the job scheduler 702 manages data in a ring structure of a dual layer of a file system layer based on a code distributed hash table and an in-memory cache layer, and assigns map reduce tasks to each server.
- the resource manager 704 manages the addition, removal, recovery, etc. of new servers, and also manages upload of files.
- each server accesses the data blocks of the distributed hash table file system in a distributed manner.
- the centrally-operated task scheduler 702 and the resource manager 704 may be implemented to have a minimum function for scalability of the system.
- the task scheduler 702 determines the hashing of each data block by using a hash function, and determines which server to store each data block through the hashing.
- each server can store and manage up to m servers in a distributed hash table.
- the value of m can be selected by the administrator of the system, and the value of m is 2 when S is the total number of servers. m -1> S must be satisfied.
- the cluster configuration of the present invention is relatively static, so storing a large number of servers in a hash table does not significantly affect scalability. This has the effect of improving data access performance.
- the distributed hash table which holds all the server information, allows direct access to files in consideration of the key range of each file, and has excellent scalability compared to conventional systems such as centralized Hadoop.
- the hash table stored in each server does not store metadata of all files, but stores only hash ranges which are in charge of the servers, so that the capacity of the hash table is very small and there is little overhead.
- a server when a data access request, it checks whether the data destruction is within its scope, and if it is in scope, provides data access from the server's file directory. To be passed to the server.
- the distributed hash table updates the information of other servers only when server addition or removal occurs, and detects a server failure by periodically sending and receiving heartbeat messages. If one server malfunctions, the resource manager 704 may cause the lost file blocks to recover from replicated blocks of the other server.
- the replication of these file blocks is replicated by k different hash functions through k different hash functions and stored on different servers, and the k value can be adjusted by the system administrator.
- the cache hierarchy of the in-memory cache (200, 204, 208, 212, 216, 220) can be divided into two partitions, dCache and iCache, where dCache is the partition where the input data block is stored and iCache is the intermediate
- dCache is the partition where the input data block is stored
- iCache is the intermediate
- the load balancing imbalance problem can be solved by providing access to the cache of another server even when the input data blocks are not
- the present invention may configure and manage metadata to distinguish the intermediate calculation result such as an application ID, an input data block, and an application parameter. Therefore, when a new job comes in, this metadata is searched to determine whether the intermediate calculation results can be reused, and if metadata that can be reused is found, the stored intermediate calculation results can be directly processed during the reduction process without remapping. To be handled.
- the input data block is determined in dCache and the input data is accessed from the cache as soon as it is available. If the input data block is not available in dCache, the input data can be accessed through the distributed hash table filesystem layer.
- the centralized work scheduler 702 manages the hashed range of the in-memory cache and allocates the hashed area for each server after division.
- the hashing range allocated to each server is dynamically adjusted, in case of input data with a large number of access requests, it is stored in multiple servers.
- the task scheduler processes the input data in a plurality of servers by allowing the input data to be replicated to multiple servers through locality-aware fair scheduling. Do it. This ensures load balancing while maximizing the cache utilization of successive operations.
- the task scheduler 702 allocates a hash range of the in-memory caches 200, 204, 208, 212, 216, and 220 of each server as shown in FIG. Information on the extent of the hashing range of the systems 202, 206, 210, 214, 218, and 222 is managed.
- a data access request from a user for a specific file may be received through the specific data access application 700.
- the task scheduler 702 allocates a MapReduce task to servers that store files for which data is requested by a user among a plurality of servers, and accesses the result value of the result of the MapReduce task. Can be output in response to a request.
- the task scheduler 702 may first obtain a hash corresponding to the requested data by returning a hash function with the name of a file requested for data access from the user.
- the task scheduler 702 stores between 26 and 39 for hashing by referencing the hashing range information allocated for the file system. It is found that the server is stored in the D server, and accesses the file system 214 of the D server in order to obtain information about the file corresponding to No. 38 (S1).
- the file corresponding to No. 38 may be a single file or the file may be divided into several files and distributed to multiple servers. However, even if the file corresponding to hashed number 38 is divided and distributed, the server D has all the information about the file corresponding to number 38.
- the server D calculates the metadata information of the file in the task scheduler. It provides to 702 (S2).
- the task scheduler 702 finds that the file corresponding to No. 38 to be decomposed using metadata information is divided into two data blocks and distributed to other servers. Check the server that corresponds to the in-memory cache that has hashed range of 56.
- the task scheduler 702 uses the hashing range information table 703 assigned to the in-memory cache for each server. We will check the in-memory cache corresponding to hashish 5 and hashish 56.
- the in-memory cache 204 of the B server has 5 hashes
- the in-memory cache 220 of the F server has 56 hashes.
- the task scheduler 702 can know that the file for which data access is requested is divided into two data blocks and distributed to different servers, and that two mappers must be executed on the B server and the F server, respectively. (map task schedule) is performed (S3).
- the server performing map reduction receives the map task schedule information from the job scheduler 702 and executes the map function on the server B and the server F in which the data blocks of the file for which access is requested are distributed and stored. Generate data.
- the job scheduler 702 reads data from the in-memory cache 204 of the B server and the in-memory cache 220 of the F server when the 5 and the 56 data of the hash are read directly from the corresponding in-memory cache, However, if there are no data 5 and 56 data in the in-memory cache, data can be read from the file system.
- the data of No. 56 is stored in the F server on the in-memory cache 220 but stored in the file system of the A server on the file system 202, and a cache miss may occur.
- the task scheduler 702 reads 56 times of data from the file system 202 of the server A (S4).
- a hash is generated from the file system 206 of the server B that stores the data to be hashed five times.
- the data of key 5 is read.
- the hashing ranges allocated to the in-memory cache 204 and the file system 206 are similar to each other, so that they may be read from the file system of the same server.
- the intermediate calculation results are generated, and these intermediate calculation results can be derived with different solutions according to data blocks.
- the task scheduler 702 may check how many hashes are output as the intermediate calculation result output during the map task process and store them in the in-memory cache of the server having the corresponding hash range. At this time, for example, hashes corresponding to two output values are mapped to hashed ranges managed by the in-memory cache 216 of the E server, and hashes managed by the in-memory cache 208 of the C server. If it is mapped to the key range, the task scheduler 702 stores the result value in the in-memory cache 216 of the E server and the in-memory cache 208 of the C server, respectively (S5).
- the job scheduler 702 informs the server that the intermediate calculation result of executing the map task is stored on the E server and the C server, so that a reduce job must be executed on the E server and the C server, respectively.
- Reduce task scheduling is performed (S6).
- the E server and the C server return the reduce function to generate an output file for the final result, that is, a data access request, and provide the same to the user.
- a MapReduce system based on a code distributed hash table data management is performed by a file structure layer based on a code distributed hash table and a ring structure of a dual layer of an in-memory cache layer to access a user's data.
- load balancing is possible by adjusting the hashing range of the code distribution hash table of the in-memory cache layer and scheduling the task based on the predicted probability distribution.
- the hit ratio can be increased.
Abstract
Description
Claims (16)
- 코드 분산 해시테이블을 기반으로 데이터를 저장하는 파일 시스템과 인메모리 캐시를 구비하는 다수의 서버와,상기 파일 시스템과 인메모리 캐시에 저장된 데이터를 이중계층의 링구조로 관리하며, 외부로부터 특정 파일에 대한 데이터 접근 요청 수신 시 상기 다수의 서버 중 상기 데이터 접근 요청된 파일을 저장하고 있는 서버들로 맵리듀스 테스크를 할당하여 상기 맵리듀스 테스크 수행된 결과값을 상기 데이터 접근 요청에 대한 응답으로 출력하는 작업 스케줄러를 포함하는 코드 분산 해시테이블 기반의 맵리듀스 시스템.
- 제 1 항에 있어서,상기 작업 스케줄러는,상기 데이터 접근 요청 수신 시, 상기 파일의 이름을 통해 해시키를 추출하고 각 서버의 인메모리 캐시에 할당된 해시키 범위를 확인하여 상기 파일을 저장하고 있는 서버를 검색하며, 상기 검색된 서버로부터 상기 파일에 대한 메타데이터를 수신하여 상기 파일이 저장된 서버들로 상기 맵리듀스 테스크를 할당하는 것을 특징으로 하는 코드 분산 해시테이블 기반의 맵리듀스 시스템.
- 제 2 항에 있어서,상기 작업 스케줄러는,상기 파일에 대한 데이터 블록 구조와 각 데이터 블록이 분산 저장된 서버의 정보를 상기 메타데이터로 수신하며, 상기 각 데이터 블록이 저장된 서버들로 상기 맵리듀스 테스크를 할당하는 것을 특징으로 하는 코드 분산 해시테이블 기반의 맵리듀스 시스템.
- 제 1 항에 있어서,상기 인메모리 캐시는,상기 코드 분산 해시테이블을 이용하여 데이터에 대응되는 해시키를 저장하고 있으며, 상기 해시키의 저장에 있어서 상기 인메모리 캐시에 할당된 해시키 범위에 해당하는 해시키와 상기 해당 해시키에 대응하는 데이터를 저장하고 있는 것을 특징으로 하는 코드 분산 해시테이블 기반의 맵리듀스 시스템.
- 제 4 항에 있어서,상기 작업 스케줄러는,각 서버의 상기 인메모리 캐시의 해시키 범위를 각 서버에 대한 데이터 접근 요청 빈도에 따라 서버별로 동적으로 변경 설정하는 것을 특징으로 하는 코드 분산 해시테이블 기반의 맵리듀스 시스템.
- 제 1 항에 있어서,상기 작업 스케줄러는,상기 파일의 각 데이터 블록에 대한 상기 맵리듀스 테스크 처리 과정에서 발생하는 중간계산결과를 상기 파일 시스템에 저장하는 것을 특징으로 하는 코드 분산 해시테이블 기반의 맵리듀스 시스템.
- 제 6 항에 있어서,상기 중간계산결과는,상기 각 데이터 블록에 따라 서로 다른 해시키를 가지도록 생성되며, 각각 다른 서버로 분산되는 것을 특징으로 하는 코드 분산 해시테이블 기반의 맵리듀스 시스템.
- 제 6 항에 있어서,상기 중간계산결과는,상기 인메모리 캐시의 중간 결과물 재사용 캐시 영역에 저장되는 것을 특징으로 하는 코드 분산 해시테이블 기반의 맵리듀스 시스템.
- 제 1 항에 있어서,상기 시스템은,상기 작업 스케줄러와 연동되며, 새로운 서버의 추가, 제거, 복구를 관리하거나 파일의 업로드를 관리하는 리소스 매니저를 더 포함하는 것을 특징으로 하는 코드 분산 해시테이블 기반의 맵리듀스 시스템.
- 파일 시스템과 인메모리 캐시를 구비하는 다수의 서버와 상기 서버로 맵리듀스 테스크를 할당하는 작업 스케줄러를 포함하는 코드 분산 해시테이블 기반의 맵리듀스 시스템에서 맵리듀스 테스크 수행방법에 있어서,상기 작업 스케줄러에서 상기 파일 시스템과 인메모리 캐시에 저장된 데이터를 이중계층의 링구조로 관리하는 단계와,외부로부터 특정 파일에 대한 데이터 접근 요청을 수신하는 단계와,상기 파일에 대한 해시키를 추출하여 상기 파일을 저장하고 있는 파일 시스템의 서버를 검색하는 단계와,상기 검색된 서버로부터 상기 파일에 대한 데이터 블록 구조와 각 데이터 블록의 분산 저장된 서버의 정보를 메타데이터로 수신하는 단계와,상기 각 데이터 블록이 저장된 서버들로 맵리듀스 테스크를 할당하는 단계와,상기 맵리듀스 테스크 수행된 결과값을 상기 작업 요청에 대한 응답으로 출력하는 단계를 포함하는 방법.
- 제 10 항에 있어서,상기 파일 시스템과 인메모리 캐시는,코드 분산 해시테이블을 기반으로 상기 데이터를 저장하는 것을 특징으로 하는 방법.
- 제 10 항에 있어서,상기 인메모리 캐시는,상기 코드 분산 해시테이블을 이용하여 데이터에 대응되는 해시키를 저장하고 있으며, 기설정된 해시키 범위를 할당받아, 상기 해시키 범위에 해당하는 해시키와 상기 해당 해시키에 대응하는 데이터를 저장하고 있는 것을 특징으로 하는 방법.
- 제 12 항에 있어서,상기 해시키 범위는,각 서버에 대한 데이터 접근 요청 빈도에 따라 서버별로 동적으로 변경 설정되는 것을 특징으로 하는 방법.
- 제 10 항에 있어서,상기 맵리듀스 테스크는,상기 각 데이터 블록이 저장된 서버들에서 처리되며, 상기 맵리듀스 테스크의 처리 과정에서 발생하는 중간계산결과가 상기 파일 시스템에 저장되는 것을 특징으로 하는 방법.
- 제 14 항에 있어서,상기 중간계산결과는,상기 각 데이터 블록에 따라 서로 다른 해시키를 가지도록 생성되며, 각각 다른 서버로 분산되는 것을 특징으로 하는 방법.
- 제 14 항에 있어서,상기 중간계산결과는,상기 인메모리 캐시의 중간 결과물 재사용 캐시 영역에 저장되는 것을 특징으로 하는 방법.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/516,878 US10394782B2 (en) | 2015-06-10 | 2015-06-10 | Chord distributed hash table-based map-reduce system and method |
PCT/KR2015/005851 WO2016199955A1 (ko) | 2015-06-10 | 2015-06-10 | 코드 분산 해쉬테이블 기반의 맵리듀스 시스템 및 방법 |
KR1020187003102A KR101928529B1 (ko) | 2015-06-10 | 2015-06-10 | 코드 분산 해쉬테이블 기반의 맵리듀스 시스템 및 방법 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/KR2015/005851 WO2016199955A1 (ko) | 2015-06-10 | 2015-06-10 | 코드 분산 해쉬테이블 기반의 맵리듀스 시스템 및 방법 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016199955A1 true WO2016199955A1 (ko) | 2016-12-15 |
Family
ID=57503874
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2015/005851 WO2016199955A1 (ko) | 2015-06-10 | 2015-06-10 | 코드 분산 해쉬테이블 기반의 맵리듀스 시스템 및 방법 |
Country Status (3)
Country | Link |
---|---|
US (1) | US10394782B2 (ko) |
KR (1) | KR101928529B1 (ko) |
WO (1) | WO2016199955A1 (ko) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463514A (zh) * | 2017-08-16 | 2017-12-12 | 郑州云海信息技术有限公司 | 一种数据存储方法及装置 |
CN109885397A (zh) * | 2019-01-15 | 2019-06-14 | 长安大学 | 一种边缘计算环境中时延优化的负载任务迁移算法 |
CN110392109A (zh) * | 2019-07-23 | 2019-10-29 | 浪潮软件集团有限公司 | 基于cmsp流程编排的任务调度方法及系统 |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10831552B1 (en) * | 2017-08-15 | 2020-11-10 | Roblox Corporation | Using map-reduce to increase processing efficiency of small files |
US10318176B2 (en) * | 2017-09-06 | 2019-06-11 | Western Digital Technologies | Real-time, self-learning automated object classification and storage tier assignment |
US10686816B1 (en) * | 2017-09-28 | 2020-06-16 | NortonLifeLock Inc. | Insider threat detection under user-resource bi-partite graphs |
US11080251B1 (en) | 2017-10-23 | 2021-08-03 | Comodo Security Solutions, Inc. | Optimization of memory usage while creating hash table |
CN109241298B (zh) * | 2018-09-06 | 2020-09-15 | 绍兴无相智能科技有限公司 | 语义数据存储调度方法 |
CN109753593A (zh) * | 2018-12-29 | 2019-05-14 | 广州极飞科技有限公司 | 喷洒作业任务调度方法及无人机 |
CN113407620B (zh) * | 2020-03-17 | 2023-04-21 | 北京信息科技大学 | 基于异构Hadoop集群环境的数据块放置方法及系统 |
US11467834B2 (en) * | 2020-04-01 | 2022-10-11 | Samsung Electronics Co., Ltd. | In-memory computing with cache coherent protocol |
KR102500278B1 (ko) | 2020-10-30 | 2023-02-16 | 충남대학교 산학협력단 | 대량의 lod 저장을 위한 맵리듀스 기반 데이터 변환 시스템 및 방법 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080120314A1 (en) * | 2006-11-16 | 2008-05-22 | Yahoo! Inc. | Map-reduce with merge to process multiple relational datasets |
US20120278323A1 (en) * | 2011-04-29 | 2012-11-01 | Biswapesh Chattopadhyay | Joining Tables in a Mapreduce Procedure |
EP2634997A1 (en) * | 2008-05-23 | 2013-09-04 | Telefonaktiebolaget L M Ericsson AB (Publ) | Maintaining distributed hash tables in an overlay network |
KR20140096936A (ko) * | 2013-01-29 | 2014-08-06 | (주)소만사 | Dlp 시스템의 빅데이터 처리 시스템 및 방법 |
KR20140119090A (ko) * | 2012-02-03 | 2014-10-08 | 마이크로소프트 코포레이션 | 확장 가능한 환경에서의 동적 로드 밸런싱 기법 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7707136B2 (en) * | 2006-03-31 | 2010-04-27 | Amazon Technologies, Inc. | System and method for providing high availability data |
US8249638B2 (en) * | 2008-03-31 | 2012-08-21 | Hong Kong Applied Science and Technology Research Institute Company Limited | Device and method for participating in a peer-to-peer network |
US9069761B2 (en) * | 2012-05-25 | 2015-06-30 | Cisco Technology, Inc. | Service-aware distributed hash table routing |
US9934147B1 (en) * | 2015-06-26 | 2018-04-03 | Emc Corporation | Content-aware storage tiering techniques within a job scheduling system |
-
2015
- 2015-06-10 US US15/516,878 patent/US10394782B2/en active Active
- 2015-06-10 WO PCT/KR2015/005851 patent/WO2016199955A1/ko active Application Filing
- 2015-06-10 KR KR1020187003102A patent/KR101928529B1/ko active IP Right Grant
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080120314A1 (en) * | 2006-11-16 | 2008-05-22 | Yahoo! Inc. | Map-reduce with merge to process multiple relational datasets |
EP2634997A1 (en) * | 2008-05-23 | 2013-09-04 | Telefonaktiebolaget L M Ericsson AB (Publ) | Maintaining distributed hash tables in an overlay network |
US20120278323A1 (en) * | 2011-04-29 | 2012-11-01 | Biswapesh Chattopadhyay | Joining Tables in a Mapreduce Procedure |
KR20140119090A (ko) * | 2012-02-03 | 2014-10-08 | 마이크로소프트 코포레이션 | 확장 가능한 환경에서의 동적 로드 밸런싱 기법 |
KR20140096936A (ko) * | 2013-01-29 | 2014-08-06 | (주)소만사 | Dlp 시스템의 빅데이터 처리 시스템 및 방법 |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463514A (zh) * | 2017-08-16 | 2017-12-12 | 郑州云海信息技术有限公司 | 一种数据存储方法及装置 |
CN107463514B (zh) * | 2017-08-16 | 2021-06-29 | 郑州云海信息技术有限公司 | 一种数据存储方法及装置 |
CN109885397A (zh) * | 2019-01-15 | 2019-06-14 | 长安大学 | 一种边缘计算环境中时延优化的负载任务迁移算法 |
CN110392109A (zh) * | 2019-07-23 | 2019-10-29 | 浪潮软件集团有限公司 | 基于cmsp流程编排的任务调度方法及系统 |
CN110392109B (zh) * | 2019-07-23 | 2021-09-07 | 浪潮软件股份有限公司 | 基于cmsp流程编排的任务调度方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
US10394782B2 (en) | 2019-08-27 |
US20170344546A1 (en) | 2017-11-30 |
KR101928529B1 (ko) | 2018-12-13 |
KR20180028461A (ko) | 2018-03-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2016199955A1 (ko) | 코드 분산 해쉬테이블 기반의 맵리듀스 시스템 및 방법 | |
US10055262B1 (en) | Distributed load balancing with imperfect workload information | |
Rahman et al. | Homr: A hybrid approach to exploit maximum overlapping in mapreduce over high performance interconnects | |
WO2012111905A2 (ko) | 맵 리듀스를 이용한 분산 메모리 클러스터 제어 장치 및 방법 | |
Guo et al. | Investigation of data locality and fairness in mapreduce | |
US20200409584A1 (en) | Load balancing for scalable storage system | |
US11080207B2 (en) | Caching framework for big-data engines in the cloud | |
WO2014142553A1 (ko) | 워크 로드에 따라 동적 자원 할당 가능한 상호 연결 패브릭 스위칭 장치 및 방법 | |
WO2014208909A1 (ko) | 시뮬레이션 장치 및 분산 시뮬레이션 시스템 | |
CN110109931B (zh) | 一种用于防止rac实例间数据访问发生冲突的方法及系统 | |
WO2010093084A1 (ko) | 분산 스페이스를 이용하여 분산 프로그래밍 환경을 제공하기 위한 방법, 시스템 및 컴퓨터 판독 가능한 기록 매체 | |
Orhean et al. | Toward scalable indexing and search on distributed and unstructured data | |
Ranichandra et al. | Architecture for distributed query processing using the RDF data in cloud environment | |
WO2019232933A1 (zh) | 基于分布式数据库的数据存储方法及系统 | |
CN110597809B (zh) | 一种支持树状数据结构的一致性算法系统及其实现方法 | |
Tan et al. | Resa: realtime elastic streaming analytics in the cloud | |
Jeong et al. | Async-LCAM: a lock contention aware messenger for Ceph distributed storage system | |
Espinosa et al. | Analysis and improvement of map-reduce data distribution in read mapping applications | |
WO2019189962A1 (ko) | 분산 데이터베이스에서의 복제본이 존재하는 데이터에 대한 질의 병렬화 방법 | |
Singh et al. | High scalability of HDFS using distributed namespace | |
WO2018216828A1 (ko) | 에너지 빅데이터 관리 시스템 및 그 방법 | |
WO2011136261A1 (ja) | ストレージシステム、ストレージシステムの制御方法、及びコンピュータプログラム | |
KR20110111241A (ko) | 고속 서열 분석을 위한 병렬 Intra-Query 라우팅 알고리즘 | |
Ancy et al. | Locality based data partitioning in Map reduce | |
EP4030311A1 (en) | A distributed database system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15895023 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 15516878 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 20187003102 Country of ref document: KR Kind code of ref document: A |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15895023 Country of ref document: EP Kind code of ref document: A1 |