WO2021187194A1

WO2021187194A1 - Distributed processing system, control method for distributed processing system, and control device for distributed processing system

Info

Publication number: WO2021187194A1
Application number: PCT/JP2021/008961
Authority: WO
Inventors: 真路小川
Original assignee: 日本電気株式会社
Priority date: 2020-03-17
Filing date: 2021-03-08
Publication date: 2021-09-23
Also published as: JPWO2021187194A1; JP7435735B2

Abstract

Provided is a distributed processing system including: a master node that refers to the number of block transfers, stored in entire block transfer information, in the distributed processing system, selects a block to be copied, and transmits a request for creating a copy of the block; a first worker node that receives the request for creating a copy of the block from the master node, and transmits the block; and a second worker node that receives the block and stores a copy of the block.

Description

Distributed processing system, distributed processing system control method, and distributed processing system control device

The embodiments described later relate to, for example, a distributed processing system, a distributed processing system control method, and a distributed processing system control device.

The following disclosure is an example and does not limit the scope of the present application or the present invention.

Generally, a distributed processing system consisting of two or more computers is used. In a distributed processing system, for example, distributed parallel processing in which a job requested to be executed is deployed to a plurality of nodes and executed in parallel can be executed. The parallel distributed storage used in a distributed processing system can store data called blocks in which one file is divided into a certain size (for example, 128 MB). Parallel distributed storage may consist of two or more nodes or computers. Parallel distributed storage can, for example, duplicate blocks into three blocks and distribute them across multiple nodes.

In a certain node, if the block to be processed of the job requested to be executed is stored on the certain node, the block can be read from the storage of this node. Further, if the block to be processed does not exist on the certain node, the transfer of the block may be requested to another node having the block to be processed.

In this way, network transfer of blocks can occur when the processing target block does not exist on the node that executes the job. If the network transfer of blocks occurs frequently, the performance of the entire distributed processing system may deteriorate due to the increase in network load.

For example, a large number of users may access the distributed processing system at the same time and a large number of jobs may be executed. In such a situation, many jobs may be executed on a node other than the node where the block to be processed is stored. In such a case, network transfer of the block to be processed may occur frequently. Therefore, the amount of data transfer may increase in the entire cluster constituting the distributed processing system.

Of the patent documents described below, for example, Patent Document 4 may be related to a "distributed system". However, the overall architecture of Patent Document 4 cannot eliminate the increase in network load as described above.

In Patent Document 4, the blocks of the movement source and the movement destination may be exchanged (Fig. 4, paragraphs 0034 to 0036, etc.). However, even if the blocks are simply replaced, if a large number of accesses occur to a certain block at the same time, the access can be concentrated on a specific node that stores the block. In Patent Document 4, a large number of access requests may cause a bottleneck or a delay in response.

Patent Document 4 also describes an example of Apache Hadoop replication. However, Patent Document 4 merely replaces the blocks, as shown in step 47 of FIG. Patent Document 4 may refer to Apache Hadoop's replication policy at the time of this exchange. That is, Patent Document 4 merely discloses that the blocks are exchanged. In Patent Document 4, when a large amount of access occurs to a certain block by simultaneous connection, there is a possibility that the access is concentrated on a specific node having the block. In fact, Patent Document 4 does not disclose or suggest when to perform replication.

The above disclosure is an example and does not limit the scope of the present application or the present invention.

JP-A-2017-191387 JP-A-2017-016404 Special Table 2015-532997 Japanese Unexamined Patent Publication No. 2014-186364 Japanese Unexamined Patent Publication No. 2004-126716

The following is an example, and does not limit the scope of the present application or the present invention.

In general, in a distributed processing system, it is preferable to suppress the network transfer of blocks as much as possible and suppress the network load. Thereby, in the distributed processing system, the performance of the entire system can be improved.

The above points do not limit the scope of the present application or the present invention.

The following description is an example and does not limit the scope of the present application or the present invention.

In the embodiment described later, for example, a distributed processing system can be disclosed. This distributed processing system refers to the number of block transfers in the distributed processing system stored in the total block transfer information, selects the block to be replicated, and sends a request to create a duplicate of the block, the master node and the master. It may include a first worker node that receives a request to create a duplicate of a block from a node and sends a block, and a second worker node that receives a block and stores a duplicate of the block. ..

In the embodiment described later, for example, a control method of a distributed processing system can be disclosed. The control method of this distributed processing system is a control method of a distributed processing system including a master node and a worker node, and is a replication target by referring to the number of block transfers in the distributed processing system stored in the total block transfer information. It may include selecting a block and storing a duplicate of the block.

In the embodiment described later, for example, a control device in a distributed processing system can be disclosed. This control device refers to the number of block transfers in the distributed processing system stored in the total block transfer information, selects the block to be duplicated, and sends a request to create a duplicate of the block to the worker node. You may.

According to the embodiment described later, for example, in a distributed processing system, the network transfer of blocks can be suppressed and the network load can be suppressed. However, this does not limit the scope of the present application or the present invention.

It is the schematic which shows the 1st Embodiment of a distributed processing system. It is a timing chart which shows the timing when a process is executed. It is a figure which shows the example of the whole block transfer information. It is a figure which shows the example of the whole block read information. It is a figure for demonstrating the process of acquiring block transfer information. It is a figure for demonstrating the process of acquiring block transfer information. It is a drawing for demonstrating the process of acquiring a block read information. It is a drawing for demonstrating the process of acquiring a block read information. It is a figure which shows the process which collects a block transfer information and a block read information. It is a figure which shows the process which collects a block transfer information and a block read information. It is a schematic diagram which shows the process of a block allocation determination part. It is a flowchart explaining the process of FIG. 11 in detail. It is a figure which shows the example of the process of step E1 of FIG. It is a figure which shows the example of the block position information. It is a figure which shows the example of the distributed processing system. It is a figure explaining parallel distributed processing of a job. It is a figure explaining parallel distributed processing of a job. It is a schematic diagram which shows the process when a client requests the distributed processing system to write a file. It is a schematic diagram about execution of a job in a distributed processing system. It is a figure which shows the example of the configuration of a computer. It is a drawing which shows the control device of a distributed processing system. It is a flowchart which shows the processing flow of the control device of a distributed processing system.

Hereinafter, some embodiments will be described with reference to the drawings. The following disclosure is exemplary. The present application and the present invention should not be construed as being limited to the following disclosures.

"Example of distributed processing system"
An example of a distributed processing system is Apache Hadoop. A distributed processing system can be a system that has both parallel distributed processing capabilities and a parallel distributed storage area. FIG. 15 is a diagram showing an example of a distributed processing system. The distributed processing system or cluster server system shown in FIG. 15 may be composed of a master node and at least one worker node.

The master node is a control device or computer that manages the resources of the entire cluster server system. The master node includes a cluster resource management unit and a cluster data management unit. The master node has a master node storage for storing block position information.

A worker node is a computer connected to a master node via a network. Worker nodes execute jobs and store data. The worker node includes a job execution management unit, a data management unit, and a data communication unit. The worker node has worker node storage. In this distributed processing system / cluster server system, data is distributed and stored in a plurality of worker nodes (1 to n, n is a natural number of 1 or more). This distributed processing system / cluster server system performs parallel distributed processing of jobs.

The client is a computer connected to this distributed processing system / cluster server system via a network. For example, the cluster server system of FIG. 15 executes a job in response to a request from a client.

Figures 16 and 17 are diagrams illustrating the operation of parallel distributed processing of jobs. The client sends a job execution request to the master node (Fig. 16, "1. Request job execution"). The master node receives the request from the client. The master node selects a worker node that has sufficient resources to start the job management process in order to execute the job corresponding to the received request. A job can include multiple jobs that can be run in parallel. The master node determines the jobs to assign to the selected worker node. The master node requests the selected worker node to start the job management process (Fig. 16, "2. Assign the start of the job management process").

The job execution management department of the worker node that received the request starts the job management process (Fig. 16, "3. Job management process start"). The started job management process acquires the resources for executing the job assigned from the master node from the master node (Fig. 17, "4. Request job execution resource", "5. Job execution resource". Get "). In other words, in FIGS. 16 and 17, the worker node that executes the job is selected, the job is assigned to the selected worker node, and the job is executed in parallel.

Here, parallel distributed storage in a distributed processing system will be described. In parallel distributed storage, one file is divided into data of a predetermined size (for example, 128MB). Here, the divided data is called a block. In parallel distributed storage, blocks are replicated (eg, 3 replicates) and stored on a worker node.

FIG. 18 is a schematic diagram showing the processing when the client requests the distributed processing system to write a file. The client requests the master node to write the file to the cluster server system. At this time, the client notifies the master node of the size of the file to be written. The cluster data management unit of the master node receives the request from the client (Fig. 18, "1. Request to write to"). When the file to be written is divided into 128MB blocks, the cluster data management unit calculates how many blocks are divided. If the size of the file is 384MB, you can see that this file is divided into 3 blocks. The cluster data management unit randomly selects three worker nodes. For example, the cluster data management unit selects

worker nodes

1, 2, and 3.

The cluster data management unit sends the information of the worker node to be written to the client. The client receives the information of the worker node to be written from the cluster data management unit (Fig. 18, "2. Notify the write destination").

The client requests the data management unit of the write destination worker node (1, 2, and 3) to write the divided file (block) (Fig. 18, "3. Request to write the divided block"). The worker node that receives the write request writes a block to its own worker node storage. The worker node also requests other worker nodes to write the duplicated block. For example, worker node 1 requests worker node 2 to write a duplicate. The number of duplicates may be a predetermined number. The duplicate writing may be a predetermined writing destination. In this way, the cluster server system creates duplicates of a predetermined number of blocks and stores them in a plurality of worker nodes in a distributed manner.

FIG. 19 is a schematic diagram of job execution in the distributed processing system / cluster server system. Jobs running in the distributed processing system refer to the blocks stored in the distributed processing system. If the block referenced by the job is stored on the worker node where the job is running, read the block from the worker node storage of this worker node. On the other hand, if the block referenced by the job is not stored on the worker node on which the job is executed, the process of transferring the block from the worker node that stores the block referenced by the job is performed. FIG. 19 shows this process. If the worker node 2 on which the job is executed does not have block B referenced by the job, for example, worker node 2 requests worker node 3 to send block B. Worker node 3 sends block B, and worker node 2 executes the job.

"First embodiment"
FIG. 1 is a schematic view showing a first embodiment of a distributed processing system. As shown in FIG. 1, this embodiment includes a client 1 and a cluster server system 2 (distributed processing system) to which the client 1 connects. In the following description, the part that overlaps with the above example of the distributed processing system is omitted.

The cluster server system 2 of FIG. 1 may be composed of a plurality of server machines. As shown in FIG. 20, each server machine may be a computer including a CPU 201, a memory 202, a storage 203 (HDD (hard disk drive) or SSD (solid state disk)), a transmitter / receiver 204, and the like. An operating system (OS) is installed on each server machine. Further, each server machine may be either a physical machine or a virtual machine. Each server machine may be a plurality of computers arranged in the same housing. Each server machine may be arranged in different housings in whole or in part. The server machines are connected to each other via a network, and all or part of them may be located at a remote location.

The server machine of the cluster server system 2 includes the master node 21 and the worker node group 22. The master node 21 manages the entire resource of the cluster server system 2. Each worker node 1 to n (n is a natural number) of the worker node group 22 executes a job and / or stores data (block). The master node 21 may be composed of a plurality of master nodes. The worker node group 22 is composed of a plurality of worker nodes 221 to 22n. “N” indicates that the number of worker nodes is n.

The master node 21 has a cluster resource management unit 211, a cluster data management unit 212, a block information collection unit 213 for the entire cluster, a block allocation determination unit 214, and a master node storage 215. These components are implemented as computer programs and may be executed by the CPU (central processing unit) of the master node 21.

The cluster resource management unit 211 manages the resources of the server machines of the entire cluster server system 2 in order to allocate jobs from worker nodes 1 to n at the time of job execution. The cluster data management unit 212 manages the storage location of each block and the state including the free space of the worker node storage of each worker node.

The block information collection unit 213 of the entire cluster collects and saves the block transfer information 22151 and the block read information 22152, which will be described later, from the worker node group 22. The block allocation determination unit 214 determines a block for which the number of duplicates needs to be increased or decreased, and gives a replication instruction to the corresponding worker node.

The master node storage 215 is, for example, a storage device such as an HDD or SSD. The master node storage 215 has block position information 2151, total block transfer information 2152, and total block read information 2153.

The block position information 2151 is information indicating in which worker node the blocks constituting the file are stored for each file stored in the cluster server system 2. The total block transfer information 2152 is information that records the number of times a block transfer has occurred between worker nodes in the entire cluster server system 2. The whole block read information 2153 is information that records the number of times of reading for each block in the entire cluster server system 2.

Next, worker nodes 221 to 22n will be described. For example, the worker node 221 has a job execution management unit 2211, a data management unit 2212, a data communication unit 2213, a block information collection unit 2214, and a worker node storage 2215. The job execution management unit 2211 assigns and executes jobs. The data management unit 2212 reads and writes blocks. The data communication unit 2213 transfers the block to another node. The block information collection unit 2214 will be described later.

Worker node storage 2215 of worker node 221 stores blocks. Further, the worker node storage 2215 stores the block transfer information 22151 and the block read information 22152. The block transfer information 22151 is information that records the number of times the block has been transferred from the worker node 1 to another worker node. The block read information 22152 is information that records the number of times each block is read.

The block information collection unit 2214 collects block transfer information 22151 and block read information 22152. The block information collection unit 2214 transmits the block transfer information 22151 and the block read information 22152 to the block information collection unit 213 of the entire cluster of the master node 21. This transmission may be performed, for example, at predetermined time intervals. Alternatively, this transmission may be performed in response to an instruction from the administrator of the cluster server system (cluster server management system) 2.

Worker nodes 222 to 22n have the same configuration as worker nodes 221. Further, the cluster server system 2 may be configured so that the worker node storage 2215 possessed by the worker nodes 221 to 22n can be logically treated as one file system from the viewpoint of the client. As a result, distributed storage can be realized. Each worker node storage may have one or more HDDs or SSDs. Each worker node storage may be configured to access network storage or storage area networks.

Hereinafter, the processing flow of the cluster server system 2 in this embodiment will be described. Here, it is assumed that the data (block) is already stored in all or a part of the worker nodes of the worker node group 22.

FIG. 2 is a timing chart showing the timing when the processing described later is executed. The block allocation determination unit 214 of the master node 21 may execute the block replication number increase / decrease process (block allocation determination) at a timing specified by the user. This timing may be, for example, midnight every 24 hours. Alternatively, the block allocation determination unit 214 of the master node 21 may execute the block replication number increase / decrease process (block allocation determination) when a user's instruction is input. Alternatively, the block allocation determination unit 214 of the master node 21 detects that the usage rate of the worker node storage 2215 of at least one worker node has reached a predetermined threshold value in the block replication number increase / decrease process (block allocation determination). You may execute it at the time. This predetermined threshold may be, for example, 90%. Alternatively, the user may specify this predetermined threshold.

Further, the block information collection unit 213 of the entire cluster of the master node 21 is stored in the total block transfer information 2152 and the total block read information 2153 when the block replication number increase / decrease process (block allocation determination) is executed. You may reset the existing value to zero. This reset process may be executed every time when the block duplication number increase / decrease process (block allocation determination) is executed, or every predetermined number of times. Alternatively, this reset process may be executed when an instruction from the user is input.

The number of block transfers is counted at each worker node after the block replication number increase / decrease process (block allocation determination) is executed until the next block replication number increase / decrease process (block allocation determination) is executed. The block transfer information 2152 is updated. After the block replication number increase / decrease process (block allocation determination) is executed, the block read count is counted at each worker node until the next block replication number increase / decrease process (block allocation determination) is executed. Block read information 2153 is updated.

FIG. 3 is a diagram showing an example of the entire block transfer information 2152. The total block transfer information 2152 is information for recording the number of times a block transfer has occurred between worker nodes in the entire cluster server system 2. A, B, C, ... In FIG. 3 may be block identification information. For example, FIG. 3 shows that block B was transmitted 12 times from worker node 223 (“# 3” in FIG. 3) to another worker node. For example, FIG. 3 shows that block D was sent twice from worker node 225 (“# 5” in FIG. 3) to another worker node.

FIG. 4 is a diagram showing an example of the whole block read information 2153. The whole block read information 2153 is information for recording the number of times each block is read in the entire cluster server system 2. Figure 4 shows, for example, that block B was read 20 times across the cluster server system 2.

Next, FIGS. 5 and 6 are diagrams for explaining the process of acquiring the block transfer information 22151 stored in the worker node storage 2215 in the worker nodes 221 to 22n.
The block transfer information 22151 is information that records the number of times the block has been transferred to another worker node. The block transfer information 22151 may be information such as a table in which block information such as identification information related to a block and the number of transfers are associated with each other. For example, suppose a job is running on worker node 222. If the worker node 222 does not store the block B to be processed by this job, the data communication unit of the worker node 223 having the block B transmits the block B to the data communication unit of the worker node 222. At this time, the data communication unit of the worker node 223 notifies the block information collection unit of the worker node 223 that the block B has been transmitted. The block information collection unit of the worker node 223 recognizes or confirms that the block transmission has been made from the data transfer unit of the worker node 223 (step A1 in FIG. 6). The block information collection unit of the worker node 223 accesses and updates / stores the block transfer information of the worker storage node in order to increase the count of the number of transfers of the transmitted block B (step A2 in FIG. 6). The above processing is the same for other worker nodes.

Next, FIGS. 7 and 8 are diagrams for explaining the process of acquiring the block read information 22152 stored in the worker node storage 2215 in the worker nodes 221 to 22n. The block read information is information that records the number of times of reading for each block. The block read information may be information such as a table in which block information such as an identifier related to a block is associated with the number of times the block is read. For example, when the data management unit in the worker node 223 reads the block B from the worker storage node, the data management unit notifies the block information collection unit to read the block B. The block information collection unit recognizes / confirms the block information of the read block B (step B1 in FIG. 8). In order to increase the count of the number of times the read block B is read, the block information collecting unit accesses and updates / stores the block read information of the worker storage node of the worker node 223 (step B2 in FIG. 8). The above processing is performed in the same manner for other worker nodes.

9 and 10 are diagrams showing a process in which the master node 21 collects the block transfer information and the block read information of the worker nodes 221 to 22n. In the worker nodes 221 to 22n, the block information collection unit transmits the block transfer information and the block read information stored in the worker node storage to the block information collection unit 213 of the entire cluster of the master node 21 (step in FIG. 10). C1). This transmission process may be executed at a predetermined date and time, for example, every day at midnight. Alternatively, this transmission process may be executed when the cluster server system 2 inputs an instruction from the user or the administrator. The block information collection unit 213 of the entire cluster saves / stores the block transfer information received from each worker node 221 to 22n in the overall block transfer information 2152 of the master node storage 215 (step C2 in FIG. 10). The block information collection unit 213 of the entire cluster stores / stores the block read information received from each worker node 221 to 22n in the overall block read information 2153 of the master node storage 215 (step C2 in FIG. 10).

The timing at which the worker nodes 221 to 22n transmit the block transfer information and the block read information to the master node 21 may be set by the user or the administrator. Further, after these are transmitted, the values of the block transfer information and the block read information stored in the worker node storage of the worker nodes 221 to 22n may be reset to zero. Alternatively, the worker nodes 221 to 22n may reset the block transfer information and the block read information stored in the worker node storage to zero when the block transfer information and the block read information are transmitted a predetermined number of times.

Hereinafter, the processing of the block allocation determination unit 214 of the master node 21 will be described with reference to FIGS. 11, 12, 13, and 14. FIG. 11 is a schematic view showing the processing of the block allocation determination unit 214. The code number of the component of FIG. 11 is the same as the code number of the component of FIG. In FIG. 11, some code numbers are omitted to make the drawings easier to see.

In FIG. 11, the master node storage 215 of the master node 21 stores / stores the entire block read information 2153 and the entire block transfer information 2152. The block allocation determination unit 214 determines the blocks to be duplicated and deleted by referring to the total block read information 2153 and the total block transfer information 2152 (step S1101 in FIG. 11). Here, the block allocation determination unit 214 determines to generate a duplicate of the block B. Here, the block allocation determination unit 214 decides to delete the block A. The block allocation determination unit 214 transmits / outputs a replication instruction and / or a deletion instruction to the data management unit of the worker node. In FIG. 11, the block allocation determination unit 214 transmits / outputs a replication instruction to the data management unit of the worker node 223 (step S1102 in FIG. 11). In FIG. 11, the block allocation determination unit 214 transmits / outputs a deletion instruction to the data management unit of the worker node 22n (step S1103 in FIG. 11).

In FIG. 11, for example, the block allocation determination unit 214 determines that block B is a replication target (step S1101 in FIG. 11). Worker node 223 is a worker node in which block B is stored. Worker node 222 is a worker node in which block B is not stored. Following the instruction from the master node 21, the data management unit of the worker node 223 reads the block B from the worker node storage. The data management unit of the worker node 223 transmits the read block B to the data management unit of the worker node 222, and requests that the number of duplicates of the block B be stored (increased) by one. The data management unit of the worker node 222 receives the block B from the data management unit of the worker node 223, and stores a copy of the block B in the worker node storage.

On the other hand, block A contains, for example, four duplicates because it was determined to be frequently used in the past. The block allocation determination unit 214 of the master node 21 determines, for example, that the frequency of use of the block A has decreased. In this case, the block allocation determination unit 214 of the master node 21 determines that the number of duplicates of the block A should be deleted by one. In FIG. 11, when a deletion instruction is received from the block allocation determination unit 214 of the master node 21 (step S1103 in FIG. 11), the data management unit of the worker node 22n deletes the block A from the worker node storage (FIG. 11). Step S1105).

FIG. 12 is a flowchart illustrating the process of FIG. 11 in detail. FIG. 12 shows the processing flow of the block allocation determination unit 214. First, the process of increasing the number of duplicate blocks will be described (steps E1 to E6 in FIG. 12).

The block allocation determination unit 214 of the master node 21 reads the entire block transfer information 2152 stored / stored in the master node storage 215. The block allocation determination unit 214 aggregates the total block transfer information 2152. In this aggregation process, the block allocation determination unit 214 may calculate the maximum value between the number of transfers for each block and each worker node in the total block transfer information 2152. Alternatively, the block allocation determination unit 214 may calculate the average value for each block. The block allocation determination unit 214 selects a block whose aggregated value is equal to or greater than a threshold value (hereinafter referred to as a replication threshold value) as a replication target (step E1 in FIG. 12). The user or administrator of the master node 21 may preset this replication threshold. Instead of using the replication threshold, for example, the block allocation determination unit 214 may target a predetermined number of blocks having a higher transfer count as a replication target.

FIG. 13 is a diagram showing an example of the process of step E1 in FIG. In this example, the block allocation determination unit 214 aggregates the entire block transfer information 2152 and obtains the aggregation result (maximum value for each block and each worker node). In FIG. 13, for example, the aggregation result of block A is “6”. This indicates that the maximum number of transfers of block A is 6. In FIG. 13, for example, the aggregation result of block C is “4”. This indicates that the maximum number of transfers in block C is four. The entire block transfer information 2152 may also store the block number. The block number may be a unique number in the cluster server system 2. For example, assume that the block number of block A is "345". The entire block transfer information 2152 may store "345" in association with "A". Based on this block number, the block allocation determination unit 214 may take correspondence with the information regarding the block stored in the block position information 2151 of FIG.

In the example of FIG. 13, the replication threshold is set to 10. The block allocation determination unit 214 selects blocks B and E having a replication threshold of 10 or more as the replication target blocks.

Refer to Fig. 12 below. The block allocation determination unit 214 selects the selected blocks to be duplicated one by one (step E2 in FIG. 12). The block allocation determination unit 214 refers to the block position information 2151 indicating the location of the selected block in the worker node. The block allocation determination unit 214 calculates the current number of duplicates of the selected block (duplicate target block) (step E3 in FIG. 12).

Here, FIG. 14 is a diagram showing an example of block position information 2151. In FIG. 14, "/user/user1/employee.txt" is a file name including a file path. This file path may be the path seen by the user (client) in the cluster server system 2. FIG. 14 shows that this file is composed of three blocks. Let each block name be block A, block B, and block C. The block A may correspond to the block number 345 which is the identification information of the block. In this example, for the sake of understanding, the block position information includes the block name. A block number may be used instead of the block name to identify the block. The block number may be a unique number in the cluster server system 2. Alternatively, in order to identify the block, the path of the file including the file name and the block number may be used in combination.

FIG. 14 shows that block A and its replicas are stored in

worker nodes

221, 222, 223, 224. In FIG. 14, worker node 221 is represented as "1" for simplicity. The same applies to other worker nodes. Similarly, FIG. 14 shows that block B and its replicas have information such as being stored on

worker nodes

221, 223, 224. The block allocation determination unit 214 refers to the block position information in FIG. 14, and it can be seen that, for example, a total of four blocks A are stored. That is, the block allocation determination unit 214 calculates the current number of duplicates of block A as 4. The current number of duplicates can be calculated for other blocks as well (step E3 in Figure 12).

Refer to Fig. 12 below. For example, if the current number of duplicates of the selected block (block to be duplicated) calculated from the block position information 2151 is larger than the default value of the number of duplicates and is out of the range of the total number of nodes or less, the block allocation determination unit 214 may use the block allocation determination unit 214. , The duplicate of the block to be duplicated may not be created (step E4 in FIG. 12). For example, if the selected block (replication target block) is already stored in all worker nodes, it can be wasted if this block is further duplicated. By this determination, such useless duplication can be avoided. As a result, the utilization rate of worker node storage can be improved in the entire cluster server system 2. The default value of the number of duplicates may be preset by the user or the administrator. The default value of the number of duplicates may be, for example, half the number of worker nodes. The default value of the number of duplicates may be, for example, 0 (zero). In this case, replication can always be created until it is equal to the number of worker nodes.

Further, in step E4 of FIG. 12, for example, the block allocation determination unit 214 indicates that the current number of duplicates of the selected block (block to be duplicated) calculated from the block position information 2151 is larger than the default value of the number of duplicates (duplicate). It is also possible not to make a duplicate of the block to be duplicated within the default number <current number of duplicates).

When the current number of duplicates meets the condition of step E4 in FIG. 12 (step E4 in FIG. 12, "Yes"), the block allocation determination unit 214 selects all the worker nodes that do not store the block to be duplicated. Only one block can be duplicated per node. In this process, the block allocation determination unit 214 may refer to the block position information. The block allocation determination unit 214 inquires whether or not the selected worker node has sufficient free space for storing the selected block (replication target block) in the worker node storage. Based on the result of this inquiry, the block allocation determination unit 214 stores a copy of the selected block (replication target block) in a worker node having sufficient free space for storing the selected block (replication target block). Candidate worker node for.

The block allocation determination unit 214 randomly selects one worker node from the candidate worker nodes. In this process, for example, the block allocation determination unit 214 may select the worker node having the largest free space in the worker node storage. The block allocation determination unit 214 may create a list of candidate worker nodes and select one worker node using a predetermined random number. Alternatively, instead of a random number, the current time may be acquired in microseconds, and a number having a predetermined number of digits at the end may be used. The selected worker node is set as the storage destination worker node.

The block allocation determination unit 214 refers to the block position information and sets one of the worker nodes in which the selected block (replication target block) is already stored as the source worker node. When there are a plurality of worker nodes in which the selected block (replication target block) is already stored, the block allocation determination unit 214 may arbitrarily select one of them. For example, the block allocation determination unit 214 may select the worker node located at the head in the example of FIG. For example, the block allocation determination unit 214 may transmit a signal such as a ping to each worker node and select the node for which the reply is received within the earliest time.

The block allocation determination unit 214 requests the source worker node to send the selected block (replication target block) to the storage destination worker node. The block allocation determination unit 214 requests the storage destination worker node to store the selected block (replication target block). The source worker node sends the selected block (replication target block) to the storage destination worker node. The storage destination worker node receives the block (replication target block) selected from the source worker node, and stores the selected block (replication target block) (step E5 in FIG. 12). The block allocation determination unit 214 repeats the above processing for all the blocks selected in step E2 of FIG. 12 (step E6 of FIG. 12).

Next, the process of reducing the number of duplicate blocks will be described with reference to FIG. 12 (steps E7 to E10 in FIG. 12).

The block allocation determination unit 214 selects the target block for which the number of duplicates is to be reduced. The block allocation determination unit 214 refers to the block position information 2151 and extracts blocks having a larger number of duplicates than the default value. FIG. 14 shows an example of block position information 2151. For this extraction, the block allocation determination unit 214 refers to the block position information 2151 and counts the number of worker nodes (nodes) in which the blocks and their duplicates are stored for each block. As described in the process of increasing the number of duplicate blocks (steps E1 to E6 in FIG. 12), only one block duplicate can be created per node. Therefore, if the number of nodes is counted, the number of blocks and their duplications is counted.

The block allocation determination unit 214 compares the number of blocks and their duplicates with the default value for each block. When the block allocation determination unit 214 determines that the number is greater than the default value, the block allocation determination unit 214 extracts the corresponding block. This default value may be a number preset by the user. This default value is set when the block allocation determination unit 214 starts the process of reducing the number of duplicate blocks when the usage rate of the worker storage of the entire cluster server system 2 is larger than a predetermined threshold value such as 90%. You may divide 1 from the default value.

The extracted blocks may be multiple. The block allocation determination unit 214 may exclude the copy target block selected in step E1 of FIG. 12 from the deletion target in advance (step E7 of FIG. 12). The block allocation determination unit 214 refers to the block read information 2153 of the entire cluster and acquires the number of times the extracted block is read (step E8 in FIG. 12). The block allocation determination unit 214 determines whether or not the number of readings is equal to or less than the threshold value (hereinafter referred to as READ threshold value) for each of the extracted blocks (step E9 in FIG. 12).

For blocks whose read count is less than or equal to the READ threshold value, the block allocation determination unit 214 refers to the block position information and randomly selects a worker node in which such a block is stored. In this process, for example, the block allocation determination unit 214 may create a list of candidate worker nodes and select one worker node using a predetermined random number. Alternatively, instead of a random number, the current time may be acquired in microseconds, and a number having a predetermined number of digits at the end may be used. Alternatively, the block allocation determination unit 214 may select the worker node having the smallest free space in the worker node storage.

The block allocation determination unit 214 requests the selected worker node to delete the block to be deleted. The selected worker node receives the request from the block allocation determination unit 214 and deletes the corresponding block (step E10 in FIG. 12). When there are a plurality of extracted blocks, the block allocation determination unit 214 executes this process for each extracted block.

The READ threshold may be specified by the user. By the above-mentioned processing using the READ threshold value, the cluster server system 2 can determine the frequency of use of the block based on the number of reads. Cluster server system 2 may delete blocks that are determined to be underutilized. As a result, the storage of the entire system can be effectively used. On the other hand, the cluster server system 2 can determine that the blocks above the READ threshold are frequently used. Cluster server system 2 can maintain the number of replicas without deleting such blocks. As a result, the cluster server system 2 can read frequently accessed files at high speed.

The order of the process of increasing the number of duplicate blocks and the process of decreasing the number of duplicate blocks may be reversed. The process of increasing the number of duplicates of a block and the process of decreasing the number of duplicates of a block may be executed separately and independently.

FIG. 20 shows an example of a computer configuration. The above-mentioned node, worker node, master node and the like may have the same or similar configuration as that in FIG. The configuration of FIG. 20 may include a CPU (central processing unit), memory, storage, and transmitter / receiver. The storage can be an HDD (hard disk drive, hard disk) or SSD (solid state drive). Storage may include a storage area network. Multiple computers may share storage. The storage may include storage on the network or a storage area network. The computer of FIG. 20 may further include an optical drive or a magnetic recording device. The transmitter / receiver may be 10BASE-T, 1000BASE-T, a wireless communication device, an optical communication device, or the like. In FIG. 20, the computer may be a general purpose computer. In FIG. 20, the computer may include an OS (operating system). In FIG. 20, the computer may be composed of a plurality of computers. In FIG. 20, the computer may include a plurality of execution environments on one computer. The above embodiments may be implemented by software.

FIG. 21 is a diagram showing a control device (master node 31) of the distributed processing system according to the present embodiment. The master node 31 includes a selection means 311 for selecting a block to be replicated, and a request means 312 for sending a request to create a duplicate of the block to the worker node.

FIG. 22 is a flowchart showing the processing flow of the control device (master node 31) of the distributed processing system of FIG. 21. The selection means 311 selects the block to be duplicated (step E11 in FIG. 22). Request means 312 sends a replication request to the worker node (step E12 in FIG. 22).

"Other aspects"
In addition to the above embodiments, for example, the following aspects may be present.

The first aspect refers to the number of block transfers in the distributed processing system stored in the total block transfer information, selects the block to be replicated, and sends a request to create a duplicate of the block, the master node and the master. Distributed, including a first worker node that receives a request from a node to create a duplicate of the block and sends the block, and a second worker node that receives the block and stores the duplicate of the block. It can be a processing system.

The second aspect is the above-mentioned distributed processing system, in which the master node refers to the number of worker nodes in which each block is stored obtained from the block position information and sends a request for making a duplicate of the block. It can be a distributed processing system that determines whether or not.

The third aspect is the above-mentioned distributed processing system, in which the master node creates duplicates of blocks when the number of duplicates of the block to be replicated is more than a predetermined number and less than the number of all worker nodes. It can be a distributed processing system that sends requests for.

The fourth aspect is the above-mentioned distributed processing system, which may be a distributed processing system in which the master node selects a worker node having free resources for creating a duplicate of a block as a second worker node.

The fifth aspect is the above-mentioned distributed processing system, which can be a distributed processing system in which the master node randomly selects a second worker node from among the worker nodes that do not store the block to be replicated.

The sixth aspect is the above-mentioned distributed processing system, in which the master node determines the block to be deleted by referring to the block position information indicated by the worker node in which each block is stored, and stores the block to be deleted. It can be a distributed processing system that selects a worker node in the field, sends a deletion request to a third worker node, and the third worker node receives the deletion request and deletes the block to be deleted.

The seventh aspect is the above-mentioned distributed processing system, in which the master node refers to the number of storage destination worker nodes of each block obtained from the block position information, and the number of storage destination worker nodes is a predetermined number. It could be a distributed processing system that sends a delete request to a third worker node when there are more.

The eighth aspect is the above-mentioned distributed processing system, which can be a distributed processing system in which the master node randomly selects a third worker node from the worker nodes that store the blocks to be deleted.

The ninth aspect is the control method of the distributed processing system including the master node and the worker node, and the block to be duplicated is selected by referring to the number of block transfers in the distributed processing system stored in the total block transfer information. It can be a control method that includes a step of performing and a step of storing a duplicate of the block.

The tenth aspect is the control device in the distributed processing system, which refers to the number of block transfers in the distributed processing system stored in the total block transfer information, selects the block to be duplicated, and creates a duplicate of the block. It can be a controller in a distributed processing system that sends requests to worker nodes.

According to the above embodiment, for example, it is possible to suppress network transfer of blocks and improve the performance of the entire system due to network load. In the above embodiment, the number of duplicates of the blocks to be processed frequently can be increased and stored in each worker node in a distributed manner. Therefore, the number of worker nodes having such blocks can be increased as a whole system. This can reduce the frequency of block transfers from other worker nodes. In addition, the probability that the processing target block exists on the worker node on which the job is executed can be increased. Therefore, the job execution can be completed in a shorter time.

Further, according to the above embodiment, the duplicated block can be automatically deleted when the frequency of use of the block whose number of duplicates has already been increased decreases. Therefore, according to the above embodiment, it may be possible to prevent the storage resource from becoming tight in the entire cluster.

The above disclosure is an example. Therefore, the present application and the present invention should not be construed as being limited to the disclosure of the above embodiments and the like.

This application claims priority based on Japanese Patent Application No. 2020-046542 filed on March 17, 2020, and incorporates all of its disclosures here.

The present invention may be applied to a distributed processing system, a control method for a distributed processing system, and a control device for a distributed processing system.

211 Cluster resource management unit 212 Cluster data management unit 213 Block information collection unit 214 Block allocation judgment unit 215 Master node storage 221 Worker node 222 Worker node 223 Worker node 224 Worker node 225 Worker node 2151 Block position information 2152 Overall block transfer information 2153 Overall Block read information 2211 Job execution management unit 2212 Data management unit 2213 Data communication unit 2214 Block information collection unit 2215 Worker node storage 22151 Block transfer information 22152 Block read information

Claims

A master node that refers to the number of block transfers in the distributed processing system stored in the total block transfer information, selects the block to be duplicated, and sends a request to create a duplicate of the block.
A first worker node that receives a request from the master node to create a duplicate of the block and transmits the block, and
A second worker node that receives the block and stores a copy of the block,
The distributed processing system comprising.
The master node refers to the number of worker nodes in which each block is stored obtained from the block position information, and determines whether or not to send a request for creating a duplicate of the block.
The distributed processing system according to claim 1.
In claim 1 or 2, the master node sends a request to create a duplicate of the block when the number of duplicates of the block to be replicated is more than a predetermined number and less than the number of all worker nodes. Described distributed processing system.
The distributed processing system according to any one of claims 1 to 3, wherein the master node selects a worker node having a free resource for creating a duplicate of the block as the second worker node.
The master node randomly selects the second worker node from the worker nodes that do not store the block to be replicated.
The distributed processing system according to any one of claims 1 to 4.
The master node determines a block to be deleted by referring to the block position information indicated by the worker node in which each block is stored, selects a third worker node to store the block to be deleted, and selects the third worker node. Send a delete request to the node and
The third worker node receives the deletion request and deletes the deletion block.
The distributed processing system according to any one of claims 1 to 5.
The master node refers to the number of worker nodes of the storage destination of each block obtained from the block position information, and when the number of worker nodes of the storage destination is larger than a predetermined number, the third worker node is assigned. Send a request for deletion,
The distributed processing system according to claim 6.
The distributed processing system according to claim 6 or 7, wherein the master node randomly selects the third worker node from the worker nodes that store the blocks to be deleted.
A control method for a distributed processing system including a master node and a worker node.
Select the block to be duplicated by referring to the number of block transfers in the distributed processing system stored in the total block transfer information.
To store a copy of the block
The control method comprising.
A selection method for selecting the block to be duplicated, and
A request means for sending a request to create a duplicate of a block to a worker node, and
A control device for a distributed processing system.