WO2014016951A1

WO2014016951A1 - Information processing device

Info

Publication number: WO2014016951A1
Application number: PCT/JP2012/069078
Authority: WO
Inventors: 地尋吉村; 由子長坂; 秀貴青木
Original assignee: 株式会社日立製作所
Priority date: 2012-07-27
Filing date: 2012-07-27
Publication date: 2014-01-30
Also published as: JP5847313B2; JPWO2014016951A1

Abstract

This information processing device is provided with a multi-thread processor, a first storage device, a second storage device for performing DMA transfers between itself and the first storage device, and an operating system that allocates physical address space to the first storage device and provides virtual address space in the physical address space. Memory space is secured according to the capacity required for DMA transfer, in the physical address space, of a thread that is scheduled for execution, the thread is transferred for execution, and the secured memory space is released after processing of the thread has finished, thereby achieving efficient DMA transfer even in a state in which the data transfer amount fluctuates.

Description

Information processing device

The present invention relates to an information processing apparatus that performs processing by transferring data to a storage device by DMA transfer, and particularly relates to an information processing apparatus characterized by switching and executing a plurality of threads.

The computer is composed of a storage device that stores data and a central processing unit (CPU) that reads and processes data from the storage device. In general, the faster the storage device, the higher the price (unit price) per bit. Also, the higher the speed, the lower the number of bits per unit area or unit volume (recording density). Therefore, a high-speed but expensive and small-capacity storage device is prepared near the CPU, and necessary data is arranged most recently. Data that cannot be accommodated alone is placed in a low-speed but inexpensive and large-capacity storage device, and the data is exchanged between both storage devices as needed. As described above, since storage devices have a trade-off relationship between speed and cost or speed and capacity, a so-called storage hierarchy concept in which a plurality of storage devices having different properties are used hierarchically is widely used in the computer world. Has been used.

This trend remains unchanged today, but because of the types of storage elements available to computer vendors, today computers are mainly composed of storage hierarchies consisting of four levels: registers, cache memory, main memory, and storage. . The main storage elements used in each layer are flip-flops for registers, SRAM for cache memory, DRAM for main memory, and HDD for storage, and there is a necessity to divide the layers according to the speed, cost, and capacity of each storage element. Born.

Next, the storage hierarchy described above will be explained from another aspect. The CPU processes the data stored in the register. When there is no data to be processed in the register, the CPU searches the cache and, if stored in the cache, reads the data from the cache into the register and performs processing. When there is no data to be processed in the cache, the CPU reads data from the main memory into the cache. If there is no data to be processed in the main memory, the data is read from the storage into the main memory. As described above, when there is no data in a layer close to the CPU, data is read from a far layer and the penalty increases. Then, while the data is being read, the CPU cannot perform a process to be performed, so that the operation becomes vacant, and the CPU utilization rate decreases. The same problem occurs not only in reading data but also in writing data.

Here, if DMA (Direct Memory Access) transfer is used for data transfer, data transfer can be performed without the intervention of the CPU, so the CPU should be able to perform other processing during the idle time during the data transfer. It is. For example, a method in which a programmer schedules DMA transfer timing and processing performed by a CPU by specifying a part where data is required in advance in a program and explicitly embedding a DMA transfer instruction is also used. Yes. However, this method causes a problem that program tuning becomes complicated.

Multithreading has been used as a technology to deal with this. In multithreading, a unit of processing that can be performed concurrently is defined as a thread, and when execution of a certain thread stops, another executable thread is executed. The thread execution stops not only when the processing of the thread is completed, but also when the data required by the thread is read. That is, when considered in conjunction with the previous data reading operation, when a certain thread is executed and necessary data starts to be read, another thread is executed during that time. In this way, the CPU utilization rate can be increased.

As a prior art relating to such DMA transfer and multi-threading, there are a technique disclosed in Patent Document 1 and a technique disclosed in Patent Document 2. In both cases, DMA transfer is performed between an on-chip memory (also referred to as local memory) and a main memory (also referred to as global memory) on the CPU.

JP 2005-129001 A JP 2002-163239 A

Incidentally, in recent years, with the spread of the Internet and various terminals, it has become possible to easily acquire a large amount of data. Such a large amount of data is difficult to handle with a conventional database management system or the like, and various technologies have been developed under the slogan of big data. In the IoT (Internet of Things) era where every thing is connected to the Internet, data related to every event that occurs in the thing is transmitted to the Internet. That is, a large amount of POE (Point of Event) data is transmitted on the Internet. In such a world, we collect POE data from the Internet, analyze the relationship between things and things, people and people, or things and people, and provide appropriate services based on them. It is expected to be used for predicting the future. For this purpose, the computer must be able to process a large amount of data at high speed.

In order for a computer to process a large amount of data at high speed, it is desirable that the data to be processed be placed on the main memory. When data is placed on the storage, it takes time to read it. In particular, when trying to analyze the relationship between various things and people, the data of various things and people must be read out. If access to the storage becomes necessary each time, the slow read of the storage becomes a bottleneck. However, when a large amount of DRAMs are arranged in order to realize a main memory having a capacity corresponding to the big data as described above, various problems are caused.

DRAM has a higher unit price than HDDs used for storage, flash memory, phase change memory, etc., and configuring a main memory with a large amount of DRAM causes an increase in cost. Also, since the DRAM is inferior in recording density, the device becomes huge compared to the HDD and flash memory of the same capacity. Therefore, the inventors of the present application solve the problem of reading speed while preparing the necessary capacity by performing DMA transfer between the main memory and the storage area in which data is saved from the main memory. Tried.

Here, thread scheduling is often based on a simple FIFO that is executed by taking an executable thread from a queue. For this reason, uniform execution time of each thread enables efficient scheduling. Therefore, when dividing a process into threads, it is desirable to divide the processing so that the load is equal.

However, with the spread of computers and the diversification of applications, there are some applications that cannot always be divided equally. For example, graph processing is performed when dealing with social science problems. The graph is composed of a set of vertices and a set of edges connecting the vertices. Social science issues often deal with relationships. For example, the relationship between companies is indicated by a vertex representing the company and an edge representing the relationship. When such a graph is divided to be processed by a plurality of threads, it is natural that a thread is allocated and divided for each vertex. However, when dividing each vertex, the number of sides connected to each vertex varies. The time required for processing each vertex is proportional to the number of vertices with which each vertex is related, that is, the number of sides. Therefore, in the division for each vertex, the processing amount varies between threads, and when the saved data is DMA-transferred to the main memory, the size of the DMA-transferred data varies.

Above all, graphs that appear due to social science problems have a property called scale-free characteristics, so this variation becomes more prominent. The number of edges to which each vertex of the graph is connected is called the degree of that vertex. The scale-free characteristic is characterized by the fact that the distribution of this order is a power distribution, with a very small number of vertices having a very large degree, but a large number of vertices having a small degree. When this characteristic is applied to the above-described variation in the processing amount, when processing a sociological graph, a small number of threads with a very large processing amount and a large number of threads with a small processing amount are processed.

In the technique disclosed in Patent Document 1 and the technique disclosed in Patent Document 2, the DMA transfer is performed between the on-chip memory and the main memory on the CPU. Since it is limited to data, it is not a technique for solving the above-described variation in data size.

An object of the present invention is to realize an efficient DMA transfer even in a situation where the data transfer amount varies.

The information processing apparatus according to the present invention includes a multi-thread processor, a first storage device, a second storage device that performs DMA transfer with the first storage device, and a physical address space in the first storage device. An operating system that allocates and provides a virtual address space on the physical address space, and secures a memory space according to a capacity required for DMA transfer on the physical address space by the thread to be executed, The above-mentioned problem is solved by releasing the memory space secured after the processing of the thread is completed.

According to the present invention, efficient DMA transfer can be realized even in a situation where the amount of data transfer varies, and the processing of the information processing apparatus can be speeded up.

It is a figure which shows the example of a structure of the information processing system of this invention. It is a figure which shows the example of a structure of the information processing apparatus of this invention. It is a figure which shows the example of a structure of the NVM subsystem of this invention. It is a figure for demonstrating the example of the relationship between a thread | sled and a processor. It is a figure for demonstrating the example of the internal structure of the process of this invention. It is a figure for demonstrating the example of the correspondence of the physical address space of the information processing apparatus of this invention, and the virtual address space of the process of this invention. It is a figure for demonstrating the detail of the example of a stack area | region and a PIN area | region pool. It is a figure for demonstrating the queue which a master control thread uses for management of a thread | sled and DMA transfer. It is a figure for demonstrating the example of a structure of a thread management cell. It is a figure which shows the example of the state transition of a thread. It is a figure for demonstrating the example of a structure of a NVM transfer request management cell. It is a figure which shows the example of the state transition of a NVM transfer request. It is a figure for demonstrating cooperation of user level multithreading and DMA transfer. It is a conceptual diagram for demonstrating the relationship between user level multithreading and DMA transfer. It is a flowchart for demonstrating the example of the scheduling operation | movement of a master control thread. It is a conceptual diagram for demonstrating the relationship between user level multithreading and DMA transfer.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. Note that components having the same function are denoted by the same reference symbols throughout the drawings for describing the embodiment, and the repetitive description thereof will be omitted.

In this embodiment, a nonvolatile memory such as a flash memory and a phase change memory is used to provide a memory larger than a memory constituted by a DRAM to an application, and the slowness that is a disadvantage of the nonvolatile memory is provided. An information processing system 100 that solves the problem with multi-thread will be described.

FIG. 1 is a diagram illustrating an example of a configuration of an information processing system 100 according to the present embodiment. The information processing system 100 has at least one node 110. The node 110 is an information processing apparatus, for example, a server apparatus. The example of FIG. 1 shows an example of a four-node configuration of nodes 0 to 3 (reference numeral 110). The nodes are connected by an inter-node network 120. In addition to the inter-node network 120, the information processing system 100 may further include an NVM subsystem interconnect 130 that connects non-volatile memory (NVM) subsystems described later.

FIG. 2 is a diagram illustrating an example of the configuration of the node 110 that is an information processing apparatus. The node 110 includes

processors

210 and 220,

DIMMs

230 and 240, an I / O hub 250, a NIC 260, a disk controller 270, an HDD 280, an SSD 290, and an NVM subsystem 300. The main memory is composed of DIMM 230 and DIMM 240 which are storage devices. The

DIMMs

230 and 240 are composed of DRAM which is a volatile memory. Note that each node 110 has at least one processor, and the node 110 in FIG. 2 is an example of a two-processor configuration of the

processors

210 and 220. Further, each of the

processors

210 and 220 may be a multi-core processor. In the example of FIG. 2, each processor has two cores, and the node 110 as a whole has four core nodes. Further, each core may support simultaneous multithreading (SMT). In the example of FIG. 2, each core is 2SMT capable, so each processor has the ability to process 4 threads simultaneously. That is, each processor is a multi-thread processor. Hereinafter, threads that can be processed simultaneously as hardware are referred to as hardware threads.

The I / O hub 250 provides an interface for connecting various devices such as the NIC 260, the disk controller 270, and the NVM subsystem 300. The I / O hub 250 is connected to the

processors

210 and 220 via a system bus provided by each processor. For the connection, for example, a bus such as HyperTransport is used. On the other hand, the I / O hub 250 is connected to various devices such as the NIC 260, the disk controller 270, and the NVM subsystem 300 by a peripheral bus for connecting peripheral devices such as PCI Express. In this embodiment, the I / O hub 250, the NIC 260, the disk controller 270, and the NVM subsystem 300 are described as being connected by PCI Express, but the present invention can be implemented by other interconnect means. .

Conventionally, the main memory of a computer is determined by the capacity of the DIMM. Data that does not fit in the main memory is stored in the HDD or SSD as storage. The storage is connected via a disk controller, and an interface such as SAS (Serial Attached SCSI) or SATA (Serial Advanced Technology Attachment) is used in terms of hardware. The interface seen from the software is a file system. The application reads / writes the file, and the device driver of the operating system controls the disk controller via the file system to read / write the HDD and SSD. For this reason, reading and writing cannot be performed without going through a plurality of hierarchies, resulting in a large overhead.

On the other hand, the information processing system 100 according to the present embodiment includes an NVM subsystem 300 in order to read and write a large-capacity nonvolatile memory faster than an HDD or SSD storage compared to a DIMM. When reading / writing is required at higher speed than using the storage, data is read from the storage to the NVM subsystem 300 in advance, thereby realizing high-speed reading / writing. Note that a non-volatile memory is abbreviated as NVM.

FIG. 3 is a diagram illustrating an example of the configuration of the NVM subsystem 300. The NVM subsystem 300 includes a hybrid memory controller 310, a nonvolatile memory (NVM) 320 that is a storage device, and a volatile memory 330 that is a storage device. The NVM 320 is a nonvolatile memory such as a flash memory or a phase change memory. Further, the volatile memory 330 is a DRAM, and a DIMM can be used. The hybrid memory controller 310 is connected to the NVM 320, the volatile memory 330, and the I / O hub 250. The hybrid memory controller 310 DMA-transfers the data stored in the NVM 320 to the

DIMM

230 or 240 as the main memory in response to a request for software operating on the

processor

210 or 220. The hybrid memory controller 310 plays a role of DMA-transferring data stored in the

DIMM

230 or 240 that is the main memory to the NVM 320. The volatile memory 330 is used as a buffer during DMA transfer. Further, the hybrid memory controller 310 of each node can be connected by the NVM subsystem interconnect 130 as described above. This connection enables access to data stored in the NVM subsystem 300 of another node.

The hybrid memory controller 310 has a memory-mapped register (MMR: Memory Mapped Register) 311. The MMR 311 is a register used by software operating on the

processors

210 and 220 to instruct the hybrid memory controller 310 to perform DMA transfer. In PCI Express, it is possible to map the registers of peripheral devices in the same memory space as the main memory. Therefore, the software can access the MMR 311 with the load / store instructions of the

processors

210 and 220 in the same manner as when reading and writing to the main memory.

On the node 110, an operating system corresponding to virtual memory operates. The node 110 is composed of a plurality of cores as described above, but has a symmetric multiprocessing (SMP) configuration in which all the cores share a single main memory. Therefore, a single operating system operates on the node 110. Hereinafter, an embodiment will be described on the premise of a single system image in which one operating system is operated on the node 110. The operating system operating on each node 110 allocates a physical address space to the main memory configured by the

DIMMs

230 and 240 of each node 110 and provides a virtual address space on the physical address space.

FIG. 4 is an explanatory diagram showing the relationship between a plurality of cores of the node 110 and resources provided to the user by the operating system. Since each node 110 has a configuration of 2 processors / nodes, 2 cores / processors, and 2 SMT / cores, each node has resources of 2 × 2 × 2 = 8 hardware threads.

In an operating system that supports virtual memory, a virtual address space for applications (user space) and a virtual address space for operating the operating system kernel using the address translation mechanism of the processor MMU (Memory Management Unit) (Kernel space) is separated to ensure system security and robustness. The user space has an independent virtual address space for each unit of process. In general, a thread in an environment having such a concept of a process takes a form dependent on the process. That is, each process has one or more threads, and each thread operates by sharing the virtual address space of the parent process.

Also, when a single operating system manages multiple cores, the operating system must abstract the core in some way and assign it to a process, so that the application can provide an environment where multiple cores can be used. Don't be. For this purpose, the concept of threads is used. As shown in FIG. 4, the operating system provides a kernel level thread to each process.

Generally, when executing a thread, the number of hardware threads (M) is limited with respect to the number of threads to be executed (N). When N is less than or equal to M, the two can be in a one-to-one correspondence, but when N is greater than M, switching is necessary. This switching is context switching. However, in order to perform context switching with a kernel-level thread, it is necessary to switch the virtual address space from the user space to the kernel space and perform context switching processing in the kernel, so the context switching overhead is large. It becomes a problem.

Therefore, in the information processing system 100 of this embodiment, as shown in FIG. 4, a kernel level thread and a hardware thread are used in a one-to-one correspondence. That is, a relationship of N = M is established. However, since the number of N is restricted by the number of M, the number of threads required by the application cannot be secured. When the number of threads is small, there is less room for concealing DMA transfer, which will be described later, and the efficiency of the entire system is lowered. Therefore, there is a need for a method for making a large number of threads available (increasing N) while avoiding context switching in the kernel.

In order to cope with this, the information processing system 100 according to the present embodiment is characterized in that a master control thread 550 and a user level thread are provided in each of the

processes

410 and 420 in FIG. 4 as shown in FIG. .

As shown in FIG. 5, in the information processing system 100 of this embodiment, the process 410 has an inter-thread shared resource 510 and at least two or more kernel level threads allocated from the kernel. A master control thread 550 is fixedly assigned to one of the plurality of kernel level threads. In addition, user level threads required by the application are allocated to kernel level threads in a time division manner.

The master control thread 550 is a thread that always occupies one kernel level thread and continues to run for the duration of the process 410, such as context switching of user level threads, scheduling, and management of shared resources between threads. I do. Unlike the kernel level thread, the master control thread 550 performs these processes in the process 410, so that context switching can be realized at higher speed without causing switching to the kernel space. That is, the master control thread 550 can speed up context switching while using a large number of threads.

As described above, threads operate by sharing process resources. In the thread, a model is used in which each thread has a unique stack and other areas are shared with other threads. In the information processing system 100 according to the present embodiment, in order to perform a DMA transfer described later, a PIN area resource allocation described later is further performed for a user level thread.

FIG. 6 is an explanatory diagram showing the correspondence between the physical address space 610 included in the node 110 and the virtual address space 620 included in the process 410 operating on the node 110. The area arranged in the physical address space 610 basically corresponds one-to-one with the physical components of the node 110. The DRAM area 611 corresponds to the

DIMMs

230 and 240. An MMIO (Memory Mapped Input / Output) area 612 is an area where the MMR 311 described above is arranged. All of these areas are managed in units called pages. In general, one page is 4 KB in size.

The virtual address space 620 of the process 410 can be roughly divided into a text area 621, a data area 622, a mmio area 623, and a stack area 624. In addition to these, the process 410 of this embodiment further has a PIN area pool 516.

Each area in the virtual address space 620 will be described below by associating the internal structure of the process 410 shown in FIG. 5 with the virtual address space 620 of the process 410 shown in FIG.

The process 410 has various resources shared between user level threads as an inter-thread shared resource 510. The inter-thread shared resource 510 includes a program code 511, a global variable 512, a heap area 513, thread management information 514, NVM transfer request management information 515, and a PIN area pool 516. The program code 511 is an instruction sequence of a program to be executed by a thread, and is arranged in the text area 621 on the virtual address space 620. The global variable 512 is a variable that is commonly used by any subroutine or thread that operates in the process 410, and is arranged in the data area 622. The heap area 513 is a resource pool when the program dynamically secures memory, and is arranged in the data area 622. Although details will be described later, the thread management information 514 is used to store information necessary for each thread to manage the thread, and is mainly used for the master control thread 550, but is accessed from the user level thread. Since it must be possible, it has the same properties as the global variable 512 and is placed in the data area 622. The NVM transfer request management information 515 is information for managing DMA transfer described later, and is placed in the data area 622 for the same reason as the thread management information 514. The stack area is an area for preparing a stack used for passing parameters of local variables and subroutines, and is allocated to each thread as will be described later.

In FIG. 6, the correspondence between the physical address space 610 and the virtual address space 620 is indicated by a broken line. As shown in FIG. 6, the physical address space 610 and the virtual address space 620 are mapped on a page basis, but there is no page corresponding to the physical address space 610 even though it exists on the virtual address space 620. Note that there are pages on the virtual address space 620. This is a mechanism that realizes so-called virtual storage. Even if a page exists in the virtual address space 620, it is not always directly stored in the DRAM, but is paged out in the HDD or SSD. there is a possibility. When such a page is accessed, the MMU generates a page fault exception, and the operating system reads the page saved from the HDD or SSD and causes it to page in. As described above, the information processing system 100 adopting the virtual memory has a feature that the memory area (page) existing from the viewpoint of the process does not necessarily exist on the physical memory. . The effect of this feature on DMA transfer will be described later.

FIG. 7 shows the relationship between the PIN area pool 516 of the virtual address space 620 and the thread of the stack area 624. Both the PIN area pool 516 and the stack area 624 are divided for each thread to be used. However, the stack area 624 always has a stack area corresponding to all the threads (master control thread and user level thread) of the process 410, whereas the PIN area pool 516 has a size corresponding to the size of the area. It has only an area corresponding to some threads. This is because the stack area 624 can secure an area as much as the virtual address space allows by using a virtual storage mechanism, whereas the PIN area pool 516 has a physical address space 610 for the reason described later. This is because only the amount that can secure the page corresponding to the DRAM area 611 is prepared.

Hereinafter, a mechanism for linking user-level multithreading and DMA transfer of the information processing system 100 will be described. By linking user-level multithreading and DMA transfer, for example, large-scale graph processing as described in the background art can be accelerated.

Here, the user level multithreading means that a plurality of threads (user level threads) are operated while being switched in the process 410. Since processing necessary for thread switching is completed in the process 410, the processing is faster than kernel level thread switching. On the other hand, in user level multithreading, thread management is also performed in the process 410. In the information processing system 100 of this embodiment, the master control thread 550 plays a role of thread management.

FIG. 8 is a diagram showing a user level thread by the master control thread 550 and a queue for managing DMA transfer. These queues are stored in the memory as thread management information 514 and NVM transfer request management information 515.

The thread management information 514 includes a READY queue 810, an IOWAIT queue 811, an NVMWAIT queue 812, and a FIN queue 813. The entry enqueued in each queue is the thread management cell 900 shown in FIG. The thread management cell 900 includes a valid flag 901, a thread ID 902, a thread state 903, a save context 904, a save stack pointer 905, a save program counter 906, a buffer request flag 907, a buffer request size 908, a buffer allocation flag 909, and a buffer area head. It consists of an address 910.

The Valid flag 901 is a flag indicating whether or not the thread management cell 900 is valid. The thread ID 902 is an identifier for uniquely identifying a thread, and is used to realize an operation that is a feature of the present invention in which DMA transfer and thread scheduling described later are linked. The thread state 903 is information for indicating what state the thread is currently in. The thread state will be described in detail later.

The save context 904, the save stack pointer 905, and the save program counter 906 are information used for executing a thread, and are information saved from the registers on the

processors

210 and 220 to the thread management cell 900 when the thread is stopped. is there. The buffer request flag 907, the buffer request size 908, the buffer allocation flag 909, and the buffer area head address 910 are used for DMA transfer described later, and will be described in detail later.

In the READY queue 810, a thread management cell 900 of an executable user level thread is enqueued. The master control thread 550 dequeues the thread management cell 900 from the READY queue 810 when a kernel level thread included in the process 410 is free or when another user level thread is stopped. Then, context switching is performed using the save context 904, save stack pointer 905, and save program counter 906 included in the dequeued thread management cell 900, and the execution of the thread is started.

Threads that are neither executing (RUN) nor being ready (READY) are in some waiting state. The master control thread 550 uses the IOWAIT queue 811 and the NVMWAIT queue 812 to manage the waiting state. In particular, having the NVMWAIT queue 812 is a feature of the information processing system 100 of the present embodiment.

The IOWAIT queue 811 is a queue in which threads waiting for the completion of I / O requested by the system call to the operating system are stored. When a user level thread makes an I / O request using an operating system function such as access to a file, it issues a system call and informs the operating system of it. Thereafter, when there is no processing to be performed until the thread that issued the system call completes the execution of the system call, that is, when the processing is resumed after the completion of the system call, the master control thread 550 makes this IOWAIT The thread that issued the system call is saved in the queue 811. When the execution of the system call is completed, the master control thread 550 shifts the saved thread to the READY queue 810.

The NVMWAIT queue 812 is a queue in which threads waiting for completion of DMA transfer between the main memory and the nonvolatile memory (NVM) 320 are stored. In the assumed use case of the information processing system 100 of the present embodiment, for example, a thread corresponding to each of a large number of vertices is executed for large-scale graph processing. At that time, since the number of threads becomes enormous, it is difficult to put data necessary for processing in all threads at once in the main memory. Therefore, the information processing system 100 stores data in the non-volatile memory (NVM) 320 and brings it from the NVM 320 to the main memory by DMA transfer when executing a thread. Whereas the main memory is composed of DRAM, the NVM 320 is composed of flash memory and phase change memory, so that a large capacity can be realized at a lower cost than DRAM. However, since this DMA transfer also takes time, if there is no other process that can be executed by the thread that requested the data in the DMA transfer, the master control thread 550 saves the thread in the NVMWAIT812 queue. When the DMA transfer is completed, the master control thread 550 shifts the thread saved in the NVMWAIT 812 queue to the READY queue 810.

The FIN queue 813 is a queue for collecting the thread management cells 900 that have become unnecessary after execution is completed. A thread management cell 900 exists for each thread. Therefore, when a large number of threads are repeatedly generated and destroyed in dynamic large-scale graph processing or the like, the thread management cell 900 is generated by receiving memory allocation from the heap area 513 each time, or heap area 513 is generated. If the area is returned to, the overhead is large. Therefore, the master control thread 550 collects the used thread management cell 900 in the FIN queue 813 and reuses the thread management cell 900 from the FIN queue 813 as necessary.

FIG. 10 summarizes the various states of the threads described so far as a state transition diagram. When a new thread is created in (1) of FIG. 10, the master control thread 550 enqueues the thread management cell 900 of the new thread into the READY queue 810. In (2) of FIG. 10, the master control thread 550 stops the user level thread that has already been executed when the resource for executing the thread is free (READY queue 810, IOWAIT queue 811, NVMWAIT queue 812). When the thread is saved in the FIN queue 813) or when the user level thread is not allocated to the kernel level thread and is free, the thread management cell 900 at the head of the REDED queue 810 is dequeued. Start running. The thread is executing (RUN).

In (3) of FIG. 10, when there is a spontaneous resource return (yield) from the executing thread, the master control thread 550 suspends the execution of the thread and enqueues the thread in the READY queue 810. To do. In (4) of FIG. 10, when the executing thread issues a system call for file access or the like and waits for completion of the system call, the master control thread 550 temporarily suspends the execution of the thread. The thread is enqueued in the IOWAIT queue 811.

In (5) of FIG. 10, when the executing thread waits for completion of the DMA transfer between the NVM 320 and the main memory, the master control thread 550 suspends the execution of the thread, Is enqueued into the NVMWAIT queue 812.

In (6) of FIG. 10, when the master control thread 550 detects the completion of the system call while waiting for the completion of the system call started by being enqueued in the IOWAIT queue 811 in (4), the master control thread 550 adds the thread to the IOWAIT queue. 811 is transferred to the READY queue 810. Similarly to (6) of FIG. 10, (7) of FIG. 10 is the operation when the completion is actually completed with respect to the waiting for completion, and the master control thread 550 is enqueued in the NVMWAIT queue 812 in (4). When the completion of the DMA transfer is detected while waiting for the completion of the DMA transfer started, the thread is transferred from the NVMWAIT queue 812 to the READY queue 810. According to (6) and (7), the system call or the DMA transfer waiting for the completion of the DMA transfer and the system call or the DMA transfer completed can be executed again, and then the execution is scheduled and started. It will wait to be done. A method for detecting completion of DMA transfer will be described later.

When the execution of the thread is completed or when the execution of the thread is interrupted, the master control thread 550 shifts the thread management cell 900 of the completed or interrupted thread to the FIN queue 813 ((( 8), corresponding to (9)). Note that the thread execution interruption can also occur when the thread is in the READY queue 810, the IOWAIT queue 811, or the NVMWAIT queue 812. In these cases, the master control thread 550 is also the thread of the suspended thread. The management cell 900 is transferred to the FIN queue 813. Further, when the execution of the thread is completed or when the execution of the thread is interrupted, the master control thread 550 allocates a PIN area to be described later to the thread that has been completed or interrupted and finished processing. Cancels the allocation, that is, releases the memory space allocated as the PIN area.

The thread management cell 900 in the FIN queue 813 is appropriately released by the master control thread 550 when the heap area 513 is insufficient. Further, when a new thread is generated (operation (1) in FIG. 10), it is preferentially used from the thread management cell 900 in the FIN queue 813. If the thread management cell 900 is still insufficient, the master control thread 550 allocates an area for the thread management cell 900 from the heap area 513.

The NVM transfer request management information 515 includes a REQ queue 820, a WAIT queue 821, a COMPLETE queue 822, and a DISPOSE queue 823. The entry enqueued in each queue is the NVM transfer request management cell 1100 shown in FIG. The NVM transfer request management cell 1100 includes a Valid flag 1101, a request source thread ID 1102, a thread management cell pointer 1103, a transfer direction 1104, a transfer state 1105, a transfer source address 1106, a transfer data length 1107, and a transfer destination address 1108. .

The Valid flag 1101 is a flag indicating whether or not the NVM transfer request management cell 1100 is valid. The request source thread ID 1102 stores the thread ID of the thread that generated the NVM transfer request. The thread management cell pointer 1103 is a pointer to the thread management cell 900 that manages the thread that has generated the NVM transfer request. That is, the thread ID 902 stored in the thread management cell 900 obtained by tracing this pointer is the same as the request source thread ID 1102.

Transfer direction 1104 is information for specifying the direction of NVM transfer, and specifies load or store. The load is the direction close to the CPU, that is, the direction in which data is read from the NVM 320 to the main memory (or it can be said that the data stored in the NVM 320 is written to the main memory). The store is the direction far from the CPU, that is, the direction in which data is read from the main memory to the NVM 320 (or the data stored in the main memory can also be written to the NVM 320). The transfer state 1105 indicates what state the NVM transfer is currently in. Details regarding the transfer status will be described later.

The transfer source address 1106 is an address that is a transfer source of DMA transfer performed by NVM transfer. When the transfer direction 1104 is load (transfer from the NVM 320 to the main memory), the transfer source address 1106 is an identifier used in the NVM 320. For this identifier, an address space dedicated to the NVM 320, which is different from the address space (physical address space, virtual address space) of the main memory, may be used. In general, the address space of the main memory is limited by the amount of DRAM assumed to be mountable in a computer at that time. For example, even a processor with a 64-bit architecture today is realistic in terms of cost. In consideration of the amount of DRAM that can be used, only a space of about 48 bits is often mounted. For this reason, it is difficult to map a large-capacity NVM to the address space of the main memory. When the transfer direction 1104 is store (transfer from the main memory to the NVM 320), the transfer source address 1106 is the address space of the main memory, and is specified by an address in the physical address space, particularly for the reason described later.

The transfer data length 1107 designates the transfer length of DMA transfer performed by NVM transfer. The transfer destination address 1108 is an address that is a transfer destination of a DMA transfer performed by NVM transfer. Similar to the transfer source address 1106, an identifier used in the NVM 320 or a physical address is specified according to the transfer direction 1104.

The state transition of the NVM transfer request will be described by comparing the state transition diagram of FIG. 12 with the queue configuration of FIG.

First, (1) an access request from a thread and (2) a DMA transfer request to the NNM subsystem in FIG. 12 will be described. When the thread requires a DMA transfer between the main memory and the NVM 320, the master control thread 550 generates the NVM transfer request management cell 1100 and enqueues the generated cell 1100 into the REQ queue 820. In order to start the DMA transfer, a command for requesting the start of the DMA transfer must be written in the MMR 311. Here, a plurality of sets of MMRs 311 may be prepared in the NVM subsystem 300 to support multiple DMA transfers in which a plurality of DMA transfers are advanced simultaneously in order to further increase the processing speed. The master control thread 550 starts the DMA transfer by dequeuing the NVM transfer request management cell 1100 from the REQ queue 820 and writing a command to the MMR 311 when there is an empty DMA transfer in the NVM subsystem 300. Then, the master control thread 550 enqueues the NVM transfer request management cell 1100 that is undergoing DMA transfer into the WAIT queue 821, and waits for completion of DMA transfer.

Next, (3) DMA transfer completion notification from the NVM system, (4) thread access completion notification, and (5) NVM access management cell release / reuse in FIG. 12 will be described. The completion of the DMA transfer can be notified by an interrupt from the NVM subsystem 300 to the

processors

210 and 220. The master control thread 550 can also know the completion of the DMA transfer by polling the MMR 311. However, when an interrupt is used, the interrupt handler of the operating system receives the interrupt, which requires switching to the kernel space and has a large overhead. Therefore, in this embodiment, a flag indicating that the DMA transfer is completed is provided in the MMR 311 and the master control thread 550 polls this flag. In particular, when a plurality of DMA transfers are performed in parallel, even the completion of the plurality of DMA transfers is individually detected as a factor of overhead. Therefore, in the information processing system 100 of the present embodiment, the master control thread 500 sets the MMR 311 when there is no other processing to be performed in the master control thread 500, that is, at an interval that does not interfere with the user level thread or scheduling. Polling is performed, and completion of a plurality of DMA transfers is detected by one polling.

The NVM transfer request management cell 1100 in which the completion of DMA transfer is detected is dequeued from the WAIT queue 821 by the master control thread 550 and enqueued in the COMPLETE queue 822. Note that the DMA transfer completion order is not necessarily the order enqueued in the WAIT queue 821, and access to the WAIT queue 821 is not necessarily a FIFO.

From the viewpoint of managing the DMA transfer, the NVM transfer request management cell 1100 enqueued in the COMPLETE queue 822 is unnecessary information because the DMA transfer has already been completed. However, in the information processing system 100 according to this embodiment, the thread management cell 900 stored in the NVMWAIT queue 812 is transferred from the NVMWAIT queue 812 based on the NVM transfer request management cell 1100 stored in the COMPLETE queue 822. It is characterized by shifting to the READY queue 810. In other words, depending on the completion status of the DMA transfer, changing the state of the thread that has generated the DMA transfer request from the DMA transfer completion waiting state to the executable state is the DMA transfer in this embodiment. This is a characteristic operation of multithreading linkage.

The master control thread 550 periodically monitors the COMPLETE queue 822. If the NVM transfer request management cell 1100 is present in the COMPLETE queue 822, the master control thread 550 dequeues it, and uses the request source thread ID 1102 as a key, in the NVMWAIT queue 812. The corresponding thread management cell 900 is searched from among them. When the master control thread 550 finds the corresponding thread management cell 900, the master control thread 550 enqueues the thread management cell 900 into the READY queue 810 and enqueues the NVM transfer request management cell 1100 into the DISPOSE queue 823. The DISPOSE queue 823 is used for reusing the used NVM transfer request management cell 1100, similarly to the FIN queue 813.

FIG. 13 is an explanatory diagram showing how the threads in the process 410 are started and stopped by the cooperation of user level multithreading and DMA transfer described so far. Hereinafter, the cooperation between the thread life cycle and the DMA transfer will be described with reference to FIG.

Process 410 has kernel level threads 1-1 to 1-5 assigned from the operating system. The process 410 can be executed by assigning an arbitrary thread to the kernel level threads 1-1 to 1-5. Here, in the information processing system 100 of the present embodiment, the master control thread 550 is fixedly allocated on the kernel level thread 1-1. When activated, the process 410 executes the master control thread 550 with the kernel level thread 1-1.

For example, when performing large-scale graph processing, the master control thread 550 performs processing by setting up a thread corresponding to each vertex of the graph. In the example of FIG. 13, user level threads A, B, C, D, E, F, and G correspond to it, and the master control thread 550 first assigns user level threads A, D, F, and G to the kernel level thread 1 respectively. Start with -2 to 1-5.

Looking at the execution status of the threads in time series, in (1), the user level thread A requests the master control thread 550 to perform a DMA transfer from the NVM 320, and if the DMA transfer is not completed, other processing can be performed. Declaring that there is no processing. The master control thread 550 generates the corresponding NVM transfer request management cell 1100 and enqueues it in the REQ queue 820, suspends execution of the user level thread A, saves its context to the thread management cell 900, and then sets the NVMWAIT Enqueue to the queue 812. As a result, there is no thread to be executed in the kernel level thread 1-2. Therefore, the master control thread 550 dequeues one thread management cell 900 from the READY queue 810, and the thread is called the kernel level thread 1-2. (In FIG. 13, the master control thread 550 has started execution of the user level thread B).

Next, the user level thread D running on the kernel level thread 1-3 is (2), and a DMA transfer request is issued to the master control thread 550 as in (1). However, in the case of (2), since the user level thread D reports to the master control thread 550 that there is other processing that can be performed without waiting for the completion of the DMA transfer, the master control thread 550 Execution is continued without pausing the user level thread D. For example, this is the case where data required by the user level thread is DMA-transferred in advance and then another process is performed. That is, the operation is such that data on the NVM 320 is prefetched by DMA transfer.

Thereafter, the processing of the user level threads B and D is temporarily stopped, and the execution of the next thread loaded in the REDEY queue 810 is started. In the example of FIG. 13, the user level thread B is scheduled to be executed again after the execution of the user level thread C is completed, and the user level thread D waits for completion of the previously requested DMA transfer. Subsequently, in the example of FIG. 13, execution of user level threads C and E is started. Since the completion of the DMA transfer requested in (1) is notified in (3) during execution of the user level thread C, the user level thread A is enqueued in the READY queue 810. The user level thread A is scheduled after the execution of the user level thread C is completed (or suspended), and starts operating with the kernel thread 1-2.

In the kernel level thread 1-4, the user level thread F voluntarily returns resources and stops in (4). Thereafter, a blank time in which there is no thread to be processed occurs for a while in the kernel level thread 1-4, and then the user level thread B is scheduled again. As described above, the user level thread B was scheduled to be rescheduled after the completion of the user level thread C. However, since the kernel thread 1-1 originally used is taken by the user level thread A, the kernel Operates on thread 1-4. Since the virtual address space is shared among the threads in the process, the kernel threads 1-1 to 1-5 can be used as equivalents like SMP in the computer configuration.

FIG. 14 is a conceptual diagram showing the relationship between the thread context (thread management information 514) and the DMA transfer of data in this embodiment. Data to be processed by the thread is transferred to the main memory from the NVM 320 via the PIN area pool 516 by DMA transfer, and the context necessary for executing the thread is transferred from the saved main memory to each hardware thread (kernel level). Threads are loaded one-to-one with hardware threads).

FIG. 15 is a flowchart showing the scheduling operation of the master control thread 550 in this embodiment. In step S1501, the master control thread 550 starts scheduling. The trigger for starting the scheduling is when a free space is detected in the kernel level thread. Next, in step S1502, the master control thread 550 determines whether the thread management cell 900 can be dequeued from the READY queue 810. If it cannot be dequeued, scheduling ends at that point, and the master control thread 550 waits for the next opportunity for scheduling. If it can be dequeued, in step S1503, the master control thread 550 dequeues the thread management cell 900 from the READY queue.

Thereafter, in step S1504, the master control thread 550 determines whether or not a PIN area can be allocated to the PIN area pool 516. The information processing system 100 according to the present embodiment is characterized by determining whether or not a PIN area can be allocated in the thread scheduling shown in FIG.

In the information processing system 100 of this embodiment, DMA transfer is performed for each thread. Since DMA transfer is performed without the intervention of a processor or operating system, the transfer is basically performed within the physical address space. That is, since the virtual address space is managed by the operating system, transfer using the virtual address space cannot be performed from the DMA transfer. Therefore, when performing DMA transfer, it is essential that the area on the main memory to be transferred is on the physical memory and is arranged in the physical address space. Therefore, a process that performs DMA transfer allocates a part of the virtual address space of the process to the DRAM area of the physical address space in a fixed manner so as not to cause page-out and page-in due to virtual storage. The area fixedly allocated to the memory having the physical presence is the PIN area, and is prepared as a PIN area pool 516 for the process in this embodiment. Note that the user can set the size of the PIN area 516 in advance.

Here, there is a method in which the PIN area pool 516 prepared for the process is equally divided for each thread and used, but this is because the amount of data required by the thread is known in advance and the request of each thread If the size of the PIN area to be performed is not uniform, it cannot be used successfully.

Therefore, in the information processing system 100 of this embodiment, a PIN area pool 516 is prepared for the process, and the PIN area of the process is managed as a resource pool. After that, the size of the PIN area required by the thread at the time of activation of the thread is reported to the master control thread 550 using the buffer request flag 907 and buffer request size 908 of the thread management cell, and the master control thread 550 makes a PIN. In response to the area allocation, the thread starts execution.

Therefore, in step S1504, the remaining amount of the current PIN area pool 516 is compared with the amount of area requested by the thread to be activated to determine whether the thread can be activated. If the executing thread has been assigned the PIN area pool 516, the remaining amount is small. If it is determined that the remaining amount of the PIN area pool 516 is insufficient, in step S1506, the master control thread 550 stacks the thread at the end of the READY queue 810 and prioritizes execution of other threads. On the other hand, if the necessary area can be secured from the PIN area pool 516, in step S1505, the master control thread 550 assigns the PIN area to the thread that has made the declaration and switches the thread. The master control thread 550 writes the PIN area allocation to the buffer allocation flag 909 and the buffer area head address 910 of the thread management cell 900 to notify the allocated area to the thread that has made the declaration.

In this way, by securing the memory space according to the capacity required for DMA transfer in the physical address space of the main memory by the thread to be executed, it is possible to cope with variations in the size of the PIN area requested by each thread. It becomes possible to respond flexibly. As a result, efficient DMA transfer can be realized, and the processing of the information processing system 100 can be speeded up.

In the present embodiment, an example of an information processing system that can be executed even when the thread management cell 900 is too large to fit in the main memory in a situation such as performing larger-scale graph processing will be described.

で When the number of vertices becomes enormous due to large-scale graph processing, the number of threads also becomes enormous. At that time, it becomes impossible to hold the thread management cells 900 of all the threads in the main memory. Therefore, in this embodiment, as shown in FIG. 16, the entity of the thread management cell 900 is arranged in the NVM 320, and is DMA-transferred to the main memory and used as necessary. Further, the master control thread 550 DMA-transfers the thread management cell 900 on the NVM 320 in advance before it becomes necessary. That is, prefetch for the thread management cell 900 is performed. As shown in FIG. 16, the thread context is DMA-transferred between the main memory and the NVM, and saved or called.

Since the thread management cell 900 is basically required in the order enqueued in the READY queue 810, the READY queue 810 may be monitored and prefetched in that order. In addition, a DMA controller that holds the head address of the READY queue 810, reads the READY queue 810 in parallel with the processor, and performs prefetching may be separately prepared.

As mentioned above, the invention made by the present inventor has been specifically described based on the embodiment. However, the present invention is not limited to the embodiment, and various modifications can be made without departing from the scope of the invention. Needless to say.

100: Information processing system, 110: Node, 120: Inter-node network, 130: NVM subsystem interconnect, 210, 220: Processor, 230, 240: DIMM, 250: I / O hub, 260: NIC, 270: Disk controller 280: HDD, 290: SSD, 300: NVM subsystem, 310: Hybrid memory controller, 320: NVM (Non-volatile Memory, non-volatile memory), 330: Volatile memory, 311: MMR (Memory Mapped Register), 410 420: Process 510: Shared resource between threads 511: Program code 512: Global variable 513: Heap area 514: Thread management information 515: NVM transfer I asked management information, 516: PIN area pool.

Claims

A multi-thread processor;
A first storage device;
A second storage device for performing a DMA transfer with the first storage device;
An operating system that allocates a physical address space to the first storage device and provides a virtual address space on the physical address space;
The thread to be executed secures a memory space according to the capacity required for the DMA transfer on the physical address space,
Move the thread to execution,
An information processing apparatus that releases a memory space secured after processing of the thread is completed.
The information processing apparatus according to claim 1,
An area of a physical address space used for the DMA transfer is preset,
An area excluding the area reserved for the executing thread in the preset area.
An information processing apparatus that secures the memory space.
The information processing apparatus according to claim 1,
An information processing apparatus characterized by temporarily interrupting a thread that has been transferred to execution and is waiting for completion of DMA transfer.
The information processing apparatus according to claim 3.
An information processing apparatus that stores a context of a suspended thread in the first storage device.
The information processing apparatus according to claim 3.
An information processing apparatus that stores a context of a suspended thread in the second storage device.
The information processing apparatus according to claim 1,
An information processing apparatus, wherein the first storage device is a main storage device.
The information processing apparatus according to claim 1,
The first storage device comprises a volatile memory;
The information processing apparatus, wherein the second storage device includes a nonvolatile memory.
The information processing apparatus according to claim 7,
An information processing apparatus comprising a hard disk drive.
The information processing apparatus according to claim 1,
The multi-thread processor has a plurality of cores.
The information processing apparatus according to claim 1,
The volatile memory is DRAM;
An information processing apparatus, wherein the nonvolatile memory is a flash memory.
The information processing apparatus according to claim 1,
The volatile memory is DRAM;
An information processing apparatus, wherein the nonvolatile memory is a phase change memory.