WO2014016951A1 - Information processing device - Google Patents

Information processing device Download PDF

Info

Publication number
WO2014016951A1
WO2014016951A1 PCT/JP2012/069078 JP2012069078W WO2014016951A1 WO 2014016951 A1 WO2014016951 A1 WO 2014016951A1 JP 2012069078 W JP2012069078 W JP 2012069078W WO 2014016951 A1 WO2014016951 A1 WO 2014016951A1
Authority
WO
WIPO (PCT)
Prior art keywords
thread
information processing
processing apparatus
memory
storage device
Prior art date
Application number
PCT/JP2012/069078
Other languages
French (fr)
Japanese (ja)
Inventor
地尋 吉村
由子 長坂
秀貴 青木
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to PCT/JP2012/069078 priority Critical patent/WO2014016951A1/en
Priority to JP2014526682A priority patent/JP5847313B2/en
Publication of WO2014016951A1 publication Critical patent/WO2014016951A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/1081Address translation for peripheral access to main memory, e.g. direct memory access [DMA]

Definitions

  • the present invention relates to an information processing apparatus that performs processing by transferring data to a storage device by DMA transfer, and particularly relates to an information processing apparatus characterized by switching and executing a plurality of threads.
  • the computer is composed of a storage device that stores data and a central processing unit (CPU) that reads and processes data from the storage device.
  • CPU central processing unit
  • the faster the storage device the higher the price (unit price) per bit.
  • the higher the speed the lower the number of bits per unit area or unit volume (recording density). Therefore, a high-speed but expensive and small-capacity storage device is prepared near the CPU, and necessary data is arranged most recently. Data that cannot be accommodated alone is placed in a low-speed but inexpensive and large-capacity storage device, and the data is exchanged between both storage devices as needed.
  • storage devices have a trade-off relationship between speed and cost or speed and capacity, a so-called storage hierarchy concept in which a plurality of storage devices having different properties are used hierarchically is widely used in the computer world. Has been used.
  • the CPU processes the data stored in the register.
  • the CPU searches the cache and, if stored in the cache, reads the data from the cache into the register and performs processing.
  • the CPU reads data from the main memory into the cache. If there is no data to be processed in the main memory, the data is read from the storage into the main memory.
  • the CPU cannot perform a process to be performed, so that the operation becomes vacant, and the CPU utilization rate decreases. The same problem occurs not only in reading data but also in writing data.
  • DMA Direct Memory Access
  • data transfer can be performed without the intervention of the CPU, so the CPU should be able to perform other processing during the idle time during the data transfer. It is.
  • a method in which a programmer schedules DMA transfer timing and processing performed by a CPU by specifying a part where data is required in advance in a program and explicitly embedding a DMA transfer instruction is also used. Yes.
  • this method causes a problem that program tuning becomes complicated.
  • Multithreading has been used as a technology to deal with this.
  • a unit of processing that can be performed concurrently is defined as a thread, and when execution of a certain thread stops, another executable thread is executed.
  • the thread execution stops not only when the processing of the thread is completed, but also when the data required by the thread is read. That is, when considered in conjunction with the previous data reading operation, when a certain thread is executed and necessary data starts to be read, another thread is executed during that time. In this way, the CPU utilization rate can be increased.
  • Patent Document 1 As a prior art relating to such DMA transfer and multi-threading, there are a technique disclosed in Patent Document 1 and a technique disclosed in Patent Document 2. In both cases, DMA transfer is performed between an on-chip memory (also referred to as local memory) and a main memory (also referred to as global memory) on the CPU.
  • on-chip memory also referred to as local memory
  • main memory also referred to as global memory
  • DRAM has a higher unit price than HDDs used for storage, flash memory, phase change memory, etc., and configuring a main memory with a large amount of DRAM causes an increase in cost. Also, since the DRAM is inferior in recording density, the device becomes huge compared to the HDD and flash memory of the same capacity. Therefore, the inventors of the present application solve the problem of reading speed while preparing the necessary capacity by performing DMA transfer between the main memory and the storage area in which data is saved from the main memory. Tried.
  • thread scheduling is often based on a simple FIFO that is executed by taking an executable thread from a queue. For this reason, uniform execution time of each thread enables efficient scheduling. Therefore, when dividing a process into threads, it is desirable to divide the processing so that the load is equal.
  • the DMA transfer is performed between the on-chip memory and the main memory on the CPU. Since it is limited to data, it is not a technique for solving the above-described variation in data size.
  • An object of the present invention is to realize an efficient DMA transfer even in a situation where the data transfer amount varies.
  • the information processing apparatus includes a multi-thread processor, a first storage device, a second storage device that performs DMA transfer with the first storage device, and a physical address space in the first storage device.
  • An operating system that allocates and provides a virtual address space on the physical address space, and secures a memory space according to a capacity required for DMA transfer on the physical address space by the thread to be executed, The above-mentioned problem is solved by releasing the memory space secured after the processing of the thread is completed.
  • efficient DMA transfer can be realized even in a situation where the amount of data transfer varies, and the processing of the information processing apparatus can be speeded up.
  • a nonvolatile memory such as a flash memory and a phase change memory is used to provide a memory larger than a memory constituted by a DRAM to an application, and the slowness that is a disadvantage of the nonvolatile memory is provided.
  • An information processing system 100 that solves the problem with multi-thread will be described.
  • FIG. 1 is a diagram illustrating an example of a configuration of an information processing system 100 according to the present embodiment.
  • the information processing system 100 has at least one node 110.
  • the node 110 is an information processing apparatus, for example, a server apparatus.
  • the example of FIG. 1 shows an example of a four-node configuration of nodes 0 to 3 (reference numeral 110).
  • the nodes are connected by an inter-node network 120.
  • the information processing system 100 may further include an NVM subsystem interconnect 130 that connects non-volatile memory (NVM) subsystems described later.
  • NVM non-volatile memory
  • FIG. 2 is a diagram illustrating an example of the configuration of the node 110 that is an information processing apparatus.
  • the node 110 includes processors 210 and 220, DIMMs 230 and 240, an I / O hub 250, a NIC 260, a disk controller 270, an HDD 280, an SSD 290, and an NVM subsystem 300.
  • the main memory is composed of DIMM 230 and DIMM 240 which are storage devices.
  • the DIMMs 230 and 240 are composed of DRAM which is a volatile memory.
  • each node 110 has at least one processor, and the node 110 in FIG. 2 is an example of a two-processor configuration of the processors 210 and 220.
  • each of the processors 210 and 220 may be a multi-core processor.
  • each processor has two cores, and the node 110 as a whole has four core nodes.
  • each core may support simultaneous multithreading (SMT).
  • SMT simultaneous multithreading
  • each core is 2SMT capable, so each processor has the ability to process 4 threads simultaneously. That is, each processor is a multi-thread processor.
  • threads that can be processed simultaneously as hardware are referred to as hardware threads.
  • the I / O hub 250 provides an interface for connecting various devices such as the NIC 260, the disk controller 270, and the NVM subsystem 300.
  • the I / O hub 250 is connected to the processors 210 and 220 via a system bus provided by each processor. For the connection, for example, a bus such as HyperTransport is used.
  • the I / O hub 250 is connected to various devices such as the NIC 260, the disk controller 270, and the NVM subsystem 300 by a peripheral bus for connecting peripheral devices such as PCI Express.
  • the I / O hub 250, the NIC 260, the disk controller 270, and the NVM subsystem 300 are described as being connected by PCI Express, but the present invention can be implemented by other interconnect means. .
  • the main memory of a computer is determined by the capacity of the DIMM.
  • Data that does not fit in the main memory is stored in the HDD or SSD as storage.
  • the storage is connected via a disk controller, and an interface such as SAS (Serial Attached SCSI) or SATA (Serial Advanced Technology Attachment) is used in terms of hardware.
  • SAS Serial Attached SCSI
  • SATA Serial Advanced Technology Attachment
  • the interface seen from the software is a file system.
  • the application reads / writes the file, and the device driver of the operating system controls the disk controller via the file system to read / write the HDD and SSD. For this reason, reading and writing cannot be performed without going through a plurality of hierarchies, resulting in a large overhead.
  • the information processing system 100 includes an NVM subsystem 300 in order to read and write a large-capacity nonvolatile memory faster than an HDD or SSD storage compared to a DIMM.
  • NVM non-volatile memory
  • FIG. 3 is a diagram illustrating an example of the configuration of the NVM subsystem 300.
  • the NVM subsystem 300 includes a hybrid memory controller 310, a nonvolatile memory (NVM) 320 that is a storage device, and a volatile memory 330 that is a storage device.
  • the NVM 320 is a nonvolatile memory such as a flash memory or a phase change memory. Further, the volatile memory 330 is a DRAM, and a DIMM can be used.
  • the hybrid memory controller 310 is connected to the NVM 320, the volatile memory 330, and the I / O hub 250.
  • the hybrid memory controller 310 DMA-transfers the data stored in the NVM 320 to the DIMM 230 or 240 as the main memory in response to a request for software operating on the processor 210 or 220.
  • the hybrid memory controller 310 plays a role of DMA-transferring data stored in the DIMM 230 or 240 that is the main memory to the NVM 320.
  • the volatile memory 330 is used as a buffer during DMA transfer.
  • the hybrid memory controller 310 of each node can be connected by the NVM subsystem interconnect 130 as described above. This connection enables access to data stored in the NVM subsystem 300 of another node.
  • the hybrid memory controller 310 has a memory-mapped register (MMR: Memory Mapped Register) 311.
  • MMR Memory Mapped Register
  • the MMR 311 is a register used by software operating on the processors 210 and 220 to instruct the hybrid memory controller 310 to perform DMA transfer.
  • PCI Express it is possible to map the registers of peripheral devices in the same memory space as the main memory. Therefore, the software can access the MMR 311 with the load / store instructions of the processors 210 and 220 in the same manner as when reading and writing to the main memory.
  • the node 110 On the node 110, an operating system corresponding to virtual memory operates.
  • the node 110 is composed of a plurality of cores as described above, but has a symmetric multiprocessing (SMP) configuration in which all the cores share a single main memory. Therefore, a single operating system operates on the node 110.
  • SMP symmetric multiprocessing
  • an embodiment will be described on the premise of a single system image in which one operating system is operated on the node 110.
  • the operating system operating on each node 110 allocates a physical address space to the main memory configured by the DIMMs 230 and 240 of each node 110 and provides a virtual address space on the physical address space.
  • a virtual address space for applications (user space) and a virtual address space for operating the operating system kernel using the address translation mechanism of the processor MMU (Memory Management Unit) (Kernel space) is separated to ensure system security and robustness.
  • the user space has an independent virtual address space for each unit of process.
  • a thread in an environment having such a concept of a process takes a form dependent on the process. That is, each process has one or more threads, and each thread operates by sharing the virtual address space of the parent process.
  • the operating system when a single operating system manages multiple cores, the operating system must abstract the core in some way and assign it to a process, so that the application can provide an environment where multiple cores can be used. Don't be. For this purpose, the concept of threads is used. As shown in FIG. 4, the operating system provides a kernel level thread to each process.
  • the number of hardware threads is limited with respect to the number of threads to be executed (N).
  • N the number of threads to be executed
  • M the number of hardware threads
  • N the number of threads to be executed
  • N M
  • the number of threads required by the application cannot be secured.
  • the efficiency of the entire system is lowered. Therefore, there is a need for a method for making a large number of threads available (increasing N) while avoiding context switching in the kernel.
  • the information processing system 100 is characterized in that a master control thread 550 and a user level thread are provided in each of the processes 410 and 420 in FIG. 4 as shown in FIG. .
  • the process 410 has an inter-thread shared resource 510 and at least two or more kernel level threads allocated from the kernel.
  • a master control thread 550 is fixedly assigned to one of the plurality of kernel level threads.
  • user level threads required by the application are allocated to kernel level threads in a time division manner.
  • the master control thread 550 is a thread that always occupies one kernel level thread and continues to run for the duration of the process 410, such as context switching of user level threads, scheduling, and management of shared resources between threads. I do. Unlike the kernel level thread, the master control thread 550 performs these processes in the process 410, so that context switching can be realized at higher speed without causing switching to the kernel space. That is, the master control thread 550 can speed up context switching while using a large number of threads.
  • threads operate by sharing process resources.
  • a model is used in which each thread has a unique stack and other areas are shared with other threads.
  • a PIN area resource allocation described later is further performed for a user level thread.
  • FIG. 6 is an explanatory diagram showing the correspondence between the physical address space 610 included in the node 110 and the virtual address space 620 included in the process 410 operating on the node 110.
  • the area arranged in the physical address space 610 basically corresponds one-to-one with the physical components of the node 110.
  • the DRAM area 611 corresponds to the DIMMs 230 and 240.
  • An MMIO (Memory Mapped Input / Output) area 612 is an area where the MMR 311 described above is arranged. All of these areas are managed in units called pages. In general, one page is 4 KB in size.
  • the virtual address space 620 of the process 410 can be roughly divided into a text area 621, a data area 622, a mmio area 623, and a stack area 624.
  • the process 410 of this embodiment further has a PIN area pool 516.
  • Each area in the virtual address space 620 will be described below by associating the internal structure of the process 410 shown in FIG. 5 with the virtual address space 620 of the process 410 shown in FIG.
  • the process 410 has various resources shared between user level threads as an inter-thread shared resource 510.
  • the inter-thread shared resource 510 includes a program code 511, a global variable 512, a heap area 513, thread management information 514, NVM transfer request management information 515, and a PIN area pool 516.
  • the program code 511 is an instruction sequence of a program to be executed by a thread, and is arranged in the text area 621 on the virtual address space 620.
  • the global variable 512 is a variable that is commonly used by any subroutine or thread that operates in the process 410, and is arranged in the data area 622.
  • the heap area 513 is a resource pool when the program dynamically secures memory, and is arranged in the data area 622.
  • the thread management information 514 is used to store information necessary for each thread to manage the thread, and is mainly used for the master control thread 550, but is accessed from the user level thread. Since it must be possible, it has the same properties as the global variable 512 and is placed in the data area 622.
  • the NVM transfer request management information 515 is information for managing DMA transfer described later, and is placed in the data area 622 for the same reason as the thread management information 514.
  • the stack area is an area for preparing a stack used for passing parameters of local variables and subroutines, and is allocated to each thread as will be described later.
  • the correspondence between the physical address space 610 and the virtual address space 620 is indicated by a broken line.
  • the physical address space 610 and the virtual address space 620 are mapped on a page basis, but there is no page corresponding to the physical address space 610 even though it exists on the virtual address space 620.
  • the MMU When such a page is accessed, the MMU generates a page fault exception, and the operating system reads the page saved from the HDD or SSD and causes it to page in.
  • the information processing system 100 adopting the virtual memory has a feature that the memory area (page) existing from the viewpoint of the process does not necessarily exist on the physical memory. . The effect of this feature on DMA transfer will be described later.
  • FIG. 7 shows the relationship between the PIN area pool 516 of the virtual address space 620 and the thread of the stack area 624.
  • Both the PIN area pool 516 and the stack area 624 are divided for each thread to be used.
  • the stack area 624 always has a stack area corresponding to all the threads (master control thread and user level thread) of the process 410, whereas the PIN area pool 516 has a size corresponding to the size of the area. It has only an area corresponding to some threads. This is because the stack area 624 can secure an area as much as the virtual address space allows by using a virtual storage mechanism, whereas the PIN area pool 516 has a physical address space 610 for the reason described later. This is because only the amount that can secure the page corresponding to the DRAM area 611 is prepared.
  • the user level multithreading means that a plurality of threads (user level threads) are operated while being switched in the process 410. Since processing necessary for thread switching is completed in the process 410, the processing is faster than kernel level thread switching. On the other hand, in user level multithreading, thread management is also performed in the process 410. In the information processing system 100 of this embodiment, the master control thread 550 plays a role of thread management.
  • FIG. 8 is a diagram showing a user level thread by the master control thread 550 and a queue for managing DMA transfer. These queues are stored in the memory as thread management information 514 and NVM transfer request management information 515.
  • the thread management information 514 includes a READY queue 810, an IOWAIT queue 811, an NVMWAIT queue 812, and a FIN queue 813.
  • the entry enqueued in each queue is the thread management cell 900 shown in FIG.
  • the thread management cell 900 includes a valid flag 901, a thread ID 902, a thread state 903, a save context 904, a save stack pointer 905, a save program counter 906, a buffer request flag 907, a buffer request size 908, a buffer allocation flag 909, and a buffer area head. It consists of an address 910.
  • the Valid flag 901 is a flag indicating whether or not the thread management cell 900 is valid.
  • the thread ID 902 is an identifier for uniquely identifying a thread, and is used to realize an operation that is a feature of the present invention in which DMA transfer and thread scheduling described later are linked.
  • the thread state 903 is information for indicating what state the thread is currently in. The thread state will be described in detail later.
  • the save context 904, the save stack pointer 905, and the save program counter 906 are information used for executing a thread, and are information saved from the registers on the processors 210 and 220 to the thread management cell 900 when the thread is stopped. is there.
  • the buffer request flag 907, the buffer request size 908, the buffer allocation flag 909, and the buffer area head address 910 are used for DMA transfer described later, and will be described in detail later.
  • a thread management cell 900 of an executable user level thread is enqueued.
  • the master control thread 550 dequeues the thread management cell 900 from the READY queue 810 when a kernel level thread included in the process 410 is free or when another user level thread is stopped.
  • context switching is performed using the save context 904, save stack pointer 905, and save program counter 906 included in the dequeued thread management cell 900, and the execution of the thread is started.
  • Threads that are neither executing (RUN) nor being ready (READY) are in some waiting state.
  • the master control thread 550 uses the IOWAIT queue 811 and the NVMWAIT queue 812 to manage the waiting state.
  • having the NVMWAIT queue 812 is a feature of the information processing system 100 of the present embodiment.
  • the IOWAIT queue 811 is a queue in which threads waiting for the completion of I / O requested by the system call to the operating system are stored.
  • a user level thread makes an I / O request using an operating system function such as access to a file, it issues a system call and informs the operating system of it. Thereafter, when there is no processing to be performed until the thread that issued the system call completes the execution of the system call, that is, when the processing is resumed after the completion of the system call, the master control thread 550 makes this IOWAIT The thread that issued the system call is saved in the queue 811. When the execution of the system call is completed, the master control thread 550 shifts the saved thread to the READY queue 810.
  • the NVMWAIT queue 812 is a queue in which threads waiting for completion of DMA transfer between the main memory and the nonvolatile memory (NVM) 320 are stored.
  • NVM nonvolatile memory
  • a thread corresponding to each of a large number of vertices is executed for large-scale graph processing.
  • the information processing system 100 stores data in the non-volatile memory (NVM) 320 and brings it from the NVM 320 to the main memory by DMA transfer when executing a thread.
  • the main memory is composed of DRAM
  • the NVM 320 is composed of flash memory and phase change memory, so that a large capacity can be realized at a lower cost than DRAM.
  • the master control thread 550 saves the thread in the NVMWAIT812 queue.
  • the master control thread 550 shifts the thread saved in the NVMWAIT 812 queue to the READY queue 810.
  • the FIN queue 813 is a queue for collecting the thread management cells 900 that have become unnecessary after execution is completed.
  • a thread management cell 900 exists for each thread. Therefore, when a large number of threads are repeatedly generated and destroyed in dynamic large-scale graph processing or the like, the thread management cell 900 is generated by receiving memory allocation from the heap area 513 each time, or heap area 513 is generated. If the area is returned to, the overhead is large. Therefore, the master control thread 550 collects the used thread management cell 900 in the FIN queue 813 and reuses the thread management cell 900 from the FIN queue 813 as necessary.
  • FIG. 10 summarizes the various states of the threads described so far as a state transition diagram.
  • the master control thread 550 enqueues the thread management cell 900 of the new thread into the READY queue 810.
  • the master control thread 550 stops the user level thread that has already been executed when the resource for executing the thread is free (READY queue 810, IOWAIT queue 811, NVMWAIT queue 812).
  • the thread management cell 900 at the head of the REDED queue 810 is dequeued. Start running. The thread is executing (RUN).
  • the master control thread 550 suspends the execution of the thread and enqueues the thread in the READY queue 810. To do.
  • the master control thread 550 when the executing thread issues a system call for file access or the like and waits for completion of the system call, the master control thread 550 temporarily suspends the execution of the thread. The thread is enqueued in the IOWAIT queue 811.
  • (6) of FIG. 10 when the master control thread 550 detects the completion of the system call while waiting for the completion of the system call started by being enqueued in the IOWAIT queue 811 in (4), the master control thread 550 adds the thread to the IOWAIT queue. 811 is transferred to the READY queue 810.
  • (7) of FIG. 10 is the operation when the completion is actually completed with respect to the waiting for completion, and the master control thread 550 is enqueued in the NVMWAIT queue 812 in (4).
  • the thread is transferred from the NVMWAIT queue 812 to the READY queue 810.
  • the system call or the DMA transfer waiting for the completion of the DMA transfer and the system call or the DMA transfer completed can be executed again, and then the execution is scheduled and started. It will wait to be done.
  • a method for detecting completion of DMA transfer will be described later.
  • the master control thread 550 shifts the thread management cell 900 of the completed or interrupted thread to the FIN queue 813 ((( 8), corresponding to (9)). Note that the thread execution interruption can also occur when the thread is in the READY queue 810, the IOWAIT queue 811, or the NVMWAIT queue 812. In these cases, the master control thread 550 is also the thread of the suspended thread. The management cell 900 is transferred to the FIN queue 813. Further, when the execution of the thread is completed or when the execution of the thread is interrupted, the master control thread 550 allocates a PIN area to be described later to the thread that has been completed or interrupted and finished processing. Cancels the allocation, that is, releases the memory space allocated as the PIN area.
  • the thread management cell 900 in the FIN queue 813 is appropriately released by the master control thread 550 when the heap area 513 is insufficient. Further, when a new thread is generated (operation (1) in FIG. 10), it is preferentially used from the thread management cell 900 in the FIN queue 813. If the thread management cell 900 is still insufficient, the master control thread 550 allocates an area for the thread management cell 900 from the heap area 513.
  • the NVM transfer request management information 515 includes a REQ queue 820, a WAIT queue 821, a COMPLETE queue 822, and a DISPOSE queue 823.
  • the entry enqueued in each queue is the NVM transfer request management cell 1100 shown in FIG.
  • the NVM transfer request management cell 1100 includes a Valid flag 1101, a request source thread ID 1102, a thread management cell pointer 1103, a transfer direction 1104, a transfer state 1105, a transfer source address 1106, a transfer data length 1107, and a transfer destination address 1108. .
  • the Valid flag 1101 is a flag indicating whether or not the NVM transfer request management cell 1100 is valid.
  • the request source thread ID 1102 stores the thread ID of the thread that generated the NVM transfer request.
  • the thread management cell pointer 1103 is a pointer to the thread management cell 900 that manages the thread that has generated the NVM transfer request. That is, the thread ID 902 stored in the thread management cell 900 obtained by tracing this pointer is the same as the request source thread ID 1102.
  • Transfer direction 1104 is information for specifying the direction of NVM transfer, and specifies load or store.
  • the load is the direction close to the CPU, that is, the direction in which data is read from the NVM 320 to the main memory (or it can be said that the data stored in the NVM 320 is written to the main memory).
  • the store is the direction far from the CPU, that is, the direction in which data is read from the main memory to the NVM 320 (or the data stored in the main memory can also be written to the NVM 320).
  • the transfer state 1105 indicates what state the NVM transfer is currently in. Details regarding the transfer status will be described later.
  • the transfer source address 1106 is an address that is a transfer source of DMA transfer performed by NVM transfer.
  • the transfer direction 1104 is load (transfer from the NVM 320 to the main memory)
  • the transfer source address 1106 is an identifier used in the NVM 320.
  • an address space dedicated to the NVM 320 which is different from the address space (physical address space, virtual address space) of the main memory, may be used.
  • the address space of the main memory is limited by the amount of DRAM assumed to be mountable in a computer at that time. For example, even a processor with a 64-bit architecture today is realistic in terms of cost. In consideration of the amount of DRAM that can be used, only a space of about 48 bits is often mounted.
  • the transfer direction 1104 is store (transfer from the main memory to the NVM 320)
  • the transfer source address 1106 is the address space of the main memory, and is specified by an address in the physical address space, particularly for the reason described later.
  • the transfer data length 1107 designates the transfer length of DMA transfer performed by NVM transfer.
  • the transfer destination address 1108 is an address that is a transfer destination of a DMA transfer performed by NVM transfer. Similar to the transfer source address 1106, an identifier used in the NVM 320 or a physical address is specified according to the transfer direction 1104.
  • the master control thread 550 When the thread requires a DMA transfer between the main memory and the NVM 320, the master control thread 550 generates the NVM transfer request management cell 1100 and enqueues the generated cell 1100 into the REQ queue 820. In order to start the DMA transfer, a command for requesting the start of the DMA transfer must be written in the MMR 311.
  • a plurality of sets of MMRs 311 may be prepared in the NVM subsystem 300 to support multiple DMA transfers in which a plurality of DMA transfers are advanced simultaneously in order to further increase the processing speed.
  • the master control thread 550 starts the DMA transfer by dequeuing the NVM transfer request management cell 1100 from the REQ queue 820 and writing a command to the MMR 311 when there is an empty DMA transfer in the NVM subsystem 300. Then, the master control thread 550 enqueues the NVM transfer request management cell 1100 that is undergoing DMA transfer into the WAIT queue 821, and waits for completion of DMA transfer.
  • the completion of the DMA transfer can be notified by an interrupt from the NVM subsystem 300 to the processors 210 and 220.
  • the master control thread 550 can also know the completion of the DMA transfer by polling the MMR 311. However, when an interrupt is used, the interrupt handler of the operating system receives the interrupt, which requires switching to the kernel space and has a large overhead. Therefore, in this embodiment, a flag indicating that the DMA transfer is completed is provided in the MMR 311 and the master control thread 550 polls this flag.
  • the master control thread 500 sets the MMR 311 when there is no other processing to be performed in the master control thread 500, that is, at an interval that does not interfere with the user level thread or scheduling. Polling is performed, and completion of a plurality of DMA transfers is detected by one polling.
  • the NVM transfer request management cell 1100 in which the completion of DMA transfer is detected is dequeued from the WAIT queue 821 by the master control thread 550 and enqueued in the COMPLETE queue 822.
  • the DMA transfer completion order is not necessarily the order enqueued in the WAIT queue 821, and access to the WAIT queue 821 is not necessarily a FIFO.
  • the NVM transfer request management cell 1100 enqueued in the COMPLETE queue 822 is unnecessary information because the DMA transfer has already been completed.
  • the thread management cell 900 stored in the NVMWAIT queue 812 is transferred from the NVMWAIT queue 812 based on the NVM transfer request management cell 1100 stored in the COMPLETE queue 822. It is characterized by shifting to the READY queue 810. In other words, depending on the completion status of the DMA transfer, changing the state of the thread that has generated the DMA transfer request from the DMA transfer completion waiting state to the executable state is the DMA transfer in this embodiment. This is a characteristic operation of multithreading linkage.
  • the master control thread 550 periodically monitors the COMPLETE queue 822. If the NVM transfer request management cell 1100 is present in the COMPLETE queue 822, the master control thread 550 dequeues it, and uses the request source thread ID 1102 as a key, in the NVMWAIT queue 812. The corresponding thread management cell 900 is searched from among them. When the master control thread 550 finds the corresponding thread management cell 900, the master control thread 550 enqueues the thread management cell 900 into the READY queue 810 and enqueues the NVM transfer request management cell 1100 into the DISPOSE queue 823. The DISPOSE queue 823 is used for reusing the used NVM transfer request management cell 1100, similarly to the FIN queue 813.
  • FIG. 13 is an explanatory diagram showing how the threads in the process 410 are started and stopped by the cooperation of user level multithreading and DMA transfer described so far.
  • the cooperation between the thread life cycle and the DMA transfer will be described with reference to FIG.
  • Process 410 has kernel level threads 1-1 to 1-5 assigned from the operating system.
  • the process 410 can be executed by assigning an arbitrary thread to the kernel level threads 1-1 to 1-5.
  • the master control thread 550 is fixedly allocated on the kernel level thread 1-1.
  • the process 410 executes the master control thread 550 with the kernel level thread 1-1.
  • the master control thread 550 when performing large-scale graph processing, performs processing by setting up a thread corresponding to each vertex of the graph.
  • user level threads A, B, C, D, E, F, and G correspond to it, and the master control thread 550 first assigns user level threads A, D, F, and G to the kernel level thread 1 respectively. Start with -2 to 1-5.
  • the user level thread A requests the master control thread 550 to perform a DMA transfer from the NVM 320, and if the DMA transfer is not completed, other processing can be performed. Declaring that there is no processing.
  • the master control thread 550 generates the corresponding NVM transfer request management cell 1100 and enqueues it in the REQ queue 820, suspends execution of the user level thread A, saves its context to the thread management cell 900, and then sets the NVMWAIT Enqueue to the queue 812.
  • the master control thread 550 dequeues one thread management cell 900 from the READY queue 810, and the thread is called the kernel level thread 1-2. (In FIG. 13, the master control thread 550 has started execution of the user level thread B).
  • the user level thread D running on the kernel level thread 1-3 is (2), and a DMA transfer request is issued to the master control thread 550 as in (1).
  • the master control thread 550 executes the master control thread 550 Execution without pausing the user level thread D. For example, this is the case where data required by the user level thread is DMA-transferred in advance and then another process is performed. That is, the operation is such that data on the NVM 320 is prefetched by DMA transfer.
  • the processing of the user level threads B and D is temporarily stopped, and the execution of the next thread loaded in the REDEY queue 810 is started.
  • the user level thread B is scheduled to be executed again after the execution of the user level thread C is completed, and the user level thread D waits for completion of the previously requested DMA transfer.
  • execution of user level threads C and E is started. Since the completion of the DMA transfer requested in (1) is notified in (3) during execution of the user level thread C, the user level thread A is enqueued in the READY queue 810.
  • the user level thread A is scheduled after the execution of the user level thread C is completed (or suspended), and starts operating with the kernel thread 1-2.
  • the user level thread F In the kernel level thread 1-4, the user level thread F voluntarily returns resources and stops in (4). Thereafter, a blank time in which there is no thread to be processed occurs for a while in the kernel level thread 1-4, and then the user level thread B is scheduled again. As described above, the user level thread B was scheduled to be rescheduled after the completion of the user level thread C. However, since the kernel thread 1-1 originally used is taken by the user level thread A, the kernel Operates on thread 1-4. Since the virtual address space is shared among the threads in the process, the kernel threads 1-1 to 1-5 can be used as equivalents like SMP in the computer configuration.
  • FIG. 14 is a conceptual diagram showing the relationship between the thread context (thread management information 514) and the DMA transfer of data in this embodiment.
  • Data to be processed by the thread is transferred to the main memory from the NVM 320 via the PIN area pool 516 by DMA transfer, and the context necessary for executing the thread is transferred from the saved main memory to each hardware thread (kernel level). Threads are loaded one-to-one with hardware threads).
  • FIG. 15 is a flowchart showing the scheduling operation of the master control thread 550 in this embodiment.
  • the master control thread 550 starts scheduling.
  • the trigger for starting the scheduling is when a free space is detected in the kernel level thread.
  • the master control thread 550 determines whether the thread management cell 900 can be dequeued from the READY queue 810. If it cannot be dequeued, scheduling ends at that point, and the master control thread 550 waits for the next opportunity for scheduling. If it can be dequeued, in step S1503, the master control thread 550 dequeues the thread management cell 900 from the READY queue.
  • step S1504 the master control thread 550 determines whether or not a PIN area can be allocated to the PIN area pool 516.
  • the information processing system 100 is characterized by determining whether or not a PIN area can be allocated in the thread scheduling shown in FIG.
  • DMA transfer is performed for each thread. Since DMA transfer is performed without the intervention of a processor or operating system, the transfer is basically performed within the physical address space. That is, since the virtual address space is managed by the operating system, transfer using the virtual address space cannot be performed from the DMA transfer. Therefore, when performing DMA transfer, it is essential that the area on the main memory to be transferred is on the physical memory and is arranged in the physical address space. Therefore, a process that performs DMA transfer allocates a part of the virtual address space of the process to the DRAM area of the physical address space in a fixed manner so as not to cause page-out and page-in due to virtual storage.
  • the area fixedly allocated to the memory having the physical presence is the PIN area, and is prepared as a PIN area pool 516 for the process in this embodiment. Note that the user can set the size of the PIN area 516 in advance.
  • the PIN area pool 516 prepared for the process is equally divided for each thread and used, but this is because the amount of data required by the thread is known in advance and the request of each thread If the size of the PIN area to be performed is not uniform, it cannot be used successfully.
  • a PIN area pool 516 is prepared for the process, and the PIN area of the process is managed as a resource pool. After that, the size of the PIN area required by the thread at the time of activation of the thread is reported to the master control thread 550 using the buffer request flag 907 and buffer request size 908 of the thread management cell, and the master control thread 550 makes a PIN. In response to the area allocation, the thread starts execution.
  • step S1504 the remaining amount of the current PIN area pool 516 is compared with the amount of area requested by the thread to be activated to determine whether the thread can be activated. If the executing thread has been assigned the PIN area pool 516, the remaining amount is small. If it is determined that the remaining amount of the PIN area pool 516 is insufficient, in step S1506, the master control thread 550 stacks the thread at the end of the READY queue 810 and prioritizes execution of other threads. On the other hand, if the necessary area can be secured from the PIN area pool 516, in step S1505, the master control thread 550 assigns the PIN area to the thread that has made the declaration and switches the thread. The master control thread 550 writes the PIN area allocation to the buffer allocation flag 909 and the buffer area head address 910 of the thread management cell 900 to notify the allocated area to the thread that has made the declaration.
  • the entity of the thread management cell 900 is arranged in the NVM 320, and is DMA-transferred to the main memory and used as necessary. Further, the master control thread 550 DMA-transfers the thread management cell 900 on the NVM 320 in advance before it becomes necessary. That is, prefetch for the thread management cell 900 is performed. As shown in FIG. 16, the thread context is DMA-transferred between the main memory and the NVM, and saved or called.
  • the READY queue 810 may be monitored and prefetched in that order.
  • a DMA controller that holds the head address of the READY queue 810, reads the READY queue 810 in parallel with the processor, and performs prefetching may be separately prepared.
  • 100 Information processing system
  • 110 Node
  • 120 Inter-node network
  • 130 NVM subsystem interconnect
  • 210, 220 Processor
  • 230, 240 DIMM
  • 250 I / O hub
  • 260 NIC
  • 270 Disk controller 280: HDD
  • 290 SSD
  • 300 NVM subsystem
  • 310 Hybrid memory controller
  • 320 NVM (Non-volatile Memory, non-volatile memory)
  • 330 Volatile memory
  • 311 MMR (Memory Mapped Register)
  • Program code 512 Global variable 513: Heap area 514: Thread management information
  • 515 NVM transfer I asked management information
  • 516 PIN area pool.

Abstract

This information processing device is provided with a multi-thread processor, a first storage device, a second storage device for performing DMA transfers between itself and the first storage device, and an operating system that allocates physical address space to the first storage device and provides virtual address space in the physical address space. Memory space is secured according to the capacity required for DMA transfer, in the physical address space, of a thread that is scheduled for execution, the thread is transferred for execution, and the secured memory space is released after processing of the thread has finished, thereby achieving efficient DMA transfer even in a state in which the data transfer amount fluctuates.

Description

情報処理装置Information processing device
 本発明は、DMA転送によって記憶装置にデータを転送して処理を行う情報処理装置で、特に複数のスレッドを切替えて実行することを特徴とした情報処理装置に関する。 The present invention relates to an information processing apparatus that performs processing by transferring data to a storage device by DMA transfer, and particularly relates to an information processing apparatus characterized by switching and executing a plurality of threads.
 コンピュータは、データを格納する記憶装置と、記憶装置からデータを読み出して処理を行う中央処理装置(CPU)から構成される。一般的に記憶装置は高速なものほど、ビットあたりの価格(単価)が高価である。また、高速なものほど、単位面積ないしは単位体積あたりのビット数(記録密度)が低い。そこで、CPUの近くに高速だが高価で容量が小さい記憶装置が用意され、直近で必要なデータが配置される。そして、それだけでは収まりきらないデータは低速だが安価で容量が大きい記憶装置に配置され、必要に応じて両記憶装置間でデータが入れ替られて利用される。このように、記憶装置に速度とコスト若しくは速度と容量のトレードオフの関係があることから、性質の異なる複数種の記憶装置を階層的に利用する、いわゆる記憶階層の概念がコンピュータの世界では幅広く用いられてきた。 The computer is composed of a storage device that stores data and a central processing unit (CPU) that reads and processes data from the storage device. In general, the faster the storage device, the higher the price (unit price) per bit. Also, the higher the speed, the lower the number of bits per unit area or unit volume (recording density). Therefore, a high-speed but expensive and small-capacity storage device is prepared near the CPU, and necessary data is arranged most recently. Data that cannot be accommodated alone is placed in a low-speed but inexpensive and large-capacity storage device, and the data is exchanged between both storage devices as needed. As described above, since storage devices have a trade-off relationship between speed and cost or speed and capacity, a so-called storage hierarchy concept in which a plurality of storage devices having different properties are used hierarchically is widely used in the computer world. Has been used.
 この傾向は今日でも不変であるが、コンピュータベンダーが利用可能な記憶素子の種類から、今日では主にレジスタ、キャッシュメモリ、メインメモリ、およびストレージの4階層から成る記憶階層でコンピュータは構成されている。それぞれの階層で用いられている主な記憶素子は、レジスタはフリップフロップ、キャッシュメモリはSRAM、メインメモリはDRAM、ストレージはHDDであり、各記憶素子の速度とコスト及び容量によって階層を分ける必然性が生まれている。 This trend remains unchanged today, but because of the types of storage elements available to computer vendors, today computers are mainly composed of storage hierarchies consisting of four levels: registers, cache memory, main memory, and storage. . The main storage elements used in each layer are flip-flops for registers, SRAM for cache memory, DRAM for main memory, and HDD for storage, and there is a necessity to divide the layers according to the speed, cost, and capacity of each storage element. Born.
 次に、前述した記憶階層に関して、別の面から説明を行う。CPUはレジスタに格納されているデータに処理を行う。CPUは、処理すべきデータがレジスタに無い場合、キャッシュを探して、キャッシュに格納されていればキャッシュからデータをレジスタに読み込んでから処理を行う。CPUは、キャッシュにも処理すべきデータが無い場合は、メインメモリからキャッシュにデータを読み込む。メインメモリにも処理すべきデータが無い場合には、ストレージからメインメモリにデータが読み込まれる。このように、CPUから見て手近な階層にデータが無い場合、遠くの階層からデータが読み込まれ、そのペナルティは大きくなる。そして、データを読み込んでいる間、CPUは行うべき処理が出来ないのでその動作が空いてしまい、CPU利用率が低下する。データの読み込みに限らず、データの書き込みでも同様の問題が発生する。 Next, the storage hierarchy described above will be explained from another aspect. The CPU processes the data stored in the register. When there is no data to be processed in the register, the CPU searches the cache and, if stored in the cache, reads the data from the cache into the register and performs processing. When there is no data to be processed in the cache, the CPU reads data from the main memory into the cache. If there is no data to be processed in the main memory, the data is read from the storage into the main memory. As described above, when there is no data in a layer close to the CPU, data is read from a far layer and the penalty increases. Then, while the data is being read, the CPU cannot perform a process to be performed, so that the operation becomes vacant, and the CPU utilization rate decreases. The same problem occurs not only in reading data but also in writing data.
 ここで、データ転送にDMA(Direct Memory Access)転送を用いるとCPUの介在無しにデータ転送を行うことが出来るため、本来はデータ転送中の空き時間にCPUは他の処理を行うことが出来るはずである。例えば、プログラム上で予めデータが必要になる箇所を特定して、明示的にDMA転送の指示を埋め込むことで、プログラマーがDMA転送のタイミングとCPUが行う処理をスケジューリングするような方法も用いられている。しかし、この方法はプログラムのチューニングが煩雑になるという問題を引き起こす。 Here, if DMA (Direct Memory Access) transfer is used for data transfer, data transfer can be performed without the intervention of the CPU, so the CPU should be able to perform other processing during the idle time during the data transfer. It is. For example, a method in which a programmer schedules DMA transfer timing and processing performed by a CPU by specifying a part where data is required in advance in a program and explicitly embedding a DMA transfer instruction is also used. Yes. However, this method causes a problem that program tuning becomes complicated.
 これに対処するための技術として、マルチスレッドが用いられてきた。マルチスレッドでは、同時並行的に行うことが出来る処理の単位がスレッドとして定義され、あるスレッドの実行が止まったら、他の実行可能なスレッドが実行される。なお、スレッドの実行が止まるのは、当該スレッドの処理が完了したときだけでなく、当該スレッドで必要となるデータを読み込んでいる時にもスレッドの実行は止まる。つまり、先のデータ読み込みの動作と合わせて考えれば、あるスレッドを実行していて、必要なデータを読み込み始めると、その間は他のスレッドの実行を行う。このようにして、CPU利用率を高めることができる。 Multithreading has been used as a technology to deal with this. In multithreading, a unit of processing that can be performed concurrently is defined as a thread, and when execution of a certain thread stops, another executable thread is executed. The thread execution stops not only when the processing of the thread is completed, but also when the data required by the thread is read. That is, when considered in conjunction with the previous data reading operation, when a certain thread is executed and necessary data starts to be read, another thread is executed during that time. In this way, the CPU utilization rate can be increased.
 このようなDMA転送とマルチスレッドに関する先行技術として、特許文献1に開示の技術と特許文献2に開示の技術がある。いずれも、DMA転送がCPU上のオンチップメモリ(ローカルメモリとも呼ばれる)とメインメモリ(グローバルメモリとも呼ばれる)の間で行われている。 As a prior art relating to such DMA transfer and multi-threading, there are a technique disclosed in Patent Document 1 and a technique disclosed in Patent Document 2. In both cases, DMA transfer is performed between an on-chip memory (also referred to as local memory) and a main memory (also referred to as global memory) on the CPU.
特開2005-129001号公報JP 2005-129001 A 特開2002-163239号公報JP 2002-163239 A
 ところで、近年、インターネットや各種端末の普及で、大量のデータを容易に取得することが出来るようになってきている。このような大量のデータは、旧来のデータベース管理システムなどで取り扱うことが難しく、ビッグデータという標語の元に種々の技術が開発されている。あらゆるモノがインターネットに接続されるIoT(Internet of Things)の時代には、モノで発生したあらゆるイベントに関するデータがインターネットに送信される。つまり、大量のPOE(Point of Event)データがインターネット上に送信される。このような世界では、インターネットからPOEデータを収集し、モノとモノ、ヒトとヒト、ないしは、モノとヒトがどのような関係にあるのかを分析し、それに基づいて適切なサービスを提供したり、将来を予測したりするなどの利用がなされていくことが期待される。そのためには、コンピュータが大量のデータを高速に処理することができなければならない。 Incidentally, in recent years, with the spread of the Internet and various terminals, it has become possible to easily acquire a large amount of data. Such a large amount of data is difficult to handle with a conventional database management system or the like, and various technologies have been developed under the slogan of big data. In the IoT (Internet of Things) era where every thing is connected to the Internet, data related to every event that occurs in the thing is transmitted to the Internet. That is, a large amount of POE (Point of Event) data is transmitted on the Internet. In such a world, we collect POE data from the Internet, analyze the relationship between things and things, people and people, or things and people, and provide appropriate services based on them. It is expected to be used for predicting the future. For this purpose, the computer must be able to process a large amount of data at high speed.
 コンピュータが大量のデータを高速に処理するためには、処理すべきデータがメインメモリ上に載ることが望ましい。ストレージ上にデータが置かれていると、その読み出しに時間を要する。特に、様々なモノやヒトの関係性を分析しようとする場合、様々なモノやヒトのデータを読み出さなければならない。その都度ストレージへのアクセスが必要になってしまうと、ストレージの読み出しの遅さがネックとなる。しかし、前述したようなビッグデータに対して、それに見合う容量のメインメモリを実現するために大量のDRAMを並べると、様々な問題を引き起こす。 In order for a computer to process a large amount of data at high speed, it is desirable that the data to be processed be placed on the main memory. When data is placed on the storage, it takes time to read it. In particular, when trying to analyze the relationship between various things and people, the data of various things and people must be read out. If access to the storage becomes necessary each time, the slow read of the storage becomes a bottleneck. However, when a large amount of DRAMs are arranged in order to realize a main memory having a capacity corresponding to the big data as described above, various problems are caused.
 DRAMはストレージに用いるHDDやフラッシュメモリ、相変化メモリなどと比較して単価が高いため、大量のDRAMでメインメモリを構成するとコスト増を引き起こす。また、DRAMは記録密度でも劣ることから、同容量のHDDやフラッシュメモリと比較して装置が巨大になってしまう。そこで、本願発明者らは、メインメモリとメインメモリからデータを退避させておく記憶領域との間でDMA転送を行うことで、必要な容量を用意しつつ、読み出し速度の問題を解決することを試みた。 DRAM has a higher unit price than HDDs used for storage, flash memory, phase change memory, etc., and configuring a main memory with a large amount of DRAM causes an increase in cost. Also, since the DRAM is inferior in recording density, the device becomes huge compared to the HDD and flash memory of the same capacity. Therefore, the inventors of the present application solve the problem of reading speed while preparing the necessary capacity by performing DMA transfer between the main memory and the storage area in which data is saved from the main memory. Tried.
 ここで、スレッドのスケジューリングは、実行可能なスレッドをキューから取りだして行う単純なFIFOに基づくことが多い。そのため、各スレッドの実行時間は均一なほうが効率的にスケジューリングすることができるので、処理をスレッドに分割するときには負荷が均等になるように分割することが望ましい。 Here, thread scheduling is often based on a simple FIFO that is executed by taking an executable thread from a queue. For this reason, uniform execution time of each thread enables efficient scheduling. Therefore, when dividing a process into threads, it is desirable to divide the processing so that the load is equal.
 しかし、コンピュータが普及し、アプリケーションが多様化する中で、必ずしも均等な分割が出来ないアプリケーションもあり問題となる。たとえば、社会科学系の問題を扱うとき、グラフ処理が行われる。グラフは、頂点の集合と、頂点間を結ぶ辺の集合で構成される。社会科学系の問題では、関係性が扱われることが多い。例えば会社間の関係は会社を表現する頂点と、関係を表現する辺で示される。このようなグラフを複数のスレッドで処理するために分割しようとすると、頂点毎にスレッドを割当てて分割する形態が自然である。ところが、頂点毎に分割したときに、各頂点が繋がっている辺の数はばらつきがある。そして、各頂点の処理に要する時間は、各頂点が関係を持っている頂点の数、すなわち、辺の数に比例する。そのため、頂点毎の分割ではスレッド間に処理量のばらつきが生じてしまい、退避させておいたデータをメインメモリにDMA転送する際に、DMA転送されるデータの大きさがばらついてしまう。 However, with the spread of computers and the diversification of applications, there are some applications that cannot always be divided equally. For example, graph processing is performed when dealing with social science problems. The graph is composed of a set of vertices and a set of edges connecting the vertices. Social science issues often deal with relationships. For example, the relationship between companies is indicated by a vertex representing the company and an edge representing the relationship. When such a graph is divided to be processed by a plurality of threads, it is natural that a thread is allocated and divided for each vertex. However, when dividing each vertex, the number of sides connected to each vertex varies. The time required for processing each vertex is proportional to the number of vertices with which each vertex is related, that is, the number of sides. Therefore, in the division for each vertex, the processing amount varies between threads, and when the saved data is DMA-transferred to the main memory, the size of the DMA-transferred data varies.
 中でも、社会科学系の問題で登場するグラフはスケールフリー特性と呼ばれる性質を有しているため、このばらつきがより顕著なものとなる。グラフの各頂点が接続されている辺の数を、その頂点の次数と呼ぶ。スケールフリー特性はこの次数の分布が冪乗分布になっていることが特徴で、ごく少数の頂点は極めて次数が大きいが、大多数の頂点は次数が小さいという特徴を持つ。この特徴を前述した処理量のばらつきにあてはめて考えると、社会学系のグラフを処理する時には、非常に処理量の大きい少数のスレッドと、処理量の小さい多数のスレッドを処理することになる。 Above all, graphs that appear due to social science problems have a property called scale-free characteristics, so this variation becomes more prominent. The number of edges to which each vertex of the graph is connected is called the degree of that vertex. The scale-free characteristic is characterized by the fact that the distribution of this order is a power distribution, with a very small number of vertices having a very large degree, but a large number of vertices having a small degree. When this characteristic is applied to the above-described variation in the processing amount, when processing a sociological graph, a small number of threads with a very large processing amount and a large number of threads with a small processing amount are processed.
 特許文献1に開示の技術と特許文献2に開示の技術は、DMA転送がCPU上のオンチップメモリとメインメモリの間で行われるものであり、DMA転送されるのは直近でCPUが処理するデータに限られるので、上述のデータの大きさのばらつきを解決する技術ではない。 In the technique disclosed in Patent Document 1 and the technique disclosed in Patent Document 2, the DMA transfer is performed between the on-chip memory and the main memory on the CPU. Since it is limited to data, it is not a technique for solving the above-described variation in data size.
 本発明は、データの転送量にばらつきのある状況であっても効率的なDMA転送を実現することを目的とする。 An object of the present invention is to realize an efficient DMA transfer even in a situation where the data transfer amount varies.
 本発明の情報処理装置は、マルチスレッドプロセッサと、第1の記憶装置と、第1の記憶装置との間でDMA転送を行う第2の記憶装置と、第1の記憶装置に物理アドレス空間を割当て、該物理アドレス空間上に仮想アドレス空間を提供するオペレーティングシステムと、を備え、実行予定のスレッドが該物理アドレス空間上でDMA転送に必要とする容量に応じてメモリ空間を確保し、該スレッドを実行に移し、該スレッドの処理が終了した後に確保したメモリ空間を開放することで、前述の課題を解決する。 The information processing apparatus according to the present invention includes a multi-thread processor, a first storage device, a second storage device that performs DMA transfer with the first storage device, and a physical address space in the first storage device. An operating system that allocates and provides a virtual address space on the physical address space, and secures a memory space according to a capacity required for DMA transfer on the physical address space by the thread to be executed, The above-mentioned problem is solved by releasing the memory space secured after the processing of the thread is completed.
 本発明により、データの転送量にばらつきのある状況であっても効率的なDMA転送を実現することができ、ひいては情報処理装置の処理を高速化できる。 According to the present invention, efficient DMA transfer can be realized even in a situation where the amount of data transfer varies, and the processing of the information processing apparatus can be speeded up.
本発明の情報処理システムの構成の例を示す図である。It is a figure which shows the example of a structure of the information processing system of this invention. 本発明の情報処理装置の構成の例を示す図である。It is a figure which shows the example of a structure of the information processing apparatus of this invention. 本発明のNVMサブシステムの構成の例を示す図である。It is a figure which shows the example of a structure of the NVM subsystem of this invention. スレッドとプロセッサの関係の例を説明するための図である。It is a figure for demonstrating the example of the relationship between a thread | sled and a processor. 本発明のプロセスの内部構造の例を説明するための図である。It is a figure for demonstrating the example of the internal structure of the process of this invention. 本発明の情報処理装置の物理アドレス空間と、本発明のプロセスの仮想アドレス空間との対応関係の例を説明するための図である。It is a figure for demonstrating the example of the correspondence of the physical address space of the information processing apparatus of this invention, and the virtual address space of the process of this invention. stack領域とPIN領域プールの例の詳細を説明するための図である。It is a figure for demonstrating the detail of the example of a stack area | region and a PIN area | region pool. マスターコントロールスレッドがスレッド及びDMA転送の管理に用いるキューを説明するための図である。It is a figure for demonstrating the queue which a master control thread uses for management of a thread | sled and DMA transfer. スレッド管理セルの構成の例を説明するための図である。It is a figure for demonstrating the example of a structure of a thread management cell. スレッドの状態遷移の例を示す図である。It is a figure which shows the example of the state transition of a thread. NVM転送要求管理セルの構成の例を説明するための図である。It is a figure for demonstrating the example of a structure of a NVM transfer request management cell. NVM転送要求の状態遷移の例を示す図である。It is a figure which shows the example of the state transition of a NVM transfer request. ユーザレベルマルチスレッディングとDMA転送の連携を説明するための図である。It is a figure for demonstrating cooperation of user level multithreading and DMA transfer. ユーザレベルマルチスレッディングとDMA転送の関係を説明するための概念図である。It is a conceptual diagram for demonstrating the relationship between user level multithreading and DMA transfer. マスターコントロールスレッドのスケジューリング動作の例を説明するためのフローチャートである。It is a flowchart for demonstrating the example of the scheduling operation | movement of a master control thread. ユーザレベルマルチスレッディングとDMA転送の関係を説明するための概念図である。It is a conceptual diagram for demonstrating the relationship between user level multithreading and DMA transfer.
 以下、本発明の実施の形態を図面を用いて説明する。なお、実施の形態を説明するための全図において、同一の部材には原則として同一の符号を付し、その繰り返しの説明は省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. Note that components having the same function are denoted by the same reference symbols throughout the drawings for describing the embodiment, and the repetitive description thereof will be omitted.
 本実施例では、フラッシュメモリや相変化メモリなどの不揮発メモリを利用して、アプリケーションに対して、DRAMで構成されるメモリよりも大きなメモリを提供しつつ、不揮発メモリの欠点である速度の遅さをマルチスレッドで解決する情報処理システム100を説明する。 In this embodiment, a nonvolatile memory such as a flash memory and a phase change memory is used to provide a memory larger than a memory constituted by a DRAM to an application, and the slowness that is a disadvantage of the nonvolatile memory is provided. An information processing system 100 that solves the problem with multi-thread will be described.
 図1は、本実施例の情報処理システム100の構成の例を示す図である。情報処理システム100は、少なくとも1個以上のノード110を有する。ノード110は、情報処理装置であり、例えばサーバ装置である。図1の例は、ノード0~3(符号は110)の4ノード構成の例を示している。ノード間はノード間ネットワーク120で接続される。また、情報処理システム100は、ノード間ネットワーク120に加えて、後述する不揮発メモリ(NVM:Non-volatile memory)サブシステム間を接続するNVMサブシステムインターコネクト130をさらに備えていても良い。 FIG. 1 is a diagram illustrating an example of a configuration of an information processing system 100 according to the present embodiment. The information processing system 100 has at least one node 110. The node 110 is an information processing apparatus, for example, a server apparatus. The example of FIG. 1 shows an example of a four-node configuration of nodes 0 to 3 (reference numeral 110). The nodes are connected by an inter-node network 120. In addition to the inter-node network 120, the information processing system 100 may further include an NVM subsystem interconnect 130 that connects non-volatile memory (NVM) subsystems described later.
 図2は、情報処理装置であるノード110の構成の例を示す図である。ノード110は、プロセッサ210、220、DIMM230、240、I/Oハブ250、NIC260、ディスクコントローラ270、HDD280、SSD290、およびNVMサブシステム300を備える。メインメモリは、記憶装置であるDIMM230およびDIMM240で構成される。DIMM230、240は、揮発性メモリであるDRAMで構成される。なお、各ノード110が有するプロセッサの数は最小で1個あれば良く、図2のノード110は、プロセッサ210、220の2プロセッサ構成の例である。また、プロセッサ210、220はそれぞれマルチコアプロセッサであっても良い。図2の例では、各プロセッサが2個のコアを有しており、ノード110全体では4コアノードとなっている。さらに、各コアは同時マルチスレッディング(SMT:Simultaneous Multithreading)に対応していても良い。図2の例では、各コアが2SMT対応であり、よって各プロセッサは4個のスレッドを同時に処理する能力を有する。すなわち、各プロセッサはマルチスレッドプロセッサである。なお、これ以降、ハードウェアとして同時に処理することの出来るスレッドをハードウェアスレッドと称する。 FIG. 2 is a diagram illustrating an example of the configuration of the node 110 that is an information processing apparatus. The node 110 includes processors 210 and 220, DIMMs 230 and 240, an I / O hub 250, a NIC 260, a disk controller 270, an HDD 280, an SSD 290, and an NVM subsystem 300. The main memory is composed of DIMM 230 and DIMM 240 which are storage devices. The DIMMs 230 and 240 are composed of DRAM which is a volatile memory. Note that each node 110 has at least one processor, and the node 110 in FIG. 2 is an example of a two-processor configuration of the processors 210 and 220. Further, each of the processors 210 and 220 may be a multi-core processor. In the example of FIG. 2, each processor has two cores, and the node 110 as a whole has four core nodes. Further, each core may support simultaneous multithreading (SMT). In the example of FIG. 2, each core is 2SMT capable, so each processor has the ability to process 4 threads simultaneously. That is, each processor is a multi-thread processor. Hereinafter, threads that can be processed simultaneously as hardware are referred to as hardware threads.
 I/Oハブ250は、NIC260、ディスクコントローラ270、NVMサブシステム300などの各種装置を接続するためのインタフェースを提供する。I/Oハブ250は、プロセッサ210、220とは、それぞれのプロセッサが提供するシステムバスで接続される。接続には、例えば、HyperTransportのようなバスが用いられる。一方、I/Oハブ250は、NIC260、ディスクコントローラ270、NVMサブシステム300のような各種装置とは、PCI Expressなどの周辺機器接続用のペリフェラルバスで接続される。本実施例では、I/Oハブ250と、NIC260、ディスクコントローラ270、およびNVMサブシステム300は、PCI Expressで接続されているものとして説明するが、他のインターコネクト手段でも本発明は実施可能である。 The I / O hub 250 provides an interface for connecting various devices such as the NIC 260, the disk controller 270, and the NVM subsystem 300. The I / O hub 250 is connected to the processors 210 and 220 via a system bus provided by each processor. For the connection, for example, a bus such as HyperTransport is used. On the other hand, the I / O hub 250 is connected to various devices such as the NIC 260, the disk controller 270, and the NVM subsystem 300 by a peripheral bus for connecting peripheral devices such as PCI Express. In this embodiment, the I / O hub 250, the NIC 260, the disk controller 270, and the NVM subsystem 300 are described as being connected by PCI Express, but the present invention can be implemented by other interconnect means. .
 従来、コンピュータのメインメモリはDIMMの容量で決まっていた。そして、メインメモリに入り切らないデータは、ストレージであるHDDやSSDに格納される。ストレージはディスクコントローラを介して接続され、ハードウェア面ではSAS(Serial Attached SCSI)やSATA(Serial Advanced Technology Attachment)のようなインタフェースが用いられる。そして、ソフトウェアから見たインタフェースはファイルシステムとなる。アプリケーションはファイルに対して読書きを行うことで、ファイルシステムを経由して、オペレーティングシステムが有するデバイスドライバがディスクコントローラを制御して、HDDやSSDを読書きする。そのため、複数の階層を経由しないと読書きが出来ず、オーバヘッドが大きい。 Conventionally, the main memory of a computer is determined by the capacity of the DIMM. Data that does not fit in the main memory is stored in the HDD or SSD as storage. The storage is connected via a disk controller, and an interface such as SAS (Serial Attached SCSI) or SATA (Serial Advanced Technology Attachment) is used in terms of hardware. The interface seen from the software is a file system. The application reads / writes the file, and the device driver of the operating system controls the disk controller via the file system to read / write the HDD and SSD. For this reason, reading and writing cannot be performed without going through a plurality of hierarchies, resulting in a large overhead.
 それに対して、本実施例の情報処理システム100は、DIMMに比べて大容量の不揮発メモリをHDDやSSDのストレージよりも高速に読書きするために、NVMサブシステム300を備えている。ストレージを使用するよりも高速に読み書きが必要な場合には、データをストレージからNVMサブシステム300に予め読み出しておくことで、高速な読み書きを実現する。なお、不揮発メモリ(Non-volatile Memory)を略記してNVMと記す。 On the other hand, the information processing system 100 according to the present embodiment includes an NVM subsystem 300 in order to read and write a large-capacity nonvolatile memory faster than an HDD or SSD storage compared to a DIMM. When reading / writing is required at higher speed than using the storage, data is read from the storage to the NVM subsystem 300 in advance, thereby realizing high-speed reading / writing. Note that a non-volatile memory is abbreviated as NVM.
 図3は、NVMサブシステム300の構成の例を示す図である。NVMサブシステム300は、ハイブリッドメモリコントローラ310と、記憶装置である不揮発性メモリ(NVM)320と、記憶装置である揮発性メモリ330とを備える。NVM320は、フラッシュメモリや相変化メモリなどの不揮発メモリである。また、揮発性メモリ330はDRAMであり、DIMMを流用することができる。ハイブリッドメモリコントローラ310は、NVM320と、揮発性メモリ330と、I/Oハブ250に接続されている。ハイブリッドメモリコントローラ310は、プロセッサ210または220上で動作するソフトウェアの要求に応じて、NVM320に格納されているデータをメインメモリであるDIMM230または240にDMA転送する。また、ハイブリッドメモリコントローラ310は、メインメモリであるDIMM230または240に格納されているデータをNVM320にDMA転送する役割を担う。揮発性メモリ330は、DMA転送の際に、バッファとして用いられる。また、各ノードが有するハイブリッドメモリコントローラ310の間を、前述したようにNVMサブシステムインターコネクト130で接続することもできる。この接続により、他のノードのNVMサブシステム300に格納されているデータにもアクセスが可能になる。 FIG. 3 is a diagram illustrating an example of the configuration of the NVM subsystem 300. The NVM subsystem 300 includes a hybrid memory controller 310, a nonvolatile memory (NVM) 320 that is a storage device, and a volatile memory 330 that is a storage device. The NVM 320 is a nonvolatile memory such as a flash memory or a phase change memory. Further, the volatile memory 330 is a DRAM, and a DIMM can be used. The hybrid memory controller 310 is connected to the NVM 320, the volatile memory 330, and the I / O hub 250. The hybrid memory controller 310 DMA-transfers the data stored in the NVM 320 to the DIMM 230 or 240 as the main memory in response to a request for software operating on the processor 210 or 220. The hybrid memory controller 310 plays a role of DMA-transferring data stored in the DIMM 230 or 240 that is the main memory to the NVM 320. The volatile memory 330 is used as a buffer during DMA transfer. Further, the hybrid memory controller 310 of each node can be connected by the NVM subsystem interconnect 130 as described above. This connection enables access to data stored in the NVM subsystem 300 of another node.
 ハイブリッドメモリコントローラ310は、メモリマップされたレジスタ(MMR:Memory Mapped Register)311を有する。MMR311は、プロセッサ210、220上で動作するソフトウェアがハイブリッドメモリコントローラ310に対してDMA転送の指示を行うためのレジスタである。PCI Expressでは周辺装置が有するレジスタを、メインメモリと同じメモリ空間上にマップすることが可能である。そのため、MMR311に対して、ソフトウェアは、メインメモリに読書きするのと同様に、プロセッサ210、220のロード・ストア命令でアクセスすることができる。 The hybrid memory controller 310 has a memory-mapped register (MMR: Memory Mapped Register) 311. The MMR 311 is a register used by software operating on the processors 210 and 220 to instruct the hybrid memory controller 310 to perform DMA transfer. In PCI Express, it is possible to map the registers of peripheral devices in the same memory space as the main memory. Therefore, the software can access the MMR 311 with the load / store instructions of the processors 210 and 220 in the same manner as when reading and writing to the main memory.
 ノード110上では、仮想記憶に対応したオペレーティングシステムが動作する。ノード110は、前述のように複数のコアから構成されるが、全てのコアが単一のメインメモリを共有する対称型マルチプロセッシング(SMP:Symmetric Multiprocessing)の構成である。そのため、ノード110では単一のオペレーティングシステムが動作することになる。これ以降、ノード110でオペレーティングシステムを1個動作させている、シングルシステムイメージを前提として実施例を説明する。各ノード110で動作するオペレーティングシステムは、各ノード110のDIMM230、240で構成されるメインメモリに物理アドレス空間を割当て、該物理アドレス空間上に仮想アドレス空間を提供する。 On the node 110, an operating system corresponding to virtual memory operates. The node 110 is composed of a plurality of cores as described above, but has a symmetric multiprocessing (SMP) configuration in which all the cores share a single main memory. Therefore, a single operating system operates on the node 110. Hereinafter, an embodiment will be described on the premise of a single system image in which one operating system is operated on the node 110. The operating system operating on each node 110 allocates a physical address space to the main memory configured by the DIMMs 230 and 240 of each node 110 and provides a virtual address space on the physical address space.
 図4は、ノード110が持つ複数のコアと、オペレーティングシステムがユーザに提供する資源との関係を示した説明図である。各ノード110は、2プロセッサ/ノード、2コア/プロセッサ、2SMT/コアの構成であるため、それぞれのノード全体では2×2×2=8ハードウェアスレッドの資源がある。 FIG. 4 is an explanatory diagram showing the relationship between a plurality of cores of the node 110 and resources provided to the user by the operating system. Since each node 110 has a configuration of 2 processors / nodes, 2 cores / processors, and 2 SMT / cores, each node has resources of 2 × 2 × 2 = 8 hardware threads.
 仮想記憶に対応したオペレーティングシステムでは、プロセッサのMMU(Memory Management Unit)が有するアドレス変換機構を活用して、アプリケーション用の仮想アドレス空間(ユーザ空間)と、オペレーティングシステムのカーネルが動くための仮想アドレス空間(カーネル空間)を分離して、システムのセキュリティや頑強性を確保している。ユーザ空間では、プロセスという単位毎に独立した仮想アドレス空間を持っている。一般的に、このようなプロセスという概念がある環境下でのスレッドは、プロセスに従属する形態をとる。すなわち、各プロセスが1個以上のスレッドを有し、各スレッドは親となるプロセスの仮想アドレス空間を共有して動作する。 In an operating system that supports virtual memory, a virtual address space for applications (user space) and a virtual address space for operating the operating system kernel using the address translation mechanism of the processor MMU (Memory Management Unit) (Kernel space) is separated to ensure system security and robustness. The user space has an independent virtual address space for each unit of process. In general, a thread in an environment having such a concept of a process takes a form dependent on the process. That is, each process has one or more threads, and each thread operates by sharing the virtual address space of the parent process.
 また、単一のオペレーティングシステムが複数のコアを管理するような場合、オペレーティングシステムは何らかの形でコアを抽象化し、プロセスに割当てることで、アプリケーションが複数のコアを利用可能な環境を提供しなければならない。そのような目的のためにも、スレッドという概念は用いられている。図4に示すように、オペレーティングシステムはカーネルレベルスレッドを各プロセスに提供している。 Also, when a single operating system manages multiple cores, the operating system must abstract the core in some way and assign it to a process, so that the application can provide an environment where multiple cores can be used. Don't be. For this purpose, the concept of threads is used. As shown in FIG. 4, the operating system provides a kernel level thread to each process.
 一般的に、スレッドを実行するときには、実行すべきスレッドの数(N)に対して、ハードウェアスレッドの数(M)は限られる。NがM以下の場合には、両者を一対一対応させることが出来るが、NがMより大きいの場合には切替えが必要となる。この切替えが、コンテキスト切替えである。しかし、カーネルレベルスレッドでコンテキスト切替えを行うためには、仮想アドレス空間をユーザ空間からカーネル空間に切替えて、カーネルの中でコンテキスト切替えの処理を行う必要があるため、コンテキスト切替えのオーバヘッドが大きいことが問題となる。 Generally, when executing a thread, the number of hardware threads (M) is limited with respect to the number of threads to be executed (N). When N is less than or equal to M, the two can be in a one-to-one correspondence, but when N is greater than M, switching is necessary. This switching is context switching. However, in order to perform context switching with a kernel-level thread, it is necessary to switch the virtual address space from the user space to the kernel space and perform context switching processing in the kernel, so the context switching overhead is large. It becomes a problem.
 そこで本実施例の情報処理システム100では、図4に示すようにカーネルレベルスレッドとハードウェアスレッドを一対一に対応させて利用する。すなわち、N=Mの関係とする。しかし、このままではNの数がMの数に制約を受けてしまうため、アプリケーションが必要とするスレッド数を確保できない。スレッド数が小さいと、後述するDMA転送を隠蔽する余地も少なくなってしまうので、システム全体として効率が低下する。そのため、カーネルでのコンテキスト切替えを避けつつ、大量のスレッドを利用可能にする(Nを大きくする)方法が必要となる。 Therefore, in the information processing system 100 of this embodiment, as shown in FIG. 4, a kernel level thread and a hardware thread are used in a one-to-one correspondence. That is, a relationship of N = M is established. However, since the number of N is restricted by the number of M, the number of threads required by the application cannot be secured. When the number of threads is small, there is less room for concealing DMA transfer, which will be described later, and the efficiency of the entire system is lowered. Therefore, there is a need for a method for making a large number of threads available (increasing N) while avoiding context switching in the kernel.
 これに対処するために、本実施例の情報処理システム100では、図4のプロセス410、420の各プロセスに、図5に示すようにマスターコントロールスレッド550とユーザレベルスレッドを設けることを特徴としている。 In order to cope with this, the information processing system 100 according to the present embodiment is characterized in that a master control thread 550 and a user level thread are provided in each of the processes 410 and 420 in FIG. 4 as shown in FIG. .
 図5に示したように、本実施例の情報処理システム100では、プロセス410はスレッド間共有リソース510と、カーネルから割当てられた少なくとも2個以上のカーネルレベルスレッドを持つ。そして、複数のカーネルレベルスレッドのうちの1個に、マスターコントロールスレッド550が固定的に割当てられる。また、アプリケーションが必要とするユーザレベルスレッドが時分割でカーネルレベルスレッドに割当てられる。 As shown in FIG. 5, in the information processing system 100 of this embodiment, the process 410 has an inter-thread shared resource 510 and at least two or more kernel level threads allocated from the kernel. A master control thread 550 is fixedly assigned to one of the plurality of kernel level threads. In addition, user level threads required by the application are allocated to kernel level threads in a time division manner.
 マスターコントロールスレッド550は、プロセス410が存続している期間、常にカーネルレベルスレッドを1個占有して動き続けるスレッドであり、ユーザレベルスレッドのコンテキスト切替えや、スケジューリング、及び、スレッド間共有リソースの管理などを行う。カーネルレベルスレッドとは異なり、マスターコントロールスレッド550がこれらの処理をプロセス410の中で行うことで、カーネル空間への切替えを発生させることなく、より高速にコンテキスト切替えを実現することができる。すなわち、マスターコントロールスレッド550により、大量のスレッドを利用しつつ、コンテキスト切替えの高速化を実現できる。 The master control thread 550 is a thread that always occupies one kernel level thread and continues to run for the duration of the process 410, such as context switching of user level threads, scheduling, and management of shared resources between threads. I do. Unlike the kernel level thread, the master control thread 550 performs these processes in the process 410, so that context switching can be realized at higher speed without causing switching to the kernel space. That is, the master control thread 550 can speed up context switching while using a large number of threads.
 前述したように、スレッドはプロセスの資源を共有して動作する。スレッドでは、各スレッドが固有のスタックを持ち、それ以外の領域は他スレッドと共有するモデルが用いられる。本実施例の情報処理システム100では、後述のDMA転送を行うために、さらに、後述するPIN領域の資源配分をユーザレベルスレッドに対して行う。 As described above, threads operate by sharing process resources. In the thread, a model is used in which each thread has a unique stack and other areas are shared with other threads. In the information processing system 100 according to the present embodiment, in order to perform a DMA transfer described later, a PIN area resource allocation described later is further performed for a user level thread.
 図6は、ノード110が有する物理アドレス空間610と、ノード110上で動作するプロセス410が有する仮想アドレス空間620との対応を記した説明図である。物理アドレス空間610に配置される領域は、基本的にはノード110の物理的な構成部材と一対一に対応している。DRAM領域611はDIMM230、240に対応している。また、MMIO(Memory Mapped Input/Output)領域612は、前述したMMR311が配置される領域である。これらの領域はいずれもページと呼ばれる単位で管理される。一般的には1ページが4KBの大きさである。 FIG. 6 is an explanatory diagram showing the correspondence between the physical address space 610 included in the node 110 and the virtual address space 620 included in the process 410 operating on the node 110. The area arranged in the physical address space 610 basically corresponds one-to-one with the physical components of the node 110. The DRAM area 611 corresponds to the DIMMs 230 and 240. An MMIO (Memory Mapped Input / Output) area 612 is an area where the MMR 311 described above is arranged. All of these areas are managed in units called pages. In general, one page is 4 KB in size.
 プロセス410の仮想アドレス空間620は、text領域621、data領域622、mmio領域623、およびstack領域624に大別できる。本実施例のプロセス410では、それらに加えてさらにPIN領域プール516を有する。 The virtual address space 620 of the process 410 can be roughly divided into a text area 621, a data area 622, a mmio area 623, and a stack area 624. In addition to these, the process 410 of this embodiment further has a PIN area pool 516.
 仮想アドレス空間620内の各領域に関して、図5に示すプロセス410の内部構造と、図6に示すプロセス410の仮想アドレス空間620を対応させながら以下に説明する。 Each area in the virtual address space 620 will be described below by associating the internal structure of the process 410 shown in FIG. 5 with the virtual address space 620 of the process 410 shown in FIG.
 プロセス410は、ユーザレベルスレッド間で共有する様々なリソースをスレッド間共有リソース510として有している。スレッド間共有リソース510には、プログラムコード511、グローバル変数512、ヒープ領域513、スレッド管理情報514、NVM転送要求管理情報515、およびPIN領域プール516が含まれる。プログラムコード511は、スレッドが実行すべきプログラムの命令列であり、これは仮想アドレス空間620上ではtext領域621に配置される。グローバル変数512は、プロセス410内で動作する任意のサブルーチンやスレッドが共通して利用する変数であり、data領域622に配置される。ヒープ領域513は、プログラムが動的にメモリを確保するときのリソースプールであり、data領域622に配置される。スレッド管理情報514は、詳細は後述するが、スレッドを管理するためにスレッド毎に必要な情報を記憶するためのものであり、主にマスターコントロールスレッド550に利用されるが、ユーザレベルスレッドからアクセス出来る必要があるためグローバル変数512と同様の性質を持つことになり、data領域622に配置される。NVM転送要求管理情報515は、後述するDMA転送を管理するための情報で、スレッド管理情報514と同様の理由により、data領域622に配置される。stack領域は、ローカル変数やサブルーチンのパラメータ渡しのために用いるスタックを用意するための領域であり、後述するように各スレッドに配分される。 The process 410 has various resources shared between user level threads as an inter-thread shared resource 510. The inter-thread shared resource 510 includes a program code 511, a global variable 512, a heap area 513, thread management information 514, NVM transfer request management information 515, and a PIN area pool 516. The program code 511 is an instruction sequence of a program to be executed by a thread, and is arranged in the text area 621 on the virtual address space 620. The global variable 512 is a variable that is commonly used by any subroutine or thread that operates in the process 410, and is arranged in the data area 622. The heap area 513 is a resource pool when the program dynamically secures memory, and is arranged in the data area 622. Although details will be described later, the thread management information 514 is used to store information necessary for each thread to manage the thread, and is mainly used for the master control thread 550, but is accessed from the user level thread. Since it must be possible, it has the same properties as the global variable 512 and is placed in the data area 622. The NVM transfer request management information 515 is information for managing DMA transfer described later, and is placed in the data area 622 for the same reason as the thread management information 514. The stack area is an area for preparing a stack used for passing parameters of local variables and subroutines, and is allocated to each thread as will be described later.
 図6には、物理アドレス空間610と仮想アドレス空間620との対応関係が破線で示されている。図6に示したように、ページ単位で物理アドレス空間610と仮想アドレス空間620はマッピングされるが、仮想アドレス空間620上は存在するのに、物理アドレス空間610に対応するページが存在しないような仮想アドレス空間620上のページがあることに留意されたい。これがいわゆる仮想記憶を実現している仕組みであり、仮想アドレス空間620にはページが存在していても、それは直接DRAMに載っているとは限らず、HDDやSSDの中にページアウトされている可能性がある。このようなページにアクセスすると、MMUがページフォルト例外を発生させ、オペレーティングシステムがHDDやSSDから退避されているページを読み出してページインさせる。このように、仮想記憶を採用している情報処理システム100では、プロセスから見て存在しているメモリ領域(ページ)が、必ずしも物理的なメモリ上に存在している訳ではないという特徴がある。この特徴がDMA転送に及ぼす影響について、後ほど説明する。 In FIG. 6, the correspondence between the physical address space 610 and the virtual address space 620 is indicated by a broken line. As shown in FIG. 6, the physical address space 610 and the virtual address space 620 are mapped on a page basis, but there is no page corresponding to the physical address space 610 even though it exists on the virtual address space 620. Note that there are pages on the virtual address space 620. This is a mechanism that realizes so-called virtual storage. Even if a page exists in the virtual address space 620, it is not always directly stored in the DRAM, but is paged out in the HDD or SSD. there is a possibility. When such a page is accessed, the MMU generates a page fault exception, and the operating system reads the page saved from the HDD or SSD and causes it to page in. As described above, the information processing system 100 adopting the virtual memory has a feature that the memory area (page) existing from the viewpoint of the process does not necessarily exist on the physical memory. . The effect of this feature on DMA transfer will be described later.
 図7は、仮想アドレス空間620のPIN領域プール516およびstack領域624のスレッドとの関係を示している。PIN領域プール516とstack領域624は、いずれも利用されるスレッド毎に分割される。しかし、stack領域624が必ずプロセス410が有する全てのスレッド(マスターコントロールスレッド、及び、ユーザレベルスレッド)に対応したstack領域を持つのに対して、PIN領域プール516はその領域の大きさに応じて一部のスレッドに対応する領域しか持たない。これは、stack領域624は、仮想記憶の機構を利用して、仮想アドレス空間の許す限り領域を確保することが出来るのに対して、PIN領域プール516は、後述する理由で、物理アドレス空間610のDRAM領域611と対応しているページを確保できる分しか用意しないためである。 FIG. 7 shows the relationship between the PIN area pool 516 of the virtual address space 620 and the thread of the stack area 624. Both the PIN area pool 516 and the stack area 624 are divided for each thread to be used. However, the stack area 624 always has a stack area corresponding to all the threads (master control thread and user level thread) of the process 410, whereas the PIN area pool 516 has a size corresponding to the size of the area. It has only an area corresponding to some threads. This is because the stack area 624 can secure an area as much as the virtual address space allows by using a virtual storage mechanism, whereas the PIN area pool 516 has a physical address space 610 for the reason described later. This is because only the amount that can secure the page corresponding to the DRAM area 611 is prepared.
 以下に、情報処理システム100のユーザレベルマルチスレッディングとDMA転送を連携させる機構について説明する。ユーザレベルマルチスレッディングとDMA転送の連携により、例えば、背景技術で述べたような大規模なグラフ処理を高速化できる。 Hereinafter, a mechanism for linking user-level multithreading and DMA transfer of the information processing system 100 will be described. By linking user-level multithreading and DMA transfer, for example, large-scale graph processing as described in the background art can be accelerated.
 ここで、ユーザレベルマルチスレッディングとは、プロセス410の中で複数のスレッド(ユーザレベルスレッド)を切替えながら動作させていくことを言う。スレッド切替えに必要な処理がプロセス410の中で完結するので、カーネルレベルスレッドの切替えよりも高速である。一方で、ユーザレベルマルチスレッディングでは、スレッドの管理もプロセス410の中で行われる。本実施例の情報処理システム100では、マスターコントロールスレッド550がスレッドの管理の役割を担う。 Here, the user level multithreading means that a plurality of threads (user level threads) are operated while being switched in the process 410. Since processing necessary for thread switching is completed in the process 410, the processing is faster than kernel level thread switching. On the other hand, in user level multithreading, thread management is also performed in the process 410. In the information processing system 100 of this embodiment, the master control thread 550 plays a role of thread management.
 図8は、マスターコントロールスレッド550によるユーザレベルスレッド、及び、DMA転送を管理するためのキューを示す図である。これらのキューはスレッド管理情報514、及び、NVM転送要求管理情報515としてメモリ上に置かれている。 FIG. 8 is a diagram showing a user level thread by the master control thread 550 and a queue for managing DMA transfer. These queues are stored in the memory as thread management information 514 and NVM transfer request management information 515.
 スレッド管理情報514は、READYキュー810、IOWAITキュー811、NVMWAITキュー812、およびFINキュー813を含む。各キューにエンキューされるエントリは、図9に示すスレッド管理セル900である。スレッド管理セル900は、Validフラグ901、スレッドID902、スレッド状態903、退避コンテキスト904、退避スタックポインタ905、退避プログラムカウンタ906、バッファ要求フラグ907、バッファ要求サイズ908、バッファ割当てフラグ909、およびバッファ領域先頭アドレス910から構成される。 The thread management information 514 includes a READY queue 810, an IOWAIT queue 811, an NVMWAIT queue 812, and a FIN queue 813. The entry enqueued in each queue is the thread management cell 900 shown in FIG. The thread management cell 900 includes a valid flag 901, a thread ID 902, a thread state 903, a save context 904, a save stack pointer 905, a save program counter 906, a buffer request flag 907, a buffer request size 908, a buffer allocation flag 909, and a buffer area head. It consists of an address 910.
 Validフラグ901は、当該スレッド管理セル900が有効であるかどうかを示すフラグである。スレッドID902は、スレッドを一意に識別するための識別子であり、後述するDMA転送とスレッドのスケジューリングを連携させるという本発明の特徴となる動作を実現させるために用いられる。スレッド状態903は、スレッドが現在どのような状態にあるかを示すための情報である。スレッドの状態に関しては後で詳しく説明する。 The Valid flag 901 is a flag indicating whether or not the thread management cell 900 is valid. The thread ID 902 is an identifier for uniquely identifying a thread, and is used to realize an operation that is a feature of the present invention in which DMA transfer and thread scheduling described later are linked. The thread state 903 is information for indicating what state the thread is currently in. The thread state will be described in detail later.
 退避コンテキスト904、退避スタックポインタ905、および退避プログラムカウンタ906は、スレッドの実行に用いられる情報であり、スレッドを停止させるときにプロセッサ210、220上のレジスタからスレッド管理セル900に退避させた情報である。バッファ要求フラグ907、バッファ要求サイズ908、バッファ割当てフラグ909、およびバッファ領域先頭アドレス910は、後述するDMA転送のために用いられるものであり、詳細は後述する。 The save context 904, the save stack pointer 905, and the save program counter 906 are information used for executing a thread, and are information saved from the registers on the processors 210 and 220 to the thread management cell 900 when the thread is stopped. is there. The buffer request flag 907, the buffer request size 908, the buffer allocation flag 909, and the buffer area head address 910 are used for DMA transfer described later, and will be described in detail later.
 READYキュー810は、実行可能なユーザレベルスレッドのスレッド管理セル900がエンキューされている。マスターコントロールスレッド550は、プロセス410が有するカーネルレベルスレッドに空きがある場合、または、他のユーザレベルスレッドが停止する場合に、READYキュー810からスレッド管理セル900をデキューする。そして、デキューしたスレッド管理セル900に含まれる退避コンテキスト904、退避スタックポインタ905、および退避プログラムカウンタ906を用いてコンテキスト切替えを行い、スレッドの実行を開始する。 In the READY queue 810, a thread management cell 900 of an executable user level thread is enqueued. The master control thread 550 dequeues the thread management cell 900 from the READY queue 810 when a kernel level thread included in the process 410 is free or when another user level thread is stopped. Then, context switching is performed using the save context 904, save stack pointer 905, and save program counter 906 included in the dequeued thread management cell 900, and the execution of the thread is started.
 実行中(RUN)でもなく、実行可能状態(REDAY)でも無いスレッドは、何らかの待ち状態にある。マスターコントロールスレッド550は、その待ち状態を管理するために、IOWAITキュー811とNVMWAITキュー812を用いる。特に、NVMWAITキュー812を有することが、本実施例の情報処理システム100の特徴である。 Threads that are neither executing (RUN) nor being ready (READY) are in some waiting state. The master control thread 550 uses the IOWAIT queue 811 and the NVMWAIT queue 812 to manage the waiting state. In particular, having the NVMWAIT queue 812 is a feature of the information processing system 100 of the present embodiment.
 IOWAITキュー811は、オペレーティングシステムにシステムコールで要求したI/Oの完了を待っているスレッドが格納されているキューである。ユーザレベルスレッドは、ファイルへのアクセスなどオペレーティングシステムの機能を使ったI/Oの要求を行う場合に、システムコールを発行してそれをオペレーティングシステムに伝える。その後、システムコールを発行したスレッドが当該システムコールの実行が完了するまで行うべき処理が無い場合、つまり当該システムコールの完了を待って処理を再開する場合には、マスターコントロールスレッド550は、このIOWAITキュー811にシステムコールを発行したスレッドを退避させる。システムコールの実行が完了すると、マスターコントロールスレッド550は、退避させていたスレッドをREADYキュー810に移行する。 The IOWAIT queue 811 is a queue in which threads waiting for the completion of I / O requested by the system call to the operating system are stored. When a user level thread makes an I / O request using an operating system function such as access to a file, it issues a system call and informs the operating system of it. Thereafter, when there is no processing to be performed until the thread that issued the system call completes the execution of the system call, that is, when the processing is resumed after the completion of the system call, the master control thread 550 makes this IOWAIT The thread that issued the system call is saved in the queue 811. When the execution of the system call is completed, the master control thread 550 shifts the saved thread to the READY queue 810.
 NVMWAITキュー812は、メインメモリと不揮発メモリ(NVM)320の間のDMA転送の完了を待っているスレッドが格納されているキューである。本実施例の情報処理システム100の想定されるユースケースでは、例えば大規模グラフ処理のために大量の頂点のそれぞれに対応するスレッドが実行される。その際、スレッドの数が膨大となるため、全てのスレッドで処理に必要なデータを一度にメインメモリに載せておくことは難しい。そこで、情報処理システム100は、データを不揮発メモリ(NVM)320に格納しておき、スレッドを実行する際にDMA転送でNVM320からメインメモリに持ってくる。メインメモリがDRAMで構成されるのに対し、NVM320はフラッシュメモリや相変化メモリで構成されるため、DRAMに比較して安価に大容量が実現される。しかし、このDMA転送にも時間を要するため、その間、DMA転送でデータを要求したスレッドで他に実行可能な処理が無い場合、マスターコントロールスレッド550は、当該スレッドをNVMWAIT812キューに退避する。DMA転送が完了すると、マスターコントロールスレッド550は、NVMWAIT812キューに退避させたスレッドをREADYキュー810に移行する。 The NVMWAIT queue 812 is a queue in which threads waiting for completion of DMA transfer between the main memory and the nonvolatile memory (NVM) 320 are stored. In the assumed use case of the information processing system 100 of the present embodiment, for example, a thread corresponding to each of a large number of vertices is executed for large-scale graph processing. At that time, since the number of threads becomes enormous, it is difficult to put data necessary for processing in all threads at once in the main memory. Therefore, the information processing system 100 stores data in the non-volatile memory (NVM) 320 and brings it from the NVM 320 to the main memory by DMA transfer when executing a thread. Whereas the main memory is composed of DRAM, the NVM 320 is composed of flash memory and phase change memory, so that a large capacity can be realized at a lower cost than DRAM. However, since this DMA transfer also takes time, if there is no other process that can be executed by the thread that requested the data in the DMA transfer, the master control thread 550 saves the thread in the NVMWAIT812 queue. When the DMA transfer is completed, the master control thread 550 shifts the thread saved in the NVMWAIT 812 queue to the READY queue 810.
 FINキュー813は、実行が完了し不要となったスレッド管理セル900を収集するためのキューである。スレッド管理セル900はスレッド毎に存在する。そのため、動的な大規模グラフ処理などで、大量のスレッドの生成と破棄を繰り返すような場合、スレッド管理セル900をその都度、ヒープ領域513からメモリの割り当てを受けて生成、ないしは、ヒープ領域513に領域を返却していては、そのオーバヘッドが大きい。そのため、マスターコントロールスレッド550は、使用済みのスレッド管理セル900をFINキュー813に収集し、必要に応じてFINキュー813からスレッド管理セル900を再利用する。 The FIN queue 813 is a queue for collecting the thread management cells 900 that have become unnecessary after execution is completed. A thread management cell 900 exists for each thread. Therefore, when a large number of threads are repeatedly generated and destroyed in dynamic large-scale graph processing or the like, the thread management cell 900 is generated by receiving memory allocation from the heap area 513 each time, or heap area 513 is generated. If the area is returned to, the overhead is large. Therefore, the master control thread 550 collects the used thread management cell 900 in the FIN queue 813 and reuses the thread management cell 900 from the FIN queue 813 as necessary.
 図10は、ここまでに説明したスレッドの各種状態を状態遷移図としてまとめたものである。図10の(1)で新規にスレッドが生成されると、マスターコントロールスレッド550は、新規スレッドのスレッド管理セル900を、READYキュー810にエンキューする。図10の(2)で、マスターコントロールスレッド550は、スレッドを実行する資源に空きが生じた場合、すなわち、既に実行していたユーザレベルスレッドが停止(READYキュー810、IOWAITキュー811、NVMWAITキュー812、またはFINキュー813にスレッドが退避される)した場合、または、カーネルレベルスレッドにユーザレベルスレッドが割当てられておらず空いている場合に、REDAYキュー810の先頭にあるスレッド管理セル900をデキューして実行開始する。当該スレッドは実行中(RUN)となる。 FIG. 10 summarizes the various states of the threads described so far as a state transition diagram. When a new thread is created in (1) of FIG. 10, the master control thread 550 enqueues the thread management cell 900 of the new thread into the READY queue 810. In (2) of FIG. 10, the master control thread 550 stops the user level thread that has already been executed when the resource for executing the thread is free (READY queue 810, IOWAIT queue 811, NVMWAIT queue 812). When the thread is saved in the FIN queue 813) or when the user level thread is not allocated to the kernel level thread and is free, the thread management cell 900 at the head of the REDED queue 810 is dequeued. Start running. The thread is executing (RUN).
 図10の(3)では、実行中のスレッドから自発的な資源返上(yield)があった場合に、マスターコントロールスレッド550は、当該スレッドの実行を一時中断し、当該スレッドをREADYキュー810にエンキューする。図10の(4)では、実行中のスレッドがファイルアクセスなどのためにシステムコールを発行し、システムコールの完了待ちになった場合に、マスターコントロールスレッド550は、当該スレッドの実行を一時中断し、当該スレッドをIOWAITキュー811にエンキューする。 In (3) of FIG. 10, when there is a spontaneous resource return (yield) from the executing thread, the master control thread 550 suspends the execution of the thread and enqueues the thread in the READY queue 810. To do. In (4) of FIG. 10, when the executing thread issues a system call for file access or the like and waits for completion of the system call, the master control thread 550 temporarily suspends the execution of the thread. The thread is enqueued in the IOWAIT queue 811.
 図10の(5)では、実行中のスレッドが、NVM320とメインメモリの間でのDMA転送の完了待ちになった場合に、マスターコントロールスレッド550は、当該スレッドの実行を一時中断し、当該スレッドをNVMWAITキュー812にエンキューする。 In (5) of FIG. 10, when the executing thread waits for completion of the DMA transfer between the NVM 320 and the main memory, the master control thread 550 suspends the execution of the thread, Is enqueued into the NVMWAIT queue 812.
 図10の(6)では、マスターコントロールスレッド550は、(4)でIOWAITキュー811にエンキューされることにより始まったシステムコール完了待ちで、システムコールの完了を検出した場合に、当該スレッドをIOWAITキュー811からREADYキュー810に移行させる。図10の(7)に関しても、図10の(6)と同様に、完了待ちに対して実際に完了した場合の動作であり、マスターコントロールスレッド550は、(4)でNVMWAITキュー812にエンキューされることにより始まったDMA転送完了待ちで、DMA転送の完了を検出した場合に、当該スレッドをNVMWAITキュー812からREADYキュー810に移行させる。(6)および(7)により、システムコール、ないしは、DMA転送の完了待ちをしていて、かつ、システムコール、ないしは、DMA転送が完了したスレッドは再度実行可能となり、次にスケジュールされて実行開始されるのを待つこととなる。なお、DMA転送の完了を検出する方法は後述する。 In (6) of FIG. 10, when the master control thread 550 detects the completion of the system call while waiting for the completion of the system call started by being enqueued in the IOWAIT queue 811 in (4), the master control thread 550 adds the thread to the IOWAIT queue. 811 is transferred to the READY queue 810. Similarly to (6) of FIG. 10, (7) of FIG. 10 is the operation when the completion is actually completed with respect to the waiting for completion, and the master control thread 550 is enqueued in the NVMWAIT queue 812 in (4). When the completion of the DMA transfer is detected while waiting for the completion of the DMA transfer started, the thread is transferred from the NVMWAIT queue 812 to the READY queue 810. According to (6) and (7), the system call or the DMA transfer waiting for the completion of the DMA transfer and the system call or the DMA transfer completed can be executed again, and then the execution is scheduled and started. It will wait to be done. A method for detecting completion of DMA transfer will be described later.
 スレッドの実行が完了した場合、または、スレッドの実行を途中で中断した場合には、マスターコントロールスレッド550は、完了または中断したスレッドのスレッド管理セル900をFINキュー813に移行する(図10の(8)、(9)に対応する)。なお、スレッドの実行中断は、スレッドがREADYキュー810、IOWAITキュー811、またはNVMWAITキュー812に入っている場合にも起こり得るので、これらの場合にも、マスターコントロールスレッド550は、中断したスレッドのスレッド管理セル900をFINキュー813に移行する。また、スレッドの実行が完了した場合、または、スレッドの実行を途中で中断した場合に、マスターコントロールスレッド550は、完了または中断して処理が終了したスレッドに後述するPIN領域を割当てていた場合には、割当てを解除、すなわちPIN領域として割当てていたメモリ空間を解放する。 When the execution of the thread is completed or when the execution of the thread is interrupted, the master control thread 550 shifts the thread management cell 900 of the completed or interrupted thread to the FIN queue 813 ((( 8), corresponding to (9)). Note that the thread execution interruption can also occur when the thread is in the READY queue 810, the IOWAIT queue 811, or the NVMWAIT queue 812. In these cases, the master control thread 550 is also the thread of the suspended thread. The management cell 900 is transferred to the FIN queue 813. Further, when the execution of the thread is completed or when the execution of the thread is interrupted, the master control thread 550 allocates a PIN area to be described later to the thread that has been completed or interrupted and finished processing. Cancels the allocation, that is, releases the memory space allocated as the PIN area.
 FINキュー813に入っているスレッド管理セル900は、ヒープ領域513が不足する場合には、マスターコントロールスレッド550により適宜開放される。また、新しいスレッドが生成される(図10の(1)の動作)際には、FINキュー813に入っているスレッド管理セル900から優先的に使われる。それでもスレッド管理セル900が足りない場合には、マスターコントロールスレッド550は、ヒープ領域513からスレッド管理セル900用の領域を割当てる。 The thread management cell 900 in the FIN queue 813 is appropriately released by the master control thread 550 when the heap area 513 is insufficient. Further, when a new thread is generated (operation (1) in FIG. 10), it is preferentially used from the thread management cell 900 in the FIN queue 813. If the thread management cell 900 is still insufficient, the master control thread 550 allocates an area for the thread management cell 900 from the heap area 513.
 NVM転送要求管理情報515は、REQキュー820、WAITキュー821、COMPLETEキュー822、およびDISPOSEキュー823から構成される。各キューにエンキューされるエントリは、図11に示すNVM転送要求管理セル1100である。NVM転送要求管理セル1100は、Validフラグ1101、要求元スレッドID1102、スレッド管理セルポインタ1103、転送方向1104、転送状態1105、転送元アドレス1106、転送データ長1107、および転送先アドレス1108から構成される。 The NVM transfer request management information 515 includes a REQ queue 820, a WAIT queue 821, a COMPLETE queue 822, and a DISPOSE queue 823. The entry enqueued in each queue is the NVM transfer request management cell 1100 shown in FIG. The NVM transfer request management cell 1100 includes a Valid flag 1101, a request source thread ID 1102, a thread management cell pointer 1103, a transfer direction 1104, a transfer state 1105, a transfer source address 1106, a transfer data length 1107, and a transfer destination address 1108. .
 Validフラグ1101は、当該NVM転送要求管理セル1100が有効であるかどうかを示すフラグである。要求元スレッドID1102は、当該NVM転送要求を発生させたスレッドのスレッドIDを格納するものである。スレッド管理セルポインタ1103は、当該NVM転送要求を発生させたスレッドを管理するスレッド管理セル900へのポインタである。つまり、このポインタを辿って得られたスレッド管理セル900に格納されているスレッドID902は、要求元スレッドID1102と同一である。 The Valid flag 1101 is a flag indicating whether or not the NVM transfer request management cell 1100 is valid. The request source thread ID 1102 stores the thread ID of the thread that generated the NVM transfer request. The thread management cell pointer 1103 is a pointer to the thread management cell 900 that manages the thread that has generated the NVM transfer request. That is, the thread ID 902 stored in the thread management cell 900 obtained by tracing this pointer is the same as the request source thread ID 1102.
 転送方向1104は、NVM転送の方向を指定するための情報で、ロード、または、ストアを指定する。ロードはCPUに近い側、つまりメインメモリにNVM320からデータを読み出す方向である(もしくは、NVM320に格納されているデータをメインメモリに書込むとも言える)。ストアは、CPUに遠い側、つまりNVM320にメインメモリからデータを読み出す方向である(もしくは、メインメモリに格納されているデータをNVM320に書込むとも言える)。転送状態1105は、NVM転送が現在どのような状態にあるかを示すものである。転送の状態に関しての詳細は後述する。 Transfer direction 1104 is information for specifying the direction of NVM transfer, and specifies load or store. The load is the direction close to the CPU, that is, the direction in which data is read from the NVM 320 to the main memory (or it can be said that the data stored in the NVM 320 is written to the main memory). The store is the direction far from the CPU, that is, the direction in which data is read from the main memory to the NVM 320 (or the data stored in the main memory can also be written to the NVM 320). The transfer state 1105 indicates what state the NVM transfer is currently in. Details regarding the transfer status will be described later.
 転送元アドレス1106は、NVM転送で行うDMA転送の転送元となるアドレスである。転送方向1104がロード(NVM320からメインメモリに転送)である場合、転送元アドレス1106はNVM320で用いられる識別子となる。この識別子には、メインメモリのアドレス空間(物理アドレス空間、仮想アドレス空間)と異なる、NVM320専用のアドレス空間を用いて良い。一般的に、メインメモリのアドレス空間は、その時代でコンピュータに搭載可能と想定されるDRAMの量で制約を受けており、例えば今日の64ビットアーキテクチャのプロセッサであっても、コスト面から現実的に利用可能なDRAMの量を考慮して、48ビット程度の空間しか実装していないことが多い。そのため、大容量のNVMをメインメモリのアドレス空間にマッピングすることは難しい。転送方向1104がストア(メインメモリからNVM320に転送)である場合、転送元アドレス1106は、メインメモリのアドレス空間であり、特に後述する理由から物理アドレス空間上のアドレスで指定される。 The transfer source address 1106 is an address that is a transfer source of DMA transfer performed by NVM transfer. When the transfer direction 1104 is load (transfer from the NVM 320 to the main memory), the transfer source address 1106 is an identifier used in the NVM 320. For this identifier, an address space dedicated to the NVM 320, which is different from the address space (physical address space, virtual address space) of the main memory, may be used. In general, the address space of the main memory is limited by the amount of DRAM assumed to be mountable in a computer at that time. For example, even a processor with a 64-bit architecture today is realistic in terms of cost. In consideration of the amount of DRAM that can be used, only a space of about 48 bits is often mounted. For this reason, it is difficult to map a large-capacity NVM to the address space of the main memory. When the transfer direction 1104 is store (transfer from the main memory to the NVM 320), the transfer source address 1106 is the address space of the main memory, and is specified by an address in the physical address space, particularly for the reason described later.
 転送データ長1107は、NVM転送で行うDMA転送の転送長を指定する。転送先アドレス1108は、NVM転送で行うDMA転送の転送先となるアドレスである。転送元アドレス1106と同様に、転送方向1104に応じて、NVM320で用いている識別子を指定するか、物理アドレスを指定するかのどちらかで利用される。 The transfer data length 1107 designates the transfer length of DMA transfer performed by NVM transfer. The transfer destination address 1108 is an address that is a transfer destination of a DMA transfer performed by NVM transfer. Similar to the transfer source address 1106, an identifier used in the NVM 320 or a physical address is specified according to the transfer direction 1104.
 NVM転送要求の状態遷移に関して、図12の状態遷移図と、図8のキュー構成を対比させながら説明する。 The state transition of the NVM transfer request will be described by comparing the state transition diagram of FIG. 12 with the queue configuration of FIG.
 まず、図12の(1)スレッドからのアクセス要求、および(2)NNMサブシステムへのDMA転送要求について説明する。スレッドがメインメモリとNVM320との間でのDMA転送を必要とする場合、マスターコントロールスレッド550は、NVM転送要求管理セル1100を生成して、生成したセル1100をREQキュー820にエンキューする。DMA転送を開始するためには、MMR311にDMA転送の開始を要求するコマンドを書込まなければならない。ここで、NVMサブシステム300にMMR311を複数セット用意して、さらなる処理の高速化のために複数のDMA転送を同時並行的に進める多重DMA転送に対応させることもできる。マスターコントロールスレッド550は、NVMサブシステム300のDMA転送に空きがある場合に、REQキュー820からNVM転送要求管理セル1100をデキューして、MMR311にコマンドを書込むことで、DMA転送を開始する。そして、マスターコントロールスレッド550は、DMA転送中となるNVM転送要求管理セル1100をWAITキュー821にエンキューして、DMA転送の完了待ちを行う。 First, (1) an access request from a thread and (2) a DMA transfer request to the NNM subsystem in FIG. 12 will be described. When the thread requires a DMA transfer between the main memory and the NVM 320, the master control thread 550 generates the NVM transfer request management cell 1100 and enqueues the generated cell 1100 into the REQ queue 820. In order to start the DMA transfer, a command for requesting the start of the DMA transfer must be written in the MMR 311. Here, a plurality of sets of MMRs 311 may be prepared in the NVM subsystem 300 to support multiple DMA transfers in which a plurality of DMA transfers are advanced simultaneously in order to further increase the processing speed. The master control thread 550 starts the DMA transfer by dequeuing the NVM transfer request management cell 1100 from the REQ queue 820 and writing a command to the MMR 311 when there is an empty DMA transfer in the NVM subsystem 300. Then, the master control thread 550 enqueues the NVM transfer request management cell 1100 that is undergoing DMA transfer into the WAIT queue 821, and waits for completion of DMA transfer.
 次に、図12の(3)NVMシステムからのDMA転送完了通知、(4)スレッドへのアクセス完了通知、および(5)NVMアクセス管理セルの解放・再利用について説明する。DMA転送の完了は、NVMサブシステム300からプロセッサ210、220への割込みで通知される方式を取ることが可能である。また、マスターコントロールスレッド550が、MMR311をポーリングすることでDMA転送の完了を知ることも出来る。但し、割込みを用いる場合には、割込みを受けるのはオペレーティングシステムが有する割込みハンドラであり、カーネル空間への切替えが必要となり、オーバヘッドが大きい。そこで、本実施例ではMMR311にDMA転送が完了したことを示すフラグを設けて、このフラグをマスターコントロールスレッド550がポーリングする。また、特に複数のDMA転送を同時並行的に行う場合には、複数のDMA転送の完了を個別に検出することすら、オーバヘッドの要因となる。そこで、本実施例の情報処理システム100では、マスターコントロールスレッド500は、ユーザレベルスレッドやスケジューリングに支障の無い程度の間隔、換言すればマスターコントロールスレッド500において他に行うべき処理が無い場合にMMR311をポーリングし、複数のDMA転送の完了を一度のポーリングで検出する。 Next, (3) DMA transfer completion notification from the NVM system, (4) thread access completion notification, and (5) NVM access management cell release / reuse in FIG. 12 will be described. The completion of the DMA transfer can be notified by an interrupt from the NVM subsystem 300 to the processors 210 and 220. The master control thread 550 can also know the completion of the DMA transfer by polling the MMR 311. However, when an interrupt is used, the interrupt handler of the operating system receives the interrupt, which requires switching to the kernel space and has a large overhead. Therefore, in this embodiment, a flag indicating that the DMA transfer is completed is provided in the MMR 311 and the master control thread 550 polls this flag. In particular, when a plurality of DMA transfers are performed in parallel, even the completion of the plurality of DMA transfers is individually detected as a factor of overhead. Therefore, in the information processing system 100 of the present embodiment, the master control thread 500 sets the MMR 311 when there is no other processing to be performed in the master control thread 500, that is, at an interval that does not interfere with the user level thread or scheduling. Polling is performed, and completion of a plurality of DMA transfers is detected by one polling.
 DMA転送の完了が検出されたNVM転送要求管理セル1100は、マスターコントロールスレッド550によって、WAITキュー821からデキューされて、COMPLETEキュー822にエンキューされる。なお、DMA転送の完了順序はWAITキュー821にエンキューされている順序とは限らず、WAITキュー821へのアクセスはFIFOになるとは限らない。 The NVM transfer request management cell 1100 in which the completion of DMA transfer is detected is dequeued from the WAIT queue 821 by the master control thread 550 and enqueued in the COMPLETE queue 822. Note that the DMA transfer completion order is not necessarily the order enqueued in the WAIT queue 821, and access to the WAIT queue 821 is not necessarily a FIFO.
 COMPLETEキュー822にエンキューされたNVM転送要求管理セル1100は、DMA転送を管理するという観点から言えば、DMA転送は既に完了しているのでもう不要な情報である。しかし、本実施例の情報処理システム100では、このCOMPLETEキュー822に格納されているNVM転送要求管理セル1100を元にして、NVMWAITキュー812に格納されているスレッド管理セル900を、NVMWAITキュー812からREADYキュー810に移行させることを特徴としている。つまり、DMA転送の完了状況に応じて、当該DMA転送要求を発生させる要因となったスレッドを、DMA転送完了待ちの状態から実行可能な状態に状態遷移させることが、本実施例におけるDMA転送とマルチスレッディングの連携という特徴的な動作となる。 From the viewpoint of managing the DMA transfer, the NVM transfer request management cell 1100 enqueued in the COMPLETE queue 822 is unnecessary information because the DMA transfer has already been completed. However, in the information processing system 100 according to this embodiment, the thread management cell 900 stored in the NVMWAIT queue 812 is transferred from the NVMWAIT queue 812 based on the NVM transfer request management cell 1100 stored in the COMPLETE queue 822. It is characterized by shifting to the READY queue 810. In other words, depending on the completion status of the DMA transfer, changing the state of the thread that has generated the DMA transfer request from the DMA transfer completion waiting state to the executable state is the DMA transfer in this embodiment. This is a characteristic operation of multithreading linkage.
 マスターコントロールスレッド550は、定期的にCOMPLETEキュー822を監視し、COMPLETEキュー822にNVM転送要求管理セル1100が存在する場合にはそれをデキューして、要求元スレッドID1102をキーとして、NVMWAITキュー812の中から対応するスレッド管理セル900を探す。マスターコントロールスレッド550は、対応するスレッド管理セル900を見つけたら、該スレッド管理セル900をREADYキュー810にエンキューすると共に、NVM転送要求管理セル1100をDISPOSEキュー823にエンキューする。なお、DISPOSEキュー823は、FINキュー813と同様に、使用済みのNVM転送要求管理セル1100の再利用を行うためのものである。 The master control thread 550 periodically monitors the COMPLETE queue 822. If the NVM transfer request management cell 1100 is present in the COMPLETE queue 822, the master control thread 550 dequeues it, and uses the request source thread ID 1102 as a key, in the NVMWAIT queue 812. The corresponding thread management cell 900 is searched from among them. When the master control thread 550 finds the corresponding thread management cell 900, the master control thread 550 enqueues the thread management cell 900 into the READY queue 810 and enqueues the NVM transfer request management cell 1100 into the DISPOSE queue 823. The DISPOSE queue 823 is used for reusing the used NVM transfer request management cell 1100, similarly to the FIN queue 813.
 図13は、これまでに説明してきたユーザレベルマルチスレッディングとDMA転送の連携で、プロセス410内のスレッドがどのように起動・停止するかを示す説明図である。以下に、図13に沿ってスレッドのライフサイクルと、DMA転送との連携を説明する。 FIG. 13 is an explanatory diagram showing how the threads in the process 410 are started and stopped by the cooperation of user level multithreading and DMA transfer described so far. Hereinafter, the cooperation between the thread life cycle and the DMA transfer will be described with reference to FIG.
 プロセス410は、カーネルレベルスレッド1-1~1-5をオペレーティングシステムから割当てられている。プロセス410は、カーネルレベルスレッド1-1~1-5に任意のスレッドを割当てて実行することができる。ここで、本実施例の情報処理システム100では、カーネルレベルスレッド1-1上にマスターコントロールスレッド550が固定的に割当てられる。プロセス410は、起動すると、マスターコントロールスレッド550をカーネルレベルスレッド1-1で実行する。 Process 410 has kernel level threads 1-1 to 1-5 assigned from the operating system. The process 410 can be executed by assigning an arbitrary thread to the kernel level threads 1-1 to 1-5. Here, in the information processing system 100 of the present embodiment, the master control thread 550 is fixedly allocated on the kernel level thread 1-1. When activated, the process 410 executes the master control thread 550 with the kernel level thread 1-1.
 例えば、大規模なグラフ処理を行う場合、マスターコントロールスレッド550はグラフの頂点毎に対応するスレッドを立てて処理を行う。図13の例では、ユーザレベルスレッドA、B、C、D、E、F、Gがそれに対応し、マスターコントロールスレッド550は、まずユーザレベルスレッドA、D、F、Gをそれぞれカーネルレベルスレッド1-2~1-5で起動する。 For example, when performing large-scale graph processing, the master control thread 550 performs processing by setting up a thread corresponding to each vertex of the graph. In the example of FIG. 13, user level threads A, B, C, D, E, F, and G correspond to it, and the master control thread 550 first assigns user level threads A, D, F, and G to the kernel level thread 1 respectively. Start with -2 to 1-5.
 スレッドの実行状況を時系列に見て行くと、(1)においてユーザレベルスレッドAがマスターコントロールスレッド550に対して、NVM320からのDMA転送を要求し、また該DMA転送が終わらないと他に行える処理が無いことを申告している。マスターコントロールスレッド550は、対応するNVM転送要求管理セル1100を生成してREQキュー820にエンキューすると共に、ユーザレベルスレッドAの実行を一時停止し、そのコンテキストをスレッド管理セル900に退避して、NVMWAITキュー812にエンキューする。これにより、カーネルレベルスレッド1-2には実行すべきスレッドが無くなるため、マスターコントロールスレッド550は、READYキュー810からスレッド管理セル900を1個デキューして、そのスレッドをカーネルレベルスレッド1-2で実行する(図13では、マスターコントロールスレッド550は、ユーザレベルスレッドBの実行を開始している)。 Looking at the execution status of the threads in time series, in (1), the user level thread A requests the master control thread 550 to perform a DMA transfer from the NVM 320, and if the DMA transfer is not completed, other processing can be performed. Declaring that there is no processing. The master control thread 550 generates the corresponding NVM transfer request management cell 1100 and enqueues it in the REQ queue 820, suspends execution of the user level thread A, saves its context to the thread management cell 900, and then sets the NVMWAIT Enqueue to the queue 812. As a result, there is no thread to be executed in the kernel level thread 1-2. Therefore, the master control thread 550 dequeues one thread management cell 900 from the READY queue 810, and the thread is called the kernel level thread 1-2. (In FIG. 13, the master control thread 550 has started execution of the user level thread B).
 次に、カーネルレベルスレッド1-3上で実行されているユーザレベルスレッドDが(2)で、(1)と同様にマスターコントロールスレッド550にDMA転送の要求を出している。しかし、(2)の場合には、ユーザレベルスレッドDはマスターコントロールスレッド550に対してDMA転送の完了を待たずとも他に行える処理があることを申告しているため、マスターコントロールスレッド550は、ユーザレベルスレッドDを一時停止させることなく、実行を続行させている。例えば、ユーザレベルスレッドで必要となるデータを予めDMA転送しておき、その後別の処理を行うような場合がこのケースにあたる。つまり、DMA転送でNVM320上のデータをプリフェッチしているような動作になる。 Next, the user level thread D running on the kernel level thread 1-3 is (2), and a DMA transfer request is issued to the master control thread 550 as in (1). However, in the case of (2), since the user level thread D reports to the master control thread 550 that there is other processing that can be performed without waiting for the completion of the DMA transfer, the master control thread 550 Execution is continued without pausing the user level thread D. For example, this is the case where data required by the user level thread is DMA-transferred in advance and then another process is performed. That is, the operation is such that data on the NVM 320 is prefetched by DMA transfer.
 その後、ユーザレベルスレッドB、Dはそれぞれ処理が一旦停止され、REDAYキュー810に積まれていた次のスレッドの実行が開始される。図13の例では、ユーザレベルスレッドBは、ユーザレベルスレッドCの実行が完了した後に再度実行予定とされ、ユーザレベルスレッドDは、先に要求したDMA転送の完了待ちに入る。続いて図13の例では、ユーザレベルスレッドC、Eの実行が開始されている。ユーザレベルスレッドCの実行中に(3)で(1)で要求したDMA転送に対する完了が通知されたので、ユーザレベルスレッドAがREADYキュー810にエンキューされる。このユーザレベルスレッドAは、ユーザレベルスレッドCの実行が終了(ないしは一時停止)した後にスケジュールされて、カーネルスレッド1-2で動作を開始している。 Thereafter, the processing of the user level threads B and D is temporarily stopped, and the execution of the next thread loaded in the REDEY queue 810 is started. In the example of FIG. 13, the user level thread B is scheduled to be executed again after the execution of the user level thread C is completed, and the user level thread D waits for completion of the previously requested DMA transfer. Subsequently, in the example of FIG. 13, execution of user level threads C and E is started. Since the completion of the DMA transfer requested in (1) is notified in (3) during execution of the user level thread C, the user level thread A is enqueued in the READY queue 810. The user level thread A is scheduled after the execution of the user level thread C is completed (or suspended), and starts operating with the kernel thread 1-2.
 カーネルレベルスレッド1-4では、(4)でユーザレベルスレッドFが自主的に資源を返上して停止する。その後、カーネルレベルスレッド1-4では処理すべきスレッドが無い空白の時間がしばらく発生し、その後、ユーザレベルスレッドBが再びスケジュールされる。前述したように、ユーザレベルスレッドBはユーザレベルスレッドCの完了後に再スケジュールされる予定だったが、元々利用していたカーネルスレッド1-1をユーザレベルスレッドAに取られてしまったので、カーネルスレッド1-4上で動作する。プロセス内は各スレッド間で仮想アドレス空間が共有されるので、コンピュータの構成におけるSMPのように各カーネルスレッド1-1~1-5は対等なものとして利用可能である。 In the kernel level thread 1-4, the user level thread F voluntarily returns resources and stops in (4). Thereafter, a blank time in which there is no thread to be processed occurs for a while in the kernel level thread 1-4, and then the user level thread B is scheduled again. As described above, the user level thread B was scheduled to be rescheduled after the completion of the user level thread C. However, since the kernel thread 1-1 originally used is taken by the user level thread A, the kernel Operates on thread 1-4. Since the virtual address space is shared among the threads in the process, the kernel threads 1-1 to 1-5 can be used as equivalents like SMP in the computer configuration.
 図14は、本実施例におけるスレッドコンテキスト(スレッド管理情報514)と、データのDMA転送の関係を示した概念図である。スレッドが処理すべきデータはDMA転送でNVM320からPIN領域プール516を介してメインメモリに転送され、スレッドを実行するために必要なコンテキストは、退避されていたメインメモリから各ハードウェアスレッド(カーネルレベルスレッドはハードウェアスレッドに一対一対応している)にロードされる。 FIG. 14 is a conceptual diagram showing the relationship between the thread context (thread management information 514) and the DMA transfer of data in this embodiment. Data to be processed by the thread is transferred to the main memory from the NVM 320 via the PIN area pool 516 by DMA transfer, and the context necessary for executing the thread is transferred from the saved main memory to each hardware thread (kernel level). Threads are loaded one-to-one with hardware threads).
 図15は、本実施例におけるマスターコントロールスレッド550のスケジューリング動作を示すフローチャートである。ステップS1501で、マスターコントロールスレッド550は、スケジューリングを開始する。スケジューリングを開始する契機はカーネルレベルスレッドに空きが検出された時である。次に、ステップS1502で、マスターコントロールスレッド550は、READYキュー810からスレッド管理セル900をデキュー可能かどうか判定する。デキューできなければその時点でスケジューリングは終了し、マスターコントロールスレッド550は、次にスケジューリング可能となる契機を待つ。デキューできる場合には、ステップS1503で、マスターコントロールスレッド550は、スレッド管理セル900をREADYキューからデキューする。 FIG. 15 is a flowchart showing the scheduling operation of the master control thread 550 in this embodiment. In step S1501, the master control thread 550 starts scheduling. The trigger for starting the scheduling is when a free space is detected in the kernel level thread. Next, in step S1502, the master control thread 550 determines whether the thread management cell 900 can be dequeued from the READY queue 810. If it cannot be dequeued, scheduling ends at that point, and the master control thread 550 waits for the next opportunity for scheduling. If it can be dequeued, in step S1503, the master control thread 550 dequeues the thread management cell 900 from the READY queue.
 その後、ステップS1504で、マスターコントロールスレッド550は、PIN領域プール516にPIN領域割当てが可能かどうかを判定する。本実施例の情報処理システム100では、図15に示すスレッドのスケジューリングにおいて、PIN領域の割当ての可否を判定することを特徴としている。 Thereafter, in step S1504, the master control thread 550 determines whether or not a PIN area can be allocated to the PIN area pool 516. The information processing system 100 according to the present embodiment is characterized by determining whether or not a PIN area can be allocated in the thread scheduling shown in FIG.
 本実施例の情報処理システム100では、スレッド毎にDMA転送を行う。DMA転送はプロセッサやオペレーティングシステムの介在なしに行われるため、基本的には物理アドレス空間内で転送が行われることになる。すなわち、仮想アドレス空間はオペレーティングシステムの管理下であるため、DMA転送からは仮想アドレス空間を用いた転送が出来ない。そこで、DMA転送を行う場合には、転送対象となるメインメモリ上の領域が物理的なメモリ上にあり、物理アドレス空間上に配置されていることが必須となる。そのため、DMA転送を行うプロセスは、プロセスの仮想アドレス空間中の一部の領域を物理アドレス空間のDRAM領域に固定的に割当て、仮想記憶によるページアウト・ページインを生じさせないようにする。この物理的な存在があるメモリに固定的に割当てられた領域が、PIN領域であり、本実施例ではプロセスに対してPIN領域プール516として用意される。なお、PIN領域516の大きさは、予めユーザが設定できる。 In the information processing system 100 of this embodiment, DMA transfer is performed for each thread. Since DMA transfer is performed without the intervention of a processor or operating system, the transfer is basically performed within the physical address space. That is, since the virtual address space is managed by the operating system, transfer using the virtual address space cannot be performed from the DMA transfer. Therefore, when performing DMA transfer, it is essential that the area on the main memory to be transferred is on the physical memory and is arranged in the physical address space. Therefore, a process that performs DMA transfer allocates a part of the virtual address space of the process to the DRAM area of the physical address space in a fixed manner so as not to cause page-out and page-in due to virtual storage. The area fixedly allocated to the memory having the physical presence is the PIN area, and is prepared as a PIN area pool 516 for the process in this embodiment. Note that the user can set the size of the PIN area 516 in advance.
 ここで、プロセスに用意したPIN領域プール516をスレッド毎に均等に分割して利用するという方法もあり得るが、これはスレッドの必要とするデータ量が予め分かっていて、かつ、各スレッドの要求するPIN領域の大きさが均等でないと上手く利用できない。 Here, there is a method in which the PIN area pool 516 prepared for the process is equally divided for each thread and used, but this is because the amount of data required by the thread is known in advance and the request of each thread If the size of the PIN area to be performed is not uniform, it cannot be used successfully.
 そこで、本実施例の情報処理システム100では、プロセスにPIN領域プール516を用意し、プロセスが持つPIN領域をリソースプールとして管理する。その上で、スレッドが起動時に自スレッドで必要とするPIN領域の大きさをスレッド管理セルのバッファ要求フラグ907、及び、バッファ要求サイズ908でマスターコントロールスレッド550に申告し、マスターコントロールスレッド550からPIN領域の割当てを受けて該スレッドが実行に移る。 Therefore, in the information processing system 100 of this embodiment, a PIN area pool 516 is prepared for the process, and the PIN area of the process is managed as a resource pool. After that, the size of the PIN area required by the thread at the time of activation of the thread is reported to the master control thread 550 using the buffer request flag 907 and buffer request size 908 of the thread management cell, and the master control thread 550 makes a PIN. In response to the area allocation, the thread starts execution.
 そのため、ステップS1504では、現在のPIN領域プール516の残量と、起動しようとしているスレッドが要求している領域の量とを比較して、スレッドの起動可否を判定する。実行中のスレッドがPIN領域プール516の割当てを得ていれば、残量はその分少ない。PIN領域プール516の残量が不足すると判定された場合には、ステップS1506で、マスターコントロールスレッド550は、当該スレッドをREADYキュー810の末尾に積み、他のスレッドを優先して実行に移す。一方、PIN領域プール516から必要な領域を確保できる場合には、ステップS1505で、マスターコントロールスレッド550は、申告をしてきたスレッドにPIN領域を割当てて、スレッドの切替えを行う。マスターコントロールスレッド550は、PIN領域の割当てについて、スレッド管理セル900のバッファ割当てフラグ909とバッファ領域先頭アドレス910に書込むことで、申告をしてきたスレッドに割当てた領域を通知する。 Therefore, in step S1504, the remaining amount of the current PIN area pool 516 is compared with the amount of area requested by the thread to be activated to determine whether the thread can be activated. If the executing thread has been assigned the PIN area pool 516, the remaining amount is small. If it is determined that the remaining amount of the PIN area pool 516 is insufficient, in step S1506, the master control thread 550 stacks the thread at the end of the READY queue 810 and prioritizes execution of other threads. On the other hand, if the necessary area can be secured from the PIN area pool 516, in step S1505, the master control thread 550 assigns the PIN area to the thread that has made the declaration and switches the thread. The master control thread 550 writes the PIN area allocation to the buffer allocation flag 909 and the buffer area head address 910 of the thread management cell 900 to notify the allocated area to the thread that has made the declaration.
 このように、実行予定のスレッドがメインメモリの物理アドレス空間上でDMA転送に必要とする容量に応じてメモリ空間を確保することで、各スレッドの要求するPIN領域の大きさのばらつきに対して柔軟に対応することが可能となる。ひいては、効率的なDMA転送を実現でき、情報処理システム100の処理を高速化できる。 In this way, by securing the memory space according to the capacity required for DMA transfer in the physical address space of the main memory by the thread to be executed, it is possible to cope with variations in the size of the PIN area requested by each thread. It becomes possible to respond flexibly. As a result, efficient DMA transfer can be realized, and the processing of the information processing system 100 can be speeded up.
 本実施例では、さらに大規模なグラフ処理を行うなどの状況で、スレッド管理セル900がメインメモリに収まらないほど大量のスレッドであっても実行することのできる情報処理システムの例を説明する。 In the present embodiment, an example of an information processing system that can be executed even when the thread management cell 900 is too large to fit in the main memory in a situation such as performing larger-scale graph processing will be described.
 大規模なグラフ処理などで頂点の数が莫大になると、スレッドの数も莫大になる。その時、全てのスレッドのスレッド管理セル900をメインメモリ中に保持しておくことすら不可能になる。そこで、本実施例では、図16に示すようにスレッド管理セル900の実体をNVM320に配置し、必要に応じてメインメモリにDMA転送して利用する。さらに、マスターコントロールスレッド550がNVM320上のスレッド管理セル900を、それが必要になる以前に予めDMA転送しておく。すなわち、スレッド管理セル900に対するプリフェッチが行われる。図16に示したように、スレッドコンテキストは、メインメモリとNVMとの間でDMA転送され、退避または呼び出される。 で When the number of vertices becomes enormous due to large-scale graph processing, the number of threads also becomes enormous. At that time, it becomes impossible to hold the thread management cells 900 of all the threads in the main memory. Therefore, in this embodiment, as shown in FIG. 16, the entity of the thread management cell 900 is arranged in the NVM 320, and is DMA-transferred to the main memory and used as necessary. Further, the master control thread 550 DMA-transfers the thread management cell 900 on the NVM 320 in advance before it becomes necessary. That is, prefetch for the thread management cell 900 is performed. As shown in FIG. 16, the thread context is DMA-transferred between the main memory and the NVM, and saved or called.
 スレッド管理セル900は基本的にREADYキュー810にエンキューされている順番で必要となるので、REDAYキュー810を監視して、その順序でプリフェッチを行えば良い。また、REDAYキュー810の先頭アドレスを保持し、プロセッサと並行してREADYキュー810を読み出してプリフェッチを行うDMAコントローラを別途用意しても良い。 Since the thread management cell 900 is basically required in the order enqueued in the READY queue 810, the READY queue 810 may be monitored and prefetched in that order. In addition, a DMA controller that holds the head address of the READY queue 810, reads the READY queue 810 in parallel with the processor, and performs prefetching may be separately prepared.
 以上、本発明者によってなされた発明を実施の形態に基づき具体的に説明したが、本発明は前記実施の形態に限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能であることはいうまでもない。 As mentioned above, the invention made by the present inventor has been specifically described based on the embodiment. However, the present invention is not limited to the embodiment, and various modifications can be made without departing from the scope of the invention. Needless to say.
100:情報処理システム、110:ノード、120:ノード間ネットワーク、130:NVMサブシステムインターコネクト、210、220:プロセッサ、230、240:DIMM、250:I/Oハブ、260:NIC、270:ディスクコントローラ、280:HDD、290:SSD、300:NVMサブシステム、310:ハイブリッドメモリコントローラ、320:NVM(Non-volatile Memory、不揮発メモリ)、330:揮発性メモリ、311:MMR(Memory Mapped Register)、410、420:プロセス、510:スレッド間共有リソース、511:プログラムコード、512:グローバル変数、513:ヒープ領域、514:スレッド管理情報、515:NVM転送要求管理情報、516:PIN領域プール。 100: Information processing system, 110: Node, 120: Inter-node network, 130: NVM subsystem interconnect, 210, 220: Processor, 230, 240: DIMM, 250: I / O hub, 260: NIC, 270: Disk controller 280: HDD, 290: SSD, 300: NVM subsystem, 310: Hybrid memory controller, 320: NVM (Non-volatile Memory, non-volatile memory), 330: Volatile memory, 311: MMR (Memory Mapped Register), 410 420: Process 510: Shared resource between threads 511: Program code 512: Global variable 513: Heap area 514: Thread management information 515: NVM transfer I asked management information, 516: PIN area pool.

Claims (11)

  1.  マルチスレッドプロセッサと、
     第1の記憶装置と、
     前記第1の記憶装置との間でDMA転送を行う第2の記憶装置と、
     前記第1の記憶装置に物理アドレス空間を割当て、該物理アドレス空間上に仮想アドレス空間を提供するオペレーティングシステムと、を備え、
     実行予定のスレッドが前記物理アドレス空間上で前記DMA転送に必要とする容量に応じてメモリ空間を確保し、
     該スレッドを実行に移し、
     該スレッドの処理が終了した後に確保したメモリ空間を開放することを特徴とする情報処理装置。
    A multi-thread processor;
    A first storage device;
    A second storage device for performing a DMA transfer with the first storage device;
    An operating system that allocates a physical address space to the first storage device and provides a virtual address space on the physical address space;
    The thread to be executed secures a memory space according to the capacity required for the DMA transfer on the physical address space,
    Move the thread to execution,
    An information processing apparatus that releases a memory space secured after processing of the thread is completed.
  2.  請求項1に記載の情報処理装置において、
     前記DMA転送に利用される物理アドレス空間の領域が予め設定され、
     予め設定された領域の内で実行中のスレッドに確保されている領域を除いた領域で、
     前記メモリ空間の確保を行うことを特徴とする情報処理装置。
    The information processing apparatus according to claim 1,
    An area of a physical address space used for the DMA transfer is preset,
    An area excluding the area reserved for the executing thread in the preset area.
    An information processing apparatus that secures the memory space.
  3.  請求項1に記載の情報処理装置において、
     実行に移されDMA転送の完了待ちとなったスレッドを一時中断することを特徴とする情報処理装置。
    The information processing apparatus according to claim 1,
    An information processing apparatus characterized by temporarily interrupting a thread that has been transferred to execution and is waiting for completion of DMA transfer.
  4.  請求項3に記載の情報処理装置において、
     一時中断されたスレッドのコンテキストを前記第1の記憶装置に記憶することを特徴とする情報処理装置。
    The information processing apparatus according to claim 3.
    An information processing apparatus that stores a context of a suspended thread in the first storage device.
  5.  請求項3に記載の情報処理装置において、
     一時中断されたスレッドのコンテキストを前記第2の記憶装置に記憶することを特徴とする情報処理装置。
    The information processing apparatus according to claim 3.
    An information processing apparatus that stores a context of a suspended thread in the second storage device.
  6.  請求項1に記載の情報処理装置において、
     前記第1の記憶装置が主記憶装置であることを特徴とする情報処理装置。
    The information processing apparatus according to claim 1,
    An information processing apparatus, wherein the first storage device is a main storage device.
  7.  請求項1に記載の情報処理装置において、
     前記第1の記憶装置は揮発性メモリを備え、
     前記第2の記憶装置は不揮発性メモリを備えることを特徴とする情報処理装置。
    The information processing apparatus according to claim 1,
    The first storage device comprises a volatile memory;
    The information processing apparatus, wherein the second storage device includes a nonvolatile memory.
  8.  請求項7に記載の情報処理装置において、
     ハードディスクドライブを備えることを特徴とする情報処理装置。
    The information processing apparatus according to claim 7,
    An information processing apparatus comprising a hard disk drive.
  9.  請求項1に記載の情報処理装置において、
     前記マルチスレッドプロセッサは複数のコアを有することを特徴とする情報処理装置。
    The information processing apparatus according to claim 1,
    The multi-thread processor has a plurality of cores.
  10.  請求項1に記載の情報処理装置において、
     前記揮発性メモリはDRAMであり、
     前記不揮発性メモリはフラッシュメモリであることを特徴とする情報処理装置。
    The information processing apparatus according to claim 1,
    The volatile memory is DRAM;
    An information processing apparatus, wherein the nonvolatile memory is a flash memory.
  11.  請求項1に記載の情報処理装置において、
     前記揮発性メモリはDRAMであり、
     前記不揮発性メモリは相変化メモリであることを特徴とする情報処理装置。
    The information processing apparatus according to claim 1,
    The volatile memory is DRAM;
    An information processing apparatus, wherein the nonvolatile memory is a phase change memory.
PCT/JP2012/069078 2012-07-27 2012-07-27 Information processing device WO2014016951A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2012/069078 WO2014016951A1 (en) 2012-07-27 2012-07-27 Information processing device
JP2014526682A JP5847313B2 (en) 2012-07-27 2012-07-27 Information processing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2012/069078 WO2014016951A1 (en) 2012-07-27 2012-07-27 Information processing device

Publications (1)

Publication Number Publication Date
WO2014016951A1 true WO2014016951A1 (en) 2014-01-30

Family

ID=49996784

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2012/069078 WO2014016951A1 (en) 2012-07-27 2012-07-27 Information processing device

Country Status (2)

Country Link
JP (1) JP5847313B2 (en)
WO (1) WO2014016951A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021210123A1 (en) * 2020-04-16 2021-10-21 日本電信電話株式会社 Scheduling method, scheduler, gpu cluster system, and program

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005190293A (en) * 2003-12-26 2005-07-14 Toshiba Corp Information processor, information processing program
JP2008021290A (en) * 2006-06-15 2008-01-31 Hitachi Ulsi Systems Co Ltd Storage device, storage controller, and information processing apparatus
JP2010152527A (en) * 2008-12-24 2010-07-08 Sony Computer Entertainment Inc Method and apparatus for providing user level dma and memory access management

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8706923B2 (en) * 2010-09-14 2014-04-22 Texas Instruments Incorported Methods and systems for direct memory access (DMA) in-flight status

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005190293A (en) * 2003-12-26 2005-07-14 Toshiba Corp Information processor, information processing program
JP2008021290A (en) * 2006-06-15 2008-01-31 Hitachi Ulsi Systems Co Ltd Storage device, storage controller, and information processing apparatus
JP2010152527A (en) * 2008-12-24 2010-07-08 Sony Computer Entertainment Inc Method and apparatus for providing user level dma and memory access management

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021210123A1 (en) * 2020-04-16 2021-10-21 日本電信電話株式会社 Scheduling method, scheduler, gpu cluster system, and program
JP7385156B2 (en) 2020-04-16 2023-11-22 日本電信電話株式会社 Scheduling method, scheduler, GPU cluster system and program

Also Published As

Publication number Publication date
JP5847313B2 (en) 2016-01-20
JPWO2014016951A1 (en) 2016-07-07

Similar Documents

Publication Publication Date Title
Sudan et al. Micro-pages: increasing DRAM efficiency with locality-aware data placement
CN105579961B (en) Data processing system, operating method and hardware unit for data processing system
US8826270B1 (en) Regulating memory bandwidth via CPU scheduling
US7624257B2 (en) Digital data processing apparatus having hardware multithreading support including a register set reserved for special class threads
WO2015169145A1 (en) Memory management method and device
US9658877B2 (en) Context switching using a context controller and on-chip context cache
US20090165007A1 (en) Task-level thread scheduling and resource allocation
Jung et al. Memorage: Emerging persistent RAM based malleable main memory and storage architecture
Li et al. A new disk I/O model of virtualized cloud environment
US20140244941A1 (en) Affinity group access to global data
Li et al. Improving preemptive scheduling with application-transparent checkpointing in shared clusters
US11868306B2 (en) Processing-in-memory concurrent processing system and method
EP3770759A1 (en) Wake-up and scheduling of functions with context hints
US20140047452A1 (en) Methods and Systems for Scalable Computing on Commodity Hardware for Irregular Applications
Weiland et al. Exploiting the performance benefits of storage class memory for HPC and HPDA workflows
US10922137B2 (en) Dynamic thread mapping
US8010963B2 (en) Method, apparatus and program storage device for providing light weight system calls to improve user mode performance
JP5847313B2 (en) Information processing device
Akram et al. NUMA implications for storage I/O throughput in modern servers
CN108845969B (en) Operation control method and operation system suitable for incompletely symmetrical multi-processing microcontroller
JP4792065B2 (en) Data storage method
Bai et al. Pipette: Efficient fine-grained reads for SSDs
Kim Combining hardware management with mixed-criticality provisioning in multicore real-time systems
Kehne et al. GPrioSwap: towards a swapping policy for GPUs
US11914512B2 (en) Writeback overhead reduction for workloads

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12881885

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2014526682

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12881885

Country of ref document: EP

Kind code of ref document: A1