WO2019037093A1 - Spark distributed computing data processing method and system - Google Patents

Spark distributed computing data processing method and system Download PDF

Info

Publication number
WO2019037093A1
WO2019037093A1 PCT/CN2017/099083 CN2017099083W WO2019037093A1 WO 2019037093 A1 WO2019037093 A1 WO 2019037093A1 CN 2017099083 W CN2017099083 W CN 2017099083W WO 2019037093 A1 WO2019037093 A1 WO 2019037093A1
Authority
WO
WIPO (PCT)
Prior art keywords
storage area
memory storage
eviction
data
space
Prior art date
Application number
PCT/CN2017/099083
Other languages
French (fr)
Chinese (zh)
Inventor
毛睿
陆敏华
陆克中
朱金彬
隋秀峰
Original Assignee
深圳大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学 filed Critical 深圳大学
Priority to PCT/CN2017/099083 priority Critical patent/WO2019037093A1/en
Publication of WO2019037093A1 publication Critical patent/WO2019037093A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers

Definitions

  • the present invention relates to the field of computers, and in particular, to a Spark distributed computing data processing method and system.
  • Spark has become a popular computing framework for big data applications, especially in the field of iterative computing such as graph computing and machine learning.
  • the lack of space causes some partitioned data to be cached to memory, or the data that has been cached to memory needs to be migrated to disk, causing the performance of Spark to drop.
  • Spark proposes and designs a unified memory management model, when the partition data is cached.
  • the task cannot apply for enough storage space, it actively migrates the cached data in the storage area to disk or directly rejects it; the unified memory management model has the flexibility to effectively alleviate the Spark cache by migrating or culling the cached data.
  • the demand for data and the pressure of insufficient storage space is a unified memory management model.
  • the Spark unified memory management model triggers some tasks of Spark.
  • the problem of double counting or disk reading has a bad impact on Spark performance.
  • the main purpose of the present invention is to provide a Spark distributed computing data processing method and system, which aims to solve the technical problem of repeated Spark task calculation or disk reading in the Spark unified memory management model in the prior art.
  • a first aspect of the present invention provides a Spark A distributed computing system data processing method, the method comprising:
  • the eviction logic unit When performing a storage task on the elastic distributed dataset RDD partition data that the user has identified the cache, if you are going to Spark If the memory storage area fails to apply, the eviction logic unit sends a command to evict the cached data by expelling the memory storage area;
  • the data access heat setting according to the eviction cache of the memory storage area is based on Migration address of the hybrid storage system of SSD and HDD;
  • Reading and releasing the eviction cache data in the memory storage area migrating the memory storage area to evict the cache data to the migration address, modifying the eviction cache data persistence level in the memory storage area, and feedback eviction success Signal and expulsion information.
  • the second aspect of the present invention further provides a Spark A distributed computing data processing system, the system comprising:
  • the eviction logic unit sends a command to evict the cache memory of the memory storage area
  • Calculating a location module configured to calculate a size of the eviction space in the memory storage area, and if the space size after the eviction meets the requirement of the storage task space by the storage task, the cache data may be eviction according to the memory storage area
  • Access popularity settings are based on Migration address of the hybrid storage system of SSD and HDD;
  • a data migration module configured to read and release the eviction cache data in the memory storage area, and migrate the memory storage area to evict the cached data to the migration address, and modify the eviction cache data in the memory storage area to be persistent Level, feedback eviction success signal and eviction information.
  • the partition data can be flexibly migrated to the SSD or HDD according to the heat, instead of directly migrating the buffered intermediate data to the disk or kicking out
  • the cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
  • the partition data when the partition data is called, the high-speed read and write performance of the hybrid storage system and the heat according to the partition data are separated.
  • the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
  • FIG. 1 is a schematic flowchart of a Spark distributed computing data processing method according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a refinement step of step 101 of a Spark distributed computing data processing method according to an embodiment of the present invention
  • FIG. 3 is a schematic flowchart of a refinement step of step 102 of a Spark distributed computing data processing method according to an embodiment of the present invention
  • FIG. 4 is a schematic flowchart of a refinement step in step 304 of a Spark distributed computing data processing method according to an embodiment of the present invention
  • FIG. 5 is a schematic flowchart of a step of refining data in step 103 of a Spark distributed computing data processing method according to an embodiment of the present invention
  • FIG. 6 is a schematic flowchart of a step of refining a data persistence level step in step 103 of a Spark distributed computing data processing method according to an embodiment of the present invention
  • FIG. 7 is a schematic diagram of functional modules of a Spark distributed computing data processing system according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of a refinement function module of an application storage module 601 of a Spark distributed computing data processing system according to an embodiment of the present invention
  • FIG. 9 is a schematic diagram of a refinement function module of the application storage module 602 of the Spark distributed computing data processing system according to an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of a refinement function module of the application storage module 603 of the Spark distributed computing data processing system according to an embodiment of the present invention.
  • FIG. 1 is a schematic flowchart of a Spark distributed computing data processing method according to an embodiment of the present invention, where the processing method includes:
  • the migration of the SSD and HDD based hybrid storage system may be set according to the memory storage area eviction cache data access heat. address.
  • a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
  • Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
  • the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
  • FIG. 2 is a schematic flowchart of a refinement step of a Spark distributed computing data processing method S101 according to an embodiment of the present invention, where the refinement step includes:
  • the Spark execution engine performs the scheduling of the subtask through the task scheduler, and performs a storage task on the RDD partition data that the user has identified and cached in the subtask runtime space, and then attempts to apply for the space space to the Spark memory storage area. If the application is successful, the RDD partition data is directly stored.
  • a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
  • Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
  • the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
  • FIG. 3 is a schematic flowchart of a refinement step of a Spark distributed computing data processing method S102 according to an embodiment of the present invention, where the refinement step includes:
  • the eviction logic unit receives the eviction command, and the eviction logic unit sends an application for expelling the memory storage space to the memory storage area by requiring insufficient storage space for performing the storage task due to the RDD partition data.
  • the memory storage area determines whether the memory storage area has an expellable space and feeds back to the eviction logic unit.
  • the least-used algorithm LRU strategy that is, the algorithm performs the phase-out data according to the historical access heat record of the memory storage area data
  • the core idea is that if the data is recently accessed, the probability of being accessed in the future is also higher, according to The probability of access determines the size of the eviction space in the memory storage area.
  • the storage space needs to occupy a space.
  • Terminating the memory storage area may evict the cache data migration task, and feedback the eviction memory storage area to evict the cache data failure signal.
  • a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
  • Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
  • the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
  • FIG. 4 is a schematic flowchart of a refinement step in a Spark distributed computing data processing method S304 according to an embodiment of the present invention, where the refinement step includes:
  • the first preset heat value range is that the memory storage area can be eviction cache data access heat is high, and the specific access heat range can be freely set by the user;
  • the first preset heat value is greater than the second preset heat value.
  • the second preset heat value range is that the memory storage area can be eviction cache data access heat is low, and the specific access heat range can be freely set by the user.
  • a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
  • Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
  • the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
  • FIG. 5 is a schematic flowchart of a step of refining data in a Spark distributed computing data processing method S103 according to an embodiment of the present invention.
  • the refinement step includes:
  • the cache data migration unit receives the memory storage area to evict the cache data migration information and the memory storage area may evict the cache data migration command, and store the eviction data of the memory storage area according to the migration information to the SSD or the HDD;
  • the cache data migration unit receives the memory storage area to evict the cache data migration information and the memory storage area can evict the cache data migration command
  • the cached data in the specified memory storage area is first read and the corresponding memory space is released, and then Cache the cached data in the memory storage area to the SSD or HDD according to the migration address;
  • the memory storage area can evict data migration information, including: the memory storage area can evict the cache data address, the memory storage area can evict the cache data space size, and the migration address.
  • Sending a memory storage area to the eviction logic unit may evict the cache data migration completion signal.
  • a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
  • Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
  • the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
  • FIG. 6 is a schematic flowchart of a step of refining a data persistence level step in a Spark distributed computing data processing method S103 according to an embodiment of the present invention.
  • the refinement step includes:
  • the migration address of the cache storage data in the memory storage area is SSD
  • the persistent storage level of the cache memory data in the modified memory storage area is SSD_ONLY.
  • the modification is completed, the feedback memory storage area can evict the cache data eviction success signal, and the memory storage area can evict the data migration information, so that the RDD partition data enters the memory storage area to complete the storage task.
  • a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
  • Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
  • the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
  • FIG. 7 is a schematic diagram of functional modules of a Spark distributed computing data processing system according to an embodiment of the present invention.
  • the functional module includes:
  • the application storage module 601 is configured to send the eviction memory storage area cache data to the eviction logic unit if the storage space of the Spark memory storage area fails when the storage task is performed on the flexible distributed data set RDD partition data that the user has identified.
  • the calculation address module 602 is configured to calculate the size of the eviction space in the memory storage area. If the space size after the eviction meets the requirements of the storage task space for the memory storage area, the data storage area may be evicted according to the memory storage area, and the SSD and HDD are set based on the SSD and the HDD. Migration address of the hybrid storage system;
  • the data migration module 603 is configured to read and release the eviction cache data in the memory storage area, migrate the cache storage data to the migration address in the memory storage area, modify the memory storage area to evict the cache data persistence level, and feedback the eviction success signal. And eviction information.
  • a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
  • Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
  • the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
  • FIG. 8 is a schematic diagram of a refinement function module of a storage module 601 of a Spark distributed computing data processing system according to an embodiment of the present disclosure, where the refinement function module includes:
  • the first application module 6011 is configured to calculate a size of a memory storage space occupied by performing a storage task on the RDD partition data, apply for a space to the Spark memory storage area, and compare with an unoccupied space of the memory storage area;
  • the first feedback module 6012 is configured to: if the size of the memory storage area occupied by the storage task is larger than the unoccupied space of the memory storage area, requesting space from the Spark memory storage area fails, and sending the eviction memory storage area to the eviction logic unit to evict the cache The command of the data and the size of the memory storage space are required to send the storage task.
  • a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
  • Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
  • the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
  • FIG. 9 is a schematic diagram of a refinement function module of a storage module 602 of a Spark distributed computing data processing system according to an embodiment of the present disclosure, where the refinement function module includes:
  • the second application module 6021 is configured to: the eviction logic unit receives the eviction command, and the eviction logic unit sends an application to the memory storage area that requires insufficient storage space for performing the storage task due to the RDD partition data, and if the application is successful, press Recently, the LRU strategy is used to calculate the size of the expellable space in the memory storage area;
  • the migration address module 6022 is configured to set the size of the unoccupied space of the memory storage area after the eviction is greater than or equal to the size of the RDD partition data to perform the storage task, and set the hybrid storage system based on the SSD and the HDD according to the eviction cache data access heat of the memory storage area.
  • the migration address, and the memory storage area eviction cache data migration information and the memory storage area eviction cache data migration command are sent to the cache data migration unit;
  • the second feedback module 6023 is configured to: if the unoccupied space of the memory storage area after the eviction is smaller than the size of the RDD partition data to perform the storage task, terminate the memory storage area to evict the cache data migration task, and feedback the eviction memory storage area to evict Cache data failure signal;
  • the SSD migration address module 6024 is configured to: if the memory storage area eviction cache data access heat is within a first preset heat value range, read the SSD address and set the read SSD address as a migration address;
  • the HDD migration address module 6025 is configured to read the HDD address and set the read HDD address as a migration address if the memory storage area eviction cache data access heat is within the second preset heat value range.
  • a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
  • Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
  • the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
  • FIG. 10 is a schematic diagram of a refinement function module of a storage module 603 of a Spark distributed computing data processing system according to an embodiment of the present invention.
  • the refinement function module includes:
  • the third feedback module 6031 is configured to send, to the eviction logic unit, a memory storage area eviction cache data migration completion signal;
  • the SSD persistence level module 6032 is configured to: if the memory storage area can evict the cached data, the migration address is SSD, and modify the memory storage area to evict the cached data to have a persistence level of SSD_ONLY;
  • the HDD persistence level module 6033 is configured to: if the memory storage area can evict the cached data, the migration address is HDD, and the modified memory storage area can evict the cached data by a persistent level of HDD_ONLY;
  • the fourth feedback module 6034 is configured to feedback the memory storage area to evict the cache data eviction success signal and the memory storage area to evict the data migration information, so that the RDD partition data enters the memory storage area to complete the storage task.
  • a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
  • Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
  • the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
  • the disclosed methods and systems may be implemented in other manners.
  • the system embodiments described above are merely illustrative.
  • the division of modules is only a logical function division.
  • multiple modules or components may be combined or integrated. Go to another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or module, and may be electrical, mechanical or otherwise.
  • the modules described as separate components may or may not be physically separate.
  • the components displayed as modules may or may not be physical modules, that is, may be located in one place, or may be distributed to multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • An integrated module if implemented as a software functional module and sold or used as a standalone product, can be stored in a computer readable storage medium.
  • the technical solution of the present invention which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium.
  • a number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read only memory (ROM, Read-Only) Memory, random access memory (RAM), disk or optical disk, and other media that can store program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The present invention relates to the field of computers, and provides a Spark distributed computing data processing method. The method comprises: scheduling a sub-task by means of a task scheduler, executing an RDD partition data storage task, and applying for a space of a storage area; calculating the size of an expellable space in the storage area, and setting a migration address of a hybrid storage system according to partition data access popularity (S102); and reading cached data in a specified storage area, releasing a corresponding memory space, migrating partition data to a specified address, modifying a persistence level of the migrated data, and feeding back an expelling success signal and expelled space information (S103). Also provided is a Spark distributed computing system. By introducing the hybrid storage system and designing an expelling logic unit and a cached data migration unit, the data is migrated to an SSD or an HDD according to the partition data popularity and is not directly migrated to a magnetic disk or the cached data is deleted, so that the pressure of memory space shortage can be effectively reduced and the Spark performance is improved.

Description

一种 Spark 分布式计算数据处理方法及系统  Spark distributed computing data processing method and system 技术领域Technical field
本发明涉及计算机领域,尤其涉及一种Spark分布式计算数据处理方法及系统。 The present invention relates to the field of computers, and in particular, to a Spark distributed computing data processing method and system.
背景技术Background technique
随着社会科学技术水平的提高,人们与对大规模数据处理的要求也越来越高,其中大数据应用对内存产生了强烈的依赖,充裕的内存是快速计算大数据的前提和保障。With the improvement of the level of social science and technology, people and the requirements for large-scale data processing are getting higher and higher. Among them, big data applications have a strong dependence on memory. Ample memory is the premise and guarantee for fast calculation of big data.
Spark作为通用、快速、大规模数据处理引擎,已经成为大数据应用领域流行的计算框架,尤其在诸如图计算、机器学习等迭代计算的应用领域表现出色,随着数据集规模的不断扩大,由于空间的不足导致部分分区数据无法缓存至内存,或,已缓存至内存的数据需要迁移至磁盘,造成Spark性能的下降,针对该问题,Spark提出并设计了统一内存管理模型,当分区数据的缓存任务无法申请足够存储区空间时,主动迁移存储区内已缓存的数据至磁盘或直接剔除;统一内存管理模型具有一定的灵活性,通过迁移或剔除已缓存的数据,有效地缓解了Spark缓存大数据的需求与存储区空间不足的压力。As a general-purpose, fast, and large-scale data processing engine, Spark has become a popular computing framework for big data applications, especially in the field of iterative computing such as graph computing and machine learning. As the scale of data sets continues to expand, The lack of space causes some partitioned data to be cached to memory, or the data that has been cached to memory needs to be migrated to disk, causing the performance of Spark to drop. For this problem, Spark proposes and designs a unified memory management model, when the partition data is cached. When the task cannot apply for enough storage space, it actively migrates the cached data in the storage area to disk or directly rejects it; the unified memory management model has the flexibility to effectively alleviate the Spark cache by migrating or culling the cached data. The demand for data and the pressure of insufficient storage space.
然而,由于已缓存的中间数据被剔除或迁移至磁盘,导致再次调用该数据时必须重新执行相应的计算任务来获取数据或读取磁盘获取缓存数据,所以Spark统一内存管理模型引发了Spark部分任务重复计算或磁盘读取的问题,对Spark性能产生恶劣的影响。However, since the cached intermediate data is culled or migrated to the disk, the data must be re-executed to obtain data or read the disk to obtain cached data when the data is called again. Therefore, the Spark unified memory management model triggers some tasks of Spark. The problem of double counting or disk reading has a bad impact on Spark performance.
技术问题technical problem
本发明的主要目的在于提供一种Spark分布式计算数据处理方法及系统,旨在解决现有技术中Spark统一内存管理模型中Spark部分任务重复计算或磁盘读取的技术问题。 The main purpose of the present invention is to provide a Spark distributed computing data processing method and system, which aims to solve the technical problem of repeated Spark task calculation or disk reading in the Spark unified memory management model in the prior art.
技术解决方案Technical solution
为实现上述目的,本发明第一方面提供 一种 Spark 分布式计算系统数据处理方法,所述方法包括:In order to achieve the above object, a first aspect of the present invention provides a Spark A distributed computing system data processing method, the method comprising:
在对用户已标识缓存的弹性分布式数据集 RDD 分区数据执行存储任务时,若向 Spark 的内存存储区申请空间失败,则 向驱逐逻辑单元发送驱逐所述内存存储区可驱逐缓存数据的命令;When performing a storage task on the elastic distributed dataset RDD partition data that the user has identified the cache, if you are going to Spark If the memory storage area fails to apply, the eviction logic unit sends a command to evict the cached data by expelling the memory storage area;
计算所述内存存储区内可驱逐空间大小,若驱逐后空间大小满足所述存储任务对所述内存存储区空间的要求,则根据所述内存存储区可驱逐缓存数据访问热度设置基于 SSD 和 HDD 的混合存储系统的迁移地址;Calculating a size of the eviction space in the memory storage area, and if the size of the space after the eviction meets the requirement of the memory storage area of the storage task, the data access heat setting according to the eviction cache of the memory storage area is based on Migration address of the hybrid storage system of SSD and HDD;
读取并释放所述内存存储区内可驱逐缓存数据,迁移所述内存存储区内可驱逐缓存数据到所述迁移地址,修改所述内存存储区内可驱逐缓存数据持久化级别,反馈驱逐成功信号及驱逐信息。Reading and releasing the eviction cache data in the memory storage area, migrating the memory storage area to evict the cache data to the migration address, modifying the eviction cache data persistence level in the memory storage area, and feedback eviction success Signal and expulsion information.
为实现上述目的,本发明第二方面还提供一种一种 Spark 分布式计算数据处理系统,所述系统包括:In order to achieve the above object, the second aspect of the present invention further provides a Spark A distributed computing data processing system, the system comprising:
申请 存储模块,用于 在对用户已标识缓存的弹性分布式数据集 RDD 分区数据执行存储任务时,若向 Spark 的内存存储区申请空间失败,则 向驱逐逻辑单元发送驱逐所述内存存储区缓存数据的命令;Applying a storage module for performing a storage task on the elastic distributed data set RDD partition data that the user has identified the cache If the Spark memory storage area fails to apply, the eviction logic unit sends a command to evict the cache memory of the memory storage area;
计算分址模块,用于计算所述内存存储区内可驱逐空间大小,若驱逐后空间大小满足所述存储任务对所述内存存储区空间的要求,则根据所述内存存储区可驱逐缓存数据访问热度设置基于 SSD 和 HDD 的混合存储系统的迁移地址;Calculating a location module, configured to calculate a size of the eviction space in the memory storage area, and if the space size after the eviction meets the requirement of the storage task space by the storage task, the cache data may be eviction according to the memory storage area Access popularity settings are based on Migration address of the hybrid storage system of SSD and HDD;
数据迁移模块,用于读取并释放所述内存存储区内可驱逐缓存数据,迁移所述内存存储区内可驱逐缓存数据到所述迁移地址,修改所述内存存储区内可驱逐缓存数据持久化级别,反馈驱逐成功信号及驱逐信息。a data migration module, configured to read and release the eviction cache data in the memory storage area, and migrate the memory storage area to evict the cached data to the migration address, and modify the eviction cache data in the memory storage area to be persistent Level, feedback eviction success signal and eviction information.
有益效果Beneficial effect
通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开 存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。 By introducing SSD and HDD to build a hybrid storage system, and designing the eviction logic unit and the cache data migration unit, the partition data can be flexibly migrated to the SSD or HDD according to the heat, instead of directly migrating the buffered intermediate data to the disk or kicking out The cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space. At the same time, when the partition data is called, the high-speed read and write performance of the hybrid storage system and the heat according to the partition data are separated. The storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
附图说明DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present invention, and those skilled in the art can obtain other drawings according to these drawings without any creative work.
图1为本发明实施例中Spark分布式计算数据处理方法的流程示意图;1 is a schematic flowchart of a Spark distributed computing data processing method according to an embodiment of the present invention;
图2为本发明实施例中Spark分布式计算数据处理方法步骤101的细化步骤流程示意图;2 is a schematic flowchart of a refinement step of step 101 of a Spark distributed computing data processing method according to an embodiment of the present invention;
图3为本发明为本发明实施例中Spark分布式计算数据处理方法步骤102的细化步骤流程示意图;FIG. 3 is a schematic flowchart of a refinement step of step 102 of a Spark distributed computing data processing method according to an embodiment of the present invention;
图4为本发明为本发明实施例中Spark分布式计算数据处理方法步骤304中细化步骤流程示意图;4 is a schematic flowchart of a refinement step in step 304 of a Spark distributed computing data processing method according to an embodiment of the present invention;
图5为本发明为本发明实施例中Spark分布式计算数据处理方法步骤103中迁移数据步骤细化步骤流程示意图;FIG. 5 is a schematic flowchart of a step of refining data in step 103 of a Spark distributed computing data processing method according to an embodiment of the present invention;
图6为本发明为本发明实施例中Spark分布式计算数据处理方法步骤103中修改数据持久化级别步骤细化步骤流程示意图;FIG. 6 is a schematic flowchart of a step of refining a data persistence level step in step 103 of a Spark distributed computing data processing method according to an embodiment of the present invention;
图7为本发明实施例中本发明为本发明实施例中Spark分布式计算数据处理系统的功能模块示意图;FIG. 7 is a schematic diagram of functional modules of a Spark distributed computing data processing system according to an embodiment of the present invention;
图8为本发明实施例中Spark分布式计算数据处理系统的申请存储模块601的细化功能模块的示意图;8 is a schematic diagram of a refinement function module of an application storage module 601 of a Spark distributed computing data processing system according to an embodiment of the present invention;
图9为本发明实施例中Spark分布式计算数据处理系统的申请存储模块602的细化功能模块的示意图;FIG. 9 is a schematic diagram of a refinement function module of the application storage module 602 of the Spark distributed computing data processing system according to an embodiment of the present invention;
图10为本发明实施例中Spark分布式计算数据处理系统的申请存储模块603的细化功能模块的示意图。FIG. 10 is a schematic diagram of a refinement function module of the application storage module 603 of the Spark distributed computing data processing system according to an embodiment of the present invention.
本发明的实施方式Embodiments of the invention
为使得本发明的发明目的、特征、优点能够更加的明显和易懂,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而非全部实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the drawings in the embodiments of the present invention. The embodiments are merely a part of the embodiments of the invention, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
请参阅图1,图1为本发明实施例中Spark分布式计算数据处理方法的流程示意图,该处理方法包括:Referring to FIG. 1, FIG. 1 is a schematic flowchart of a Spark distributed computing data processing method according to an embodiment of the present invention, where the processing method includes:
S101、在对用户已标识缓存的弹性分布式数据集RDD分区数据执行存储任务时,若向Spark的内存存储区申请空间失败,则向驱逐逻辑单元发送驱逐内存存储区缓存数据的命令。S101. When performing a storage task on the RDD partition data of the encrypted distributed data set that the user has identified, if the space request for the Spark memory storage area fails, the command to evict the memory storage area cache data is sent to the eviction logic unit.
S102、计算内存存储区内可驱逐空间大小,若驱逐后空间大小满足存储任务对内存存储区空间的要求,则根据内存存储区可驱逐缓存数据访问热度设置基于SSD和HDD的混合存储系统的迁移地址。S102. Calculate the size of the eviction space in the memory storage area. If the space size after the eviction meets the storage task space requirement, the migration of the SSD and HDD based hybrid storage system may be set according to the memory storage area eviction cache data access heat. address.
S103、读取并释放内存存储区内可驱逐缓存数据,迁移内存存储区内可驱逐缓存数据到迁移地址,修改内存存储区内可驱逐缓存数据持久化级别,反馈驱逐成功信号及驱逐信息。S103. Read and release the memory storage area to evict the cached data, migrate the memory storage area to evict the cached data to the migration address, modify the memory storage area to evict the cache data persistence level, and feed back the eviction success signal and the eviction information.
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开 存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。In the embodiment of the present invention, a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data. Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space. At the same time, when calling partition data, due to the high-speed read and write performance of the hybrid storage system. And separate according to the heat of the partition data The storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
请参阅图2,图2为本发明实施例中Spark分布式计算数据处理方法S101的细化步骤流程示意图,该细化步骤包括:Referring to FIG. 2, FIG. 2 is a schematic flowchart of a refinement step of a Spark distributed computing data processing method S101 according to an embodiment of the present invention, where the refinement step includes:
S201、计算对RDD分区数据执行存储任务所占用内存存储区空间的大小,向Spark的内存存储区申请空间,并将存储任务所占用内存存储区空间的大小与内存存储区未占用空间作比较;S201. Calculate the size of the memory storage area occupied by the storage task for the RDD partition data, apply for space to the memory storage area of the Spark, and compare the size of the memory storage space occupied by the storage task with the unoccupied space of the memory storage area;
具体的,由Spark执行引擎通过任务调度器进行子任务的调度,在子任务运行时空间对用户已标识缓存的RDD分区数据执行存储任务,然后再尝试向Spark的内存存储区申请空间空间,若申请成功,则直接进行RDD分区数据的存储工作。Specifically, the Spark execution engine performs the scheduling of the subtask through the task scheduler, and performs a storage task on the RDD partition data that the user has identified and cached in the subtask runtime space, and then attempts to apply for the space space to the Spark memory storage area. If the application is successful, the RDD partition data is directly stored.
S202、若存储任务所占用内存存储区空间的大小大于内存存储区未占用空间,则向Spark的内存存储区申请空间失败,同时向驱逐逻辑单元发送驱逐内存存储区可驱逐缓存数据的命令以及发送存储任务需要占用内存存储区空间的大小。S202. If the size of the memory storage area occupied by the storage task is larger than the unoccupied space of the memory storage area, requesting space from the Spark memory storage area fails, and sending the eviction memory storage area to evict the cached data command and sending the data to the eviction logic unit. The storage task needs to occupy the size of the memory storage space.
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开 存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。In the embodiment of the present invention, a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data. Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space. At the same time, when calling partition data, due to the high-speed read and write performance of the hybrid storage system. And separate according to the heat of the partition data The storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
请参阅图3,图3为本发明为本发明实施例中Spark分布式计算数据处理方法S102的细化步骤流程示意图,该细化步骤包括:Referring to FIG. 3, FIG. 3 is a schematic flowchart of a refinement step of a Spark distributed computing data processing method S102 according to an embodiment of the present invention, where the refinement step includes:
S301、驱逐逻辑单元接收到驱逐命令,同时驱逐逻辑单元向内存存储区发出由于RDD分区数据执行存储任务所需存储空间不足需要驱逐内存存储区空间的申请;S301. The eviction logic unit receives the eviction command, and the eviction logic unit sends an application for expelling the memory storage space to the memory storage area by requiring insufficient storage space for performing the storage task due to the RDD partition data.
进一步的,当内存存储区接收到驱逐逻辑单元发出的申请后,判断内存存储区是否有可驱逐的空间并反馈给驱逐逻辑单元。Further, after receiving the application sent by the eviction logic unit, the memory storage area determines whether the memory storage area has an expellable space and feeds back to the eviction logic unit.
S302、若申请申请成功,则按近期最少使用算法LRU策略计算内存存储区内可驱逐空间大小;S302. If the application is successful, calculate the size of the expellable space in the memory storage area according to the least-time algorithm LRU strategy;
其中,最少使用算法LRU策略即此算法根据内存存储区数据的历史访问热度记录来进行淘汰数据,其核心思想是:如果此数据最近被访问过,那么其将来被访问的几率也更高,根据访问几率判断内存存储区内可驱逐空间的大小。Among them, the least-used algorithm LRU strategy, that is, the algorithm performs the phase-out data according to the historical access heat record of the memory storage area data, and the core idea is that if the data is recently accessed, the probability of being accessed in the future is also higher, according to The probability of access determines the size of the eviction space in the memory storage area.
S303、若内存存储区内可驱逐空间大小大于等于RDD分区数据执行存储任务需要占用空间大小。S303. If the size of the eviction space in the memory storage area is greater than or equal to the RDD partition data, the storage space needs to be occupied.
S304、根据内存存储区可驱逐缓存数据的访问热度设置基于SSD和HDD的混合存储系统的迁移地址,并将内存存储区可驱逐缓存数据迁移信息和内存存储区可驱逐缓存数据迁移命令发送至缓存数据迁移单元。S304. Set a migration address of the hybrid storage system based on the SSD and the HDD according to the access heat of the cache storage data in the memory storage area, and send the memory storage area eviction cache data migration information and the memory storage area eviction cache data migration command to the cache. Data migration unit.
S305、若内存存储区内可驱逐空间大小小于RDD分区数据执行存储任务需要占用空间大小。S305. If the size of the eviction space in the memory storage area is smaller than the RDD partition data, the storage space needs to occupy a space.
S306、终止内存存储区可驱逐缓存数据迁移任务,并反馈驱逐内存存储区可驱逐缓存数据失败信号。S306. Terminating the memory storage area may evict the cache data migration task, and feedback the eviction memory storage area to evict the cache data failure signal.
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开 存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。In the embodiment of the present invention, a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data. Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space. At the same time, when calling partition data, due to the high-speed read and write performance of the hybrid storage system. And separate according to the heat of the partition data The storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
参阅图4,图4为本发明为本发明实施例中Spark分布式计算数据处理方法S304中细化步骤流程示意图,该细化步骤包括:Referring to FIG. 4, FIG. 4 is a schematic flowchart of a refinement step in a Spark distributed computing data processing method S304 according to an embodiment of the present invention, where the refinement step includes:
S3041、判断内存存储区可驱逐缓存数据访问热度。S3041: Determine that the memory storage area can evict the cache data access heat.
S3042、若内存存储区可驱逐缓存数据访问热度在第一预置热度数值范围内,则读取SSD地址并将读取到的SSD地址设置为迁移地址;S3042: If the memory storage area eviction cache data access heat is within the first preset heat value range, the SSD address is read and the read SSD address is set as the migration address;
其中,第一预置热度数值范围为内存存储区可驱逐缓存数据访问热度较高,具体的访问热度范围可由用户自由设置;The first preset heat value range is that the memory storage area can be eviction cache data access heat is high, and the specific access heat range can be freely set by the user;
特别的,第一预置热度数值大于第二预置热度数值。In particular, the first preset heat value is greater than the second preset heat value.
S3043、若内存存储区可驱逐缓存数据访问热度在第二预置热度数值范围内,则读取HDD地址并将读取到的HDD地址设置为迁移地址;S3043. If the memory storage area eviction cache data access heat is within the second preset heat value range, the HDD address is read and the read HDD address is set as the migration address.
其中,第二预置热度数值范围为内存存储区可驱逐缓存数据访问热度较低,具体的访问热度范围可由用户自由设置。The second preset heat value range is that the memory storage area can be eviction cache data access heat is low, and the specific access heat range can be freely set by the user.
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开 存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。In the embodiment of the present invention, a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data. Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space. At the same time, when calling partition data, due to the high-speed read and write performance of the hybrid storage system. And separate according to the heat of the partition data The storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
请参阅图5,图5为本发明为本发明实施例中Spark分布式计算数据处理方法S103中迁移数据步骤细化步骤流程示意图,该细化步骤包括:Referring to FIG. 5, FIG. 5 is a schematic flowchart of a step of refining data in a Spark distributed computing data processing method S103 according to an embodiment of the present invention. The refinement step includes:
S401、缓存数据迁移单元接收到内存存储区可驱逐缓存数据迁移信息和内存存储区可驱逐缓存数据迁移命令后,将内存存储区可驱逐数据按迁移信息存储到SSD或HDD;S401. The cache data migration unit receives the memory storage area to evict the cache data migration information and the memory storage area may evict the cache data migration command, and store the eviction data of the memory storage area according to the migration information to the SSD or the HDD;
进一步的,缓存数据迁移单元接收到内存存储区可驱逐缓存数据迁移信息和内存存储区可驱逐缓存数据迁移命令后,会先读取指定内存存储区内已缓存数据并释放相应的内存空间,然后将内存存储区内已缓存数据按迁移地址存储到SSD或HDD;Further, after the cache data migration unit receives the memory storage area to evict the cache data migration information and the memory storage area can evict the cache data migration command, the cached data in the specified memory storage area is first read and the corresponding memory space is released, and then Cache the cached data in the memory storage area to the SSD or HDD according to the migration address;
其中,内存存储区可驱逐数据迁移信息具体包括:内存存储区可驱逐缓存数据地址、内存存储区可驱逐缓存数据空间大小以及迁移地址。The memory storage area can evict data migration information, including: the memory storage area can evict the cache data address, the memory storage area can evict the cache data space size, and the migration address.
S402、向驱逐逻辑单元发送内存存储区可驱逐缓存数据迁移完成信号。S402. Sending a memory storage area to the eviction logic unit may evict the cache data migration completion signal.
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开 存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。In the embodiment of the present invention, a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data. Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space. At the same time, when calling partition data, due to the high-speed read and write performance of the hybrid storage system. And separate according to the heat of the partition data The storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
请参阅图6,图6为本发明为本发明实施例中Spark分布式计算数据处理方法S103中修改数据持久化级别步骤细化步骤流程示意图,该细化步骤包括:Referring to FIG. 6, FIG. 6 is a schematic flowchart of a step of refining a data persistence level step in a Spark distributed computing data processing method S103 according to an embodiment of the present invention. The refinement step includes:
S501、判断内存存储区可驱逐缓存数据迁移地址的类别。S501. Determine a category in which the memory storage area can evict the cached data migration address.
S502、若内存存储区可驱逐缓存数据的迁移地址为SSD,修改内存存储区可驱逐缓存数据的持久化级别为SSD_ONLY。S502. If the migration address of the cache storage data in the memory storage area is SSD, the persistent storage level of the cache memory data in the modified memory storage area is SSD_ONLY.
S503、若内存存储区可驱逐缓存数据的迁移地址为HDD,修改内存存储区可驱逐缓存数据的持久化级别为HDD_ONLY。S503. If the migration address of the cache storage data in the memory storage area is HDD, modifying the memory storage area to evict the cache data by a persistent level of HDD_ONLY.
S504、修改完成,反馈内存存储区可驱逐缓存数据驱逐成功信号以及内存存储区可驱逐数据迁移信息,以使得RDD分区数据进入内存存储区,完成存储任务。S504, the modification is completed, the feedback memory storage area can evict the cache data eviction success signal, and the memory storage area can evict the data migration information, so that the RDD partition data enters the memory storage area to complete the storage task.
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开 存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。In the embodiment of the present invention, a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data. Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space. At the same time, when calling partition data, due to the high-speed read and write performance of the hybrid storage system. And separate according to the heat of the partition data The storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
请参阅图7,图7为本发明实施例中本发明为本发明实施例中Spark分布式计算数据处理系统的功能模块示意图,该功能模块包括:Referring to FIG. 7, FIG. 7 is a schematic diagram of functional modules of a Spark distributed computing data processing system according to an embodiment of the present invention. The functional module includes:
申请存储模块601,用于在对用户已标识缓存的弹性分布式数据集RDD分区数据执行存储任务时,若向Spark的内存存储区申请空间失败,则向驱逐逻辑单元发送驱逐内存存储区缓存数据的命令;The application storage module 601 is configured to send the eviction memory storage area cache data to the eviction logic unit if the storage space of the Spark memory storage area fails when the storage task is performed on the flexible distributed data set RDD partition data that the user has identified. The command;
计算分址模块602,用于计算内存存储区内可驱逐空间大小,若驱逐后空间大小满足存储任务对内存存储区空间的要求,则根据内存存储区可驱逐缓存数据访问热度设置基于SSD和HDD的混合存储系统的迁移地址;The calculation address module 602 is configured to calculate the size of the eviction space in the memory storage area. If the space size after the eviction meets the requirements of the storage task space for the memory storage area, the data storage area may be evicted according to the memory storage area, and the SSD and HDD are set based on the SSD and the HDD. Migration address of the hybrid storage system;
数据迁移模块603,用于读取并释放内存存储区内可驱逐缓存数据,迁移内存存储区内可驱逐缓存数据到迁移地址,修改内存存储区内可驱逐缓存数据持久化级别,反馈驱逐成功信号及驱逐信息。The data migration module 603 is configured to read and release the eviction cache data in the memory storage area, migrate the cache storage data to the migration address in the memory storage area, modify the memory storage area to evict the cache data persistence level, and feedback the eviction success signal. And eviction information.
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开 存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。In the embodiment of the present invention, a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data. Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space. At the same time, when calling partition data, due to the high-speed read and write performance of the hybrid storage system. And separate according to the heat of the partition data The storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
请参阅图8,图8为本发明实施例中Spark分布式计算数据处理系统的申请存储模块601的细化功能模块的示意图,该细化功能模块包括:Referring to FIG. 8, FIG. 8 is a schematic diagram of a refinement function module of a storage module 601 of a Spark distributed computing data processing system according to an embodiment of the present disclosure, where the refinement function module includes:
第一申请模块6011,用于计算对RDD分区数据执行存储任务所占用内存存储区空间的大小,向Spark内存存储区申请空间,并与内存存储区未占用空间作比较;The first application module 6011 is configured to calculate a size of a memory storage space occupied by performing a storage task on the RDD partition data, apply for a space to the Spark memory storage area, and compare with an unoccupied space of the memory storage area;
第一反馈模块6012,用于若存储任务所占用内存存储区空间的大小大于内存存储区未占用空间,则向Spark内存存储区申请空间失败,同时向驱逐逻辑单元发送驱逐内存存储区可驱逐缓存数据的命令以及发送存储任务需要占用内存存储区空间的大小。The first feedback module 6012 is configured to: if the size of the memory storage area occupied by the storage task is larger than the unoccupied space of the memory storage area, requesting space from the Spark memory storage area fails, and sending the eviction memory storage area to the eviction logic unit to evict the cache The command of the data and the size of the memory storage space are required to send the storage task.
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开 存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。In the embodiment of the present invention, a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data. Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space. At the same time, when calling partition data, due to the high-speed read and write performance of the hybrid storage system. And separate according to the heat of the partition data The storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
请参阅图9,图9为本发明实施例中Spark分布式计算数据处理系统的申请存储模块602的细化功能模块的示意图,该细化功能模块包括:Referring to FIG. 9, FIG. 9 is a schematic diagram of a refinement function module of a storage module 602 of a Spark distributed computing data processing system according to an embodiment of the present disclosure, where the refinement function module includes:
第二申请模块6021,用于驱逐逻辑单元接收到驱逐命令,同时驱逐逻辑单元向内存存储区发出由于RDD分区数据执行存储任务所需存储空间不足需要驱逐空间的申请,若申请申请成功,则按近期最少使用算法LRU策略计算内存存储区内可驱逐空间大小;The second application module 6021 is configured to: the eviction logic unit receives the eviction command, and the eviction logic unit sends an application to the memory storage area that requires insufficient storage space for performing the storage task due to the RDD partition data, and if the application is successful, press Recently, the LRU strategy is used to calculate the size of the expellable space in the memory storage area;
设置迁移地址模块6022,用于若驱逐后内存存储区未占用空间大小大于等于RDD分区数据执行存储任务需要占用空间大小,根据内存存储区可驱逐缓存数据访问热度设置基于SSD和HDD的混合存储系统的迁移地址,并将内存存储区可驱逐缓存数据迁移信息和内存存储区可驱逐缓存数据迁移命令发送至缓存数据迁移单元;The migration address module 6022 is configured to set the size of the unoccupied space of the memory storage area after the eviction is greater than or equal to the size of the RDD partition data to perform the storage task, and set the hybrid storage system based on the SSD and the HDD according to the eviction cache data access heat of the memory storage area. The migration address, and the memory storage area eviction cache data migration information and the memory storage area eviction cache data migration command are sent to the cache data migration unit;
第二反馈模块6023,用于若驱逐后内存存储区未占用空间大小小于RDD分区数据执行存储任务需要占用空间大小,则终止内存存储区可驱逐缓存数据迁移任务,并反馈驱逐内存存储区可驱逐缓存数据失败信号;The second feedback module 6023 is configured to: if the unoccupied space of the memory storage area after the eviction is smaller than the size of the RDD partition data to perform the storage task, terminate the memory storage area to evict the cache data migration task, and feedback the eviction memory storage area to evict Cache data failure signal;
SSD迁移地址模块6024,用于若内存存储区可驱逐缓存数据访问热度在第一预置热度数值范围内,则读取SSD地址并将读取到的SSD地址设置为迁移地址;The SSD migration address module 6024 is configured to: if the memory storage area eviction cache data access heat is within a first preset heat value range, read the SSD address and set the read SSD address as a migration address;
HDD迁移地址模块6025,用于若内存存储区可驱逐缓存数据访问热度在第二预置热度数值范围内,则读取HDD地址并将读取到的HDD地址设置为迁移地址。The HDD migration address module 6025 is configured to read the HDD address and set the read HDD address as a migration address if the memory storage area eviction cache data access heat is within the second preset heat value range.
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开 存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。In the embodiment of the present invention, a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data. Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space. At the same time, when calling partition data, due to the high-speed read and write performance of the hybrid storage system. And separate according to the heat of the partition data The storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
请参阅图10,图10为本发明实施例中Spark分布式计算数据处理系统的申请存储模块603的细化功能模块的示意图,该细化功能模块包括:Referring to FIG. 10, FIG. 10 is a schematic diagram of a refinement function module of a storage module 603 of a Spark distributed computing data processing system according to an embodiment of the present invention. The refinement function module includes:
第三反馈模块6031,用于向驱逐逻辑单元发送内存存储区可驱逐缓存数据迁移完成信号;The third feedback module 6031 is configured to send, to the eviction logic unit, a memory storage area eviction cache data migration completion signal;
SSD持久化级别模块6032,用于若内存存储区可驱逐缓存数据的迁移地址为SSD,修改内存存储区可驱逐缓存数据的持久化级别为SSD_ONLY;The SSD persistence level module 6032 is configured to: if the memory storage area can evict the cached data, the migration address is SSD, and modify the memory storage area to evict the cached data to have a persistence level of SSD_ONLY;
HDD持久化级别模块6033,用于若内存存储区可驱逐缓存数据的迁移地址为HDD,修改内存存储区可驱逐缓存数据的持久化级别为HDD_ONLY;The HDD persistence level module 6033 is configured to: if the memory storage area can evict the cached data, the migration address is HDD, and the modified memory storage area can evict the cached data by a persistent level of HDD_ONLY;
第四反馈模块6034,用于反馈内存存储区可驱逐缓存数据驱逐成功信号以及内存存储区可驱逐数据迁移信息,以使得RDD分区数据进入内存存储区,完成存储任务。The fourth feedback module 6034 is configured to feedback the memory storage area to evict the cache data eviction success signal and the memory storage area to evict the data migration information, so that the RDD partition data enters the memory storage area to complete the storage task.
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开 存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。In the embodiment of the present invention, a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data. Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space. At the same time, when calling partition data, due to the high-speed read and write performance of the hybrid storage system. And separate according to the heat of the partition data The storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
在本申请所提供的几个实施例中,应该理解到,所揭露的方法和系统,可以通过其它的方式实现。例如,以上所描述的系统实施例仅仅是示意性的,例如,模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present application, it should be understood that the disclosed methods and systems may be implemented in other manners. For example, the system embodiments described above are merely illustrative. For example, the division of modules is only a logical function division. In actual implementation, there may be another division manner. For example, multiple modules or components may be combined or integrated. Go to another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or module, and may be electrical, mechanical or otherwise.
作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separate. The components displayed as modules may or may not be physical modules, that is, may be located in one place, or may be distributed to multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本发明各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one module. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。An integrated module, if implemented as a software functional module and sold or used as a standalone product, can be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the various embodiments of the present invention. The foregoing storage medium includes: a U disk, a mobile hard disk, a read only memory (ROM, Read-Only) Memory, random access memory (RAM), disk or optical disk, and other media that can store program code.
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本发明所必须的。It should be noted that, for the foregoing method embodiments, for the sake of brevity, they are all described as a series of action combinations, but those skilled in the art should understand that the present invention is not limited by the described action sequence. Because certain steps may be performed in other sequences or concurrently in accordance with the present invention. In the following, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。In the above embodiments, the descriptions of the various embodiments are all focused, and the parts that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.
以上为对本发明所提供的一种Spark分布式计算数据处理方法及系统的描述,对于本领域的技术人员,依据本发明实施例的思想,在具体实施方式及应用范围上均会有改变之处,综上,本说明书内容不应理解为对本发明的限制。The foregoing is a description of a Spark distributed computing data processing method and system provided by the present invention. For those skilled in the art, according to the idea of the embodiment of the present invention, there are changes in specific implementation manners and application scopes. In conclusion, the contents of the specification are not to be construed as limiting the invention.

Claims (10)

  1. 一种Spark分布式计算数据处理方法,其特征在于,所述方法包括:A Spark distributed computing data processing method, the method comprising:
    在对用户已标识缓存的弹性分布式数据集(RDD,Resilient Distributed Datasets)分区数据执行存储任务时,若向Spark的内存存储区申请空间失败,则向驱逐逻辑单元发送驱逐所述内存存储区可驱逐缓存数据的命令;Resilient Distributed Data Set (RDD, Resilient Distributed) on the user's identified cache Datasets) When the partition data performs a storage task, if the space request to the Spark memory storage area fails, the eviction logic unit sends a command to evict the cache data from the memory storage area;
    计算所述内存存储区内可驱逐空间大小,若驱逐后空间大小满足所述存储任务对所述内存存储区空间的要求,则根据所述内存存储区可驱逐缓存数据访问热度设置基于固态硬盘(SSD,Solid State Drives)和磁盘(HDD,Hard Disk Drive)的混合存储系统的迁移地址;Calculating a size of the eviction space in the memory storage area, and if the space size after the eviction meets the storage task space requirement of the storage task, setting the cache data access heat according to the memory storage area to be based on the solid state hard disk ( SSD, Solid Migration address of the hybrid storage system of State Drives and HDD (Hard Disk Drive);
    读取并释放所述内存存储区内可驱逐缓存数据,迁移所述内存存储区内可驱逐缓存数据到所述迁移地址,修改所述内存存储区内可驱逐缓存数据持久化级别,反馈驱逐成功信号及驱逐信息。Reading and releasing the eviction cache data in the memory storage area, migrating the memory storage area to evict the cache data to the migration address, modifying the eviction cache data persistence level in the memory storage area, and feedback eviction success Signal and expulsion information.
  2. 根据权利要求1所述的方法,其特征在于,所述若向Spark内存存储区申请空间失败,则向驱逐逻辑单元发送驱逐所述内存存储区可驱逐缓存数据的命令具体包括:The method of claim 1, wherein the request to evict the buffered data by expelling the memory storage area from the eviction logic unit if the application space fails to the Spark memory storage area comprises:
    计算所述对RDD分区数据执行存储任务所占用所述内存存储区空间的大小,向所述Spark的内存存储区申请空间,并将所述存储任务所占用所述内存存储区空间的大小与所述内存存储区未占用空间作比较,若所述存储任务所占用所述内存存储区空间的大小大于所述内存存储区未占用空间,则向所述Spark的内存存储区申请空间失败,同时向所述驱逐逻辑单元发送驱逐所述内存存储区可驱逐缓存数据的命令以及发送所述存储任务需要占用所述内存存储区空间的大小。Calculating a size of the memory storage area occupied by the storage task for the RDD partition data, applying for a space to the memory storage area of the Spark, and occupying the size of the memory storage area occupied by the storage task If the size of the memory storage area occupied by the storage task is larger than the unoccupied space of the memory storage area, the space for applying to the Spark memory storage area fails, and The eviction logic unit sends a command to evict the cache memory data by eviction of the memory storage area and a size of the memory storage area space required to send the storage task.
  3. 根据权利要求1所述的方法,其特征在于,所述计算所述内存存储区内可驱逐空间大小,若驱逐后空间大小满足所述存储任务对所述内存存储区空间的要求,则根据所述内存存储区可驱逐缓存数据访问热度设置基于SSD和HDD的混合存储系统的迁移地址具体包括:The method according to claim 1, wherein the calculating the size of the eviction space in the memory storage area, and if the size of the eviction space satisfies the requirement of the storage task for the memory storage area space, The memory storage area can be used to evict the cache data access heat setting. The migration address of the hybrid storage system based on SSD and HDD specifically includes:
    所述驱逐逻辑单元接收到驱逐命令,同时所述驱逐逻辑单元向所述内存存储区发出由于所述RDD分区数据执行存储任务所需存储空间不足需要驱逐空间的申请,若所述申请申请成功,则按近期最少使用算法LRU策略计算所述内存存储区内可驱逐空间大小;The eviction logic unit receives the eviction command, and the eviction logic unit sends an application to the memory storage area that the storage space required for performing the storage task is insufficient due to the RDD partition data, and if the application is successful, Calculating the size of the expellable space in the memory storage area according to the least recently used algorithm LRU policy;
    若所述内存存储区内可驱逐空间大小大于等于所述RDD分区数据执行存储任务需要占用空间大小,根据所述内存存储区可驱逐缓存数据的访问热度设置基于SSD和HDD的混合存储系统的迁移地址,并将所述内存存储区可驱逐缓存数据迁移信息和所述内存存储区可驱逐缓存数据迁移命令发送至缓存数据迁移单元;If the size of the eviction space in the memory storage area is greater than or equal to the size of the RDD partition data to perform the storage task, the migration of the hybrid storage system based on the SSD and the HDD is set according to the access heat of the memory storage area to evict the cached data. An address, and the memory storage area eviction cache data migration information and the memory storage area eviction cache data migration command are sent to the cache data migration unit;
    若所述内存存储区内可驱逐空间大小小于所述RDD分区数据执行存储任务需要占用空间大小,则终止所述内存存储区可驱逐缓存数据迁移任务,并反馈驱逐所述内存存储区可驱逐缓存数据失败信号。If the size of the eviction space in the memory storage area is smaller than the size of the RDD partition data to perform the storage task, terminating the memory storage area may evict the cache data migration task, and feedback eviction of the memory storage area to evict the cache Data failure signal.
  4. 根据权利要求3所述的方法,其特征在于所述根据所述内存存储区可驱逐缓存数据的访问热度设置基于SSD和HDD的混合存储系统的迁移地址具体包括:The method according to claim 3, wherein the setting the migration address of the hybrid storage system based on the SSD and the HDD according to the access heat of the cacheable data of the memory storage area comprises:
    若所述内存存储区可驱逐缓存数据访问热度在第一预置热度数值范围内,则读取SSD地址并将读取到的SSD地址设置为所述迁移地址;If the memory storage area eviction cache data access heat is within a first preset heat value range, reading the SSD address and setting the read SSD address to the migration address;
    若所述内存存储区可驱逐缓存数据访问热度在第二预置热度数值范围内,则读取HDD地址并将读取到的HDD地址设置为所述迁移地址;If the memory storage area eviction cache data access heat is within a second preset heat value range, reading the HDD address and setting the read HDD address to the migration address;
    所述在第一预置热度数值大于所述第二预置热度数值。The first preset heat value is greater than the second preset heat value.
  5. 根据权利要求1所述的方法,其特征在于,所述读取并释放所述内存存储区内可驱逐缓存数据,迁移所述内存存储区内可驱逐缓存数据到所述迁移地址具体包括:The method according to claim 1, wherein the reading and releasing the eviction cache data in the memory storage area, and migrating the eviction cache data to the migration address in the memory storage area comprises:
    缓存数据迁移单元接收到所述内存存储区可驱逐缓存数据迁移信息和所述内存存储区可驱逐缓存数据迁移命令后,将所述内存存储区可驱逐数据按所述迁移信息存储到SSD或HDD,并向所述驱逐逻辑单元发送所述内存存储区可驱逐缓存数据迁移完成信号;After the cache data migration unit receives the memory storage area eviction cache data migration information and the memory storage area eviction cache data migration command, the memory storage area eviction data is stored into the SSD or HDD according to the migration information. And sending the memory storage area to the eviction logic unit to evict the cache data migration completion signal;
    其中所述内存存储区可驱逐数据迁移信息具体包括:所述内存存储区可驱逐缓存数据地址、所述内存存储区可驱逐缓存数据空间大小以及所述迁移地址。The eviction data migration information includes: the memory storage area can evict the cache data address, the memory storage area can evict the cache data space size, and the migration address.
  6. 根据权利要求1所述的方法,其特征在于,所述修改所述内存存储区可驱逐缓存数据持久化级别,反馈驱逐成功信号及驱逐信息具体包括:The method according to claim 1, wherein the modifying the memory storage area to evict the cache data persistence level, and the feedback eviction success signal and the eviction information specifically include:
    若所述内存存储区可驱逐缓存数据的迁移地址为SSD,修改所述内存存储区可驱逐缓存数据的持久化级别为SSD_ONLY;If the migration address of the cache storage data of the memory storage area is an SSD, modifying the memory storage area to evict the cache data by a persistence level of SSD_ONLY;
    若所述内存存储区可驱逐缓存数据的迁移地址为HDD,修改所述内存存储区可驱逐缓存数据的持久化级别为HDD_ONLY;If the migration address of the cache storage data of the memory storage area is HDD, modifying the memory storage area to evict the cache data by a persistent level of HDD_ONLY;
    修改完成,反馈所述内存存储区可驱逐缓存数据驱逐成功信号以及所述内存存储区可驱逐数据迁移信息,以使得所述RDD分区数据进入所述内存存储区,完成所述存储任务。After the modification is completed, the memory storage area may be evoked to expel the cache data eviction success signal, and the memory storage area may evict the data migration information, so that the RDD partition data enters the memory storage area to complete the storage task.
  7. 一种Spark分布式计算数据处理系统,其特征在于,所述系统包括:A Spark distributed computing data processing system, characterized in that the system comprises:
    申请存储模块,用于在对用户已标识缓存的弹性分布式数据集RDD分区数据执行存储任务时,若向Spark的内存存储区申请空间失败,则向驱逐逻辑单元发送驱逐所述内存存储区可驱逐缓存数据的命令;Applying a storage module, when performing a storage task on the elastic distributed data set RDD partition data that has been identified by the user, if the space request for the Spark memory storage area fails, sending the eviction memory storage area to the eviction logic unit a command to evict cached data;
    计算分址模块,用于计算所述内存存储区内可驱逐空间大小,若驱逐后空间大小满足所述存储任务对所述内存存储区空间的要求,则根据所述内存存储区可驱逐缓存数据访问热度设置基于SSD和HDD的混合存储系统的迁移地址;Calculating a location module, configured to calculate a size of the eviction space in the memory storage area, and if the space size after the eviction meets the requirement of the storage task space by the storage task, the cache data may be eviction according to the memory storage area Access hotness sets the migration address of the hybrid storage system based on SSD and HDD;
    数据迁移模块,用于读取并释放所述内存存储区内可驱逐缓存数据,迁移所述内存存储区内可驱逐缓存数据到所述迁移地址,修改所述内存存储区内可驱逐缓存数据持久化级别,反馈驱逐成功信号及驱逐信息。a data migration module, configured to read and release the eviction cache data in the memory storage area, and migrate the memory storage area to evict the cached data to the migration address, and modify the eviction cache data in the memory storage area to be persistent Level, feedback eviction success signal and eviction information.
  8. 根据权利要求7所述的系统,其特征在于,所述申请存储模块包括:The system of claim 7, wherein the application storage module comprises:
    第一申请模块,用于计算所述对RDD分区数据执行存储任务所占用所述内存存储区空间的大小,向所述Spark内存存储区申请空间,并与所述内存存储区未占用空间作比较;The first application module is configured to calculate a size of the memory storage area occupied by the storage task for the RDD partition data, apply for a space to the Spark memory storage area, and compare with the unoccupied space of the memory storage area. ;
    第一反馈模块,用于若所述存储任务所占用所述内存存储区空间的大小大于所述内存存储区未占用空间,则向Spark内存存储区申请空间失败,同时向所述驱逐逻辑单元发送驱逐所述所述内存存储区可驱逐缓存数据的命令以及发送所述存储任务需要占用所述内存存储区空间的大小。The first feedback module is configured to: if the size of the memory storage area occupied by the storage task is larger than the unoccupied space of the memory storage area, requesting space from the Spark memory storage area fails, and sending the space to the eviction logic unit The eviction of the memory storage area may evict a command to cache data and a size of the memory storage area space required to transmit the storage task.
  9. 根据权利要求 7 所述的系统,其特征在于,所述计算分址模块包括 ;The system of claim 7 wherein said computing address module comprises;
    第二申请模块,用于所述驱逐逻辑单元接收到驱逐命令,同时所述驱逐逻辑单元向所述内存存储区发出由于所述 RDD 分区数据执行存储任务所需存储空间不足需要驱逐 空间的申请,若所述申请申请成功,则按近期最少使用算法 LRU 策略计算所述内存存储区内可驱逐空间大小;a second application module, configured to receive the eviction command by the eviction logic unit, and send the eviction logic unit to the memory storage area due to the RDD The partitioned data requires insufficient storage space to perform the storage task. If the application is successful, the least-time algorithm is used. The policy calculates the size of the expellable space in the memory storage area;
    设置迁移地址模块,用于若所述驱逐后所述内存存储区未占用空间大小大于等于所述 RDD 分区数据执行存储任务需要占用 空间大小,根据所述内存存储区可驱逐缓存 数据 访问热度设置基于 SSD 和 HDD 的混合存储系统的迁移地址,并将所述 内存存储区 可驱逐缓存 数据迁移 信息和所述内存存储区可驱逐缓存数据迁移命令发送至缓存数据迁移单元;The migration address module is configured to: if the unoccupied space of the memory storage area is greater than or equal to the RDD partition data, the storage task needs to be occupied after the expulsion The size of the space, according to the memory storage area, the eviction cache data access heat setting the migration address of the SSD- and HDD-based hybrid storage system, and migrating the memory storage area Information and the memory storage area eviction cache data migration command are sent to the cache data migration unit;
    第二反馈模块,用于若所述驱逐后所述内存存储区未占用空间大小小于所述 RDD 分区数据执行存储任务需要占用 空间大小,则终止所述内存存储区可驱逐缓存 数据 迁移任务,并反馈驱逐所述内存存储区可驱逐缓存 数据 失败信号;a second feedback module, configured to: if the unoccupied space of the memory storage area is smaller than the RDD partition data, the storage task needs to be occupied after the expulsion The size of the space terminates the memory storage area to evict the cache data migration task, and feedback eviction of the memory storage area to evict the cache data failure signal;
    SSD 迁移地址模块,用于若所述内存存储区可驱逐缓存数据访问热度在第一预置热度数值范围内 , 则读取 SSD 地址并将读取到的 SSD 地址设置为所述迁移地址;The SSD migration address module is configured to read the SSD if the memory storage area can evict the cache data access heat within the first preset heat value range Address and set the read SSD address to the migration address;
    HDD 迁移地址模块,用于若所述内存存储区可驱逐缓存数据访问热度在第二预置热度数值范围内 , 则读取 HDD 地址并将读取到的 HDD 地址设置为所述迁移地址。The HDD migration address module is configured to read the HDD if the memory storage area eviction cache data access heat is within a second preset heat value range Address and set the read HDD address to the migration address.
  10. 根据权利要求7所述的系统,其特征在于,所述数据迁移模块包括:The system of claim 7, wherein the data migration module comprises:
    数据迁移模块,所述缓存数据迁移单元接收到所述内存存储区可驱逐缓存数据迁移信息和所述内存存储区可驱逐缓存数据迁移命令后,将所述内存存储区可驱逐数据按所述迁移信息存储到SSD或HDD;a data migration module, after the cache data migration unit receives the memory storage area eviction cache data migration information and the memory storage area eviction cache data migration command, the memory storage area eviction data is migrated according to the Information is stored to the SSD or HDD;
    第三反馈模块,用于向所述驱逐逻辑单元发送所述内存存储区可驱逐缓存数据迁移完成信号;a third feedback module, configured to send, to the eviction logic unit, the memory storage area eviction cache data migration completion signal;
    SSD持久化级别模块,用于若所述内存存储区可驱逐缓存数据的迁移地址为SSD,修改所述内存存储区可驱逐缓存数据的持久化级别为SSD_ONLY;The SSD persistence level module is configured to: if the migration address of the memory storage area to evict the cache data is an SSD, modify the memory storage area to evict the cache data to have a persistence level of SSD_ONLY;
    HDD持久化级别模块,用于若所述内存存储区可驱逐缓存数据的迁移地址为HDD,修改所述内存存储区可驱逐缓存数据的持久化级别为HDD_ONLY;The HDD persistence level module is configured to: if the memory storage area can evict the cached data, the migration address is HDD, and modify the memory storage area to evict the cache data to have a persistence level of HDD_ONLY;
    第四反馈模块,用于反馈所述内存存储区可驱逐缓存数据驱逐成功信号以及所述内存存储区可驱逐数据迁移信息,以使得所述RDD分区数据进入所述内存存储区,完成所述存储任务。a fourth feedback module, configured to feed back the memory storage area to evict the cache data eviction success signal, and the memory storage area may evict data migration information, so that the RDD partition data enters the memory storage area to complete the storage task.
PCT/CN2017/099083 2017-08-25 2017-08-25 Spark distributed computing data processing method and system WO2019037093A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/099083 WO2019037093A1 (en) 2017-08-25 2017-08-25 Spark distributed computing data processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/099083 WO2019037093A1 (en) 2017-08-25 2017-08-25 Spark distributed computing data processing method and system

Publications (1)

Publication Number Publication Date
WO2019037093A1 true WO2019037093A1 (en) 2019-02-28

Family

ID=65438348

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/099083 WO2019037093A1 (en) 2017-08-25 2017-08-25 Spark distributed computing data processing method and system

Country Status (1)

Country Link
WO (1) WO2019037093A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109947778A (en) * 2019-03-27 2019-06-28 联想(北京)有限公司 A kind of Spark storage method and system
CN115145841A (en) * 2022-07-18 2022-10-04 河南大学 Method for reducing memory contention applied to Spark computing platform

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101907978A (en) * 2010-07-27 2010-12-08 浙江大学 Mixed storage system and storage method based on solid state disk and magnetic hard disk
US20110191556A1 (en) * 2010-02-01 2011-08-04 International Business Machines Corporation Optimization of data migration between storage mediums
CN102831088A (en) * 2012-07-27 2012-12-19 国家超级计算深圳中心(深圳云计算中心) Data migration method and device based on mixing memory
CN103186350A (en) * 2011-12-31 2013-07-03 北京快网科技有限公司 Hybrid storage system and hot spot data block migration method
CN103631730A (en) * 2013-11-01 2014-03-12 深圳清华大学研究院 Caching optimizing method of internal storage calculation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110191556A1 (en) * 2010-02-01 2011-08-04 International Business Machines Corporation Optimization of data migration between storage mediums
CN101907978A (en) * 2010-07-27 2010-12-08 浙江大学 Mixed storage system and storage method based on solid state disk and magnetic hard disk
CN103186350A (en) * 2011-12-31 2013-07-03 北京快网科技有限公司 Hybrid storage system and hot spot data block migration method
CN102831088A (en) * 2012-07-27 2012-12-19 国家超级计算深圳中心(深圳云计算中心) Data migration method and device based on mixing memory
CN103631730A (en) * 2013-11-01 2014-03-12 深圳清华大学研究院 Caching optimizing method of internal storage calculation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LU , KEZHONG ET AL.: "Design of RDD Persistence Method in Spark for SSDs", JOURNAL OF COMPUTER RESEARCH AND DEVELOPMENT, vol. 54, no. 6, 30 June 2017 (2017-06-30), pages 1382, XP055578521 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109947778A (en) * 2019-03-27 2019-06-28 联想(北京)有限公司 A kind of Spark storage method and system
CN115145841A (en) * 2022-07-18 2022-10-04 河南大学 Method for reducing memory contention applied to Spark computing platform
CN115145841B (en) * 2022-07-18 2023-05-12 河南大学 Method for reducing memory contention applied to Spark computing platform

Similar Documents

Publication Publication Date Title
CN107526546B (en) Spark distributed computing data processing method and system
US11157376B2 (en) Transfer track format information for tracks in cache at a primary storage system to a secondary storage system to which tracks are mirrored to use after a failover or failback
WO2014044136A1 (en) Distributed data-based concurrent processing method and system, and computer storage medium
JP3857661B2 (en) Information processing apparatus, program, and recording medium
JP7449276B2 (en) Power management advisor to support power management controls
JP2017138852A (en) Information processing device, storage device and program
US9218287B2 (en) Virtual computer system, virtual computer control method, virtual computer control program, recording medium, and integrated circuit
WO2019037093A1 (en) Spark distributed computing data processing method and system
Deshpande et al. Scatter-gather live migration of virtual machines
KR20190033122A (en) Storage device communicating with host according to multicast communication protocol and communication method of host
JP2017227969A (en) Control program, system and method
JP4667092B2 (en) Information processing apparatus and data control method in information processing apparatus
CN112069090A (en) System and method for managing a cache hierarchy
WO2017157125A1 (en) Method and apparatus for deleting cloud host in cloud computing environment, server and storage medium
US10657022B2 (en) Input and output recording device and method, CPU and data read and write operation method thereof
US10831662B1 (en) Systems and methods for maintaining cache coherency
WO2024113568A1 (en) Data migration method and apparatus for solid-state drive, electronic device, and storage medium
WO2020235858A1 (en) Server and control method thereof
WO2015024532A1 (en) System and method for caching high-performance instruction
CN115087961A (en) Arbitration scheme for coherent and non-coherent memory requests
JP2014186675A (en) Operation processing device, information processing device, and control method of information processing device
US20230359556A1 (en) Performing Operations for Handling Data using Processor in Memory Circuitry in a High Bandwidth Memory
TW201201102A (en) Resource adjustment methods and systems for virtual machines, and computer program products thereof
CN106909523A (en) large-scale data transmission method and system
CN110633132B (en) Memory module

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17922440

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 24.09.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17922440

Country of ref document: EP

Kind code of ref document: A1