WO2022218218A1 - 数据处理方法、装置、归约服务器及映射服务器 - Google Patents
数据处理方法、装置、归约服务器及映射服务器 Download PDFInfo
- Publication number
- WO2022218218A1 WO2022218218A1 PCT/CN2022/085771 CN2022085771W WO2022218218A1 WO 2022218218 A1 WO2022218218 A1 WO 2022218218A1 CN 2022085771 W CN2022085771 W CN 2022085771W WO 2022218218 A1 WO2022218218 A1 WO 2022218218A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- memory
- server
- mapping
- storage area
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims abstract description 311
- 238000013507 mapping Methods 0.000 title claims abstract description 301
- 238000000034 method Methods 0.000 title claims abstract description 187
- 230000009467 reduction Effects 0.000 title claims abstract description 176
- 230000015654 memory Effects 0.000 claims abstract description 610
- 238000003860 storage Methods 0.000 claims abstract description 262
- 230000008569 process Effects 0.000 claims description 116
- 238000007726 management method Methods 0.000 claims description 81
- 238000003672 processing method Methods 0.000 claims description 45
- 238000004590 computer program Methods 0.000 claims description 12
- 238000013500 data storage Methods 0.000 claims description 11
- 230000005540 biological transmission Effects 0.000 description 38
- 238000010586 diagram Methods 0.000 description 28
- 230000006870 function Effects 0.000 description 20
- 238000013523 data management Methods 0.000 description 13
- 238000004364 calculation method Methods 0.000 description 5
- 101100456045 Schizosaccharomyces pombe (strain 972 / ATCC 24843) map3 gene Proteins 0.000 description 3
- 230000002776 aggregation Effects 0.000 description 3
- 238000004220 aggregation Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 3
- 238000003491 array Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000001360 synchronised effect Effects 0.000 description 3
- 101150064138 MAP1 gene Proteins 0.000 description 2
- 101100075995 Schizosaccharomyces pombe (strain 972 / ATCC 24843) fma2 gene Proteins 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 2
- 238000012790 confirmation Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000036961 partial effect Effects 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 229910052802 copper Inorganic materials 0.000 description 1
- 239000010949 copper Substances 0.000 description 1
- 230000006837 decompression Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000010408 sweeping Methods 0.000 description 1
- 230000026676 system process Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0608—Saving storage space on storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
- G06F16/134—Distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
- G06F16/137—Hash-based
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/0643—Management of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/0644—Management of space entities, e.g. partitions, extents, pools
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5077—Logical partitioning of resources; Management or configuration of virtualized resources
Definitions
- the present application relates to the field of computer technologies, and in particular, to a data processing method, device, reduction server, and mapping server.
- a distributed high-concurrency computing framework is usually used, and the data to be processed is divided into several data blocks, and the computations are performed concurrently through different computing nodes. Since the entire data processing process may be divided into several steps, when the input data of one step comes from the operation results of multiple computing nodes in the previous step, the transmission of a large amount of data between computing nodes must be involved. However, due to the limited memory capacity of a single computing node, the large network transmission delay between computing nodes, and the small bandwidth, the data transmission efficiency between computing nodes is low.
- an embodiment of the present application provides a data processing method, the method is applied to a reduction server in a distributed processing system, and the distributed processing system includes a plurality of mapping servers and a plurality of reduction servers,
- the memory of the plurality of mapping servers and the memory of the plurality of reduction servers constitute a global memory
- the method includes: obtaining metadata of the first data to be read from a preset first storage area; the metadata, determine the first address of the first data in the global memory; read the first data from the global memory according to the first address, wherein the first data includes A target data block among the plurality of data blocks of the second data, where the second data includes the processing result of the input data by the corresponding mapping server.
- the reduction server in the distributed processing system can obtain metadata of the first data to be read from the first storage area, where the first data includes multiple data blocks of the second data
- the second data includes the processing result of the input data by the corresponding mapping server, and then according to the metadata, the first address of the first data in the global memory is determined, and according to the first address, from the global memory Read the first data, so that when the reduction server reads the input data (first data) including the target data block from the processing results of the multiple mapping servers, the target data block does not need to be copied and transmitted, but the memory
- the processing process in the shuffling stage is not limited by factors such as the memory capacity of the computing node, the physical bandwidth of the transmission network, and the transmission delay, and can improve the shuffling process.
- the processing efficiency and processing performance of the card stage are improved, thereby improving the processing efficiency of the distributed processing system.
- the reading the first data from the global memory according to the first address includes: in the first address When an address is outside the access range of the reduction server, the first address is mapped to a second address, and the second address is within the access range of the reduction server; according to the second address address, read the first data from the global memory.
- the reduction server can read the first data located at the remote end from the global memory.
- the method further includes: connecting the reduction server to the distribution After the type processing system is installed, the reduction server performs registration through a preset registration instruction, so that the memory of the reduction server is added to the global memory.
- the memory of the reduction server added to the distributed processing system can be managed uniformly, thereby realizing the management of the global memory.
- an embodiment of the present application provides a data processing method, and the method is applied to a mapping server in a distributed processing system.
- the distributed processing system includes multiple mapping servers and multiple reduction servers.
- the memory of the plurality of mapping servers and the memory of the plurality of reduction servers constitute a global memory, and the method includes: processing input data to obtain second data; according to preset labels, dividing the second data into a plurality of data blocks; storing the plurality of data blocks in a second storage area, and the second storage area is located in the global memory.
- the mapping server in the distributed processing system can process the input data to obtain the second data, divide the second data into multiple data blocks according to the preset tags, and then divide the multiple data
- the blocks are stored in the second storage area located in the global memory, so that the processing results of the mapping server (that is, the second data) can be stored in the global memory during the shuffling phase, which can not only avoid slow disk read and write, but also enable the shuffling process.
- the processing process of the card stage is not limited by the memory capacity of the mapping server, thereby improving the processing efficiency and processing performance of the card shuffling stage.
- the storing the plurality of data blocks in the second storage area includes: when the data in the plurality of data blocks needs to be processed In the case of sorting, the second storage area is divided into multiple sub-areas according to the preset second size; the multiple data blocks are stored in the multiple sub-areas according to the order of the sub-areas; During the period when multiple data blocks are sequentially stored in the multiple sub-regions, the data in all the sub-regions that have been stored is sorted by updating the ordered index linked list, and the ordered index linked list is performed by linking the position index of the data in the linked list. sort.
- data writing and sorting are performed in an asynchronous pipeline (pipeline) manner, and an ordered index linked list is used during sorting, which can not only sort while writing, realize direct sorting during writing, but also remove separate
- the data copy link during sorting reduces memory usage, thereby improving the processing efficiency of shuffle writing in the shuffling stage.
- writing and sorting can also be combined into one step, reducing processing steps and improving the processing efficiency of the shuffling stage.
- the mapping server includes at least one first operator for processing the input data
- the method uses the mapping server on the The method further includes: in the initialization stage of the first operation process, according to the number of processor cores of the mapping server, applying to the global memory for the second storage area to Each processor core is made to correspond to a second storage area, wherein at least one first operator runs on each processor core.
- a second storage area can be applied to the global memory, so that each processor core corresponds to a second storage area.
- a storage area, in which at least one first operator runs on each processor core, so that at least one operator running on the same processor core can be regarded as a shuffle writer, and stored in the global memory for The shuffle writer allocates storage space so that the data with the same tag in the processing result of at least one operator running on the same processor core is stored in the same area of the global memory, so as to realize the aggregation of data based on the processor core and reduce the The data is scattered, thereby improving the efficiency of data reading.
- the dividing the second data into a plurality of data blocks according to a preset label includes: according to the preset label, using a hash In this way, the second data is divided into a plurality of data blocks.
- the second data is divided into a plurality of data blocks by hashing, so that sorting is not required before the second data is divided into blocks, so that the processing efficiency of the second data blocks can be improved.
- the storing the plurality of data blocks in the second storage area includes: determining a third address of the second storage area; When the third address is outside the access range of the mapping server, the third address is mapped to a fourth address, and the fourth address is within the access range of the mapping server; according to the third address Four addresses to store the plurality of data blocks in the second storage area.
- the mapping server can access the remote second storage area.
- the method further includes: determining metadata of the multiple data blocks; storing the metadata of the multiple data blocks in a preset first storage area.
- the reduction server can obtain the metadata of the data to be read from the first storage area, such as tags, storage address, etc.
- the method further includes: after the mapping server is connected to the distributed processing system, the mapping server performs registration through a preset registration instruction, so that the memory of the mapping server is added to the global memory.
- the memory of the mapping server added to the distributed processing system can be managed uniformly, thereby realizing the management of the global memory.
- the method also includes:
- the first memory When the first memory satisfies a first condition, determine first target data from data stored in the first memory, and store the first target data in an external storage area, where the first condition is the first memory
- the used space is greater than or equal to the first threshold, or the ratio of the space already used by the first memory to the total space of the first memory is greater than or equal to the second threshold, and the first memory is the global memory or part of the global memory.
- part of the data in the first memory can be stored in the external storage area, so as to free up space to store the data to be stored , to avoid the situation that the first memory cannot store a large amount of data, causing the application to fail to run normally or to have low running efficiency.
- the method also includes:
- second target data is determined from the data stored in the external storage area, and the second target data is stored in the first memory, where the second condition is all
- the space already used by the first memory is less than or equal to the third threshold, or the ratio of the space already used by the first memory to the total space of the first memory is less than or equal to the fourth threshold.
- the memory management device through the management of the first memory by the memory management device, when the available storage space of the first memory becomes larger, the data stored in the external storage area can be retrieved to the first memory, so that the part needs to be read
- the data reduction server can read the corresponding data from the global memory instead of the external storage area, which improves the data reading efficiency.
- the external storage area Including at least one of the following: a hard disk drive (hard disk drive, HDD), a solid state disk (solid state disk, SSD).
- the external storage area includes HDD and/or SSD, which can store data persistently.
- embodiments of the present application provide a reduction server, where the reduction server is applied to a distributed processing system, the distributed processing system includes multiple mapping servers and multiple reduction servers, the multiple The memory of each mapping server and the memory of the plurality of reduction servers constitute a global memory, and the reduction server includes: a metadata reading module, configured to obtain the first storage area to be read from a preset first storage area metadata of the data; an address determination module for determining the first address of the first data in the global memory according to the metadata; a data reading module for determining the first address from the first address according to the first address
- the first data is read from the global memory, wherein the first data includes a target data block in a plurality of data blocks of the second data, and the second data includes the processing result of the input data by the corresponding mapping server .
- the reduction server in the distributed processing system can obtain metadata of the first data to be read from the first storage area, where the first data includes multiple data blocks of the second data
- the second data includes the processing result of the input data by the corresponding mapping server, and then according to the metadata, the first address of the first data in the global memory is determined, and according to the first address, from the global memory Read the first data, so that when the reduction server reads the input data (first data) including the target data block from the processing results of the multiple mapping servers, the target data block does not need to be copied and transmitted, but the memory
- the processing process in the shuffling stage is not limited by factors such as the memory capacity of the computing node, the physical bandwidth of the transmission network, and the transmission delay, and can improve the shuffling process.
- the processing efficiency and processing performance of the card stage are improved, thereby improving the processing efficiency of the distributed processing system.
- the data reading module is configured to: when the first address is located outside the access range of the reduction server In this case, map the first address to a second address, and the second address is within the access range of the reduction server; read the first address from the global memory according to the second address data.
- the reduction server can read the first data located at the remote end from the global memory.
- the reduction server further includes: a first registration module, configured to After the reduction server is connected to the distributed processing system, the reduction server performs registration through a preset registration instruction, so that the memory of the reduction server is added to the global memory.
- the memory of the reduction server added to the distributed processing system can be managed uniformly, thereby realizing the management of the global memory.
- mapping server is applied to a distributed processing system, the distributed processing system includes multiple mapping servers and multiple reduction servers, the multiple mapping servers
- the memory of the server and the memory of the plurality of reduction servers constitute a global memory
- the mapping server includes: a data processing module for processing the input data to obtain second data; a data division module for according to preset labels , dividing the second data into multiple data blocks; a data storage module, configured to store the multiple data blocks in a second storage area, where the second storage area is located in the global memory.
- the mapping server in the distributed processing system can process the input data to obtain the second data, divide the second data into multiple data blocks according to the preset tags, and then divide the multiple data
- the blocks are stored in the second storage area located in the global memory, so that the processing results of the mapping server (that is, the second data) can be stored in the global memory during the shuffling phase, which can not only avoid slow disk read and write, but also enable the shuffling process.
- the processing process of the card stage is not limited by the memory capacity of the mapping server, thereby improving the processing efficiency and processing performance of the card shuffling stage.
- the data storage module is configured to: in the case that data in multiple data blocks needs to be sorted, according to a preset In the second size, the second storage area is divided into a plurality of sub-areas; the plurality of data blocks are stored in the plurality of sub-areas in the order of the sub-areas; the plurality of data blocks are sequentially stored in the plurality of sub-areas During the period of multiple sub-regions, the data in all sub-regions that have been stored is sorted by updating the ordered index linked list, which is sorted by the position index of the linked data.
- data writing and sorting are performed in an asynchronous pipeline (pipeline) manner, and an ordered index linked list is used during sorting, which can not only sort while writing, realize direct sorting during writing, but also remove separate
- the data copy link during sorting reduces memory usage, thereby improving the processing efficiency of shuffle writing in the shuffling stage.
- writing and sorting can also be combined into one step, reducing processing steps and improving the processing efficiency of the shuffling stage.
- the mapping server further includes: an initialization module, configured to, in the initialization stage of the first operation process, apply the second storage area to the global memory, so that each processor core corresponds to a second storage area, wherein the first operation process runs on the mapping server and is used for all The input data is processed, and each processor core runs at least one first operator, and the first operator is used for processing the input data.
- an initialization module configured to, in the initialization stage of the first operation process, apply the second storage area to the global memory, so that each processor core corresponds to a second storage area, wherein the first operation process runs on the mapping server and is used for all The input data is processed, and each processor core runs at least one first operator, and the first operator is used for processing the input data.
- a second storage area can be applied to the global memory, so that each processor core corresponds to a second storage area area, in which at least one first operator runs on each processor core, so that at least one operator running on the same processor core can be regarded as a shuffle writer and stored in global memory for the
- the shuffle writer allocates storage space so that the data with the same label in the processing result of at least one operator running on the same processor core is stored in the same area of the global memory, realizing data aggregation based on the processor core and reducing data Scatter, thereby improving the efficiency of data reading.
- the data division module is configured to: divide the second data into multiple pieces by hashing according to a preset tag data block.
- the second data is divided into a plurality of data blocks by hashing, so that sorting is not required before the second data is divided into blocks, so that the processing efficiency of the second data blocks can be improved.
- the data storage module is configured to: determine a third address of the second storage area; If the access range of the server is outside, the third address is mapped to a fourth address, and the fourth address is located within the access range of the mapping server; according to the fourth address, the plurality of data Blocks are stored in the second storage area.
- the mapping server can access the remote second storage area.
- mapping server further includes: a metadata determination module for determining metadata of the multiple data blocks; a metadata storage module for storing the metadata of the multiple data blocks in a preset first storage area .
- the reduction server can obtain the metadata of the data to be read from the first storage area, such as tags, storage address, etc.
- mapping server further includes: a second registration module, configured to register the mapping server through a preset registration instruction after the mapping server is connected to the distributed processing system, so that the memory of the mapping server is added to the distributed processing system. the global memory.
- the memory of the mapping server added to the distributed processing system can be managed uniformly, thereby realizing the management of the global memory.
- the mapping server also includes:
- a memory management device configured to determine first target data from data stored in the first memory when the first memory satisfies a first condition, and store the first target data in an external storage area, the first condition
- the used space of the first memory is greater than or equal to the first threshold, or the ratio of the space used by the first memory to the total space of the first memory is greater than or equal to the second threshold.
- a memory is the global memory or a portion of the global memory.
- part of the data in the first memory can be stored in the external storage area, so as to free up space to store the data to be stored , to avoid the situation that the first memory cannot store a large amount of data, causing the application to fail to run normally or to have low running efficiency.
- the memory management device is also used for:
- second target data is determined from the data stored in the external storage area, and the second target data is stored in the first memory, where the second condition is all
- the space already used by the first memory is less than or equal to the third threshold, or the ratio of the space already used by the first memory to the total space of the first memory is less than or equal to the fourth threshold.
- the memory management device through the management of the first memory by the memory management device, when the available storage space of the first memory becomes larger, the data stored in the external storage area can be retrieved to the first memory, so that the part needs to be read
- the data reduction server can read the corresponding data from the global memory instead of the external storage area, which improves the data reading efficiency.
- the external storage area Including at least one of the following: HDD, SSD.
- the external storage area includes HDD and/or SSD, which can store data persistently.
- embodiments of the present application provide a data processing apparatus, including a processor and a memory for storing instructions executable by the processor, wherein the processor is configured to implement the above-mentioned first when executing the instructions
- the processor is configured to implement the above-mentioned first when executing the instructions
- the mapping server in the distributed processing system can process the input data to obtain the second data, divide the second data into multiple data blocks according to the preset tags, and then divide the multiple data
- the blocks are stored in the second storage area located in the global memory, so that the processing results of the mapping server (that is, the second data) can be stored in the global memory during the shuffling phase, which can not only avoid slow disk read and write, but also enable the shuffling process.
- the processing process of the card stage is not limited by the memory capacity of the mapping server, thereby improving the processing efficiency and processing performance of the card shuffling stage.
- the reduction server in the distributed processing system can obtain metadata of the first data to be read from the first storage area, wherein the first data includes a target data block in a plurality of data blocks of the second data, and the first data
- the second data includes the processing result of the input data by the corresponding mapping server, and then determines the first address of the first data in the global memory according to the metadata, and reads the first data from the global memory according to the first address, thereby
- the reduction server reads the input data (first data) including the target data block from the processing results of the multiple mapping servers, the target data block does not need to be copied and transmitted, but is stored in the global memory in an in-memory manner.
- the target data block is directly read, which not only makes the processing process in the shuffling stage not limited by the memory capacity of the computing node, the physical bandwidth of the transmission network, transmission delay and other factors, but also can improve the processing efficiency and processing efficiency of the shuffling stage. performance, thereby improving the processing efficiency of the distributed processing system.
- embodiments of the present application provide a non-volatile computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the above-mentioned first aspect or the first aspect is implemented
- the mapping server in the distributed processing system can process the input data to obtain the second data, divide the second data into multiple data blocks according to the preset tags, and then divide the multiple data
- the blocks are stored in the second storage area located in the global memory, so that the processing results of the mapping server (that is, the second data) can be stored in the global memory during the shuffling phase, which can not only avoid slow disk read and write, but also enable the shuffling process.
- the processing process of the card stage is not limited by the memory capacity of the mapping server, thereby improving the processing efficiency and processing performance of the card shuffling stage.
- the reduction server in the distributed processing system can obtain metadata of the first data to be read from the first storage area, wherein the first data includes a target data block in a plurality of data blocks of the second data, and the first data
- the second data includes the processing result of the input data by the corresponding mapping server, and then according to the metadata, the first address of the first data in the global memory is determined, and according to the first address, the first data is read from the global memory, thereby
- the reduction server reads the input data (first data) including the target data block from the processing results of the multiple mapping servers, the target data block does not need to be copied and transmitted, but is stored in the global memory in an in-memory manner.
- the target data block is directly read, which not only makes the processing process in the shuffling stage not limited by the memory capacity of the computing node, the physical bandwidth of the transmission network, transmission delay and other factors, but also can improve the processing efficiency and processing efficiency of the shuffling stage. performance, thereby improving the processing efficiency of the distributed processing system.
- embodiments of the present application provide a computer program product, comprising computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in an electronic
- the processor in the electronic device executes the first aspect or one or more of the data processing methods in multiple possible implementations of the first aspect, or executes the second aspect or the second aspect.
- the mapping server in the distributed processing system can process the input data to obtain the second data, divide the second data into multiple data blocks according to the preset tags, and then divide the multiple data
- the blocks are stored in the second storage area located in the global memory, so that the processing results of the mapping server (that is, the second data) can be stored in the global memory during the shuffling phase, which can not only avoid slow disk read and write, but also enable the shuffling process.
- the processing process of the card stage is not limited by the memory capacity of the mapping server, thereby improving the processing efficiency and processing performance of the card shuffling stage.
- the reduction server in the distributed processing system can obtain metadata of the first data to be read from the first storage area, wherein the first data includes a target data block in a plurality of data blocks of the second data, and the first data
- the second data includes the processing result of the input data by the corresponding mapping server, and then according to the metadata, the first address of the first data in the global memory is determined, and according to the first address, the first data is read from the global memory, thereby
- the reduction server reads the input data (first data) including the target data block from the processing results of the multiple mapping servers, the target data block does not need to be copied and transmitted, but is stored in the global memory in an in-memory manner.
- the target data block is directly read, which not only makes the processing process in the shuffling stage not limited by the memory capacity of the computing node, the physical bandwidth of the transmission network, transmission delay and other factors, but also can improve the processing efficiency and processing efficiency of the shuffling stage. performance, thereby improving the processing efficiency of the distributed processing system.
- FIG. 1 shows a schematic diagram of a map-reduce framework
- FIG. 2 shows a schematic diagram of a processing procedure in a shuffling stage
- FIG. 3 shows a schematic diagram of a distributed processing system according to an embodiment of the present application
- FIG. 4 shows a flowchart of a data processing method according to an embodiment of the present application
- FIG. 5 shows a schematic diagram of sorting during data writing in a data processing method according to an embodiment of the present application
- FIG. 6 shows a flowchart of a data processing method according to an embodiment of the present application
- FIG. 7 shows a flowchart of managing global memory by a memory management apparatus according to an embodiment of the present application
- FIG. 8 shows a schematic diagram of a software architecture of a data processing method according to an embodiment of the present application.
- FIG. 9 shows a schematic diagram of initialization of a computing process of a mapping server according to an embodiment of the present application.
- FIG. 10 shows a schematic diagram of a processing procedure of a data processing method according to an embodiment of the present application
- FIG. 11 shows a block diagram of a reduction server according to an embodiment of the present application.
- FIG. 12 shows a block diagram of a mapping server according to an embodiment of the present application.
- the processing and analysis of massive data usually adopts distributed high-concurrency computing frameworks, such as hadoop map-reduce (HadoopMR) framework, Spark, etc.
- distributed high-concurrency computing framework performs concurrent operations on the data to be processed through multiple computing nodes.
- Figure 1 shows a schematic diagram of a map-reduce framework.
- the data processing process of the map-reduce framework includes a map stage and a reduce stage, wherein the output of the operator in the map stage can be converted to the input of the operator in the reduction stage.
- the segment processing process is called the shuffle phase.
- the shuffling phase may include saving, chunking, copying/pulling, merging, sorting, etc. of the input data in the reduction phase of the output data of the operators in the mapping phase.
- the shuffling stage distributes the calculation results of the previous stage to the physical nodes used for calculating or storing the results in the next stage through shuffling.
- data 110 to be processed is stored on a Hadoop distributed file system (hadoop distributed file system, HDFS) 100 .
- HDFS Hadoop distributed file system
- data segmentation can be performed first, and the to-be-processed data 110 is divided into 4 data blocks: data block A1, data block A2, data block A3 and data block A4; then the 4 data blocks are input Four mapping computing nodes (map1, map2, map3 and map4, respectively, perform the same processing) for processing: input data block A1 into map1 for processing, data block A2 into map2 for processing, data block A3 into map3 for processing, and data block A3 into map3 for processing.
- mapping computing nodes maps1, map2, map3 and map4, respectively, perform the same processing
- Block A4 is input to map4 for processing, and the corresponding processing results are obtained; according to the labels 121, 122, and 123 corresponding to the three reduction calculation nodes in the reduction stage, the processing results of each mapping calculation node are divided into 3 data blocks and processed. Store, complete the processing of the mapping phase.
- Each reduction computing node copies/pulls data from the corresponding mapping computing node and processes it: the first reduction computing node (including sort1 and reduce1) pulls the data block labeled 121 from each mapping computing node, and then obtains the data. After the block, sort the data in the data block through sort1, and input the sorted data into reduce1 for processing to obtain the processing result data block B1; the second reduction computing node (including sort2 and reduce2) calculates the node from each map.
- the nodes (including sort3 and reduce3) pull the data block with the label 123 from each mapping computing node, and after obtaining the data block, sort the data in the data block through sort3, and input the sorted data into reduce3 for processing, and get Process result data block B3.
- Data block B1, data block B2 and data block B3 are then stored on HDFS 130.
- map computing nodes and the reduction computing nodes are connected through a network. Limited by the memory capacity, the processing results of each mapping computing node need to be stored in the local disk, and the reduction computing node needs to read the data from the disk of the corresponding mapping computing node, and then transmit it to the local for processing through the network.
- FIG. 2 shows a schematic diagram of the processing procedure of a shuffling stage.
- the data to be processed is a data block 201 and a data block 202
- the data block 201 and the data block 202 are respectively input to the mapping computing node 211 and the mapping computing node 212 for processing.
- the mapping computing node 211 processes the data block 201 through the operator 1, obtains the processing result, and then performs shuffle write, partitions the processing result of the operator 1 and stores it in the memory 1, and the memory 1 is full. After that, a spill operation is performed to store the data in the memory 1 on the disk 1, and the process is repeated until the processing of the data block 201 is completed.
- the metadata describing the disk file information where the processing result of operator 1 is located is stored in the map output management (MapOutTracker) unit 221 .
- the mapping computing node 212 processes the data block 202 through the operator 2 to obtain the processing result, and then performs shuffle write, partitions the processing result of the operator 2 and stores it in the memory 2, where the memory 2 is full After that, a spill operation is performed to store the data in the memory 2 on the disk 2, and the process is repeated until the processing of the data block 202 is completed.
- metadata describing the information of the disk file where the processing result of operator 2 is located is also stored in the map output management unit 221 .
- operator 1 and operator 2 perform the same processing.
- the reduction computing node 231 When the reduction computing node 231 is running, it first obtains the metadata of the data to be read from the mapping output management unit 221, and according to the metadata, performs a shuffle read, respectively, from the mapping computing node 211 through the network. Disk 1 and disk 2 of the mapping computing node 212 read the corresponding data, and then process the data through operator 3 to obtain an output result 241 .
- the intermediate data that needs to be transmitted (that is, the processing result of the mapping computing node) is usually serialized, sorted, compressed, Download the hard disk, network transmission, decompression, deserialization and other processing.
- the transmission control protocol (TCP) and the Internet Protocol (IP) are used.
- TCP transmission control protocol
- IP Internet Protocol
- the data needs to be copied twice through the TCP/IP protocol stack: the central processing unit (CPU) of the mapping computing node copies the data of the application layer to the TCP kernel across states (from user state to kernel state) for transmission
- the buffer area is sent to the reduction computing node through the network adapter (ie the network card); the CPU of the reduction computing node also receives the data received through the network adapter (ie the network card) from the TCP kernel across the state (from the kernel state to the application state). area is copied to the application layer.
- the secondary copy of data in the TCP/IP protocol stack will consume a lot of CPU time, resulting in a high absolute transmission delay (usually at the level of 10ms), which in turn affects the data transmission efficiency.
- a compromise method is used to cache some frequently used temporary data in the memory. Although this method has an acceleration effect on some specific scenarios, the shuffling phase Intermediate data still needs to be stored on disk, memory acceleration is limited, and it cannot solve the problem of limited memory capacity.
- a remote direct data access (Remote Direct Memory Access, RDMA) technology is used in the data transmission process in the shuffling stage.
- the data of the mapping computing node can directly reach the application layer of the reducing computing node through the RDMA network card, which eliminates the secondary copy of the data in the TCP/IP protocol stack, and reduces the time overhead and CPU usage.
- the data needs to be copied across nodes, and the memory usage increases (the memory usage is twice that of the intermediate data), and the intermediate data in the shuffling stage is still stored in the form of files, and the data transmission is based on the input/output (input/output) with high overhead.
- /output, IO semantics, that is, there is a file system call overhead. Therefore, compared with memory access, data transmission between computing nodes through RDMA still has a relatively high absolute transmission delay.
- the present application provides a data processing method.
- the data processing method of the embodiment of the present application is applied to a distributed processing system, and based on the global memory formed by the memory interconnection of multiple computing nodes in the distributed processing system, data reading and writing in the shuffling stage can be implemented through memory operations, so that the There is no need to copy and transmit intermediate data in the shuffling stage, so that the processing process of the shuffling stage is not limited by factors such as the memory capacity of the computing node, the physical bandwidth of the transmission network, and the transmission delay, and can be based on efficient memory semantics.
- the data in the shuffling stage is read and written to improve the processing efficiency of the shuffling stage, thereby improving the processing efficiency of the distributed processing system for massive data.
- the distributed processing system may include a distributed system such as a server cluster, a data center, and the like for processing massive data. This application does not limit the specific type of the distributed processing system.
- the distributed processing system may include multiple mapping servers and multiple reduction servers.
- the plurality of map servers and the plurality of reduction servers are used for data processing.
- the distributed processing system may include at least one shuffling stage.
- the input data of each reduction server comes from the output data of multiple mapping servers.
- Multiple mapping servers can be regarded as the front end
- multiple reduction servers can be regarded as the back end
- the output data of the front end can be regarded as the back end. terminal input data.
- the memory of the multiple mapping servers and the memory of the multiple reduction servers in the distributed processing system can pass the system bus, high-speed serial computer expansion (peripheral component interconnect express, PCIE) bus, GEN -Z bus, RDMA and other memory interconnection methods are connected, so that the memory of multiple mapping servers and the memory of multiple reduction servers constitute global memory.
- PCIE peripheral component interconnect express
- GEN -Z bus GEN -Z bus
- RDMA random access memory interconnection
- the mapping server can be registered through a preset registration instruction, so that the memory of the mapping server is added to the global memory.
- the preset registration instruction is the register instruction
- the mapping server is connected to the distributed processing system, and its memory is interconnected with the memory of other servers (including the mapping server and the reduction server) in the distributed processing system through the system bus, the mapping server
- the server may send a register instruction (ie, a preset registration instruction) to the system bus, and perform registration on the system bus, so that the memory of the mapping server is added to the global memory.
- the system bus can also send confirmation instructions such as registration completion and registration success to the mapping server, so that the mapping server obtains the right to access the global memory.
- the reduction server may also be registered through a preset registration instruction, so that the memory of the reduction server is added to the global memory.
- the preset registration instruction is the register instruction
- the reduction server is connected to the distributed processing system, and its memory is interconnected with the memory of other servers (including the mapping server and the reduction server) in the distributed processing system through the system bus
- the The reduction server may send a register instruction (ie, a preset registration instruction) to the system bus, and perform registration on the system bus, so that the memory of the reduction server is added to the global memory.
- the system bus can also send confirmation instructions such as registration completion and registration success to the reduction server, so that the reduction server obtains the permission to access the global memory.
- the memory of the reduction server and the mapping server added to the distributed processing system can be managed in a unified manner, thereby realizing the unified management of the global memory.
- an address mapping relationship between the global memory and the memory of each mapping server and each reduction server can also be established, so that address mapping is performed when data is read and written.
- both the memory of the mapping server and the memory of the reduction server are multi-level memory.
- Multi-level memory can include double-rate synchronous dynamic random access memory (Double Data Rate Synchronous Dynamic Random Access Memory, DDR SDRAM, also referred to as DDR), dynamic random access memory (dynamic random access memory, DRAM), Optane memory (optane memory) memory) or at least two of other memories accessed in memory.
- DDR SDRAM Double Data Rate Synchronous Dynamic Random Access Memory
- DRAM dynamic random access memory
- Optane memory Optane memory
- AEP Accelhe Pass
- multi-level memory may be constructed according to the read/write speed of the memory.
- the read and write speed of DDR is faster than that of Optane memory, and the multi-level memory can be set to "DDR + Optane memory".
- DDR is used first.
- Optane memory is used first.
- the multi-level memory can also be set to "DRAM+Optane memory".
- the global memory formed by interconnecting the memory of multiple mapping servers and the memory of multiple reduction servers is also a multi-level memory.
- FIG. 3 shows a schematic diagram of an application scenario of a data processing method according to an embodiment of the present application.
- the data processing method is applied to a distributed processing system, and the distributed processing system includes two mapping servers (respectively, a mapping server 311 and a mapping server 321 ) and a reduction server 331 .
- the multilevel memory 314 (DDR+AEP+other memory) of the mapping server 311, the multilevel memory 324 (DDR+AEP+other memory) of the mapping server 321, and the multilevel memory 334 (DDR+AEP+other memory) of the reduction server 331
- the connections are made through the system bus 314 to form a global memory. That is, the global memory 340 in FIG. 3 includes the multi-level memory 314 , the multi-level memory 324 and the multi-level memory 334 .
- the distributed processing system processes data block 301 and data block 302 .
- the data block 301 is input to the map server 311, and the map server 311 executes the map task 315: the data block 301 is processed by the operator 312 to obtain the processing result, and then the shuffle is performed to write 313, and the processing is performed through the memory operation instruction.
- the result is written into the global memory 340; similarly, the data block 302 is input to the map server 321, and the map server 321 performs a map task 325: the data block 302 is processed by the operator 322 to obtain the processing result, and then the shuffle write is performed Enter 323, and write the processing result into the global memory 340 through the memory operation instruction.
- the operator 312 is the first operator for processing the input data (data block 301 ), and the operator 322 is the first operator for processing the input data (data block 302 ).
- the reduction server 331 executes a reduce task (reduce task) 335: firstly perform the shuffle reading 333, read data from the global memory 340 through the memory operation instruction, and then use the operator 332 processes the read data to obtain an output result 341 .
- the operator 332 is a second operator running on the reduction server and processing the processing result of the mapping server.
- mapping servers and one reduction server are examples to illustrate the distributed processing system and the global memory.
- a distributed processing system may include multiple mapping servers and multiple reduction servers, and the present application does not limit the number of mapping servers and the number of reduction servers in the distributed processing system.
- FIG. 4 shows a flowchart of a data processing method according to an embodiment of the present application. As shown in Figure 4, the method is applied to a mapping server in a distributed processing system, and the method includes:
- Step S401 processing the input data to obtain second data.
- the distributed processing system may include multiple mapping servers.
- the distributed processing system can divide the massive data to be processed according to the number of mapping servers.
- the split function can be used to split the massive data to be processed into multiple data to be processed. block; then use one or more blocks of data to be processed as input data to the map server.
- each mapping server is allocated 1 data block to be processed as input data; if the number of mapping servers is 4, the number of data blocks to be processed is also 4 If it is 8, then allocate 2 blocks of data to be processed as input data for each mapping server.
- the mapping server After receiving the input data, the mapping server can perform format conversion, data screening or calculation on the input data through the first operator to obtain the second data. That is to say, the second data is the processing result of the input data by the mapping server.
- the massive data to be processed is the population file of country X.
- the distributed processing system needs to analyze and count the population by province.
- the massive data to be processed can be divided into multiple data blocks to be processed, and the The data block is used as the input data of the mapping server in the distributed processing system; the mapping server can extract the preset key population information from the input data, such as name, date of birth, household registration location, residence and other information to obtain the second data.
- Step S402 Divide the second data into a plurality of data blocks according to the preset label.
- the preset tags may be preset according to usage scenarios and keywords of massive data to be processed.
- the massive data to be processed is the population file of country X, and the distributed processing system needs to analyze and count the population by province.
- the province where the household registration is located can be used as the default label, and the number of default labels is the same as that of country X.
- the total number of provinces and cities is the same.
- the number of reduction servers in the distributed processing system may also be considered.
- the number of preset tags may be determined first according to the number of reduction servers, and then the preset tags may be corresponding to the keywords of the massive data to be processed.
- the preset label can also be set in other ways, and the present application does not limit the setting method and setting basis of the preset label.
- the second data may be divided into a plurality of data blocks by searching, matching, hashing, etc. according to a preset tag.
- a preset tag For example, assuming that the preset label is the province where the household registration is located, and the number of the preset tags is 10, the second data can be divided into 10 data blocks according to the province where the household registration is located.
- the number of data blocks obtained by dividing the second data may be less than 10.
- data when the second data is divided into multiple data blocks, data may be selected from the second data in a hash manner, and the second data may be divided into multiple data blocks. In this way, sorting is not required before the second data is divided into blocks, so that the processing efficiency of the second data division can be improved.
- Step S403 Store the multiple data blocks in the second storage area, and the second storage area is located in the global memory.
- a memory allocation instruction such as an allocate(size) instruction, may be used according to the size of the second data, where size represents the second storage area.
- the size of the data apply to the global memory for storage space, and after the application is successful, use the applied storage space as the second storage area for storing the second data.
- the second storage space can be dynamically applied according to the size of the second data, thereby saving the memory space and improving the memory usage rate.
- the mapping server may also pre-allocate the second storage area for the second data in the global memory according to the preset first size.
- the mapping server dynamically applies for storage space according to actual needs. In this way, the second storage space can be allocated for the second data in advance, and the number of times of dynamically applying for the storage space during operation can be reduced, thereby improving the processing efficiency.
- the second storage area Since the second storage area is located in the global memory, the second storage area may be located locally (in the physical memory of the mapping server), or may be located remotely (in the physical memory of other servers).
- the third address of the second storage area can be determined, and it can be determined whether the third address is within the access range of the mapping server. If the third address is within the access range of the mapping server, address mapping is not required, and the mapping server can directly store multiple data blocks in the second storage area through a data write instruction, such as a store instruction.
- the third address is outside the access range of the mapping server, address mapping needs to be performed.
- the third address can be mapped to the fourth address located in the access range of the mapping server through the address mapping instruction, such as the map instruction; and then according to the fourth address, multiple data blocks are stored in the first address. Two storage areas.
- mapping server can access the remote second storage area.
- the first operator processing the input data in the mapping server corresponds to the second storage area, that is, each first operator corresponds to a second storage area.
- the second storage area can be divided into multiple sub-areas according to the preset second size, and the multiple data blocks can be stored in the sub-areas in the order of the sub-areas.
- the data in all sub-regions that have been stored can be sorted by updating the ordered index linked list.
- the ordered index linked list is sorted by the position index of the linked list linking data.
- This sort-on-write approach can be thought of as an asynchronous pipeline approach. In this way, while writing data in multiple data blocks into the second storage area, the written data can be sorted, so as to realize direct sorting during writing, that is, sorting while writing.
- the processing procedure of sorting during data writing will be exemplarily described below with reference to FIG. 5 .
- FIG. 5 is a schematic diagram illustrating sorting of data during writing in a data processing method according to an embodiment of the present application.
- the second storage area (located in the global memory) 550 corresponding to the first operator (ie, the mapping task) 551 of the mapping server can be divided into 10 sub-areas or memory slices (slice), sub-regions 560 to 569, respectively.
- the sub-regions 560-564 have been stored, that is, the first operator 551 has completed the shuffle write to the sub-regions 560-564, and the data stored in the sub-regions 560-564 has been sorted; 565 has been stored, but the data stored on it is not sorted; sub-areas 566-569 are unused blank areas.
- the first operator 551 continues to perform shuffle write, and can select the first blank sub-area 566 from the second storage area 550 in the order of the sub-areas, execute data writing in an exclusive manner, and at the same time. Create a position index for each data (or record) written by means of a position array or the like. After the sub-area 566 is full, the first operator 551 may select the next empty sub-area 567 to continue data writing, and notify (eg, through a message, etc.) the sorting thread (sorter) 570 that the sub-area 566 has been written.
- the sorting thread 570 can merge and sort the data on the sub-area 565 and the sorted data on the sub-areas 560-564 through the ordered index linked list, so as to The data stored on the sub-regions 560-565 is globally ordered.
- the sorting thread 570 may sequentially read the data according to the position index of the sub-region 565, and then merge the read data and the sorted data (that is, the data on the sub-regions 560-564) by means of bucket sorting or the like. Sort, and update the ordered index linked list to get the sorted result.
- an ordered index linked list is used during sorting, so that no copying of data occurs during the sorting process.
- the sorting thread 570 completes sorting the sub-area 565, after receiving the notification that the writing of the sub-area 566 is completed, in a similar manner, the data stored on the sub-area 566 and the sorted data on the sub-areas 560-565 are processed. Merge sort.
- the native sorter can be used to perform out-of-heap bucket sorting during sorting. Since the in-heap merge sort running on the Java virtual machine (JVM) has problems such as slow execution speed, memory overflow will cause disk IO, and the sorting algorithm is inefficient, using the native sorter to perform bucket sorting outside the heap can effectively Improve sorting efficiency.
- JVM Java virtual machine
- Execute data writing and sorting through asynchronous pipeline (pipeline), and use ordered index linked list when sorting not only can write and sort, realize direct sorting when writing, but also can remove the data copy link when sorting separately , reducing the memory footprint, which can improve the processing efficiency of shuffle writing in the shuffling stage.
- writing and sorting can also be combined into one step, reducing processing steps and improving the processing efficiency of the shuffling stage.
- the mapping server in the distributed processing system can process the input data to obtain the second data, divide the second data into multiple data blocks according to the preset tags, and then divide the multiple data
- the blocks are stored in the second storage area located in the global memory, so that the processing results of the mapping server (that is, the second data) can be stored in the global memory during the shuffling phase, which can not only avoid slow disk read and write, but also enable the shuffling process.
- the processing process of the card stage is not limited by the memory capacity of the mapping server, thereby improving the processing efficiency and processing performance of the card shuffling stage.
- the method may further include: determining metadata of the multiple data blocks; storing the metadata of the multiple data blocks in a preset first storage area.
- the metadata may include attribute information of multiple data blocks.
- the attribute information of each data block includes the storage address of the data block in the global memory.
- the attribute information of each data block may further include at least one of the label and size (ie size) of the data block. .
- Those skilled in the art can set the specific content of the metadata according to the actual situation, which is not limited in this application.
- the first storage area may be located in global memory, or may be located in other memory accessible by multiple mapping servers and multiple reduction servers. The present application does not limit the specific location of the first storage area.
- the reduction server can obtain the metadata of the data to be read, such as tags, storage addresses, etc., from the first storage area.
- FIG. 6 shows a flowchart of a data processing method according to an embodiment of the present application. As shown in Figure 6, the method is applied to a reduction server in a distributed processing system, and the method includes:
- Step S601 Obtain metadata of the first data to be read from a preset first storage area.
- the second data includes the processing result of the input data by the mapping server located at the front end in the distributed processing system, and as any reduction server at the back end, the first data to be read may include The target data block among the plurality of data blocks of the second data. That is, the first data is the target data block processed by the reduction server among the plurality of data blocks of the second data.
- the second data is divided into multiple data blocks and stored in the global memory, and the metadata of the multiple data blocks is stored in the first storage area.
- the reduction server may obtain the metadata of the target data block included in the first data from the first storage area according to the target label corresponding to the to-be-processed first data in the preset label.
- metadata of each target data block may be acquired from the first storage area, respectively.
- Step S602 Determine the first address of the first data in the global memory according to the metadata of the first data.
- the first address (ie, the storage address) of the first data in the global memory can be obtained from the metadata.
- the first addresses of the target data blocks in the global memory may be determined from the metadata of each target data block.
- Step S603 according to the first address, read the first data from the global memory.
- the first address Since the first address is located in global memory, the first address may be located locally (in the physical memory of the reduction server) or remotely (in the physical memory of other servers).
- the first data When the first data is read from the global memory, it can be determined whether the first address is within the access range of the reduction server. If the first address is within the access range of the reduction server, address mapping is not required, and the reduction server can directly read the first data from the global memory through a data read command, such as a load command.
- the first address can be mapped to a second address located within the access range of the reduction server through an address mapping instruction, such as a map instruction; and then according to the second address, through a memory instruction, such as a load instruction , read the first data from global memory.
- an address mapping instruction such as a map instruction
- a memory instruction such as a load instruction
- the reduction server can read the first data located at the remote end from the global memory.
- the reduction server in the distributed processing system can obtain metadata of the first data to be read from the first storage area, where the first data includes multiple data blocks of the second data
- the second data includes the processing result of the input data by the corresponding mapping server, and then according to the metadata, the first address of the first data in the global memory is determined, and according to the first address, from the global memory Read the first data, so that when the reduction server reads the input data (first data) including the target data block from the processing results of the multiple mapping servers, the target data block does not need to be copied and transmitted, but the memory
- the processing process in the shuffling stage is not limited by factors such as the memory capacity of the computing node, the physical bandwidth of the transmission network, and the transmission delay, and can improve the shuffling process.
- the processing efficiency and processing performance of the card stage are improved, thereby improving the processing efficiency of the distributed processing system.
- the global memory is not infinite. If multiple mapping servers execute the method shown in Figure 4 in parallel, then the multiple mapping servers are likely to store a large amount of data in the global memory, and the storage of the global memory Due to the limited space, the global memory may not be able to store a large amount of data, resulting in applications that cannot run normally or have low operating efficiency.
- the present application provides a memory management device, which can manage the global memory, and when the storage space of the global memory is insufficient, stores part of the data in the global memory to an external storage area to free up space Store the data to be stored; when the available storage space of the global memory becomes larger, the data stored in the external storage area is retrieved to the global memory, so that the protocol server that needs to read this part of the data can read the corresponding data from the global memory , instead of reading from the external storage area, improving the efficiency of data reading.
- the above-mentioned external storage area includes, but is not limited to, at least one of the following: HDD, SSD, or other types of hard disks.
- the external storage area may also be HDFS shared by multiple mapping servers and multiple reduction servers, or the like.
- the above-mentioned memory management apparatus may be a software module deployed on a single mapping server.
- the memory management apparatus may also be implemented by hardware of the mapping server, for example, processed by one of the mapping servers.
- the controller implements the functions of the memory management device.
- the memory management apparatus may manage the entire space of the global memory, or may only manage the memory space of the global memory allocated to the mapping server where the memory management apparatus is located.
- the memory management apparatus may also be a separate device, or may be a software module or the like deployed on a separate device. At this time, the memory management device can manage the entire space of the global memory.
- the specific process of managing the global memory by the memory management device includes:
- the memory management apparatus acquires the space already used by the first memory.
- the first memory is the entire memory space of the global memory
- the first memory is the memory space allocated by the global memory to the mapping server where the memory management device is located, that is, a partial memory of the global memory.
- the memory management apparatus determines whether the first memory satisfies the first condition, and when it is determined that the first memory satisfies the first condition, executes S703, and when it is determined that the first memory does not meet the first condition, executes S704.
- the memory management apparatus determines the first target data from the data stored in the first memory, and stores the first target data in an external storage area.
- the first condition is that the space used by the first memory is greater than or equal to the first threshold.
- the first threshold can be 80M, or the first condition is that the first memory has been used.
- the ratio of the used space to the total space of the first memory is greater than or equal to the second threshold.
- the second threshold can be 80%.
- the first and second thresholds can be based on actual It is set according to the situation, and there is no specific limitation here.
- the memory management device may determine the first target data from the data stored in the first memory according to the priority of the data stored in the first memory. Specifically, the first target data may be stored in the first memory The lower priority part of the data.
- the memory management device may store the data according to the priority of the data, that is, the data with low priority will be sent out of the first memory first, and the data with high priority will be sent out of the first memory later.
- Mode 1 It is represented by the ID of the sub-area to which the data belongs.
- mapping server when the mapping server applies to the first memory for a second storage area, it may apply to the first memory for a second storage area including a preset number of sub-areas according to the number of protocol servers, and each sub-area corresponds to a unique protocol Servers, that is, the above-mentioned preset number is the same as the number of protocol servers, and each sub-area corresponds to a unique identification (ID).
- ID unique identification
- a sub-area with a small ID may correspond to a protocol server that needs to start the data reading task first
- the sub-region with a large ID corresponds to the protocol server that starts the data reading task
- the sub-region with a large ID corresponds to the protocol server that needs to start the data reading task first
- the sub-region with a small ID corresponds to the protocol that starts the data reading task after server.
- the mapping server When the mapping server stores data in the second storage area, it can store the data corresponding to different protocol servers to the sub-area corresponding to each protocol server according to the sub-area ID, and the subsequent protocol server can store the data from the corresponding sub-area according to the sub-area ID. read the corresponding data.
- a protocol server with an identifier of 001 corresponds to a sub-area with an identifier of 1, and the protocol server can read data from the sub-area with an identifier of 1;
- a protocol server with an identifier of 002 corresponds to a sub-area with an identifier of 2, and the protocol server can Read data from the subregion identified as 2.
- the priority of the data stored in the first memory is reflected by the sub-area ID to which the data belongs.
- the corresponding sub-area with a small ID needs to be activated first
- the protocol server of the data reading task when the sub-area with a large ID corresponds to the protocol server of the data reading task, the priority of the data stored in the first memory can be higher than that of the data in the sub-area with a small ID.
- the sub-area with a large ID corresponds to the protocol server that needs to start the data read task first
- the sub-area with a small ID corresponds to the protocol server that starts the data read task.
- the priority of the data may be that the priority of the data in the sub-region with the larger ID is higher than the priority of the data in the sub-region with the smaller ID.
- the memory management apparatus determines the first memory from the first memory.
- a target data can be data in a sub-region whose ID is greater than the first preset ID, and the first preset ID can be set according to the actual situation. For example, if the first memory includes sub-regions with IDs 1 to 10, the first If the preset ID is 8, the first target data includes data in the sub-region with ID 9 and data in the sub-region with ID 10.
- the memory management apparatus determines the first memory from the first memory.
- a target data can be the data in the sub-region whose ID is smaller than the second preset ID.
- the second preset ID can be set according to the actual situation. For example, if the first memory includes sub-regions with IDs 1 to 10, the first The second preset ID is 3, and the first target data includes data in the sub-region with ID 1 and data in the sub-region with ID 2.
- Mode 2 is embodied by the sequence of data storage to the first memory.
- the priority of the data may be that the priority of the data stored in the first memory is lower than the priority of the data stored in the first memory later, or the priority of the data stored in the first memory is higher than the priority of the data stored in the first memory. The priority of the data that is later stored to the first memory.
- the first target data may be a preset amount of data stored in the first memory, and the preset amount may be set according to the actual situation. 100 pieces of data, the preset number is 10, and the first target data is the 10 pieces of data that are first stored in the first memory among the above 100 pieces of data.
- Mode 3 is reflected by the amount of data.
- the priority of data may be that the priority of data with a large amount of data is higher than the priority of data with a small amount of data, or the priority of data with a large amount of data is lower than the priority of data with a small amount of data.
- the first target data is the data whose data volume is less than or equal to the preset data volume in the data stored in the first memory, and the preset data volume can be set according to the actual situation.
- the first memory If 100 pieces of data are stored and the preset data volume is 10KB, the first target data includes data whose data volume is less than or equal to 10KB among the above 100 pieces of data.
- the memory management apparatus determines whether the first memory satisfies the second condition, and when it is determined that the first memory satisfies the second condition, executes S705, and when it is determined that the first memory does not meet the second condition, executes S701.
- the memory management apparatus determines the second target data from the data stored in the external storage area, and stores the second target data in the first memory.
- the second condition is that the space used by the first memory is less than or equal to the third threshold.
- the third threshold can be 70M, or the second condition is that the first memory has been used.
- the ratio of the used space to the total space of the first memory is less than or equal to the fourth threshold.
- the fourth threshold can be 70%, and the third and fourth thresholds can be based on actual It is set according to the situation, which is not specifically limited here.
- the third threshold may be smaller than the first threshold
- the fourth threshold may be smaller than the second threshold
- optionally, the third threshold may be equal to the first threshold
- the fourth threshold may be equal to the second threshold
- the memory management apparatus may perform S703 or S705.
- the memory management device may also store the data in the external storage area according to the priority of the data To the first memory, that is, the high-priority data goes out of the external storage area first, and the low-priority data goes out of the external storage area later, and the process of determining the second target data from the data stored in the external storage area by the memory management device is the same as the above.
- the process of determining the first target data from the data stored in the first memory by the memory management apparatus is similar, and reference may be made to the above related description, which will not be repeated here.
- the memory management device can monitor the situation of the first memory in the process of storing the first target data in the external storage area. , when it is determined that the first memory satisfies the third condition, the operation of storing the first target data in the external storage area is stopped.
- the third condition is that the space already used by the first memory is equal to the fifth threshold, or the ratio of the space already used by the first memory to the total space of the first memory is equal to the sixth threshold, and the fifth threshold is smaller than the first threshold, The sixth threshold is smaller than the second threshold, and the fifth threshold and the sixth threshold can be set according to actual conditions, which are not specifically limited here.
- the memory management apparatus when storing the first target data in the external storage area, may further determine the metadata of the first target data and the metadata of the remaining data in the first memory, and The metadata of the first target data and the metadata of the data remaining in the first memory are updated to the preset first storage area, wherein the metadata of the first target data may include attribute information of the first target data.
- the attribute information of the first target data includes the storage address of the data in the external storage area, optionally, at least one of the label and size of the data, and the metadata of the remaining data in the first memory may include the remaining data.
- the attribute information of the data, the attribute information of the remaining data includes the storage address of the data in the first memory, and optionally, it can also include at least one of the label and size of the data.
- the specific content of the metadata of the first target data and the metadata of the remaining data is set, which is not limited in this application.
- the protocol server can obtain the metadata of the data to be read from the first storage area , such as labels, dimensions, storage addresses, etc.
- the first data to be read by the protocol server in step S601 may all be located in the global memory as described in step S601, or all may be located in the external storage area, and may also be partially located in the global memory. memory, partly in the external storage area.
- the process of reading the first data from the global memory by the protocol server may refer to the method flow shown in FIG. 6 .
- the address of the first data may be located locally or may be located remotely.
- the protocol server can directly read the first data locally. If the address is at the far end, the protocol server can send a data read request including the address of the first data to the memory management device, requesting to read the first data, and after receiving the read data request, the memory management device can Find the first data from the remote external storage area, then store the first data from the remote external storage area to the global memory, and return the address of the first data in the global memory to the protocol server, so that the protocol server The first data can be read from global memory.
- the metadata of the first data obtained by the protocol server from the preset first storage area includes the metadata of the first part of the data and the first part of the data. Metadata of two parts of data, wherein the first part of data represents the data located in the global memory in the first data, and the second part of the data represents the data in the first data located in the external storage area.
- the address of the first part of the data in the global memory can be obtained from the metadata of the first part of the data, and the first part of the data can be obtained from the metadata of the second part of the data.
- the address of the second part of the data in the external storage area then the first part of the data is read from the global memory, and the second part of the data is read from the external storage area.
- the process of the reduction server reading the first part of data from the global memory is similar to the process of the reduction server reading the first data from the global memory described above, and the process of reading the second part of the data from the external storage area by the reduction server is the same as above.
- the process of reading the first data from the external storage area by the protocol server is similar, and reference may be made to the above related description.
- the operation of the memory management device to store the first target data from the first memory to the external storage area and the operation of the mapping server to store data to the first memory can be performed in parallel, and the memory management device stores the first target data in parallel.
- the operation of storing data from the first memory to the external storage area and the operation of reading data from the global memory by the protocol server can be performed in parallel, and the operation of the memory management device storing the second target data from the external storage area to the first memory and the mapping server
- the operation of storing data to the first memory can be performed in parallel, and the operation of the memory management device to store the second target data from the external storage area to the first memory and the operation of the protocol server to read data from the global memory can also be performed in parallel, so as to improve the performance. Efficiency of data processing.
- the memory management apparatus may store the first target data from the first memory to the external storage area in an asynchronous or synchronous transmission manner, or store the second target data from the external storage area to the first memory.
- FIG. 8 shows a schematic diagram of a software architecture of a data processing method according to an embodiment of the present application.
- the data processing method can be applied to the shuffling stage of the distributed processing system.
- the shuffle managers 811 and 812 , the shuffle writer 812 , the shuffle read The fetcher (shuffle writer) 822, the data management component (shuffle) 830 and the global memory 840 are implemented.
- the shuffle management component can provide an external shuffle function interface (such as a global memory interface related to reading and writing, or an interface related to other operations of the global memory, such as storing the first target data from the first memory to an external storage area. interface), which can be used after the upper-layer software is registered.
- the shuffle management component can be seamlessly compatible with multiple open source software in the form of plug-ins.
- the shuffle management component 811 is deployed on the mapping server 810 in the distributed processing system, and the shuffle writer (shuffle writer) 812, such as the mapping task maptask, can pass the functional interface provided by the shuffle management component 811,
- the data to be written (for example, the processing result of the maptask) is written into the global memory 840 by using an operation instruction based on memory semantics.
- the shuffle management component 821 is deployed on the reduction server 820 in the distributed processing system, and the shuffle reader (shuffle reader) 822, such as the reduction task reducetask, can use the functional interface provided by the shuffle management component 821 based on the The operation instruction of memory semantics reads the input data of the reducetask from the global memory 840 .
- the data management component 830 interacts with the shuffling management component 811 deployed on the mapping server 810 and the shuffling management component 821 deployed on the reduction server 820 in the distributed processing system.
- the data management component 830 manages intermediate/temporary data and provides metadata services. For example, the data management component 830 may provide metadata writing services for the shuffling management component 811 deployed on the mapping server 810; the data management component 830 may provide metadata reading for the shuffling management component 821 deployed on the reduction server 820 pick up service.
- Shuffle writer 812 is used to perform memory operations related to writing data to global memory 840 .
- executable memory application such as allocate(size) instruction
- address mapping such as map instruction
- unmap such as unmap instruction
- memory release such as release instruction
- write data such as store instruction
- Shuffle reader 822 is used to perform memory operations related to reading data from global memory 840 .
- executable memory application such as allocate(size) instruction
- address mapping such as map instruction
- unmap such as unmap instruction
- memory release such as release instruction
- read data such as load instruction
- etc. are based on memory semantics , and the memory semantics associated with global memory copies.
- mapping server and one reduction server as an example to illustrate the software architecture of the data processing method of the embodiment of the present application.
- the software architecture shown in FIG. 8 can be used for multiple mapping servers and reduction servers in the distributed processing system.
- the above-mentioned software architecture can be implemented by multiple programming languages such as Java, C++, Python, etc., which is not limited in this application.
- the above components can all run on the Java virtual machine. machine (Java Virtual Machine, JVM).
- the first computing process is used to perform data processing tasks on the mapping server, and may include multiple mapping task threads.
- the mapping server may include at least one first operator that processes the input data, and in the initialization stage of the first operation process, the mapping server may be configured to provide a global map according to the preset first size and the number of processor cores of the mapping server.
- the memory applies for the second storage area, so that each processor core corresponds to a second storage area, wherein each processor core runs at least one first operator.
- FIG. 9 shows a schematic diagram of initialization of a first operation process of a mapping server according to an embodiment of the present application. As shown in FIG. 9 , the first operation process 910 and the first operation process 920 run on different mapping servers.
- the processor of each mapping server includes 2 cores (CPU core).
- the first operation process 910 includes four operators, an operator 911 , an operator 912 , an operator 913 , and an operator 914 .
- the operator 911 , the operator 912 , the operator 913 and the operator 914 are all first operators (ie maptask) running on the map server and used for processing the input data.
- the operator 911 and the operator 912 run concurrently on one core of the processor, and the operator 913 and the operator 914 run concurrently on another core of the processor.
- the first operation process 920 includes four operators, an operator 921 , an operator 922 , an operator 923 , and an operator 924 .
- the operator 921 , the operator 922 , the operator 923 and the operator 924 are all first operators (ie maptask) running on the map server and used for processing the input data.
- the operator 921 and the operator 922 run concurrently on one core of the processor, and the operator 923 and the operator 924 run concurrently on another core of the processor.
- a storage space (for storing the processing result of the mapping server) can be applied to the global memory 930 as The second storage area, so that each processor core corresponds to one second storage area. At least one first operator may run on each processor core of the mapping server.
- At least one operator running on the same processor core can be regarded as a shuffle writer (shuffle writer), that is, each processor core corresponds to a shuffle writer, and according to the preset The first size of , respectively applies storage space for each shuffle writer in the global memory 930 as a second storage area corresponding to each processor core (or each shuffle writer).
- a shuffle writer shuffle writer
- the first operation process 910 regards the operator 911 and the operator 912 running on the same processor core as the shuffle writer writer 915, and treats the operator 913 and the operator running on the same processor core as the writer 915. 914, seen as shuffle writer writer916.
- the second storage area applied by the first operation process 910 for the writer 915 in the global memory 930 includes 9 caches, which are 3 caches A, 3 caches B, and 3 caches C respectively.
- cache A is used to store the data labeled key1 in the processing results of operators 911 and 912, that is, to aggregate the data labeled key1 in the processing results of operators 911 and 912
- cache B is used to store The data labelled key2 in the processing results of operator 911 and operator 912, that is, the data labelled key2 in the processing results of operator 911 and operator 912 is aggregated
- cache C is used to store operator 911 and operator 912
- the data labeled key3 in the processing result of that is, the data labeled key3 in the processing results of operator 911 and operator 912 are aggregated.
- the second storage area applied by the first operation process 910 for the writer 916 in the global memory 930 includes 9 caches, which are 3 caches D, 3 caches E and 3 caches F respectively.
- cache D is used to store the data labeled key1 in the processing results of operator 913 and operator 914, that is, to aggregate the data labeled key1 in the processing results of operator 913 and operator 914
- cache E is used to store The data labeled key2 in the processing results of operator 913 and operator 914, that is, the data labeled key2 in the processing results of operator 913 and operator 914 are aggregated
- cache F is used to store operator 913 and operator 914
- the data labeled key3 in the processing result of that is, the data labeled key3 in the processing results of operator 913 and operator 914 are aggregated.
- the first operation process 920 regards the operator 921 and the operator 922 running on the same processor core as the shuffle writer Operator 923 and operator 924 are regarded as shuffle writer writer926.
- the second storage area applied by the first operation process 920 for the writer 925 in the global memory 930 includes 9 caches, which are 3 caches G, 3 caches H and 3 caches J respectively.
- cache G is used to store the data labeled key1 in the processing results of operator 921 and operator 922, that is, to aggregate the data labeled key1 in the processing results of operator 921 and operator 922
- cache H is used to store The data labelled key2 in the processing results of operator 921 and operator 922, that is, the data labelled key3 in the processing results of operator 921 and operator 922 is aggregated
- cache J is used to store operator 921 and operator 922
- the data labeled key3 in the processing result of that is, the data labeled key3 in the processing results of operator 921 and operator 922 are aggregated.
- the second storage area applied by the first operation process 920 for the writer 926 in the global memory 930 includes 9 caches, which are 3 caches K, 3 caches L and 3 caches M respectively.
- the cache K is used to store the data labeled key1 in the processing results of the operator 923 and the operator 924, that is, to aggregate the data labeled key1 in the processing results of the operator 923 and the operator 924;
- the cache L is used to store The data labelled key2 in the processing results of operator 923 and operator 924, that is, the data labelled key2 in the processing results of operator 923 and operator 924 is aggregated;
- cache M is used to store operator 923 and operator 924
- the data labeled key3 in the processing result of that is, the data labeled key3 in the processing results of operator 923 and operator 924 are aggregated.
- the reduce task (reduce task) 940, reduce task (reduce task) 950 and reduce task (reduce task) 960 running on the reduction server are respectively obtained from the global Data is read in memory 930 .
- the reduce task (reduce task) 940 reads the data labeled key1 from the global memory 930, that is, reads the data from the cache A, the cache D, the cache G, and the cache K respectively;
- the reduce task (reduce task) 950 reads the data from the The data labeled key2 is read from the global memory 930, that is, data is read from cache B, cache E, cache H, and cache L respectively;
- the reduce task (reduce task) 960 reads the data labeled key3 from the global memory 930 , that is, read data from cache C, cache F, cache J, and cache M respectively.
- mapping server in the embodiment of the present application.
- Other map servers in the distributed processing system are initialized in a similar fashion.
- a second storage area can be applied to the global memory, so that each processor core corresponds to a second storage area area, in which at least one first operator runs on each processor core, so that at least one operator (for example, operator 911 and operator 912 ) running on the same processor core can be regarded as a shuffle write and allocate storage space for the shuffle writer in the global memory, so that the data with the same label in the processing result of at least one operator running on the same processor core is stored in the same area of the global memory, realizing Data aggregation based on processor cores reduces data dispersion and improves data reading efficiency.
- FIG. 10 shows a schematic diagram of a processing procedure of a data processing method according to an embodiment of the present application.
- the distributed processing system includes a mapping server 1010 and a reduction server 1020 .
- the multi-level memory 1012 of the mapping server 1010 includes DRAM+AEP
- the multi-level memory 1022 of the reduction server 1020 also includes DRAM+AEP.
- the first operation process 1015 on the mapping server 1010 is used to process the input data.
- the first operation process 1015 may include a plurality of threads (threads corresponding to the first operator) for executing map tasks, that is, maptask threads.
- the maptask thread as the shuffle writer can use the function interface provided by the shuffle management component 1011 to transfer the data to be written. Write to global memory.
- the first operation process 1015 can apply for a storage space (also referred to as memory space, cache space, etc.) from the global memory according to the preset first size using the method shown in FIG. 9 . , as the second storage area.
- a storage space also referred to as memory space, cache space, etc.
- multiple maptask threads in the first computing process 1015 can process the input data to obtain second data (ie, the processing result of the mapping server 1010 ), for example, multiple ⁇ key, value> records.
- the first operation process 1015 can determine whether the storage space requested in the initialization stage is sufficient according to the size of the second data.
- the first computing process 1015 can also dynamically apply for storage space from the global memory through the global memory interface, and map the newly applied storage space to the access range, so that the storage space can be accessed by the first computing process 1015 .
- the maptask thread can divide the second data into multiple data blocks by hashing according to the preset label, and as a shuffle writer, use the functional interface provided by the shuffle management component 1011 to use
- the memory operation instruction stores multiple data blocks in the applied storage space (ie, the second storage area).
- the sorting thread 1013 can be used in an asynchronous pipeline mode (refer to FIG. 5 ) to perform sorting while writing the data. Assuming that multiple data blocks are stored on the DRAM in the multi-level memory 1012 of the mapping server 1010, after the multiple data blocks are stored, the first operation process 1015 can send the metadata of the stored multiple data blocks to the data management component 1030 , so that the data management component 1030 stores the metadata.
- the second operation process 1025 on the reduction server 1020 is used to read the first data (the number of the second data) from the processing results (ie, the second data) of the mapping server 1010 and other mapping servers (not shown in the figure). the target data block in the data blocks), and process the read first data.
- the second operation process 1025 may include a plurality of threads (threads corresponding to the second operator) for executing reduction tasks, ie, reducetask threads.
- the second computing process 1025 can register with the shuffling management component 1021 deployed on the mapping server 1020, as the reducetask thread of the shuffling reader, it can use the memory operation instruction through the functional interface provided by the shuffling management component 1021 to get from Read data from global memory.
- each reducetask thread in the second computing process 1025 can obtain the metadata of the first data from the data management component 1030, determine the storage address of the first data according to the metadata, and store the The storage address is mapped into the access range of the second operation process 1025 .
- the second computing process 1025 can apply for the corresponding memory locally, and then, according to the mapped storage address, directly read the data on the DRAM stored in the multi-level memory 1012 of the mapping server 1010 through the memory read data command (for example, the load command). It is fetched into the local memory of the reduction server 1020 for processing.
- the second computing process 1025 may also perform a gather memory copy asynchronously, and hard-copy data scattered in different remote memories to the local memory at one time for subsequent processing.
- the first computing process 1015 may also monitor the condition of the first memory:
- the first target data can be determined from the data stored in the first memory, and the first target data can be stored in the external storage area through the functional interface provided by the shuffling management component 1011 .
- the first operation process 1015 may send the metadata of the first target data and the data stored in the first memory except the first target data to the data management component 1030 metadata of the remaining data other than the external storage area, so that the data management component 1030 stores the metadata of the data stored in the external storage area and the first memory.
- the first operation process 1015 may determine the second target data from the data stored in the external storage area, and use the function interface provided by the shuffling management component 1011 to store the second target data The target data is stored in the first memory. After the first operation process 1015 stores the second target data in the first memory, the first operation process 1015 may send the metadata of the second target data and the data stored in the external storage area to the data management component 1030 except the second target data metadata of the remaining data other than the external storage area, so that the data management component 1030 stores the metadata of the data stored in the external storage area and the first memory.
- mapping server and one reduction server as an example to illustrate the processing procedure of the data processing method of the embodiment of the present application. It should be understood that the distributed processing system may include multiple mapping servers and multiple reduction servers, and the processing process thereof is similar to this, which will not be repeated here.
- a read/write command (such as a load/store command) is directly executed to the remote memory in the global memory, there is still a relatively large overhead compared to the read/write of the local memory.
- the sorting result of the intermediate data based on the shuffle stage can be obtained in advance, and the data can be pre-accessed by constructing a memory address list, etc., so as to improve the read and write efficiency of the remote memory.
- the data processing method described in the embodiments of the present application is applied to a distributed processing system, and can realize data reading and writing in the shuffling stage through memory operations based on a global memory formed by interconnecting memory of multiple computing nodes in the distributed processing system, It can not only make full use of massive memory, but also remove redundant data processing links in the old software architecture, which greatly improves the processing performance of the shuffle stage.
- the data processing method described in the embodiments of the present application redefines the software architecture of the shuffle stage based on the new hardware topology structure of memory interconnection, so that the storage of intermediate data in the shuffle stage and the reading and writing between computing nodes can be operated with efficient memory It reduces the processing flow of the shuffle stage, and further reduces the bottleneck effect of the shuffle stage in big data processing.
- FIG. 11 shows a block diagram of a reduction server according to an embodiment of the present application.
- the reduction server is applied to a distributed processing system, and the distributed processing system includes a plurality of mapping servers and a plurality of reduction servers, and the memory of the plurality of mapping servers and the memory of the plurality of reduction servers constitute a global memory .
- the reduction server includes:
- the metadata reading module 1110 is used to obtain the metadata of the first data to be read from the preset first storage area; the specific implementation of the function of the metadata reading module 1110 can refer to step S601, which will not be repeated here. .
- the address determination module 1120 is configured to determine the first address of the first data in the global memory according to the metadata; the specific implementation of the function of the address determination module 1120 may refer to step S602, which will not be repeated here.
- a data reading module 1130 configured to read the first data from the global memory according to the first address
- the first data includes a target data block in a plurality of data blocks of the second data
- the second data includes the processing result of the input data by the corresponding mapping server.
- the data reading module 1130 is configured to map the first address as a second address, where the second address is located within the access range of the reduction server; according to the second address, the first data is read from the global memory.
- the reduction server further includes: a first registration module, configured to register the reduction server through a preset registration module after the reduction server is connected to the distributed processing system instructions to register to cause the reduction server's memory to join the global memory.
- FIG. 12 shows a block diagram of a mapping server according to an embodiment of the present application.
- the mapping server is applied to a distributed processing system.
- the distributed processing system includes a plurality of mapping servers and a plurality of reduction servers, and the memory of the plurality of mapping servers and the memory of the plurality of reduction servers constitute a global memory.
- the mapping server includes:
- the data processing module 1210 is used for processing the input data to obtain the second data; for the specific implementation of the functions of the data processing module 1210, reference may be made to step S4.1, which will not be repeated here.
- the data dividing module 1220 is configured to divide the second data into a plurality of data blocks according to the preset label; for the specific implementation of the functions of the data dividing module 1220, reference may be made to step S402, which will not be repeated here.
- the data storage module 1230 is used to store the plurality of data blocks in the second storage area, and the second storage area is located in the global memory. For the specific implementation of the functions of the data storage module 1230, refer to step S403, which is not described here. Repeat.
- the data storage module 1230 is configured to divide the second storage area into a second storage area according to a preset second size when the data in multiple data blocks needs to be sorted is a plurality of sub-regions; according to the order of the sub-regions, the plurality of data blocks are stored in the plurality of sub-regions; during the sequential storage of the plurality of data blocks in the plurality of sub-regions, the ordered index is updated by updating the The linked list sorts the data in all the sub-regions that have been stored, and the ordered index linked list is sorted by the position index of the linked list linking data.
- the mapping server further includes: an initialization module, configured to apply to the global memory for the The second storage area, so that each processor core corresponds to a second storage area, wherein the first operation process runs on the mapping server and is used to process the input data, and each processing process At least one first operator is run on the device core, and the first operator is used to process the input data.
- the data dividing module 1220 is configured to: divide the second data into multiple data blocks by hashing according to a preset label.
- the data storage module 1230 is configured to: determine a third address of the second storage area; in the case that the third address is outside the access range of the mapping server, mapping the third address to a fourth address, where the fourth address is located within the access range of the mapping server; and storing the plurality of data blocks in the second storage area according to the fourth address.
- the mapping server further includes: a metadata determination module for determining metadata of the multiple data blocks; a metadata storage module for storing metadata of the multiple data blocks The data is stored in a preset first storage area.
- the mapping server further includes: a second registration module, configured to register the mapping server through a preset registration instruction after the mapping server is connected to the distributed processing system , so that the memory of the mapping server is added to the global memory.
- the mapping server further includes: a memory management device, configured to determine first target data from data stored in the first memory when the first memory satisfies a first condition, Store the first target data in an external storage area, and the first condition is that the space already used by the first memory is greater than or equal to a first threshold, or, the space already used by the first memory and the The ratio of the total space of the first memory is greater than or equal to the second threshold, and the first memory is the global memory or a partial memory of the global memory.
- a memory management device configured to determine first target data from data stored in the first memory when the first memory satisfies a first condition, Store the first target data in an external storage area, and the first condition is that the space already used by the first memory is greater than or equal to a first threshold, or, the space already used by the first memory and the The ratio of the total space of the first memory is greater than or equal to the second threshold, and the first memory is the global memory or a partial memory of the global memory.
- the above-mentioned memory management apparatus is further configured to determine second target data from data stored in the external storage area when the first memory satisfies the second condition, and store the second target data in the external storage area.
- the target data is stored in the first memory
- the second condition is that the space already used by the first memory is less than or equal to the third threshold, or, the space already used by the first memory and the first memory
- the ratio of the total space is less than or equal to the fourth threshold.
- the external storage area includes, but is not limited to, at least one of the following: HDD and SSD.
- An embodiment of the present application provides a data processing apparatus, including: a processor and a memory for storing instructions executable by the processor; wherein the processor is configured to implement the above method when executing the instructions.
- Embodiments of the present application provide a non-volatile computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, implement the above method.
- Embodiments of the present application provide a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in a processor of an electronic device When running in the electronic device, the processor in the electronic device executes the above method.
- a computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device.
- the computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- Computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (Electrically Programmable Read-Only-Memory, EPROM or flash memory), static random access memory (Static Random-Access Memory, SRAM), portable compact disk read-only memory (Compact Disc Read-Only Memory, CD - ROM), Digital Video Disc (DVD), memory sticks, floppy disks, mechanically encoded devices, such as punch cards or raised structures in grooves on which instructions are stored, and any suitable combination of the foregoing .
- RAM random access memory
- ROM read only memory
- EPROM erasable programmable read-only memory
- EPROM Errically Programmable Read-Only-Memory
- SRAM static random access memory
- portable compact disk read-only memory Compact Disc Read-Only Memory
- CD - ROM Compact Disc Read-Only Memory
- DVD Digital Video Disc
- memory sticks floppy disks
- Computer readable program instructions or code described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network and/or a wireless network.
- the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
- a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
- the computer program instructions used to perform the operations of the present application may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more source or object code written in any combination of programming languages, including object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as the "C" language or similar programming languages.
- the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
- the remote computer may be connected to the user's computer through any kind of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or, may be connected to an external computer (eg, use an internet service provider to connect via the internet).
- electronic circuits such as programmable logic circuits, Field-Programmable Gate Arrays (FPGA), or Programmable Logic Arrays (Programmable Logic Arrays), are personalized by utilizing state information of computer-readable program instructions.
- Logic Array, PLA the electronic circuit can execute computer readable program instructions to implement various aspects of the present application.
- These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
- These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
- Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions.
- the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
- each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in hardware (eg, circuits or ASICs (Application) that perform the corresponding functions or actions. Specific Integrated Circuit, application-specific integrated circuit)), or can be implemented by a combination of hardware and software, such as firmware.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种数据处理方法,应用于分布式处理系统中的归约服务器,该分布式处理系统包括多个映射服务器及多个归约服务器,上述多个映射服务器的内存及上述多个归约服务器的内存构成全局内存,方法包括:从预设的第一存储区域,获取待读取的第一数据的元数据,然后根据第一数据的元数据,确定第一数据在全局内存中的第一地址,最后根据第一地址,从全局内存中读取第一数据。本申请的实施例能够以内存方式对洗牌阶段存储在全局内存中的数据进行读取,从而提高洗牌阶段的处理效率。
Description
本申请要求于2021年4月14日提交中国专利局、申请号为202110401463.9、发明名称为“一种基于全局大内存系统的shuffle方法”的中国专利申请的优先权,以及要求于2021年6月08日提交中国专利局、申请号为202110638812.9、发明名称为“一种数据处理的方法、装置和系统”的中国专利申请的优先权,以及要求于2021年7月19日提交中国专利局、申请号为202110812926.0、发明名称为“数据处理方法、装置、归约服务器及映射服务器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及计算机技术领域,尤其涉及一种数据处理方法、装置、归约服务器及映射服务器。
近年来,以大数据、物联网、人工智能、第五代移动通信技术(5th generation mobile networks,5G)为核心特征的数字化浪潮正席卷全球,由此产生了海量数据。
在相关技术中,对海量数据进行处理时,通常采用分布式高并发计算框架,将待处理数据划分为若干数据块,通过不同计算节点并发进行运算。由于整个数据处理过程可能分为若干步骤,在一个步骤的输入数据来源于前一个步骤的多个计算节点的运算结果的情况下,必然涉及到大量数据在计算节点间的传输。而受单个计算节点的内存容量有限、计算节点间网络传输时延大、带宽小等因素的影响,计算节点间的数据传输效率较低。
发明内容
有鉴于此,提出了一种数据处理技术方案。
第一方面,本申请的实施例提供了一种数据处理方法,所述方法应用于分布式处理系统中的归约服务器,所述分布式处理系统包括多个映射服务器及多个归约服务器,所述多个映射服务器的内存及所述多个归约服务器的内存构成全局内存,所述方法包括:从预设的第一存储区域,获取待读取的第一数据的元数据;根据所述元数据,确定所述第一数据在所述全局内存中的第一地址;根据所述第一地址,从所述全局内存中读取所述第一数据,其中,所述第一数据包括第二数据的多个数据块中的目标数据块,所述第二数据包括相应的映射服务器对输入数据的处理结果。
根据本申请的实施例,分布式处理系统中的归约服务器,能够从第一存储区域,获取待读取的第一数据的元数据,其中,第一数据包括第二数据的多个数据块中的目标数据块,第二数据包括相应的映射服务器对输入数据的处理结果,然后根据该元数据,确定第一数据在全局内存中的第一地址,并根据第一地址,从全局内存中读取第一数据,从而能够在归约服务器从多个映射服务器的处理结果中读取包括目标数据块的输入数据(第一数据)时,无需对目标数据块进行拷贝传输,而是以内存方式对存储在全局内存中的目标数据块直接进行读取,不仅使得洗牌阶段的处理过程不受计算节点的内存容量、传输网络的物理带宽、传输时延等因素的限制,而且能够提高洗牌阶段的处理效率及处理性能,进而提高分布式处理系统的处理效率。
根据第一方面,在所述数据处理方法的第一种可能的实现方式中,所述根据所述第一地址,从所述全局内存中读取所述第一数据,包括:在所述第一地址位于所述归约服务器的访问范围之外的情况下,将所述第一地址映射为第二地址,所述第二地址位于所述归约服务器的访问范围内;根据所述第二地址,从所述全局内存中读取所述第一数据。
在本实施例中,通过在第一地址位于归约服务器的访问范围之外的情况下进行地址映射,使得归约服务器可以从全局内存中读取位于远端的第一数据。
根据第一方面或第一方面的第一种可能的实现方式,在所述数据处理方法的第二种可能的实现方式中,所述方法还包括:在所述归约服务器连接到所述分布式处理系统后,所述归约服务器通过预设的注册指令进行注册,以使所述归约服务器的内存加入所述全局内存。
在本实施例中,通过归约服务器的注册,能够对加入分布式处理系统的归约服务器的内存进行统一管理,从而实现对全局内存的管理。
第二方面,本申请的实施例提供了一种数据处理方法,所述方法应用于分布式处理系统中的映射服务器,所述分布式处理系统包括多个映射服务器及多个归约服务器,所述多个映射服务器的内存及所述多个归约服务器的内存构成全局内存,所述方法包括:对输入数据进行处理,得到第二数据;根据预设标签,将所述第二数据划分为多个数据块;将所述多个数据块存储到第二存储区域,所述第二存储区域位于所述全局内存中。
根据本申请的实施例,分布式处理系统中的映射服务器,能够对输入数据进行处理,得到第二数据,并根据预设标签,将第二数据划分为多个数据块,然后将多个数据块存储到位于全局内存中第二存储区域,从而可以在洗牌阶段,将映射服务器的处理结果(即第二数据)存储在全局内存中,不仅能够避免缓慢的磁盘读写,而且能够使得洗牌阶段的处理过程不受映射服务器的内存容量的限制,进而提高洗牌阶段的处理效率及处理性能。
根据第二方面,在所述数据处理方法的第一种可能的实现方式中,所述将所述多个数据块存储到第二存储区域,包括:在需要对多个数据块中的数据进行排序的情况下,根据预设的第二尺寸,将第二存储区域划分为多个子区域;按照子区域的顺序,将所述多个数据块存储到所述多个子区域中;在将所述多个数据块依次存储到所述多个子区域期间,通过更新有序索引链表,对存储完成的所有子区域中的数据进行排序,所述有序索引链表通过链表链接数据的位置索引的方式进行排序。
在本实施例中,通过异步流水线(pipeline)的方式执行数据写入与排序,并在排序时使用有序索引链表,不仅能够边写边排序,实现在写入时直接排序,而且能够去除单独排序时的数据拷贝环节,减少内存占用,从而可以提高洗牌阶段中洗牌写入的处理效率。此外,通过这种方式,还可以将写入及排序合并为一个步骤,减少处理步骤,提高洗牌阶段的处理效率。
根据第二方面,在所述数据处理方法的第二种可能的实现方式中,所述映射服务器包括对所述输入数据进行处理的至少一个第一算子,所述方法通过所述映射服务器上的第一运算进程实现,所述方法还包括:在所述第一运算进程的初始化阶段,根据所述映射服务器的处理器核的数量,向所述全局内存申请所述第二存储区域,以使每个处理器核对应一个第二存储区域,其中,所述每个处理器核上运行至少一个第一算子。
在本实施例中,能够在映射服务器上的第一运算进程的初始化阶段,根据映射服务器的处理器核的数量,向全局内存申请第二存储区域,以使每个处理器核对应一个第二存储区域,其中,每个处理器核上运行至少一个第一算子,从而可以将运行在同一处理器核上的至少一个算子,看作一个洗牌写入者,并在全局内存中为该洗牌写入者分配存储空间,使得运行在 同一个处理器核上的至少一个算子的处理结果中标签相同的数据存储在全局内存的同一区域,实现基于处理器核的数据汇聚,减少数据分散,进而提高数据读取效率。
根据第二方面,在所述数据处理方法的第三种可能的实现方式中,所述根据预设标签,将所述第二数据划分为多个数据块,包括:根据预设标签,通过哈希方式,将所述第二数据划分为多个数据块。
在本实施例中,通过哈希方式将第二数据划分为多个数据块,能够在对第二数据分块前无需进行排序,从而可提高第二数据分块的处理效率。
根据第二方面,在所述数据处理方法的第四种可能的实现方式中,所述将所述多个数据块存储到第二存储区域,包括:确定第二存储区域的第三地址;在所述第三地址位于所述映射服务器的访问范围之外的情况下,将所述第三地址映射为第四地址,所述第四地址位于所述映射服务器的访问范围内;根据所述第四地址,将所述多个数据块存储到所述第二存储区域。
在本实施例中,通过在第三地址位于映射服务器的访问范围之外的情况下进行地址映射,能够实现映射服务器对位于远端的第二存储区域的访问。
根据第二方面或第二方面的第一种可能的实现方式至第二方面的第四种可能的实现方式中的任一种,在所述数据处理方法的第五种可能的实现方式中,所述方法还包括:确定所述多个数据块的元数据;将所述多个数据块的元数据存储到预设的第一存储区域。
在本实施例中,通过确定多个数据块的元数据,并将该元数据存储到第一存储区域,使得归约服务器能够从第一存储区域获取待读取数据的元数据,例如标签、存储地址等。
根据第二方面或第二方面的第一种可能的实现方式至第二方面的第五种可能的实现方式中的任一种,在所述数据处理方法的第六种可能的实现方式中,所述方法还包括:在所述映射服务器连接到所述分布式处理系统后,所述映射服务器通过预设的注册指令进行注册,以使所述映射服务器的内存加入所述全局内存。
在本实施例中,通过映射服务器的注册,能够对加入分布式处理系统的映射服务器的内存进行统一管理,从而实现对全局内存的管理。
根据第二方面或第二方面的第一种可能的实现方式至第二方面的第六种可能的实现方式中的任一种,在所述数据处理方法的第七种可能的实现方式中,所述方法还包括:
当第一内存满足第一条件时,从所述第一内存存储的数据中确定第一目标数据,将所述第一目标数据存储至外部存储区域,所述第一条件为所述第一内存已经使用的空间大于或等于第一阈值,或者,为所述第一内存已经使用的空间与所述第一内存的总空间的比值大于或等于第二阈值,所述第一内存为所述全局内存或者所述全局内存的部分内存。
在本实施例中,通过内存管理装置对第一内存的管理,能够在第一内存的存储空间不够时,将第一内存中的部分数据存储至外部存储区域,以腾出空间存储待存储数据,避免第一内存无法存储大量数据,导致应用出现无法正常运行或者运行效率低的情况。
根据第二方面或第二方面的第一种可能的实现方式至第二方面的第七种可能的实现方式中的任一种,在所述数据处理方法的第八种可能的实现方式中,所述方法还包括:
当所述第一内存满足第二条件时,从所述外部存储区域存储的数据中确定第二目标数据,将所述第二目标数据存储至所述第一内存,所述第二条件为所述第一内存已经使用的空间小于或等于第三阈值,或者,为所述第一内存已经使用的空间与所述第一内存的总空间的比值小于或等于第四阈值。
在本实施例中,通过内存管理装置对第一内存的管理,能够在第一内存的可用存储空间 变大时,将存储至外部存储区域的数据取回第一内存,以便需要读取该部分数据的规约服务器可以从全局内存中读取到对应数据,而不是从外部存储区域读取,提高数据读取效率。
根据第二方面或第二方面的第七种可能的实现方式或第二方面的第八种可能的实现方式,在所述数据处理方法的第九种可能的实现方式中,所述外部存储区域包括以下至少一种:硬盘驱动器(hard disk drive,HDD)、固态硬盘(solid state disk,SSD)。
在本实施例中,外部存储区域包括HDD和/或SSD,能够持久化存储数据。
第三方面,本申请的实施例提供了一种归约服务器,所述归约服务器应用于分布式处理系统,所述分布式处理系统包括多个映射服务器及多个归约服务器,所述多个映射服务器的内存及所述多个归约服务器的内存构成全局内存,所述归约服务器包括:元数据读取模块,用于从预设的第一存储区域,获取待读取的第一数据的元数据;地址确定模块,用于根据所述元数据,确定所述第一数据在所述全局内存中的第一地址;数据读取模块,用于根据所述第一地址,从所述全局内存中读取所述第一数据,其中,所述第一数据包括第二数据的多个数据块中的目标数据块,所述第二数据包括相应的映射服务器对输入数据的处理结果。
根据本申请的实施例,分布式处理系统中的归约服务器,能够从第一存储区域,获取待读取的第一数据的元数据,其中,第一数据包括第二数据的多个数据块中的目标数据块,第二数据包括相应的映射服务器对输入数据的处理结果,然后根据该元数据,确定第一数据在全局内存中的第一地址,并根据第一地址,从全局内存中读取第一数据,从而能够在归约服务器从多个映射服务器的处理结果中读取包括目标数据块的输入数据(第一数据)时,无需对目标数据块进行拷贝传输,而是以内存方式对存储在全局内存中的目标数据块直接进行读取,不仅使得洗牌阶段的处理过程不受计算节点的内存容量、传输网络的物理带宽、传输时延等因素的限制,而且能够提高洗牌阶段的处理效率及处理性能,进而提高分布式处理系统的处理效率。
根据第三方面,在所述归约服务器的第一种可能的实现方式中,所述数据读取模块,被配置为:在所述第一地址位于所述归约服务器的访问范围之外的情况下,将所述第一地址映射为第二地址,所述第二地址位于所述归约服务器的访问范围内;根据所述第二地址,从所述全局内存中读取所述第一数据。
在本实施例中,通过在第一地址位于归约服务器的访问范围之外的情况下进行地址映射,使得归约服务器可以从全局内存中读取位于远端的第一数据。
根据第三方面或第三方面的第一种可能的实现方式,在所述归约服务器的第二种可能的实现方式中,所述归约服务器还包括:第一注册模块,用于在所述归约服务器连接到所述分布式处理系统后,所述归约服务器通过预设的注册指令进行注册,以使所述归约服务器的内存加入所述全局内存。
在本实施例中,通过归约服务器的注册,能够对加入分布式处理系统的归约服务器的内存进行统一管理,从而实现对全局内存的管理。
第四方面,本申请的实施例提供了一种映射服务器,所述映射服务器应用于分布式处理系统,所述分布式处理系统包括多个映射服务器及多个归约服务器,所述多个映射服务器的内存及所述多个归约服务器的内存构成全局内存,所述映射服务器包括:数据处理模块,用于对输入数据进行处理,得到第二数据;数据划分模块,用于根据预设标签,将所述第二数据划分为多个数据块;数据存储模块,用于将所述多个数据块存储到第二存储区域,所述第二存储区域位于所述全局内存中。
根据本申请的实施例,分布式处理系统中的映射服务器,能够对输入数据进行处理,得 到第二数据,并根据预设标签,将第二数据划分为多个数据块,然后将多个数据块存储到位于全局内存中第二存储区域,从而可以在洗牌阶段,将映射服务器的处理结果(即第二数据)存储在全局内存中,不仅能够避免缓慢的磁盘读写,而且能够使得洗牌阶段的处理过程不受映射服务器的内存容量的限制,进而提高洗牌阶段的处理效率及处理性能。
根据第四方面,在所述映射服务器的第一种可能的实现方式中,所述数据存储模块,被配置为:在需要对多个数据块中的数据进行排序的情况下,根据预设的第二尺寸,将第二存储区域划分为多个子区域;按照子区域的顺序,将所述多个数据块存储到所述多个子区域中;在将所述多个数据块依次存储到所述多个子区域期间,通过更新有序索引链表,对存储完成的所有子区域中的数据进行排序,所述有序索引链表通过链表链接数据的位置索引的方式进行排序。
在本实施例中,通过异步流水线(pipeline)的方式执行数据写入与排序,并在排序时使用有序索引链表,不仅能够边写边排序,实现在写入时直接排序,而且能够去除单独排序时的数据拷贝环节,减少内存占用,从而可以提高洗牌阶段中洗牌写入的处理效率。此外,通过这种方式,还可以将写入及排序合并为一个步骤,减少处理步骤,提高洗牌阶段的处理效率。
根据第四方面,在所述映射服务器的第二种可能的实现方式中,所述映射服务器还包括:初始化模块,用于在第一运算进程的初始化阶段,根据所述映射服务器的处理器核的数量,向所述全局内存申请所述第二存储区域,以使每个处理器核对应一个第二存储区域,其中,所述第一运算进程运行在所述映射服务器上,用于对所述输入数据进行处理,所述每个处理器核上运行至少一个第一算子,所述第一算子用于对所述输入数据进行处理。
在本实施例中,能够在映射服务器的第一运算进程的初始化阶段,根据映射服务器的处理器核的数量,向全局内存申请第二存储区域,以使每个处理器核对应一个第二存储区域,其中,每个处理器核上运行至少一个第一算子,从而可以将运行在同一处理器核上的至少一个算子,看作一个洗牌写入者,并在全局内存中为该洗牌写入者分配存储空间,使得运行在同一个处理器核上的至少一个算子的处理结果中标签相同的数据存储在全局内存的同一区域,实现基于处理器核的数据汇聚,减少数据分散,进而提高数据读取效率。
根据第四方面,在所述映射服务器的第三种可能的实现方式中,所述数据划分模块,被配置为:根据预设标签,通过哈希方式,将所述第二数据划分为多个数据块。
在本实施例中,通过哈希方式将第二数据划分为多个数据块,能够在对第二数据分块前无需进行排序,从而可提高第二数据分块的处理效率。
根据第四方面,在所述映射服务器的第四种可能的实现方式中,所述数据存储模块,被配置为:确定第二存储区域的第三地址;在所述第三地址位于所述映射服务器的访问范围之外的情况下,将所述第三地址映射为第四地址,所述第四地址位于所述映射服务器的访问范围内;根据所述第四地址,将所述多个数据块存储到所述第二存储区域。
在本实施例中,通过在第三地址位于映射服务器的访问范围之外的情况下进行地址映射,能够实现映射服务器对位于远端的第二存储区域的访问。
根据第四方面或第四方面的第一种可能的实现方式至第四方面的第四种可能的实现方式中的任一种,在所述映射服务器的第五种可能的实现方式中,所述映射服务器还包括:元数据确定模块,用于确定所述多个数据块的元数据;元数据存储模块,用于将所述多个数据块的元数据存储到预设的第一存储区域。
在本实施例中,通过确定多个数据块的元数据,并将该元数据存储到第一存储区域,使 得归约服务器能够从第一存储区域获取待读取数据的元数据,例如标签、存储地址等。
根据第四方面或第四方面的第一种可能的实现方式至第四方面的第五种可能的实现方式中的任一种,在所述映射服务器的第六种可能的实现方式中,所述映射服务器还包括:第二注册模块,用于在所述映射服务器连接到所述分布式处理系统后,所述映射服务器通过预设的注册指令进行注册,以使所述映射服务器的内存加入所述全局内存。
在本实施例中,通过映射服务器的注册,能够对加入分布式处理系统的映射服务器的内存进行统一管理,从而实现对全局内存的管理。
根据第四方面或第四方面的第一种可能的实现方式至第四方面的第六种可能的实现方式中的任一种,在所述数据处理方法的第七种可能的实现方式中,所述映射服务器还包括:
内存管理装置,用于当第一内存满足第一条件时,从所述第一内存存储的数据中确定第一目标数据,将所述第一目标数据存储至外部存储区域,所述第一条件为所述第一内存已经使用的空间大于或等于第一阈值,或者,为所述第一内存已经使用的空间与所述第一内存的总空间的比值大于或等于第二阈值,所述第一内存为所述全局内存或者所述全局内存的部分。
在本实施例中,通过内存管理装置对第一内存的管理,能够在第一内存的存储空间不够时,将第一内存中的部分数据存储至外部存储区域,以腾出空间存储待存储数据,避免第一内存无法存储大量数据,导致应用出现无法正常运行或者运行效率低的情况。
根据第四方面或第四方面的第一种可能的实现方式至第四方面的第七种可能的实现方式中的任一种,在所述数据处理方法的第八种可能的实现方式中,所述内存管理装置,还用于:
当所述第一内存满足第二条件时,从所述外部存储区域存储的数据中确定第二目标数据,将所述第二目标数据存储至所述第一内存,所述第二条件为所述第一内存已经使用的空间小于或等于第三阈值,或者,为所述第一内存已经使用的空间与所述第一内存的总空间的比值小于或等于第四阈值。
在本实施例中,通过内存管理装置对第一内存的管理,能够在第一内存的可用存储空间变大时,将存储至外部存储区域的数据取回第一内存,以便需要读取该部分数据的规约服务器可以从全局内存中读取到对应数据,而不是从外部存储区域读取,提高数据读取效率。
根据第四方面或第四方面的第七种可能的实现方式或第四方面的第八种可能的实现方式,在所述数据处理方法的第九种可能的实现方式中,所述外部存储区域包括以下至少一种:HDD、SSD。
在本实施例中,外部存储区域包括HDD和/或SSD,能够持久化存储数据。
第五方面,本申请的实施例提供了一种数据处理装置,包括处理器及用于存储处理器可执行指令的存储器,其中,所述处理器被配置为执行所述指令时实现上述第一方面或者第一方面的多种可能的实现方式中的一种或几种的数据处理方法,或者实现上述第二方面或者第二方面的多种可能的实现方式中的一种或几种的数据处理方法。
根据本申请的实施例,分布式处理系统中的映射服务器,能够对输入数据进行处理,得到第二数据,并根据预设标签,将第二数据划分为多个数据块,然后将多个数据块存储到位于全局内存中第二存储区域,从而可以在洗牌阶段,将映射服务器的处理结果(即第二数据)存储在全局内存中,不仅能够避免缓慢的磁盘读写,而且能够使得洗牌阶段的处理过程不受映射服务器的内存容量的限制,进而提高洗牌阶段的处理效率及处理性能。
分布式处理系统中的归约服务器,能够从第一存储区域,获取待读取的第一数据的元数据,其中,第一数据包括第二数据的多个数据块中的目标数据块,第二数据包括相应的映射服务器对输入数据的处理结果,然后根据该元数据,确定第一数据在全局内存中的第一地址, 并根据第一地址,从全局内存中读取第一数据,从而能够在归约服务器从多个映射服务器的处理结果中读取包括目标数据块的输入数据(第一数据)时,无需对目标数据块进行拷贝传输,而是以内存方式对存储在全局内存中的目标数据块直接进行读取,不仅使得洗牌阶段的处理过程不受计算节点的内存容量、传输网络的物理带宽、传输时延等因素的限制,而且能够提高洗牌阶段的处理效率及处理性能,进而提高分布式处理系统的处理效率。
第六方面,本申请的实施例提供了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述第一方面或者第一方面的多种可能的实现方式中的一种或几种的数据处理方法,或者实现上述第二方面或者第二方面的多种可能的实现方式中的一种或几种的数据处理方法。
根据本申请的实施例,分布式处理系统中的映射服务器,能够对输入数据进行处理,得到第二数据,并根据预设标签,将第二数据划分为多个数据块,然后将多个数据块存储到位于全局内存中第二存储区域,从而可以在洗牌阶段,将映射服务器的处理结果(即第二数据)存储在全局内存中,不仅能够避免缓慢的磁盘读写,而且能够使得洗牌阶段的处理过程不受映射服务器的内存容量的限制,进而提高洗牌阶段的处理效率及处理性能。
分布式处理系统中的归约服务器,能够从第一存储区域,获取待读取的第一数据的元数据,其中,第一数据包括第二数据的多个数据块中的目标数据块,第二数据包括相应的映射服务器对输入数据的处理结果,然后根据该元数据,确定第一数据在全局内存中的第一地址,并根据第一地址,从全局内存中读取第一数据,从而能够在归约服务器从多个映射服务器的处理结果中读取包括目标数据块的输入数据(第一数据)时,无需对目标数据块进行拷贝传输,而是以内存方式对存储在全局内存中的目标数据块直接进行读取,不仅使得洗牌阶段的处理过程不受计算节点的内存容量、传输网络的物理带宽、传输时延等因素的限制,而且能够提高洗牌阶段的处理效率及处理性能,进而提高分布式处理系统的处理效率。
第七方面,本申请的实施例提供了一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行上述第一方面或者第一方面的多种可能的实现方式中的一种或几种的数据处理方法,或者执行上述第二方面或者第二方面的多种可能的实现方式中的一种或几种的数据处理方法。
根据本申请的实施例,分布式处理系统中的映射服务器,能够对输入数据进行处理,得到第二数据,并根据预设标签,将第二数据划分为多个数据块,然后将多个数据块存储到位于全局内存中第二存储区域,从而可以在洗牌阶段,将映射服务器的处理结果(即第二数据)存储在全局内存中,不仅能够避免缓慢的磁盘读写,而且能够使得洗牌阶段的处理过程不受映射服务器的内存容量的限制,进而提高洗牌阶段的处理效率及处理性能。
分布式处理系统中的归约服务器,能够从第一存储区域,获取待读取的第一数据的元数据,其中,第一数据包括第二数据的多个数据块中的目标数据块,第二数据包括相应的映射服务器对输入数据的处理结果,然后根据该元数据,确定第一数据在全局内存中的第一地址,并根据第一地址,从全局内存中读取第一数据,从而能够在归约服务器从多个映射服务器的处理结果中读取包括目标数据块的输入数据(第一数据)时,无需对目标数据块进行拷贝传输,而是以内存方式对存储在全局内存中的目标数据块直接进行读取,不仅使得洗牌阶段的处理过程不受计算节点的内存容量、传输网络的物理带宽、传输时延等因素的限制,而且能够提高洗牌阶段的处理效率及处理性能,进而提高分布式处理系统的处理效率。
本申请的这些和其他方面在以下(多个)实施例的描述中会更加简明易懂。
包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本申请的示例性实施例、特征和方面,并且用于解释本申请的原理。
图1示出一种映射归约框架的示意图;
图2示出一种洗牌阶段的处理过程的示意图;
图3示出根据本申请一实施例的分布式处理系统的示意图;
图4示出根据本申请一实施例的数据处理方法的流程图;
图5示出根据本申请一实施例的数据处理方法中数据写入时排序的示意图;
图6示出根据本申请一实施例的数据处理方法的流程图;
图7示出根据本申请一实施例的内存管理装置对全局内存进行管理的流程图;
图8示出根据本申请一实施例的数据处理方法的软件架构的示意图;
图9示出根据本申请一实施例的映射服务器的运算进程的初始化示意图;
图10示出根据本申请一实施例的数据处理方法的处理过程的示意图;
图11示出根据本申请一实施例的归约服务器的框图;
图12示出根据本申请一实施例的映射服务器的框图。
以下将参考附图详细说明本申请的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。
另外,为了更好的说明本申请,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本申请同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本申请的主旨。
目前,对海量数据的处理及分析通常采用分布式高并发计算框架,例如海杜普映射归约(hadoop map-reduce,HadoopMR)框架、Spark等。分布式高并发计算框架通过多个计算节点对待处理数据进行并发运算。
图1示出一种映射归约框架的示意图。如图1所示,映射归约框架的数据处理过程包括映射(map)阶段及归约(reduce)阶段,其中,可将映射阶段中算子的输出到归约阶段中算子的输入的这段处理过程,称为洗牌(shuffle)阶段。洗牌阶段可包括映射阶段中算子的输出数据的保存、分块、归约阶段中输入数据的拷贝/拉取、合并、排序等。
可以认为,洗牌阶段将上一个阶段的计算结果通过洗牌分散到下一阶段用于计算或存储结果的物理节点上。
参考图1,待处理数据110存储在海杜普分布式文件系统(hadoop distributed file system,HDFS)100上。对待处理数据110进行处理时,可首先进行数据切分,将待处理数据110切分成数据块A1、数据块A2、数据块A3及数据块A4这4个数据块;然后将4个数据块输入4个映射计算节点(分别为map1、map2、map3及map4,执行相同的处理)进行处理:将数据块A1输入map1处理,将数据块A2输入map2处理,将数据块A3输入map3处理,将数 据块A4输入map4处理,得到相应的处理结果;根据与归约阶段的3个归约计算节点相对应的标签121、122、123,将各个映射计算节点的处理结果划分为3个数据块并进行存储,完成映射阶段的处理。
在映射阶段完成后,进入归约阶段。各个归约计算节点从相应的映射计算节点拷贝/拉取数据并进行处理:第1个归约计算节点(包括sort1和reduce1)从各个映射计算节点拉取标签为121的数据块,在获取数据块后,通过sort1对数据块中的数据进行排序,并将排序后的数据输入reduce1进行处理,得到处理结果数据块B1;第2个归约计算节点(包括sort2和reduce2)从各个映射计算节点拉取标签为122的数据块,在获取数据块后,通过sort2对数据块中的数据进行排序,并将排序后的数据输入reduce2进行处理,得到处理结果数据块B2;第3个归约计算节点(包括sort3和reduce3)从各个映射计算节点拉取标签为123的数据块,并在获取数据块后,通过sort3对数据块中的数据进行排序,将排序后的数据输入reduce3进行处理,得到处理结果数据块B3。然后将数据块B1、数据块B2及数据块B3存储在HDFS 130上。
在图1所示的映射归约框架中,映射计算节点与归约计算节点之间通过网络连接。受限于内存容量,各个映射计算节点的处理结果需要存储在本地磁盘,归约计算节点需要从相应的映射计算节点的磁盘中读取数据,再通过网络传输到本地后进行处理。
图2示出一种洗牌阶段的处理过程的示意图。如图2所示,待处理数据为数据块201及数据块202,分别将数据块201及数据块202输入映射计算节点211及映射计算节点212进行处理。映射计算节点211通过算子1对数据块201进行处理,得到处理结果,然后进行洗牌写入(shuffle write),将算子1的处理结果进行分区并存储到内存1,在内存1存满之后,执行溢出(spill)操作,将内存1中的数据存储在磁盘1上,重复该过程,直到数据块201处理完成。同时,在进行洗牌写入时,将描述算子1的处理结果所在磁盘文件信息的元数据存储到映射输出管理(MapOutTracker)单元221。
映射计算节点212通过算子2对数据块202进行处理,得到处理结果,然后进行洗牌写入(shuffle write),将算子2的处理结果进行分区并存储到内存2,在内存2存满之后,执行溢出(spill)操作,将内存2中的数据存储在磁盘2上,重复该过程,直到数据块202处理完成。同时,在进行洗牌写入时,将描述算子2的处理结果所在磁盘文件信息的元数据也存储到映射输出管理单元221。其中,算子1与算子2执行相同的处理。
归约计算节点231运行时,首先从映射输出管理单元221处获取待读取数据的元数据,并根据该元数据,进行洗牌读取(shuffle read),通过网络分别从映射计算节点211的磁盘1及映射计算节点212的磁盘2读取相应数据,然后通过算子3对数据进行处理,得到输出结果241。
由上述示例可知,在现有映射归约框架的洗牌阶段,各个映射计算节点的处理结果存储在磁盘,归约计算节点需要从多个映射计算节点的磁盘中读取数据,并通过网络传输到本地后进行处理。在该过程中,缓慢的磁盘读写严重影响数据传输效率,且短时间内大量数据在计算节点间通过网络传输,网络的物理带宽及传输绝对时延也对数据传输效率有较大影响,进而影响洗牌阶段的处理效率。
在一些技术方案中,为了缓解内存容量压力及网络传输压力,在洗牌阶段(shuffle阶段),通常会对需要传输的中间数据(即映射计算节点的处理结果)进行序列化、排序、压缩、下硬盘、网络传输、解压缩、反序列化等处理。该方式虽然可以部分缓解内存容量压力及网络传输压力,但同时也带来了冗余开销,而且也不能从根本上解决内存容量受限及数据传输效 率较低的问题,也就不能有效提高洗牌阶段的处理效率。
此外,通过网络传输数据时,即在归约计算节点通过网络从映射计算节点拷贝/拉取数据时,使用传输控制协议(transmission control protocol,TCP)及网际互连协议(Internet Protocol,IP)。传输过程中,数据需要经过TCP/IP协议栈的二次拷贝:映射计算节点的处理器(central processing unit,CPU)跨态(从用户态到内核态)将应用层的数据拷贝到TCP内核发送缓存区,通过网络适配器(即网卡)发送给归约计算节点;归约计算节点的CPU也跨态(从内核态到应用态)将通过网络适配器(即网卡)接收的数据从TCP内核接收缓存区拷贝至应用层。数据在TCP/IP协议栈的二次拷贝,会耗费大量CPU时间,导致传输绝对时延较高(通常在10ms级别),进而影响数据传输效率。
在一些技术方案中,受限于计算节点的内存容量,使用了在内存中缓存部分高频使用的临时数据的折中方法,虽然该方法对某些特定场景有加速作用,但是洗牌阶段的中间数据仍然需要保存在磁盘,内存加速受限,而且也不能解决内存容量受限的问题。
在一些技术方案中,为了提高网络传输效率,在洗牌阶段的数据传输过程中使用了远程直接数据存取(Remote Direct Memory Access,RDMA)技术。映射计算节点的数据可以通过RDMA网卡直接到达归约计算节点的应用层,去除了数据在TCP/IP协议栈的二次拷贝,减少了时间开销和CPU占用率。然而,该方法中数据需要跨节点拷贝,内存占用增加(内存占用为中间数据的两倍),且洗牌阶段的中间数据仍以文件形式存储,数据传输基于开销较大的输入/输出(input/output,IO)语义,即存在文件系统调用开销,因此,与内存访问相比,计算节点间通过RDMA进行数据传输,仍然存在相对较高的传输绝对时延。
为了解决上述技术问题,本申请提供了一种数据处理方法。本申请实施例的数据处理方法应用于分布式处理系统,能够基于分布式处理系统中多个计算节点的内存互联构成的全局内存,通过内存操作实现洗牌阶段的数据读写,从而在洗牌阶段无需对中间数据进行拷贝传输,使得洗牌阶段的处理过程不受计算节点的内存容量、传输网络的物理带宽、传输时延等因素的限制,而且能够基于高效的内存语义,以内存方式对洗牌阶段的数据进行读写,提高洗牌阶段的处理效率,进而提高分布式处理系统对海量数据的处理效率。
在一种可能的实现方式中,所述分布式处理系统可包括服务器集群、数据中心等用于处理海量数据的分布式系统。本申请对分布式处理系统的具体类型不作限制。
在一种可能的实现方式中,所述分布式处理系统可包括多个映射服务器及多个归约服务器。所述多个映射服务器及所述多个归约服务器用于数据处理。
在一种可能的实现方式中,所述分布式处理系统可包括至少一个洗牌阶段。在任一洗牌阶段,各个归约服务器的输入数据来源于多个映射服务器的输出数据,可将多个映射服务器看作前端,将多个归约服务器看作后端,前端的输出数据作为后端的输入数据。
在一种可能的实现方式中,分布式处理系统中的多个映射服务器的内存及多个归约服务器的内存可以通过系统总线、高速串行计算机扩展(peripheral component interconnect express,PCIE)总线、GEN-Z总线、RDMA等内存互联方式进行连接,以使得多个映射服务器的内存及多个归约服务器的内存构成全局内存。分布式处理系统中的各个映射服务器及各个归约服务器可以以内存方式(使用内存操作指令)访问全局内存。本申请对内存互联的具体方式不作限制。
在一种可能的实现方式中,在映射服务器连接到分布式处理系统后,映射服务器可通过预设的注册指令进行注册,以使该映射服务器的内存加入全局内存。例如,预设的注册指令为register指令,映射服务器连接到分布式处理系统,且其内存通过系统总线与分布式处理系 统中其他服务器(包括映射服务器及归约服务器)的内存互联后,该映射服务器可向系统总线发送register指令(即预设的注册指令),在系统总线上进行注册,以使该映射服务器的内存加入全局内存。系统总线还可向该映射服务器发送注册完成、注册成功等确认指令,以使该映射服务器获得访问全局内存的权限。
在一种可能的实现方式中,在归约服务器连接到分布式处理系统后,归约服务器也可通过预设的注册指令进行注册,以使该归约服务器的内存加入全局内存。例如,预设的注册指令为register指令,归约服务器连接到分布式处理系统,且其内存通过系统总线与分布式处理系统中其他服务器(包括映射服务器及归约服务器)的内存互联后,该归约服务器可向系统总线发送register指令(即预设的注册指令),在系统总线上进行注册,以使该归约服务器的内存加入全局内存。系统总线还可向该归约服务器发送注册完成、注册成功等确认指令,以使该归约服务器获得访问全局内存的权限。
通过这种方式,能够对加入分布式处理系统的归约服务器及映射服务器的内存进行统一管理,从而实现对全局内存的统一管理。
在一种可能的实现方式中,构建全局内存后,还可建立全局内存与各个映射服务器及各个归约服务器的内存的地址映射关系,以便数据读写时进行地址映射。
在一种可能的实现方式中,映射服务器的内存及归约服务器的内存均为多级内存。多级内存可包括双倍速率同步动态随机存储器(Double Data Rate Synchronous Dynamic Random Access Memory,DDR SDRAM,也可简称DDR)、动态随机存取存储器(dynamic random access memory,DRAM)、傲腾内存(optane memory)或者其他以内存方式访问的存储器中的至少两种。其中,傲腾内存也可称为AEP(Apache Pass)内存。
在一种可能的实现方式中,可根据内存的读写速度,构建多级内存。读写速度越快,内存级别越高。例如,DDR的读写速度比傲腾内存的读写速度快,可将多级内存设置为“DDR+傲腾内存”,在该多级内存的使用过程中,优先使用DDR,DDR存满后,使用傲腾内存。类似地,也可将多级内存设置为“DRAM+傲腾内存”。本领域技术人员可根据实际情况对多级内存进行设置,本申请对此不作限制。
在映射服务器的内存及归约服务器的内存均为多级内存的情况下,通过多个映射服务器的内存与多个归约服务器的内存互联构成的全局内存,也为多级内存。
图3示出根据本申请一实施例的数据处理方法的应用场景的示意图。如图3所示,数据处理方法应用于分布式处理系统,该分布式处理系统包括2个映射服务器(分别为映射服务器311及映射服务器321)和1个归约服务器331。其中,映射服务器311的多级内存314(DDR+AEP+其他内存)、映射服务器321的多级内存324(DDR+AEP+其他内存)、归约服务器331的多级内存334(DDR+AEP+其他内存)通过系统总线314进行连接,构成全局内存。也就是说,图3中的全局内存340包括多级内存314、多级内存324及多级内存334。
参考图3,该分布式处理系统对数据块301及数据块302进行处理。数据块301输入映射服务器311,映射服务器311执行映射任务(maptask)315:通过算子312对数据块301进行处理,得到处理结果,之后执行洗牌写入313,通过内存操作指令,将该处理结果写入全局内存340;类似地,数据块302输入映射服务器321,映射服务器的321执行映射任务(maptask)325:通过算子322对数据块302进行处理,得到处理结果,之后执行洗牌写入323,通过内存操作指令,将该处理结果写入全局内存340。其中,算子312为对输入数据(数据块301)进行处理的第一算子,算子322为对输入数据(数据块302)进行处理的第一算子。
映射服务器311及映射服务器321处理完成后,归约服务器331执行归约任务(reducetask) 335:首先执行洗牌读取333,通过内存操作指令,从全局内存340中读取数据,然后通过算子332对读取的数据进行处理,得到输出结果341。其中,算子332为运行在归约服务器上的、对映射服务器的处理结果进行处理的第二算子。
需要说明的是,以上仅以2个映射服务器及1个归约服务器作为示例,对分布式处理系统及全局内存进行了示例性说明。本领域技术人员应当理解,分布式处理系统可包括多个映射服务器及多个归约服务器,本申请对分布式处理系统中映射服务器的数量及归约服务器的数量不作限制。
图4示出根据本申请一实施例的数据处理方法的流程图。如图4所示,所述方法应用于分布式处理系统中的映射服务器,所述方法包括:
步骤S401,对输入数据进行处理,得到第二数据。
在一种可能的实现方式中,分布式处理系统可包括多个映射服务器。对于待处理的海量数据,分布式处理系统可根据映射服务器的数量,对待处理的海量数据进行分割,例如,可通过分割函数split(),将待处理的海量数据切分为多个待处理数据块;然后将一个或多个待处理数据块作为映射服务器的输入数据。
例如,映射服务器的数量为4,待处理数据块的数量也为4,则为每个映射服务器分配1个待处理数据块作为输入数据;若映射服务器的数量为4,待处理数据块的数量为8,则为每个映射服务器分配2个待处理数据块作为输入数据。
映射服务器接收到输入数据后,可通过第一算子,对输入数据进行格式转换、数据筛选或计算等处理,得到第二数据。也就是说,第二数据为映射服务器对输入数据的处理结果。
例如,待处理的海量数据为X国的人口档案,分布式处理系统需要按省对人口进行分析及统计,可将待处理的海量数据划分为多个待处理数据块,并将多个待处理数据块作为分布式处理系统中映射服务器的输入数据;映射服务器可从输入数据中提取预设的人口关键信息,例如姓名、出生日期、户籍所在地、居住地等信息,得到第二数据。
步骤S402,根据预设标签,将第二数据划分为多个数据块。
在一种可能的实现方式中,预设标签可根据使用场景及待处理的海量数据的关键词进行预先设置。例如,待处理的海量数据为X国的人口档案,分布式处理系统需要按省对人口进行分析及统计,该场景下,可将户籍所在省作为预设标签,预设标签的数量与X国的省市总数量一致。
在一种可能的实现方式中,确定预设标签时,还可考虑分布式处理系统中归约服务器的数量。例如,可首先根据归约服务器的数量,确定预设标签的数量,然后再将预设标签与待处理的海量数据的关键词相对应。
需要说明的是,还可通过其他方式对预设标签进行设置,本申请对预设标签的设置方式及设置依据均不作限制。
在一种可能的实现方式中,得到第二数据后,可在步骤S402中,根据预设标签,通过查找、匹配、哈希等方式,将第二数据划分为多个数据块。例如,假设预设标签为户籍所在省,预设标签的数量为10个,可根据户籍所在省,将第二数据划分为10个数据块。可选的,在第二数据中不包括某个或某些省的数据的情况下,将第二数据划分后得到的数据块的数量会小于10。
在一种可能的实现方式中,将第二数据划分多个数据块时,可通过哈希(hash)方式,从第二数据中选取数据,将第二数据划分为多个数据块。通过这种方式,在对第二数据分块前无需进行排序,从而可提高第二数据分块的处理效率。
步骤S403,将多个数据块存储到第二存储区域,第二存储区域位于全局内存中。
在一种可能的实现方式中,在映射服务器将多个数据块存储到第二存储区域前,可根据第二数据的尺寸,通过内存分配指令,例如allocate(size)指令,其中size表示第二数据的尺寸,向全局内存申请存储空间,申请成功后,将申请到的存储空间作为存储第二数据的第二存储区域。通过这种方式,能够根据第二数据的尺寸动态申请第二存储空间,从而能够节省内存空间,提高内存使用率。
在一种可能的实现方式中,映射服务器也可根据预设的第一尺寸,在全局内存中为第二数据预先分配第二存储区域。在第二数据的尺寸大于第二存储区域的情况下,即第二存储区域的空间不够的情况下,映射服务器再根据实际需要,动态申请存储空间。通过这种方式,能够预先为第二数据分配第二存储空间,减少运行中动态申请存储空间的次数,从而能够提高处理效率。
由于第二存储区域位于全局内存中,那么第二存储区域可能位于本地(映射服务器的物理内存中),也可能位于远端(其他服务器的物理内存中)。在将多个数据块存储到第二存储区域时,可确定第二存储区域的第三地址,并判断第三地址是否位于映射服务器的访问范围内。若第三地址位于映射服务器的访问范围内,则无需进行地址映射,映射服务器可直接通过写数据指令,例如store指令,将多个数据块存储到第二存储区域。
若第三地址位于映射服务器的访问范围之外,则需要进行地址映射。可根据预设的地址映射关系,通过地址映射指令,例如map指令,将第三地址映射为位于映射服务器的访问范围内的第四地址;然后根据第四地址,将多个数据块存储到第二存储区域。
通过在第三地址位于映射服务器的访问范围之外的情况下进行地址映射,能够实现映射服务器对位于远端的第二存储区域的访问。
在一种可能的实现方式中,映射服务器中对输入数据进行处理的第一算子与第二存储区域对应,即每个第一算子对应一个第二存储区域。在需要对多个数据块中的数据进行排序的情况下,可根据预设的第二尺寸,将第二存储区域划分为多个子区域,并按照子区域的顺序,将多个数据块存储到多个子区域中;在将多个数据块依次存储到多个子区域期间,可通过更新有序索引链表,对存储完成的所有子区域中的数据进行排序。其中,有序索引链表通过链表链接数据的位置索引的方式进行排序。
可将这种写入时排序的方式看作异步流水线(pipeline)方式。通过这种方式,能够在将多个数据块中的数据写入第二存储区域的同时,对写入的数据进行排序,实现写入时直接排序,即边写边排序。下面将结合图5对数据写入时排序的处理过程进行示例性说明。
图5示出根据本申请一实施例的数据处理方法中数据写入时排序的示意图。如图5所示,可根据预设的第二尺寸,将与映射服务器的第一算子(即映射任务)551对应第二存储区域(位于全局内存中)550划分为10个子区域或内存切片(slice),分别为子区域560至569。其中,子区域560-564已存储完成,即第一算子551已完成对子区域560-564的洗牌写入(shuffle write),且子区域560-564上存储的数据已排序;子区域565已存储完成,但其上存储的数据并未进行排序;子区域566-569为未使用的空白区域。
第一算子551继续执行洗牌写入(shuffle write),可从第二存储区域550中按照子区域的顺序,选取第一个空白的子区域566,以独占方式执行数据写入,并同时通过位置数组等方式,为写入的每条数据(或记录)建立位置索引。子区域566写满之后,第一算子551可选取下一个空白的子区域567继续执行数据写入,并通知(例如通过消息等方式)排序线程(sorter)570子区域566写入完成。
在第一算子551对子区域566进行写入的同时,排序线程570可通过有序索引链表,对子区域565上的数据以及子区域560-564上已排序的数据,进行归并排序,以使子区域560-565上存储的数据实现整体排序。可选地,排序线程570可根据子区域565的位置索引依次读取数据,然后通过桶排序等方式,对读取的数据和已排序的数据(即子区域560-564上的数据)进行归并排序,并更新有序索引链表,得到排序结果。其中,排序时使用有序索引链表,能够使得排序过程中不发生数据的拷贝。
排序线程570对子区域565排序完成后,在接收到子区域566写入完成后的通知后,以类似的方式,对子区域566上存储的数据以及子区域560-565上已排序的数据进行归并排序。
可选的,在映射服务器运行在Java平台时,在排序时,可通过native sorter进行堆外的桶排序。由于运行在Java虚拟机(java virtual machine,JVM)上的堆内归并排序存在执行速度慢、内存溢出会引发磁盘IO、排序算法效率低等问题,通过native sorter进行堆外的桶排序,能够有效提高排序效率。
通过异步流水线(pipeline)的方式执行数据写入与排序,并在排序时使用有序索引链表,不仅能够边写边排序,实现在写入时直接排序,而且能够去除单独排序时的数据拷贝环节,减少内存占用,从而可以提高洗牌阶段中洗牌写入的处理效率。此外,通过这种方式,还可以将写入及排序合并为一个步骤,减少处理步骤,提高洗牌阶段的处理效率。
根据本申请的实施例,分布式处理系统中的映射服务器,能够对输入数据进行处理,得到第二数据,并根据预设标签,将第二数据划分为多个数据块,然后将多个数据块存储到位于全局内存中第二存储区域,从而可以在洗牌阶段,将映射服务器的处理结果(即第二数据)存储在全局内存中,不仅能够避免缓慢的磁盘读写,而且能够使得洗牌阶段的处理过程不受映射服务器的内存容量的限制,进而提高洗牌阶段的处理效率及处理性能。
在一种可能的实现方式中,在步骤S403之后,所述方法还可包括:确定所述多个数据块的元数据;将所述多个数据块的元数据存储到预设的第一存储区域。
其中,元数据可包括多个数据块的属性信息。每个数据块的属性信息包括该数据块在全局内存中的存储地址,可选的,每个数据块的属性信息还可包括该数据块的标签、尺寸(即大小)等中的至少一种。本领域技术人员可根据实际对元数据的具体内容进行设置,本申请对此不作限制。
在将多个数据块存储到第二存储区域之后,可确定多个数据块的元数据,并将该元数据存储到预设的第一存储区域。第一存储区域可位于全局内存中,也可位于多个映射服务器及多个归约服务器均可访问的其他内存中。本申请对第一存储区域的具体位置不作限制。
通过确定多个数据块的元数据,并将该元数据存储到第一存储区域,使得归约服务器能够从第一存储区域获取待读取数据的元数据,例如标签、存储地址等。
图6示出根据本申请一实施例的数据处理方法的流程图。如图6所示,所述方法应用于分布式处理系统中的归约服务器,所述方法包括:
步骤S601,从预设的第一存储区域,获取待读取的第一数据的元数据。
在一种可能的实现方式中,第二数据包括分布式处理系统中位于前端的映射服务器对输入数据的处理结果,而作为后端的任一归约服务器,其待读取的第一数据可包括第二数据的多个数据块中的目标数据块。也就是说,第一数据是第二数据的多个数据块中由该归约服务器处理的目标数据块。
在映射阶段,根据预设标签,第二数据被划分为多个数据块存储在全局内存中,多个数据块的元数据存储在第一存储区域。在归约阶段,归约服务器可根据预设标签中与待处理的 第一数据对应的目标标签,从第一存储区域中获取第一数据中包括的目标数据块的元数据。
在第一数据中包括的目标数据块为多个的情况下,可从第一存储区域,分别获取各个目标数据块的元数据。
步骤S602,根据第一数据的元数据,确定第一数据在全局内存中的第一地址。
获取第一数据的元数据后,可从该元数据中,得到第一数据在全局内存中的第一地址(即存储地址)。在第一数据包括的目标数据块为多个的情况下,可分别从各个目标数据块的元数据中,确定其在全局内存中的第一地址。
步骤S603,根据第一地址,从全局内存中读取第一数据。
由于第一地址位于全局内存中,那么第一地址可能位于本地(归约服务器的物理内存中),也可能位于远端(其他服务器的物理内存中)。在从全局内存中读取第一数据时,可判断第一地址是否位于归约服务器的访问范围内。若第一地址位于归约服务器的访问范围内,则无需进行地址映射,归约服务器可直接通过读数据指令,例如load指令,从全局内存中读取第一数据。
若第一地址位于归约服务器的访问范围之外,则需要进行地址映射。可根据预设的地址映射关系,通过地址映射指令,例如map指令,将第一地址映射为位于归约服务器的访问范围内的第二地址;然后根据第二地址,通过内存指令,例如load指令,从全局内存中读取第一数据。
通过在第一地址位于归约服务器的访问范围之外的情况下进行地址映射,使得归约服务器可以从全局内存中读取位于远端的第一数据。
根据本申请的实施例,分布式处理系统中的归约服务器,能够从第一存储区域,获取待读取的第一数据的元数据,其中,第一数据包括第二数据的多个数据块中的目标数据块,第二数据包括相应的映射服务器对输入数据的处理结果,然后根据该元数据,确定第一数据在全局内存中的第一地址,并根据第一地址,从全局内存中读取第一数据,从而能够在归约服务器从多个映射服务器的处理结果中读取包括目标数据块的输入数据(第一数据)时,无需对目标数据块进行拷贝传输,而是以内存方式对存储在全局内存中的目标数据块直接进行读取,不仅使得洗牌阶段的处理过程不受计算节点的内存容量、传输网络的物理带宽、传输时延等因素的限制,而且能够提高洗牌阶段的处理效率及处理性能,进而提高分布式处理系统的处理效率。
可以理解,在具体实现中,全局内存并不是无限大,若多个映射服务器并行执行图4所示的方法,那么多个映射服务器很可能要存储大量的数据到全局内存,而全局内存的存储空间有限,很可能会出现全局内存无法存储大量数据,导致应用出现无法正常运行或者运行效率低的情况。
为了避免出现上述情况,本申请提供一种内存管理装置,该装置可以对全局内存进行管理,在全局内存的存储空间不够时,将全局内存中的部分数据存储至外部存储区域,以腾出空间存储待存储数据;在全局内存的可用存储空间变大时,再将存储至外部存储区域的数据取回全局内存,以便需要读取该部分数据的规约服务器可以从全局内存中读取到对应数据,而不是从外部存储区域读取,提高数据读取效率。
在本申请具体的实施例中,上述外部存储区域包括但不限于以下至少一种:HDD、SSD或者其他类型的硬盘。可选地,外部存储区域还可以是多个映射服务器以及多个规约服务器共享的HDFS等。
在本申请具体的实施例中,上述内存管理装置可以为部署在单个映射服务器上的一个软 件模块,可选地,内存管理装置也可以由映射服务器的硬件实现,例如,由映射服务器中一个处理器实现内存管理装置的功能。此时,内存管理装置可以管理全局内存的全部空间,也可以只管理全局内存分配给该内存管理装置所在的映射服务器的内存空间。
可选地,内存管理装置也可以为一个单独的设备,也可以是部署在一个单独的设备上的软件模块等。此时,内存管理装置可以管理全局内存的全部空间。
参见图7,内存管理装置对全局内存进行管理的具体过程包括:
S701、内存管理装置获取第一内存已经使用的空间。
当内存管理装置用于管理全局内存的全部内存空间时,第一内存为全局内存的全部内存空间,当内存管理装置用于管理全局内存分配给该内存管理装置所在的映射服务器上的内存空间时,第一内存为全局内存分配给该内存管理装置所在的映射服务器的内存空间,即全局内存的部分内存。
S702、内存管理装置确定第一内存是否满足第一条件,当确定第一内存满足第一条件的情况下,执行S703,当确定第一内存不满足第一条件的情况下,执行S704。
S703、内存管理装置从第一内存存储的数据中确定第一目标数据,将第一目标数据存储至外部存储区域。
其中,第一条件为第一内存已经使用的空间大于或等于第一阈值,例如,假设第一内存的总空间为100M,则第一阈值可以为80M,或者,第一条件为第一内存已经使用的空间与第一内存的总空间的比值大于或等于第二阈值,例如,假设第一内存的总空间为100M,则第二阈值可以为80%,第一阈值、第二阈值可以根据实际情况进行设置,此处不做具体限定。
在本申请具体的实施例中,内存管理装置可以根据第一内存存储的数据的优先级从第一内存存储的数据中确定第一目标数据,具体地,第一目标数据可以为第一内存存储的数据中优先级低的部分数据。
内存管理装置在将第一目标数据存储至外部存储区域时,可以根据数据的优先级进行存储,即低优先级的数据先出第一内存,高优先级的数据后出第一内存。
第一内存存储的数据的优先级可以通过以下方式中任意一种体现:
方式1、由数据所属的子区域ID体现。
具体地,在映射服务器向第一内存申请第二存储区域时,可以根据规约服务器的数量,向第一内存申请包括预设数量的子区域的第二存储区域,每个子区域对应一个唯一的规约服务器,即上述预设数量与规约服务器的数量相同,且每个子区域对应一个唯一的标识(identification,ID),可选地,ID小的子区域可以对应需要先启动数据读取任务的规约服务器,ID大的子区域对应后启动数据读取任务的规约服务器,或者,ID大的子区域对应需要先启动数据读取任务的规约服务器,ID小的子区域对应后启动数据读取任务的规约服务器。
映射服务器在向第二存储区域存储数据时,可以根据子区域ID将与不同规约服务器对应的数据存储到与每个规约服务器对应的子区域,后续规约服务器可以根据子区域ID从相应的子区域中读取对应的数据。例如,标识为001的规约服务器对应标识为1的子区域,该规约服务器可以从标识为1的子区域中读取数据,标识为002的规约服务器对应标识为2的子区域,该规约服务器可以从标识为2的子区域中读取数据。
在映射服务器根据子区域ID将数据存储到对应的子区域的情况下,第一内存存储的数据的优先级由数据所属的子区域ID体现,具体地,在ID小的子区域对应需要先启动数据读取任务的规约服务器,ID大的子区域对应后启动数据读取任务的规约服务器时,第一内存存储的数据的优先级可以为ID小的子区域中的数据的优先级高于ID大的子区域中的数据的优先 级,在ID大的子区域对应需要先启动数据读取任务的规约服务器,ID小的子区域对应后启动数据读取任务的规约服务器,第一内存存储的数据的优先级可以为ID大的子区域中的数据的优先级高于ID小的子区域中的数据的优先级。
在第一内存存储的数据的优先级为ID小的子区域中的数据的优先级高于ID大的子区域中的数据的优先级的情况下,内存管理装置从第一内存中确定的第一目标数据可以为ID大于第一预设ID的子区域中的数据,第一预设ID可以根据实际情况进行设置,例如,假设第一内存包括ID为1至ID为10的子区域,第一预设ID为8,则第一目标数据包括ID为9的子区域中的数据和ID为10的子区域中的数据。
在第一内存存储的数据的优先级为ID大的子区域中的数据的优先级高于ID小的子区域中的数据的优先级的情况下,内存管理装置从第一内存中确定的第一目标数据可以为ID小于第二预设ID的子区域中的数据,第二预设ID可以根据实际情况进行设置,例如,假设第一内存包括ID为1至ID为10的子区域,第二预设ID为3,则第一目标数据包括ID为1的子区域中的数据和ID为2的子区域中的数据。
方式2、由数据存储到第一内存的先后顺序体现。
具体地,数据的优先级可以为先存储到第一内存的数据的优先级低于后存储到第一内存的数据的优先级,或者,为先存储到第一内存的数据的优先级高于后存储到第一内存的数据的优先级。
以上述第一种情况为例,第一目标数据可以为第一内存存储的数据中先存储的预设数量的数据,预设数量可以根据实际情况进行设置,例如,假设第一内存先后存储了100个数据,预设数量为10,则第一目标数据为上述100个数据中先存储到第一内存中的10个数据。
方式3、由数据的数据量大小体现。
具体地,数据的优先级可以为数据量大的数据的优先级高于数据量小的数据的优先级,或者,为数据量大的数据的优先级低于数据量小的数据的优先级。
以上述第一种情况为例,第一目标数据为第一内存存储的数据中数据量小于或等于预设数据量的数据,预设数据量可以根据实际情况进行设置,例如,假设第一内存存储了100个数据,预设数据量为10KB,则第一目标数据包括上述100个数据中数据量小于或等于10KB的数据。
需要说明的是,上述所列举的几种体现数据的优先级方式仅仅是作为一种示例,不应视为具体限定。
S704、内存管理装置确定第一内存是否满足第二条件,当确定第一内存满足第二条件时,执行S705,当确定第一内存不满足第二条件时,执行S701。
S705、内存管理装置从外部存储区域存储的数据中确定第二目标数据,将第二目标数据存储至第一内存。
其中,第二条件为第一内存已经使用的空间小于或等于第三阈值,例如,假设第一内存的总空间为100M,则第三阈值可以为70M,或者,第二条件为第一内存已经使用的空间与第一内存的总空间的比值小于或等于第四阈值,例如,假设第一内存的总空间为100M,则第四阈值可以为70%,第三阈值、第四阈值可以根据实际情况进行设置,此处不作具体限定。
第三阈值可小于第一阈值,第四阈值可小于第二阈值,可选地,第三阈值可等于第一阈值,第四阈值可等于第二阈值。
在第三阈值等于第一阈值或者第四阈值等于第二阈值的情况下,内存管理装置可执行S703或者S705。
在本申请具体的实施例中,与内存管理装置根据数据的优先级将第一内存中的数据存储在外部存储区域对应,内存管理装置也可以根据数据的优先级将外部存储区域中的数据存储至第一内存,即高优先级的数据先出外部存储区域,低优先级的数据后出外部存储区域,内存管理装置从外部存储区域存储的数据中确定第二目标数据的过程与上文所述内存管理装置从第一内存存储的数据中确定第一目标数据的过程相类似,可以参考上文相关描述,此处不再展开赘述。
可以理解,内存管理装置在将第一目标数据存储至外部存储区域的过程中,第一内存的可用存储空间会逐渐变大,为了避免过多的数据被存储至外部存储区域,第一内存中剩余的数据过少,导致规约服务器读取数据时,需要从外部存储区域进行读取,因此,内存管理装置在将第一目标数据存储至外部存储区域的过程中,可以监测第一内存的情况,在确定第一内存满足第三条件时,停止进行将第一目标数据存储至外部存储区域的操作。其中,第三条件为第一内存已经使用的空间等于第五阈值,或者,为第一内存已经使用的空间与第一内存的总空间的比值等于第六阈值,第五阈值小于第一阈值,第六阈值小于第二阈值,第五阈值、第六阈值可以根据实际情况进行设置,此处不作具体限定。
根据本申请的实施例可以看出,通过内存管理装置对第一内存进行管理,可以解决第一内存的空间有限时导致应用无法正常运行或者运行效率低的问题。
在本申请具体的实施例中,内存管理装置在将第一目标数据存储至外部存储区域时,还可以确定第一目标数据的元数据,以及确定第一内存中剩余的数据的元数据,并将第一目标数据的元数据以及第一内存中剩余的数据的元数据更新至预设的第一存储区域,其中,第一目标数据的元数据可以包括第一目标数据的属性信息。第一目标数据的属性信息包括数据在外部存储区域中的存储地址,可选地,还可以包括数据的标签、尺寸等中的至少一种,第一内存中剩余的数据的元数据可以包括剩余的数据的属性信息,剩余数据的属性信息包括数据在第一内存中的存储地址,可选地,还可以包括数据的标签、尺寸等中的至少一种,本领域技术人员可根据实际情况对第一目标数据的元数据以及剩余数据的元数据的具体内容进行设置,本申请对此不作限制。
通过确定第一目标数据的元数据以及第一内存中剩余的数据的元数据,并将上述元数据存储到第一存储区域,使得规约服务器能够从第一存储区域获取待读取数据的元数据,例如标签、尺寸、存储地址等。
可以理解,通过内存管理装置对第一内存的管理,步骤S601中规约服务器待读取的第一数据可能如步骤S601所述全部位于全局内存,也可能全部位于外部存储区域,还可能部分位于全局内存,部分位于外部存储区域。
在第一数据全部位于全局内存的情况下,规约服务器从全局内存中读取第一数据的过程可以参考图6所示方法流程。
在第一数据全部位于外部存储区域的情况下,那么第一数据的地址可能位于本地,也可能位于远端。在从外部存储区域读取第一数据时,可判断第一数据的地址是否位于本地,若第一数据的地址位于本地,则规约服务器可以直接从本地读取第一数据,若第一数据的地址位于远端,则规约服务器可以向内存管理装置发送包括第一数据的地址的读数据请求,请求读取第一数据,内存管理装置在接收到读数据请求后,可以根据第一数据的地址从远端的外部存储区域中查找到第一数据,然后将第一数据从远端的外部存储区域存储到全局内存,并将第一数据在全局内存中的地址返回至规约服务器,使得规约服务器可以从全局内存中读取第一数据。
在第一数据中的部分数据位于全局内存以及部分数据位于外部存储区域的情况下,规约服务器从预设的第一存储区域,获取的第一数据的元数据包括第一部分数据的元数据和第二部分数据的元数据,其中,第一部分数据表示第一数据中位于全局内存的数据,第二部分数据表示第一数据中位于外部存储区域的数据。
在获取第一部分数据的元数据和第二部分数据的元数据后,可从第一部分数据的元数据中,得到第一部分数据在全局内存的地址,从第二部分数据的元数据中,得到第二部分数据在外部存储区域的地址,然后从全局内存中读取第一部分数据,从外部存储区域读取第二部分数据。规约服务器从全局内存读取第一部分数据的过程与上文所述规约服务器从全局内存中读取第一数据的过程相类似,规约服务器从外部存储区域读取第二部分数据的过程与上文所述规约服务器从外部存储区域读取第一数据的过程相类似,可以参考上文相关描述。
在本申请具体的实施例中,内存管理装置将第一目标数据从第一内存存储至外部存储区域的操作和映射服务器向第一内存存储数据的操作可以并行执行,内存管理装置将第一目标数据从第一内存存储至外部存储区域的操作和规约服务器从全局内存中读取数据的操作可以并行执行,内存管理装置将第二目标数据从外部存储区域存储至第一内存的操作和映射服务器向第一内存存储数据的操作可以并行执行,内存管理装置将第二目标数据从外部存储区域存储至第一内存的操作和规约服务器从全局内存中读取数据的操作也可以并行执行,以提高数据处理的效率。
在具体实现中,内存管理装置可以采用异步或者同步的传输方式将第一目标数据从第一内存存储至外部存储区域,或者,将第二目标数据从外部存储区域存储至第一内存。
图8示出根据本申请一实施例的数据处理方法的软件架构的示意图。如图8所示,该数据处理方法可应用于分布式处理系统的洗牌阶段,可通过洗牌管理组件(shuffle manager)811和812、洗牌写入者(shuffle writer)812、洗牌读取者(shuffle writer)822、数据管理组件(shuffle)830及全局内存840实现。
其中,洗牌管理组件可提供对外的shuffle功能接口(例如与读写相关的全局内存接口,或与全局内存的其他操作相关的接口,例如将第一目标数据从第一内存存储至外部存储区域的接口),上层软件注册后可使用。洗牌管理组件可以通过插件(plugin)的形式,与多个开源软件无缝兼容。
参考图8,洗牌管理组件811部署在分布式处理系统中的映射服务器810上,洗牌写入者(shuffle writer)812,例如映射任务maptask,可通过洗牌管理组件811提供的功能接口,使用基于内存语义的操作指令,将待写入的数据(例如maptask的处理结果)写入全局内存840。
洗牌管理组件821部署在分布式处理系统中的归约服务器820上,洗牌读取者(shuffle reader)822,例如归约任务reducetask,可通过洗牌管理组件821提供的功能接口,使用基于内存语义的操作指令,从全局内存840中读取reducetask的输入数据。
数据管理组件830与分布式处理系统中映射服务器810上部署的洗牌管理组件811以及归约服务器820上部署的洗牌管理组件821均存在交互。数据管理组件830对中间数据/临时数据进行管理,并提供元数据服务。例如,数据管理组件830可为部署在映射服务器810上的洗牌管理组件811提供元数据写入服务;数据管理组件830可为部署在归约服务器820上的洗牌管理组件821提供元数据读取服务。
洗牌写入者812用于执行与向全局内存840写入数据相关的内存操作。例如,可执行内存申请(例如allocate(size)指令)、地址映射(例如map指令)、解除映射(例如unmap指 令)、内存释放(例如release指令)、写数据(例如store指令)等基于内存语义的操作指令。
洗牌读取者822用于执行与从全局内存840读取数据相关的内存操作。例如,可执行内存申请(例如allocate(size)指令)、地址映射(例如map指令)、解除映射(例如unmap指令)、内存释放(例如release指令)、读数据(例如load指令)等基于内存语义的操作指令,以及与全局内存拷贝相关的内存语义。
需要说明的是,以上仅以1个映射服务器及1个归约服务器作为示例,对本申请的实施例的数据处理方法的软件架构进行了示例性说明。分布式处理系统中的多个映射服务器及归约服务器均可使用图8所示的软件架构。
在一种可能的实现方式中,上述软件架构可通过Java、C++、Python等多种编程语言实现,本申请对此不作限制。可选的,在上述软件架构通过Java实现时,上述组件(包括洗牌管理组件811和821、洗牌写入者812、洗牌读取者822、数据管理组件830)均可运行在Java虚拟机(Java Virtual Machine,JVM)上。
在一种可能的实现方式中,在通过上述软件架构实现本申请实施例的数据处理方法时,可在映射服务器的第一运算进程的初始化阶段,可根据预设的第一尺寸,向全局内存申请第二存储空间,用于存储映射服务器的处理结果,即存储shuffle阶段的中间数据。其中,第一运算进程用于执行映射服务器上的数据处理任务,可包括多个映射任务线程。
可选的,映射服务器可包括对输入数据进行处理的至少一个第一算子,在第一运算进程的初始化阶段,可根据预设的第一尺寸及映射服务器的处理器核的数量,向全局内存申请第二存储区域,以使每个处理器核对应一个第二存储区域,其中,每个处理器核上运行至少一个第一算子。
图9示出根据本申请一实施例的映射服务器的第一运算进程的初始化示意图。如图9所示,第一运算进程910及第一运算进程920运行在不同的映射服务器上。每个映射服务器的处理器包括2个内核(CPU core)。
其中,第一运算进程910包括算子911、算子912、算子913及算子914这4个算子。算子911、算子912、算子913及算子914均为运行在映射服务器上的、用于对输入数据进行处理的第一算子(即maptask)。其中,算子911及算子912以并发方式运行在处理器的一个内核上,算子913及算子914以并发方式运行在处理器的另一个内核上。
第一运算进程920包括算子921、算子922、算子923及算子924这4个算子。算子921、算子922、算子923及算子924均为运行在映射服务器上的、用于对输入数据进行处理的第一算子(即maptask)。其中,算子921及算子922以并发方式运行在处理器的一个内核上,算子923及算子924以并发方式运行在处理器的另一个内核上。
第一运算进程910及第一运算进程920初始化时,可根据预设的第一尺寸及映射服务器的处理器核的数量,向全局内存930申请存储空间(用于存储映射服务器的处理结果)作为第二存储区域,以使每个处理器核对应一个第二存储区域。映射服务器的每个处理器核上可运行至少一个第一算子。
也就是说,可将运行在同一处理器核上至少一个的算子,看作一个洗牌写入者(shuffle writer),即每个处理器核对应一个洗牌写入者,并根据预设的第一尺寸,在全局内存930中分别为各个洗牌写入者申请存储空间,作为与各个处理器核(或各个洗牌写入者)对应的第二存储区域。
参考图9,第一运算进程910将运行在同一处理器核上的算子911及算子912,看作洗牌写入者writer915,将运行在同一处理器核上的算子913及算子914,看作洗牌写入者writer916。
第一运算进程910为writer915在全局内存930中申请的第二存储区域包括9个缓存,分别为3个缓存A、3个缓存B及3个缓存C。其中,缓存A用于存储算子911及算子912的处理结果中标签为key1的数据,即对算子911及算子912的处理结果中标签为key1的数据进行汇聚;缓存B用于存储算子911及算子912的处理结果中标签为key2的数据,即对算子911及算子912的处理结果中标签为key2的数据进行汇聚;缓存C用于存储算子911及算子912的处理结果中标签为key3的数据,即对算子911及算子912的处理结果中标签为key3的数据进行汇聚。
第一运算进程910为writer916在全局内存930中申请的第二存储区域包括9个缓存,分别为3个缓存D、3个缓存E及3个缓存F。其中,缓存D用于存储算子913及算子914的处理结果中标签为key1的数据,即对算子913及算子914的处理结果中标签为key1的数据进行汇聚;缓存E用于存储算子913及算子914的处理结果中标签为key2的数据,即对算子913及算子914的处理结果中标签为key2的数据进行汇聚;缓存F用于存储算子913及算子914的处理结果中标签为key3的数据,即对算子913及算子914的处理结果中标签为key3的数据进行汇聚。
类似地,与处理器核相对应,第一运算进程920将运行在同一处理器核上的算子921及算子922,看作洗牌写入者writer925,将运行在同一处理器核上的算子923及算子924,看作洗牌写入者writer926。
第一运算进程920为writer925在全局内存930中申请的第二存储区域包括9个缓存,分别为3个缓存G、3个缓存H及3个缓存J。其中,缓存G用于存储算子921及算子922的处理结果中标签为key1的数据,即对算子921及算子922的处理结果中标签为key1的数据进行汇聚;缓存H用于存储算子921及算子922的处理结果中标签为key2的数据,即对算子921及算子922的处理结果中标签为key3的数据进行汇聚;缓存J用于存储算子921及算子922的处理结果中标签为key3的数据,即对算子921及算子922的处理结果中标签为key3的数据进行汇聚。
第一运算进程920为writer926在全局内存930中申请的第二存储区域包括9个缓存,分别为3个缓存K、3个缓存L及3个缓存M。其中,缓存K用于存储算子923及算子924的处理结果中标签为key1的数据,即对算子923及算子924的处理结果中标签为key1的数据进行汇聚;缓存L用于存储算子923及算子924的处理结果中标签为key2的数据,即对算子923及算子924的处理结果中标签为key2的数据进行汇聚;缓存M用于存储算子923及算子924的处理结果中标签为key3的数据,即对算子923及算子924的处理结果中标签为key3的数据进行汇聚。
在第一运算进程910及第一运算进程920处理完成后,运行在归约服务器上的归约任务(reducetask)940、归约任务(reducetask)950及归约任务(reducetask)960,分别从全局内存930中读取数据。
具体地,归约任务(reducetask)940从全局内存930中读取标签为key1的数据,即分别从缓存A、缓存D、缓存G、缓存K中读取数据;归约任务(reducetask)950从全局内存930中读取标签为key2的数据,即分别从缓存B、缓存E、缓存H、缓存L中读取数据;归约任务(reducetask)960从全局内存930中读取标签为key3的数据,即分别从缓存C、缓存F、缓存J、缓存M中读取数据。
需要说明的是,以上仅以2个映射服务器作为示例,对本申请的实施例的映射服务器的第一运算进程的初始化进行了示例性说明。分布式处理系统中的其他映射服务器也通过类似 的方式进行初始化。
通过这种方式,能够在映射服务器上的第一运算进程的初始化阶段,根据映射服务器的处理器核的数量,向全局内存申请第二存储区域,以使每个处理器核对应一个第二存储区域,其中,每个处理器核上运行至少一个第一算子,从而可以将运行在同一处理器核上的至少一个算子(例如算子911及算子912),看作一个洗牌写入者,并在全局内存中为该洗牌写入者分配存储空间,使得运行在同一个处理器核上的至少一个算子的处理结果中标签相同的数据存储在全局内存的同一区域,实现基于处理器核的数据汇聚,减少数据分散,进而提高数据读取效率。
图10示出根据本申请一实施例的数据处理方法的处理过程的示意图。如图10所示,分布式处理系统包括映射服务器1010及归约服务器1020。映射服务器1010的多级内存1012包括DRAM+AEP,归约服务器1020的多级内存1022也包括DRAM+AEP,多级内存1012及多级内存1022通过系统总线1040连接。映射服务器1010及归约服务器1020通过预设的注册命令注册后,多级内存1012及多级内存1022构成全局内存。
映射服务器1010上的第一运算进程1015用于对输入数据进行处理。第一运算进程1015可包括多个用于执行映射任务的线程(与第一算子对应的线程),即maptask线程。第一运算进程1015可向部署在映射服务器1010上的洗牌管理组件1011注册后,作为洗牌写入者的maptask线程,可通过洗牌管理组件1011提供的功能接口,将待写入的数据写入全局内存。
在第一运算进程1015的初始化阶段,第一运算进程1015可根据预设的第一尺寸,使用图9所示的方式,向全局内存申请存储空间(也可称为内存空间、缓存空间等),作为第二存储区域。初始化完成后,第一运算进程1015中的多个maptask线程可对输入数据进行处理,得到第二数据(即映射服务器1010的处理结果),例如,多条<键,值>记录。第一运算进程1015可根据第二数据的尺寸,可判断在初始化阶段申请的存储空间是否够用。在初始化阶段申请的存储空间不够的情况下,第一运算进程1015还可通过全局内存接口,向全局内存动态申请存储空间,并将新申请的存储空间映射到的访问范围内,以使存储空间可以被第一运算进程1015访问。
得到第二数据后,maptask线程可根据预设标签,通过哈希方式,将第二数据划分为多个数据块,并作为洗牌写入者,通过洗牌管理组件1011提供的功能接口,使用内存操作指令,将多个数据块存储在申请好的存储空间(即第二存储区域)中。
如果需要对写入的数据进行排序,则可通过异步流水线(pipeline)方式(参考图5),利用排序线程1013,在写入数据的同时进行排序。假设多个数据块存储在映射服务器1010的多级内存1012中的DRAM上,在多个数据块存储完成后,第一运算进程1015可向数据管理组件1030发送存储的多个数据块的元数据,以使数据管理组件1030对元数据进行存储。
归约服务器1020上的第二运算进程1025用于从映射服务器1010及其他映射服务器(图中未示出)的处理结果(即第二数据)中,读取第一数据(第二数据的多个数据块中的目标数据块),并对读取的第一数据进行处理。第二运算进程1025可包括多个用于执行归约任务的线程(与第二算子对应的线程),即reducetask线程。第二运算进程1025可向部署在映射服务器1020上的洗牌管理组件1021注册后,作为洗牌读取者的reducetask线程,可通过洗牌管理组件1021提供的功能接口,使用内存操作指令,从全局内存中读取数据。
在读取待处理的第一数据时,第二运算进程1025中的各个reducetask线程可从数据管理组件1030获取第一数据的元数据,根据该元数据,确定第一数据的存储地址,并将该存储地址映射到第二运算进程1025的访问范围内。第二运算进程1025可在本地申请相应的内存, 然后根据映射后的存储地址,直接通过内存读数据命令(例如load命令),将存储在映射服务器1010的多级内存1012中的DRAM上数据读取到归约服务器1020的本地内存中进行处理。
在一种可能的实现方式中,第二运算进程1025也可异步执行内存拷贝(gather memory copy),把分散存储在不同远端内存的数据一次性硬拷贝到本地内存中,以便进行后续处理。
在一种可能的实现方式中,第一运算进程1015还可监测第一内存的情况:
如果确定第一内存满足第一条件,则可从第一内存存储的数据中确定第一目标数据,并通过洗牌管理组件1011提供的功能接口,将第一目标数据存储至外部存储区域。在第一运算进程1015将第一目标数据存储至外部存储区域后,第一运算进程1015可向数据管理组件1030发送第一目标数据的元数据以及第一内存存储的数据中除第一目标数据之外的剩余数据的元数据,以使数据管理组件1030对外部存储区域以及第一内存存储的数据的元数据进行存储。
如果第一运算进程1015确定第一内存满足第二条件,第一运算进程1015可从外部存储区域存储的数据中确定第二目标数据,并通过洗牌管理组件1011提供的功能接口,将第二目标数据存储至第一内存。在第一运算进程1015将第二目标数据存储至第一内存后,第一运算进程1015可向数据管理组件1030发送第二目标数据的元数据以及外部存储区域存储的数据中除第二目标数据之外的剩余数据的元数据,以使数据管理组件1030对外部存储区域以及第一内存存储的数据的元数据进行存储。
需要说明的是,以上仅以1个映射服务器及1个归约服务器作为示例,对本申请的实施例的数据处理方法的处理过程进行了示例性说明。应当理解,分布式处理系统可包括多个映射服务器及多个归约服务器,其处理过程与此类似,此处不再赘述。
在一种可能的实现方式中,对全局内存中的远端内存直接执行读写命令(例如load/store命令)时,与读写本地内存相比,仍然存在较大开销。基于shuffle阶段中间数据的排序结果可以预先获取,可通过构建内存地址列表(memory address list)等方式,对数据进行预存取,从而可以提高远端内存的读写效率。
本申请的实施例所述的数据处理方法,应用于分布式处理系统,能够基于分布式处理系统中多个计算节点的内存互联构成的全局内存,通过内存操作实现洗牌阶段的数据读写,不仅能够充分利用海量内存,还能够去除旧的软件架构中冗余的数据处理环节,极大地提升了shuffle阶段的处理性能。
本申请的实施例所述的数据处理方法,基于内存互联的新硬件拓扑结构,重新定义了shuffle阶段的软件架构,使得shuffle阶段中间数据的存储、计算节点间的读写都以高效的内存操作进行,减少了shuffle阶段的处理流程,使得shuffle阶段在大数据处理中的瓶颈效应进一步减轻。
图11示出根据本申请一实施例的归约服务器的框图。该归约服务器应用于分布式处理系统,所述分布式处理系统包括多个映射服务器及多个归约服务器,所述多个映射服务器的内存及所述多个归约服务器的内存构成全局内存。
如图11所示,该归约服务器包括:
元数据读取模块1110,用于从预设的第一存储区域,获取待读取的第一数据的元数据;元数据读取模块1110的功能的具体实现可参考步骤S601,这里不再赘述。
地址确定模块1120,用于根据所述元数据,确定所述第一数据在所述全局内存中的第一地址;地址确定模块1120的功能的具体实现可参考步骤S602,这里不再赘述。
数据读取模块1130,用于根据所述第一地址,从所述全局内存中读取所述第一数据,
其中,所述第一数据包括第二数据的多个数据块中的目标数据块,所述第二数据包括相应的映射服务器对输入数据的处理结果。数据读取模块1130的功能的具体实现可参考步骤S603,这里不再赘述。
在一种可能的实现方式中,所述数据读取模块1130,被配置为:在所述第一地址位于所述归约服务器的访问范围之外的情况下,将所述第一地址映射为第二地址,所述第二地址位于所述归约服务器的访问范围内;根据所述第二地址,从所述全局内存中读取所述第一数据。
在一种可能的实现方式中,所述归约服务器还包括:第一注册模块,用于在所述归约服务器连接到所述分布式处理系统后,所述归约服务器通过预设的注册指令进行注册,以使所述归约服务器的内存加入所述全局内存。
图12示出根据本申请一实施例的映射服务器的框图。该映射服务器应用于分布式处理系统,所述分布式处理系统包括多个映射服务器及多个归约服务器,所述多个映射服务器的内存及所述多个归约服务器的内存构成全局内存。
如图12所示,该映射服务器包括:
数据处理模块1210,用于对输入数据进行处理,得到第二数据;数据处理模块1210的功能的具体实现可参考步骤S4.1,这里不再赘述。
数据划分模块1220,用于根据预设标签,将所述第二数据划分为多个数据块;数据划分模块1220的功能的具体实现可参考步骤S402,这里不再赘述。
数据存储模块1230,用于将所述多个数据块存储到第二存储区域,所述第二存储区域位于所述全局内存中,数据存储模块1230的功能的具体实现可参考步骤S403,这里不再赘述。
在一种可能的实现方式中,所述数据存储模块1230,被配置为:在需要对多个数据块中的数据进行排序的情况下,根据预设的第二尺寸,将第二存储区域划分为多个子区域;按照子区域的顺序,将所述多个数据块存储到所述多个子区域中;在将所述多个数据块依次存储到所述多个子区域期间,通过更新有序索引链表,对存储完成的所有子区域中的数据进行排序,所述有序索引链表通过链表链接数据的位置索引的方式进行排序。
在一种可能的实现方式中,所述映射服务器还包括:初始化模块,用于在第一运算进程的初始化阶段,根据所述映射服务器的处理器核的数量,向所述全局内存申请所述第二存储区域,以使每个处理器核对应一个第二存储区域,其中,所述第一运算进程运行在所述映射服务器上,用于对所述输入数据进行处理,所述每个处理器核上运行至少一个第一算子,所述第一算子用于对所述输入数据进行处理。
在一种可能的实现方式中,所述数据划分模块1220,被配置为:根据预设标签,通过哈希方式,将所述第二数据划分为多个数据块。
在一种可能的实现方式中,所述数据存储模块1230,被配置为:确定第二存储区域的第三地址;在所述第三地址位于所述映射服务器的访问范围之外的情况下,将所述第三地址映射为第四地址,所述第四地址位于所述映射服务器的访问范围内;根据所述第四地址,将所述多个数据块存储到所述第二存储区域。
在一种可能的实现方式中,所述映射服务器还包括:元数据确定模块,用于确定所述多个数据块的元数据;元数据存储模块,用于将所述多个数据块的元数据存储到预设的第一存储区域。
在一种可能的实现方式中,所述映射服务器还包括:第二注册模块,用于在所述映射服务器连接到所述分布式处理系统后,所述映射服务器通过预设的注册指令进行注册,以使所述映射服务器的内存加入所述全局内存。
在一种可能的实现方式中,所述映射服务器还包括:内存管理装置,用于在所述第一内存满足第一条件时,从所述第一内存存储的数据中确定第一目标数据,将所述第一目标数据存储至外部存储区域,所述第一条件为所述第一内存已经使用的空间大于或等于第一阈值,或者,为所述第一内存已经使用的空间与所述第一内存的总空间的比值大于或等于第二阈值,所述第一内存为所述全局内存或者所述全局内存的部分内存。
在一种可能的实现方式中,上述内存管理装置,还用于在所述第一内存满足第二条件时,从所述外部存储区域存储的数据中确定第二目标数据,将所述第二目标数据存储至所述第一内存,所述第二条件为所述第一内存已经使用的空间小于或等于第三阈值,或者,为所述第一内存已经使用的空间与所述第一内存的总空间的比值小于或等于第四阈值。
在一种可能的实现方式中,所述外部存储区域包括但不限于以下至少一种:HDD、SSD。
本申请的实施例提供了一种数据处理装置,包括:处理器以及用于存储处理器可执行指令的存储器;其中,所述处理器被配置为执行所述指令时实现上述方法。
本申请的实施例提供了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述方法。
本申请的实施例提供了一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当所述计算机可读代码在电子设备的处理器中运行时,所述电子设备中的处理器执行上述方法。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(Random Access Memory,RAM)、只读存储器(Read Only Memory,ROM)、可擦式可编程只读存储器(Electrically Programmable Read-Only-Memory,EPROM或闪存)、静态随机存取存储器(Static Random-Access Memory,SRAM)、便携式压缩盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、数字多功能盘(Digital Video Disc,DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。
这里所描述的计算机可读程序指令或代码可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本申请操作的计算机程序指令可以是汇编指令、指令集架构(Instruction Set Architecture,ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(Local Area Network,LAN)或广域网(Wide Area Network,WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。 在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或可编程逻辑阵列(Programmable Logic Array,PLA),该电子电路可以执行计算机可读程序指令,从而实现本申请的各个方面。
这里参照根据本申请实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本申请的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本申请的多个实施例的装置、系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。
也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行相应的功能或动作的硬件(例如电路或ASIC(Application Specific Integrated Circuit,专用集成电路))来实现,或者可以用硬件和软件的组合,如固件等来实现。
尽管在此结合各实施例对本发明进行了描述,然而,在实施所要求保护的本发明过程中,本领域技术人员通过查看所述附图、公开内容、以及所附权利要求书,可理解并实现所述公开实施例的其它变化。在权利要求中,“包括”(comprising)一词不排除其他组成部分或步骤,“一”或“一个”不排除多个的情况。单个处理器或其它单元可以实现权利要求中列举的若干项功能。相互不同的从属权利要求中记载了某些措施,但这并不表示这些措施不能组合起来产生良好的效果。
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。
Claims (29)
- 一种数据处理方法,其特征在于,所述方法应用于分布式处理系统中的归约服务器,所述分布式处理系统包括多个映射服务器及多个归约服务器,所述多个映射服务器的内存及所述多个归约服务器的内存构成全局内存,所述方法包括:从预设的第一存储区域,获取待读取的第一数据的元数据;根据所述元数据,确定所述第一数据在所述全局内存中的第一地址;根据所述第一地址,从所述全局内存中读取所述第一数据,其中,所述第一数据包括第二数据的多个数据块中的目标数据块,所述第二数据包括相应的映射服务器对输入数据的处理结果。
- 根据权利要求1所述的方法,其特征在于,所述根据所述第一地址,从所述全局内存中读取所述第一数据,包括:在所述第一地址位于所述归约服务器的访问范围之外的情况下,将所述第一地址映射为第二地址,所述第二地址位于所述归约服务器的访问范围内;根据所述第二地址,从所述全局内存中读取所述第一数据。
- 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:在所述归约服务器连接到所述分布式处理系统后,所述归约服务器通过预设的注册指令进行注册,以使所述归约服务器的内存加入所述全局内存。
- 一种数据处理方法,其特征在于,所述方法应用于分布式处理系统中的映射服务器,所述分布式处理系统包括多个映射服务器及多个归约服务器,所述多个映射服务器的内存及所述多个归约服务器的内存构成全局内存,所述方法包括:对输入数据进行处理,得到第二数据;根据预设标签,将所述第二数据划分为多个数据块;将所述多个数据块存储到第二存储区域,所述第二存储区域位于所述全局内存中。
- 根据权利要求4所述的方法,其特征在于,所述将所述多个数据块存储到第二存储区域,包括:在需要对多个数据块中的数据进行排序的情况下,根据预设的第二尺寸,将第二存储区域划分为多个子区域;按照子区域的顺序,将所述多个数据块存储到所述多个子区域中;在将所述多个数据块依次存储到所述多个子区域期间,通过更新有序索引链表,对存储完成的所有子区域中的数据进行排序,所述有序索引链表通过链表链接数据的位置索引的方式进行排序。
- 根据权利要求4中所述的方法,其特征在于,所述映射服务器包括对所述输入数据进行处理的至少一个第一算子,所述方法通过所述映射服务器上的第一运算进程实现,所述方法还包括:在所述第一运算进程的初始化阶段,根据所述映射服务器的处理器核的数量,向所述全局内存申请所述第二存储区域,以使每个处理器核对应一个第二存储区域,其中,所述每个处理器核上运行至少一个第一算子。
- 根据权利要求4所述的方法,其特征在于,所述根据预设标签,将所述第二数据划分为多个数据块,包括:根据预设标签,通过哈希方式,将所述第二数据划分为多个数据块。
- 根据权利要求4所述的方法,其特征在于,所述将所述多个数据块存储到第二存储区域,包括:确定第二存储区域的第三地址;在所述第三地址位于所述映射服务器的访问范围之外的情况下,将所述第三地址映射为第四地址,所述第四地址位于所述映射服务器的访问范围内;根据所述第四地址,将所述多个数据块存储到所述第二存储区域。
- 根据权利要求4-8中任一项所述的方法,其特征在于,所述方法还包括:确定所述多个数据块的元数据;将所述多个数据块的元数据存储到预设的第一存储区域。
- 根据权利要求4-9中任一项所述的方法,其特征在于,所述方法还包括:在所述映射服务器连接到所述分布式处理系统后,所述映射服务器通过预设的注册指令进行注册,以使所述映射服务器的内存加入所述全局内存。
- 根据权利要求4-10任一项所述的方法,其特征在于,所述方法还包括:当第一内存满足第一条件时,从所述第一内存存储的数据中确定第一目标数据,将所述第一目标数据存储至外部存储区域,所述第一条件为所述第一内存已经使用的空间大于或等于第一阈值,或者,为所述第一内存已经使用的空间与所述第一内存的总空间的比值大于或等于第二阈值,所述第一内存为所述全局内存或者所述全局内存的部分内存。
- 根据权利要求4-11任一所述的方法,其特征在于,所述方法还包括:当所述第一内存满足第二条件时,从所述外部存储区域存储的数据中确定第二目标数据,将所述第二目标数据存储至所述第一内存,所述第二条件为所述第一内存已经使用的空间小于或等于第三阈值,或者,为所述第一内存已经使用的空间与所述第一内存的总空间的比值小于或等于第四阈值。
- 根据权利要求11或12所述的方法,其特征在于,所述外部存储区域包括以下至少一种:硬盘驱动器HDD、固态硬盘SSD。
- 一种归约服务器,其特征在于,所述归约服务器应用于分布式处理系统,所述分布式处理系统包括多个映射服务器及多个归约服务器,所述多个映射服务器的内存及所述多个归约服务器的内存构成全局内存,所述归约服务器包括:元数据读取模块,用于从预设的第一存储区域,获取待读取的第一数据的元数据;地址确定模块,用于根据所述元数据,确定所述第一数据在所述全局内存中的第一地址;数据读取模块,用于根据所述第一地址,从所述全局内存中读取所述第一数据,其中,所述第一数据包括第二数据的多个数据块中的目标数据块,所述第二数据包括相应的映射服务器对输入数据的处理结果。
- 根据权利要求14所述的归约服务器,其特征在于,所述数据读取模块,被配置为:在所述第一地址位于所述归约服务器的访问范围之外的情况下,将所述第一地址映射为第二地址,所述第二地址位于所述归约服务器的访问范围内;根据所述第二地址,从所述全局内存中读取所述第一数据。
- 根据权利要求14或15所述的归约服务器,其特征在于,所述归约服务器还包括:第一注册模块,用于在所述归约服务器连接到所述分布式处理系统后,所述归约服务器通过预设的注册指令进行注册,以使所述归约服务器的内存加入所述全局内存。
- 一种映射服务器,其特征在于,所述映射服务器应用于分布式处理系统,所述分布式处理系统包括多个映射服务器及多个归约服务器,所述多个映射服务器的内存及所述多个归约服务器的内存构成全局内存,所述映射服务器包括:数据处理模块,用于对输入数据进行处理,得到第二数据;数据划分模块,用于根据预设标签,将所述第二数据划分为多个数据块;数据存储模块,用于将所述多个数据块存储到第二存储区域,所述第二存储区域位于所述全局内存中。
- 根据权利要求17所述的映射服务器,其特征在于,所述数据存储模块,被配置为:在需要对多个数据块中的数据进行排序的情况下,根据预设的第二尺寸,将第二存储区域划分为多个子区域;按照子区域的顺序,将所述多个数据块存储到所述多个子区域中;在将所述多个数据块依次存储到所述多个子区域期间,通过更新有序索引链表,对存储完成的所有子区域中的数据进行排序,所述有序索引链表通过链表链接数据的位置索引的方式进行排序。
- 根据权利要求17所述的映射服务器,其特征在于,所述映射服务器还包括:初始化模块,用于在第一运算进程的初始化阶段,根据所述映射服务器的处理器核的数量,向所述全局内存申请所述第二存储区域,以使每个处理器核对应一个第二存储区域,其中,所述第一运算进程运行在所述映射服务器上,用于对所述输入数据进行处理,所述每个处理器核上运行至少一个第一算子,所述第一算子用于对所述输入数据进行处理。
- 根据权利要求17所述的映射服务器,其特征在于,所述数据划分模块,被配置为:根据预设标签,通过哈希方式,将所述第二数据划分为多个数据块。
- 根据权利要求17所述的映射服务器,其特征在于,所述数据存储模块,被配置为:确定第二存储区域的第三地址;在所述第三地址位于所述映射服务器的访问范围之外的情况下,将所述第三地址映射为第四地址,所述第四地址位于所述映射服务器的访问范围内;根据所述第四地址,将所述多个数据块存储到所述第二存储区域。
- 根据权利要求17-21中任一项所述的映射服务器,其特征在于,所述映射服务器还包括:元数据确定模块,用于确定所述多个数据块的元数据;元数据存储模块,用于将所述多个数据块的元数据存储到预设的第一存储区域。
- 根据权利要求17-22中任一项所述的映射服务器,其特征在于,所述映射服务器还包括:第二注册模块,用于在所述映射服务器连接到所述分布式处理系统后,所述映射服务器通过预设的注册指令进行注册,以使所述映射服务器的内存加入所述全局内存。
- 根据权利要求17-23任一项所述的映射服务器,其特征在于,所述映射服务器还包括:内存管理装置,用于当第一内存满足第一条件时,从所述第一内存存储的数据中确定第一目标数据,将所述第一目标数据存储至外部存储区域,所述第一条件为所述第一内存已经使用的空间大于或等于第一阈值,或者,为所述第一内存已经使用的空间与所述第一内存的总空间的比值大于或等于第二阈值,所述第一内存为所述全局内存或者所述全局内存的部分。
- 根据权利要求17-24任一所述的映射服务器,其特征在于,所述内存管理装置,还用 于:当所述第一内存满足第二条件时,从所述外部存储区域存储的数据中确定第二目标数据,将所述第二目标数据存储至所述第一内存,所述第二条件为所述第一内存已经使用的空间小于或等于第三阈值,或者,为所述第一内存已经使用的空间与所述第一内存的总空间的比值小于或等于第四阈值。
- 根据权利要求24或25所述的映射服务器,其特征在于,所述外部存储区域包括以下至少一种:HDD、SSD。
- 一种数据处理装置,其特征在于,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为执行所述指令时实现权利要求1-3中任意一项所述的方法,或者实现权利要求4-13中任意一项所述的方法。
- 一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1-3中任意一项所述的方法,或者,实现权利要求4-13中任意一项所述的方法。
- 一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行权利要求1-3中任意一项所述的方法,或者,执行权利要求4-13中任意一项所述的方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP22787444.3A EP4318257A4 (en) | 2021-04-14 | 2022-04-08 | DATA PROCESSING METHOD AND APPARATUS, REDUCTION SERVER, AND MAPPING SERVER |
US18/485,847 US20240036728A1 (en) | 2021-04-14 | 2023-10-12 | Method and apparatus for processing data, reduction server, and mapping server |
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110401463 | 2021-04-14 | ||
CN202110401463.9 | 2021-04-14 | ||
CN202110638812.9 | 2021-06-08 | ||
CN202110638812 | 2021-06-08 | ||
CN202110812926.0 | 2021-07-19 | ||
CN202110812926.0A CN115203133A (zh) | 2021-04-14 | 2021-07-19 | 数据处理方法、装置、归约服务器及映射服务器 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/485,847 Continuation US20240036728A1 (en) | 2021-04-14 | 2023-10-12 | Method and apparatus for processing data, reduction server, and mapping server |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022218218A1 true WO2022218218A1 (zh) | 2022-10-20 |
Family
ID=83574150
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/085771 WO2022218218A1 (zh) | 2021-04-14 | 2022-04-08 | 数据处理方法、装置、归约服务器及映射服务器 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20240036728A1 (zh) |
EP (1) | EP4318257A4 (zh) |
CN (1) | CN115203133A (zh) |
WO (1) | WO2022218218A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11755445B2 (en) * | 2021-02-17 | 2023-09-12 | Microsoft Technology Licensing, Llc | Distributed virtual data tank for cross service quota management |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130332612A1 (en) * | 2010-03-31 | 2013-12-12 | International Business Machines Corporation | Transmission of map/reduce data in a data center |
CN103714009A (zh) * | 2013-12-20 | 2014-04-09 | 华中科技大学 | 一种GPU上基于内存统一管理的MapReduce实现方法 |
US20140358977A1 (en) * | 2013-06-03 | 2014-12-04 | Zettaset, Inc. | Management of Intermediate Data Spills during the Shuffle Phase of a Map-Reduce Job |
CN106371919A (zh) * | 2016-08-24 | 2017-02-01 | 上海交通大学 | 一种基于映射‑归约计算模型的洗牌数据缓存方法 |
CN108027801A (zh) * | 2015-12-31 | 2018-05-11 | 华为技术有限公司 | 数据处理方法、装置和系统 |
CN110362780A (zh) * | 2019-07-17 | 2019-10-22 | 北京航空航天大学 | 一种基于申威众核处理器的大数据张量典范分解计算方法 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140059552A1 (en) * | 2012-08-24 | 2014-02-27 | International Business Machines Corporation | Transparent efficiency for in-memory execution of map reduce job sequences |
US11467967B2 (en) * | 2018-08-25 | 2022-10-11 | Panzura, Llc | Managing a distributed cache in a cloud-based distributed computing environment |
US10977189B2 (en) * | 2019-09-06 | 2021-04-13 | Seagate Technology Llc | Reducing forward mapping table size using hashing |
-
2021
- 2021-07-19 CN CN202110812926.0A patent/CN115203133A/zh active Pending
-
2022
- 2022-04-08 EP EP22787444.3A patent/EP4318257A4/en active Pending
- 2022-04-08 WO PCT/CN2022/085771 patent/WO2022218218A1/zh active Application Filing
-
2023
- 2023-10-12 US US18/485,847 patent/US20240036728A1/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130332612A1 (en) * | 2010-03-31 | 2013-12-12 | International Business Machines Corporation | Transmission of map/reduce data in a data center |
US20140358977A1 (en) * | 2013-06-03 | 2014-12-04 | Zettaset, Inc. | Management of Intermediate Data Spills during the Shuffle Phase of a Map-Reduce Job |
CN103714009A (zh) * | 2013-12-20 | 2014-04-09 | 华中科技大学 | 一种GPU上基于内存统一管理的MapReduce实现方法 |
CN108027801A (zh) * | 2015-12-31 | 2018-05-11 | 华为技术有限公司 | 数据处理方法、装置和系统 |
CN106371919A (zh) * | 2016-08-24 | 2017-02-01 | 上海交通大学 | 一种基于映射‑归约计算模型的洗牌数据缓存方法 |
CN110362780A (zh) * | 2019-07-17 | 2019-10-22 | 北京航空航天大学 | 一种基于申威众核处理器的大数据张量典范分解计算方法 |
Non-Patent Citations (1)
Title |
---|
See also references of EP4318257A4 |
Also Published As
Publication number | Publication date |
---|---|
EP4318257A1 (en) | 2024-02-07 |
US20240036728A1 (en) | 2024-02-01 |
CN115203133A (zh) | 2022-10-18 |
EP4318257A4 (en) | 2024-02-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11971861B2 (en) | Providing scalable and concurrent file systems | |
US11258796B2 (en) | Data processing unit with key value store | |
US9665533B2 (en) | Blob pools, selectors, and command set implemented within a memory appliance for accessing memory | |
US8261020B2 (en) | Cache enumeration and indexing | |
US9304815B1 (en) | Dynamic replica failure detection and healing | |
US20150127880A1 (en) | Efficient implementations for mapreduce systems | |
US8955087B2 (en) | Method and system for transferring replicated information from source storage to destination storage | |
US8171064B2 (en) | Methods and systems for concurrently reading direct and indirect data blocks | |
US11093143B2 (en) | Methods and systems for managing key-value solid state drives (KV SSDS) | |
US11226778B2 (en) | Method, apparatus and computer program product for managing metadata migration | |
US20070094529A1 (en) | Method and apparatus for increasing throughput in a storage server | |
CN115129621B (zh) | 一种内存管理方法、设备、介质及内存管理模块 | |
US20240036728A1 (en) | Method and apparatus for processing data, reduction server, and mapping server | |
WO2024021470A1 (zh) | 一种跨区域的数据调度方法、装置、设备及存储介质 | |
US10031859B2 (en) | Pulse counters | |
CN111930684A (zh) | 基于hdfs的小文件处理方法、装置、设备及存储介质 | |
US11971902B1 (en) | Data retrieval latency management system | |
US11507611B2 (en) | Personalizing unstructured data according to user permissions | |
US20240320201A1 (en) | Method and system for performing high-performance writes of data to columnar storage | |
CN118860258A (zh) | 数据处理方法、装置、计算设备、系统及可读存储介质 | |
US20040158622A1 (en) | Auto-sizing channel | |
CN117472870A (zh) | 数据对象的存储方法、装置、设备及存储介质 | |
TW202219786A (zh) | 互動連續裝置內鍵值交易處理系統及方法 | |
CN117880288A (zh) | 数据均衡方法及相关设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22787444 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022787444 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2022787444 Country of ref document: EP Effective date: 20231023 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |