CN109388615A - Task processing method and system based on Spark - Google Patents

Task processing method and system based on Spark Download PDF

Info

Publication number
CN109388615A
CN109388615A CN201811142575.1A CN201811142575A CN109388615A CN 109388615 A CN109388615 A CN 109388615A CN 201811142575 A CN201811142575 A CN 201811142575A CN 109388615 A CN109388615 A CN 109388615A
Authority
CN
China
Prior art keywords
task
file
data
reduce
task processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811142575.1A
Other languages
Chinese (zh)
Other versions
CN109388615B (en
Inventor
王海波
魏枫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan Smartq Beijing Mdt Infotech Ltd
Original Assignee
Yunnan Smartq Beijing Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan Smartq Beijing Mdt Infotech Ltd filed Critical Yunnan Smartq Beijing Mdt Infotech Ltd
Priority to CN201811142575.1A priority Critical patent/CN109388615B/en
Publication of CN109388615A publication Critical patent/CN109388615A/en
Application granted granted Critical
Publication of CN109388615B publication Critical patent/CN109388615B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of task processing method and system based on Spark, wherein task processing method, including Map stage include: the quantity for obtaining pending data and ReduceTask in the Map stage;Each MapTask creates a bucket;MapTask carries out cutting to the pending data of given range according to the quantity of ReduceTask, and the result of each self-generating is stored in corresponding bucket;Local file system successively is written into the data in bucket, completes MapTask, wherein the data in each bucket form an assignment file, include the multiple FileSegment generated according to different ReduceTask in assignment file.During being written into local file system, the I/O burden and the expense for core buffer for greatly reducing disk, to improve the stability during Spark Shuffle read-write.

Description

Task processing method and system based on Spark
Technical field
The present invention relates to technical field of data processing, in particular to a kind of task processing method and system based on Spark.
Background technique
With the arrival of big data era, the requirement for being handled data and being excavated is also higher and higher therewith.Spark makees For the frame of distributed a big data processing and digging evidence, it is widely used in big data field, will be distributed over cluster service Data on device carry out the processing and excavation of cross-server, have been related to a large amount of data aggregate, merging, sorting operation, this Process is known as shuffle, has important role to the stability of Spark processing distributed data.
In MapReduce operation in Spark, each Map task (MapTask) can be raw according to the quantity of junior's task At the storage region (bucket) of corresponding number, it is right that storage according to original data block cutting obtains key-value (key-value);It Afterwards, the data stored in bucket, which are constantly written into local file system, to be locally stored, and file slice is formed (FileSegment).In the reduce stage, Reduce task (ReduceTask) is drawn from the MapTask of each node output Take the data of subregion to be processed (partition), the operation such as polymerization, sequence of complete paired data.
But, it is assumed that there are M MapTask, each MapTask to generate R FileSegment, M*R text will be generated Part, such as M=1024, R=1024, just has 1M FileSegment, undoubtedly to file system if the value of M and R is all larger A very big burden, during shuffle, even if data volume less but FileSegment quantity is very more In the case of, the I/O performance of disk equally can be seriously reduced during random writing file system.
Summary of the invention
The object of the present invention is to provide a kind of task processing method and system based on Spark, efficiently solves existing During Spark Shuffle it is possible that reduction I/O performance the technical issues of.
Technical solution provided by the invention is as follows:
A kind of task processing method based on Spark, including Map stage include: in the Map stage
Obtain pending data and the quantity of ReduceTask;
Each MapTask creates a bucket;
MapTask carries out cutting to the pending data of given range according to the quantity of ReduceTask, and will be each spontaneous At result be stored in corresponding bucket;
Local file system successively is written into the data in bucket, completes MapTask, wherein in each bucket Data form an assignment file, include according to the multiple of different ReduceTask generations in the assignment file FileSegment。
It is further preferred that further including the Reduce stage in the task processing method, include: in the Reduce stage
The data area that ReduceTask is handled as needed pulling data and is handled from corresponding assignment file.
It is further preferred that can also generate an index file corresponding with assignment file in each MapTask;
In assignment file, each FileSegment is arranged successively according to the sequence of ReduceTask;
In indexed file, it is stored with the initial address range that each ReduceTask corresponds to FileSegment;
In the Reduce stage, ReduceTask is according to the initial address range of FileSegment from corresponding assignment file In pull data to be treated.
It is further preferred that further including the mark letter of the identification information of MapTask, ReduceTask in the index file The mapping relations of breath and ReduceTask and FileSegment;
In the Reduce stage, ReduceTask is according to the identification information of itself and its identification information of the MapTask of dependence The data for pulling corresponding FileSegment in corresponding task file are handled.
It is further preferred that in the step of local file system successively is written in the data in bucket, comprising:
Core buffer successively is written into the data in bucket;
It overflows when the data in core buffer and local disk is written, and record storage location information;
After the Reduce stage, the location information for obtaining the storage of MapTask result, by local reading and/or remotely The data that the mode of reading is read in corresponding task file are handled.
The present invention also provides a kind of task processing system based on Spark, including the end Mapper, in the end Mapper Include:
Task acquisition module obtains the quantity of pending data and Reduce task;
Multiple Map task processing modules, connect with task acquisition module, and each Map task processing module is pre-created one Storage region, Map task processing module according to the pending data of the quantity of Reduce task to given range carry out cutting it Afterwards, the result of each self-generating is stored in corresponding storage region;
Data write. module the data in storage region is written for successively local with the Map task processing module File system completes Map task, wherein data in each storage region form an assignment file, in the assignment file Including being sliced according to multiple files of different Reduce task generations.
It is further preferred that further including the end Reducer being connect with the end Mapper, institute in the task processing system It states in the end Reducer and includes:
Multiple Reduce task processing modules, the data area for handling as needed are drawn from corresponding assignment file Access evidence is simultaneously handled.
It is further preferred that can also generate an index file corresponding with assignment file in Data write. module;
In assignment file, each file slice is arranged successively according to the sequence of Reduce task;
In indexed file, it is stored with the initial address range of each Reduce task respective file slice;
In Reduce task processing module, drawn from corresponding assignment file according to the initial address range that file is sliced Take data to be treated.
It is further preferred that further including the mark letter of the identification information of Map task, Reduce task in the index file The mapping relations of breath and Reduce task and file slice;
In Reduce task processing module, according to the identification information of itself and its identification information of the Map task of dependence The data that corresponding document is sliced in corresponding task file are pulled to be handled.
It is further preferred that in Data write. module: core buffer successively is written in the data in storage region;It will Data in core buffer, which are overflow, is written local disk, and record storage location information;
In Reduce task processing module, obtain Map task processing module processing result storage location information it Afterwards, the data in corresponding task file are read by way of locally reading and/or remotely reading to be handled.
In the task processing method provided by the invention based on Spark and system, not further according to the junior of MapTask The task that the quantity of Reduce task generates corresponding number is sliced (FileSegment), but each MapTask only generates one Assignment file includes the task slice generated according to each Reduce task in each assignment file, and according to Reduce task Sequence is arranged successively, and during being written into local file system, greatly reduces the I/O burden of disk and for memory The expense of buffer area, to improve the stability during Spark Shuffle read-write.In addition, each Reduce task pair The data area (initial address range) answered is safeguarded by an other increased index file, is stored in each index file The initial address range of each ReduceTask respective file slice, the fast quick checking convenient for Reduce task phase to corresponding data It looks for, it is simple and convenient.
Detailed description of the invention
Below by clearly understandable mode, preferred embodiment is described with reference to the drawings, to above-mentioned characteristic, technical characteristic, Advantage and its implementation are further described.
Fig. 1 is a kind of embodiment flow diagram of task processing method based on Spark in the present invention;
Fig. 2 is the task processing method another embodiment flow diagram based on Spark in the present invention;
Fig. 3 is to establish MapTask-ReduceTask structural schematic diagram in the present invention in an example during shuffle;
Fig. 4 is a kind of embodiment schematic diagram of task processing system based on Spark in the present invention;
Fig. 5 is the task processing system another embodiment schematic diagram based on Spark in the present invention;
Description of symbols:
100- task processing system, the end 110-Mapper, 111- task acquisition module, 112-Map task processing module, 113- Data write. module, the end 120-Reducer, 121-Reduce task processing module.
Specific embodiment
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, Detailed description of the invention will be compareed below A specific embodiment of the invention.It should be evident that drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing, and obtain other embodiments.
To make simplified form, part related to the present invention is only schematically shown in each figure, they are not represented Its practical structures as product.In addition, there is identical structure or function in some figures so that simplified form is easy to understand Component only symbolically depicts one of those, or has only marked one of those.Herein, "one" is not only indicated " only this ", can also indicate the situation of " more than one ".
When the Map stage output result will by the Reduce stage in use, the Map stage export result need to be distributed to it is each A Reducer gets on, this process is shuffle.Since shuffle has been related to the read-write of local file system, The height of shuffle performance has directly influenced the operational efficiency of entire application.In existing MapReduce operation, each MapTask generates the bucket of corresponding number according to the quantity of junior's task, and storage obtains key- according to original data block cutting Value pairs;Later, the data stored in bucket, which are constantly written into local file system, to be locally stored, and is formed FileSegment.In this process, each MapTask is that each ReduceTask generates a FileSegment, is saved The data that specific ReduceTask need to be handled, it is assumed that there is M MapTask, each MapTask to generate R FileSegment, just M*R FileSegment can be generated, if M and R are very huge, generates and read these FileSegment (shuffle processes In) a large amount of random I/O, the serious I/O performance for reducing system, inefficiency can be generated.
In order to solve the above-mentioned technical problems, the present invention provides a kind of new task processing methods based on Spark, are used for Improve the read-write stability during Spark shuffle.It specifically, include Map in the task processing method based on Spark Stage, as shown in Figure 1, including: that S1 obtains pending data and the quantity of ReduceTask in the Map stage;S2 is each MapTask creates a bucket;S3MapTask carries out the pending data of given range according to the quantity of ReduceTask Cutting, and the result of each self-generating is stored in corresponding bucket;S4 is successively by the local text of data write-in in bucket Part system completes MapTask, wherein the data in each bucket form an assignment file, include basis in assignment file Multiple FileSegment that different ReduceTask is generated.
In the present embodiment, pending data is specially the intermediate data that Mapper in Spark (stage) is generated, and is pressed According to quantity (RDD (Resilient Distributed Datasets, the elasticity point depending on current stage of ReduceTask Cloth data set) number of partitions) carry out cutting.Each MapTask is equal by the data obtained according to each ReduceTask cutting It, will during storage after being put into the bucket of MapTask creation according to pre-set partition algorithm The corresponding data of each ReduceTask successively sort according to the sequence of ReduceTask, are convenient for subsequent read operation.It will During data write-in local file system in bucket, the data in each bucket form an assignment file, should It include the FileSegment generated according to different ReduceTask in assignment file, each ReduceTask is one corresponding FileSegment.It for partition algorithm, can be configured, such as be stored according to key Hash according to the actual situation. In one example, it is assumed that have M MapTask, each MapTask generates 1 assignment file, although including in each assignment file R FileSegment, but on the whole for produce M*1 assignment file, compared with the existing technology in M*R FileSegment influences I/O performance during greatly reducing random read in, substantially increases write efficiency.
During local file system is written in the data in bucket, comprising: successively by the data in bucket Core buffer is written, when the data in core buffer reach certain quantity, by the excessive write-in of data in core buffer Local disk.Such as, in one example, the size of core buffer is 100M (million), when the data wherein stored reach 80M, Start write-in local disk that data therein are overflow.In this course, the size that core buffer needs is C*R*100Kb (Kilobyte), wherein C is the nucleus number run in Spark cluster, and R is the quantity of subsequent ReduceTask, if ReduceTask Quantity it is very big when, occupied memory is also a no small expense.By what is generated during MapTask in present embodiment The reduction of number of files greatly reduces during by the data write-in core buffer in bucket to memory buffer The utilization rate in area.
Above embodiment is improved to obtain present embodiment, in the present embodiment, in the appointing based on Spark Be engaged in processing method in include Map stage and Reduce stage, wherein include: in the Map stage S1 obtain pending data and The quantity of ReduceTask;The each MapTask of S2 creates a bucket;S3MapTask is according to the quantity of ReduceTask to giving The pending data for determining range carries out cutting, and the result of each self-generating is stored in corresponding bucket;S4 successively will Local file system is written in data in bucket, completes MapTask, wherein the data in each bucket form one and appoint It is engaged in file, includes the multiple FileSegment generated according to different ReduceTask in assignment file.Reduce is wrapped in the stage Include: the data area that S5ReduceTask is handled as needed pulling data and is handled from corresponding assignment file, such as Shown in Fig. 2.
In the present embodiment, each MapTask is by the data obtained according to each ReduceTask cutting all in accordance with preparatory It, will be each during storage after the partition algorithm of setting is put into the bucket of MapTask creation The corresponding data of ReduceTask successively sort according to the sequence of ReduceTask.Local file is written into data in bucket During system, it includes according to different in the assignment file that the data in each bucket, which form an assignment file, The corresponding FileSegment of the FileSegment that ReduceTask is generated, each ReduceTask.When Reducer starts When, the data area that ReduceTask is then handled as needed pulling data and is handled from corresponding assignment file.
Above embodiment is improved to obtain present embodiment, in the present embodiment, each MapTask generates one A assignment file and an index file corresponding with the assignment file;It specifically, include according to different in each assignment file The FileSegment that ReduceTask is generated, and each FileSegment is arranged successively according to the sequence of ReduceTask;In rope The initial address range that each ReduceTask corresponds to FileSegment is stored in quotation part.With this, in the Reduce stage, ReduceTask pulls data to be treated according to the initial address range of FileSegment from corresponding assignment file.
In the present embodiment, an assignment file is only generated by each MapTask, in order to can in the Reduce stage The corresponding FileSegment of ReduceTask, each assignment file one rope of corresponding maintenance are accurately pulled from assignment file Quotation part stores the initial address range of each FileSegment in indexed file, that is, if in assignment file according to The sequence of ReduceTask incorporates n FileSegment, then n initial address range is stored in indexed file, with n The position of FileSegment corresponds.With this, in the Reduce stage, ReduceTask is according to the starting point in index file Corresponding data are read from assignment file in location.In one example, as shown in figure 3, then existing including 4 ReduceTask During shuffle, each MapTask generates 4 FileSegment according to the quantity of ReduceTask, is put into a task In file, an assignment file safeguards an index file index.
Furthermore, it is understood that further include in indexed file the identification information of MapTask, ReduceTask identification information and The mapping relations of ReduceTask and FileSegment;In the Reduce stage, ReduceTask according to itself identification information and The data that the identification information of its MapTask relied on pulls corresponding FileSegment in corresponding task file are handled.
Specifically, the data obtained according to each ReduceTask cutting are put into bucket by MapTask, and local file is written After system, when Reducer starting, ReduceTask can be according to identification information (number of task) He Suoyi of oneself task The identification information (mission number) of bad MapTask obtains corresponding from the block manager of distal end or local Input of the bucket as Reducer.Mapping relations between ReduceTask and FileSegment are specially ReduceTask With the mapping relations between FileSegment storage address, ReduceTask according to the initial address range in index file from Corresponding data are pulled in corresponding FileSegment in assignment file.
Above embodiment is improved to obtain present embodiment, in the present embodiment, executed MapTask it Afterwards, the address of the executive condition of MapTask and the assignment file of generation is encapsulated into mapStatus object, is called MapOutputTrackerMaster of the MapoutputTrackerWorker into Driver sends message, notice Location information (the IP of the computer including file storage of each assignment file in MapOutputTrackerMaster disk Address, corresponding MapTask for generating this document that location, file store in the computer etc.).It needs to read in ReduceTask When data in assignment file, the position of assignment file is obtained by requesting the MapOutputTrackerMaster at the end Driver Confidence breath, such ReduceTask can be arrived by way of locally reading and/or remotely reading to the node where data (FileSegment point) according to index file initial address range pulling data.
The present invention also provides a kind of task processing system 100 based on Spark, including the end Mapper 110, such as Fig. 4 institute Show, includes: the task acquisition module 111, (Map in as shown of multiple Map task processing modules 112 in the end Mapper 110 Task processing module 1 ..., Map task processing module n, for handling MapTask) and Data write. module 113, wherein it is more A Map task processing module 112 is connect with task acquisition module 111 and Data write. module 113 respectively.Task acquisition module 111 for obtaining the quantity of pending data and Reduce task;Map task processing module 112 is according to the number of ReduceTask It measures after carrying out cutting to the pending data of given range, the result of each self-generating is stored in corresponding bucket;Number According to writing module 113 for local file system successively to be written in the data in bucket, MapTask is completed, wherein each Data in bucket form an assignment file, include according to the multiple of different ReduceTask generations in assignment file FileSegment。
In the present embodiment, pending data is specially the intermediate data that Mapper in Spark (stage) is generated, and is pressed Cutting is carried out according to the quantity of ReduceTask.Each Map task processing module 112 will be obtained according to each ReduceTask cutting After data are put into the bucket being pre-created all in accordance with pre-set partition algorithm, during storage, number The corresponding data of each ReduceTask are successively sorted according to the sequence of ReduceTask according to writing module 113, convenient for subsequent Read operation.During local file system is written in data in bucket, the data in each bucket form one Assignment file includes the FileSegment, each ReduceTask generated according to different ReduceTask in the assignment file A corresponding FileSegment.It for partition algorithm, can be configured according to the actual situation, such as according to key Hash It is stored.M*R FileSegment in compared with the existing technology is greatly reduced during reading at random to I/O It can influence, substantially increase write efficiency.
During local file system is written in the data in bucket, comprising: Data write. module 113 successively will Core buffer is written in data in bucket, when the data in core buffer reach certain quantity, by core buffer In data overflow write-in local disk.In this course, the size that core buffer needs is C*R*100Kb (Kilobyte), wherein C is the nucleus number run in Spark cluster, and R is the quantity of subsequent ReduceTask, if ReduceTask Quantity it is very big when, occupied memory is also a no small expense.By what is generated during MapTask in present embodiment The reduction of number of files greatly reduces during by the data write-in core buffer in bucket to memory buffer The utilization rate in area.
Above embodiment is improved to obtain present embodiment, in the present embodiment, the task processing system 100 Including the end Mapper 110 and the end Reducer, as shown in figure 5, including: task acquisition module 111 in the end Mapper 110, multiple Map task processing module 112 and Data write. module 113.Task acquisition module 111 is for obtaining pending data and Reduce The quantity of task;Map task processing module 112 cuts the pending data of given range according to the quantity of ReduceTask / after, the result of each self-generating is stored in corresponding bucket;Data write. module 113 is used for successively will be in bucket Data local file system 100 is written, complete MapTask, wherein data in each bucket form a task text Part includes the multiple FileSegment generated according to different ReduceTask in assignment file.It include: more in the end Reducer A Reduce task processing module 121 (Reduce task processing module 1 in as shown ..., Reduce task processing module N, for handling ReduceTask), pulling data is simultaneously from corresponding assignment file for data area for handling as needed It is handled.
In the present embodiment, the number that each MapTask task processing module will be obtained according to each ReduceTask cutting After being put into the bucket being pre-created, Data write. module 113 is corresponding by each ReduceTask during storage Data successively sort according to the sequence of ReduceTask.During local file system is written in data in bucket, Data in each bucket form an assignment file, include being generated according to different ReduceTask in the assignment file The corresponding FileSegment of FileSegment, each ReduceTask.When Reducer starting, the processing of Reduce task The data area that module 121 is handled as needed pulling data and is handled from corresponding assignment file.
Above embodiment is improved to obtain present embodiment, in the present embodiment, in Data write. module 113 In, data one assignment file of generation and one generated for each Map task processing module 112 is corresponding with the assignment file Index file;It specifically, include the FileSegment generated according to different ReduceTask in each assignment file, and each FileSegment is arranged successively according to the sequence of ReduceTask;It is corresponding that each ReduceTask is stored in indexed file The initial address range of FileSegment.With this, in the Reduce stage, ReduceTask is according to the starting point of FileSegment Location range pulls data to be treated from corresponding assignment file.
In the present embodiment, an assignment file is only generated by each Map task processing module 112, in order to The corresponding FileSegment of ReduceTask, each task text can be accurately pulled in the Reduce stage from assignment file Part is corresponding to safeguard an index file, the initial address range of each FileSegment is stored in indexed file, that is, if appointing N FileSegment is incorporated according to the sequence of ReduceTask in business file, then stores n starting point in indexed file Location range is corresponded with the position of n FileSegment.With this, in the Reduce stage, Reduce task processing module 121 Corresponding data are read from assignment file according to the initial address in index file.
Furthermore, it is understood that further include in indexed file the identification information of MapTask, ReduceTask identification information and The mapping relations of ReduceTask and FileSegment;In the Reduce stage, Reduce task processing module 121 is according to itself Identification information and its identification information of MapTask of dependence pull the data of corresponding FileSegment in corresponding task file It is handled.
Specifically, the data obtained according to each ReduceTask cutting are put into bucket by Map task processing module 112, It is written after local file system 100, when Reducer starting, Reduce task processing module 121 can be according to oneself task's The identification information (mission number) of identification information (number of task) and the MapTask relied on from distal end or local Input of the corresponding bucket as Reducer is obtained in block manager.Between ReduceTask and FileSegment Mapping relations be specially mapping relations between ReduceTask and FileSegment storage address, the processing of Reduce task Module 121 is pulled from FileSegment corresponding in assignment file accordingly according to the initial address range in index file Data.
Above embodiment is improved to obtain present embodiment, in the present embodiment, executed MapTask it Afterwards, the address of the executive condition of MapTask and the assignment file of generation is encapsulated into mapStatus pairs by Data write. module 113 As in, calls MapOutputTrackerMaster of the MapoutputTrackerWorker into Driver to send message, lead to Know the location information (IP of the computer including file storage of each assignment file in MapOutputTrackerMaster disk Address, corresponding MapTask for generating this document that location, file store in the computer etc.).In Reduce task processing module 121 when needing to read the data in assignment file, is obtained by requesting the MapOutputTrackerMaster at the end Driver The location information of assignment file, such Reduce task processing module 121, which can be arrived, to be read and/or is remotely read by local Node (FileSegment point) where mode to data according to index file initial address range pulling data.
It should be noted that above-described embodiment can be freely combined as needed.The above is only preferred implementations of the invention Mode, it is noted that for those skilled in the art, without departing from the principle of the present invention, also Several improvements and modifications can be made, these modifications and embellishments should also be considered as the scope of protection of the present invention.

Claims (10)

1. a kind of task processing method based on Spark, which is characterized in that it include the Map stage in the task processing method, Include: in the Map stage
Obtain the quantity of pending data and Reduce task;
One storage region of each Map task creation;
Map task carries out cutting according to the pending data of the quantity of Reduce task to given range, and by each self-generating As a result it is stored in corresponding storage region;
Local file system successively is written into the data in storage region, completes Map task, wherein in each storage region Data form an assignment file, include being cut in the assignment file according to multiple files that different Reduce tasks generates Piece.
2. task processing method as described in claim 1, which is characterized in that further include Reduce in the task processing method Stage includes: in the Reduce stage
The data area that Reduce task is handled as needed pulling data and is handled from corresponding assignment file.
3. task processing method as claimed in claim 2, which is characterized in that in each Map task, can also generate one and appoint The corresponding index file of business file;
In assignment file, each file slice is arranged successively according to the sequence of Reduce task;
In indexed file, it is stored with the initial address range of each Reduce task respective file slice;
In the Reduce stage, the initial address range that Reduce task is sliced according to file pulls need from corresponding assignment file Data to be processed.
4. task processing method as claimed in claim 3, which is characterized in that further include Map task in the index file The mapping relations of identification information, the identification information of Reduce task and Reduce task and file slice;
In the Reduce stage, Reduce task is pulled according to the identification information of the identification information of itself and its Map task of dependence The data that corresponding document is sliced in corresponding task file are handled.
5. the task processing method as described in claim 1-4 any one, which is characterized in that successively will be in storage region Data were written in the step of local file system, comprising:
Core buffer successively is written into the data in storage region;
Data in core buffer are overflow, local disk is written, and record storage location information;
After the Reduce stage, the location information for obtaining the storage of Map task result, by locally reading and/or remotely reading The data that read in corresponding task file of mode handled.
6. a kind of task processing system based on Spark, which is characterized in that it include the end Mapper in the task processing system, Include: in the end Mapper
Task acquisition module obtains the quantity of pending data and Reduce task;
Multiple Map task processing modules, connect with task acquisition module, and a storage is pre-created in each Map task processing module It region will after Map task processing module carries out cutting to the pending data of given range according to the quantity of Reduce task The result of each self-generating is stored in corresponding storage region;
Data write. module, and the Map task processing module, for local file successively to be written in the data in storage region System completes Map task, wherein the data in each storage region form an assignment file, include in the assignment file The multiple files slice generated according to different Reduce tasks.
7. task processing system as described in claim 1, which is characterized in that further include in the task processing system with it is described The end Reducer of the end Mapper connection includes: in the end Reducer
Multiple Reduce task processing modules, the data area for handling as needed pull number from corresponding assignment file According to and handled.
8. task processing system as claimed in claim 7, which is characterized in that in Data write. module, can also generate one with The corresponding index file of assignment file;
In assignment file, each file slice is arranged successively according to the sequence of Reduce task;
In indexed file, it is stored with the initial address range of each Reduce task respective file slice;
In Reduce task processing module, need are pulled from corresponding assignment file according to the initial address range that file is sliced Data to be processed.
9. task processing system as claimed in claim 8, which is characterized in that further include Map task in the index file The mapping relations of identification information, the identification information of Reduce task and Reduce task and file slice;
In Reduce task processing module, pulled according to the identification information of the identification information of itself and its Map task of dependence The data that corresponding document is sliced in corresponding task file are handled.
10. the task processing system as described in claim 6-9 any one, which is characterized in that in Data write. module: according to Core buffer is written in the secondary data by storage region;Data in core buffer are overflow, local disk is written, and recorded Storage location information;
In Reduce task processing module, after the location information of processing result storage for obtaining Map task processing module, lead to The data that the local mode for reading and/or remotely reading is read in corresponding task file are crossed to be handled.
CN201811142575.1A 2018-09-28 2018-09-28 Spark-based task processing method and system Active CN109388615B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811142575.1A CN109388615B (en) 2018-09-28 2018-09-28 Spark-based task processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811142575.1A CN109388615B (en) 2018-09-28 2018-09-28 Spark-based task processing method and system

Publications (2)

Publication Number Publication Date
CN109388615A true CN109388615A (en) 2019-02-26
CN109388615B CN109388615B (en) 2022-04-01

Family

ID=65418272

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811142575.1A Active CN109388615B (en) 2018-09-28 2018-09-28 Spark-based task processing method and system

Country Status (1)

Country Link
CN (1) CN109388615B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110955732A (en) * 2019-12-16 2020-04-03 湖南大学 Method and system for realizing partition load balance in Spark environment
WO2023005366A1 (en) * 2021-07-28 2023-02-02 华为云计算技术有限公司 Computing method and apparatus, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103327128A (en) * 2013-07-23 2013-09-25 百度在线网络技术(北京)有限公司 Intermediate data transmission method and system for MapReduce
CN105955819A (en) * 2016-04-18 2016-09-21 中国科学院计算技术研究所 Data transmission method and system based on Hadoop
US9460147B1 (en) * 2015-06-12 2016-10-04 International Business Machines Corporation Partition-based index management in hadoop-like data stores

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103327128A (en) * 2013-07-23 2013-09-25 百度在线网络技术(北京)有限公司 Intermediate data transmission method and system for MapReduce
US9460147B1 (en) * 2015-06-12 2016-10-04 International Business Machines Corporation Partition-based index management in hadoop-like data stores
CN105955819A (en) * 2016-04-18 2016-09-21 中国科学院计算技术研究所 Data transmission method and system based on Hadoop

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110955732A (en) * 2019-12-16 2020-04-03 湖南大学 Method and system for realizing partition load balance in Spark environment
WO2023005366A1 (en) * 2021-07-28 2023-02-02 华为云计算技术有限公司 Computing method and apparatus, device and storage medium

Also Published As

Publication number Publication date
CN109388615B (en) 2022-04-01

Similar Documents

Publication Publication Date Title
CN109684333B (en) Data storage and cutting method, equipment and storage medium
CN107038206B (en) LSM tree establishing method, LSM tree data reading method and server
US10705735B2 (en) Method and device for managing hash table, and computer program product
CN104881466B (en) The processing of data fragmentation and the delet method of garbage files and device
US20180027061A1 (en) Method and apparatus for elastically scaling virtual machine cluster
JP5427640B2 (en) Decision tree generation apparatus, decision tree generation method, and program
JP2013541083A (en) System and method for scalable reference management in a storage system based on deduplication
US20180150536A1 (en) Instance-based distributed data recovery method and apparatus
CN111723073B (en) Data storage processing method, device, processing system and storage medium
CN106407224A (en) Method and device for file compaction in KV (Key-Value)-Store system
CN110399096B (en) Method, device and equipment for deleting metadata cache of distributed file system again
CN103150149A (en) Method and device for processing redo data of database
CN113806300B (en) Data storage method, system, device, equipment and storage medium
CN105589908A (en) Association rule computing method for transaction set
CN109388615A (en) Task processing method and system based on Spark
CN105988995B (en) A method of based on HFile batch load data
CN113254445A (en) Real-time data storage method and device, computer equipment and storage medium
CN105068875A (en) Intelligence data processing method and apparatus
CN112860412B (en) Service data processing method and device, electronic equipment and storage medium
CN113609090A (en) Data storage method and device, computer readable storage medium and electronic equipment
CN108121807B (en) Method for realizing multi-dimensional Index structure OBF-Index in Hadoop environment
JP2019121333A (en) Data dynamic migration method and data dynamic migration device
CN111061719B (en) Data collection method, device, equipment and storage medium
CN110059075A (en) A kind of method, apparatus of database migration, equipment and computer-readable medium
CN113703678A (en) Method, device, equipment and medium for re-splitting index of storage bucket

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant