CN109388615A - Task processing method and system based on Spark - Google Patents
Task processing method and system based on Spark Download PDFInfo
- Publication number
- CN109388615A CN109388615A CN201811142575.1A CN201811142575A CN109388615A CN 109388615 A CN109388615 A CN 109388615A CN 201811142575 A CN201811142575 A CN 201811142575A CN 109388615 A CN109388615 A CN 109388615A
- Authority
- CN
- China
- Prior art keywords
- task
- file
- data
- reduce
- task processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
The invention discloses a kind of task processing method and system based on Spark, wherein task processing method, including Map stage include: the quantity for obtaining pending data and ReduceTask in the Map stage;Each MapTask creates a bucket;MapTask carries out cutting to the pending data of given range according to the quantity of ReduceTask, and the result of each self-generating is stored in corresponding bucket;Local file system successively is written into the data in bucket, completes MapTask, wherein the data in each bucket form an assignment file, include the multiple FileSegment generated according to different ReduceTask in assignment file.During being written into local file system, the I/O burden and the expense for core buffer for greatly reducing disk, to improve the stability during Spark Shuffle read-write.
Description
Technical field
The present invention relates to technical field of data processing, in particular to a kind of task processing method and system based on Spark.
Background technique
With the arrival of big data era, the requirement for being handled data and being excavated is also higher and higher therewith.Spark makees
For the frame of distributed a big data processing and digging evidence, it is widely used in big data field, will be distributed over cluster service
Data on device carry out the processing and excavation of cross-server, have been related to a large amount of data aggregate, merging, sorting operation, this
Process is known as shuffle, has important role to the stability of Spark processing distributed data.
In MapReduce operation in Spark, each Map task (MapTask) can be raw according to the quantity of junior's task
At the storage region (bucket) of corresponding number, it is right that storage according to original data block cutting obtains key-value (key-value);It
Afterwards, the data stored in bucket, which are constantly written into local file system, to be locally stored, and file slice is formed
(FileSegment).In the reduce stage, Reduce task (ReduceTask) is drawn from the MapTask of each node output
Take the data of subregion to be processed (partition), the operation such as polymerization, sequence of complete paired data.
But, it is assumed that there are M MapTask, each MapTask to generate R FileSegment, M*R text will be generated
Part, such as M=1024, R=1024, just has 1M FileSegment, undoubtedly to file system if the value of M and R is all larger
A very big burden, during shuffle, even if data volume less but FileSegment quantity is very more
In the case of, the I/O performance of disk equally can be seriously reduced during random writing file system.
Summary of the invention
The object of the present invention is to provide a kind of task processing method and system based on Spark, efficiently solves existing
During Spark Shuffle it is possible that reduction I/O performance the technical issues of.
Technical solution provided by the invention is as follows:
A kind of task processing method based on Spark, including Map stage include: in the Map stage
Obtain pending data and the quantity of ReduceTask;
Each MapTask creates a bucket;
MapTask carries out cutting to the pending data of given range according to the quantity of ReduceTask, and will be each spontaneous
At result be stored in corresponding bucket;
Local file system successively is written into the data in bucket, completes MapTask, wherein in each bucket
Data form an assignment file, include according to the multiple of different ReduceTask generations in the assignment file
FileSegment。
It is further preferred that further including the Reduce stage in the task processing method, include: in the Reduce stage
The data area that ReduceTask is handled as needed pulling data and is handled from corresponding assignment file.
It is further preferred that can also generate an index file corresponding with assignment file in each MapTask;
In assignment file, each FileSegment is arranged successively according to the sequence of ReduceTask;
In indexed file, it is stored with the initial address range that each ReduceTask corresponds to FileSegment;
In the Reduce stage, ReduceTask is according to the initial address range of FileSegment from corresponding assignment file
In pull data to be treated.
It is further preferred that further including the mark letter of the identification information of MapTask, ReduceTask in the index file
The mapping relations of breath and ReduceTask and FileSegment;
In the Reduce stage, ReduceTask is according to the identification information of itself and its identification information of the MapTask of dependence
The data for pulling corresponding FileSegment in corresponding task file are handled.
It is further preferred that in the step of local file system successively is written in the data in bucket, comprising:
Core buffer successively is written into the data in bucket;
It overflows when the data in core buffer and local disk is written, and record storage location information;
After the Reduce stage, the location information for obtaining the storage of MapTask result, by local reading and/or remotely
The data that the mode of reading is read in corresponding task file are handled.
The present invention also provides a kind of task processing system based on Spark, including the end Mapper, in the end Mapper
Include:
Task acquisition module obtains the quantity of pending data and Reduce task;
Multiple Map task processing modules, connect with task acquisition module, and each Map task processing module is pre-created one
Storage region, Map task processing module according to the pending data of the quantity of Reduce task to given range carry out cutting it
Afterwards, the result of each self-generating is stored in corresponding storage region;
Data write. module the data in storage region is written for successively local with the Map task processing module
File system completes Map task, wherein data in each storage region form an assignment file, in the assignment file
Including being sliced according to multiple files of different Reduce task generations.
It is further preferred that further including the end Reducer being connect with the end Mapper, institute in the task processing system
It states in the end Reducer and includes:
Multiple Reduce task processing modules, the data area for handling as needed are drawn from corresponding assignment file
Access evidence is simultaneously handled.
It is further preferred that can also generate an index file corresponding with assignment file in Data write. module;
In assignment file, each file slice is arranged successively according to the sequence of Reduce task;
In indexed file, it is stored with the initial address range of each Reduce task respective file slice;
In Reduce task processing module, drawn from corresponding assignment file according to the initial address range that file is sliced
Take data to be treated.
It is further preferred that further including the mark letter of the identification information of Map task, Reduce task in the index file
The mapping relations of breath and Reduce task and file slice;
In Reduce task processing module, according to the identification information of itself and its identification information of the Map task of dependence
The data that corresponding document is sliced in corresponding task file are pulled to be handled.
It is further preferred that in Data write. module: core buffer successively is written in the data in storage region;It will
Data in core buffer, which are overflow, is written local disk, and record storage location information;
In Reduce task processing module, obtain Map task processing module processing result storage location information it
Afterwards, the data in corresponding task file are read by way of locally reading and/or remotely reading to be handled.
In the task processing method provided by the invention based on Spark and system, not further according to the junior of MapTask
The task that the quantity of Reduce task generates corresponding number is sliced (FileSegment), but each MapTask only generates one
Assignment file includes the task slice generated according to each Reduce task in each assignment file, and according to Reduce task
Sequence is arranged successively, and during being written into local file system, greatly reduces the I/O burden of disk and for memory
The expense of buffer area, to improve the stability during Spark Shuffle read-write.In addition, each Reduce task pair
The data area (initial address range) answered is safeguarded by an other increased index file, is stored in each index file
The initial address range of each ReduceTask respective file slice, the fast quick checking convenient for Reduce task phase to corresponding data
It looks for, it is simple and convenient.
Detailed description of the invention
Below by clearly understandable mode, preferred embodiment is described with reference to the drawings, to above-mentioned characteristic, technical characteristic,
Advantage and its implementation are further described.
Fig. 1 is a kind of embodiment flow diagram of task processing method based on Spark in the present invention;
Fig. 2 is the task processing method another embodiment flow diagram based on Spark in the present invention;
Fig. 3 is to establish MapTask-ReduceTask structural schematic diagram in the present invention in an example during shuffle;
Fig. 4 is a kind of embodiment schematic diagram of task processing system based on Spark in the present invention;
Fig. 5 is the task processing system another embodiment schematic diagram based on Spark in the present invention;
Description of symbols:
100- task processing system, the end 110-Mapper, 111- task acquisition module, 112-Map task processing module,
113- Data write. module, the end 120-Reducer, 121-Reduce task processing module.
Specific embodiment
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, Detailed description of the invention will be compareed below
A specific embodiment of the invention.It should be evident that drawings in the following description are only some embodiments of the invention, for
For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other
Attached drawing, and obtain other embodiments.
To make simplified form, part related to the present invention is only schematically shown in each figure, they are not represented
Its practical structures as product.In addition, there is identical structure or function in some figures so that simplified form is easy to understand
Component only symbolically depicts one of those, or has only marked one of those.Herein, "one" is not only indicated
" only this ", can also indicate the situation of " more than one ".
When the Map stage output result will by the Reduce stage in use, the Map stage export result need to be distributed to it is each
A Reducer gets on, this process is shuffle.Since shuffle has been related to the read-write of local file system,
The height of shuffle performance has directly influenced the operational efficiency of entire application.In existing MapReduce operation, each
MapTask generates the bucket of corresponding number according to the quantity of junior's task, and storage obtains key- according to original data block cutting
Value pairs;Later, the data stored in bucket, which are constantly written into local file system, to be locally stored, and is formed
FileSegment.In this process, each MapTask is that each ReduceTask generates a FileSegment, is saved
The data that specific ReduceTask need to be handled, it is assumed that there is M MapTask, each MapTask to generate R FileSegment, just
M*R FileSegment can be generated, if M and R are very huge, generates and read these FileSegment (shuffle processes
In) a large amount of random I/O, the serious I/O performance for reducing system, inefficiency can be generated.
In order to solve the above-mentioned technical problems, the present invention provides a kind of new task processing methods based on Spark, are used for
Improve the read-write stability during Spark shuffle.It specifically, include Map in the task processing method based on Spark
Stage, as shown in Figure 1, including: that S1 obtains pending data and the quantity of ReduceTask in the Map stage;S2 is each
MapTask creates a bucket;S3MapTask carries out the pending data of given range according to the quantity of ReduceTask
Cutting, and the result of each self-generating is stored in corresponding bucket;S4 is successively by the local text of data write-in in bucket
Part system completes MapTask, wherein the data in each bucket form an assignment file, include basis in assignment file
Multiple FileSegment that different ReduceTask is generated.
In the present embodiment, pending data is specially the intermediate data that Mapper in Spark (stage) is generated, and is pressed
According to quantity (RDD (Resilient Distributed Datasets, the elasticity point depending on current stage of ReduceTask
Cloth data set) number of partitions) carry out cutting.Each MapTask is equal by the data obtained according to each ReduceTask cutting
It, will during storage after being put into the bucket of MapTask creation according to pre-set partition algorithm
The corresponding data of each ReduceTask successively sort according to the sequence of ReduceTask, are convenient for subsequent read operation.It will
During data write-in local file system in bucket, the data in each bucket form an assignment file, should
It include the FileSegment generated according to different ReduceTask in assignment file, each ReduceTask is one corresponding
FileSegment.It for partition algorithm, can be configured, such as be stored according to key Hash according to the actual situation.
In one example, it is assumed that have M MapTask, each MapTask generates 1 assignment file, although including in each assignment file
R FileSegment, but on the whole for produce M*1 assignment file, compared with the existing technology in M*R
FileSegment influences I/O performance during greatly reducing random read in, substantially increases write efficiency.
During local file system is written in the data in bucket, comprising: successively by the data in bucket
Core buffer is written, when the data in core buffer reach certain quantity, by the excessive write-in of data in core buffer
Local disk.Such as, in one example, the size of core buffer is 100M (million), when the data wherein stored reach 80M,
Start write-in local disk that data therein are overflow.In this course, the size that core buffer needs is C*R*100Kb
(Kilobyte), wherein C is the nucleus number run in Spark cluster, and R is the quantity of subsequent ReduceTask, if ReduceTask
Quantity it is very big when, occupied memory is also a no small expense.By what is generated during MapTask in present embodiment
The reduction of number of files greatly reduces during by the data write-in core buffer in bucket to memory buffer
The utilization rate in area.
Above embodiment is improved to obtain present embodiment, in the present embodiment, in the appointing based on Spark
Be engaged in processing method in include Map stage and Reduce stage, wherein include: in the Map stage S1 obtain pending data and
The quantity of ReduceTask;The each MapTask of S2 creates a bucket;S3MapTask is according to the quantity of ReduceTask to giving
The pending data for determining range carries out cutting, and the result of each self-generating is stored in corresponding bucket;S4 successively will
Local file system is written in data in bucket, completes MapTask, wherein the data in each bucket form one and appoint
It is engaged in file, includes the multiple FileSegment generated according to different ReduceTask in assignment file.Reduce is wrapped in the stage
Include: the data area that S5ReduceTask is handled as needed pulling data and is handled from corresponding assignment file, such as
Shown in Fig. 2.
In the present embodiment, each MapTask is by the data obtained according to each ReduceTask cutting all in accordance with preparatory
It, will be each during storage after the partition algorithm of setting is put into the bucket of MapTask creation
The corresponding data of ReduceTask successively sort according to the sequence of ReduceTask.Local file is written into data in bucket
During system, it includes according to different in the assignment file that the data in each bucket, which form an assignment file,
The corresponding FileSegment of the FileSegment that ReduceTask is generated, each ReduceTask.When Reducer starts
When, the data area that ReduceTask is then handled as needed pulling data and is handled from corresponding assignment file.
Above embodiment is improved to obtain present embodiment, in the present embodiment, each MapTask generates one
A assignment file and an index file corresponding with the assignment file;It specifically, include according to different in each assignment file
The FileSegment that ReduceTask is generated, and each FileSegment is arranged successively according to the sequence of ReduceTask;In rope
The initial address range that each ReduceTask corresponds to FileSegment is stored in quotation part.With this, in the Reduce stage,
ReduceTask pulls data to be treated according to the initial address range of FileSegment from corresponding assignment file.
In the present embodiment, an assignment file is only generated by each MapTask, in order to can in the Reduce stage
The corresponding FileSegment of ReduceTask, each assignment file one rope of corresponding maintenance are accurately pulled from assignment file
Quotation part stores the initial address range of each FileSegment in indexed file, that is, if in assignment file according to
The sequence of ReduceTask incorporates n FileSegment, then n initial address range is stored in indexed file, with n
The position of FileSegment corresponds.With this, in the Reduce stage, ReduceTask is according to the starting point in index file
Corresponding data are read from assignment file in location.In one example, as shown in figure 3, then existing including 4 ReduceTask
During shuffle, each MapTask generates 4 FileSegment according to the quantity of ReduceTask, is put into a task
In file, an assignment file safeguards an index file index.
Furthermore, it is understood that further include in indexed file the identification information of MapTask, ReduceTask identification information and
The mapping relations of ReduceTask and FileSegment;In the Reduce stage, ReduceTask according to itself identification information and
The data that the identification information of its MapTask relied on pulls corresponding FileSegment in corresponding task file are handled.
Specifically, the data obtained according to each ReduceTask cutting are put into bucket by MapTask, and local file is written
After system, when Reducer starting, ReduceTask can be according to identification information (number of task) He Suoyi of oneself task
The identification information (mission number) of bad MapTask obtains corresponding from the block manager of distal end or local
Input of the bucket as Reducer.Mapping relations between ReduceTask and FileSegment are specially ReduceTask
With the mapping relations between FileSegment storage address, ReduceTask according to the initial address range in index file from
Corresponding data are pulled in corresponding FileSegment in assignment file.
Above embodiment is improved to obtain present embodiment, in the present embodiment, executed MapTask it
Afterwards, the address of the executive condition of MapTask and the assignment file of generation is encapsulated into mapStatus object, is called
MapOutputTrackerMaster of the MapoutputTrackerWorker into Driver sends message, notice
Location information (the IP of the computer including file storage of each assignment file in MapOutputTrackerMaster disk
Address, corresponding MapTask for generating this document that location, file store in the computer etc.).It needs to read in ReduceTask
When data in assignment file, the position of assignment file is obtained by requesting the MapOutputTrackerMaster at the end Driver
Confidence breath, such ReduceTask can be arrived by way of locally reading and/or remotely reading to the node where data
(FileSegment point) according to index file initial address range pulling data.
The present invention also provides a kind of task processing system 100 based on Spark, including the end Mapper 110, such as Fig. 4 institute
Show, includes: the task acquisition module 111, (Map in as shown of multiple Map task processing modules 112 in the end Mapper 110
Task processing module 1 ..., Map task processing module n, for handling MapTask) and Data write. module 113, wherein it is more
A Map task processing module 112 is connect with task acquisition module 111 and Data write. module 113 respectively.Task acquisition module
111 for obtaining the quantity of pending data and Reduce task;Map task processing module 112 is according to the number of ReduceTask
It measures after carrying out cutting to the pending data of given range, the result of each self-generating is stored in corresponding bucket;Number
According to writing module 113 for local file system successively to be written in the data in bucket, MapTask is completed, wherein each
Data in bucket form an assignment file, include according to the multiple of different ReduceTask generations in assignment file
FileSegment。
In the present embodiment, pending data is specially the intermediate data that Mapper in Spark (stage) is generated, and is pressed
Cutting is carried out according to the quantity of ReduceTask.Each Map task processing module 112 will be obtained according to each ReduceTask cutting
After data are put into the bucket being pre-created all in accordance with pre-set partition algorithm, during storage, number
The corresponding data of each ReduceTask are successively sorted according to the sequence of ReduceTask according to writing module 113, convenient for subsequent
Read operation.During local file system is written in data in bucket, the data in each bucket form one
Assignment file includes the FileSegment, each ReduceTask generated according to different ReduceTask in the assignment file
A corresponding FileSegment.It for partition algorithm, can be configured according to the actual situation, such as according to key Hash
It is stored.M*R FileSegment in compared with the existing technology is greatly reduced during reading at random to I/O
It can influence, substantially increase write efficiency.
During local file system is written in the data in bucket, comprising: Data write. module 113 successively will
Core buffer is written in data in bucket, when the data in core buffer reach certain quantity, by core buffer
In data overflow write-in local disk.In this course, the size that core buffer needs is C*R*100Kb
(Kilobyte), wherein C is the nucleus number run in Spark cluster, and R is the quantity of subsequent ReduceTask, if ReduceTask
Quantity it is very big when, occupied memory is also a no small expense.By what is generated during MapTask in present embodiment
The reduction of number of files greatly reduces during by the data write-in core buffer in bucket to memory buffer
The utilization rate in area.
Above embodiment is improved to obtain present embodiment, in the present embodiment, the task processing system 100
Including the end Mapper 110 and the end Reducer, as shown in figure 5, including: task acquisition module 111 in the end Mapper 110, multiple
Map task processing module 112 and Data write. module 113.Task acquisition module 111 is for obtaining pending data and Reduce
The quantity of task;Map task processing module 112 cuts the pending data of given range according to the quantity of ReduceTask
/ after, the result of each self-generating is stored in corresponding bucket;Data write. module 113 is used for successively will be in bucket
Data local file system 100 is written, complete MapTask, wherein data in each bucket form a task text
Part includes the multiple FileSegment generated according to different ReduceTask in assignment file.It include: more in the end Reducer
A Reduce task processing module 121 (Reduce task processing module 1 in as shown ..., Reduce task processing module
N, for handling ReduceTask), pulling data is simultaneously from corresponding assignment file for data area for handling as needed
It is handled.
In the present embodiment, the number that each MapTask task processing module will be obtained according to each ReduceTask cutting
After being put into the bucket being pre-created, Data write. module 113 is corresponding by each ReduceTask during storage
Data successively sort according to the sequence of ReduceTask.During local file system is written in data in bucket,
Data in each bucket form an assignment file, include being generated according to different ReduceTask in the assignment file
The corresponding FileSegment of FileSegment, each ReduceTask.When Reducer starting, the processing of Reduce task
The data area that module 121 is handled as needed pulling data and is handled from corresponding assignment file.
Above embodiment is improved to obtain present embodiment, in the present embodiment, in Data write. module 113
In, data one assignment file of generation and one generated for each Map task processing module 112 is corresponding with the assignment file
Index file;It specifically, include the FileSegment generated according to different ReduceTask in each assignment file, and each
FileSegment is arranged successively according to the sequence of ReduceTask;It is corresponding that each ReduceTask is stored in indexed file
The initial address range of FileSegment.With this, in the Reduce stage, ReduceTask is according to the starting point of FileSegment
Location range pulls data to be treated from corresponding assignment file.
In the present embodiment, an assignment file is only generated by each Map task processing module 112, in order to
The corresponding FileSegment of ReduceTask, each task text can be accurately pulled in the Reduce stage from assignment file
Part is corresponding to safeguard an index file, the initial address range of each FileSegment is stored in indexed file, that is, if appointing
N FileSegment is incorporated according to the sequence of ReduceTask in business file, then stores n starting point in indexed file
Location range is corresponded with the position of n FileSegment.With this, in the Reduce stage, Reduce task processing module 121
Corresponding data are read from assignment file according to the initial address in index file.
Furthermore, it is understood that further include in indexed file the identification information of MapTask, ReduceTask identification information and
The mapping relations of ReduceTask and FileSegment;In the Reduce stage, Reduce task processing module 121 is according to itself
Identification information and its identification information of MapTask of dependence pull the data of corresponding FileSegment in corresponding task file
It is handled.
Specifically, the data obtained according to each ReduceTask cutting are put into bucket by Map task processing module 112,
It is written after local file system 100, when Reducer starting, Reduce task processing module 121 can be according to oneself task's
The identification information (mission number) of identification information (number of task) and the MapTask relied on from distal end or local
Input of the corresponding bucket as Reducer is obtained in block manager.Between ReduceTask and FileSegment
Mapping relations be specially mapping relations between ReduceTask and FileSegment storage address, the processing of Reduce task
Module 121 is pulled from FileSegment corresponding in assignment file accordingly according to the initial address range in index file
Data.
Above embodiment is improved to obtain present embodiment, in the present embodiment, executed MapTask it
Afterwards, the address of the executive condition of MapTask and the assignment file of generation is encapsulated into mapStatus pairs by Data write. module 113
As in, calls MapOutputTrackerMaster of the MapoutputTrackerWorker into Driver to send message, lead to
Know the location information (IP of the computer including file storage of each assignment file in MapOutputTrackerMaster disk
Address, corresponding MapTask for generating this document that location, file store in the computer etc.).In Reduce task processing module
121 when needing to read the data in assignment file, is obtained by requesting the MapOutputTrackerMaster at the end Driver
The location information of assignment file, such Reduce task processing module 121, which can be arrived, to be read and/or is remotely read by local
Node (FileSegment point) where mode to data according to index file initial address range pulling data.
It should be noted that above-described embodiment can be freely combined as needed.The above is only preferred implementations of the invention
Mode, it is noted that for those skilled in the art, without departing from the principle of the present invention, also
Several improvements and modifications can be made, these modifications and embellishments should also be considered as the scope of protection of the present invention.
Claims (10)
1. a kind of task processing method based on Spark, which is characterized in that it include the Map stage in the task processing method,
Include: in the Map stage
Obtain the quantity of pending data and Reduce task;
One storage region of each Map task creation;
Map task carries out cutting according to the pending data of the quantity of Reduce task to given range, and by each self-generating
As a result it is stored in corresponding storage region;
Local file system successively is written into the data in storage region, completes Map task, wherein in each storage region
Data form an assignment file, include being cut in the assignment file according to multiple files that different Reduce tasks generates
Piece.
2. task processing method as described in claim 1, which is characterized in that further include Reduce in the task processing method
Stage includes: in the Reduce stage
The data area that Reduce task is handled as needed pulling data and is handled from corresponding assignment file.
3. task processing method as claimed in claim 2, which is characterized in that in each Map task, can also generate one and appoint
The corresponding index file of business file;
In assignment file, each file slice is arranged successively according to the sequence of Reduce task;
In indexed file, it is stored with the initial address range of each Reduce task respective file slice;
In the Reduce stage, the initial address range that Reduce task is sliced according to file pulls need from corresponding assignment file
Data to be processed.
4. task processing method as claimed in claim 3, which is characterized in that further include Map task in the index file
The mapping relations of identification information, the identification information of Reduce task and Reduce task and file slice;
In the Reduce stage, Reduce task is pulled according to the identification information of the identification information of itself and its Map task of dependence
The data that corresponding document is sliced in corresponding task file are handled.
5. the task processing method as described in claim 1-4 any one, which is characterized in that successively will be in storage region
Data were written in the step of local file system, comprising:
Core buffer successively is written into the data in storage region;
Data in core buffer are overflow, local disk is written, and record storage location information;
After the Reduce stage, the location information for obtaining the storage of Map task result, by locally reading and/or remotely reading
The data that read in corresponding task file of mode handled.
6. a kind of task processing system based on Spark, which is characterized in that it include the end Mapper in the task processing system,
Include: in the end Mapper
Task acquisition module obtains the quantity of pending data and Reduce task;
Multiple Map task processing modules, connect with task acquisition module, and a storage is pre-created in each Map task processing module
It region will after Map task processing module carries out cutting to the pending data of given range according to the quantity of Reduce task
The result of each self-generating is stored in corresponding storage region;
Data write. module, and the Map task processing module, for local file successively to be written in the data in storage region
System completes Map task, wherein the data in each storage region form an assignment file, include in the assignment file
The multiple files slice generated according to different Reduce tasks.
7. task processing system as described in claim 1, which is characterized in that further include in the task processing system with it is described
The end Reducer of the end Mapper connection includes: in the end Reducer
Multiple Reduce task processing modules, the data area for handling as needed pull number from corresponding assignment file
According to and handled.
8. task processing system as claimed in claim 7, which is characterized in that in Data write. module, can also generate one with
The corresponding index file of assignment file;
In assignment file, each file slice is arranged successively according to the sequence of Reduce task;
In indexed file, it is stored with the initial address range of each Reduce task respective file slice;
In Reduce task processing module, need are pulled from corresponding assignment file according to the initial address range that file is sliced
Data to be processed.
9. task processing system as claimed in claim 8, which is characterized in that further include Map task in the index file
The mapping relations of identification information, the identification information of Reduce task and Reduce task and file slice;
In Reduce task processing module, pulled according to the identification information of the identification information of itself and its Map task of dependence
The data that corresponding document is sliced in corresponding task file are handled.
10. the task processing system as described in claim 6-9 any one, which is characterized in that in Data write. module: according to
Core buffer is written in the secondary data by storage region;Data in core buffer are overflow, local disk is written, and recorded
Storage location information;
In Reduce task processing module, after the location information of processing result storage for obtaining Map task processing module, lead to
The data that the local mode for reading and/or remotely reading is read in corresponding task file are crossed to be handled.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811142575.1A CN109388615B (en) | 2018-09-28 | 2018-09-28 | Spark-based task processing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811142575.1A CN109388615B (en) | 2018-09-28 | 2018-09-28 | Spark-based task processing method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109388615A true CN109388615A (en) | 2019-02-26 |
CN109388615B CN109388615B (en) | 2022-04-01 |
Family
ID=65418272
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811142575.1A Active CN109388615B (en) | 2018-09-28 | 2018-09-28 | Spark-based task processing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109388615B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110955732A (en) * | 2019-12-16 | 2020-04-03 | 湖南大学 | Method and system for realizing partition load balance in Spark environment |
WO2023005366A1 (en) * | 2021-07-28 | 2023-02-02 | 华为云计算技术有限公司 | Computing method and apparatus, device and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103327128A (en) * | 2013-07-23 | 2013-09-25 | 百度在线网络技术(北京)有限公司 | Intermediate data transmission method and system for MapReduce |
CN105955819A (en) * | 2016-04-18 | 2016-09-21 | 中国科学院计算技术研究所 | Data transmission method and system based on Hadoop |
US9460147B1 (en) * | 2015-06-12 | 2016-10-04 | International Business Machines Corporation | Partition-based index management in hadoop-like data stores |
-
2018
- 2018-09-28 CN CN201811142575.1A patent/CN109388615B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103327128A (en) * | 2013-07-23 | 2013-09-25 | 百度在线网络技术(北京)有限公司 | Intermediate data transmission method and system for MapReduce |
US9460147B1 (en) * | 2015-06-12 | 2016-10-04 | International Business Machines Corporation | Partition-based index management in hadoop-like data stores |
CN105955819A (en) * | 2016-04-18 | 2016-09-21 | 中国科学院计算技术研究所 | Data transmission method and system based on Hadoop |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110955732A (en) * | 2019-12-16 | 2020-04-03 | 湖南大学 | Method and system for realizing partition load balance in Spark environment |
WO2023005366A1 (en) * | 2021-07-28 | 2023-02-02 | 华为云计算技术有限公司 | Computing method and apparatus, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109388615B (en) | 2022-04-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109684333B (en) | Data storage and cutting method, equipment and storage medium | |
CN107038206B (en) | LSM tree establishing method, LSM tree data reading method and server | |
US10705735B2 (en) | Method and device for managing hash table, and computer program product | |
CN104881466B (en) | The processing of data fragmentation and the delet method of garbage files and device | |
US20180027061A1 (en) | Method and apparatus for elastically scaling virtual machine cluster | |
JP5427640B2 (en) | Decision tree generation apparatus, decision tree generation method, and program | |
JP2013541083A (en) | System and method for scalable reference management in a storage system based on deduplication | |
US20180150536A1 (en) | Instance-based distributed data recovery method and apparatus | |
CN111723073B (en) | Data storage processing method, device, processing system and storage medium | |
CN106407224A (en) | Method and device for file compaction in KV (Key-Value)-Store system | |
CN110399096B (en) | Method, device and equipment for deleting metadata cache of distributed file system again | |
CN103150149A (en) | Method and device for processing redo data of database | |
CN113806300B (en) | Data storage method, system, device, equipment and storage medium | |
CN105589908A (en) | Association rule computing method for transaction set | |
CN109388615A (en) | Task processing method and system based on Spark | |
CN105988995B (en) | A method of based on HFile batch load data | |
CN113254445A (en) | Real-time data storage method and device, computer equipment and storage medium | |
CN105068875A (en) | Intelligence data processing method and apparatus | |
CN112860412B (en) | Service data processing method and device, electronic equipment and storage medium | |
CN113609090A (en) | Data storage method and device, computer readable storage medium and electronic equipment | |
CN108121807B (en) | Method for realizing multi-dimensional Index structure OBF-Index in Hadoop environment | |
JP2019121333A (en) | Data dynamic migration method and data dynamic migration device | |
CN111061719B (en) | Data collection method, device, equipment and storage medium | |
CN110059075A (en) | A kind of method, apparatus of database migration, equipment and computer-readable medium | |
CN113703678A (en) | Method, device, equipment and medium for re-splitting index of storage bucket |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |