CN109309726A

CN109309726A - Document generating method and system based on mass data

Info

Publication number: CN109309726A
Application number: CN201811250926.0A
Authority: CN
Inventors: 安栋; 王斌; 宋先优; 郭锦玉
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-10-25
Filing date: 2018-10-25
Publication date: 2019-02-05

Abstract

The present invention provides a kind of document generating method and system based on mass data, including: client sends the first request message to the first management node, carry N number of data block store path and each data block corresponding to task type, including CPU intensive type task and I/O intensive task；The processing capacity that first management node successively obtains two generic tasks of each calculate node processing distributes a subtask to N number of calculate node, calculate node reads the data in the data block and handles data according to the task type of N number of data block respectively；Client generates file corresponding to data according to the data processed result of N number of calculate node.File is generated to mass data parallel processing by multiple calculate nodes in spark cluster, and the data block is distributed to the strong calculate node of processing the type task ability by the task type according to corresponding to database of the management node in spark cluster, and the speed of data processing is improved on the basis of reaching load balancing.

Description

Document generating method and system based on mass data

Technical field

The invention belongs to field of computer technology more particularly to a kind of document generating method based on mass data and it is System.

Background technique

With the fast development of computer technology and Internet technology, the scale of network popularity rate and Internet user also exist It rises year by year, the constantly soaring double stimuli increased rapidly with data processing amount of userbase brings new for Internet application Challenge.

For example, the data interaction interaction all in the form of text file substantially between fund system, with the increasing of number of users Long, the data file for needing to generate on the day before mesh is up to more than 30 G, and commonsense method generates file and needs a few hours, seriously affects industry The efficiency of business.Also, as data volume is increasing, to the also higher and higher of system properties.Therefore, magnanimity number is faced According to how to improve the formation speed of file is present institute's facing challenges.

Summary of the invention

In view of this, the embodiment of the invention provides a kind of document generating method and system based on mass data, with solution The slow-footed problem of file generated certainly in the prior art based on mass data.

The first aspect of the embodiment of the present invention provides a kind of document generating method based on mass data, this method application It include the first management node and multiple calculate nodes in computing engines spark cluster, spark cluster, comprising:

Client sends the first request message to first management node, and first request message is for request will be to The data of processing carry out processing and generate file, and the data are by N number of data chunk in first request message described in carrying Task type corresponding to the store path information and each data block of each data block, the task type in N number of data block Including central processor CPU intensive task and input and output I/O intensive task, N is the positive integer more than or equal to 2；

First management node successively obtains the processing capacity and processing of each calculate node processing CPU intensive type task The processing capacity of I/O intensive task；

First management node handles the processing capacity and processing I/O of CPU intensive type task according to each calculate node Task type corresponding to the processing capacity of intensive task and N number of data block, is distributed respectively to N number of calculate node One subtask, for handling a data block, each subtask carries the road of a data block for each subtask Diameter information, so that the calculate node is read in the data block according to the routing information of the data block in the subtask received Data and data are handled；

The client generates file corresponding to the data according to the data processed result of N number of calculate node.

The second aspect of the embodiment of the present invention provides a kind of computer readable storage medium, the computer-readable storage Media storage has computer-readable instruction, and the computer-readable instruction realizes following steps when being executed by processor:

The third aspect of the embodiment of the present invention provides a kind of filing system based on mass data, and feature exists In it includes the first management section in spark cluster that the filing system, which includes client and computer engine spark cluster, Point and multiple calculate nodes, the system are used for:

The present invention provides a kind of document generating method and system based on mass data, by more in spark cluster A calculate node generates file to mass data parallel processing, and the management node in spark cluster is according to corresponding to database Task type the data block is distributed into the strong calculate node of processing the type task ability, on the basis for reaching load balancing On improve the speed of data processing.

Detailed description of the invention

It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only of the invention some Embodiment for those of ordinary skill in the art without any creative labor, can also be according to these Attached drawing obtains other attached drawings.

Fig. 1 is a kind of flow diagram of the document generating method based on mass data provided in an embodiment of the present invention；

Fig. 2 is the flow diagram of another document generating method based on mass data provided in an embodiment of the present invention；

Fig. 3 is the flow diagram of another document generating method based on mass data provided in an embodiment of the present invention；

Fig. 4 is the flow diagram of another document generating method based on mass data provided in an embodiment of the present invention；

Fig. 5 is a kind of architecture diagram of the filing system based on mass data provided in an embodiment of the present invention；

Fig. 6 is any terminal equipment in a kind of filing system based on mass data provided in an embodiment of the present invention Schematic diagram.

Specific embodiment

In being described below, for illustration and not for limitation, the tool of such as particular system structure, technology etc is proposed Body details, to understand thoroughly the embodiment of the present invention.However, it will be clear to one skilled in the art that there is no these specific The present invention also may be implemented in the other embodiments of details.In other situations, it omits to well-known system, device, electricity The detailed description of road and method, in case unnecessary details interferes description of the invention.

In order to illustrate technical solutions according to the invention, the following is a description of specific embodiments.

The embodiment of the invention provides a kind of document generating methods based on mass data.In conjunction with Fig. 1, this method comprises:

S101, client send the first request message to first management node, and first request message is for asking It asks and data to be processed are subjected to processing generate file, the data are by N number of data chunk at taking in first request message Task type corresponding to store path information and each data block with each data block in N number of data block, described Service type includes central processor CPU intensive task and input and output I/O intensive task.

Wherein, N is the positive integer more than or equal to 2.

In embodiments of the present invention, CPU (Central Processing Unit, central processing unit) intensive task is Refer in task implementation procedure, needs to be the task of a large amount of calculating and logic judgment, I/O (input/output, input/output) Intensive task refers to the completing to calculate the time that required time depends on I/O operation of the task, to the more demanding of equipment I/O.

Before this step, this method further includes the mass data write-in point that client will carry out processing generation file In cloth storage system HDFS, this method comprises:

S1011, client send the second request message to the second management node, and data to be processed are written for requesting, Second request message carries the size information of each data block in N number of data block.

HDFS distributed management system obeys master slave mode, including a management node and multiple memory nodes, management section Point is known as namenode (namenode), and memory node is known as back end (DataNode) in embodiments of the present invention, will The management node of HDFS is known as the second management node.

For example, size of data to be processed is 30G, the second request message that client is sent to the second management node In the size information comprising each data block in N number of data block, 30G data to be processed are made of this N number of data block.

S1012, second management node distribute a storage section according to the size of any data block for the data block Point, and response message is sent to the client, memory node corresponding to each data block is carried in the response message Routing information.

Second management node distributes one according to the data storage condition of all memory nodes managed for each data block A memory node.A response message is sent to client after the distribution of management node completion memory node, is taken in response message Routing information with memory node corresponding to each data block.For example, data block includes data block 1 to data block 3, and Data block 1 is distributed to memory node 1 and stored by two management nodes, and data block 2 is distributed to memory node 2 and is stored, by data block 3 Distribute to the storage of memory node 3.At this point, the corresponding relationship in the path of data block 1 and memory node 1 is carried in response message, number According to the corresponding relationship in the path of block 2 and memory node 2, the corresponding relationship in the path of data block 3 and memory node 3.

S1013, the management node is according to routing information entrained in the response message, by N number of data block It stores into the memory node of HDFS.

After the storage for completing pending data, client sends the first request to the first management node of spark cluster Message, request carry out processing to data by spark cluster and generate file.

In embodiments of the present invention, client is according to operation flow, and data to be processed are divided into N number of data block, and by N A data block is stored respectively in N number of memory node into HDFS system.Client singly judges task corresponding to each data block Type, including CPU intensive type task and I/O intensive task, client are sending the to the first management node of spark cluster When one request message, the task type of each data block is carried in the first request message.

S102, the first management node successively obtain processing capacity and the place of each calculate node processing CPU intensive type task Manage the processing capacity of I/O intensive task.

In embodiments of the present invention, three kinds of processing capacities for obtaining each calculate node processing CPU intensive type task are provided With the method for the processing capacity of processing I/O intensive task.

In conjunction with Fig. 2, the first management node successively obtain each calculate node processing CPU intensive type task processing capacity and Processing I/O intensive task processing capacity first method include:

S1021, first management node are successively read the running log of each calculate node.

The historic task processing information preservation of each calculate node is in the running log of calculate node.

S1022, for any calculate node, first management node is obtained according to the running log of the calculate node Take the average time T1 of the CPU intensive type task of the calculate node processing unit data quantity and the I/O of processing unit data quantity The average time T2 of intensive task.

Optionally, the unit data quantity is 1G, such as passes through the running log of a calculate node, judges the calculate node The average time T1 of the CPU intensive type task of history executable unit data volume, and the I/O intensive task of processing unit data quantity Average time.The average time of processing is shorter, illustrates that processing capacity is stronger.

S1023, institute in first management node T1 value according to corresponding to the calculate node and the spark cluster There is the average time T1 of the CPU intensive type task of calculate node processing unit data quantity, obtain the calculate node processing CPU The processing capacity of intensive task.

First management node obtains the mean time of each calculate node processing unit data quantity CPU intensive type task in cluster Between after, obtain the average time T1 of the CPU intensive type task of all calculate nodes processing unit data quantities, if calculatings saves Point history processing unit data quantity CPU intensive type task average time be T1, then in embodiments of the present invention, with T1 with The ratio of T1 ' come represent the calculate node processing CPU intensive type task ability.

S1024, institute in first management node T2 value according to corresponding to the calculate node and the spark cluster There is the average time T2 of the I/O intensive task of calculate node processing unit data quantity, obtain the calculate node processing I/O The processing capacity of intensive task.

In embodiments of the present invention, if the I/O intensive task of a calculate node history processing unit data quantity is put down The equal time is T2, and the average time of the I/O intensive task of all calculate node processing unit data quantities is T2 ' in cluster, then In embodiments of the present invention, with T2 and T2 ' ratio come represent the calculate node processing I/O intensive task ability.

In conjunction with Fig. 3, the first management node successively obtain each calculate node processing CPU intensive type task processing capacity and Processing I/O intensive task processing capacity second method include:

S1025, in spark cluster starting, for any calculate node, described in the first management node instruction Calculate node handles the CPU intensive type task of preset data amount and the I/O intensive task of preset data amount respectively.

S1026, first management node obtain the CPU intensive type times that the calculate node handles the preset data amount The time T4 of the I/O intensive task of the time T3 and processing preset data amount of business.

S1027, institute in first management node T3 value according to corresponding to the calculate node and the spark cluster There is calculate node to handle the average time T3 ' of the CPU intensive type task of the preset data amount, obtains the calculate node processing The processing capacity of CPU intensive type task.

S1028 is stated in the first management node T4 value according to corresponding to the calculate node and the spark cluster and is owned Calculate node handles the average time T4 ' of the I/O intensive task of the preset quantity, obtains the calculate node processing I/O The processing capacity of intensive task.

The implementation of the embodiment of the present invention is similar with the method for embodiment corresponding to Fig. 2, unlike, the first management Node is not to obtain the ability that each calculate node history calculates CPU intensive type task by inquiring the running log of calculate node Or the ability of I/O intensive task, but by when cluster starts, indicating that each calculate node runs identical CPU intensive Type task and I/O intensive task, to obtain time and the I/O intensity that each calculate node handles the CPU intensive type task The time of task.

In embodiments of the present invention, if a calculate node handles the CPU of the specified preset data amount of the first management node The time of intensive task is T3, and all calculate nodes handle being averaged for the CPU intensive type task of the preset data amount in cluster Time is T3 ', then in embodiments of the present invention with T3 and T3 ' ratio come represent the calculate node processing CPU intensive type task Ability, if the time that calculate node handles the I/O intensive task of the specified preset data amount of the first management node be T4, it is T4 that all calculate nodes, which handle the average time of the I/O intensive task of the preset data amount, in cluster, then in this hair With T4 and T4 in bright embodiment ' ratio represent the calculate node processing I/O intensive task ability.

In conjunction with Fig. 4, the first management node successively obtain each calculate node processing CPU intensive type task processing capacity and Processing capacity the third method for handling I/O intensive task includes:

S1029, first management node successively obtain CPU frequency, memory size, the network bandwidth of each calculate node With maximum disk read or write speed.

S10210, first management node calculate the CPU frequency average value of all calculate nodes in spark cluster, interior Deposit capacity average value, network bandwidth average value and maximum disk read or write speed average value.

S10211, for any calculate node, first management node is according to the CPU frequency of the calculate node and interior The CPU frequency average value and memory size average value for depositing all calculate nodes in capacity and park cluster, obtain the calculating The processing capacity of node processing CPU intensive type task.

S10212, network bandwidth and maximum disk read or write speed of first management node according to the calculate node, And the network bandwidth average value of all calculate nodes and maximum disk read or write speed average value in spark cluster, described in acquisition The processing capacity of calculate node processing I/O intensive task.

In embodiments of the present invention, for any node, with all calculate nodes in the CPU frequency and cluster of the node The ratio of the average value of CPU frequency, in addition the average value of the memory size of the memory size of the node and all calculate nodes Ratio, the processing capacity as calculate node processing CPU intensive type task；In the value and cluster of the network bandwidth of the node The ratio of the average value of all calculate node network bandwidths, in addition institute in the value of the maximum disk read or write speed of the node and cluster There are the ratio of the average value of the maximum disk read or write speed of calculate node, the ability as node processing I/O intensive task.

Further, in Fig. 2 into embodiment shown in Fig. 4, the first management node also needs to consider each calculate node Operation stability, specifically, first management node is successively read the running log of each calculate node, for any meter Operator node, first management node obtain the calculate node operation CPU intensive according to the running log of the calculate node The number of type mission failure, and the number of operation I/O intensive task failure, if the calculate node runs CPU intensive type The number of mission failure is higher than preset times, then first management node forbids the calculate node operation CPU intensive type to appoint Business, if the number of calculate node operation I/O intensive task failure is higher than preset times, first management node is prohibited Only the calculate node runs I/O intensive task.

For any calculate node, the frequency of failure of the calculate node history run CPU intensive type task represents the meter Operator node runs the stability of CPU intensive type task, which runs the frequency of failure of I/O intensive task, represent The stability of calculate node operation I/O intensive task.

If the frequency of failure that the calculate node runs CPU intensive type task is higher than preset times, illustrate that the calculate node is transported The stability of row CPU intensive type task is poor, even if calculate node fortune is calculated by the method as described in Fig. 2 to Fig. 4 The ability of row CPU intensive type task is stronger, and the first management node can also forbid this during distributing CPU intensive type task Calculate node runs CPU intensive type task, if the calculate node.

If the frequency of failure that calculate node runs I/O intensive task is higher than preset times, illustrate that the calculate node is run The stability of I/O intensive task is poor, even if calculate node operation is calculated by the method as described in Fig. 2 to Fig. 4 The ability of I/O intensive task is stronger, and the first management node can also forbid the meter during distributing I/O intensive task Operator node runs I/O intensive task, if the calculate node.

S103, first management node handle processing capacity and the place of CPU intensive type task according to each calculate node Task type corresponding to the processing capacity and N number of data block of I/O intensive task is managed, is distinguished to N number of calculate node A subtask is distributed, for handling a data block, each subtask carries a data block for each subtask Routing information.

Specifically, client is sent in the first request message of the first management node, N number of data block is also carried Encoded information, the encoded information of N number of data block is for indicating that N number of data block is successive during file generated Sequentially.First management node generates N number of subtask according to first request message, and according to institute in the first request message The encoded information for stating N number of data block is ranked up N number of subtask.The first management node real-time reception is had time The heartbeat message that not busy calculate node is sent.First management node is according to the ranking results of N number of subtask, successively by institute It states N number of subtask and distributes to N number of calculate node in described the available free calculate node, wherein any subtask is directed to, right When the subtask is allocated, according to the task type of the subtask, which is distributed in current idle node To the strongest calculate node of task type processing capacity.

Massive information to be processed is such as divided into 48 data blocks according to operation flow, according to for generating the suitable of file Sequence, the number of data block are respectively data block 1 to data block 48, and the first management node is according to the number of each data block, by this File generated task corresponding to first request message is divided into 48 subtasks, and each subtask is used for the number to a data block According to being handled, then this 48 subtasks are discharged into task queue by the first management node, the row of subtask corresponding to data block 1 In the queue first, be successively subtask corresponding to subtask ... data block 48 corresponding to data block 2 later.

In embodiments of the present invention, for purposes of illustration only, the subtask for the data for being used to handle data block 1 is known as task1, The subtask for being used to handle the data of data block 2 is known as task2 ... and will be used to handle the subtask of the data of data block 48 Referred to as task48.

In spark cluster, when a calculate node free time, a heartbeat message can be sent to the first management node, First management node judges the calculate node free time according to the heartbeat message that the calculate node is sent.

Task1 to task48 in queue is successively allocated by the first management node, and the first management node is distributed first Task1 obtains task type corresponding to processing task1 according to the task type of task1 in current all idle nodes Task1 is distributed to the calculate node, later according to same rule, successively by the strongest calculate node of the processing capacity of task Task2 to task48 is allocated.

S104, the calculate node are read in the data block according to the routing information of the data block in the subtask received Data and data are handled.

It optionally, is the EMS memory occupation amount for reducing calculate node, for a data block, the calculate node is according to data block The data block is divided into n batch and handled by the sequencing of middle data, and n is the positive integer more than or equal to 2.The meter Operator node is every to have handled a batch, generates a subfile, and the filename of the subfile includes the number of the data block The batch number information of information and the batch.The subfile is written to the default storage section in HDFS by the calculate node Point.

Task1 is such as distributed into calculate node 1, calculate node 1 reads number according to the store path of data block 1 in task1 According to block 1, calculate node 1 is handled the data in data block 1 according to batch, if data block 1 includes 1,000,000 datas, meter Operator node handles task1 points for 10 batches, and each batch generates a subfile, and the name of the subfile can be Task1-1, task1-2 are to task1-10.

S105, the client generate corresponding to the data according to the data processed result of N number of calculate node File.

After all child nodes complete the processing of its corresponding subtask, memory node is according to the files of all subfiles Name carries out file mergences to all subfiles in sequence, generates file corresponding to the data, and by the file download To local storage space.

The present invention provides a kind of document generating methods based on mass data, pass through multiple calculating in spark cluster Node generates file, and the task according to corresponding to database of the management node in spark cluster to mass data parallel processing The data block is distributed to the strong calculate node of processing the type task ability by type, is improved on the basis of reaching load balancing The speed of data processing.

In conjunction with Fig. 5, the embodiment of the invention also provides a kind of filing system based on mass data, which includes Client 51 and computer engine spark cluster, include the first management node 52 and multiple calculate nodes 53 in spark cluster, The system is used for:

Client 51 sends the first request message to first management node 52, and first request message is for requesting Data to be processed are subjected to processing and generate file, the data are by N number of data chunk at carrying in first request message Task type corresponding to the store path information and each data block of each data block, the task in N number of data block Type includes central processor CPU intensive task and input and output I/O intensive task, and N is the positive integer more than or equal to 2；

First management node 52 successively obtain each calculate node 53 handle CPU intensive type task processing capacity and Handle the processing capacity of I/O intensive task；

First management node 52 handles the processing capacity and processing of CPU intensive type task according to each calculate node 53 Task type corresponding to the processing capacity of I/O intensive task and N number of data block is distinguished to N number of calculate node 53 A subtask is distributed, for handling a data block, each subtask carries a data block for each subtask Routing information so that the calculate node 53 reads the number according to the routing information of the data block in the subtask received According to the data in block and data are handled；

The client 51 generates corresponding to the data according to the data processed result of N number of calculate node 53 File.

Further, which further includes distributed file system HDFS, comprising the second management node 54 and more in HDFS A memory node 55, before the client 51 sends the first request message to first management node 52, this method is also Include:

The client 51 sends the second request message to the second management node 54, and number to be processed is written for requesting According to second request message carries the size information of each data block in N number of data block；

Second management node distributes a memory node 55 according to the size of any data block for the data block, And response message is sent to the client 51, memory node 55 corresponding to each data block is carried in the response message Routing information；

The client 51 according to routing information entrained in the response message, by N number of data block store to In the memory node 55 of HDFS.

Further, first management node 52 successively obtains each calculate node 53 and handles CPU intensive type task Processing capacity and the processing capacity for handling I/O intensive task include:

First management node 52 is successively read the running log of each calculate node 53；

For any calculate node 53, first management node 52 is obtained according to the running log of the calculate node 53 The calculate node 53 is taken to handle the average time T1 of the CPU intensive type task of unit data quantity and the I/ of processing unit data quantity The average time T2 of O intensive task；

Own in first management node 52 T1 value according to corresponding to the calculate node 53 and the spark cluster Calculate node 53 handles the average time T1 ' of the CPU intensive type task of unit data quantity, obtains the calculate node 53 and handles The processing capacity of CPU intensive type task；

Own in first management node 52 T2 value according to corresponding to the calculate node 53 and the spark cluster Calculate node 53 handles the average time T2 of the I/O intensive task of unit data quantity, it obtains the calculate node 53 and handles I/ The processing capacity of O intensive task.

In spark cluster starting, for any calculate node 53, first management node 52 indicates the meter Operator node 53 handles the CPU intensive type task of preset data amount and the I/O intensive task of preset data amount respectively；

First management node 52 obtains the CPU intensive type task that the calculate node 53 handles the preset data amount Time T3 and handle the preset data amount I/O intensive task time T4；

Own in first management node 52 T3 value according to corresponding to the calculate node 53 and the spark cluster Calculate node 53 handles the average time T3 of the CPU intensive type task of the preset data amount, it obtains at the calculate node 53 Manage the processing capacity of CPU intensive type task；

Own in first management node 52 T4 value according to corresponding to the calculate node 53 and the spark cluster Calculate node 53 handles the average time T4 ' of the I/O intensive task of the preset quantity, obtains the calculate node 53 and handles The processing capacity of I/O intensive task.

First management node 52 successively obtain the CPU frequency of each calculate node 53, memory size, network bandwidth and Maximum disk read or write speed；

The CPU frequency average value, memory that first management node 52 calculates all calculate nodes 53 in spark cluster hold Measure average value, network bandwidth average value and maximum disk read or write speed average value；

For any calculate node 53, first management node 52 is according to the CPU frequency of the calculate node 53 and interior The CPU frequency average value and memory size average value for depositing all calculate nodes 53 in capacity and park cluster, obtain the meter The processing capacity of the processing CPU intensive type task of operator node 53；

Network bandwidth and maximum disk read or write speed of first management node 52 according to the calculate node 53, and The network bandwidth average value of all calculate nodes 53 and maximum disk read or write speed average value, obtain the meter in spark cluster The processing capacity of the processing I/O intensive task of operator node 53.

Further, which is also used to:

For any calculate node 53, first management node 52 is obtained according to the running log of the calculate node 53 The calculate node 53 is taken to run the number of CPU intensive type mission failure, and the number of operation I/O intensive task failure；

If the number that the calculate node 53 runs CPU intensive type mission failure is higher than preset times, first pipe Reason node 52 forbids the calculate node 53 to run CPU intensive type task；

If the number that the calculate node 53 runs the failure of I/O intensive task is higher than preset times, first pipe Reason node 52 forbids the calculate node 53 to run I/O intensive task.

It further, further include the encoded information of N number of data block, N number of data in first request message The encoded information of block is for indicating sequencing of N number of data block during file generated, first management node 52 handle the processing capacity of the processing capacity of CPU intensive type task and processing I/O intensive task according to each calculate node 53, And task type corresponding to N number of data block, distributing a subtask respectively to N number of calculate node 53 includes:

First management node 52 generates N number of subtask according to first request message, and is disappeared according to the first request The encoded information of N number of data block described in breath is ranked up N number of subtask；

First management node, 52 real-time reception the heartbeat message that sends of available free calculate node 53；

First management node 52 successively distributes N number of subtask according to the ranking results of N number of subtask To N number of calculate node 53 in described the available free calculate node 53, wherein any subtask is directed to, to the subtask When being allocated, according to the task type of the subtask, which is distributed in current idle node to the task class The strongest calculate node 53 of type processing capacity.

Further, described so that the calculate node 53 according to the path of the data block in the subtask received believe Breath reads the data in the data block and carries out processing to data

The data block is divided into n batch and carried out by the calculate node 53 according to the sequencing of data in data block Processing, n are the positive integer more than or equal to 2；

The calculate node 53 is every to have handled a batch, generates a subfile, the filename of the subfile includes The batch number information of the number information of the data block and the batch；

The subfile is written to the default memory node in HDFS by the calculate node 53；

The client 51 generates corresponding to the data according to the data processed result of N number of calculate node 53 File includes:

The memory node carries out file mergences to all subfiles in sequence according to the filename of all subfiles, Generate file corresponding to the data, and by the file download to local storage space.

The present invention provides a kind of filing systems based on mass data, pass through multiple calculating in spark cluster Node generates file, and the task according to corresponding to database of the management node in spark cluster to mass data parallel processing The data block is distributed to the strong calculate node of processing the type task ability by type, is improved on the basis of reaching load balancing The speed of data processing.

Fig. 6 is the signal of any terminal equipment in the filing system provided in an embodiment of the present invention based on mass data Figure.As shown in fig. 6, the terminal device 6 of the embodiment includes: processor 60, memory 61 and is stored in the memory 61 In and the computer program 62 that can be run on the processor 60, such as the document generator based on mass data.It is described Processor 60 is realized when executing the computer program 62 in above-mentioned each document generating method embodiment based on mass data The step of, such as step 101 shown in FIG. 1 is to 105.

Illustratively, the computer program 62 can be divided into one or more module/units, it is one or Multiple module/units are stored in the memory 61, and are executed by the processor 60, to complete the present invention.Described one A or multiple module/units can be the series of computation machine program instruction section that can complete specific function, which is used for Implementation procedure of the computer program 62 in the terminal device 6 is described.

The terminal device 6 can be the calculating such as desktop PC, notebook, palm PC and cloud server and set It is standby.The terminal device may include, but be not limited only to, processor 60, memory 61.It will be understood by those skilled in the art that Fig. 6 The only example of terminal device 6 does not constitute the restriction to terminal device 6, may include than illustrating more or fewer portions Part perhaps combines certain components or different components, such as the terminal device can also include input-output equipment, net Network access device, bus etc..

The processor 60 can be central processing unit (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng.

The memory 61 can be the internal storage unit of the terminal device 6, such as the hard disk or interior of terminal device 6 It deposits.The memory 61 is also possible to the External memory equipment of the terminal device 6, such as be equipped on the terminal device 6 Plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card dodge Deposit card (Flash Card) etc..Further, the memory 61 can also both include the storage inside list of the terminal device 6 Member also includes External memory equipment.The memory 61 is for storing needed for the computer program and the terminal device Other programs and data.The memory 61 can be also used for temporarily storing the data that has exported or will export.

The embodiment of the present invention also provides a kind of computer readable storage medium, and the computer-readable recording medium storage has Computer program, the computer program realize the text described in any of the above-described embodiment based on mass data when being executed by processor The step of part generation method.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the present invention Portion or part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. are various can store journey The medium of sequence code.

Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations；Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features；And these are modified Or replacement, the essence of corresponding technical solution is departed from the spirit and scope of the technical scheme of various embodiments of the present invention, it should all It is included within protection scope of the present invention.

Claims

1. a kind of document generating method based on mass data, which is characterized in that this method is applied to computing engines spark collection Group, including the first management node and multiple calculate nodes in spark cluster, this method comprises:

Client sends the first request message to first management node, and first request message will be to be processed for request Data carry out processing and generate file, the data are by N number of data chunk at carrying in first request message described N number of Task type corresponding to the store path information and each data block of each data block, the task type include in data block Central processor CPU intensive task and input and output I/O intensive task, N are the positive integer more than or equal to 2；

First management node successively obtains the processing capacity and processing I/O of each calculate node processing CPU intensive type task The processing capacity of intensive task；

First management node is intensive according to the processing capacity and processing I/O of each calculate node processing CPU intensive type task Task type corresponding to the processing capacity of type task and N number of data block, distributes one to N number of calculate node respectively Subtask, for handling a data block, each subtask carries the path letter an of data block for each subtask Breath, so that the calculate node reads the number in the data block according to the routing information of the data block in the subtask received It is handled according to and to data；

2. document generating method according to claim 1, which is characterized in that this method is also applied to distributed file system Include the second management node and multiple memory nodes in HDFS, HDFS, is sent in the client to first management node Before first request message, this method further include:

The client sends the second request message to the second management node, for request write-in data to be processed, described the Two request messages carry the size information of each data block in N number of data block；

Second management node distributes a memory node according to the size of any data block, for the data block, and to institute It states client and sends response message, the routing information of memory node corresponding to each data block is carried in the response message；

The client stores N number of data block to HDFS's according to routing information entrained in the response message In memory node.

3. document generating method according to claim 1, which is characterized in that first management node successively obtains each The processing capacity of calculate node processing CPU intensive type task and the processing capacity of processing I/O intensive task include:

First management node is successively read the running log of each calculate node；

For any calculate node, first management node obtains the calculating according to the running log of the calculate node The average time T1 of the CPU intensive type task of node processing unit data quantity and the I/O intensive task of processing unit data quantity Average time T2；

All calculate nodes in first management node T1 value according to corresponding to the calculate node and the spark cluster The average time T1 ' for handling the CPU intensive type task of unit data quantity obtains the calculate node processing CPU intensive type task Processing capacity；

All calculate nodes in first management node T2 value according to corresponding to the calculate node and the spark cluster The average time T2 ' for handling the I/O intensive task of unit data quantity obtains the calculate node processing I/O intensive task Processing capacity.

4. document generating method according to claim 1, which is characterized in that first management node successively obtains each The processing capacity of calculate node processing CPU intensive type task and the processing capacity of processing I/O intensive task include:

In spark cluster starting, for any calculate node, first management node indicates the calculate node point Manage the CPU intensive type task of preset data amount and the I/O intensive task of preset data amount in other places；

First management node obtains the time T3 that the calculate node handles the CPU intensive type task of the preset data amount With the time T4 for the I/O intensive task for handling the preset data amount；

All calculate nodes in first management node T3 value according to corresponding to the calculate node and the spark cluster The average time T3 ' for handling the CPU intensive type task of the preset data amount obtains the calculate node processing CPU intensive type The processing capacity of task；

All calculate nodes in first management node T4 value according to corresponding to the calculate node and the spark cluster The average time T4 ' for handling the I/O intensive task of the preset quantity obtains the calculate node processing I/O intensity and appoints The processing capacity of business.

5. document generating method according to claim 1, which is characterized in that first management node successively obtains each The processing capacity of calculate node processing CPU intensive type task and the processing capacity of processing I/O intensive task include:

First management node successively obtains CPU frequency, memory size, network bandwidth and the maximum disk of each calculate node Read or write speed；

It is average that first management node calculates the CPU frequency average value of all calculate nodes, memory size in spark cluster Value, network bandwidth average value and maximum disk read or write speed average value；

For any calculate node, first management node according to the CPU frequency and memory size of the calculate node, and The CPU frequency average value and memory size average value of all calculate nodes in park cluster obtain the calculate node processing CPU The processing capacity of intensive task；

First management node is according to the network bandwidth of the calculate node and maximum disk read or write speed and spark collection The network bandwidth average value of all calculate nodes and maximum disk read or write speed average value in group, obtain the calculate node processing The processing capacity of I/O intensive task.

6. according to the described in any item document generating methods of claim 3-5, which is characterized in that this method further include:

For any calculate node, first management node obtains the calculating according to the running log of the calculate node Node runs the number of CPU intensive type mission failure, and the number of operation I/O intensive task failure；

If the number of the calculate node operation CPU intensive type mission failure is higher than preset times, first management node Forbid the calculate node operation CPU intensive type task；

If the number of the calculate node operation I/O intensive task failure is higher than preset times, first management node Forbid the calculate node operation I/O intensive task.

It further include N number of data block in first request message 7. document generating method according to claim 6 Encoded information, the encoded information of N number of data block is for indicating that N number of data block is successive suitable during file generated Sequence, first management node handle the processing capacity and processing I/O intensity of CPU intensive type task according to each calculate node Task type corresponding to the processing capacity of task and N number of data block distributes a son to N number of calculate node respectively Task includes:

First management node generates N number of subtask according to first request message, and according to institute in the first request message The encoded information for stating N number of data block is ranked up N number of subtask；

The first management node real-time reception the heartbeat message that sends of available free calculate node；

First management node is successively distributed to N number of subtask described according to the ranking results of N number of subtask N number of calculate node in available free calculate node, wherein be directed to any subtask, when being allocated to the subtask, According to the task type of the subtask, which is distributed in current idle node to the task type processing capacity most Strong calculate node.

8. document generating method according to claim 7, which is characterized in that described so that the calculate node is according to connecing The routing information of the data block in subtask received reads the data in the data block and carries out processing to data

The data block is divided into n batch and handled by the calculate node according to the sequencing of data in data block, n For the positive integer more than or equal to 2；

The calculate node is every to have handled a batch, generates a subfile, and the filename of the subfile includes the number According to the number information of block and the batch number information of the batch；

The subfile is written to the default memory node in HDFS by the calculate node；

According to the data processed result of N number of calculate node, generate file corresponding to the data includes: the client

The memory node carries out file mergences to all subfiles in sequence according to the filename of all subfiles, generates File corresponding to the data, and by the file download to local storage space.

9. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In when the computer program is executed by processor the step of any one of such as claim 1 to 8 of realization the method.

10. a kind of filing system based on mass data, which is characterized in that the filing system include client and Computer engine spark cluster includes the first management node and multiple calculate nodes in spark cluster, which is used for: