CN107066328A

CN107066328A - The construction method of large-scale data processing platform

Info

Publication number: CN107066328A
Application number: CN201710357465.6A
Authority: CN
Inventors: 赖真霖; 文君
Original assignee: Chengdu Sixiang Lianchuang Technology Co Ltd
Current assignee: Chengdu Sixiang Lianchuang Technology Co Ltd
Priority date: 2017-05-19
Filing date: 2017-05-19
Publication date: 2017-08-18

Abstract

The invention provides a kind of construction method of large-scale data processing platform, this method includes：Increase multiple pretreatment load nodes in MapReduce model；Handled by the multivalue during the information and Map of XML tag data, realize the operation of data processing；Adoption status transfer and Dynamic Programming mechanism are optimized to load balance in cloud storage resource.The present invention proposes a kind of construction method of large-scale data processing platform, and a variety of small documents from different isomerization source are carried out into unified standard tissue based on improved distributed processing framework, is easy to efficient storage, analysis and retrieval.

Description

The construction method of large-scale data processing platform

Technical field

The present invention relates to data calculating, more particularly to a kind of construction method of large-scale data processing platform.

Background technology

Cloud computing technology possesses Distributed Calculation, ultra-large, and virtualization, high reliability, high resiliency is expansible, on demand The features such as service, highly efficient analysis and more preferable computing capability can be provided for big data processing.At big data Hundreds of millions of small documents processing in reason are, it is necessary to which distributed memory system and directory system carry for files such as webpage and mails Supported for storage.With the application demand of a large amount of small text file process, a large amount of isomeric datas are there are in different information systems Source；The unified standardisation body method of data deficiency；In some fields, a large amount of small text files are difficult to effectively analysis and efficiently deposited Storage and retrieval.

The content of the invention

To solve the problems of above-mentioned prior art, the present invention proposes a kind of structure of large-scale data processing platform Construction method, including：

Increase multiple pretreatment load nodes in MapReduce model；

Handled by the multivalue during the information and Map of XML tag data, realize the operation of data processing；

Adoption status transfer and Dynamic Programming mechanism are optimized to load balance in cloud storage resource.

Preferably, the task that these described load nodes are performed was distributed by host node before Map tasks are performed Subtask in task, then pre-processes user's restriction relation；User asks the processing with restriction relation to submit To host node, host node describes the xml document of the restriction relation according to task requests dynamic generation, after task segmentation, reads xml Multivalued mappings relation in file, when single map tasks start, analyzes the file of input and produces the key assignments of many-one relationship Right, user is arbitrarily operated to key-value pair；After the completion of then collect customized key-value pair again, by the constraint of data processing Automated generalization is finished, and then starts the Map processes and Reduce processes of MapReduce scheduling again.

Preferably, it is described that load balance is optimized using Dynamic Programming mechanism in cloud storage resource, further wrap Include：

The set of all data storage blocks in cloud storage is represented with Cdata={ 1,2 ... m }；, k ∈ Cdata represent kth Group data storage, m is total group of number of data storage in the cloud storage that need to be distributed；I-th of memory node is obtained in note cloud storage platform Must organize the storage efficiency of storage resource for L (u (i), i)；Cloud storage resources configuration optimization problem is expressed as to solve Maximum；

(1) in initialization procedure, the data in CData is hashed into Distribution Strategy according to uniformity, is divided into m group data, deposits Storage node virtual turns to n memory node, initializes the storage efficiency value e and load capacity c of memory node；Stage Counting is set Device i；

(2) according to the memory node number of virtualization, this resource allocation process is divided into n stage；Determine state variable x (i+1) remaining data after 1 to i memory node of distribution, are represented；

(3) x (i) travels through its interval [u (i) with certain step-length_min,u(i)_max], calculate surplus resources x (i) points The maximum storage efficiency V (x (i), i), while related data is remembered of n-i memory node after i-th of memory node of dispensing Record is in data acquisition system NoteData [i] { x (i), u (i), V (x (i), i) }；

(4) as i=n, data distribution, u (n) are carried out according to the load capacity c and storage efficiency e of n-th of memory node< =c_n；

Utilization state equation of transfer：X (i+1)=x (i)-u (i)

With Dynamic Programming Equation V (x (i), i)=max_{u(i)∈U(x(i))}L (u (i), i) ,+V (x (i+1), i+1) }, i=n- 1,2 ..., 1

V (x (n), n)=L (x (n), n)

Release in the optimal value in each stage, assigning process according to the load capacity c of the memory node in each stage_iIt is determined that Decision variable u (i) boundary value；

(5) recursive calculation tries to achieve optimizing decision sequence NoteData (u (1), u (2) ..., u (n)), if I.e. data resource is not fully assigned；Recurrence is then repeated, the secondary figure of merit in per stage is taken successively, until

The present invention compared with prior art, with advantages below：

The present invention proposes a kind of construction method of large-scale data processing platform, based on improved distributed processing framework A variety of small documents from different isomerization source are subjected to unified standard tissue, are easy to efficient storage, analysis and retrieval.

Brief description of the drawings

Fig. 1 is the flow chart of the construction method of large-scale data processing platform according to embodiments of the present invention.

Embodiment

Retouching in detail to one or more embodiment of the invention is hereafter provided together with illustrating the accompanying drawing of the principle of the invention State.The present invention is described with reference to such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right Claim is limited, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.

An aspect of of the present present invention provides a kind of construction method of large-scale data processing platform.Fig. 1 is according to the present invention The construction method flow chart of the large-scale data processing platform of embodiment.

The present invention sets up index structure by document classification, and based on weights similarity Piece file mergence into big file with Handled under cloud computing environment.When classifying to small text file, k nearest neighbor assorting process is described using MapReduce, together When, addition characteristic vector compares in k nearest neighbor, sequentially reconfigures two Feature Words identical characteristic vectors.For document inspection Complex process and content map relation during rope, the improvement MapReduce model based on XML and multivalue.Pass through XML tag The content of data, coordinate, operation etc. information, carry out data complex process.The content of data typically has mapping relations, passes through Multivalue processing during XML tag and Map, realizes the operation of data processing.

First, just subseries is carried out by document format.For sorted text document, according to based on MapReduce and The improvement k nearest neighbor sorting technique of characteristic vector reduction is classified.It is then combined with the small text file of unified classification, the big text of generation Part..Small text file is write into big file sequentially in time, the name of big file, copy, positional information are then write into name Byte point, datanode is write by content.

The comparative approach of traditional characteristic vector is added in k nearest neighbor algorithm, is first found out between two original feature vectors Identical word and its weight, two Feature Words all identical characteristic vectors, then profit are reconfigured according to the order of same characteristic features word The similarity between the two characteristic vectors is calculated with the corresponding weight vectors of Feature Words.

Method is described as follows：All texts in training set are pre-processed, the characteristic vector of key-value pair form is generated；

The characteristic vector T and the set of eigenvectors ET of training sample of the text of step 1. normalization input；And calculate T, Identical Feature Words in ET；

Identical Feature Words and corresponding weights are extracted new vectorial NT, the NET of composition by step 2.；

Step 3. application MapReduce carries out Similarity Measure.Calculate two characteristic vectors weights composition unitary to Similarity sim (t, x) between amount；

Step 4.MapReduce is ranked up to the Similarity Measure result of the text of calculating；

Step 5. takes out similarity k text of highest；The similarity category of this k text is added up；

Step 6. takes similarity maximum S_iAnd corresponding classification C_i；

If step 7. S_iMore than predefined similarity threshold, then identify the text and belong to C_iClass.

In terms of index structure, the tree divided with K dimension spaces data set builds the trunk of whole tree structure, judges that K is tieed up Whether tree is empty, if it is empty then directly as root node.Otherwise compare the point and the size of the K values for tieing up root vertex respective dimension is closed System, next step operation is carried out into its left and right subtree；If the point is less than the value of root node respective dimension, enters left subtree and carry out Search operation is until the left subtree or right subtree of some node are sky.Then it regard point insertion as its leaf node；If the point is big In the value of root node respective dimension, then enter right subtree and carry out insertion operation.Then, position is loaded on the leaf node of the K Wei Shu Sensitive hash structure is put, i.e., remaining point is placed into position sensing hash.Data set X is converted into the binary system in space String；Advance Selecting All Parameters r>0, c>1, randomly select k hash function；Data point is stored in using these hash functions corresponding In hash table.

Based on above-mentioned file index rule, small text file is merged, provided with multiple file A₁,A₂…A_n, Wherein A_i=a_i1,a_i2,…,a_ik..., and a_ikFor k-th of character of filename.Concretely comprise the following steps：

Step 1, to the character string A of input_i(i=l, 2 ... n), find a_ik=' ', intercepts a_ikAll characters below. The number in this block with this class file is counted, m is designated as_ij.Such text that each piece is included in same node is calculated successively The number of part, obtains sequence m_i1, m_i2... m_in, seek m_i=∑ m_ij(j=0,1 ... n) represent the extension included in this node The classification of name.

Step 2. calculates the number M for all small text files deposited in this node, obtains small text file in classification During the weights that set.

Step 3. solves the ratio m shared by each type file_i/ M, sorts from big to small in proportion.The extension name of formation List is safeguarded in datanode.

Step 4. counts the m on this node_iIn root node, formed root node list.Have in each extension name One root node list.This list is safeguarded in datanode.

Reduce task of the step 5. according to where block to be placed, obtains the extension name of this block.

Step 6. reads the root node of block to be placed.Root node list is set, according to the maximum principle pair of weights similarity Root is ranked up.

Step 7. selects the root made number one in this block.

Step 8. finds the maximum node of extension name proportion in the cluster.This root is searched wherein, if it does, putting Put this block.

Step 9. excludes this node from candidate list, and whether then judge list is empty.It is not sky, goes to step 8.

Step 10. excludes this root from the list of root, and whether the list for judging root is empty.It is not sky, goes to step 7；For Sky, is stored on the node of this extension name at random.

For the complex process during file retrieval and content map relation, in the increase of original MapReduce model Multiple pretreatment load nodes, the task of their execution of these load nodes is to be sent by host node before Map tasks are performed Subtask in the task of hair, then pre-processes user's restriction relation.User asks the processing with restriction relation Host node is submitted to, host node describes the xml document of the restriction relation according to task requests dynamic generation, after task segmentation, read The multivalued mappings relation in xml document is taken, when single map tasks start, the file of input is analyzed and produces many-one relationship Key-value pair, user arbitrarily operated to key-value pair.After the completion of then collect customized key-value pair again, by data processing Restriction relation be disposed, then start again MapReduce scheduling Map processes and Reduce processes.

Further to realize cloud storage load balancing, all storage numbers in cloud storage are represented with Cdata={ 1,2 ... m } According to the set of block.K ∈ Cdata represent kth group data storage, and m is total group of number of data storage in the cloud storage that need to be distributed.Remember cloud In storage platform the storage efficiency of i-th of memory node acquisition group storage resource be L (u (i), i)；By cloud storage resource optimization point It is expressed as solving with problemMaximum.

(1) in initialization procedure, the data in CData is hashed into Distribution Strategy according to uniformity, is divided into m group data, deposits Storage node virtual turns to n memory node, initializes the storage efficiency value e and load capacity c of memory node.Stage Counting is set Device i.

(2) according to the memory node number of virtualization, this resource allocation process is divided into n stage.Determine state variable x (i+1) remaining data after 1 to i memory node of distribution, are represented；

(3) x (i) travels through its interval [u (i) with certain step-length_min,u(i)_max], calculate surplus resources x (i) points The maximum storage efficiency V (x (i), i), while related data is remembered of n-i memory node after i-th of memory node of dispensing Record is in data acquisition system NoteData [i] { x (i), u (i), V (x (i), i) }.

(4) as i=n, data distribution, u (n) are carried out according to the load capacity c and storage efficiency e of n-th of memory node< =c_n。

Utilization state equation of transfer：X (i+1)=x (i)-u (i)

V (x (n), n)=L (x (n), n)

Release in the optimal value in each stage, assigning process according to the load capacity c of the memory node in each stage_iIt is determined that Decision variable u (i) boundary value.

(5) recursive calculation tries to achieve optimizing decision sequence NoteData (u (1), u (2) ..., u (n)), if I.e. data resource is not fully assigned.Recurrence is then repeated, the secondary figure of merit in per stage is taken successively, until

Based on above-mentioned improved MapReduce frameworks, under retrieval concurrent environment more, the present invention sets shared retrieval architecture And shared using two-stage, the first order is shared to realize shared sampling using public sample management mechanism, reduces redundancy I/O expenses；The Two grades of shared calculating by Online aggregate are shared to be abstracted into special ACQ optimization problems.The present invention realizes many from subtask aspect The merging of operation is retrieved, i.e., realizes that task level merges according to the correlation of each retrieval operation subtask, and sharing merging Task is sent to each calculate node and completes further processing.The flow of shared searching system framework based on Hadoop may include： Retrieval collector is responsible for collecting one group of retrieval request, and realizes that task level is closed by the analysis to each retrieval operation Map subtasks And operate, form a series of shared Map tasks；Shared Map tasks are assigned to each calculate node and carry out respective handling, including From HDFS collecting samples data and calculating ASSOCIATE STATISTICS amount；Completed approximately to estimate by Reduce tasks according to statistic relevant information Meter and precision judge, are returned if user's accuracy requirement is met, otherwise repeat aforesaid operations.

Given two retrievals Q₁And Q₂, its corresponding Map subtasks collection is combined into M₁={ M_1,1,M_1,2…,M_1,mAnd M₂= {M_2,1,M_2,2…,M_2,n, then secret sharing of the invention is：If two Map subtasks M_i,1∈M_1,M_j,2∈M₂With identical Input data is data block B_i=B_j, then shared Map tasks are merged into the two Map subtasks and then realize two independent I/O The merging of pipeline, by data block B_iUnified access that to complete sampling shared；If two Map subtasks M_i,1∈M_1,M_j,2∈M₂ In addition to identical block, also predicate and aggregate type sentence, including SUM, COUNT, AVG are retrieved with identical When, merging of the Map tasks realization to two Map task normalized sets is shared, passes through and calculates and the amount completion of multiplexing Intermediate Statistics Normalized set it is shared；If two Map subtasks are B without identical input data_i≠B_jWhen, then it can not merge generation altogether Enjoy Map tasks.

For above-mentioned different sharing mode and secret sharing, the present invention uses following sharing policy：For every number According to block B_iBuilding unified I/O pipelines is used for sample collection, and the random sample of acquisition is stored in the Sample Buffer in internal memory Area, provides data for follow-up shared sampling and supports；It is shared for the first order, participate in what is merged according to each in shared Map tasks The demand of Map tasks sample needed for each round Accuracy extimate, reads correspondingly sized sample set and distribution from buffering area Share the Map tasks of sampling condition to complete calculating task to middle satisfaction；If needing to carry out normalized set in shared Map tasks It is shared, then share result from the first order in the second level is shared and obtain respective sample collection, and to it according to bottom Map task sharings Group carries out the classified calculating of Intermediate Statistics amount, and each shared group obtains respective statistic by the multiplexing to middle statistic, from And complete calculating task.

The classified calculating of the statistic, can specifically be completed by two benches：Division stage and adjusting stage.Input one group of sample This set k={ k_1i,k_2i,…k_ni, ascending sort is carried out to sample set k, the stage that divides is determined initially altogether using Greedy strategy Enjoy packet scheme；And the task of adjusting stage is to carry out local directed complete set to the Map tasks in adjacent shared packet.

The division stage uses the variance yields of one group of sample size as the standard for measuring its difference size, by larger to variance Shared packet divide and realize the separation of difference sample size.First, the integral sharing for calculating current shared packet scheme is opened Sell and be designated as c_min, secondly, the shared packet with maximum variance is chosen from shared packet scheme as the candidate of division operation Shared packet, and two new shared packets are divided into according to the average of sample size in shared packet, then, calculate new production Raw shared packet scheme ' integral sharing expense and be designated as c_curIf, c_min≤c_curThen retain the new shared packet scheme to lay equal stress on Multiple above-mentioned division performs flow, the former shared packet scheme of on the contrary then return.

In the adjusting stage, i-th of shared packet sg of the shared packet scheme of definition_rThe packet of moving out of sample size is represented, and The i-th -1 shared packet sg_l；The packet of moving into of sample size is represented, the sample size chosen less than sample size average in packet is formed Initial candidate's migration sample duration set cand；Further priority judgement is carried out to the element in cand, preferably sample is chosen This amount is migrated.Each element cand [j] in, counts sg respectively_rThere is common edge with it in interior remaining sample size The sample size number eg on boundary_rAnd sg_lThere is the sample size number eg of public boundary in interior all sample sizes with it_l.Define two Variable CE_rAnd CE_lRespectively to the eg corresponding to cand [j]_rAnd eg_lIt is ranked up, in CE_rMiddle use ascending order arrangement, and in CE_lIn Arranged using descending, for any cand [j], using it in CE_rAnd CE_lIn index position rInd and lInd be used as priority Normalized parameter, and introduce weight coefficient w_inAnd w_outTo adjust eg_rAnd eg_lInfluence to priority.Consider eg_rAnd eg_l The sample size migration priority of influence is calculated as follows：

Rank=w_inrInd+w_outlInd

Wherein weight coefficient w_in+w_out=1；Obtain its corresponding migration priority and choose that there is highest for each calculating The sample size of priority carries out the migration between adjacent shared packet to obtain new shared packet scheme, and by sharing cost Calculate and compare and can be determined that whether above-mentioned migration example is effective, until shared cost is no longer reduced, and return to final shared point Prescription case.

Given table search more than one, its Map function is according to different shared demands to corresponding Map tasks or shared point Group is handled respectively, is realized the reading of input data and is carried out normalized set to sample set, each round statistics is calculated into knot Really as the input data of Reduce functions.First, Map functions load global variable to support subsequent statistical amount to calculate, and from The shared Map set of tasks of sampling and the shared packet of normalized set are read in variable.Secondly, for the key assignments of each arrival It is right, cached first by public sample buffer, and it is read out and used according to different shared demands.For adopting Sample is shared, central when saving enough samples, obtains each required sample size and simultaneously updates and then the retrieval class in variable Type pair：Normalized set is carried out, and is key assignments using statistic and current Map task IDs as group currently to retrieve ID by result of calculation Key assignments formation key-value pair is closed as the input data of follow-up Reduce functions.

In summary, the present invention proposes a kind of construction method of large-scale data processing platform, based on improved distribution A variety of small documents from different isomerization source are carried out unified standard tissue by formula processing framework, are easy to efficient storage, analysis and inspection Rope.

Obviously, can be with general it should be appreciated by those skilled in the art, above-mentioned each module of the invention or each step Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and constituted Network on, alternatively, the program code that they can be can perform with computing system be realized, it is thus possible to they are stored Performed within the storage system by computing system.So, the present invention is not restricted to any specific hardware and software combination.

It should be appreciated that the above-mentioned embodiment of the present invention is used only for exemplary illustration or explains the present invention's Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent substitution, improvement etc., should be included in the scope of the protection.In addition, appended claims purport of the present invention Covering the whole changes fallen into scope and border or this scope and the equivalents on border and repairing Change example.

Claims

1. a kind of construction method of large-scale data processing platform, it is characterised in that including：

Increase multiple pretreatment load nodes in MapReduce model；

2. according to the method described in claim 1, it is characterised in that the task that these described load nodes are performed is by host node Subtask in being distributed before performing Map tasks for task, then pre-processes user's restriction relation；User will have Host node is submitted in the processing request of Constrained relation, and host node describes the restriction relation according to task requests dynamic generation Xml document, after task segmentation, reads the multivalued mappings relation in xml document, when single map tasks start, analysis input File and the key-value pair for producing many-one relationship, user are arbitrarily operated to key-value pair；After the completion of then again collect make by oneself Justice key-value pair, the restriction relation of data processing is disposed, then start again MapReduce scheduling Map processes and Reduce processes.

3. according to the method described in claim 1, it is characterised in that described that Dynamic Programming mechanism pair is used in cloud storage resource Load balance is optimized, and is further comprised：

The set of all data storage blocks in cloud storage is represented with Cdata={ 1,2 ... m }；, k ∈ Cdata represent that kth group is deposited Data are stored up, m is total group of number of data storage in the cloud storage that need to be distributed；Remember i-th of memory node acquisition group in cloud storage platform The storage efficiency of storage resource be L (u (i), i)；Cloud storage resources configuration optimization problem is expressed as to solveMost Big value；

(1) in initialization procedure, the data in CData are hashed into Distribution Strategy according to uniformity, are divided into m group data, storage section The virtual storage efficiency value e and load capacity c for turning to n memory node, initializing memory node of point；Stage Counting device i is set；

(2) according to the memory node number of virtualization, this resource allocation process is divided into n stage；Determine state variable x (i+ 1) remaining data after 1 to i memory node of distribution, are represented；

(3) x (i) travels through its interval [u (i) with certain step-length_min,u(i)_max], surplus resources x (i) is distributed in calculating The maximum storage efficiency V (x (i), i), while related data record is existed of n-i memory node after i-th of memory node Data acquisition system NoteData [i] x (i), u (i), V (x (i), i) } in；

(4) as i=n, data distribution, u (n) are carried out according to the load capacity c and storage efficiency e of n-th of memory node<=c_n；

Utilization state equation of transfer：X (i+1)=x (i)-u (i)

With Dynamic Programming Equation V (x (i), i)=max_{u(i)∈U(x(i))}L (u (i), i) ,+V (x (i+1), i+1) }, i=n-1, 2 ..., 1

V (x (n), n)=L (x (n), n)

Release in the optimal value in each stage, assigning process according to the load capacity c of the memory node in each stage_iDetermine decision-making Variable u (i) boundary value；

(5) recursive calculation tries to achieve optimizing decision sequence NoteData (u (1), u (2) ..., u (n)), ifThat is data Resource is not fully assigned；Recurrence is then repeated, the secondary figure of merit in per stage is taken successively, until