CN107066328A - The construction method of large-scale data processing platform - Google Patents

The construction method of large-scale data processing platform Download PDF

Info

Publication number
CN107066328A
CN107066328A CN201710357465.6A CN201710357465A CN107066328A CN 107066328 A CN107066328 A CN 107066328A CN 201710357465 A CN201710357465 A CN 201710357465A CN 107066328 A CN107066328 A CN 107066328A
Authority
CN
China
Prior art keywords
data
memory node
storage
data processing
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201710357465.6A
Other languages
Chinese (zh)
Inventor
赖真霖
文君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Sixiang Lianchuang Technology Co Ltd
Original Assignee
Chengdu Sixiang Lianchuang Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Sixiang Lianchuang Technology Co Ltd filed Critical Chengdu Sixiang Lianchuang Technology Co Ltd
Priority to CN201710357465.6A priority Critical patent/CN107066328A/en
Publication of CN107066328A publication Critical patent/CN107066328A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of construction method of large-scale data processing platform, this method includes:Increase multiple pretreatment load nodes in MapReduce model;Handled by the multivalue during the information and Map of XML tag data, realize the operation of data processing;Adoption status transfer and Dynamic Programming mechanism are optimized to load balance in cloud storage resource.The present invention proposes a kind of construction method of large-scale data processing platform, and a variety of small documents from different isomerization source are carried out into unified standard tissue based on improved distributed processing framework, is easy to efficient storage, analysis and retrieval.

Description

The construction method of large-scale data processing platform
Technical field
The present invention relates to data calculating, more particularly to a kind of construction method of large-scale data processing platform.
Background technology
Cloud computing technology possesses Distributed Calculation, ultra-large, and virtualization, high reliability, high resiliency is expansible, on demand The features such as service, highly efficient analysis and more preferable computing capability can be provided for big data processing.At big data Hundreds of millions of small documents processing in reason are, it is necessary to which distributed memory system and directory system carry for files such as webpage and mails Supported for storage.With the application demand of a large amount of small text file process, a large amount of isomeric datas are there are in different information systems Source;The unified standardisation body method of data deficiency;In some fields, a large amount of small text files are difficult to effectively analysis and efficiently deposited Storage and retrieval.
The content of the invention
To solve the problems of above-mentioned prior art, the present invention proposes a kind of structure of large-scale data processing platform Construction method, including:
Increase multiple pretreatment load nodes in MapReduce model;
Handled by the multivalue during the information and Map of XML tag data, realize the operation of data processing;
Adoption status transfer and Dynamic Programming mechanism are optimized to load balance in cloud storage resource.
Preferably, the task that these described load nodes are performed was distributed by host node before Map tasks are performed Subtask in task, then pre-processes user's restriction relation;User asks the processing with restriction relation to submit To host node, host node describes the xml document of the restriction relation according to task requests dynamic generation, after task segmentation, reads xml Multivalued mappings relation in file, when single map tasks start, analyzes the file of input and produces the key assignments of many-one relationship Right, user is arbitrarily operated to key-value pair;After the completion of then collect customized key-value pair again, by the constraint of data processing Automated generalization is finished, and then starts the Map processes and Reduce processes of MapReduce scheduling again.
Preferably, it is described that load balance is optimized using Dynamic Programming mechanism in cloud storage resource, further wrap Include:
The set of all data storage blocks in cloud storage is represented with Cdata={ 1,2 ... m };, k ∈ Cdata represent kth Group data storage, m is total group of number of data storage in the cloud storage that need to be distributed;I-th of memory node is obtained in note cloud storage platform Must organize the storage efficiency of storage resource for L (u (i), i);Cloud storage resources configuration optimization problem is expressed as to solve Maximum;
(1) in initialization procedure, the data in CData is hashed into Distribution Strategy according to uniformity, is divided into m group data, deposits Storage node virtual turns to n memory node, initializes the storage efficiency value e and load capacity c of memory node;Stage Counting is set Device i;
(2) according to the memory node number of virtualization, this resource allocation process is divided into n stage;Determine state variable x (i+1) remaining data after 1 to i memory node of distribution, are represented;
(3) x (i) travels through its interval [u (i) with certain step-lengthmin,u(i)max], calculate surplus resources x (i) points The maximum storage efficiency V (x (i), i), while related data is remembered of n-i memory node after i-th of memory node of dispensing Record is in data acquisition system NoteData [i] { x (i), u (i), V (x (i), i) };
(4) as i=n, data distribution, u (n) are carried out according to the load capacity c and storage efficiency e of n-th of memory node< =cn
Utilization state equation of transfer:X (i+1)=x (i)-u (i)
With Dynamic Programming Equation V (x (i), i)=maxu(i)∈U(x(i))L (u (i), i) ,+V (x (i+1), i+1) }, i=n- 1,2 ..., 1
V (x (n), n)=L (x (n), n)
Release in the optimal value in each stage, assigning process according to the load capacity c of the memory node in each stageiIt is determined that Decision variable u (i) boundary value;
(5) recursive calculation tries to achieve optimizing decision sequence NoteData (u (1), u (2) ..., u (n)), if I.e. data resource is not fully assigned;Recurrence is then repeated, the secondary figure of merit in per stage is taken successively, until
The present invention compared with prior art, with advantages below:
The present invention proposes a kind of construction method of large-scale data processing platform, based on improved distributed processing framework A variety of small documents from different isomerization source are subjected to unified standard tissue, are easy to efficient storage, analysis and retrieval.
Brief description of the drawings
Fig. 1 is the flow chart of the construction method of large-scale data processing platform according to embodiments of the present invention.
Embodiment
Retouching in detail to one or more embodiment of the invention is hereafter provided together with illustrating the accompanying drawing of the principle of the invention State.The present invention is described with reference to such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right Claim is limited, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of construction method of large-scale data processing platform.Fig. 1 is according to the present invention The construction method flow chart of the large-scale data processing platform of embodiment.
The present invention sets up index structure by document classification, and based on weights similarity Piece file mergence into big file with Handled under cloud computing environment.When classifying to small text file, k nearest neighbor assorting process is described using MapReduce, together When, addition characteristic vector compares in k nearest neighbor, sequentially reconfigures two Feature Words identical characteristic vectors.For document inspection Complex process and content map relation during rope, the improvement MapReduce model based on XML and multivalue.Pass through XML tag The content of data, coordinate, operation etc. information, carry out data complex process.The content of data typically has mapping relations, passes through Multivalue processing during XML tag and Map, realizes the operation of data processing.
First, just subseries is carried out by document format.For sorted text document, according to based on MapReduce and The improvement k nearest neighbor sorting technique of characteristic vector reduction is classified.It is then combined with the small text file of unified classification, the big text of generation Part..Small text file is write into big file sequentially in time, the name of big file, copy, positional information are then write into name Byte point, datanode is write by content.
The comparative approach of traditional characteristic vector is added in k nearest neighbor algorithm, is first found out between two original feature vectors Identical word and its weight, two Feature Words all identical characteristic vectors, then profit are reconfigured according to the order of same characteristic features word The similarity between the two characteristic vectors is calculated with the corresponding weight vectors of Feature Words.
Method is described as follows:All texts in training set are pre-processed, the characteristic vector of key-value pair form is generated;
The characteristic vector T and the set of eigenvectors ET of training sample of the text of step 1. normalization input;And calculate T, Identical Feature Words in ET;
Identical Feature Words and corresponding weights are extracted new vectorial NT, the NET of composition by step 2.;
Step 3. application MapReduce carries out Similarity Measure.Calculate two characteristic vectors weights composition unitary to Similarity sim (t, x) between amount;
Step 4.MapReduce is ranked up to the Similarity Measure result of the text of calculating;
Step 5. takes out similarity k text of highest;The similarity category of this k text is added up;
Step 6. takes similarity maximum SiAnd corresponding classification Ci
If step 7. SiMore than predefined similarity threshold, then identify the text and belong to CiClass.
In terms of index structure, the tree divided with K dimension spaces data set builds the trunk of whole tree structure, judges that K is tieed up Whether tree is empty, if it is empty then directly as root node.Otherwise compare the point and the size of the K values for tieing up root vertex respective dimension is closed System, next step operation is carried out into its left and right subtree;If the point is less than the value of root node respective dimension, enters left subtree and carry out Search operation is until the left subtree or right subtree of some node are sky.Then it regard point insertion as its leaf node;If the point is big In the value of root node respective dimension, then enter right subtree and carry out insertion operation.Then, position is loaded on the leaf node of the K Wei Shu Sensitive hash structure is put, i.e., remaining point is placed into position sensing hash.Data set X is converted into the binary system in space String;Advance Selecting All Parameters r>0, c>1, randomly select k hash function;Data point is stored in using these hash functions corresponding In hash table.
Based on above-mentioned file index rule, small text file is merged, provided with multiple file A1,A2…An, Wherein Ai=ai1,ai2,…,aik..., and aikFor k-th of character of filename.Concretely comprise the following steps:
Step 1, to the character string A of inputi(i=l, 2 ... n), find aik=' ', intercepts aikAll characters below. The number in this block with this class file is counted, m is designated asij.Such text that each piece is included in same node is calculated successively The number of part, obtains sequence mi1, mi2... min, seek mi=∑ mij(j=0,1 ... n) represent the extension included in this node The classification of name.
Step 2. calculates the number M for all small text files deposited in this node, obtains small text file in classification During the weights that set.
Step 3. solves the ratio m shared by each type filei/ M, sorts from big to small in proportion.The extension name of formation List is safeguarded in datanode.
Step 4. counts the m on this nodeiIn root node, formed root node list.Have in each extension name One root node list.This list is safeguarded in datanode.
Reduce task of the step 5. according to where block to be placed, obtains the extension name of this block.
Step 6. reads the root node of block to be placed.Root node list is set, according to the maximum principle pair of weights similarity Root is ranked up.
Step 7. selects the root made number one in this block.
Step 8. finds the maximum node of extension name proportion in the cluster.This root is searched wherein, if it does, putting Put this block.
Step 9. excludes this node from candidate list, and whether then judge list is empty.It is not sky, goes to step 8.
Step 10. excludes this root from the list of root, and whether the list for judging root is empty.It is not sky, goes to step 7;For Sky, is stored on the node of this extension name at random.
For the complex process during file retrieval and content map relation, in the increase of original MapReduce model Multiple pretreatment load nodes, the task of their execution of these load nodes is to be sent by host node before Map tasks are performed Subtask in the task of hair, then pre-processes user's restriction relation.User asks the processing with restriction relation Host node is submitted to, host node describes the xml document of the restriction relation according to task requests dynamic generation, after task segmentation, read The multivalued mappings relation in xml document is taken, when single map tasks start, the file of input is analyzed and produces many-one relationship Key-value pair, user arbitrarily operated to key-value pair.After the completion of then collect customized key-value pair again, by data processing Restriction relation be disposed, then start again MapReduce scheduling Map processes and Reduce processes.
Further to realize cloud storage load balancing, all storage numbers in cloud storage are represented with Cdata={ 1,2 ... m } According to the set of block.K ∈ Cdata represent kth group data storage, and m is total group of number of data storage in the cloud storage that need to be distributed.Remember cloud In storage platform the storage efficiency of i-th of memory node acquisition group storage resource be L (u (i), i);By cloud storage resource optimization point It is expressed as solving with problemMaximum.
(1) in initialization procedure, the data in CData is hashed into Distribution Strategy according to uniformity, is divided into m group data, deposits Storage node virtual turns to n memory node, initializes the storage efficiency value e and load capacity c of memory node.Stage Counting is set Device i.
(2) according to the memory node number of virtualization, this resource allocation process is divided into n stage.Determine state variable x (i+1) remaining data after 1 to i memory node of distribution, are represented;
(3) x (i) travels through its interval [u (i) with certain step-lengthmin,u(i)max], calculate surplus resources x (i) points The maximum storage efficiency V (x (i), i), while related data is remembered of n-i memory node after i-th of memory node of dispensing Record is in data acquisition system NoteData [i] { x (i), u (i), V (x (i), i) }.
(4) as i=n, data distribution, u (n) are carried out according to the load capacity c and storage efficiency e of n-th of memory node< =cn
Utilization state equation of transfer:X (i+1)=x (i)-u (i)
With Dynamic Programming Equation V (x (i), i)=maxu(i)∈U(x(i))L (u (i), i) ,+V (x (i+1), i+1) }, i=n- 1,2 ..., 1
V (x (n), n)=L (x (n), n)
Release in the optimal value in each stage, assigning process according to the load capacity c of the memory node in each stageiIt is determined that Decision variable u (i) boundary value.
(5) recursive calculation tries to achieve optimizing decision sequence NoteData (u (1), u (2) ..., u (n)), if I.e. data resource is not fully assigned.Recurrence is then repeated, the secondary figure of merit in per stage is taken successively, until
Based on above-mentioned improved MapReduce frameworks, under retrieval concurrent environment more, the present invention sets shared retrieval architecture And shared using two-stage, the first order is shared to realize shared sampling using public sample management mechanism, reduces redundancy I/O expenses;The Two grades of shared calculating by Online aggregate are shared to be abstracted into special ACQ optimization problems.The present invention realizes many from subtask aspect The merging of operation is retrieved, i.e., realizes that task level merges according to the correlation of each retrieval operation subtask, and sharing merging Task is sent to each calculate node and completes further processing.The flow of shared searching system framework based on Hadoop may include: Retrieval collector is responsible for collecting one group of retrieval request, and realizes that task level is closed by the analysis to each retrieval operation Map subtasks And operate, form a series of shared Map tasks;Shared Map tasks are assigned to each calculate node and carry out respective handling, including From HDFS collecting samples data and calculating ASSOCIATE STATISTICS amount;Completed approximately to estimate by Reduce tasks according to statistic relevant information Meter and precision judge, are returned if user's accuracy requirement is met, otherwise repeat aforesaid operations.
Given two retrievals Q1And Q2, its corresponding Map subtasks collection is combined into M1={ M1,1,M1,2…,M1,mAnd M2= {M2,1,M2,2…,M2,n, then secret sharing of the invention is:If two Map subtasks Mi,1∈M1,Mj,2∈M2With identical Input data is data block Bi=Bj, then shared Map tasks are merged into the two Map subtasks and then realize two independent I/O The merging of pipeline, by data block BiUnified access that to complete sampling shared;If two Map subtasks Mi,1∈M1,Mj,2∈M2 In addition to identical block, also predicate and aggregate type sentence, including SUM, COUNT, AVG are retrieved with identical When, merging of the Map tasks realization to two Map task normalized sets is shared, passes through and calculates and the amount completion of multiplexing Intermediate Statistics Normalized set it is shared;If two Map subtasks are B without identical input datai≠BjWhen, then it can not merge generation altogether Enjoy Map tasks.
For above-mentioned different sharing mode and secret sharing, the present invention uses following sharing policy:For every number According to block BiBuilding unified I/O pipelines is used for sample collection, and the random sample of acquisition is stored in the Sample Buffer in internal memory Area, provides data for follow-up shared sampling and supports;It is shared for the first order, participate in what is merged according to each in shared Map tasks The demand of Map tasks sample needed for each round Accuracy extimate, reads correspondingly sized sample set and distribution from buffering area Share the Map tasks of sampling condition to complete calculating task to middle satisfaction;If needing to carry out normalized set in shared Map tasks It is shared, then share result from the first order in the second level is shared and obtain respective sample collection, and to it according to bottom Map task sharings Group carries out the classified calculating of Intermediate Statistics amount, and each shared group obtains respective statistic by the multiplexing to middle statistic, from And complete calculating task.
The classified calculating of the statistic, can specifically be completed by two benches:Division stage and adjusting stage.Input one group of sample This set k={ k1i,k2i,…kni, ascending sort is carried out to sample set k, the stage that divides is determined initially altogether using Greedy strategy Enjoy packet scheme;And the task of adjusting stage is to carry out local directed complete set to the Map tasks in adjacent shared packet.
The division stage uses the variance yields of one group of sample size as the standard for measuring its difference size, by larger to variance Shared packet divide and realize the separation of difference sample size.First, the integral sharing for calculating current shared packet scheme is opened Sell and be designated as cmin, secondly, the shared packet with maximum variance is chosen from shared packet scheme as the candidate of division operation Shared packet, and two new shared packets are divided into according to the average of sample size in shared packet, then, calculate new production Raw shared packet scheme ' integral sharing expense and be designated as ccurIf, cmin≤ccurThen retain the new shared packet scheme to lay equal stress on Multiple above-mentioned division performs flow, the former shared packet scheme of on the contrary then return.
In the adjusting stage, i-th of shared packet sg of the shared packet scheme of definitionrThe packet of moving out of sample size is represented, and The i-th -1 shared packet sgl;The packet of moving into of sample size is represented, the sample size chosen less than sample size average in packet is formed Initial candidate's migration sample duration set cand;Further priority judgement is carried out to the element in cand, preferably sample is chosen This amount is migrated.Each element cand [j] in, counts sg respectivelyrThere is common edge with it in interior remaining sample size The sample size number eg on boundaryrAnd sglThere is the sample size number eg of public boundary in interior all sample sizes with itl.Define two Variable CErAnd CElRespectively to the eg corresponding to cand [j]rAnd eglIt is ranked up, in CErMiddle use ascending order arrangement, and in CElIn Arranged using descending, for any cand [j], using it in CErAnd CElIn index position rInd and lInd be used as priority Normalized parameter, and introduce weight coefficient winAnd woutTo adjust egrAnd eglInfluence to priority.Consider egrAnd egl The sample size migration priority of influence is calculated as follows:
Rank=winrInd+woutlInd
Wherein weight coefficient win+wout=1;Obtain its corresponding migration priority and choose that there is highest for each calculating The sample size of priority carries out the migration between adjacent shared packet to obtain new shared packet scheme, and by sharing cost Calculate and compare and can be determined that whether above-mentioned migration example is effective, until shared cost is no longer reduced, and return to final shared point Prescription case.
Given table search more than one, its Map function is according to different shared demands to corresponding Map tasks or shared point Group is handled respectively, is realized the reading of input data and is carried out normalized set to sample set, each round statistics is calculated into knot Really as the input data of Reduce functions.First, Map functions load global variable to support subsequent statistical amount to calculate, and from The shared Map set of tasks of sampling and the shared packet of normalized set are read in variable.Secondly, for the key assignments of each arrival It is right, cached first by public sample buffer, and it is read out and used according to different shared demands.For adopting Sample is shared, central when saving enough samples, obtains each required sample size and simultaneously updates and then the retrieval class in variable Type pair:Normalized set is carried out, and is key assignments using statistic and current Map task IDs as group currently to retrieve ID by result of calculation Key assignments formation key-value pair is closed as the input data of follow-up Reduce functions.
In summary, the present invention proposes a kind of construction method of large-scale data processing platform, based on improved distribution A variety of small documents from different isomerization source are carried out unified standard tissue by formula processing framework, are easy to efficient storage, analysis and inspection Rope.
Obviously, can be with general it should be appreciated by those skilled in the art, above-mentioned each module of the invention or each step Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and constituted Network on, alternatively, the program code that they can be can perform with computing system be realized, it is thus possible to they are stored Performed within the storage system by computing system.So, the present invention is not restricted to any specific hardware and software combination.
It should be appreciated that the above-mentioned embodiment of the present invention is used only for exemplary illustration or explains the present invention's Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent substitution, improvement etc., should be included in the scope of the protection.In addition, appended claims purport of the present invention Covering the whole changes fallen into scope and border or this scope and the equivalents on border and repairing Change example.

Claims (3)

1. a kind of construction method of large-scale data processing platform, it is characterised in that including:
Increase multiple pretreatment load nodes in MapReduce model;
Handled by the multivalue during the information and Map of XML tag data, realize the operation of data processing;
Adoption status transfer and Dynamic Programming mechanism are optimized to load balance in cloud storage resource.
2. according to the method described in claim 1, it is characterised in that the task that these described load nodes are performed is by host node Subtask in being distributed before performing Map tasks for task, then pre-processes user's restriction relation;User will have Host node is submitted in the processing request of Constrained relation, and host node describes the restriction relation according to task requests dynamic generation Xml document, after task segmentation, reads the multivalued mappings relation in xml document, when single map tasks start, analysis input File and the key-value pair for producing many-one relationship, user are arbitrarily operated to key-value pair;After the completion of then again collect make by oneself Justice key-value pair, the restriction relation of data processing is disposed, then start again MapReduce scheduling Map processes and Reduce processes.
3. according to the method described in claim 1, it is characterised in that described that Dynamic Programming mechanism pair is used in cloud storage resource Load balance is optimized, and is further comprised:
The set of all data storage blocks in cloud storage is represented with Cdata={ 1,2 ... m };, k ∈ Cdata represent that kth group is deposited Data are stored up, m is total group of number of data storage in the cloud storage that need to be distributed;Remember i-th of memory node acquisition group in cloud storage platform The storage efficiency of storage resource be L (u (i), i);Cloud storage resources configuration optimization problem is expressed as to solveMost Big value;
(1) in initialization procedure, the data in CData are hashed into Distribution Strategy according to uniformity, are divided into m group data, storage section The virtual storage efficiency value e and load capacity c for turning to n memory node, initializing memory node of point;Stage Counting device i is set;
(2) according to the memory node number of virtualization, this resource allocation process is divided into n stage;Determine state variable x (i+ 1) remaining data after 1 to i memory node of distribution, are represented;
(3) x (i) travels through its interval [u (i) with certain step-lengthmin,u(i)max], surplus resources x (i) is distributed in calculating The maximum storage efficiency V (x (i), i), while related data record is existed of n-i memory node after i-th of memory node Data acquisition system NoteData [i] x (i), u (i), V (x (i), i) } in;
(4) as i=n, data distribution, u (n) are carried out according to the load capacity c and storage efficiency e of n-th of memory node<=cn
Utilization state equation of transfer:X (i+1)=x (i)-u (i)
With Dynamic Programming Equation V (x (i), i)=maxu(i)∈U(x(i))L (u (i), i) ,+V (x (i+1), i+1) }, i=n-1, 2 ..., 1
V (x (n), n)=L (x (n), n)
Release in the optimal value in each stage, assigning process according to the load capacity c of the memory node in each stageiDetermine decision-making Variable u (i) boundary value;
(5) recursive calculation tries to achieve optimizing decision sequence NoteData (u (1), u (2) ..., u (n)), ifThat is data Resource is not fully assigned;Recurrence is then repeated, the secondary figure of merit in per stage is taken successively, until
CN201710357465.6A 2017-05-19 2017-05-19 The construction method of large-scale data processing platform Withdrawn CN107066328A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710357465.6A CN107066328A (en) 2017-05-19 2017-05-19 The construction method of large-scale data processing platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710357465.6A CN107066328A (en) 2017-05-19 2017-05-19 The construction method of large-scale data processing platform

Publications (1)

Publication Number Publication Date
CN107066328A true CN107066328A (en) 2017-08-18

Family

ID=59609463

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710357465.6A Withdrawn CN107066328A (en) 2017-05-19 2017-05-19 The construction method of large-scale data processing platform

Country Status (1)

Country Link
CN (1) CN107066328A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704515A (en) * 2019-12-11 2020-01-17 四川新网银行股份有限公司 Two-stage online sampling method based on MapReduce model
CN111211993A (en) * 2018-11-21 2020-05-29 百度在线网络技术(北京)有限公司 Incremental persistence method and device for streaming computation
CN111506621A (en) * 2020-03-31 2020-08-07 新华三大数据技术有限公司 Data statistical method and device
CN115617279A (en) * 2022-12-13 2023-01-17 北京中电德瑞电子科技有限公司 Distributed cloud data processing method and device and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104065685A (en) * 2013-03-22 2014-09-24 中国银联股份有限公司 Data migration method in cloud computing environment-oriented layered storage system
CN105069524A (en) * 2015-07-29 2015-11-18 中国西电电气股份有限公司 Planned scheduling optimization method based on large data analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104065685A (en) * 2013-03-22 2014-09-24 中国银联股份有限公司 Data migration method in cloud computing environment-oriented layered storage system
CN105069524A (en) * 2015-07-29 2015-11-18 中国西电电气股份有限公司 Planned scheduling optimization method based on large data analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
任崇广: "面向海量数据处理领域的云计算及其关键技术研究", 《中国优秀博士学位论文全文数据库》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111211993A (en) * 2018-11-21 2020-05-29 百度在线网络技术(北京)有限公司 Incremental persistence method and device for streaming computation
CN111211993B (en) * 2018-11-21 2023-08-11 百度在线网络技术(北京)有限公司 Incremental persistence method, device and storage medium for stream computation
CN110704515A (en) * 2019-12-11 2020-01-17 四川新网银行股份有限公司 Two-stage online sampling method based on MapReduce model
CN110704515B (en) * 2019-12-11 2020-06-02 四川新网银行股份有限公司 Two-stage online sampling method based on MapReduce model
CN111506621A (en) * 2020-03-31 2020-08-07 新华三大数据技术有限公司 Data statistical method and device
CN115617279A (en) * 2022-12-13 2023-01-17 北京中电德瑞电子科技有限公司 Distributed cloud data processing method and device and storage medium
CN115617279B (en) * 2022-12-13 2023-03-31 北京中电德瑞电子科技有限公司 Distributed cloud data processing method and device and storage medium

Similar Documents

Publication Publication Date Title
Liu et al. A speculative approach to spatial‐temporal efficiency with multi‐objective optimization in a heterogeneous cloud environment
CN113064879B (en) Database parameter adjusting method and device and computer readable storage medium
US11423082B2 (en) Methods and apparatus for subgraph matching in big data analysis
CN107292186B (en) Model training method and device based on random forest
US10621493B2 (en) Multiple record linkage algorithm selector
CN110390345B (en) Cloud platform-based big data cluster self-adaptive resource scheduling method
CN108320171A (en) Hot item prediction technique, system and device
CN107066328A (en) The construction method of large-scale data processing platform
CN107193940A (en) Big data method for optimization analysis
CN102214213A (en) Method and system for classifying data by adopting decision tree
CN107832456B (en) Parallel KNN text classification method based on critical value data division
CN107908536B (en) Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment
Zhang et al. Virtual machine placement strategy using cluster-based genetic algorithm
CN111143685B (en) Commodity recommendation method and device
CN106934410A (en) The sorting technique and system of data
CN107622326A (en) User&#39;s classification, available resources Forecasting Methodology, device and equipment
US7890705B2 (en) Shared-memory multiprocessor system and information processing method
CN113032367A (en) Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system
CN108052535A (en) The parallel fast matching method of visual signature and system based on multi processor platform
CN107103095A (en) Method for computing data based on high performance network framework
US8667008B2 (en) Search request control apparatus and search request control method
CN116680090B (en) Edge computing network management method and platform based on big data
US7647592B2 (en) Methods and systems for assigning objects to processing units
CN115510331B (en) Shared resource matching method based on idle amount aggregation
CN116243869A (en) Data processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20170818