CN107066328A - The construction method of large-scale data processing platform - Google Patents
The construction method of large-scale data processing platform Download PDFInfo
- Publication number
- CN107066328A CN107066328A CN201710357465.6A CN201710357465A CN107066328A CN 107066328 A CN107066328 A CN 107066328A CN 201710357465 A CN201710357465 A CN 201710357465A CN 107066328 A CN107066328 A CN 107066328A
- Authority
- CN
- China
- Prior art keywords
- data
- memory node
- storage
- data processing
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1001—Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a kind of construction method of large-scale data processing platform, this method includes:Increase multiple pretreatment load nodes in MapReduce model;Handled by the multivalue during the information and Map of XML tag data, realize the operation of data processing;Adoption status transfer and Dynamic Programming mechanism are optimized to load balance in cloud storage resource.The present invention proposes a kind of construction method of large-scale data processing platform, and a variety of small documents from different isomerization source are carried out into unified standard tissue based on improved distributed processing framework, is easy to efficient storage, analysis and retrieval.
Description
Technical field
The present invention relates to data calculating, more particularly to a kind of construction method of large-scale data processing platform.
Background technology
Cloud computing technology possesses Distributed Calculation, ultra-large, and virtualization, high reliability, high resiliency is expansible, on demand
The features such as service, highly efficient analysis and more preferable computing capability can be provided for big data processing.At big data
Hundreds of millions of small documents processing in reason are, it is necessary to which distributed memory system and directory system carry for files such as webpage and mails
Supported for storage.With the application demand of a large amount of small text file process, a large amount of isomeric datas are there are in different information systems
Source;The unified standardisation body method of data deficiency;In some fields, a large amount of small text files are difficult to effectively analysis and efficiently deposited
Storage and retrieval.
The content of the invention
To solve the problems of above-mentioned prior art, the present invention proposes a kind of structure of large-scale data processing platform
Construction method, including:
Increase multiple pretreatment load nodes in MapReduce model;
Handled by the multivalue during the information and Map of XML tag data, realize the operation of data processing;
Adoption status transfer and Dynamic Programming mechanism are optimized to load balance in cloud storage resource.
Preferably, the task that these described load nodes are performed was distributed by host node before Map tasks are performed
Subtask in task, then pre-processes user's restriction relation;User asks the processing with restriction relation to submit
To host node, host node describes the xml document of the restriction relation according to task requests dynamic generation, after task segmentation, reads xml
Multivalued mappings relation in file, when single map tasks start, analyzes the file of input and produces the key assignments of many-one relationship
Right, user is arbitrarily operated to key-value pair;After the completion of then collect customized key-value pair again, by the constraint of data processing
Automated generalization is finished, and then starts the Map processes and Reduce processes of MapReduce scheduling again.
Preferably, it is described that load balance is optimized using Dynamic Programming mechanism in cloud storage resource, further wrap
Include:
The set of all data storage blocks in cloud storage is represented with Cdata={ 1,2 ... m };, k ∈ Cdata represent kth
Group data storage, m is total group of number of data storage in the cloud storage that need to be distributed;I-th of memory node is obtained in note cloud storage platform
Must organize the storage efficiency of storage resource for L (u (i), i);Cloud storage resources configuration optimization problem is expressed as to solve
Maximum;
(1) in initialization procedure, the data in CData is hashed into Distribution Strategy according to uniformity, is divided into m group data, deposits
Storage node virtual turns to n memory node, initializes the storage efficiency value e and load capacity c of memory node;Stage Counting is set
Device i;
(2) according to the memory node number of virtualization, this resource allocation process is divided into n stage;Determine state variable x
(i+1) remaining data after 1 to i memory node of distribution, are represented;
(3) x (i) travels through its interval [u (i) with certain step-lengthmin,u(i)max], calculate surplus resources x (i) points
The maximum storage efficiency V (x (i), i), while related data is remembered of n-i memory node after i-th of memory node of dispensing
Record is in data acquisition system NoteData [i] { x (i), u (i), V (x (i), i) };
(4) as i=n, data distribution, u (n) are carried out according to the load capacity c and storage efficiency e of n-th of memory node<
=cn;
Utilization state equation of transfer:X (i+1)=x (i)-u (i)
With Dynamic Programming Equation V (x (i), i)=maxu(i)∈U(x(i))L (u (i), i) ,+V (x (i+1), i+1) }, i=n-
1,2 ..., 1
V (x (n), n)=L (x (n), n)
Release in the optimal value in each stage, assigning process according to the load capacity c of the memory node in each stageiIt is determined that
Decision variable u (i) boundary value;
(5) recursive calculation tries to achieve optimizing decision sequence NoteData (u (1), u (2) ..., u (n)), if
I.e. data resource is not fully assigned;Recurrence is then repeated, the secondary figure of merit in per stage is taken successively, until
The present invention compared with prior art, with advantages below:
The present invention proposes a kind of construction method of large-scale data processing platform, based on improved distributed processing framework
A variety of small documents from different isomerization source are subjected to unified standard tissue, are easy to efficient storage, analysis and retrieval.
Brief description of the drawings
Fig. 1 is the flow chart of the construction method of large-scale data processing platform according to embodiments of the present invention.
Embodiment
Retouching in detail to one or more embodiment of the invention is hereafter provided together with illustrating the accompanying drawing of the principle of the invention
State.The present invention is described with reference to such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right
Claim is limited, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with
Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details
Some or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of construction method of large-scale data processing platform.Fig. 1 is according to the present invention
The construction method flow chart of the large-scale data processing platform of embodiment.
The present invention sets up index structure by document classification, and based on weights similarity Piece file mergence into big file with
Handled under cloud computing environment.When classifying to small text file, k nearest neighbor assorting process is described using MapReduce, together
When, addition characteristic vector compares in k nearest neighbor, sequentially reconfigures two Feature Words identical characteristic vectors.For document inspection
Complex process and content map relation during rope, the improvement MapReduce model based on XML and multivalue.Pass through XML tag
The content of data, coordinate, operation etc. information, carry out data complex process.The content of data typically has mapping relations, passes through
Multivalue processing during XML tag and Map, realizes the operation of data processing.
First, just subseries is carried out by document format.For sorted text document, according to based on MapReduce and
The improvement k nearest neighbor sorting technique of characteristic vector reduction is classified.It is then combined with the small text file of unified classification, the big text of generation
Part..Small text file is write into big file sequentially in time, the name of big file, copy, positional information are then write into name
Byte point, datanode is write by content.
The comparative approach of traditional characteristic vector is added in k nearest neighbor algorithm, is first found out between two original feature vectors
Identical word and its weight, two Feature Words all identical characteristic vectors, then profit are reconfigured according to the order of same characteristic features word
The similarity between the two characteristic vectors is calculated with the corresponding weight vectors of Feature Words.
Method is described as follows:All texts in training set are pre-processed, the characteristic vector of key-value pair form is generated;
The characteristic vector T and the set of eigenvectors ET of training sample of the text of step 1. normalization input;And calculate T,
Identical Feature Words in ET;
Identical Feature Words and corresponding weights are extracted new vectorial NT, the NET of composition by step 2.;
Step 3. application MapReduce carries out Similarity Measure.Calculate two characteristic vectors weights composition unitary to
Similarity sim (t, x) between amount;
Step 4.MapReduce is ranked up to the Similarity Measure result of the text of calculating;
Step 5. takes out similarity k text of highest;The similarity category of this k text is added up;
Step 6. takes similarity maximum SiAnd corresponding classification Ci;
If step 7. SiMore than predefined similarity threshold, then identify the text and belong to CiClass.
In terms of index structure, the tree divided with K dimension spaces data set builds the trunk of whole tree structure, judges that K is tieed up
Whether tree is empty, if it is empty then directly as root node.Otherwise compare the point and the size of the K values for tieing up root vertex respective dimension is closed
System, next step operation is carried out into its left and right subtree;If the point is less than the value of root node respective dimension, enters left subtree and carry out
Search operation is until the left subtree or right subtree of some node are sky.Then it regard point insertion as its leaf node;If the point is big
In the value of root node respective dimension, then enter right subtree and carry out insertion operation.Then, position is loaded on the leaf node of the K Wei Shu
Sensitive hash structure is put, i.e., remaining point is placed into position sensing hash.Data set X is converted into the binary system in space
String;Advance Selecting All Parameters r>0, c>1, randomly select k hash function;Data point is stored in using these hash functions corresponding
In hash table.
Based on above-mentioned file index rule, small text file is merged, provided with multiple file A1,A2…An,
Wherein Ai=ai1,ai2,…,aik..., and aikFor k-th of character of filename.Concretely comprise the following steps:
Step 1, to the character string A of inputi(i=l, 2 ... n), find aik=' ', intercepts aikAll characters below.
The number in this block with this class file is counted, m is designated asij.Such text that each piece is included in same node is calculated successively
The number of part, obtains sequence mi1, mi2... min, seek mi=∑ mij(j=0,1 ... n) represent the extension included in this node
The classification of name.
Step 2. calculates the number M for all small text files deposited in this node, obtains small text file in classification
During the weights that set.
Step 3. solves the ratio m shared by each type filei/ M, sorts from big to small in proportion.The extension name of formation
List is safeguarded in datanode.
Step 4. counts the m on this nodeiIn root node, formed root node list.Have in each extension name
One root node list.This list is safeguarded in datanode.
Reduce task of the step 5. according to where block to be placed, obtains the extension name of this block.
Step 6. reads the root node of block to be placed.Root node list is set, according to the maximum principle pair of weights similarity
Root is ranked up.
Step 7. selects the root made number one in this block.
Step 8. finds the maximum node of extension name proportion in the cluster.This root is searched wherein, if it does, putting
Put this block.
Step 9. excludes this node from candidate list, and whether then judge list is empty.It is not sky, goes to step 8.
Step 10. excludes this root from the list of root, and whether the list for judging root is empty.It is not sky, goes to step 7;For
Sky, is stored on the node of this extension name at random.
For the complex process during file retrieval and content map relation, in the increase of original MapReduce model
Multiple pretreatment load nodes, the task of their execution of these load nodes is to be sent by host node before Map tasks are performed
Subtask in the task of hair, then pre-processes user's restriction relation.User asks the processing with restriction relation
Host node is submitted to, host node describes the xml document of the restriction relation according to task requests dynamic generation, after task segmentation, read
The multivalued mappings relation in xml document is taken, when single map tasks start, the file of input is analyzed and produces many-one relationship
Key-value pair, user arbitrarily operated to key-value pair.After the completion of then collect customized key-value pair again, by data processing
Restriction relation be disposed, then start again MapReduce scheduling Map processes and Reduce processes.
Further to realize cloud storage load balancing, all storage numbers in cloud storage are represented with Cdata={ 1,2 ... m }
According to the set of block.K ∈ Cdata represent kth group data storage, and m is total group of number of data storage in the cloud storage that need to be distributed.Remember cloud
In storage platform the storage efficiency of i-th of memory node acquisition group storage resource be L (u (i), i);By cloud storage resource optimization point
It is expressed as solving with problemMaximum.
(1) in initialization procedure, the data in CData is hashed into Distribution Strategy according to uniformity, is divided into m group data, deposits
Storage node virtual turns to n memory node, initializes the storage efficiency value e and load capacity c of memory node.Stage Counting is set
Device i.
(2) according to the memory node number of virtualization, this resource allocation process is divided into n stage.Determine state variable x
(i+1) remaining data after 1 to i memory node of distribution, are represented;
(3) x (i) travels through its interval [u (i) with certain step-lengthmin,u(i)max], calculate surplus resources x (i) points
The maximum storage efficiency V (x (i), i), while related data is remembered of n-i memory node after i-th of memory node of dispensing
Record is in data acquisition system NoteData [i] { x (i), u (i), V (x (i), i) }.
(4) as i=n, data distribution, u (n) are carried out according to the load capacity c and storage efficiency e of n-th of memory node<
=cn。
Utilization state equation of transfer:X (i+1)=x (i)-u (i)
With Dynamic Programming Equation V (x (i), i)=maxu(i)∈U(x(i))L (u (i), i) ,+V (x (i+1), i+1) }, i=n-
1,2 ..., 1
V (x (n), n)=L (x (n), n)
Release in the optimal value in each stage, assigning process according to the load capacity c of the memory node in each stageiIt is determined that
Decision variable u (i) boundary value.
(5) recursive calculation tries to achieve optimizing decision sequence NoteData (u (1), u (2) ..., u (n)), if
I.e. data resource is not fully assigned.Recurrence is then repeated, the secondary figure of merit in per stage is taken successively, until
Based on above-mentioned improved MapReduce frameworks, under retrieval concurrent environment more, the present invention sets shared retrieval architecture
And shared using two-stage, the first order is shared to realize shared sampling using public sample management mechanism, reduces redundancy I/O expenses;The
Two grades of shared calculating by Online aggregate are shared to be abstracted into special ACQ optimization problems.The present invention realizes many from subtask aspect
The merging of operation is retrieved, i.e., realizes that task level merges according to the correlation of each retrieval operation subtask, and sharing merging
Task is sent to each calculate node and completes further processing.The flow of shared searching system framework based on Hadoop may include:
Retrieval collector is responsible for collecting one group of retrieval request, and realizes that task level is closed by the analysis to each retrieval operation Map subtasks
And operate, form a series of shared Map tasks;Shared Map tasks are assigned to each calculate node and carry out respective handling, including
From HDFS collecting samples data and calculating ASSOCIATE STATISTICS amount;Completed approximately to estimate by Reduce tasks according to statistic relevant information
Meter and precision judge, are returned if user's accuracy requirement is met, otherwise repeat aforesaid operations.
Given two retrievals Q1And Q2, its corresponding Map subtasks collection is combined into M1={ M1,1,M1,2…,M1,mAnd M2=
{M2,1,M2,2…,M2,n, then secret sharing of the invention is:If two Map subtasks Mi,1∈M1,Mj,2∈M2With identical
Input data is data block Bi=Bj, then shared Map tasks are merged into the two Map subtasks and then realize two independent I/O
The merging of pipeline, by data block BiUnified access that to complete sampling shared;If two Map subtasks Mi,1∈M1,Mj,2∈M2
In addition to identical block, also predicate and aggregate type sentence, including SUM, COUNT, AVG are retrieved with identical
When, merging of the Map tasks realization to two Map task normalized sets is shared, passes through and calculates and the amount completion of multiplexing Intermediate Statistics
Normalized set it is shared;If two Map subtasks are B without identical input datai≠BjWhen, then it can not merge generation altogether
Enjoy Map tasks.
For above-mentioned different sharing mode and secret sharing, the present invention uses following sharing policy:For every number
According to block BiBuilding unified I/O pipelines is used for sample collection, and the random sample of acquisition is stored in the Sample Buffer in internal memory
Area, provides data for follow-up shared sampling and supports;It is shared for the first order, participate in what is merged according to each in shared Map tasks
The demand of Map tasks sample needed for each round Accuracy extimate, reads correspondingly sized sample set and distribution from buffering area
Share the Map tasks of sampling condition to complete calculating task to middle satisfaction;If needing to carry out normalized set in shared Map tasks
It is shared, then share result from the first order in the second level is shared and obtain respective sample collection, and to it according to bottom Map task sharings
Group carries out the classified calculating of Intermediate Statistics amount, and each shared group obtains respective statistic by the multiplexing to middle statistic, from
And complete calculating task.
The classified calculating of the statistic, can specifically be completed by two benches:Division stage and adjusting stage.Input one group of sample
This set k={ k1i,k2i,…kni, ascending sort is carried out to sample set k, the stage that divides is determined initially altogether using Greedy strategy
Enjoy packet scheme;And the task of adjusting stage is to carry out local directed complete set to the Map tasks in adjacent shared packet.
The division stage uses the variance yields of one group of sample size as the standard for measuring its difference size, by larger to variance
Shared packet divide and realize the separation of difference sample size.First, the integral sharing for calculating current shared packet scheme is opened
Sell and be designated as cmin, secondly, the shared packet with maximum variance is chosen from shared packet scheme as the candidate of division operation
Shared packet, and two new shared packets are divided into according to the average of sample size in shared packet, then, calculate new production
Raw shared packet scheme ' integral sharing expense and be designated as ccurIf, cmin≤ccurThen retain the new shared packet scheme to lay equal stress on
Multiple above-mentioned division performs flow, the former shared packet scheme of on the contrary then return.
In the adjusting stage, i-th of shared packet sg of the shared packet scheme of definitionrThe packet of moving out of sample size is represented, and
The i-th -1 shared packet sgl;The packet of moving into of sample size is represented, the sample size chosen less than sample size average in packet is formed
Initial candidate's migration sample duration set cand;Further priority judgement is carried out to the element in cand, preferably sample is chosen
This amount is migrated.Each element cand [j] in, counts sg respectivelyrThere is common edge with it in interior remaining sample size
The sample size number eg on boundaryrAnd sglThere is the sample size number eg of public boundary in interior all sample sizes with itl.Define two
Variable CErAnd CElRespectively to the eg corresponding to cand [j]rAnd eglIt is ranked up, in CErMiddle use ascending order arrangement, and in CElIn
Arranged using descending, for any cand [j], using it in CErAnd CElIn index position rInd and lInd be used as priority
Normalized parameter, and introduce weight coefficient winAnd woutTo adjust egrAnd eglInfluence to priority.Consider egrAnd egl
The sample size migration priority of influence is calculated as follows:
Rank=winrInd+woutlInd
Wherein weight coefficient win+wout=1;Obtain its corresponding migration priority and choose that there is highest for each calculating
The sample size of priority carries out the migration between adjacent shared packet to obtain new shared packet scheme, and by sharing cost
Calculate and compare and can be determined that whether above-mentioned migration example is effective, until shared cost is no longer reduced, and return to final shared point
Prescription case.
Given table search more than one, its Map function is according to different shared demands to corresponding Map tasks or shared point
Group is handled respectively, is realized the reading of input data and is carried out normalized set to sample set, each round statistics is calculated into knot
Really as the input data of Reduce functions.First, Map functions load global variable to support subsequent statistical amount to calculate, and from
The shared Map set of tasks of sampling and the shared packet of normalized set are read in variable.Secondly, for the key assignments of each arrival
It is right, cached first by public sample buffer, and it is read out and used according to different shared demands.For adopting
Sample is shared, central when saving enough samples, obtains each required sample size and simultaneously updates and then the retrieval class in variable
Type pair:Normalized set is carried out, and is key assignments using statistic and current Map task IDs as group currently to retrieve ID by result of calculation
Key assignments formation key-value pair is closed as the input data of follow-up Reduce functions.
In summary, the present invention proposes a kind of construction method of large-scale data processing platform, based on improved distribution
A variety of small documents from different isomerization source are carried out unified standard tissue by formula processing framework, are easy to efficient storage, analysis and inspection
Rope.
Obviously, can be with general it should be appreciated by those skilled in the art, above-mentioned each module of the invention or each step
Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and constituted
Network on, alternatively, the program code that they can be can perform with computing system be realized, it is thus possible to they are stored
Performed within the storage system by computing system.So, the present invention is not restricted to any specific hardware and software combination.
It should be appreciated that the above-mentioned embodiment of the present invention is used only for exemplary illustration or explains the present invention's
Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any
Modification, equivalent substitution, improvement etc., should be included in the scope of the protection.In addition, appended claims purport of the present invention
Covering the whole changes fallen into scope and border or this scope and the equivalents on border and repairing
Change example.
Claims (3)
1. a kind of construction method of large-scale data processing platform, it is characterised in that including:
Increase multiple pretreatment load nodes in MapReduce model;
Handled by the multivalue during the information and Map of XML tag data, realize the operation of data processing;
Adoption status transfer and Dynamic Programming mechanism are optimized to load balance in cloud storage resource.
2. according to the method described in claim 1, it is characterised in that the task that these described load nodes are performed is by host node
Subtask in being distributed before performing Map tasks for task, then pre-processes user's restriction relation;User will have
Host node is submitted in the processing request of Constrained relation, and host node describes the restriction relation according to task requests dynamic generation
Xml document, after task segmentation, reads the multivalued mappings relation in xml document, when single map tasks start, analysis input
File and the key-value pair for producing many-one relationship, user are arbitrarily operated to key-value pair;After the completion of then again collect make by oneself
Justice key-value pair, the restriction relation of data processing is disposed, then start again MapReduce scheduling Map processes and
Reduce processes.
3. according to the method described in claim 1, it is characterised in that described that Dynamic Programming mechanism pair is used in cloud storage resource
Load balance is optimized, and is further comprised:
The set of all data storage blocks in cloud storage is represented with Cdata={ 1,2 ... m };, k ∈ Cdata represent that kth group is deposited
Data are stored up, m is total group of number of data storage in the cloud storage that need to be distributed;Remember i-th of memory node acquisition group in cloud storage platform
The storage efficiency of storage resource be L (u (i), i);Cloud storage resources configuration optimization problem is expressed as to solveMost
Big value;
(1) in initialization procedure, the data in CData are hashed into Distribution Strategy according to uniformity, are divided into m group data, storage section
The virtual storage efficiency value e and load capacity c for turning to n memory node, initializing memory node of point;Stage Counting device i is set;
(2) according to the memory node number of virtualization, this resource allocation process is divided into n stage;Determine state variable x (i+
1) remaining data after 1 to i memory node of distribution, are represented;
(3) x (i) travels through its interval [u (i) with certain step-lengthmin,u(i)max], surplus resources x (i) is distributed in calculating
The maximum storage efficiency V (x (i), i), while related data record is existed of n-i memory node after i-th of memory node
Data acquisition system NoteData [i] x (i), u (i), V (x (i), i) } in;
(4) as i=n, data distribution, u (n) are carried out according to the load capacity c and storage efficiency e of n-th of memory node<=cn;
Utilization state equation of transfer:X (i+1)=x (i)-u (i)
With Dynamic Programming Equation V (x (i), i)=maxu(i)∈U(x(i))L (u (i), i) ,+V (x (i+1), i+1) }, i=n-1,
2 ..., 1
V (x (n), n)=L (x (n), n)
Release in the optimal value in each stage, assigning process according to the load capacity c of the memory node in each stageiDetermine decision-making
Variable u (i) boundary value;
(5) recursive calculation tries to achieve optimizing decision sequence NoteData (u (1), u (2) ..., u (n)), ifThat is data
Resource is not fully assigned;Recurrence is then repeated, the secondary figure of merit in per stage is taken successively, until
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710357465.6A CN107066328A (en) | 2017-05-19 | 2017-05-19 | The construction method of large-scale data processing platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710357465.6A CN107066328A (en) | 2017-05-19 | 2017-05-19 | The construction method of large-scale data processing platform |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107066328A true CN107066328A (en) | 2017-08-18 |
Family
ID=59609463
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710357465.6A Withdrawn CN107066328A (en) | 2017-05-19 | 2017-05-19 | The construction method of large-scale data processing platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107066328A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110704515A (en) * | 2019-12-11 | 2020-01-17 | 四川新网银行股份有限公司 | Two-stage online sampling method based on MapReduce model |
CN111211993A (en) * | 2018-11-21 | 2020-05-29 | 百度在线网络技术(北京)有限公司 | Incremental persistence method and device for streaming computation |
CN111506621A (en) * | 2020-03-31 | 2020-08-07 | 新华三大数据技术有限公司 | Data statistical method and device |
CN115617279A (en) * | 2022-12-13 | 2023-01-17 | 北京中电德瑞电子科技有限公司 | Distributed cloud data processing method and device and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104065685A (en) * | 2013-03-22 | 2014-09-24 | 中国银联股份有限公司 | Data migration method in cloud computing environment-oriented layered storage system |
CN105069524A (en) * | 2015-07-29 | 2015-11-18 | 中国西电电气股份有限公司 | Planned scheduling optimization method based on large data analysis |
-
2017
- 2017-05-19 CN CN201710357465.6A patent/CN107066328A/en not_active Withdrawn
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104065685A (en) * | 2013-03-22 | 2014-09-24 | 中国银联股份有限公司 | Data migration method in cloud computing environment-oriented layered storage system |
CN105069524A (en) * | 2015-07-29 | 2015-11-18 | 中国西电电气股份有限公司 | Planned scheduling optimization method based on large data analysis |
Non-Patent Citations (1)
Title |
---|
任崇广: "面向海量数据处理领域的云计算及其关键技术研究", 《中国优秀博士学位论文全文数据库》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111211993A (en) * | 2018-11-21 | 2020-05-29 | 百度在线网络技术(北京)有限公司 | Incremental persistence method and device for streaming computation |
CN111211993B (en) * | 2018-11-21 | 2023-08-11 | 百度在线网络技术(北京)有限公司 | Incremental persistence method, device and storage medium for stream computation |
CN110704515A (en) * | 2019-12-11 | 2020-01-17 | 四川新网银行股份有限公司 | Two-stage online sampling method based on MapReduce model |
CN110704515B (en) * | 2019-12-11 | 2020-06-02 | 四川新网银行股份有限公司 | Two-stage online sampling method based on MapReduce model |
CN111506621A (en) * | 2020-03-31 | 2020-08-07 | 新华三大数据技术有限公司 | Data statistical method and device |
CN115617279A (en) * | 2022-12-13 | 2023-01-17 | 北京中电德瑞电子科技有限公司 | Distributed cloud data processing method and device and storage medium |
CN115617279B (en) * | 2022-12-13 | 2023-03-31 | 北京中电德瑞电子科技有限公司 | Distributed cloud data processing method and device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | A speculative approach to spatial‐temporal efficiency with multi‐objective optimization in a heterogeneous cloud environment | |
CN109872036B (en) | Task allocation method and device based on classification algorithm and computer equipment | |
CN113064879B (en) | Database parameter adjusting method and device and computer readable storage medium | |
US11423082B2 (en) | Methods and apparatus for subgraph matching in big data analysis | |
US10621493B2 (en) | Multiple record linkage algorithm selector | |
CN110390345B (en) | Cloud platform-based big data cluster self-adaptive resource scheduling method | |
CN107292186A (en) | A kind of model training method and device based on random forest | |
CN108320171A (en) | Hot item prediction technique, system and device | |
CN107066328A (en) | The construction method of large-scale data processing platform | |
CN107193940A (en) | Big data method for optimization analysis | |
CN107908536B (en) | Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment | |
CN107832456B (en) | Parallel KNN text classification method based on critical value data division | |
CN106934410A (en) | The sorting technique and system of data | |
CN111259933B (en) | High-dimensional characteristic data classification method and system based on distributed parallel decision tree | |
CN107622326A (en) | User's classification, available resources Forecasting Methodology, device and equipment | |
US7890705B2 (en) | Shared-memory multiprocessor system and information processing method | |
CN113032367A (en) | Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system | |
CN108052535A (en) | The parallel fast matching method of visual signature and system based on multi processor platform | |
CN107103095A (en) | Method for computing data based on high performance network framework | |
US8667008B2 (en) | Search request control apparatus and search request control method | |
CN114254762A (en) | Interpretable machine learning model construction method and device and computer equipment | |
CN116680090B (en) | Edge computing network management method and platform based on big data | |
US7647592B2 (en) | Methods and systems for assigning objects to processing units | |
CN115510331B (en) | Shared resource matching method based on idle amount aggregation | |
CN116243869A (en) | Data processing method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20170818 |