CN106095940A - A kind of data migration method of task based access control load - Google Patents
A kind of data migration method of task based access control load Download PDFInfo
- Publication number
- CN106095940A CN106095940A CN201610415905.4A CN201610415905A CN106095940A CN 106095940 A CN106095940 A CN 106095940A CN 201610415905 A CN201610415905 A CN 201610415905A CN 106095940 A CN106095940 A CN 106095940A
- Authority
- CN
- China
- Prior art keywords
- data
- task
- node
- data migration
- load
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/214—Database migration support
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses the data migration method of a kind of task based access control load, it realizes process and is, extracted data in relevant database, and data are carried out form conversion;Starting to perform Data Migration, this data migration process uses MapReduce framework, and concrete transition process is: first import data in data source, generates corresponding operation class and is packaged into jar bag, passes to Hadoop platform storage and processes.The data migration method of this task based access control load is compared with prior art, solve the hypothesis problem of the task disposal ability of each node, CPU arithmetic speed and map quantity allotted, avoid the occurrence of part of nodes overload or the phenomenon kicked the beam by the optimization of the scheduling of task as far as possible, data migration efficiency can be improved, practical.
Description
Technical field
The present invention relates to Data Transference Technology field, the data of a kind of practical task based access control load
Moving method.
Background technology
Along with the continuous development of information technology, human society produced data volume every year flies the most with surprising rapidity
Speed increases, and our society has been introduced into a brand-new information age.But the problem of thing followed data storage, by
Gradually become a much-talked-about topic.In recent years, along with cloud computing and big data and correlation technique thereof development, information-based produce
The general layout of industry is the most gradually changed.In the actual application of enterprise, the storage of data necessarily requires round-the-clock high availability
Index, but along with the scale of network store system is increasing, the number of stoppages of system also sharply increases.Although it addition, storage
The scale of system is increasing, and storage device gets more and more, but the resource utilization in storage system stills remain in one
The lowest level.On the other hand, in the case of memory data output increases, system needs to meet high extensibility;Big rule
The different types of storage device with high isomerism that the storage system of mould is provided by different manufacturers often forms
, thus causing management storage system and there is higher complexity, the system that also results in extension in future usage will
Become a difficult problem the biggest.Under technical conditions instantly, select to use cloud storage will can effectively solve the problem that the problems referred to above.
Under normal circumstances, the information data system of enterprise can comprise multiple different operation system, and different business
System all includes respective a set of online operation system, filing system and standby system.Enterprise, for the consideration to cost, deposits
Storage system can be the cloud storage platform of the Data Migration of online business platform to rear end.But the process of Data Migration is the most multiple
Miscellaneous, need the problem solved numerous.In the problems of Data Migration, we mainly have studied one of them problem, i.e. closes
Be type data by during Data Migration to big data platform, the transport efficiency of data has much room for improvement.We are by adjusting task
The optimization of degree mechanism is to reach to improve the purpose of data migration efficiency.
Generally, if user does not carry out special setting to Hadoop, then during processing task scheduling just
FIFO can be used to dispatch, and its operation logic is as shown in Figure 1.Hadoop is to have following vacation utilizing FIFO scheduler to carry out operation
If:
In distributed type assemblies, it is identical that each node carries out the ability of task process.
Each node during task is processed, it calculates speed and operational capability and keeps constant, and be not subject to
The impact of other factors.
When identical task is processed by system, the task that map and reduce in Map Reduce is accepted is divided
It is all equal for joining the quantity with task computation.
Conditions above all can under ideal conditions it is assumed that reality operate with during, substantially be difficult to meet
Above three hypothesis.Firstly, since the isomerism of the server of different nodes, it is difficulty with its complete phase of ability processing task
With, such as processing computation-intensive task constantly, the operation of CPU can be slack-off, when processing data-intensive task
Time, the exploitation speed of disk also can decline.Therefore, in actual use, overall data is moved by the unstability of various factors
The execution process moved, adds many unstable factors.
Based on this, the data migration method now providing a kind of task based access control to load, is by carrying out Task Scheduling Mechanism
Optimize thus improve a kind of method of data migration efficiency.
Summary of the invention
The technical assignment of the present invention is for above weak point, it is provided that a kind of practical, number of task based access control load
According to moving method.
A kind of data migration method of task based access control load, it realizes process and is:
Extracted data in relevant database, and data are carried out form conversion;Start to perform Data Migration, these data
Transition process uses MapReduce framework, and concrete transition process is: first import data in data source, generates corresponding operation
Class is packaged into jar bag, passes to Hadoop platform storage and processes.
Before performing data migration operation, having first had to the configuration of relevant essential information, this relevant essential information includes,
Positional information that data are moved out, the address information of Data Migration destination, data migration process use the quantity of map and existing
There is the configuring condition that the server of each node of distributed type assemblies is basic, after completing basic configuration, start to perform Data Migration
Operation.
The detailed process of described Data Migration is:
Step one, first parameter is set, resolves the presupposed information of tasks carrying and data source and data output paths are set;
Step 2, in default data source obtain data;
Step 3, conversion data form, data will be converted to the storable data of big data platform by original form
Form;
Step 4, data are carried out divide distribution, each node being issued in cluster;
Step 5, finally write data into correspondence outgoing route in.
The process that implements of described step one is:
First resolve the systemic presupposition information about tasks carrying, essential information and the data of data to be migrated are i.e. set
Data message in migration, this systemic presupposition information includes whether data back up, the output road of the acquisition approach of data, data
Footpath, the unprocessed form of data, the output format of data, Mapper class data being divided and calculating and Reducer class;
Then putting of current storage such as data to be migrated such as data source, i.e. parsing etc. it is set, does standard for data migration operation
Standby;
The outgoing route of data is finally set, will the position that preserves of the output data after Data Format Transform.
The outgoing route of data uses HBase table.
Described step 2 obtains the detailed process of data in default data source: by the way of Java programs, profit
Obtaining the data source in the relevant database preset with JDBC, after acquisition, the result set of output is entitled ResultSet
Java object.
Conversion data form in described step 3 refers to through Data Format Transform, ResultSet object is become key-value pair
The form of Key/Value.
Described step 4 data divide and issue the process of related data: start MapReduce, pass through MapReduce
Carry out division and the calculating of data, then by Map, task is allocated, each node being issued in cluster, finally leads to
Crossing Reduce to calculate, last result set is written to destination address, the be written to big data complete when all data are put down
After platform, it is achieved the purpose of relevant database Data Migration to big data platform.
The process that implements of described step 4 is:
During data divide and issue related data, by the map of MapReduce, task is assigned to each and saves
Point processes, will be distributed to respectively in the TaskTracker of each node after data division, then write data into
The HBase table of user preset stores.
Being allocated task by Map, the detailed process of each node being issued in cluster is: when detecting
During available free for TaskTracker Slot, its I/O occupation condition is detected by system, if the I/O resource of this node accounts for
With for minimum in all nodes, and when this node tasks load entropy meets default threshold value, JobTracker will automatically will times
This TaskTracker is distributed in business;Otherwise, if finding, the I/O resource of now this node is taken or task load entropy in a large number
Value is relatively big, waits pending task more, does not the most assign the task to this node.
The distribution of above-mentioned task is completed by task dispatcher, and the specific operation process of this task dispatcher is:
First the respective field in system file mapred-site.xml is set to task dispatcher class
Org.apache.hadoop.mapred.TaskScheduler so that it can be used in task scheduling below;
Then the process of the task dispatcher of task based access control load is realized, to the AssignTask method in JobTracker
It is designed writing, the TaskTrackerStatus field of this method includes TaskTracker at heartbeat message
The relevant information submitted in Heartbeat, this relevant information includes the maximum of MapSlot quantity, ReduceSlot quantity
Maximum, virtual memory maximum, physical memory, the size of residue free disk space, the execution state of each task, disk
Access status, residue Slot quantity, disk access speed, the real-time status of CPU;
During being allocated task, JobTracker utilizes TaskTracker to pass through heartbeat message Heartbeat
Sending which node is the information obtained judge to assign the task to, wherein major parameter has Slot service condition, disk access
State, task blocking, waiting state.
Based on above-mentioned task dispatcher, concrete task scheduling situation is:
If all of slot is the most occupied in the TaskTracker of node, then task dispatcher is then refused newly
Task distribution to this TaskTracker;
If there is the Slot of free time on TaskTracker not yet, then need disk is taken situation, task wait situation etc.
It is monitored judging:
If a). present node loading commissions situation, according to the task load equilibrium evaluation and test calculated p of modeliValue is higher than
Meansigma methodsThe most new task is not distributed to this node;
If b) .1 present node loading condition, according to the task load equilibrium evaluation and test calculated p of modeliValue is less than average
ValueThen this node can accept new task.
Described step 5 is and writes data into HBase: the TaskTracker of each node carries out calculating process to task
Output data, the key-value pair data write HBase table that will read;Data migration process along with in system Map task all tie
Bundle and terminate, otherwise, if Data Migration not yet terminates, then TaskTracker is each by continuing to assign the task to
TaskTracker process.
The data migration method of a kind of task based access control load of the present invention, has the advantage that
The data migration method of a kind of task based access control load that the present invention proposes, utilizes the artificial bee colony in swarm intelligence algorithm
Task Scheduling Mechanism is optimized by algorithm, and some of Hadoop acquiescence FIFO scheduling are it is assumed that be all to propose under ideal conditions
, but during the operating with of reality, substantially it is difficult to meet these and assumes.Thus in actual use, various
The execution process that overall data is migrated by the unstability of factor, adds many unstable factors, and we utilize optimization
Task Scheduling Mechanism solve to a certain extent the task disposal ability of the most each node, CPU arithmetic speed and map distribution
The hypothesis of quantity;Utilize the optimization of Task Scheduling Mechanism to improve the efficiency of Data Migration, due to original Task Scheduling Mechanism
Part of nodes task load in distributed type assemblies may be caused heavier or relatively light, and the appearance of this situation will affect data
Read-write speed, thus had influence on the efficiency of Data Migration to a certain extent, we are kept away by the optimization of the scheduling of task as far as possible
Exempt from phenomenon part of nodes overload occurring or kicking the beam, data migration efficiency can be improved, practical, it is easy to promote.
Accompanying drawing explanation
Accompanying drawing 1 is FIFO scheduler principle schematic.
Accompanying drawing 2 is to illustrate Hadoop task scheduling process based on ABC algorithm optimization.
Accompanying drawing 3 is data migration process flow chart.
The Data Migration flow process that accompanying drawing 4 loads for task based access control.
Accompanying drawing 5 is data volume and migration time schematic diagram.
Accompanying drawing 6 is corresponding map quantity and efficiency.
Detailed description of the invention
The invention will be further described with specific embodiment below in conjunction with the accompanying drawings.
Traditional relevant database such as oracle, mysql occupied database field within a very long time in past
Dominant position.But for relevant database, when storage size reaches certain limit, system is very easy to appearance
The concurrent problem such as deadlock, thus cause data performance degradation during read-write, affect user and the inquiry of data is deleted
The operations such as insertion.The storage of the data of magnanimity faced by therefore and read-write inquiry work, build a reliability high and expansible
The good big data platform of property to industrial undertaking for be the most urgent.The most now face many problems, relational data
How database data migrates to big data platform makes in hgher efficiency, consumes resource less;The data of distributed memory system how
Allow storage that digital independent etc. just can be made rapider;Which data needs to migrate to big data platform;Data Migration is to big number
After platform, data query, data mining etc. how could convenient and swift efficiently;For using how data more frequently allow
Processing just to make the reading of data and service efficiency higher.
As shown in Figure 2-4, for the demand by online data Data Migration to big data platform, it is proposed that a kind of base
Data migration method in Task Scheduling Mechanism.In order to said method experiment Analysis, we used Hadoop framework
Realize, and compared by the FIFO Task Scheduling Mechanism given tacit consent to Hadoop, the effectiveness of verification method.
The present invention provides the data migration method that a kind of task based access control loads, and in the present invention, the transition process of data have employed
MapReduce framework, first imports data in data source, generates corresponding operation class and is packaged into jar bag, passes to Hadoop and puts down
Platform storage and process.Can effectively utilize the parallel mechanism of MapReduce in this process, utilize task based access control to load simultaneously
Data migration process is optimized by the task scheduling strategy of value, improves apology efficiency.Wherein the task of task based access control load is adjusted
Degree device operation logic is exactly task load value p according to each nodei, the individual task in cluster is allocated scheduling, so
Can effectively reduce task waiting time, it is to avoid task blocking, thus improve task treatment effeciency, reduce the consumption of Data Migration
Duration.
MapReduce framework during we utilize Hadoop framework in literary composition completes the distribution to task and evaluation work.?
Before performing data migration operation, we have first had to the configuration of relevant essential information, consist predominantly of, the position that data are moved out
Information (claim data source), the address information of Data Migration destination, data migration process use the quantity of map and existing
The configuring condition etc. that the server of each node of distributed type assemblies is basic.After completing basic configuration, start to perform Data Migration
Operation.According to Fig. 4 with Fig. 6, the process of Data Migration is briefly described by we:
(1) parameter is set.
In the application, we first have to be configured systematic parameter, mainly comprise following information:
1. resolve the systemic presupposition information about tasks carrying: essential information and the Data Migration of data to be migrated are set
In data message, whether such as data back up, the unprocessed form of the outgoing route of the acquisition approach of data, data, data, number
According to output format, Mapper class data being divided and calculating and Reducer class etc..
2. data source it is set: putting of the storage that the data to be migrated such as parsing are current, prepares for data migration operation.
3. the outgoing route of data be set: be exactly by Data Format Transform after the position that preserves of output data, at big number
According to platform being distributed file system or HBase, Hive, the HBase table that we are arranged in this article.
(2) in default data source, obtain data.
We utilize the mode that Java programs, and utilize JDBC to obtain the data source in the relevant database preset, obtain
The result set of rear output is the Java object of entitled ResultSet.
(3) data are converted to big data platform storable data form by original form.
ResultSet object is become through Data Format Transform the form of key-value pair (Key/Value).
(4) start MapReduce, utilize MapReduce to carry out division and the calculating of data, utilize Map task to be carried out point
Join, each node being issued in cluster, utilize Reduce to calculate, last result set is written to destination address, when
After what all data were complete is written to big data platform, also it is achieved that relevant database Data Migration is to big data platform
Purpose.Detailed process is as follows:
After performing data migration operation, the map first with MapReduce is distributed to each joint after data division respectively
In the TaskTracker of point, then write data in the HBase table of user preset and store.Preoperative in execution
In configuration, it would be desirable to be configured the quantity of Map, and the data segmentation before Data Migration can be by the shadow of map quantity
Ringing, this most a certain degree of scope on each tasks carrying creates impact.When detecting that TaskTracker is available free
During Slot, its I/O occupation condition is detected by system, if its resource occupation is less, and this node tasks load entropy
When meeting the threshold value preset, JobTracker will assign the task to this TaskTracker automatically.Otherwise, if finding now should
The I/O resource of node is taken in a large number or task load entropy is relatively big, waits pending task more, even if should
The Slot that TaskTracker is available free, for avoiding the occurrence of obstruction, thus affects the efficiency of Data Migration, and task dispatcher also will
Avoid assigning the task to this node.Therefore, system utilizes the scheduler that task based access control loads, and can effectively monitor identification system
In the I/O occupancy of each TaskTracker, cpu busy percentage, task wait/obstruction etc., thus be optimized scheduling, improve
The overall performance of migratory system.
(5) data write HBase.
The TaskTracker of each node carries out calculating and processes output data, the key-value pair data that will read task
Write HBase table.The Map task in system that is as of data migration process all terminates and terminates, otherwise, if Data Migration is still
Do not terminate, then TaskTracker assigns the task to each TaskTracker process by continuing.
According to the task scheduling based on ABC algorithm designed herein, write the scheduler of task based access control dispatch situation.
Respective field in system file mapred-site.xml is set to task dispatcher class
Org.apache.hadoop.mapred.TaskScheduler so that it can be used in task scheduling below.Realize
The process of the task dispatcher of task based access control load, is mainly designed the AssignTask method in JobTracker compiling
Write, the TaskTrackerStatus field of this method includes TaskTracker in Heartbeat (heartbeat message)
The relevant information submitted to, wherein has the maximum of MapSlot quantity, the maximum of ReduceSlot quantity, virtual memory maximum
Value, physical memory, the size of residue free disk space, the execution state of each task, disk access state.In data above
In, task is processed the bigger factor of impact mainly residue Slot quantity, disk access speed also has the real-time status of CPU.
During being allocated task, JobTracker utilizes TaskTracker to pass through Heartbeat (heart beating letter
Breath) send the information that obtains and judge to assign the task to which node, wherein major parameter has Slot service condition, disk to deposit
Take state, task blocking, waiting state etc..In data migration process, the dispatch situation about task can be divided into following several:
If 1. in the TaskTracker of node, all of slot is the most occupied, then task dispatcher is avoided as far as possible
By new task distribution to this TaskTracker.
If 2. there is the Slot of free time on TaskTracker not yet, then needing to take disk situation, task waits situation
Etc. be monitored judge:
If a). present node loading commissions situation, according to the task load equilibrium evaluation and test calculated p of modeliValue is higher than
Meansigma methodsThe most new task is not distributed to this node.
If b) .1 present node loading condition, according to the task load equilibrium evaluation and test calculated p of modeliValue is less than average
ValueThen this node can accept new task.
The data migration method data test of the task based access control optimizing scheduling that instantiation is as described below and interpretation of result:
According to presetting, we carry out data test, meanwhile, at similarity condition to the Data Migration of task based access control scheduling
Under Hadoop acquiescence FIFO scheduler is carried out the test of same quantity of data.The final result both obtained is analyzed ratio
Relatively.
In experimentation, we divide seven kinds of data volumes to test, and often no group data do not carry out six experiments, finally utilize
Meansigma methods in six experiments, as fiducial value, is so to ensure that the accuracy of data.
Data test result based on system default FIFO scheduler, as shown in the table:
Table 1 FIFO scheduler
Design to Hadoop task dispatcher based on ABC algorithm carries out the performance test of Data Migration, test result
As shown in table 2:
Table 2 Hadoop based on ABC algorithm task dispatcher
To average time in Tables 1 and 2 experiment, data contrast, as shown in Figure 5.
According to Tables 1 and 2, first we carry out simple analysis, in order to ensure reliability and the standard of data to test data
Really rate we only the meansigma methods of test data is analyzed.When data volume is 31250, use system default scheduler
Carry out Data Migration consumption time length ratio the latter few;When data volume is 62500 and 125000, although each not phase in six tests
With, and under individual cases, gap is the biggest, but elapsed time of ining succession after averaging is close;When data volume reach 25000,
500000 until when 2000000, task dispatcher based on ABC algorithm is more less than FIFO scheduler elapsed time, and along with
Being continuously increased of data volume, both time consuming differences are 2,1,5,11 successively.According further to Fig. 5, can clearly see
Going out, when data volume is higher than 1000000, in image, red lines are apparently higher than blue lines.Then we can be summarized as,
When data volume is gradually increased, under the conditions of equal, task dispatcher efficiency based on ABC algorithm is more and more higher, time consumed
Between fewer and feweri, data migration efficiency is more and more higher.
In addition to different pieces of information amount carries out the difference of Data Migration, we are also to distribution difference in data migration process
Map quantity be tested.Hadooop provides parameter mapred.map.tasks, and this parameter can be used for arranging map
Number, therefore we can control the quantity of map by this parameter.But, the number of map is set in this way, and
It not the most effective.Main reason is that mapred.map.tasks is the reference number of map number in a hadoop
Value, the number of final map, additionally depend on other factor.
Assuming that the quantity that Map enables is M (individual), the metric values of efficiency is time T (s), and the efficiency that data process is P
(bar/s).For ensureing reliability and the accuracy rate of data, we use same tables of data repeatedly to test, and test sets every time
The map quantity put is different, it is known that the data volume in tables of data is 2000000.
During carrying out data test, we are respectively started 1,2,4,6,8 map functions, according to every kind of map number
Amount, is migrated to identical tables of data in HBase by MySQL database.Test data are as shown in table 3 below:
The Data Migration time of the different map of table 3 and efficiency
According to data above, we obtain the quantity graph of a relation picture with data migration efficiency of Map, such as Fig. 6.
According to table 3 and Fig. 6, the data obtained by first map quantity tested by we are analyzed, when map quantity is 1
Time, the used time is the longest, and this is that very little, the used time must be relatively for map quantity owing to the map stage is responsible for input file is carried out cutting process
Many;Why, during map quantity increases to 2 to 4, efficiency gradually effectively improves, and the used time is shorter and shorter;When
When map quantity is 6, although map quantity adds but the efficiency of Data Migration have dropped on the contrary, causes this result to obtain mainly
Reason is when calling map function, and owing to the task situation of each node is different, and task processes real-time change, and map will be
Income scheduling uses between the individual nodes, so can increase the consumption of system resource, calling and taking due to map function
Part system resource, causes task treatment effeciency not increase counter subtracting.
According to test data we it can be concluded that
(1) data volume and data migration time are not linear relationship, and the time migrated when data volume becomes multiple to increase is not
Be to increase according to corresponding multiple, i.e. the time of Data Migration and the data volume of Data Migration is not directly proportional.For this result,
We analyze as follows, occur that this result must main reason is that during starting MapReduce, and start goes back simultaneously
There is Job, and Job preparation will consume portion of time, take a part of system resource.Thus, Hadoop is given tacit consent to
Based on ABC algorithm design objective scheduling process in scheduler and literary composition, between duration and migration data volume that its Data Migration consumes
Do not have the situation according to linear increase.
(2), according to time in experimental data it will be seen that data volume is relatively fewer, two kinds of schedulers are processing identical
The time that business is consumed is roughly the same, but during data volume is ever-increasing, in the impact not considering other factors
In the case of, the performance of the scheduler of task based access control scheduling is more preferable than FIFO scheduler.Meanwhile, also embodied when task relatively
In the case of intensive, the ability processing task due to different node servers is variant, so task dispatcher will preferentially will be appointed
The node that business is dispatched to task load value relatively low processes, and reduces task waiting time, it is to avoid block, so that data are moved
Shifting efficiency is improved.
(3), in processing task, the map number of setting is closer to number of nodes, and the efficiency of Data Migration is the highest,
Occur that this result obtains in the case of reason is reasonable quantity, can effectively utilize storage system access speed and cluster network
Bandwidth.
The most in actual applications, being not that map usage quantity is The more the better, when map quantity is too much, map is in systems
Call also can occupying system resources and Internet resources, thus cause shared by Data Migration resource to reduce, transport efficiency reduces.
There is above-mentioned conclusion to understand, when cluster is in task processes, select the optimization to task scheduling and systematic parameter
Correct configuration all the execution efficiency of task can be produced impact.
The present invention, based on MapReduce framework, is optimized improvement to task dispatcher, simply by task according to different
Node tasks can capacity assignment to the real-time task upper execution of the less TaskTracker of load, the work to task comparatively dense
Industry performance boost is more obvious, but less operation intensive for task, DeGrain and in use taking
A part of system resource, affects systematic function, and in following research, can be according to each node cpu utilization rate, data throughput
The factors such as rate, internal memory isomery, improve the transport efficiency of intensive for task, the data-intensive Data Migration operation of system.
Above-mentioned detailed description of the invention is only the concrete case of the present invention, and the scope of patent protection of the present invention includes but not limited to
Above-mentioned detailed description of the invention, claims of the data migration method of any a kind of task based access control load meeting the present invention
And the those of ordinary skill of any described technical field suitably change that it is done or replace, all should fall into the patent of the present invention
Protection domain.
Claims (10)
1. the data migration method of task based access control load, it is characterised in that it realizes process and is, in relevant database
Extracted data, and data are carried out form conversion;Starting to perform Data Migration, this data migration process uses MapReduce frame
Structure, concrete transition process is: first import data in data source, generates corresponding operation class and is packaged into jar bag, passes to
Hadoop platform storage and process.
The data migration method of a kind of task based access control the most according to claim 1 load, it is characterised in that performing data
Before migration operation, having first had to the configuration of relevant essential information, this relevant essential information includes, the position letter that data are moved out
Breath, the address information of Data Migration destination, data migration process use the quantity of map and existing distributed type assemblies each
The configuring condition that the server of node is basic, after completing basic configuration, starts to perform data migration operation.
The data migration method of a kind of task based access control the most according to claim 1 load, it is characterised in that described data are moved
The detailed process moved is:
Step one, first parameter is set, resolves the presupposed information of tasks carrying and data source and data output paths are set;
Step 2, in default data source obtain data;
Step 3, conversion data form, data will be converted to big data platform storable data form by original form;
Step 4, data are carried out divide distribution, each node being issued in cluster;
Step 5, finally write data into correspondence outgoing route in.
The data migration method of a kind of task based access control the most according to claim 3 load, it is characterised in that described step one
The process that implements be:
First resolve the systemic presupposition information about tasks carrying, essential information and the Data Migration of data to be migrated are i.e. set
In data message, this systemic presupposition information includes whether data back up, the outgoing route of the acquisition approach of data, data, number
According to unprocessed form, the output format of data, Mapper class data being divided and calculating and Reducer class;
Then putting of current storage such as data to be migrated such as data source, i.e. parsing etc. it is set, prepares for data migration operation;
The outgoing route of data is finally set, will the position that preserves of the output data after Data Format Transform, this data defeated
Outbound path uses HBase table.
The data migration method of a kind of task based access control the most according to claim 3 load, it is characterised in that described step 2
The detailed process obtaining data in default data source is: by the way of Java programs, and utilizes JDBC to obtain the pass preset
Being the data source in type data base, after acquisition, the result set of output is the Java object of entitled ResultSet, corresponding, step
Conversion data form in rapid three refers to become ResultSet object through Data Format Transform the shape of key-value pair Key/Value
Formula.
The data migration method of a kind of task based access control the most according to claim 3 load, it is characterised in that described step 4
Data divide and issue the process of related data: start MapReduce, carried out division and the meter of data by MapReduce
Calculate, then by Map, task is allocated, each node being issued in cluster, calculates finally by Reduce, will
Last result set is written to destination address, when all data complete be written to big data platform after, it is achieved relational data
Database data migrates to the purpose of big data platform.
The data migration method of a kind of task based access control the most according to claim 6 load, it is characterised in that described step 4
The process that implements be:
During data divide and issue related data, by the map of MapReduce, task is assigned in each node
Process, will be distributed to respectively in the TaskTracker of each node after data division, then write data into user
The HBase table preset stores;
Wherein being allocated task by Map, the detailed process of each node being issued in cluster is: when detecting
During available free for TaskTracker Slot, its I/O occupation condition is detected by system, if the I/O resource of this node accounts for
With for minimum in all nodes, and when this node tasks load entropy meets default threshold value, JobTracker will automatically will times
This TaskTracker is distributed in business;Otherwise, if finding, the I/O resource of now this node is taken or task load entropy in a large number
Value is relatively big, waits pending task more, does not the most assign the task to this node.
The data migration method of a kind of task based access control the most according to claim 7 load, it is characterised in that above-mentioned task is divided
Joining and completed by task dispatcher, the specific operation process of this task dispatcher is:
First the respective field in system file mapred-site.xml is set to task dispatcher class
Org.apache.hadoop.mapred.TaskScheduler so that it can be used in task scheduling below;
Then realize the process of the task dispatcher of task based access control load, the AssignTask method in JobTracker is carried out
Design is write, and includes TaskTracker at heartbeat message in the TaskTrackerStatus field of this method
The relevant information submitted in Heartbeat, this relevant information includes the maximum of MapSlot quantity, ReduceSlot quantity
Maximum, virtual memory maximum, physical memory, the size of residue free disk space, the execution state of each task, disk
Access status, residue Slot quantity, disk access speed, the real-time status of CPU;
During being allocated task, JobTracker utilizes TaskTracker to be sent by heartbeat message Heartbeat
Which node is the information obtained judge to assign the task to, wherein major parameter have Slot service condition, disk access state,
Task blocking, waiting state.
The data migration method of a kind of task based access control the most according to claim 8 load, it is characterised in that based on above-mentioned
Business scheduler, concrete task scheduling situation is:
If all of slot is the most occupied in the TaskTracker of node, then task dispatcher is then refused to appoint new
Business distributes to this TaskTracker;
If there is the Slot of free time on TaskTracker not yet, then needing to take disk situation, task wait situation etc. is carried out
Monitoring judges:
If a). present node loading commissions situation, according to the task load equilibrium evaluation and test calculated p of modeliValue is higher than meansigma methodsThe most new task is not distributed to this node;
If b) .1 present node loading condition, according to the task load equilibrium evaluation and test calculated p of modeliValue subaverage
Then this node can accept new task.
The data migration method of a kind of task based access control the most according to claim 3 load, it is characterised in that described step
The detailed process writing data into HBase in five is: the TaskTracker of each node carries out calculating and processes output number task
According to, the key-value pair data write HBase table that will read;Data migration process along with in system Map task all terminate and tie
Bundle, otherwise, if Data Migration not yet terminates, then TaskTracker is carried out continuing to assign the task to each TaskTracker
Process.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610415905.4A CN106095940A (en) | 2016-06-14 | 2016-06-14 | A kind of data migration method of task based access control load |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610415905.4A CN106095940A (en) | 2016-06-14 | 2016-06-14 | A kind of data migration method of task based access control load |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106095940A true CN106095940A (en) | 2016-11-09 |
Family
ID=57845413
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610415905.4A Pending CN106095940A (en) | 2016-06-14 | 2016-06-14 | A kind of data migration method of task based access control load |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106095940A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106534359A (en) * | 2016-12-13 | 2017-03-22 | 中科院成都信息技术股份有限公司 | Storage load balancing method based on storage entropy |
CN106844051A (en) * | 2017-01-19 | 2017-06-13 | 河海大学 | The loading commissions migration algorithm of optimised power consumption in a kind of edge calculations environment |
CN106937092A (en) * | 2017-04-11 | 2017-07-07 | 北京邮电大学 | Video data moving method and device in a kind of Distributed Computing Platform |
CN107609061A (en) * | 2017-08-28 | 2018-01-19 | 武汉奇米网络科技有限公司 | A kind of method and apparatus of data syn-chronization |
CN109325016A (en) * | 2018-09-12 | 2019-02-12 | 杭州朗和科技有限公司 | Data migration method, device, medium and electronic equipment |
CN109783472A (en) * | 2018-12-14 | 2019-05-21 | 深圳壹账通智能科技有限公司 | Moving method, device, computer equipment and the storage medium of table data |
CN109857528A (en) * | 2019-01-10 | 2019-06-07 | 北京三快在线科技有限公司 | Speed adjustment method, device, storage medium and the mobile terminal of Data Migration |
WO2019219010A1 (en) * | 2018-05-14 | 2019-11-21 | 杭州海康威视数字技术股份有限公司 | Data migration method and device and computer readable storage medium |
CN110515724A (en) * | 2019-08-13 | 2019-11-29 | 新华三大数据技术有限公司 | Resource allocation method, device, monitor and machine readable storage medium |
CN110532247A (en) * | 2019-08-28 | 2019-12-03 | 北京皮尔布莱尼软件有限公司 | Data migration method and data mover system |
CN111984395A (en) * | 2019-05-22 | 2020-11-24 | 中移(苏州)软件技术有限公司 | Data migration method and system, and computer readable storage medium |
WO2020238858A1 (en) * | 2019-05-30 | 2020-12-03 | 深圳前海微众银行股份有限公司 | Data migration method and apparatus, and computer-readable storage medium |
CN114327253A (en) * | 2021-10-18 | 2022-04-12 | 杭州逗酷软件科技有限公司 | Data migration method and device, electronic equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104065685A (en) * | 2013-03-22 | 2014-09-24 | 中国银联股份有限公司 | Data migration method in cloud computing environment-oriented layered storage system |
CN105205117A (en) * | 2015-09-09 | 2015-12-30 | 郑州悉知信息科技股份有限公司 | Data table migrating method and device |
-
2016
- 2016-06-14 CN CN201610415905.4A patent/CN106095940A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104065685A (en) * | 2013-03-22 | 2014-09-24 | 中国银联股份有限公司 | Data migration method in cloud computing environment-oriented layered storage system |
CN105205117A (en) * | 2015-09-09 | 2015-12-30 | 郑州悉知信息科技股份有限公司 | Data table migrating method and device |
Non-Patent Citations (2)
Title |
---|
GENG YUSHUI 等: "Cloud Data Migration Method Based on ABC Algorithm", 《INTERNATIONAL JOURNAL OF DATABASE THEORY AND APPLICATION》 * |
吕明育: "Hadoop架构下数据挖掘与数据迁移系统的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106534359A (en) * | 2016-12-13 | 2017-03-22 | 中科院成都信息技术股份有限公司 | Storage load balancing method based on storage entropy |
CN106534359B (en) * | 2016-12-13 | 2019-05-14 | 中科院成都信息技术股份有限公司 | A kind of storage load-balancing method based on storage entropy |
CN106844051A (en) * | 2017-01-19 | 2017-06-13 | 河海大学 | The loading commissions migration algorithm of optimised power consumption in a kind of edge calculations environment |
CN106937092A (en) * | 2017-04-11 | 2017-07-07 | 北京邮电大学 | Video data moving method and device in a kind of Distributed Computing Platform |
CN107609061A (en) * | 2017-08-28 | 2018-01-19 | 武汉奇米网络科技有限公司 | A kind of method and apparatus of data syn-chronization |
WO2019219010A1 (en) * | 2018-05-14 | 2019-11-21 | 杭州海康威视数字技术股份有限公司 | Data migration method and device and computer readable storage medium |
CN109325016A (en) * | 2018-09-12 | 2019-02-12 | 杭州朗和科技有限公司 | Data migration method, device, medium and electronic equipment |
CN109783472A (en) * | 2018-12-14 | 2019-05-21 | 深圳壹账通智能科技有限公司 | Moving method, device, computer equipment and the storage medium of table data |
CN109857528A (en) * | 2019-01-10 | 2019-06-07 | 北京三快在线科技有限公司 | Speed adjustment method, device, storage medium and the mobile terminal of Data Migration |
CN111984395A (en) * | 2019-05-22 | 2020-11-24 | 中移(苏州)软件技术有限公司 | Data migration method and system, and computer readable storage medium |
CN111984395B (en) * | 2019-05-22 | 2022-12-13 | 中移(苏州)软件技术有限公司 | Data migration method, system and computer readable storage medium |
WO2020238858A1 (en) * | 2019-05-30 | 2020-12-03 | 深圳前海微众银行股份有限公司 | Data migration method and apparatus, and computer-readable storage medium |
CN110515724A (en) * | 2019-08-13 | 2019-11-29 | 新华三大数据技术有限公司 | Resource allocation method, device, monitor and machine readable storage medium |
CN110515724B (en) * | 2019-08-13 | 2022-05-10 | 新华三大数据技术有限公司 | Resource allocation method, device, monitor and machine-readable storage medium |
CN110532247A (en) * | 2019-08-28 | 2019-12-03 | 北京皮尔布莱尼软件有限公司 | Data migration method and data mover system |
CN114327253A (en) * | 2021-10-18 | 2022-04-12 | 杭州逗酷软件科技有限公司 | Data migration method and device, electronic equipment and storage medium |
CN114327253B (en) * | 2021-10-18 | 2024-05-28 | 杭州逗酷软件科技有限公司 | Data migration method, device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106095940A (en) | A kind of data migration method of task based access control load | |
Dong et al. | Greedy scheduling of tasks with time constraints for energy-efficient cloud-computing data centers | |
Gautam et al. | A survey on job scheduling algorithms in big data processing | |
WO2016078008A1 (en) | Method and apparatus for scheduling data flow task | |
Eskandari et al. | T3-scheduler: A topology and traffic aware two-level scheduler for stream processing systems in a heterogeneous cluster | |
CN101957863A (en) | Data parallel processing method, device and system | |
Arfat et al. | Big data for smart infrastructure design: Opportunities and challenges | |
Senthilkumar et al. | A survey on job scheduling in big data | |
CN107515784A (en) | A kind of method and apparatus of computing resource in a distributed system | |
Lin et al. | Performance evaluation of job schedulers on Hadoop YARN | |
Battré et al. | Detecting bottlenecks in parallel dag-based data flow programs | |
Son et al. | Timeline scheduling for out-of-core ray batching | |
CN112000657A (en) | Data management method, device, server and storage medium | |
Kastrinakis et al. | Video2Flink: real-time video partitioning in Apache Flink and the cloud | |
Chen et al. | Pisces: optimizing multi-job application execution in mapreduce | |
He et al. | Real-time scheduling in mapreduce clusters | |
US8650571B2 (en) | Scheduling data analysis operations in a computer system | |
US20230161620A1 (en) | Pull mode and push mode combined resource management and job scheduling method and system, and medium | |
Mortazavi-Dehkordi et al. | Efficient resource scheduling for the analysis of Big Data streams | |
Lasluisa et al. | In-situ feature-based objects tracking for data-intensive scientific and enterprise analytics workflows | |
Zohrati et al. | Flexible approach to schedule tasks in cloud‐computing environments | |
Su et al. | Node capability aware resource provisioning in a heterogeneous cloud | |
Fu et al. | Load Balancing Algorithms for Hadoop Cluster in Unbalanced Environment | |
Miranda et al. | Dynamic communication-aware scheduling with uncertainty of workflow applications in clouds | |
Wang et al. | A genetic algorithm based efficient static load distribution strategy for handling large-scale workloads on sustainable computing systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161109 |