CN106095940A - A kind of data migration method of task based access control load - Google Patents

A kind of data migration method of task based access control load Download PDF

Info

Publication number
CN106095940A
CN106095940A CN201610415905.4A CN201610415905A CN106095940A CN 106095940 A CN106095940 A CN 106095940A CN 201610415905 A CN201610415905 A CN 201610415905A CN 106095940 A CN106095940 A CN 106095940A
Authority
CN
China
Prior art keywords
data
task
node
data migration
load
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610415905.4A
Other languages
Chinese (zh)
Inventor
耿玉水
孙涛
袁家恒
姜雪松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Original Assignee
Qilu University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology filed Critical Qilu University of Technology
Priority to CN201610415905.4A priority Critical patent/CN106095940A/en
Publication of CN106095940A publication Critical patent/CN106095940A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/214Database migration support

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the data migration method of a kind of task based access control load, it realizes process and is, extracted data in relevant database, and data are carried out form conversion;Starting to perform Data Migration, this data migration process uses MapReduce framework, and concrete transition process is: first import data in data source, generates corresponding operation class and is packaged into jar bag, passes to Hadoop platform storage and processes.The data migration method of this task based access control load is compared with prior art, solve the hypothesis problem of the task disposal ability of each node, CPU arithmetic speed and map quantity allotted, avoid the occurrence of part of nodes overload or the phenomenon kicked the beam by the optimization of the scheduling of task as far as possible, data migration efficiency can be improved, practical.

Description

A kind of data migration method of task based access control load
Technical field
The present invention relates to Data Transference Technology field, the data of a kind of practical task based access control load Moving method.
Background technology
Along with the continuous development of information technology, human society produced data volume every year flies the most with surprising rapidity Speed increases, and our society has been introduced into a brand-new information age.But the problem of thing followed data storage, by Gradually become a much-talked-about topic.In recent years, along with cloud computing and big data and correlation technique thereof development, information-based produce The general layout of industry is the most gradually changed.In the actual application of enterprise, the storage of data necessarily requires round-the-clock high availability Index, but along with the scale of network store system is increasing, the number of stoppages of system also sharply increases.Although it addition, storage The scale of system is increasing, and storage device gets more and more, but the resource utilization in storage system stills remain in one The lowest level.On the other hand, in the case of memory data output increases, system needs to meet high extensibility;Big rule The different types of storage device with high isomerism that the storage system of mould is provided by different manufacturers often forms , thus causing management storage system and there is higher complexity, the system that also results in extension in future usage will Become a difficult problem the biggest.Under technical conditions instantly, select to use cloud storage will can effectively solve the problem that the problems referred to above.
Under normal circumstances, the information data system of enterprise can comprise multiple different operation system, and different business System all includes respective a set of online operation system, filing system and standby system.Enterprise, for the consideration to cost, deposits Storage system can be the cloud storage platform of the Data Migration of online business platform to rear end.But the process of Data Migration is the most multiple Miscellaneous, need the problem solved numerous.In the problems of Data Migration, we mainly have studied one of them problem, i.e. closes Be type data by during Data Migration to big data platform, the transport efficiency of data has much room for improvement.We are by adjusting task The optimization of degree mechanism is to reach to improve the purpose of data migration efficiency.
Generally, if user does not carry out special setting to Hadoop, then during processing task scheduling just FIFO can be used to dispatch, and its operation logic is as shown in Figure 1.Hadoop is to have following vacation utilizing FIFO scheduler to carry out operation If:
In distributed type assemblies, it is identical that each node carries out the ability of task process.
Each node during task is processed, it calculates speed and operational capability and keeps constant, and be not subject to The impact of other factors.
When identical task is processed by system, the task that map and reduce in Map Reduce is accepted is divided It is all equal for joining the quantity with task computation.
Conditions above all can under ideal conditions it is assumed that reality operate with during, substantially be difficult to meet Above three hypothesis.Firstly, since the isomerism of the server of different nodes, it is difficulty with its complete phase of ability processing task With, such as processing computation-intensive task constantly, the operation of CPU can be slack-off, when processing data-intensive task Time, the exploitation speed of disk also can decline.Therefore, in actual use, overall data is moved by the unstability of various factors The execution process moved, adds many unstable factors.
Based on this, the data migration method now providing a kind of task based access control to load, is by carrying out Task Scheduling Mechanism Optimize thus improve a kind of method of data migration efficiency.
Summary of the invention
The technical assignment of the present invention is for above weak point, it is provided that a kind of practical, number of task based access control load According to moving method.
A kind of data migration method of task based access control load, it realizes process and is:
Extracted data in relevant database, and data are carried out form conversion;Start to perform Data Migration, these data Transition process uses MapReduce framework, and concrete transition process is: first import data in data source, generates corresponding operation Class is packaged into jar bag, passes to Hadoop platform storage and processes.
Before performing data migration operation, having first had to the configuration of relevant essential information, this relevant essential information includes, Positional information that data are moved out, the address information of Data Migration destination, data migration process use the quantity of map and existing There is the configuring condition that the server of each node of distributed type assemblies is basic, after completing basic configuration, start to perform Data Migration Operation.
The detailed process of described Data Migration is:
Step one, first parameter is set, resolves the presupposed information of tasks carrying and data source and data output paths are set;
Step 2, in default data source obtain data;
Step 3, conversion data form, data will be converted to the storable data of big data platform by original form Form;
Step 4, data are carried out divide distribution, each node being issued in cluster;
Step 5, finally write data into correspondence outgoing route in.
The process that implements of described step one is:
First resolve the systemic presupposition information about tasks carrying, essential information and the data of data to be migrated are i.e. set Data message in migration, this systemic presupposition information includes whether data back up, the output road of the acquisition approach of data, data Footpath, the unprocessed form of data, the output format of data, Mapper class data being divided and calculating and Reducer class;
Then putting of current storage such as data to be migrated such as data source, i.e. parsing etc. it is set, does standard for data migration operation Standby;
The outgoing route of data is finally set, will the position that preserves of the output data after Data Format Transform.
The outgoing route of data uses HBase table.
Described step 2 obtains the detailed process of data in default data source: by the way of Java programs, profit Obtaining the data source in the relevant database preset with JDBC, after acquisition, the result set of output is entitled ResultSet Java object.
Conversion data form in described step 3 refers to through Data Format Transform, ResultSet object is become key-value pair The form of Key/Value.
Described step 4 data divide and issue the process of related data: start MapReduce, pass through MapReduce Carry out division and the calculating of data, then by Map, task is allocated, each node being issued in cluster, finally leads to Crossing Reduce to calculate, last result set is written to destination address, the be written to big data complete when all data are put down After platform, it is achieved the purpose of relevant database Data Migration to big data platform.
The process that implements of described step 4 is:
During data divide and issue related data, by the map of MapReduce, task is assigned to each and saves Point processes, will be distributed to respectively in the TaskTracker of each node after data division, then write data into The HBase table of user preset stores.
Being allocated task by Map, the detailed process of each node being issued in cluster is: when detecting During available free for TaskTracker Slot, its I/O occupation condition is detected by system, if the I/O resource of this node accounts for With for minimum in all nodes, and when this node tasks load entropy meets default threshold value, JobTracker will automatically will times This TaskTracker is distributed in business;Otherwise, if finding, the I/O resource of now this node is taken or task load entropy in a large number Value is relatively big, waits pending task more, does not the most assign the task to this node.
The distribution of above-mentioned task is completed by task dispatcher, and the specific operation process of this task dispatcher is:
First the respective field in system file mapred-site.xml is set to task dispatcher class Org.apache.hadoop.mapred.TaskScheduler so that it can be used in task scheduling below;
Then the process of the task dispatcher of task based access control load is realized, to the AssignTask method in JobTracker It is designed writing, the TaskTrackerStatus field of this method includes TaskTracker at heartbeat message The relevant information submitted in Heartbeat, this relevant information includes the maximum of MapSlot quantity, ReduceSlot quantity Maximum, virtual memory maximum, physical memory, the size of residue free disk space, the execution state of each task, disk Access status, residue Slot quantity, disk access speed, the real-time status of CPU;
During being allocated task, JobTracker utilizes TaskTracker to pass through heartbeat message Heartbeat Sending which node is the information obtained judge to assign the task to, wherein major parameter has Slot service condition, disk access State, task blocking, waiting state.
Based on above-mentioned task dispatcher, concrete task scheduling situation is:
If all of slot is the most occupied in the TaskTracker of node, then task dispatcher is then refused newly Task distribution to this TaskTracker;
If there is the Slot of free time on TaskTracker not yet, then need disk is taken situation, task wait situation etc. It is monitored judging:
If a). present node loading commissions situation, according to the task load equilibrium evaluation and test calculated p of modeliValue is higher than Meansigma methodsThe most new task is not distributed to this node;
If b) .1 present node loading condition, according to the task load equilibrium evaluation and test calculated p of modeliValue is less than average ValueThen this node can accept new task.
Described step 5 is and writes data into HBase: the TaskTracker of each node carries out calculating process to task Output data, the key-value pair data write HBase table that will read;Data migration process along with in system Map task all tie Bundle and terminate, otherwise, if Data Migration not yet terminates, then TaskTracker is each by continuing to assign the task to TaskTracker process.
The data migration method of a kind of task based access control load of the present invention, has the advantage that
The data migration method of a kind of task based access control load that the present invention proposes, utilizes the artificial bee colony in swarm intelligence algorithm Task Scheduling Mechanism is optimized by algorithm, and some of Hadoop acquiescence FIFO scheduling are it is assumed that be all to propose under ideal conditions , but during the operating with of reality, substantially it is difficult to meet these and assumes.Thus in actual use, various The execution process that overall data is migrated by the unstability of factor, adds many unstable factors, and we utilize optimization Task Scheduling Mechanism solve to a certain extent the task disposal ability of the most each node, CPU arithmetic speed and map distribution The hypothesis of quantity;Utilize the optimization of Task Scheduling Mechanism to improve the efficiency of Data Migration, due to original Task Scheduling Mechanism Part of nodes task load in distributed type assemblies may be caused heavier or relatively light, and the appearance of this situation will affect data Read-write speed, thus had influence on the efficiency of Data Migration to a certain extent, we are kept away by the optimization of the scheduling of task as far as possible Exempt from phenomenon part of nodes overload occurring or kicking the beam, data migration efficiency can be improved, practical, it is easy to promote.
Accompanying drawing explanation
Accompanying drawing 1 is FIFO scheduler principle schematic.
Accompanying drawing 2 is to illustrate Hadoop task scheduling process based on ABC algorithm optimization.
Accompanying drawing 3 is data migration process flow chart.
The Data Migration flow process that accompanying drawing 4 loads for task based access control.
Accompanying drawing 5 is data volume and migration time schematic diagram.
Accompanying drawing 6 is corresponding map quantity and efficiency.
Detailed description of the invention
The invention will be further described with specific embodiment below in conjunction with the accompanying drawings.
Traditional relevant database such as oracle, mysql occupied database field within a very long time in past Dominant position.But for relevant database, when storage size reaches certain limit, system is very easy to appearance The concurrent problem such as deadlock, thus cause data performance degradation during read-write, affect user and the inquiry of data is deleted The operations such as insertion.The storage of the data of magnanimity faced by therefore and read-write inquiry work, build a reliability high and expansible The good big data platform of property to industrial undertaking for be the most urgent.The most now face many problems, relational data How database data migrates to big data platform makes in hgher efficiency, consumes resource less;The data of distributed memory system how Allow storage that digital independent etc. just can be made rapider;Which data needs to migrate to big data platform;Data Migration is to big number After platform, data query, data mining etc. how could convenient and swift efficiently;For using how data more frequently allow Processing just to make the reading of data and service efficiency higher.
As shown in Figure 2-4, for the demand by online data Data Migration to big data platform, it is proposed that a kind of base Data migration method in Task Scheduling Mechanism.In order to said method experiment Analysis, we used Hadoop framework Realize, and compared by the FIFO Task Scheduling Mechanism given tacit consent to Hadoop, the effectiveness of verification method.
The present invention provides the data migration method that a kind of task based access control loads, and in the present invention, the transition process of data have employed MapReduce framework, first imports data in data source, generates corresponding operation class and is packaged into jar bag, passes to Hadoop and puts down Platform storage and process.Can effectively utilize the parallel mechanism of MapReduce in this process, utilize task based access control to load simultaneously Data migration process is optimized by the task scheduling strategy of value, improves apology efficiency.Wherein the task of task based access control load is adjusted Degree device operation logic is exactly task load value p according to each nodei, the individual task in cluster is allocated scheduling, so Can effectively reduce task waiting time, it is to avoid task blocking, thus improve task treatment effeciency, reduce the consumption of Data Migration Duration.
MapReduce framework during we utilize Hadoop framework in literary composition completes the distribution to task and evaluation work.? Before performing data migration operation, we have first had to the configuration of relevant essential information, consist predominantly of, the position that data are moved out Information (claim data source), the address information of Data Migration destination, data migration process use the quantity of map and existing The configuring condition etc. that the server of each node of distributed type assemblies is basic.After completing basic configuration, start to perform Data Migration Operation.According to Fig. 4 with Fig. 6, the process of Data Migration is briefly described by we:
(1) parameter is set.
In the application, we first have to be configured systematic parameter, mainly comprise following information:
1. resolve the systemic presupposition information about tasks carrying: essential information and the Data Migration of data to be migrated are set In data message, whether such as data back up, the unprocessed form of the outgoing route of the acquisition approach of data, data, data, number According to output format, Mapper class data being divided and calculating and Reducer class etc..
2. data source it is set: putting of the storage that the data to be migrated such as parsing are current, prepares for data migration operation.
3. the outgoing route of data be set: be exactly by Data Format Transform after the position that preserves of output data, at big number According to platform being distributed file system or HBase, Hive, the HBase table that we are arranged in this article.
(2) in default data source, obtain data.
We utilize the mode that Java programs, and utilize JDBC to obtain the data source in the relevant database preset, obtain The result set of rear output is the Java object of entitled ResultSet.
(3) data are converted to big data platform storable data form by original form.
ResultSet object is become through Data Format Transform the form of key-value pair (Key/Value).
(4) start MapReduce, utilize MapReduce to carry out division and the calculating of data, utilize Map task to be carried out point Join, each node being issued in cluster, utilize Reduce to calculate, last result set is written to destination address, when After what all data were complete is written to big data platform, also it is achieved that relevant database Data Migration is to big data platform Purpose.Detailed process is as follows:
After performing data migration operation, the map first with MapReduce is distributed to each joint after data division respectively In the TaskTracker of point, then write data in the HBase table of user preset and store.Preoperative in execution In configuration, it would be desirable to be configured the quantity of Map, and the data segmentation before Data Migration can be by the shadow of map quantity Ringing, this most a certain degree of scope on each tasks carrying creates impact.When detecting that TaskTracker is available free During Slot, its I/O occupation condition is detected by system, if its resource occupation is less, and this node tasks load entropy When meeting the threshold value preset, JobTracker will assign the task to this TaskTracker automatically.Otherwise, if finding now should The I/O resource of node is taken in a large number or task load entropy is relatively big, waits pending task more, even if should The Slot that TaskTracker is available free, for avoiding the occurrence of obstruction, thus affects the efficiency of Data Migration, and task dispatcher also will Avoid assigning the task to this node.Therefore, system utilizes the scheduler that task based access control loads, and can effectively monitor identification system In the I/O occupancy of each TaskTracker, cpu busy percentage, task wait/obstruction etc., thus be optimized scheduling, improve The overall performance of migratory system.
(5) data write HBase.
The TaskTracker of each node carries out calculating and processes output data, the key-value pair data that will read task Write HBase table.The Map task in system that is as of data migration process all terminates and terminates, otherwise, if Data Migration is still Do not terminate, then TaskTracker assigns the task to each TaskTracker process by continuing.
According to the task scheduling based on ABC algorithm designed herein, write the scheduler of task based access control dispatch situation. Respective field in system file mapred-site.xml is set to task dispatcher class Org.apache.hadoop.mapred.TaskScheduler so that it can be used in task scheduling below.Realize The process of the task dispatcher of task based access control load, is mainly designed the AssignTask method in JobTracker compiling Write, the TaskTrackerStatus field of this method includes TaskTracker in Heartbeat (heartbeat message) The relevant information submitted to, wherein has the maximum of MapSlot quantity, the maximum of ReduceSlot quantity, virtual memory maximum Value, physical memory, the size of residue free disk space, the execution state of each task, disk access state.In data above In, task is processed the bigger factor of impact mainly residue Slot quantity, disk access speed also has the real-time status of CPU.
During being allocated task, JobTracker utilizes TaskTracker to pass through Heartbeat (heart beating letter Breath) send the information that obtains and judge to assign the task to which node, wherein major parameter has Slot service condition, disk to deposit Take state, task blocking, waiting state etc..In data migration process, the dispatch situation about task can be divided into following several:
If 1. in the TaskTracker of node, all of slot is the most occupied, then task dispatcher is avoided as far as possible By new task distribution to this TaskTracker.
If 2. there is the Slot of free time on TaskTracker not yet, then needing to take disk situation, task waits situation Etc. be monitored judge:
If a). present node loading commissions situation, according to the task load equilibrium evaluation and test calculated p of modeliValue is higher than Meansigma methodsThe most new task is not distributed to this node.
If b) .1 present node loading condition, according to the task load equilibrium evaluation and test calculated p of modeliValue is less than average ValueThen this node can accept new task.
The data migration method data test of the task based access control optimizing scheduling that instantiation is as described below and interpretation of result:
According to presetting, we carry out data test, meanwhile, at similarity condition to the Data Migration of task based access control scheduling Under Hadoop acquiescence FIFO scheduler is carried out the test of same quantity of data.The final result both obtained is analyzed ratio Relatively.
In experimentation, we divide seven kinds of data volumes to test, and often no group data do not carry out six experiments, finally utilize Meansigma methods in six experiments, as fiducial value, is so to ensure that the accuracy of data.
Data test result based on system default FIFO scheduler, as shown in the table:
Table 1 FIFO scheduler
Design to Hadoop task dispatcher based on ABC algorithm carries out the performance test of Data Migration, test result As shown in table 2:
Table 2 Hadoop based on ABC algorithm task dispatcher
To average time in Tables 1 and 2 experiment, data contrast, as shown in Figure 5.
According to Tables 1 and 2, first we carry out simple analysis, in order to ensure reliability and the standard of data to test data Really rate we only the meansigma methods of test data is analyzed.When data volume is 31250, use system default scheduler Carry out Data Migration consumption time length ratio the latter few;When data volume is 62500 and 125000, although each not phase in six tests With, and under individual cases, gap is the biggest, but elapsed time of ining succession after averaging is close;When data volume reach 25000, 500000 until when 2000000, task dispatcher based on ABC algorithm is more less than FIFO scheduler elapsed time, and along with Being continuously increased of data volume, both time consuming differences are 2,1,5,11 successively.According further to Fig. 5, can clearly see Going out, when data volume is higher than 1000000, in image, red lines are apparently higher than blue lines.Then we can be summarized as, When data volume is gradually increased, under the conditions of equal, task dispatcher efficiency based on ABC algorithm is more and more higher, time consumed Between fewer and feweri, data migration efficiency is more and more higher.
In addition to different pieces of information amount carries out the difference of Data Migration, we are also to distribution difference in data migration process Map quantity be tested.Hadooop provides parameter mapred.map.tasks, and this parameter can be used for arranging map Number, therefore we can control the quantity of map by this parameter.But, the number of map is set in this way, and It not the most effective.Main reason is that mapred.map.tasks is the reference number of map number in a hadoop Value, the number of final map, additionally depend on other factor.
Assuming that the quantity that Map enables is M (individual), the metric values of efficiency is time T (s), and the efficiency that data process is P (bar/s).For ensureing reliability and the accuracy rate of data, we use same tables of data repeatedly to test, and test sets every time The map quantity put is different, it is known that the data volume in tables of data is 2000000.
During carrying out data test, we are respectively started 1,2,4,6,8 map functions, according to every kind of map number Amount, is migrated to identical tables of data in HBase by MySQL database.Test data are as shown in table 3 below:
The Data Migration time of the different map of table 3 and efficiency
According to data above, we obtain the quantity graph of a relation picture with data migration efficiency of Map, such as Fig. 6.
According to table 3 and Fig. 6, the data obtained by first map quantity tested by we are analyzed, when map quantity is 1 Time, the used time is the longest, and this is that very little, the used time must be relatively for map quantity owing to the map stage is responsible for input file is carried out cutting process Many;Why, during map quantity increases to 2 to 4, efficiency gradually effectively improves, and the used time is shorter and shorter;When When map quantity is 6, although map quantity adds but the efficiency of Data Migration have dropped on the contrary, causes this result to obtain mainly Reason is when calling map function, and owing to the task situation of each node is different, and task processes real-time change, and map will be Income scheduling uses between the individual nodes, so can increase the consumption of system resource, calling and taking due to map function Part system resource, causes task treatment effeciency not increase counter subtracting.
According to test data we it can be concluded that
(1) data volume and data migration time are not linear relationship, and the time migrated when data volume becomes multiple to increase is not Be to increase according to corresponding multiple, i.e. the time of Data Migration and the data volume of Data Migration is not directly proportional.For this result, We analyze as follows, occur that this result must main reason is that during starting MapReduce, and start goes back simultaneously There is Job, and Job preparation will consume portion of time, take a part of system resource.Thus, Hadoop is given tacit consent to Based on ABC algorithm design objective scheduling process in scheduler and literary composition, between duration and migration data volume that its Data Migration consumes Do not have the situation according to linear increase.
(2), according to time in experimental data it will be seen that data volume is relatively fewer, two kinds of schedulers are processing identical The time that business is consumed is roughly the same, but during data volume is ever-increasing, in the impact not considering other factors In the case of, the performance of the scheduler of task based access control scheduling is more preferable than FIFO scheduler.Meanwhile, also embodied when task relatively In the case of intensive, the ability processing task due to different node servers is variant, so task dispatcher will preferentially will be appointed The node that business is dispatched to task load value relatively low processes, and reduces task waiting time, it is to avoid block, so that data are moved Shifting efficiency is improved.
(3), in processing task, the map number of setting is closer to number of nodes, and the efficiency of Data Migration is the highest, Occur that this result obtains in the case of reason is reasonable quantity, can effectively utilize storage system access speed and cluster network Bandwidth.
The most in actual applications, being not that map usage quantity is The more the better, when map quantity is too much, map is in systems Call also can occupying system resources and Internet resources, thus cause shared by Data Migration resource to reduce, transport efficiency reduces.
There is above-mentioned conclusion to understand, when cluster is in task processes, select the optimization to task scheduling and systematic parameter Correct configuration all the execution efficiency of task can be produced impact.
The present invention, based on MapReduce framework, is optimized improvement to task dispatcher, simply by task according to different Node tasks can capacity assignment to the real-time task upper execution of the less TaskTracker of load, the work to task comparatively dense Industry performance boost is more obvious, but less operation intensive for task, DeGrain and in use taking A part of system resource, affects systematic function, and in following research, can be according to each node cpu utilization rate, data throughput The factors such as rate, internal memory isomery, improve the transport efficiency of intensive for task, the data-intensive Data Migration operation of system.
Above-mentioned detailed description of the invention is only the concrete case of the present invention, and the scope of patent protection of the present invention includes but not limited to Above-mentioned detailed description of the invention, claims of the data migration method of any a kind of task based access control load meeting the present invention And the those of ordinary skill of any described technical field suitably change that it is done or replace, all should fall into the patent of the present invention Protection domain.

Claims (10)

1. the data migration method of task based access control load, it is characterised in that it realizes process and is, in relevant database Extracted data, and data are carried out form conversion;Starting to perform Data Migration, this data migration process uses MapReduce frame Structure, concrete transition process is: first import data in data source, generates corresponding operation class and is packaged into jar bag, passes to Hadoop platform storage and process.
The data migration method of a kind of task based access control the most according to claim 1 load, it is characterised in that performing data Before migration operation, having first had to the configuration of relevant essential information, this relevant essential information includes, the position letter that data are moved out Breath, the address information of Data Migration destination, data migration process use the quantity of map and existing distributed type assemblies each The configuring condition that the server of node is basic, after completing basic configuration, starts to perform data migration operation.
The data migration method of a kind of task based access control the most according to claim 1 load, it is characterised in that described data are moved The detailed process moved is:
Step one, first parameter is set, resolves the presupposed information of tasks carrying and data source and data output paths are set;
Step 2, in default data source obtain data;
Step 3, conversion data form, data will be converted to big data platform storable data form by original form;
Step 4, data are carried out divide distribution, each node being issued in cluster;
Step 5, finally write data into correspondence outgoing route in.
The data migration method of a kind of task based access control the most according to claim 3 load, it is characterised in that described step one The process that implements be:
First resolve the systemic presupposition information about tasks carrying, essential information and the Data Migration of data to be migrated are i.e. set In data message, this systemic presupposition information includes whether data back up, the outgoing route of the acquisition approach of data, data, number According to unprocessed form, the output format of data, Mapper class data being divided and calculating and Reducer class;
Then putting of current storage such as data to be migrated such as data source, i.e. parsing etc. it is set, prepares for data migration operation;
The outgoing route of data is finally set, will the position that preserves of the output data after Data Format Transform, this data defeated Outbound path uses HBase table.
The data migration method of a kind of task based access control the most according to claim 3 load, it is characterised in that described step 2 The detailed process obtaining data in default data source is: by the way of Java programs, and utilizes JDBC to obtain the pass preset Being the data source in type data base, after acquisition, the result set of output is the Java object of entitled ResultSet, corresponding, step Conversion data form in rapid three refers to become ResultSet object through Data Format Transform the shape of key-value pair Key/Value Formula.
The data migration method of a kind of task based access control the most according to claim 3 load, it is characterised in that described step 4 Data divide and issue the process of related data: start MapReduce, carried out division and the meter of data by MapReduce Calculate, then by Map, task is allocated, each node being issued in cluster, calculates finally by Reduce, will Last result set is written to destination address, when all data complete be written to big data platform after, it is achieved relational data Database data migrates to the purpose of big data platform.
The data migration method of a kind of task based access control the most according to claim 6 load, it is characterised in that described step 4 The process that implements be:
During data divide and issue related data, by the map of MapReduce, task is assigned in each node Process, will be distributed to respectively in the TaskTracker of each node after data division, then write data into user The HBase table preset stores;
Wherein being allocated task by Map, the detailed process of each node being issued in cluster is: when detecting During available free for TaskTracker Slot, its I/O occupation condition is detected by system, if the I/O resource of this node accounts for With for minimum in all nodes, and when this node tasks load entropy meets default threshold value, JobTracker will automatically will times This TaskTracker is distributed in business;Otherwise, if finding, the I/O resource of now this node is taken or task load entropy in a large number Value is relatively big, waits pending task more, does not the most assign the task to this node.
The data migration method of a kind of task based access control the most according to claim 7 load, it is characterised in that above-mentioned task is divided Joining and completed by task dispatcher, the specific operation process of this task dispatcher is:
First the respective field in system file mapred-site.xml is set to task dispatcher class Org.apache.hadoop.mapred.TaskScheduler so that it can be used in task scheduling below;
Then realize the process of the task dispatcher of task based access control load, the AssignTask method in JobTracker is carried out Design is write, and includes TaskTracker at heartbeat message in the TaskTrackerStatus field of this method The relevant information submitted in Heartbeat, this relevant information includes the maximum of MapSlot quantity, ReduceSlot quantity Maximum, virtual memory maximum, physical memory, the size of residue free disk space, the execution state of each task, disk Access status, residue Slot quantity, disk access speed, the real-time status of CPU;
During being allocated task, JobTracker utilizes TaskTracker to be sent by heartbeat message Heartbeat Which node is the information obtained judge to assign the task to, wherein major parameter have Slot service condition, disk access state, Task blocking, waiting state.
The data migration method of a kind of task based access control the most according to claim 8 load, it is characterised in that based on above-mentioned Business scheduler, concrete task scheduling situation is:
If all of slot is the most occupied in the TaskTracker of node, then task dispatcher is then refused to appoint new Business distributes to this TaskTracker;
If there is the Slot of free time on TaskTracker not yet, then needing to take disk situation, task wait situation etc. is carried out Monitoring judges:
If a). present node loading commissions situation, according to the task load equilibrium evaluation and test calculated p of modeliValue is higher than meansigma methodsThe most new task is not distributed to this node;
If b) .1 present node loading condition, according to the task load equilibrium evaluation and test calculated p of modeliValue subaverage Then this node can accept new task.
The data migration method of a kind of task based access control the most according to claim 3 load, it is characterised in that described step The detailed process writing data into HBase in five is: the TaskTracker of each node carries out calculating and processes output number task According to, the key-value pair data write HBase table that will read;Data migration process along with in system Map task all terminate and tie Bundle, otherwise, if Data Migration not yet terminates, then TaskTracker is carried out continuing to assign the task to each TaskTracker Process.
CN201610415905.4A 2016-06-14 2016-06-14 A kind of data migration method of task based access control load Pending CN106095940A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610415905.4A CN106095940A (en) 2016-06-14 2016-06-14 A kind of data migration method of task based access control load

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610415905.4A CN106095940A (en) 2016-06-14 2016-06-14 A kind of data migration method of task based access control load

Publications (1)

Publication Number Publication Date
CN106095940A true CN106095940A (en) 2016-11-09

Family

ID=57845413

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610415905.4A Pending CN106095940A (en) 2016-06-14 2016-06-14 A kind of data migration method of task based access control load

Country Status (1)

Country Link
CN (1) CN106095940A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106534359A (en) * 2016-12-13 2017-03-22 中科院成都信息技术股份有限公司 Storage load balancing method based on storage entropy
CN106844051A (en) * 2017-01-19 2017-06-13 河海大学 The loading commissions migration algorithm of optimised power consumption in a kind of edge calculations environment
CN106937092A (en) * 2017-04-11 2017-07-07 北京邮电大学 Video data moving method and device in a kind of Distributed Computing Platform
CN107609061A (en) * 2017-08-28 2018-01-19 武汉奇米网络科技有限公司 A kind of method and apparatus of data syn-chronization
CN109325016A (en) * 2018-09-12 2019-02-12 杭州朗和科技有限公司 Data migration method, device, medium and electronic equipment
CN109783472A (en) * 2018-12-14 2019-05-21 深圳壹账通智能科技有限公司 Moving method, device, computer equipment and the storage medium of table data
CN109857528A (en) * 2019-01-10 2019-06-07 北京三快在线科技有限公司 Speed adjustment method, device, storage medium and the mobile terminal of Data Migration
WO2019219010A1 (en) * 2018-05-14 2019-11-21 杭州海康威视数字技术股份有限公司 Data migration method and device and computer readable storage medium
CN110515724A (en) * 2019-08-13 2019-11-29 新华三大数据技术有限公司 Resource allocation method, device, monitor and machine readable storage medium
CN110532247A (en) * 2019-08-28 2019-12-03 北京皮尔布莱尼软件有限公司 Data migration method and data mover system
CN111984395A (en) * 2019-05-22 2020-11-24 中移(苏州)软件技术有限公司 Data migration method and system, and computer readable storage medium
WO2020238858A1 (en) * 2019-05-30 2020-12-03 深圳前海微众银行股份有限公司 Data migration method and apparatus, and computer-readable storage medium
CN114327253A (en) * 2021-10-18 2022-04-12 杭州逗酷软件科技有限公司 Data migration method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104065685A (en) * 2013-03-22 2014-09-24 中国银联股份有限公司 Data migration method in cloud computing environment-oriented layered storage system
CN105205117A (en) * 2015-09-09 2015-12-30 郑州悉知信息科技股份有限公司 Data table migrating method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104065685A (en) * 2013-03-22 2014-09-24 中国银联股份有限公司 Data migration method in cloud computing environment-oriented layered storage system
CN105205117A (en) * 2015-09-09 2015-12-30 郑州悉知信息科技股份有限公司 Data table migrating method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GENG YUSHUI 等: "Cloud Data Migration Method Based on ABC Algorithm", 《INTERNATIONAL JOURNAL OF DATABASE THEORY AND APPLICATION》 *
吕明育: "Hadoop架构下数据挖掘与数据迁移系统的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106534359A (en) * 2016-12-13 2017-03-22 中科院成都信息技术股份有限公司 Storage load balancing method based on storage entropy
CN106534359B (en) * 2016-12-13 2019-05-14 中科院成都信息技术股份有限公司 A kind of storage load-balancing method based on storage entropy
CN106844051A (en) * 2017-01-19 2017-06-13 河海大学 The loading commissions migration algorithm of optimised power consumption in a kind of edge calculations environment
CN106937092A (en) * 2017-04-11 2017-07-07 北京邮电大学 Video data moving method and device in a kind of Distributed Computing Platform
CN107609061A (en) * 2017-08-28 2018-01-19 武汉奇米网络科技有限公司 A kind of method and apparatus of data syn-chronization
WO2019219010A1 (en) * 2018-05-14 2019-11-21 杭州海康威视数字技术股份有限公司 Data migration method and device and computer readable storage medium
CN109325016A (en) * 2018-09-12 2019-02-12 杭州朗和科技有限公司 Data migration method, device, medium and electronic equipment
CN109783472A (en) * 2018-12-14 2019-05-21 深圳壹账通智能科技有限公司 Moving method, device, computer equipment and the storage medium of table data
CN109857528A (en) * 2019-01-10 2019-06-07 北京三快在线科技有限公司 Speed adjustment method, device, storage medium and the mobile terminal of Data Migration
CN111984395A (en) * 2019-05-22 2020-11-24 中移(苏州)软件技术有限公司 Data migration method and system, and computer readable storage medium
CN111984395B (en) * 2019-05-22 2022-12-13 中移(苏州)软件技术有限公司 Data migration method, system and computer readable storage medium
WO2020238858A1 (en) * 2019-05-30 2020-12-03 深圳前海微众银行股份有限公司 Data migration method and apparatus, and computer-readable storage medium
CN110515724A (en) * 2019-08-13 2019-11-29 新华三大数据技术有限公司 Resource allocation method, device, monitor and machine readable storage medium
CN110515724B (en) * 2019-08-13 2022-05-10 新华三大数据技术有限公司 Resource allocation method, device, monitor and machine-readable storage medium
CN110532247A (en) * 2019-08-28 2019-12-03 北京皮尔布莱尼软件有限公司 Data migration method and data mover system
CN114327253A (en) * 2021-10-18 2022-04-12 杭州逗酷软件科技有限公司 Data migration method and device, electronic equipment and storage medium
CN114327253B (en) * 2021-10-18 2024-05-28 杭州逗酷软件科技有限公司 Data migration method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106095940A (en) A kind of data migration method of task based access control load
Dong et al. Greedy scheduling of tasks with time constraints for energy-efficient cloud-computing data centers
Gautam et al. A survey on job scheduling algorithms in big data processing
WO2016078008A1 (en) Method and apparatus for scheduling data flow task
Eskandari et al. T3-scheduler: A topology and traffic aware two-level scheduler for stream processing systems in a heterogeneous cluster
CN101957863A (en) Data parallel processing method, device and system
Arfat et al. Big data for smart infrastructure design: Opportunities and challenges
Senthilkumar et al. A survey on job scheduling in big data
CN107515784A (en) A kind of method and apparatus of computing resource in a distributed system
Lin et al. Performance evaluation of job schedulers on Hadoop YARN
Battré et al. Detecting bottlenecks in parallel dag-based data flow programs
Son et al. Timeline scheduling for out-of-core ray batching
CN112000657A (en) Data management method, device, server and storage medium
Kastrinakis et al. Video2Flink: real-time video partitioning in Apache Flink and the cloud
Chen et al. Pisces: optimizing multi-job application execution in mapreduce
He et al. Real-time scheduling in mapreduce clusters
US8650571B2 (en) Scheduling data analysis operations in a computer system
US20230161620A1 (en) Pull mode and push mode combined resource management and job scheduling method and system, and medium
Mortazavi-Dehkordi et al. Efficient resource scheduling for the analysis of Big Data streams
Lasluisa et al. In-situ feature-based objects tracking for data-intensive scientific and enterprise analytics workflows
Zohrati et al. Flexible approach to schedule tasks in cloud‐computing environments
Su et al. Node capability aware resource provisioning in a heterogeneous cloud
Fu et al. Load Balancing Algorithms for Hadoop Cluster in Unbalanced Environment
Miranda et al. Dynamic communication-aware scheduling with uncertainty of workflow applications in clouds
Wang et al. A genetic algorithm based efficient static load distribution strategy for handling large-scale workloads on sustainable computing systems

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20161109