CN114048186A - Data migration method and system based on mass data - Google Patents

Data migration method and system based on mass data Download PDF

Info

Publication number
CN114048186A
CN114048186A CN202111209912.6A CN202111209912A CN114048186A CN 114048186 A CN114048186 A CN 114048186A CN 202111209912 A CN202111209912 A CN 202111209912A CN 114048186 A CN114048186 A CN 114048186A
Authority
CN
China
Prior art keywords
data
hotspot
edge
data warehouse
distributed data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202111209912.6A
Other languages
Chinese (zh)
Inventor
彭亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Anhe Shengda Enterprise Management Co ltd
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202111209912.6A priority Critical patent/CN114048186A/en
Publication of CN114048186A publication Critical patent/CN114048186A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/214Database migration support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Abstract

The invention discloses a data migration method based on mass data, which comprises the following steps: the central cloud predicts the hot event and outputs a corresponding hot data group; the central cloud imports the hotspot data set into an HDFS storage engine; the HDFS storage engine migrates the hotspot data set into the distributed data warehouse; the distributed data warehouse distributes the hotspot data set to the edge cloud cluster, so that the edge cloud cluster provides the hotspot data set for user equipment.

Description

Data migration method and system based on mass data
Technical Field
The invention belongs to the technical field of information, and particularly relates to a data migration method and system based on mass data.
Background
The big data is a data set which is large in scale and greatly exceeds the capability range of traditional database software tools (such as MySQL, Oracle, PostgreSQL and the like) in the aspects of acquisition, storage and analysis, and has the four characteristics of massive data scale, rapid data circulation, various data types and low value density; the method is a massive, high-growth rate and diversified information asset which can have stronger decision making power, insight discovery power and process optimization capability only by a new processing mode.
In recent years, researches on data storage, data migration, data reading and the like based on big data have entered a high-speed development period, but many technical problems need to be solved, and the inventor finds that, in the research process, instantaneous flow increase caused by sudden hot events can cause short-term congestion caused by big data reading and data migration, and at present, an effective early warning mechanism is lacking to reject such phenomena.
Disclosure of Invention
The invention provides a data migration method and system based on mass data, which solve the problem of data reading and transmission congestion caused by a hot event under a mass data scene in the prior art, effectively reduce the congestion problem and improve the user experience.
In order to achieve the above object, the present invention provides a mass data based data migration method, which is applied to a data migration system, wherein the system includes a central cloud, an HDFS storage engine, a distributed data warehouse, and an edge cloud cluster, and the method includes:
in the first period of time,
step 1: the central cloud predicts the hot event and outputs a corresponding hot data group;
step 2: the central cloud imports the hotspot data set into an HDFS storage engine;
and step 3: the HDFS storage engine migrates the hotspot data set into the distributed data warehouse;
and 4, step 4: the distributed data warehouse distributes the hotspot data set to the edge cloud cluster so that the edge cloud cluster provides the hotspot data set for user equipment;
in the second period of time, the period of time,
repeating the steps 1-3 to enable the distributed data warehouse to distribute the hotspot data set in a second period to the edge cloud cluster, wherein the hotspot data set in the second period is different from the hotspot data set in the first period.
Optionally, if the distributed data warehouse stores the hotspot data set in advance, the migrating the hotspot data set to the distributed data warehouse by the HDFS storage engine includes:
the HDFS storage engine acquires the ID and the priority parameter of the hotspot data group;
and the HDFS storage engine adjusts the priority parameter of the hot spot data group, and guides the adjusted priority parameter and the ID of the hot spot data group into the distributed data warehouse through an instruction, so that the distributed data warehouse searches and updates the priority parameter of the hot spot data group based on the ID and the priority parameter of the hot spot data group.
Optionally, if the distributed data warehouse is divided into a range partition and a hash partition, the importing the adjusted priority parameter and the ID of the hotspot data group into the distributed data warehouse through an instruction includes:
writing the ID of the hotspot data group into a range partition of the distributed data warehouse, and writing the adjusted priority parameter into a hash partition of the distributed data warehouse.
Optionally, if the distributed data warehouse does not store the hotspot data set in advance, migrating the hotspot data set to the distributed data warehouse by the HDFS storage engine, where the migrating includes:
the distributed data warehouse is provided with a monitoring program, and the monitoring program is used for monitoring a file record table of the HDFS storage engine;
after monitoring that the hotspot data group is recorded in the file record table, the distributed data warehouse sends a data acquisition request to the HDFS storage engine;
and after receiving the data acquisition request, the HDFS storage engine packages the hot spot data group into a message queue and migrates the message queue to the distributed data warehouse.
Optionally, the migrating the message queue to the distributed data warehouse includes:
dividing the message queue into a plurality of sectors through a kafka connect tool, writing the plurality of sectors into the distributed data warehouse in parallel, and merging in the distributed data warehouse.
Optionally, the edge cloud cluster includes a plurality of edge clouds and an edge cloud manager, and after the edge cloud cluster provides the hotspot data set for the user equipment, the method further includes:
the edge cloud manager predicts KPI performance indexes of the edge clouds, and sets a load balancing strategy after predicting that the KPI performance index of a first edge cloud exceeds a first preset threshold;
the edge cloud manager sends the storage content of the first edge cloud to a temporary partition of the distributed data warehouse based on the load balancing strategy, wherein the storage content of the first edge cloud does not include the hotspot data group;
the distributed data warehouse migrates the storage content in the temporary partition to one or more edge clouds nearest to the first edge cloud, wherein a KPI performance index of the one or more edge clouds is lower than a second preset threshold, and the second preset threshold is smaller than the first preset threshold.
Optionally, the predicting, by the edge cloud manager, KPI performance indicators of the plurality of edge clouds includes:
predicting KPI performance indicators for the plurality of edge clouds using a Hidden Markov Model (HMM).
Optionally, the predicting the hotspot event by the central cloud, and outputting a corresponding hotspot data set, includes:
the central cloud predicts the hotspot events using a time series model ARIMA.
Optionally, the edge cloud cluster includes a plurality of edge clouds and a plurality of edge nodes, where one edge cloud corresponds to the plurality of edge nodes, and the providing, by the edge cloud cluster, the hotspot data set for the user equipment includes:
the plurality of edge clouds respectively distribute the hotspot data sets to the corresponding plurality of edge nodes;
and the plurality of edge nodes send the hotspot data groups to the user equipment.
The embodiment of the present invention further provides a system, which includes a memory and a processor, where the memory stores computer-executable instructions, and the processor implements the method when running the computer-executable instructions on the memory.
The method and the system of the embodiment of the invention have the following advantages:
in the embodiment of the invention, the hot content data group corresponding to the hot event in the period is obtained by predicting the hot event, the hot content data is subjected to data migration in advance, and the hot content data group is written into the edge cloud in advance, so that when the user equipment requests the hot event, the edge cloud can quickly respond to the user equipment, the network transmission efficiency can be improved, and the problem of data congestion caused by an emergent hot event can be greatly reduced. In addition, due to the positioning reason of the large data storage database, the HDFS storage engine is suitable for a scene of once writing and multiple times of reading, and does not support random modification of stored data, so that hot spot events in different periods are different in corresponding hot spot data groups, the random modification requirement on the data is greatly increased, and the original HDFS storage engine is not suitable for the scene.
Drawings
FIG. 1 is a block diagram of a data migration system based on mass data in one embodiment;
FIG. 2 is a flow diagram of a method for mass data based data migration in one embodiment;
FIG. 3 is a logical view of inventory data migration in one embodiment;
FIG. 4 is a logical representation of incremental data migration in one embodiment;
FIG. 5 is a logic diagram of edge cloud data distribution in one embodiment;
FIG. 6 is a logic diagram of edge cloud load balancing in one embodiment;
FIG. 7 is a diagram illustrating the hardware components of the system in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Fig. 1 is a network architecture diagram according to an embodiment of the present invention, and as shown in fig. 1, the embodiment of the present invention includes a central cloud 11, an HDFS storage engine 12, a distributed data warehouse 13, an edge cloud cluster 14, and a user device 15, where the central cloud 11 is located at a core layer of a network, and is configured to acquire mass data, process and analyze the mass data, and control storage, migration, and distribution of various types of data.
The HDFS storage engine 12 is the mainstream storage tool/engine for big data Hadoop, which is collectively referred to as a distributed file system. The off-line analysis tool is suitable for high throughput, is suitable for a scene of one-time writing and multiple times of reading and writing, and does not support random modification of files.
The distributed data warehouse 13 is distinguished from the HDFS, which supports random modification of files and is therefore well suited for scenarios where data needs to be updated and modified randomly. For data warehouse (data washhouse), it is a strategic set that provides all types of data support for decision making at all levels of the enterprise. Its main function is to systematically analyze and organize a large amount of data accumulated over the years by the online transaction processing (OLTP) of the information system through the data storage structure specific to the data warehouse theory, so as to facilitate the proceeding of various analysis methods such as online analysis processing (OLAP) and data mining, and further support the creation of Decision Support Systems (DSS) and administrative information systems (EIS), thereby helping the decision maker to quickly and effectively analyze valuable information from a large amount of data and help to construct BI.
The edge cloud cluster 14 is a cloud server cluster located at an edge layer, is closer to a user, and is a lightweight and miniaturized cloud service cluster, the edge cloud cluster 14 may further include an edge cloud manager 141 (not shown in the figure) and an edge node 142 (not shown in the figure), the edge cloud manager is a functional device, and its own functional characteristics are that an edge cloud server is used as both an edge cloud server and a device for controlling normal operation and maintenance work such as load balancing and data migration of an edge cloud, and the edge node is a lower level of the edge cloud and is closer to the user, it can be understood that the edge cloud server is generally distributed at a city level and a district level, and the edge node is distributed at a county level and a county level.
The user equipment 15 is all devices which can be networked and have intelligent analysis and processing capabilities, such as a mobile phone, a tablet computer, a PC, a VR, an AR, an industrial personal computer, an intelligent automobile, an internet of things device and the like. The method can perform data interaction with the edge cloud cluster through the existing wireless communication protocol to acquire various different data.
To achieve the above object, as shown in fig. 2, the present invention provides a method for data migration based on mass data, which is applied to the network architecture shown in fig. 1, and the method includes the following steps:
during the first period T1, the following steps 1-4 are performed:
step 1: the central cloud predicts the hot event and outputs a corresponding hot data group;
the hot events are a group of events with short time period and high concurrence, such as major news, major topics and the like, and are characterized in that the search volume is high in a short time period, and keywords are concentrated. The hot spot events are classified into two types, one type is a fixed hot spot event, such as a high-concurrency strong-association hot spot event which occurs in fixed time of the type of "mid-autumn" or "moon cake", and the other type is non-fixed, such as sudden news, therefore, for the embodiment of the invention, an event sequence prediction model can be adopted for the fixed hot spot event to carry out hot spot data prediction.
Time series analysis explains the fact that data points obtained over time may have an internal structure that should be considered (e.g., autocorrelation, trends, or seasonal variations). And time series prediction uses a regression model to predict future values based on previously observed values. In the embodiment of the present invention, a time series model ARIMA (auto regression sum moving average) or VAR (vector auto regression) may be used to predict a hotspot event, and a hotspot data set corresponding to the hotspot event is obtained (the hotspot data set is a data file of the content of the hotspot event, specifically, the hotspot data set may be a binary data file, a video file, an audio file, and the like, for example, if the hotspot event is "mid-autumn," the hotspot data set may be a video, a moon cake picture, and the like at a late meeting in mid-autumn, that is, the hotspot event and the hotspot data set have a strong association relationship). The ARIMA model and the VAR model are prior art and embodiments of the present invention will not be described here.
Step 2: the central cloud imports the hotspot data set into an HDFS storage engine;
HDFS is one of the most prominent distributed storage systems used in Hadoop applications. An HDFS cluster is mainly composed of one NameNode and many dataodes: the Namenode manages metadata of the file system, and the Datanode stores actual data. Hadoop (including HDFS) is well suited for distributed storage and computation on commodity hardware (comfort hardware) because it is not only fault tolerant and scalable, but is also very scalable. The Map-Reduce framework is known for its simplicity and availability in large distributed system applications, and has been integrated into Hadoop.
After acquiring the hotspot data set, the central cloud imports the hotspot data set into the HDFS storage engine. However, due to the positioning of large data storage, the HDFS is suitable for write-once and read-many scenarios, does not support random modification of files, and is not allowed if random update and deletion of data are performed during data migration.
And step 3: the HDFS storage engine migrates the hotspot data set into the distributed data warehouse;
for the above reasons, the embodiment of the present invention provides a distributed data warehouse, which can support batch storage and random read-write in a short time. Data storage, migration and distribution are performed using a distributed data warehouse to "replace" the HDFS.
For example, the original hotspot data group is only a data group with a lower priority in the data warehouse, and at this time, the priority of the hotspot data group needs to be increased, so that the time of migration of the hotspot data group is preferentially ensured in the transmission or migration process.
Specifically, if the distributed data warehouse stores the hotspot data group in advance, step 3 is actually inventory migration, and the method includes:
the HDFS storage engine acquires the ID and the priority parameter of the hotspot data group;
and the HDFS storage engine adjusts the priority parameter of the hot spot data group, and imports the adjusted priority parameter and the ID of the hot spot data group into the distributed data warehouse through an instruction, so that the distributed data warehouse searches and updates the priority parameter of the hot spot data group based on the ID and the priority parameter of the hot spot data group.
Specifically, the distributed data warehouse may be divided into a range partition (range partition) and a hash partition (hash partition), where the range partition is a partition mode that is most widely applied, and uses a range of column values as a partition condition, and has advantages of easy horizontal expansion of partitions and high throughput of sequential reading. The principle is that records are stored in a range partition where column values are located, so that the base columns and the range values of the partitions need to be specified during creation, if some records cannot predict the range temporarily, a maxvalue partition can be created, all records which are not in the specified range are stored in the partition where the maxvalue is located, and multiple columns are supported as dependent columns. The data stored in each partition is less than the value of the partition, and the other partitions except the first partition have the minimum value and are equal to the value of the value less than the value of the previous partition.
And the hash partition carries out the hash algorithm on the key, the data can be uniformly distributed, and therefore the writing speed is high. The Hash partitioning implementation idea is that a token is distributed to each node in the system, the range is 0-232-1, and the tokens form a Hash ring. When the data read-write executes the node searching operation, the hash value is calculated according to the key, and then the first encountered token node is found clockwise. The advantage of this approach over node redundancy is that adding and deleting nodes only affects neighboring nodes in the hash ring, and has no effect on other nodes.
Therefore, as shown in fig. 3, based on the advantages of different partitions, the HDFS may write the ID of the hot data group into the range partition of the distributed data warehouse (the range partition has a fast reading speed), write the adjusted priority parameter into the hash partition of the distributed data warehouse (the hash partition has a fast writing speed), and finally merge the two.
If the distributed data warehouse does not store the hotspot data set in advance, step 3 is actually incremental migration, and the method includes:
a monitoring program is set in the distributed data warehouse and is used for monitoring a file record table of the HDFS storage engine; for example, a BinLog file of MySQL may be monitored, and when the BinLog file is changed, the monitoring program may know whether the data is modified or updated.
When the fact that the hotspot data set is recorded in the file record table is monitored, the distributed data warehouse sends a data acquisition request to the HDFS storage engine; that is, when data modification or update occurs, the distributed data warehouse synchronizes data to the storage engine, and the data modification or update can be maintained in the data warehouse in real time.
And after receiving the data acquisition request, the HDFS storage engine packages the hot spot data group into a message queue and migrates the message queue to the distributed data warehouse.
In the embodiment of the invention, a message queue tool is adopted to serve as a buffer pool of data so as to reduce the peak value brought by data synchronization.
The step of migrating the message queue to the distributed data warehouse may specifically be:
dividing the message queue into a plurality of sectors through a kafka connect tool, writing the plurality of sectors into the distributed data warehouse in parallel, and merging in the distributed data warehouse. Among them, Kafka Connect is a tool for reliable data transmission between Kafka and other systems.
As shown in fig. 4, when the incremental data needs to be synchronized, the distributed data warehouse listens through a listener, and imports the incremental data into its own database in parallel through a synchronization mechanism, and finally merges the incremental data. In the process, the advantage of high parallel writing speed of the distributed data warehouse is fully exerted, different sector data are transmitted to the database in parallel, and finally the different sector data are combined/merged and stored in a container space or a class.
And 4, step 4: the distributed data warehouse distributes the hotspot data set to the edge cloud cluster so that the edge cloud cluster provides the hotspot data set for user equipment;
after receiving the hotspot data set, the distributed data warehouse distributes the hotspot data set to each edge cloud server, so that when a user requests the hotspot data set in the first period, the edge cloud cluster can quickly respond to the user requirement and send the hotspot data set corresponding to a hotspot event (namely, content corresponding to the hotspot event) to each User Equipment (UE).
In this embodiment of the present invention, an edge cloud cluster includes a plurality of edge clouds and a plurality of edge nodes, where one edge cloud corresponds to a plurality of edge nodes, and the edge cloud cluster provides a hot data group for a user equipment, which may specifically be:
the plurality of edge clouds respectively distribute the hotspot data sets to a plurality of corresponding edge nodes;
and the plurality of edge nodes send the hotspot data set to the user equipment. As shown in fig. 5, a distributed data warehouse-an edge cloud-an edge node may form a tree structure, and the distributed data warehouse serves as a root node, so that a hotspot data set is distributed to the edge cloud, the edge cloud is distributed to the edge node, and finally the hotspot data set is sent to the UE, thereby forming an effective data distribution method.
In the second period of time, the period of time,
repeating the steps 1-3 to enable the distributed data warehouse to distribute the hotspot data set in a second period to the edge cloud cluster, wherein the hotspot data set in the second period is different from the hotspot data set in the first period. In the second period, a new hot event is predicted through the central cloud, the hot event is obtained and led into the HDFS, the HDFS is poured into the distributed data warehouse, the distributed data warehouse is led into the edge cloud cluster, a cycle is completed, and the like, the third period and the fourth period are repeated.
In addition, in order to improve the data transmission efficiency, the load performance of the edge cloud cannot be considered without load balancing. As shown in fig. 6, in the embodiment of the present invention, an edge cloud manager predicts KPI (Key Performance Indicator) Performance indicators of multiple edge clouds, and sets a load balancing policy after predicting that the KPI Performance Indicator of a first edge cloud exceeds a first preset threshold;
KPI performance indicators are a general term for a series of indicator parameters, including but not limited to resource occupancy, transmission efficiency, storage space, etc. In the field of cloud computing operation and maintenance, KPI performance indicators are generally used to indicate whether the performance of a server is good or bad, and if any one indicator exceeds a standard, for example, the storage space exceeds an upper limit of 85%, the cloud server has a possibility of deteriorating storage response at this time, and a warning needs to be given in time to reduce the storage space.
In addition, KPI performance indicators can be predicted. For example, KPI performance indicators of a plurality of edge clouds may be predicted using a hidden markov model HMM. HMMs are powerful and complex algorithms for time series. Which itself trains HMM parameters using an EM algorithm that maximizes the likelihood of historical training data. For example, unsupervised learning can be employed for time series analysis, using temporal signatures to determine and predict KPI anomalies. This technology belongs to the prior art, and embodiments of the present invention will not be described in detail.
The method comprises the steps that an edge cloud manager sends storage content of a first edge cloud to a temporary partition of a distributed data warehouse based on a load balancing strategy, wherein the storage content of the first edge cloud does not comprise a hotspot data group;
the idea of the load balancing strategy is to equally divide resources, maximize efficiency improvement, and if a KPI anomaly occurs in a first edge cloud (any one of the edge clouds), it is necessary to migrate non-important data (i.e., other data of a non-hotspot data group) in the current first edge cloud to reduce resource occupancy rate thereof, thereby achieving load balancing. Therefore, the edge cloud manager needs to migrate data in the first edge cloud, and in the migration process, a temporary partition can be created for temporarily transferring the data by taking the distributed data warehouse as a transfer station, and the temporary partition can be destroyed after the data transfer is finished.
And the distributed data warehouse migrates the storage content in the temporary partition to one or more edge clouds closest to the first edge cloud, wherein the KPI performance index of the one or more edge clouds is lower than a second preset threshold, and the second preset threshold is smaller than the first preset threshold.
The data is migrated to the target object, so that the KPI performance index of the target object cannot exceed the standard and is within a lower threshold, therefore, a second preset threshold (for example, the upper limit of the storage space does not exceed 60%) can be manually set, and the edge cloud satisfying the condition has a lower load degree and is suitable for migration of data redundancy backup.
The method of the embodiment of the invention has the following advantages:
in the embodiment of the invention, the hot content data group corresponding to the hot event in the period is obtained by predicting the hot event, the hot content data is subjected to data migration in advance, and the hot content data group is written into the edge cloud in advance, so that when the user equipment requests the hot event, the edge cloud can quickly respond to the user equipment, the network transmission efficiency can be improved, and the problem of data congestion caused by an emergent hot event can be greatly reduced. In addition, due to the positioning reason of the large data storage database, the HDFS storage engine is suitable for a scene of once writing and multiple times of reading, and does not support random modification of stored data, so that hot spot events in different periods are different in corresponding hot spot data groups, the random modification requirement on the data is greatly increased, and the original HDFS storage engine is not suitable for the scene.
The embodiment of the present invention further provides a system, which includes a memory and a processor, where the memory stores computer-executable instructions, and the processor implements the method when running the computer-executable instructions on the memory.
Embodiments of the present invention also provide a computer-readable storage medium having stored thereon computer-executable instructions for performing the method in the foregoing embodiments.
FIG. 7 is a diagram illustrating the hardware components of the system in one embodiment. It will be appreciated that fig. 7 only shows a simplified design of the system. In practical applications, the systems may also respectively include other necessary elements, including but not limited to any number of input/output systems, processors, controllers, memories, etc., and all systems that can implement the big data management method of the embodiments of the present application are within the protection scope of the present application.
The memory includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or a portable read-only memory (CD-ROM), which is used for storing instructions and data.
The input system is for inputting data and/or signals and the output system is for outputting data and/or signals. The output system and the input system may be separate devices or may be an integral device.
The processor may include one or more processors, for example, one or more Central Processing Units (CPUs), and in the case of one CPU, the CPU may be a single-core CPU or a multi-core CPU. The processor may also include one or more special purpose processors, which may include GPUs, FPGAs, etc., for accelerated processing.
The memory is used to store program codes and data of the network device.
The processor is used for calling the program codes and data in the memory and executing the steps in the method embodiment. Specifically, reference may be made to the description of the method embodiment, which is not repeated herein.
In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. For example, the division of the unit is only one logical function division, and other division may be implemented in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. The shown or discussed mutual coupling, direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, systems or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable system. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a read-only memory (ROM), or a Random Access Memory (RAM), or a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium, such as a Digital Versatile Disk (DVD), or a semiconductor medium, such as a Solid State Disk (SSD).
The above is only a specific embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A data migration method based on mass data is characterized by being applied to a data migration system, wherein the system comprises a central cloud, an HDFS storage engine, a distributed data warehouse and an edge cloud cluster, and the method comprises the following steps:
in the first period of time,
step 1: the central cloud predicts the hot event and outputs a corresponding hot data group;
step 2: the central cloud imports the hotspot data set into an HDFS storage engine;
and step 3: the HDFS storage engine migrates the hotspot data set into the distributed data warehouse;
and 4, step 4: the distributed data warehouse distributes the hotspot data set to the edge cloud cluster so that the edge cloud cluster provides the hotspot data set for user equipment;
in the second period of time, the period of time,
repeating the steps 1-3 to enable the distributed data warehouse to distribute the hotspot data set in a second period to the edge cloud cluster, wherein the hotspot data set in the second period is different from the hotspot data set in the first period.
2. The method of claim 1, wherein if the distributed data warehouse pre-stores the hotspot data set, the HDFS storage engine migrating the hotspot data set to the distributed data warehouse, comprising:
the HDFS storage engine acquires the ID and the priority parameter of the hotspot data group;
and the HDFS storage engine adjusts the priority parameter of the hot spot data group, and guides the adjusted priority parameter and the ID of the hot spot data group into the distributed data warehouse through an instruction, so that the distributed data warehouse searches and updates the priority parameter of the hot spot data group based on the ID and the priority parameter of the hot spot data group.
3. The method of claim 2, wherein the distributed data warehouse is divided into a range partition and a hash partition, and the importing the adjusted priority parameter and the ID of the hotspot data group into the distributed data warehouse via an instruction comprises:
writing the ID of the hotspot data group into a range partition of the distributed data warehouse, and writing the adjusted priority parameter into a hash partition of the distributed data warehouse.
4. The method of claim 1, wherein if the distributed data warehouse does not store the hotspot data set in advance, the migrating the hotspot data set to the distributed data warehouse by the HDFS storage engine comprises:
the distributed data warehouse is provided with a monitoring program, and the monitoring program is used for monitoring a file record table of the HDFS storage engine;
after monitoring that the hotspot data group is recorded in the file record table, the distributed data warehouse sends a data acquisition request to the HDFS storage engine;
and after receiving the data acquisition request, the HDFS storage engine packages the hot spot data group into a message queue and migrates the message queue to the distributed data warehouse.
5. The method of claim 4, wherein said migrating said message queue into said distributed data warehouse comprises:
dividing the message queue into a plurality of sectors through a kafka connect tool, writing the plurality of sectors into the distributed data warehouse in parallel, and merging in the distributed data warehouse.
6. The method of claim 1, wherein the edge cloud cluster comprises a plurality of edge clouds and an edge cloud manager, and wherein after the edge cloud cluster provides the hotspot data set to a user device, the method further comprises:
the edge cloud manager predicts KPI performance indexes of the edge clouds, and sets a load balancing strategy after predicting that the KPI performance index of a first edge cloud exceeds a first preset threshold;
the edge cloud manager sends the storage content of the first edge cloud to a temporary partition of the distributed data warehouse based on the load balancing strategy, wherein the storage content of the first edge cloud does not include the hotspot data group;
the distributed data warehouse migrates the storage content in the temporary partition to one or more edge clouds nearest to the first edge cloud, wherein a KPI performance index of the one or more edge clouds is lower than a second preset threshold, and the second preset threshold is smaller than the first preset threshold.
7. The method of claim 6, wherein the edge cloud manager predicts KPI performance indicators for the plurality of edge clouds, comprising:
predicting KPI performance indicators for the plurality of edge clouds using a Hidden Markov Model (HMM).
8. The method of claim 1, wherein the central cloud predicts hotspot events and outputs corresponding hotspot data sets, comprising:
the central cloud predicts the hotspot events using a time series model ARIMA.
9. The method of claim 1, wherein the edge cloud cluster comprises a plurality of edge clouds and a plurality of edge nodes, and wherein if an edge cloud corresponds to a plurality of edge nodes, the edge cloud cluster provides the hotspot data set for a user device, comprising:
the plurality of edge clouds respectively distribute the hotspot data sets to the corresponding plurality of edge nodes;
and the plurality of edge nodes send the hotspot data groups to the user equipment.
10. A mass data based data migration system comprising a memory having stored thereon computer-executable instructions and a processor that, when executing the computer-executable instructions on the memory, implements the method of any of claims 1 to 9.
CN202111209912.6A 2021-10-18 2021-10-18 Data migration method and system based on mass data Withdrawn CN114048186A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111209912.6A CN114048186A (en) 2021-10-18 2021-10-18 Data migration method and system based on mass data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111209912.6A CN114048186A (en) 2021-10-18 2021-10-18 Data migration method and system based on mass data

Publications (1)

Publication Number Publication Date
CN114048186A true CN114048186A (en) 2022-02-15

Family

ID=80205321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111209912.6A Withdrawn CN114048186A (en) 2021-10-18 2021-10-18 Data migration method and system based on mass data

Country Status (1)

Country Link
CN (1) CN114048186A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115022313A (en) * 2022-04-19 2022-09-06 湖南宝马文化传播有限公司 Data migration method and system under cloud architecture
US11792262B1 (en) 2022-07-20 2023-10-17 The Toronto-Dominion Bank System and method for data movement

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115022313A (en) * 2022-04-19 2022-09-06 湖南宝马文化传播有限公司 Data migration method and system under cloud architecture
US11792262B1 (en) 2022-07-20 2023-10-17 The Toronto-Dominion Bank System and method for data movement

Similar Documents

Publication Publication Date Title
US10459898B2 (en) Configurable-capacity time-series tables
US9740706B2 (en) Management of intermediate data spills during the shuffle phase of a map-reduce job
US8886781B2 (en) Load balancing in cluster storage systems
JP5765416B2 (en) Distributed storage system and method
US10922316B2 (en) Using computing resources to perform database queries according to a dynamically determined query size
CN109726174A (en) Data archiving method, system, equipment and storage medium
US11137926B1 (en) Systems and methods for automatic storage tiering
CN114048186A (en) Data migration method and system based on mass data
US20170344546A1 (en) Code dispersion hash table-based map-reduce system and method
CN103139302A (en) Real-time copy scheduling method considering load balancing
CN107169009B (en) Data splitting method and device of distributed storage system
CN103631894A (en) Dynamic copy management method based on HDFS
CN112947860B (en) Hierarchical storage and scheduling method for distributed data copies
CN105827678B (en) Communication means and node under a kind of framework based on High Availabitity
CN106130960A (en) Judgement system, load dispatching method and the device of steal-number behavior
Salehian et al. Comparison of spark resource managers and distributed file systems
Irie et al. A novel automated tiered storage architecture for achieving both cost saving and qoe
Fazul et al. Improving data availability in HDFS through replica balancing
CN112000703A (en) Data warehousing processing method and device, computer equipment and storage medium
CN111858656A (en) Static data query method and device based on distributed architecture
CN106549983B (en) Database access method, terminal and server
Tatarnikova et al. Algorithms for placing files in tiered storage using Kohonen map
US11537616B1 (en) Predicting query performance for prioritizing query execution
CN113760822A (en) HDFS-based distributed intelligent campus file management system optimization method and device
CN114153395A (en) Object storage data life cycle management method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20220315

Address after: 561000 room 17011, unit 1, building C, Jianbo International Plaza, No. 188, Huangguoshu street, Huaxi street, Xixiu District, Anshun City, Guizhou Province

Applicant after: Guizhou Anhe Shengda Enterprise Management Co.,Ltd.

Address before: 518129 Bantian shangpinya garden, Longgang District, Shenzhen City, Guangdong Province

Applicant before: Peng Liang

TA01 Transfer of patent application right
WW01 Invention patent application withdrawn after publication

Application publication date: 20220215

WW01 Invention patent application withdrawn after publication