CN112883026A

CN112883026A - Data processing method and device

Info

Publication number: CN112883026A
Application number: CN202110121834.8A
Authority: CN
Inventors: 戚永峰
Original assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd
Current assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2021-06-01

Abstract

The invention discloses a data processing method and a data processing device. The method comprises the following steps: acquiring the use frequency of each data in the first distributed file system; determining data with the use frequency lower than the set frequency as cold data; synchronizing the cold data into a second distributed file system, wherein the second distributed file system is arranged in a network attached storage preset in a second server, and the first distributed file system is arranged in a preset server; after synchronization is complete, cold data is deleted from the first distributed file system. According to the invention, the effect of reducing the cost by independently storing the data with low use frequency is achieved.

Description

Data processing method and device

Technical Field

The present invention relates to the field of data processing, and in particular, to a data processing method and apparatus.

Background

Many companies have their own large data departments with large amounts of data being generated each day, often in increments of a few TBs to tens of TBs. The amount of data gets larger and larger as time goes by. The cost of IT infrastructure is also increasing, and increasing cluster size and data size also results in poor system stability and efficiency.

At present, the big data architecture of most enterprise units is built based on hadoop technology, and the technology is relatively mature and stable. Although this architecture uses a pc that is less expensive than a professional data server, it also requires multiple copies of each piece of data, and for the most common case, 2 copies are allocated to one piece of data. As the amount of data and traffic increases, It costs are also increasing.

The similar technical scheme is as follows: and (3) compressing, archiving and storing cold data by using a time sequence database, wherein the cold data refers to historical data with relatively low use frequency. However, the method of compressing, archiving and storing cold data by using a time-series database has the following disadvantages:

1. the archiving needs to write a special application program and call a development interface (api) of the time sequence database to write data. The development cost is increased;

2. and a new technical component is introduced, so that the learning and operation and maintenance cost of a big data department is increased.

3. When the archived data needs to be used (such as inquiry), the use mode and habit of the original user are changed.

4. The archived data is difficult to be compatible with the original management access application.

Aiming at the problem of high management cost of large data storage in the related technology, an effective solution is not provided at present.

Disclosure of Invention

The invention mainly aims to provide a data processing method and a data processing device, which aim to solve the problem of high management cost of large data storage.

In order to achieve the above object, according to an aspect of the present invention, there is provided a data processing method including: acquiring the use frequency of each data in the first distributed file system; determining data with the use frequency lower than the set frequency as cold data; synchronizing the cold data into a second distributed file system, wherein the second distributed file system is arranged in a network attached storage preset in a second server, and the first distributed file system is arranged in a preset server; after synchronization is complete, the cold data is deleted from the first distributed file system.

Further, prior to synchronizing the cold data into the second distributed file system, the method further comprises: creating an external table in a data warehouse corresponding to the first distributed file system; setting a storage location of the external table to be a directory name of the second distributed file system; repairing the partition of the table in the data repository according to the storage location of the external table.

Further, before obtaining the frequency of use of each data in the first distributed file system, the method further includes: dividing a storage space of the network attached storage into a plurality of volumes; mounting the volumes to at least one preset server; installing the second distributed file system in each preset server, wherein the second distributed file system comprises a management node and a plurality of data nodes; configuring a data storage location in the data node as a mount location for the plurality of volumes.

Further, partitioning the network attached storage space into a plurality of volumes comprises: dividing a preset network storage space into a plurality of volumes with the same size; or dividing the preset network storage space into a plurality of volumes with the size difference not exceeding the preset threshold value.

Further, synchronizing the cold data into a second distributed file system comprises: and synchronizing the cold data to a second distributed file system through a distributed copy command carried by the distributed file system.

Further, the acquiring the use frequency of each data in the first distributed file system includes: the use frequency of each data in the first distributed file system is periodically acquired.

Further, the determining the data with the use frequency lower than the set frequency as the cold data includes: determining data with the use frequency lower than the set frequency in the current period as cold data; or, determining the data with the frequency lower than the set frequency in the current period and one or more periods continuous before the current period as cold data.

In order to achieve the above object, according to another aspect of the present invention, there is also provided a data processing apparatus comprising: the acquisition unit is used for acquiring the use frequency of each data in the first distributed file system; a determination unit for determining data having a frequency of use lower than a set frequency as cold data; the synchronization unit is used for synchronizing the cold data to a second distributed file system, wherein the second distributed file system is arranged in a network attached storage preset in a second server, and the first distributed file system is arranged in a preset server; a deletion unit to delete the cold data from the first distributed file system after synchronization is completed.

In order to achieve the above object, according to another aspect of the present invention, there is also provided a computer-readable storage medium characterized by comprising a stored program, wherein the data processing method of the present invention is executed when the program is executed by a processor.

In order to achieve the above object, according to another aspect of the present invention, there is also provided a processor, which is characterized in that the processor is configured to execute a program, wherein the program executes to execute the data processing method according to the present invention.

The method comprises the steps of acquiring the use frequency of each data in a first distributed file system; determining data with the use frequency lower than the set frequency as cold data; synchronizing the cold data into a second distributed file system, wherein the second distributed file system is arranged in a network attached storage preset in a second server, and the first distributed file system is arranged in a preset server; after the synchronization is completed, the cold data is deleted from the first distributed file system, the problem that the large data storage management cost is high is solved, and the effect of reducing the cost by independently storing the data with low use frequency is achieved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow diagram of a data processing method according to an embodiment of the invention;

FIG. 2 is a diagram showing a configuration of the solution deployment according to the present embodiment;

FIG. 3 is a flowchart of the synchronous use of data installation of the present embodiment;

fig. 4 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the invention provides a data processing method.

Fig. 1 is a flowchart of a data processing method according to an embodiment of the present invention, as shown in fig. 1, the method including the steps of:

step S102: acquiring the use frequency of each data in the first distributed file system;

step S104: determining data with the use frequency lower than the set frequency as cold data;

step S106: synchronizing the cold data into a second distributed file system, wherein the second distributed file system is arranged in a network attached storage preset in a second server, and the first distributed file system is arranged in a preset server;

step S108: after synchronization is complete, the cold data is deleted from the first distributed file system.

The embodiment adopts the method that the use frequency of each data in the first distributed file system is obtained; determining data with the use frequency lower than the set frequency as cold data; synchronizing the cold data into a second distributed file system, wherein the second distributed file system is arranged in a network attached storage preset in a second server, and the first distributed file system is arranged in a preset server; after the synchronization is completed, the cold data is deleted from the first distributed file system, the problem that the large data storage management cost is high is solved, and the effect of reducing the cost by independently storing the data with low use frequency is achieved.

The technical scheme of the embodiment can be applied to a Hadoop Distributed File System (HDFS), the first HDFS System can be a HDFS, the second HDFS System has a lot of data with low data use frequency on the HDFS, the data are found and are additionally archived and stored to set a lower Storage copy, the cost of Network Attached Storage stored by the second HDFS is lower than that stored by a server, so that cold data is stored in a NAS (Network Attached Storage) with lower cost, the cost can be reduced, the set value of specific data use frequency can be changed and set according to different application scene requirements, the second HDFS can be an archived HDFS, after synchronization, the synchronized data is deleted from the first HDFS, the size of the data stored in the HDFS can be reduced by synchronizing the cold data to the archived HDFS, and the data Storage overhead can be reduced, the cost is reduced.

Optionally, before synchronizing the cold data into the second HDFS, creating an external table in a data repository (HIVE database) corresponding to the first distributed file system, wherein the HIVE database corresponds to the first HDFS; setting a storage location of the external table to a directory name of the second HDFS; and repairing the partition of the HIVE table according to the storage position of the external table.

The directory name of the second HDFS can be a directory structure name, the HIVE database is used for producing the HIVE database, an external table is created in the HIVE database, the storage position of the external table is set as the directory name of the second HDFS, data in the second HDFS can be read in the first HDFS, and meanwhile the partition of the HIVE table is repaired according to the storage position of the external table so as to perfect the partition of the HIVE table, and the data can be timely supplemented into the table after contents are newly added.

Optionally, before obtaining the use frequency of each data in the first HDFS, dividing the NAS storage space into a plurality of volumes; mounting a plurality of volumes into at least one preset server; installing a second HDFS service in each preset server, wherein the second HDFS service comprises a management node and a plurality of data nodes; the data storage locations in the data nodes are configured as mount locations for the plurality of volumes.

The method comprises the steps of dividing an NAS storage space into a plurality of volumes, wherein the volumes can be different in size, can be same in size, and are preferably same in size or not different in size, mounting each volume to a preset LINUX server host, mounting one volume to one server host, or mounting a plurality of volumes to one server host, installing an archiving HDFS service in each LINUX server, arranging a management node and a plurality of data nodes in each archiving HDFS service, and configuring data storage positions in the data nodes to mounting positions of the volumes so as to complete service construction.

Optionally, dividing the NAS storage space into a plurality of volumes comprises: dividing a preset network storage space into a plurality of volumes with the same size; or, dividing the preset network storage space into a plurality of volumes with the size difference not exceeding a preset threshold value.

Each volume has the same size or has small size difference, each storage space can be utilized to the maximum extent, the situation that one volume space is used up and the other volume space is not used by half is prevented, and the resource waste is reduced.

Optionally, synchronizing the cold data into the second HDFS comprises: and synchronizing the cold data into the second distributed file system through a distributed copy command of the distributed file system, and synchronizing the cold data into the second HDFS through a distcp command of the HDFS when synchronizing the cold data.

Optionally, the obtaining of the usage frequency of each data in the first distributed file system includes: the use frequency of each data in the first distributed file system is periodically acquired.

When determining the use frequency of each data, the data may be acquired and determined periodically, for example, the data may be acquired and calculated once every week or once every month, and in some scenarios, the period may also be shortened, the data may be acquired and calculated once every day, and the period may be adjusted according to a specific application scenario.

Optionally, determining the data with the use frequency lower than the set frequency as the cold data includes: determining data with the use frequency lower than the set frequency in the current period as cold data; or, determining the data with the frequency lower than the set frequency in the current period and one or more periods continuous before the current period as cold data.

Besides the single-period calculation frequency, the use frequency of the data can be calculated by taking a plurality of continuous periods as a whole, so that the rolling calculation frequency can be realized, and the determination of the cold data is more accurate.

The embodiment provides a simple and easy-to-use method and device for low-cost archiving of large data cold data, so that the cost of a large data part It is reduced, and the operating efficiency and stability of a system are improved. The cold data refers to historical data with relatively low frequency of use.

Implementation of this embodiment requires hardware plus software coordination.

The technology related to the technical scheme of the embodiment comprises hdfs, hive, nas storage, linux mounting, hive external tables and the like. Through experiments, the scheme can be successfully implemented on linux.

And on the hardware aspect, the system comprises NAS storage with lower relative cost and a plurality of common LINUX server hosts.

And the software aspect comprises an independent HDFS service and a multiplexed HIVE service.

The method comprises the following specific steps:

fig. 2 is a schematic deployment structure diagram of the embodiment, which divides the NAS storage into several equal-sized volumes, and mounts (mount) these volumes to the above-mentioned several ordinary LINUX server hosts.

A separate HDFS service is installed on these several LINUX server hosts, containing a namenode management node and a dataode data node, and the dataode data storage directory is configured as the aforementioned several directories to mount NAS storage. The number of the management nodes is usually one, and the number of the data nodes is multiple.

And synchronizing data (namely cold data) which is stored in the HDFS in the production environment and needs to be archived to the archiving HDFS by using a self-contained distcp command of the HDFS, deleting source data on the production HDFS after synchronization is completed, and releasing space.

An external table is created on the production HIVE database, the storage location (location) of the table is set to the directory name (i.e., location) on the archive HDFS, and then the partition of the HIVE table is repaired.

Principle of reducing costs and providing ease of use:

fig. 3 is a flowchart of synchronous data installation and use in this embodiment, since NAS storage is relatively low in cost per unit storage compared to a production environment server, and since archive cluster stores cold data, the use frequency is very low, a lower storage copy can be set, and the cost is further reduced.

The data archiving mode is realized by using a native HDFS synchronization program distcp, and is easier to use and has no extra development and learning cost compared with a mode of developing a special program to call a special api.

The archived data can be directly accessed in hive on the original production environment, the same access entry and the same access mode are used as before, and no additional development and learning cost is caused on use.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

The embodiment of the invention provides a data processing device, which can be used for executing the data processing method of the embodiment of the invention.

Fig. 4 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention, as shown in fig. 4, the apparatus including:

an obtaining unit 10, which obtains the use frequency of each data in the first distributed file system;

a determination unit 20 for determining data having a frequency of use lower than a set frequency as cold data;

a synchronization unit 30, configured to synchronize the cold data to a second distributed file system, where the second distributed file system is disposed in a network attached storage preset in a second server, and the first distributed file system is disposed in a preset server;

and the deleting unit 40 is used for deleting the cold data from the first distributed file system after the synchronization is completed.

In this embodiment, an obtaining unit 10 is adopted to obtain the use frequency of each data in the first distributed file system; a determination unit 20 for determining data having a frequency of use lower than a set frequency as cold data; a synchronization unit 30, configured to synchronize the cold data to a second distributed file system, where the second distributed file system is disposed in a network attached storage preset in a second server, and the first distributed file system is disposed in a preset server; and the deleting unit 40 is used for deleting the cold data from the first distributed file system after the synchronization is completed, so that the problem of high management cost of large data storage is solved, and the effect of reducing the cost by separately storing the data with low use frequency is achieved.

The data processing device comprises a processor and a memory, the acquisition unit, the determination unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to one or more, and the cost is reduced by separately storing data with low use frequency by adjusting the kernel parameters.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing the data processing method when executed by a processor.

The embodiment of the invention provides a processor, which is used for running a program, wherein the data processing method is executed when the program runs.

The embodiment of the invention provides equipment, which comprises at least one processor, at least one memory and a bus, wherein the memory and the bus are connected with the processor; the processor and the memory complete mutual communication through a bus; the processor is used for calling the program instructions in the memory so as to execute the data processing method. The device herein may be a server, a PC, a PAD, a mobile phone, etc.

The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: acquiring the use frequency of each data in the first HDFS; determining data with the use frequency lower than the set frequency as cold data; synchronizing cold data into the second HDFS; after the synchronization is completed, the cold data is deleted from the first HDFS.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A data processing method, comprising:

acquiring the use frequency of each data in the first distributed file system;

determining data with the use frequency lower than the set frequency as cold data;

synchronizing the cold data into a second distributed file system, wherein the second distributed file system is arranged in a network attached storage preset in a second server, and the first distributed file system is arranged in a preset server;

after synchronization is complete, the cold data is deleted from the first distributed file system.

2. The method of claim 1, wherein prior to synchronizing the cold data into a second distributed file system, the method further comprises:

creating an external table in a data warehouse corresponding to the first distributed file system;

setting a storage location of the external table to be a directory name of the second distributed file system;

repairing the partition of the table in the data repository according to the storage location of the external table.

3. The method of claim 1, wherein prior to obtaining the frequency of use of the respective data in the first distributed file system, the method further comprises:

dividing a storage space of the network attached storage into a plurality of volumes;

mounting the volumes to at least one preset server;

installing the second distributed file system in each preset server, wherein the second distributed file system comprises a management node and a plurality of data nodes;

configuring a data storage location in the data node as a mount location for the plurality of volumes.

4. The method of claim 3, wherein partitioning the network attached storage space into a plurality of volumes comprises:

dividing a preset network storage space into a plurality of volumes with the same size; or dividing the preset network storage space into a plurality of volumes with the size difference not exceeding the preset threshold value.

5. The method of claim 1, wherein synchronizing the cold data into a second distributed file system comprises:

and synchronizing the cold data to a second distributed file system through a distributed copy command carried by the distributed file system.

6. The method according to claim 1, wherein the obtaining of the usage frequency of each data in the first distributed file system comprises:

the use frequency of each data in the first distributed file system is periodically acquired.

7. The method of claim 6, wherein determining the data with the usage frequency lower than the set frequency as cold data comprises:

determining data with the use frequency lower than the set frequency in the current period as cold data;

or, determining the data with the frequency lower than the set frequency in the current period and one or more periods continuous before the current period as cold data.

8. A data processing apparatus, comprising:

the acquisition unit is used for acquiring the use frequency of each data in the first distributed file system;

a determination unit for determining data having a frequency of use lower than a set frequency as cold data;

the synchronization unit is used for synchronizing the cold data to a second distributed file system, wherein the second distributed file system is arranged in a network attached storage preset in a second server, and the first distributed file system is arranged in a preset server;

a deletion unit to delete the cold data from the first distributed file system after synchronization is completed.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the data processing method of any one of claims 1 to 7 is performed when the program is executed by a processor.

10. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to execute the data processing method according to any one of claims 1 to 7 when running.