CN109101514A

CN109101514A - Data lead-in method and device

Info

Publication number: CN109101514A
Application number: CN201710476399.4A
Authority: CN
Inventors: 汤卫群
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2017-06-21
Filing date: 2017-06-21
Publication date: 2018-12-28

Abstract

The invention discloses a kind of data lead-in method and devices, before the data of the second cluster are imported the first cluster, first cluster is first by the primary sources deletion in itself currently stored historical data without adding up by day, then, it will store from the data to be imported obtained in the second cluster into the first cluster.The primary sources in the historical data of the first cluster-based storage are deleted, primary sources do not need to add up by day, moreover, such data will not may change whithin a period of time.Therefore, deleting such data will not influence subsequent data handling procedure, in addition, subsequent may further include this data from the data that the second cluster imports.Therefore, after deleting the primary sources in the first cluster, importing and the duplicate data of the primary sources from the second cluster is can be avoided, the memory space in the first cluster is saved.

Description

Data lead-in method and device

Technical field

The present invention relates to field of computer technology more particularly to a kind of data lead-in methods and device.

Background technique

Hadoop is the architecture of a distributed system, and user can be the case where not knowing about distributed bottom level details Under, distributed program is developed, cluster is made full use of to carry out high speed computing and storage.Hadoop realizes a distributed field system It unites (Hadoop Distributed File System), abbreviation HDFS.HDFS has the characteristics that high fault tolerance, and is deployed in On cheap hardware.It also provides the data of high-throughput access application, is suitble to the application program for having super large data set.

In practical applications, resource required for big data Hadoop cluster is very big, for example, it is desired to tens or even several hundred Server.But server often not so much in test environment, may there was only several servers.In order to Test and development and application program in environment are tested, is needed for the data generated in actual production cluster to imported into test environment, If going to calculate and store the data for producing cluster based on several servers under test environment, asking for inadequate resource can be encountered Topic.

Summary of the invention

In view of the above problems, it proposes the present invention provides a kind of data lead-in method and device, to solve to utilize test The technical issues of inadequate resource caused by PC cluster storing data.

In a first aspect, the application provides a kind of data lead-in method, it is applied in the first cluster, which comprises

In the historical data currently stored from first cluster, the primary sources without adding up by day are searched；

The primary sources are deleted from first cluster；

Data to be imported are obtained from the second cluster, the resource of first cluster is less than the resource of second cluster；

By the data storage to be imported into first cluster.

Optionally, the primary sources are deleted from first cluster, comprising:

According to the first predetermined period, the primary sources in first cluster are deleted.

Optionally, the method also includes:

Delete it is being stored in first cluster with the current time time difference be more than the first preset duration all data.

It is optionally, described that data to be imported are obtained from the second cluster, comprising:

According to the second predetermined period, the data to be imported in second cluster are obtained.

Receive the data to be imported that second cluster is sent according to the second predetermined period.

Second aspect, this application provides a kind of data importing devices, are applied in the first cluster, described device includes:

Searching unit, for from the currently stored historical data of first cluster, search without add up by day the A kind of data；

First deletes unit, for deleting the primary sources from first cluster；

Acquiring unit, for obtaining data to be imported from the second cluster, the resource of first cluster is less than described the The resource of two clusters；

Storage unit, for storing the data to be imported into first cluster.

Optionally, the first deletion unit is specifically used for:

Optionally, further includes:

Second deletes unit, is more than first pre- for deleting storing in first cluster with the current time time difference If all data of duration.

Optionally, the acquiring unit is specifically used for:

Data lead-in method provided by the embodiments of the present application, wherein the resource of the first cluster (for example, test cluster) is less than The resource of second cluster (for example, actual production cluster), needs for the data in the second cluster to imported into the first cluster.First The capacity that the resource of cluster is less able to the file of storage is also seldom, therefore, the data of the second cluster is being imported the first cluster Before, the first cluster first deletes the primary sources for being not necessarily to add up by day in itself currently stored historical data, then, It will store from the data to be imported obtained in the second cluster into the first cluster.In the historical data for deleting the first cluster-based storage Primary sources, primary sources do not need to add up by day, moreover, such data will not may become whithin a period of time Change.Therefore, deleting such data will not influence subsequent data handling procedure, in addition, the subsequent data imported from the second cluster In may further include this data.Therefore, after deleting the primary sources in the first cluster, it can be avoided and led from the second cluster Enter with the duplicate data of the primary sources, save the memory space in the first cluster.

The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.

Detailed description of the invention

By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:

Fig. 1 shows a kind of flow diagram of data lead-in method of the embodiment of the present application；

Fig. 2 shows the flow diagrams of the embodiment of the present application another kind data lead-in method；

Fig. 3 shows the flow diagram of another data lead-in method of the embodiment of the present application；

Fig. 4 shows a kind of block diagram of data importing device of the embodiment of the present application；

Fig. 5 shows the block diagram of the embodiment of the present application another kind data importing device.

Specific embodiment

The resource of first cluster is less than the resource of the second cluster, for example, the first cluster is test cluster, the second cluster is real Border produces cluster.For test and development and application program in test environment, therefore, it is necessary to will produce the data generated in cluster It imported into test cluster.At this point, just will appear since the data volume in production cluster is very big, and the resource of test cluster is very It is few, cause test cluster to be not enough to calculate the data that storage is imported from production cluster.In order to solve the above technical problems, the application mentions A kind of data lead-in method is supplied, before the data of the second cluster import, the first cluster first currently stored goes through itself Primary sources in history data without adding up by day are deleted, and then, the data to be imported obtained from the second cluster are stored Into the first cluster.The primary sources in the historical data of the first cluster-based storage are deleted, primary sources do not need tired by day Add, moreover, such data will not may change whithin a period of time.Therefore, delete such data will not influence it is subsequent Data handling procedure, in addition, subsequent may further include this data from the data that the second cluster imports.Therefore, first is deleted After primary sources in cluster, it can be avoided and imported from the second cluster and the duplicate data of the primary sources, saving Memory space in first cluster.

Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.

Referring to Figure 1, a kind of flow diagram of data lead-in method of the embodiment of the present application is shown, this method is applied to In first cluster, the data in the second cluster for importeding into the first cluster by this method, and the resource of the first cluster is less than the The resource of two clusters.For example, the first cluster is test cluster, the second cluster is production cluster.As shown in Figure 1, this method can be with The following steps are included:

S110 in the historical data currently stored from the first cluster, searches the primary sources without adding up by day.

Whether the first cluster is searched comprising primary sources in itself currently stored historical data, and primary sources are not The data to add up by day are needed, i.e. primary sources are unrelated with the date.For example, an application program (Application, APP) Accumulative installation number be exactly the data unrelated with the date.

If being stored with primary sources in the first cluster, the primary sources are found.

S120 deletes the primary sources from first cluster.

Primary sources do not need to add up by day, moreover, such data will not may change whithin a period of time, because This, is likely to also comprising primary sources currently stored in the first cluster in the data that will be imported in the second cluster, if Such data in the first cluster are not deleted, after waiting the data in the second cluster to import the first cluster, will lead to the first cluster In duplicate data, duplicate data are also required to occupy memory space, and the storage resource that this will lead to the first cluster is insufficient Problem is more serious.Therefore, the primary sources deleted in the first cluster can save the memory space in the first cluster.

S130 obtains data to be imported from the second cluster；The resource of first cluster is less than second cluster Resource.

Second cluster be production cluster, produce cluster in generated data need to imported into the first cluster, so as to Test and development and application program are carried out in first cluster.

S140, by data to be imported storage into first cluster.

First cluster stores the data to be imported in the second cluster into the first cluster.

Data lead-in method provided in this embodiment, before the data of the second cluster are imported the first cluster, the first collection Cluster, without the primary sources deletion to add up by day, then, will will first collect in itself currently stored historical data from second The data to be imported obtained in group are stored into the first cluster.Delete the he first-class numbert in the historical data of the first cluster-based storage According to primary sources do not need to add up by day, moreover, such data will not may change whithin a period of time.Therefore, it deletes Falling such data will not influence subsequent data handling procedure, in addition, subsequent may also wrap from the data that the second cluster imports Include this data.Therefore, after deleting the primary sources in the first cluster, it can be avoided and imported from the second cluster and described the The data of a kind of Data duplication save the memory space in the first cluster.

Fig. 2 is referred to, the flow diagram of the embodiment of the present application another kind data lead-in method is shown, as shown in Fig. 2, It is further comprising the steps of on the basis of this method embodiment shown in Fig. 1:

S210, search stored in the first cluster be more than with time difference at current time the first preset duration all numbers According to.

The resource of first cluster is seldom, and also therefore very little can not retain the file size that can be stored in the first cluster The data for all time sections that second cluster imports.

Whether the first cluster judges in currently stored historical data comprising the time between importing time and current time The poor data more than the first preset duration, if comprising executing S220；If do not included, show do not have in the first cluster The data of storage too long.

First preset duration can be determined according to the memory space of the first cluster, if memory space very little, first is pre- If duration can be set shorter；If memory space is very big, the first preset duration can be set longer.

In a kind of possible implementation of the application, the first preset duration be can be set to 7 days.

S220 deletes all data found.

First cluster deletes all data that currently stored historical data is more than the first preset duration, for example, first is default Shi Changwei 7 days, then it can delete and store over 7 days all data in the first cluster, that is, only store nearest 7 in the first cluster Its data imported.

In another embodiment of the application, S210 and S220 can be executed before S110, the application to this not It limits.

Data lead-in method provided by the embodiments of the present application, the first cluster can periodically delete leading more than the first preset duration Enter data, to reduce the memory space of the first cluster of hash occupancy, improves the resource utilization of the first cluster.

Fig. 3 is referred to, the flow chart of another data lead-in method of the embodiment of the present application is shown, this method includes following Step:

S310 in the historical data currently stored from the first cluster, searches the primary sources without adding up by day.

Whether the first cluster can be searched in currently stored historical data according to the period comprising primary sources, can also be with Primary sources are searched after the data for receiving the importing of the second cluster every time.

S320 deletes the primary sources in the first cluster according to the first predetermined period.

First cluster periodically deletes the primary sources of itself storage.First predetermined period can be according to the first cluster Memory space determines；The memory space of the duration of first predetermined period and the first cluster is positively correlated, that is, and memory space is bigger, the The duration of one predetermined period can be longer；Memory space is smaller, and the duration of the first predetermined period can be smaller.

S330 obtains the data to be imported in the second cluster according to the second predetermined period.

In a kind of possible implementation of the application, the first cluster can periodically actively be obtained from the second cluster to Import data, the period i.e. the second predetermined period.Second predetermined period can be determining according to data generation time in production cluster, For example, just obtaining the data as data to be imported after the data of production cluster required for test all generate.

In the alternatively possible implementation of the application, it is default according to second that the first cluster passively receives the second cluster The data to be imported that period obtains.That is, the second cluster according to the second predetermined period obtain need to imported into the first cluster wait lead Enter data, and be sent to the first cluster, the first cluster directly stores after receiving the data to be imported of the second collection pocket transmission.

S340, by data to be imported storage into first cluster.

First cluster is by storage after the data to be imported in the second cluster of acquisition into the memory space of the first cluster.

Data lead-in method provided by the embodiments of the present application, the first cluster first will be in itself currently stored historical datas Primary sources without adding up by day are deleted, then, by the data to be imported obtained from the second cluster storage to the first collection In group.The primary sources in the historical data of the first cluster-based storage are periodically deleted, primary sources do not need to add up by day, Moreover, such data will not may change whithin a period of time.Therefore, deleting such data will not influence subsequent data Treatment process, in addition, subsequent may further include this data from the data that the second cluster imports.Therefore, the first cluster is deleted In primary sources after, can be avoided imported from the second cluster with the duplicate data of the primary sources, save the Memory space in one cluster.

Corresponding to above-mentioned embodiment of the method, present invention also provides data importing device embodiments.

Fig. 4 is referred to, a kind of block diagram of data importing device of the embodiment of the present application is shown, which is applied to the first collection In group, as shown in figure 4, the apparatus may include: it is single that searching unit 110, first deletes unit 120, acquiring unit 130 and storage Member 140.

Searching unit 110, for searching be not necessarily to add up by day first from the currently stored historical data of the first cluster Class data.

Primary sources are the data for not needing to add up by day, i.e. primary sources are unrelated with the date.For example, an application The accumulative installation number of program (Application, APP) is exactly the data unrelated with the date.If being stored in the first cluster A kind of data then find the primary sources.

First deletes unit 120, for deleting the primary sources from the first cluster.

In one embodiment of the application, according to the first predetermined period, described first in first cluster is deleted Class data.

First predetermined period can be determined according to the memory space of the first cluster；The duration of first predetermined period and the first collection The memory space of group is positively correlated, that is, memory space is bigger, and the duration of the first predetermined period can be longer；Memory space is smaller, the The duration of one predetermined period can be smaller.

Acquiring unit 130, for obtaining data to be imported from the second cluster, the resource of first cluster is less than described The resource of second cluster.

In one embodiment of the application, the first cluster can actively be obtained from the second cluster according to the second predetermined period Obtain data to be imported；

In another embodiment of the application, the first cluster passively receives the second cluster and obtains according to the second predetermined period Data to be imported.That is, the second cluster obtains the data to be imported for needing to imported into the first cluster according to the second predetermined period, and It is sent to the first cluster, the first cluster directly stores after receiving the data to be imported of the second collection pocket transmission.

Wherein, the second predetermined period can be determined according to data generation time in production cluster, for example, required for test After the data whole generation for producing cluster, the data are just obtained as data to be imported.

Storage unit 140, for storing the data to be imported into first cluster.

Data importing device provided in this embodiment, before the data of the second cluster are imported the first cluster, the first collection Cluster, without the primary sources deletion to add up by day, then, will will first collect in itself currently stored historical data from second The data to be imported obtained in group are stored into the first cluster.Delete the he first-class numbert in the historical data of the first cluster-based storage According to primary sources do not need to add up by day, moreover, such data will not may change whithin a period of time.After in addition, Continuous may further include this data from the data that the second cluster imports.Therefore, after deleting the primary sources in the first cluster, It can be avoided and imported from the second cluster with the duplicate data of the primary sources, the storage saved in the first cluster is empty Between.

Fig. 5 is referred to, the block diagram of the embodiment of the present application another kind data importing device is shown, the device is shown in Fig. 4 On the basis of embodiment further include:

Second deletes unit 210, is more than for deleting storing in first cluster with the current time time difference All data of one preset duration.

First preset duration can be determined according to the memory space of the first cluster, if memory space very little, first is pre- If duration can be set shorter；If memory space is very big, the first preset duration can be set longer.For example, first Preset duration can be set to 7 days.That is, only storing the nearest 7 days data imported in the first cluster.

Data importing device provided by the embodiments of the present application, the first cluster can periodically delete leading more than the first preset duration Enter data, to reduce the memory space of the first cluster of hash occupancy, improves the resource utilization of the first cluster.

The data importing device includes processor and memory, and above-mentioned searching unit, first delete unit, obtain list Member, storage unit and second delete unit etc. and store in memory as program unit, are stored in by processor execution Above procedure unit in reservoir realizes corresponding function.

Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one Or more, by adjusting kernel parameter come reduce imported into the data in the first cluster needed for memory space.

Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (flash RAM), memory include that at least one is deposited Store up chip.

The embodiment of the invention provides a kind of storage mediums, are stored thereon with program, real when which is executed by processor The existing data lead-in method.

The embodiment of the invention provides a kind of processor, the processor is for running program, wherein described program operation Data lead-in method described in Shi Zhihang.

The embodiment of the invention provides a kind of equipment, equipment include processor, memory and storage on a memory and can The program run on a processor, processor perform the steps of when executing program

The primary sources are deleted from first cluster；

By the data storage to be imported into first cluster.

In one embodiment of the application, the primary sources are deleted from first cluster, comprising:

In another embodiment of the application, the method also includes:

It is described that data to be imported are obtained from the second cluster in another embodiment of the application, comprising:

Equipment herein can be server, PC, PAD, mobile phone etc..

Present invention also provides a kind of computer program products, when executing on data processing equipment, are adapted for carrying out just The program of beginningization there are as below methods step:

The primary sources are deleted from first cluster；

By the data storage to be imported into first cluster.

In another embodiment of the application, the method also includes:

It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.

The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.

Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie The example of matter.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.

It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element There is also other identical elements in process, method, commodity or equipment.

It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.

The above is only embodiments herein, are not intended to limit this application.To those skilled in the art, Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement, Improve etc., it should be included within the scope of the claims of this application.

Claims

1. a kind of data lead-in method is applied in the first cluster, which is characterized in that the described method includes:

The primary sources are deleted from first cluster；

By the data storage to be imported into first cluster.

2. the method according to claim 1, wherein delete the primary sources from first cluster, Include:

3. the method according to claim 1, wherein the method also includes:

4. method according to claim 1-3, which is characterized in that described to obtain number to be imported from the second cluster According to, comprising:

5. method according to claim 1-3, which is characterized in that described to obtain number to be imported from the second cluster According to, comprising:

6. a kind of data importing device, it is applied in the first cluster, which is characterized in that described device includes:

Searching unit, for searching the first kind without adding up by day from the currently stored historical data of first cluster Data；

First deletes unit, for deleting the primary sources from first cluster；

Acquiring unit, for obtaining data to be imported from the second cluster, the resource of first cluster is less than second collection The resource of group；

Storage unit, for storing the data to be imported into first cluster.

7. device according to claim 6, which is characterized in that the first deletion unit is specifically used for:

8. device according to claim 6, which is characterized in that further include:

Second delete unit, for delete it is being stored in first cluster be more than first default with the current time time difference when Long all data.

9. according to the described in any item devices of claim 6-8, which is characterized in that the acquiring unit is specifically used for:

10. according to the described in any item devices of claim 6-8, which is characterized in that the acquiring unit is specifically used for: