CN112365244A

CN112365244A - Data life cycle management method and device

Info

Publication number: CN112365244A
Application number: CN202011359475.1A
Authority: CN
Inventors: 周统汉; 覃娆; 孙朝辉; 崖飞虎
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2021-02-12
Anticipated expiration: 2040-11-27
Also published as: CN112365244B

Abstract

The embodiment of the application provides a data life cycle management method and device. The method comprises the following steps: the method comprises the steps that first equipment calculates the data active duration of a data table based on the access condition of the data table; when the creation duration of the data table is longer than the active data duration, the first equipment determines a target processing strategy based on the type of the data table; and the first equipment performs data life cycle management on the data table according to the target processing strategy. Therefore, manual intervention operation is reduced, the processing efficiency of data life cycle management is improved, the safety of data archiving or data cleaning is ensured, and the cost pressure of enterprise storage is reduced.

Description

Data life cycle management method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a data life cycle management method and device.

Background

In the field of financial technology (Fintech), the size of data has expanded dramatically with the rapid development of business. Usually, a large number of data tables such as temporary tables and invalid intermediate tables are stored in a database/data cluster, and the data tables are stored in a high-end storage device in a centralized manner, which occupies a large amount of storage resources and calculation resources, and thus, the investment cost caused by capacity expansion is continuously increased. In order to reasonably configure storage resources and improve the use efficiency of system resources, data lifecycle management needs to be performed.

For any data table, the performance, availability and storage requirements of the data at different stages in the whole life cycle of the data are different. Generally, at the initial stage of data establishment, the use frequency of data is high, and high-speed storage is required to be used to ensure high availability of data. In the middle stage of data establishment, the importance of data will gradually decrease, the use frequency will decrease, and the data should be stored at different levels to provide appropriate availability and storage space for it, so as to reduce the management cost and resource overhead of the data. In the later period of data establishment, most of data can not be used any more, and the data is cleaned and then archived for storage for use when temporarily needed or deleted.

However, developers often make a processing strategy corresponding to data lifecycle management before data is online based on their own understanding of the data, set a retention period of registered data, and then archive or destroy the data according to the retention period, which easily causes subjective evaluation errors in the retention period of the data, resulting in a risk of mistaken deletion or redundant storage in the data lifecycle management.

Disclosure of Invention

The embodiment of the application provides a data life cycle management method and device, which can accurately recommend the data activity duration of a data table, automatically recommend specific data life cycle management, intelligently realize the data life cycle management process of the data table, ensure the safety performance of data and reduce the cost pressure of enterprise data storage.

In a first aspect, an embodiment of the present application provides a data lifecycle management method.

The method comprises the following steps: the method comprises the steps that first equipment calculates the data active duration of a data table based on the access condition of the data table; when the creation duration of the data table is longer than the active data duration, the first equipment determines a target processing strategy based on the type of the data table; and the first equipment performs data life cycle management on the data table according to the target processing strategy.

By the method provided by the first aspect, the first device can calculate the data activity duration of the data table by combining the access condition of the data table, and accurate recommendation of the data activity duration is achieved. When the creation duration of the data table is longer than the data activity duration of the data table, the first device may determine that the data table needs to perform data lifecycle management. The first device can determine a target processing strategy based on the type of the data table, and accurately evaluates whether the data table needs data archiving or data cleaning. Therefore, the first device can intelligently perform automatic data life cycle management on the data sheet according to the target processing strategy, so that manual intervention operation is reduced, the processing efficiency of data life cycle management is improved, the safety of data archiving or data cleaning is ensured, and the cost pressure of enterprise storage is reduced.

In one possible design, the first device calculates the data activity duration of a data table based on the access condition of the data table, and the method includes:

the method comprises the steps that first equipment obtains the creation time of a data table, the creation time of each partition table in the data table and the latest access time; when the creation duration of the data table is longer than the preset duration, the first device determines the difference between the creation duration of each partition table and the latest access time as a difference sequence; the first equipment carries out density clustering processing on the difference value sequence to obtain a plurality of core points, wherein each core point is used for representing the access duration of the partition table of the same type in the data table; the first equipment calculates a maximum boundary point corresponding to a maximum core point in the plurality of core points, wherein the maximum boundary point is used for representing the maximum access duration of all types of partition tables in the data table; the first device determines the maximum boundary point as the data active duration.

Therefore, the first device can perform density clustering processing on a difference value sequence formed by the difference value between the creating time and the latest access time of each partition table in the data table, eliminate noise points which are occasionally and temporarily accessed by users to some data, and obtain a plurality of core points which meet the corresponding rules of the access frequency of the data from the initial building stage, the middle building stage to the later building stage, wherein each core point represents the access time of the partition table of the same type in the data table. The first device selects the maximum core point from the plurality of core points, and the maximum boundary point represents the maximum access duration of all types of partition tables in the data table, so that the maximum boundary point corresponding to the maximum core point can be determined as the data activity duration of the data table, the safety of data can be ensured, the misoperation of data life cycle management on the data table is avoided, and the accurate recommendation of the data activity duration is realized.

In one possible design, the step of performing, by the first device, density clustering on the difference sequence to obtain a plurality of core points includes:

step 1, inputting a difference sequence L, a preset neighborhood radius Eps and a minimum neighborhood point number MinPts of a preset given point serving as a core point in a neighborhood into a preset algorithm module by first equipment to output a result sequence R and an achievable distance rd of each partition table difference sample;

and 2, outputting clustering results C corresponding to the plurality of core points based on the result sequence R.

In one possible design, the first device inputs the difference sequence L, the preset neighborhood radius Eps, and the minimum neighborhood point number MinPts of the preset given point as the core point in the neighborhood into the preset algorithm module to output the result sequence R and the reachable distance rd of each partition table difference sample, including:

step 1.1: selecting a sample point which is not in the result sequence R and is a core point, finding all direct density reachable sample points of the sample point, if the sample point does not exist in the result sequence R, putting the sample point into an ordered sequence Q, and sorting the sample points from small to large according to a reachable distance rd, wherein the sample point is the active duration of each partition table;

step 1.2: if the ordered sequence Q is empty, executing the step 1.1, if the ordered sequence Q is not empty, taking a first sample point m from the ordered sequence Q, and storing the taken sample point m into a result sequence R;

step 1.3: and (5) iterating the step 1.2 until the algorithm is finished, and outputting a result sequence R and the reachable distance rd of each partition table difference value sample.

In one possible design, outputting clustering results C corresponding to a plurality of core points based on the result sequence R includes:

step 2.1: sequentially taking out sample points from the result sequence R;

step 2.2: if the core distance of the sample point is greater than the given neighborhood radius Eps, determining the sample point as a noise point, and neglecting the sample point, otherwise, determining that the sample point belongs to a new cluster, and jumping to the step 2.1;

step 2.3: and finishing traversing the result sequence R to output a clustering result C corresponding to the plurality of core points.

In one possible design, the method further includes: and when the creation duration of the data table is less than or equal to the preset duration, the first equipment stops performing data life cycle management on the data table. Therefore, the phenomenon of data deletion by mistake or long-time storage is avoided, and the validity of data life cycle management is improved.

In one possible design, the first device determines the target processing policy based on a type of the data table, including:

the first device identifies basic data information of the data table, wherein the basic data information comprises: whether the structures of the data table and the first-level upstream data table are consistent or not, the access condition of the data table in a preset time length, the same-base output degree and the same-base input degree of the data table, different-base output degrees and different-base input degrees of the data table, and other output degrees and other input degrees of the data table; the first equipment determines the type of the data table based on the incidence relation between the basic data information and the type of the data table; the first device determines a target processing policy based on an association between the type of the data table and the processing policy of the data table.

Therefore, the first device is configured with the association relationship between the basic data information and the type of the data table and the association relationship between the type of the data table and the processing strategy of the data table in advance, so that the first device can determine the target processing strategy based on the collected basic data information of the data table. Therefore, specific recommendation of data life cycle management is given by effectively combining specific conditions of the data table.

In one possible design, the types of data tables include: direct source table, secondary source table, intermediate table, result table, temporary table, and other tables.

In one possible design, the first device performs data lifecycle management on the data table according to a target processing policy, including:

the method comprises the steps that a first device sends a first request to a second device, wherein the first request is used for requesting to trigger a target processing strategy; the first device receiving a first response from the second device; and when the first equipment determines that the first response indicates that the developer approves the target processing strategy is feasible, performing data life cycle management on the data table according to the target processing strategy.

Therefore, the first device further determines the data activity duration of the data table according to the actual management requirements of enterprises or regulations on the data and by combining the approval management process of developers, so that the safety of the data is ensured, and the integrity of the life cycle management of the data is ensured.

In one possible design, the method further includes: the first equipment updates the data active duration to the data active duration carried in the first response when determining that the first response indicates that the developer does not approve the target processing strategy, determines the target processing strategy based on the type of the data table when the creation duration of the data table is longer than the data active duration, and performs data life cycle management on the data table according to the target processing strategy when determining that the first response indicates that the developer approves the target strategy processing is feasible.

Therefore, the data activity duration of the data table is effectively corrected, the data safety is ensured, and the integrity of data life cycle management is ensured.

In one possible design, the first device and the second device are the same device or different devices.

In a second aspect, an embodiment of the present application provides a data lifecycle management apparatus, which is applied to a first device.

The apparatus may include:

the calculation module is used for calculating the data active duration of a data table based on the access condition of the data table;

the determining module is used for determining a target processing strategy based on the type of the data table when the judging module determines that the creation duration of the data table is longer than the active duration of the data;

and the management module is used for carrying out data life cycle management on the data table according to the target processing strategy.

In one possible design, the calculation module is specifically configured to obtain a creation time of the data table, and a creation time and a latest access time of each partition table in the data table; when the creating time length of the data table is longer than the preset time length, determining the difference between the creating time length of each partition table and the latest access time as a difference sequence; performing density clustering processing on the difference sequence to obtain a plurality of core points, wherein each core point is used for representing the access duration of the partition table of the same type in the data table; calculating a maximum boundary point corresponding to a maximum core point in the plurality of core points, wherein the maximum boundary point is used for representing the maximum access duration of all types of partition tables in the data table; and determining the maximum boundary point as the data active duration.

In a possible design, the management module is further configured to stop performing data life cycle management on the data table when the determination module determines that the creation duration of the data table is less than or equal to a preset duration.

In one possible design, the determining module is specifically configured to identify basic data information of the data table, where the basic data information includes: whether the structures of the data table and the first-level upstream data table are consistent or not, the access condition of the data table in a preset time length, the same-base output degree and the same-base input degree of the data table, different-base output degrees and different-base input degrees of the data table, and other output degrees and other input degrees of the data table; determining the type of the data table based on the incidence relation between the basic data information and the type of the data table; and determining a target processing strategy based on the association relationship between the type of the data table and the processing strategy of the data table.

In one possible design, the management module is specifically configured to send a first request to the second device, where the first request is used to request triggering of a target processing policy; receiving a first response from the second device; and when the judging module determines that the first response indicates that the developer can approve the target processing strategy, performing data life cycle management on the data table according to the target processing strategy.

In one possible design, the apparatus further includes: the updating module is used for updating the data active duration to be the data active duration carried in the first response when the judging module determines that the first response indicates that the developer approves the target processing strategy is not feasible, the determining module determines the target processing strategy based on the type of the data table when the judging module determines that the creation duration of the data table is longer than the data active duration, and the management module performs data life cycle management on the data table according to the target processing strategy when the judging module determines that the first response indicates that the developer approves the target strategy processing to be feasible.

The beneficial effects of the data lifecycle management apparatus provided in the second aspect and in each possible design of the second aspect may refer to the beneficial effects brought by each possible implementation manner of the first aspect, and are not described herein again.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor; the memory is used for storing program instructions; the processor is configured to invoke program instructions in the memory to cause the electronic device to perform the method of data lifecycle management in the first aspect and in any one of the possible designs of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer storage medium, which includes computer instructions that, when executed on an electronic device, cause the electronic device to perform the method for data lifecycle management in the first aspect and any one of the possible designs of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product, which when run on a computer, causes the computer to execute the data lifecycle management method of the first aspect and any one of the possible designs of the first aspect.

In a sixth aspect, an embodiment of the present application provides a chip system, where the chip system includes: a processor; the electronic device performs the data lifecycle management method of the first aspect and any one of the possible designs of the first aspect when the processor executes the computer instructions stored in the memory.

Drawings

FIG. 1 is a flow chart illustrating a conventional data lifecycle management method;

fig. 2 is a schematic flowchart illustrating a data lifecycle management method according to an embodiment of the present application;

fig. 3A is a schematic flowchart of a data lifecycle management method according to an embodiment of the present application;

fig. 3B is a schematic flow chart illustrating a process of obtaining data active duration according to an embodiment of the present application;

fig. 4 is a flowchart illustrating a data lifecycle management method according to an embodiment of the present application;

fig. 5 is a flowchart illustrating a data lifecycle management method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a data lifecycle management apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a data lifecycle management apparatus according to an embodiment of the present application.

Detailed Description

First, some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

1. Data lifecycle management: a data management schema includes full lifecycle management for the generation, use, migration, cleaning, and destruction of data. The data scale of the production system can be effectively controlled through data life cycle management, and the data access efficiency is improved, so that the overall efficiency of system operation is improved, and enterprises are helped to obtain the maximum value at the lowest cost in each stage of data life.

2. Data active duration: the data access heat degree generally follows the law that the access frequency gradually decreases along with the increase of time. The access heat gradually decreases with the lapse of time until the access heat is stabilized at a certain fixed time length, which is a data active time length.

The embodiment of the application provides a method for managing the life cycle of the existing data. Referring to fig. 1, fig. 1 is a flow chart illustrating a conventional data lifecycle management method. As shown in fig. 1, in the existing data lifecycle management method, a developer sets a processing policy corresponding to data lifecycle management based on data requirements and understanding of data, and the processing policy may generally include: data processing means (e.g., permanent storage, archiving/cleaning after a period of storage), data retention period (e.g., 30/60/90 days).

In the online stage of the data, the processor may determine whether the actual data retention period exceeds the data retention period set in the processing policy. If the actual data retention period exceeds the data retention period set in the processing strategy, the processor triggers the processing flow of data archiving or data cleaning. After the process is approved, operation and maintenance personnel perform data archiving and data cleaning, namely entering a data archiving stage or a data destruction stage.

The archiving phase of the data may include online archiving and offline archiving, among others. In online archiving of data, data is stored in an archive cluster of high-end storage devices and supports online queries by users. In offline archiving of data, the data is stored in a low-end storage device and does not support online queries by a user.

However, in the existing data lifecycle management method exemplarily shown in fig. 1, one of the most critical parts is to determine when data should be archived or cleaned, and the data retention period often depends on the data knowledge of the developer, and there are several problems as follows:

1. the accuracy is low: the retention period of data is artificially drawn, subjective evaluation errors exist, and a large difference between a drawn value and an actual value is easily caused.

2. Data security risks or increases equipment costs: data is the property of enterprises, and generally cleaning and filing needs to be carried out with great care so as to avoid the problem of production caused by mistaken deletion and the safety risk of the data or avoid the problem of overdue storage and the increase of equipment cost.

3. The treatment efficiency is low: in the existing scheme, a developer establishes a data retention period before data is on line, and a corresponding processing strategy needs to be repeatedly reviewed, so that the complexity and the management cost of data life cycle management are increased virtually, and the processing efficiency is low.

In order to solve the foregoing problems, embodiments of the present application provide a data lifecycle management method, an apparatus, a device, and a computer storage medium, where an execution subject of the data lifecycle management method in embodiments of the present application is a first device (e.g., a server), and the first device is applied in the field of financial technology, and the first device can combine an access condition of a data table to accurately calculate a data activity duration of the data table. And when the creation duration of the data table is longer than the active duration of the data, the first device may determine that the data table needs to be subjected to data lifecycle management at this time. And the first equipment can determine a target processing strategy based on the type of the data table, and accurately evaluate whether the data table needs to be subjected to data archiving or data cleaning. Therefore, the first device can intelligently perform automatic data life cycle management on the data sheet according to the target processing strategy, so that manual intervention operation is reduced, the processing efficiency of data life cycle management is improved, the safety of data archiving or data cleaning is ensured, and the cost pressure of enterprise storage is reduced.

Referring to fig. 2, fig. 2 is a flowchart illustrating a data lifecycle management method according to an embodiment of the present application.

As shown in fig. 2, with a first device as an execution subject, the data lifecycle management method according to the embodiment of the present application may include:

s101, the first device calculates the data active duration of a data table based on the access condition of the data table. The access condition of a data table can represent the actual condition of a user accessing the data table, and the data activity duration of the data table is obtained by considering the access frequency of the data from the initial building stage, the middle building stage to the later building stage and the phenomenon that the user occasionally needs to access some data temporarily. Therefore, the first device can calculate the data activity duration of one data table based on the access condition of the one data table.

The access condition of one data table may include, but is not limited to: the method comprises the following steps of describing information, creating time, creating duration, accessing time and accessing frequency of a data table, and describing information, creating time, creating duration, accessing time, accessing frequency, storing address and the like of each partition table in the data table.

In addition, as will be understood by those skilled in the art, the density clustering algorithm performs clustering based on how dense the data set is in the spatial distribution, assuming that the clustering structure can be determined by how close the sample distribution is. It can be seen that the data active duration of the data table is matched with the density clustering algorithm. Therefore, the first device can adopt a density clustering algorithm to calculate the data activity duration of one data table based on the access condition of the one data table.

The embodiment of the present application does not limit the specific implementation manner of the density clustering algorithm. In some embodiments, the density clustering algorithm comprises: an OPTIC (ordering points to identification the clustering) density clustering algorithm or a DBSCAN (dense-based spatial clustering of applications with noise) density clustering algorithm.

S102, when the creation duration of the data table is longer than the data active duration, the first device determines a target processing strategy based on the type of the data table.

S103, the first device manages the data life cycle of the data table according to the target processing strategy.

Based on the data active duration determined in step S101, the first device may determine whether the creation duration of the data table is greater than the data active duration. When the creation duration of the data table is longer than the data activity duration, the first device may determine that data lifecycle management needs to be performed on the data table at this time. Data lifecycle management may include data archiving or data cleansing, among other things.

Because the first device stores the corresponding relationship between the type of the data table and the processing strategy corresponding to the data lifecycle management in advance, the first device can determine the target processing strategy based on the type of the data table, that is, specifically, perform data archiving on the data table and also perform data cleaning on the data table. Thus, the first device may perform data lifecycle management on the data table according to the target processing policy.

The embodiment of the present application does not limit the specific implementation manner of the type of the data table. In some embodiments, the types of data tables include: direct source table, secondary source table, intermediate table, result table, temporary table, and other tables.

According to the data life cycle management method provided by the embodiment of the application, the data activity duration of the data table can be calculated by combining the access condition of the first equipment and the data table, and accurate recommendation of the data activity duration is realized. When the creation duration of the data table is longer than the data activity duration of the data table, the first device may determine that the data table needs to perform data lifecycle management. The first device can determine a target processing strategy based on the type of the data table, and accurately evaluates whether the data table needs data archiving or data cleaning. Therefore, the first device can intelligently perform automatic data life cycle management on the data sheet according to the target processing strategy, so that manual intervention operation is reduced, the processing efficiency of data life cycle management is improved, the safety of data archiving or data cleaning is ensured, and the cost pressure of enterprise storage is reduced.

Next, with reference to fig. 3A, a possible implementation manner of calculating, by the first device, the data active duration of the data table based on the access condition of the data table in step S101 is described.

Referring to fig. 3A, fig. 3A is a schematic flow chart illustrating a data lifecycle management method according to an embodiment of the present application.

As shown in fig. 3A, a data lifecycle management method according to an embodiment of the present application may include:

s201, the first device obtains the creating time of the data table, the creating time of each partition table in the data table and the latest access time.

One data table typically exists in the form of a partition table, and thus, one data table may include a plurality of partition tables. For example, the data table is a Hive table in a Hadoop cluster, and the partition table in the data table is a partition table classified according to a date directory in the Hive table.

At the initial stage of building the data table, the metadata information of the data table, such as description information, creation time, access frequency, etc., of the data table, and the metadata information of each partition table, such as description information, creation time, access frequency, storage address, etc., are recorded.

Therefore, the first device obtains the creation time of the data table, the creation time of each partition table in the data table and the latest access time from the module for storing the information. The creating time length is the difference between the creating time and the current time.

S202, the first device judges whether the creation time length of the data table is larger than a preset time length.

Based on the creation time length of the data table obtained in step S201, the first device may determine whether the creation time length of the data table is greater than a preset time length to determine whether the data in the data table is in an active period.

The embodiment of the present application does not limit the specific value of the preset duration. In general, the preset time period may be set based on the business situation, such as 90 days.

If yes, the first device may determine that the data in the data table is not in an active period, and perform step S203-step S206; if not, the first device may determine that the data in the data table is in an active period, and perform step S207.

S203, the first device determines the difference between the creating time length of each partition table and the latest access time as a difference sequence.

When the creation time length of the data table is longer than the preset time length, based on the creation time length and the latest access time of each partition table in the data table acquired in step S201, the first device may calculate a difference between the creation time length and the latest access time of each partition table, and determine the difference corresponding to each partition table as a difference sequence.

For example, for any one partition table in the data table, the creation duration of the partition table is T_ciThe latest access time of the partition table is T_eiAnd i is a positive integer which is greater than or equal to 1 and less than or equal to n, wherein n is the total number of the partition tables of the data table.

Then, the difference sequence L is L { (T)_e1-T_c1),(T_e2-T_c2),...,(T_en-T_cn)}。

And S204, the first device carries out density clustering processing on the difference sequence to obtain a plurality of core points, and each core point is used for representing the access duration of the partition table of the same type in the data table.

The difference value sequence not only can represent the access frequency of the data from the initial stage of the establishment, the middle stage of the establishment to the later stage of the establishment, but also can represent the phenomenon that a user occasionally needs to access some data temporarily. Therefore, the first device can perform density clustering processing on the difference sequence based on a density clustering algorithm, and eliminate a plurality of noise points which may exist to obtain a plurality of core points. Wherein, one core point can be used for representing one access time length of the partition table of the same type in the data table. A noise point may be used to indicate when a user temporarily accesses certain data sporadically.

S205, the first device calculates a maximum boundary point corresponding to a maximum core point in the plurality of core points, wherein the maximum boundary point is used for representing the maximum access duration of all types of partition tables in the data table.

S206, the first device determines the maximum boundary point as the data active duration.

Since one core point may represent one access time length of the partition table of the same type in the data table. Thus, the first device may take the largest core point of the plurality of core points, i.e. calculate the maximum access time length for all types of partition tables in the data table. Although the maximum core point may represent the maximum access duration of the data table, in consideration of the security of data management, the first device may determine the maximum boundary point corresponding to the maximum core point as the data active duration of the data table, so as to ensure the security of data and avoid a misoperation of data lifecycle management on the data table.

Based on steps S204 to S206, and with reference to fig. 3B, taking an OPTICS density clustering algorithm as an example, the first device may perform the following steps to obtain the data active duration of the data table:

step 1, the first device inputs the difference sequence L, a preset neighborhood radius Eps and a preset minimum neighborhood point number MinPts of a given point which becomes a core object in a neighborhood into an algorithm module of the first device. Therefore, the algorithm module outputs the result sequence R and the reachable distance rd of each partition table difference sample according to the following algorithm flow.

The core object, i.e., the core point, may be understood as drawing a circle by taking a sample point in a sample as a center and taking the neighborhood radius Eps as a radius. If the number of sample points other than the sample point and falling in the circle is equal to or greater than the minimum neighborhood point number MinPts, the sample point is a core object, and the sample point falling in the circle is a boundary point of the sample point. Otherwise, the sample point is not a core point.

Step 1.1: the algorithm module selects a sample point that is not in the result sequence R and is a core object, and finds all direct density reachable sample points of the sample point (i.e., the set N corresponding to the domain object point of the core object c in fig. 3B). If the sample point is not present in the result sequence R, the algorithm module places the sample point in the ordered sequence Q and orders the sample point from small to large by the achievable distance rd.

The difference sequence L is a difference sample of each partition table, where one sample point is an active time length of one partition table, that is, a difference between a last access time of the partition table and a partition table creation time length.

Step 1.2: if the ordered sequence Q is empty, the algorithm module executes step 1.1. If the ordered sequence Q is not empty, the algorithm module takes the first sample point m from the ordered sequence Q and stores the taken sample point m into the result sequence R.

Step 1.2.1: the algorithm module determines whether the sample point m is a core object. If not, the algorithm module executes step 1.2. If so, the algorithm module finds all direct density reachable point sets N for the sample point m (i.e., the set N corresponding to the domain object point of the core object m in FIG. 3B).

Step 1.2.2: the algorithm module determines whether the sample points in the set N already exist in the result sequence R. If so, the algorithm module does not process. If not, the algorithm module executes step 1.2.3 or step 1.2.4.

Step 1.2.3: if the direct density reachable sample points already exist in the ordered sequence Q, and the new reachable distance is less than the old reachable distance at this time, the algorithm module replaces the old reachable distance with the new reachable distance and reorders the ordered sequence Q.

Step 1.2.4: if the direct density reachable sample point does not exist in the ordered sequence Q, the algorithm module interpolates the point and reorders the ordered sequence Q.

Step 1.3: and (3) iterating the step 1.2 by the algorithm module until the algorithm is finished, and outputting a result sequence R and the reachable distance rd of each partition table difference value sample.

Wherein, the core distance cd: make itThe minimum radius of the sample point, which becomes the core object, is the distance between the sample point and the nearest point of the MinPts. The reachable distance rd: neighbor point t to sample point t₁、t₂、t₃、…、t_nIn other words, if the distance from the point to the point t is greater than the core distance, the reachable distance is the actual distance from the point to the point t; if the distance from these points to point t is less than the core distance, then the reachable distance is the core distance of point t.

And 2, outputting a final clustering result C by the algorithm module based on the result sequence R obtained in the step 1.

Step 2.1: the algorithm module takes sample points in order from the result sequence R.

If the sample point's reachable distance is not greater than the given neighborhood radius Eps, the algorithm module determines that the sample point belongs to the current category. Otherwise, the algorithm module executes step 2.2.

Step 2.2: if the core distance of the sample point is greater than the given neighborhood radius Eps, the algorithm module determines that the sample point is a noise point and may ignore the sample point. Otherwise, the algorithm module determines that the sample point belongs to a new cluster and jumps to step 2.1.

Step 2.3: and the algorithm module finishes traversing the result sequence R, and the algorithm is finished.

Thus, the algorithm module obtains a clustering result C_iAnd i is a positive integer which is greater than or equal to 1 and less than or equal to n, wherein n is the total number of types of the partition table in the data table. Each clustering result corresponds to a plurality of access durations (including core points and boundary points) of the partition table of the same type.

Through the steps, a plurality of access durations of each partition table in the data table can be accurately analyzed between the creation duration and the latest access time of each partition table, and the content to be deleted of the final data table is determined according to the access durations of each partition table, so that the data table can be accurately and effectively managed subsequently.

Step 3, in the clustering result C, the algorithm module takes the category C₁、C₂、C₃、…、C_nMaximum class value C in the values_x. In order to ensure the safety of data archiving or data cleaning, the algorithm module takes C_xThe medium maximum sample value (i.e. the maximum boundary point) is used as the recommended value of the data active time length.

It should be noted that the first device may also use a DBSCAN density clustering algorithm instead of the OPTICS density clustering algorithm, and only needs to ensure that the neighborhood radius Eps input into the algorithm module of the first device by the first device and the minimum neighborhood point number MinPts of the given point, which becomes a core object in the neighborhood, are appropriate.

In addition, the first device can also adopt a clustering algorithm with other ideas to calculate the data active duration of the data table. For example, the first device may use a k-means clustering algorithm (k-means clustering algorithm) or a deformed clustering algorithm thereof to calculate k centroid points, perform denoising processing on the k centroid points, and then calculate a maximum value of a boundary among the k centroid points as a recommended value of the data activity duration, so as to obtain accurate data activity duration and ensure data security.

And S207, the first equipment stops performing data life cycle management on the data table.

When the creation time length of the data table is less than or equal to the preset time length, the first device may determine that the data in the data table is in an active period. Thus, the first device may stop data lifecycle management for the data table, i.e., without regard to data archiving or data cleansing operations.

In the embodiment of the application, the first device may perform density clustering on a difference sequence formed by differences between the creation time and the latest access time of each partition table in the data table, remove noise points which are occasionally accessed to some data by a user temporarily, and obtain a plurality of core points which satisfy the corresponding laws of the access frequencies of the data from the initial stage of establishment, the middle stage of establishment to the later stage of establishment, wherein each core point is used for representing the access time of the partition table of the same type in the data table. The first device selects the maximum core point from the plurality of core points, calculates the maximum boundary point corresponding to the maximum core point, namely the maximum access duration of all types of partition tables in the data table, and determines the maximum boundary point corresponding to the maximum core point as the data activity duration of the data table, so that the data security can be ensured, the misoperation of data life cycle management on the data table is avoided, and the accurate data activity duration recommendation is realized.

Next, with reference to fig. 4, a possible implementation manner of determining the target processing policy by the first device based on the type of the data table in step S102 is described.

Referring to fig. 4, fig. 4 is a schematic flowchart illustrating a data lifecycle management method according to an embodiment of the present application.

As shown in fig. 4, the data lifecycle management method according to the embodiment of the present application may include:

s301, the first device identifies basic data information of the data table.

Wherein the basic data information includes: whether the structures of the data table and the first-level upstream data table are consistent or not, the access condition of the data table in a preset time length, the same-base output degree and the same-base input degree of the data table, different-base output degrees and different-base input degrees of the data table, and other output degrees and other input degrees of the data table.

The first device may identify underlying data information for the data table based on the data table's consanguineous data, such as cluster, library name, table name, etc.

For example, the first device may compute a triple triplet via Spark graph x. Specifically, the first device may analyze the dependency between tables and the cross-library dependency according to the information of the start node, the edge, the end node, and the like of the triple, so as to obtain the basic data information of the data table. Wherein a node represents a certain upstream/downstream table. The edges represent a certain hive/sqoop/mask task. The output degree is the number of the downstream tables of the table. The entry number is the number of the upstream tables of the table.

S302, the first device determines the type of the data table based on the incidence relation between the basic data information and the type of the data table.

Since the first device is configured with the association relationship between the basic data information and the type of the data table in advance, the first device can determine the type of the data table based on the basic data information of the data table.

The embodiment of the present application does not limit a specific implementation manner of the association relationship between the basic data information of the first device configuration and the type of the data table. In some embodiments, the first device may construct an association relationship between the basic data information and the type of the data table based on whether the structure is consistent with the primary upstream table structure, whether the data table is accessed in the last 30 days/90 days/180 days, the same-base access degree, the different-base access degree, other access degrees and the like, as shown in table 1.

TABLE 1

Wherein, directly pasting the source table: syncing from online layer source database (e.g., mysql) to the direct target table of the data table (hive table). Secondary source pasting table: and (4) unloading from the direct source pasting table. Intermediate table: the data base is specially used for storing a data table of the intermediate calculation result. Results table: a results table supporting the business application.

S303, the first device determines a target processing strategy based on the association relationship between the type of the data table and the processing strategy of the data table.

Since the first device is configured with the association relationship between the type of the data table and the processing policy of the data table in advance, the first device can determine the target processing policy based on the type of the data table.

The embodiment of the present application does not limit a specific implementation manner of the association relationship between the type of the first device configuration data table and the processing policy of the data table. In some embodiments, the processing strategies corresponding to the direct pasting source table, the secondary pasting source table, the intermediate table, the result table, the temporary table, and the other tables are shown in table 2.

TABLE 2

Wherein, the 'permanent': the table is built for less than half a year, or access records are recorded within half a year, and the whole table is permanently kept online. "day 0": and no access record lasts more than half a year, and the whole table is filed.

In the embodiment of the application, the first device is preconfigured with the association between the basic data information and the type of the data table and the association between the type of the data table and the processing policy of the data table, so that the first device can determine the target processing policy based on the acquired basic data information of the data table. Therefore, specific recommendation of data life cycle management is given by effectively combining specific conditions of the data table.

Based on step S101, the first device may calculate when data lifecycle management for the data table should be initiated. Based on step S102, the first device may determine whether the data table is a target processing policy for data archiving or data cleaning. In order to ensure reasonable compliance of data archiving or data cleaning, strict process management and control of developers are required for data lifecycle management.

Next, with reference to fig. 5, a specific implementation process of the data lifecycle management method participated by the developer is described.

Referring to fig. 5, fig. 5 is a flowchart illustrating a data lifecycle management method according to an embodiment of the present application.

As shown in fig. 5, a data lifecycle management method according to an embodiment of the present application may include:

s401, the first device calculates the data active duration of a data table based on the access condition of the data table.

S402, when the creating duration of the data table is longer than the active duration of the data, the first device determines a target processing strategy based on the type of the data table.

S401 and S402 are similar to the implementation manners of S101 and S102 in the embodiment of fig. 2, and are not described herein again in this embodiment of the application.

S403, the first device sends a first request to the second device, wherein the first request is used for requesting to trigger a target processing strategy.

The first device can send a first request for triggering the target processing strategy to the second device, so that a developer can timely know that data archiving or data cleaning needs to be carried out on the data table at present. The present application does not limit specific implementation manners of the first request and the second device.

In some embodiments, the first device and the second device are the same device, for example, both the first device and the second device are a server, and the developer may obtain the first request in the form of a prompt message, a web interface, or the like through the server.

In other embodiments, the first device and the second device are different devices, for example, the first device is a server, the second device is a terminal device, the server sends the first request to the terminal device, and the developer can obtain the first request in the form of a short message, a reminding interface of an application program, a web interface, and the like through the terminal device.

S404, the first device receives a first response from the second device.

S405, the first device judges whether the first response indicates that the developer approves the target processing strategy to be feasible.

The developer may approve the feasibility of the target processing strategy through the second device. And the second equipment carries the approval result in the first response and sends the approval result to the first equipment. Thus, the first device may determine whether the first response indicates that the developer approved the target processing policy as viable.

If yes, the first device may perform step S406; if not, the first device may perform step S407.

S406, the first device performs data life cycle management on the data table according to the target processing strategy.

When the first response indicates that the developer approves the target processing strategy is feasible, the first device can manage the data life cycle of the data table according to the target processing strategy.

S407, the first device updates the data activity duration to the data activity duration carried in the first response, and executes the steps S402-S405.

When it is determined that the first response indicates that the developer does not approve the target processing policy, the first response may carry an updated data active duration, for example, the updated data active duration may be a duration input by the developer, or may be a default duration. Thus, the first device may update the data activity duration of the data table to the updated data activity duration. And the first device executes the steps S402-S405 again until it is determined that the developer approves the target policy processing, and the first device may perform data lifecycle management on the data table according to the target processing policy.

In the embodiment of the application, the first device further determines or effectively corrects the data activity duration of the data table according to the actual management requirements of enterprises or regulations on the data and by combining the approval management process of developers, so that the safety of the data is ensured, and the integrity of the life cycle management of the data is ensured.

Exemplarily, the embodiment of the present application further provides a data lifecycle management apparatus.

Referring to fig. 6 to 7, fig. 6 is a schematic structural diagram of a data lifecycle management apparatus according to an embodiment of the present application.

The data lifecycle management apparatus according to the embodiment of the present application may be disposed in a server, and may implement the operation of the application container management method embodiment corresponding to the first device. As shown in fig. 6, the apparatus may include:

the calculation module 101 is configured to calculate a data active duration of a data table based on an access condition of the data table;

a determining module 102, configured to determine a target processing policy based on a type of the data table when the determining module 103 determines that the creation duration of the data table is longer than the active duration of the data;

and the management module 104 is used for performing data life cycle management on the data table according to the target processing strategy.

In some embodiments, the calculation module 101 is specifically configured to obtain a creation time of a data table, and a creation time and a latest access time of each partition table in the data table; when the creating time length of the data table is longer than the preset time length, determining the difference between the creating time length of each partition table and the latest access time as a difference sequence; performing density clustering processing on the difference sequence to obtain a plurality of core points, wherein each core point is used for representing the access duration of the partition table of the same type in the data table; calculating a maximum boundary point corresponding to a maximum core point in the plurality of core points, wherein the maximum boundary point is used for representing the maximum access duration of all types of partition tables in the data table; and determining the maximum boundary point as the data active duration.

In some embodiments, the calculating module 101 is specifically configured to, in step 1, the first device inputs the difference sequence L, the preset neighborhood radius Eps, and a minimum neighborhood point number MinPts that is a preset given point becoming a core point in a neighborhood into a preset algorithm module, so as to output the result sequence R and the reachable distance rd of each partition table difference sample; and 2, outputting clustering results C corresponding to the plurality of core points based on the result sequence R.

In some embodiments, the calculation module 101 is specifically configured to perform the following steps 1.1: selecting a sample point which is not in the result sequence R and is a core point, finding all direct density reachable sample points of the sample point, if the sample point does not exist in the result sequence R, putting the sample point into an ordered sequence Q, and sorting the sample points from small to large according to a reachable distance rd, wherein the sample point is the active duration of each partition table; step 1.2: if the ordered sequence Q is empty, executing the step 1.1, if the ordered sequence Q is not empty, taking a first sample point m from the ordered sequence Q, and storing the taken sample point m into a result sequence R; step 1.3: and (5) iterating the step 1.2 until the algorithm is finished, and outputting a result sequence R and the reachable distance rd of each partition table difference value sample.

In some embodiments, the calculation module 101 is specifically configured to perform the following steps 2.1: sequentially taking out sample points from the result sequence R; step 2.2: and if the core distance of the sample point is greater than the given neighborhood radius Eps, determining the sample point as a noise point, and neglecting the sample point, otherwise, determining that the sample point belongs to a new cluster, and jumping to the step 2.1. Step 2.3: and finishing traversing the result sequence R to output a clustering result C corresponding to the plurality of core points.

In some embodiments, the management module 104 is further configured to stop performing data lifecycle management on the data table when the determining module 103 determines that the creation duration of the data table is less than or equal to the preset duration.

In some embodiments, the determining module 102 is specifically configured to identify basic data information of the data table, where the basic data information includes: whether the structures of the data table and the first-level upstream data table are consistent or not, the access condition of the data table in a preset time length, the same-base output degree and the same-base input degree of the data table, different-base output degrees and different-base input degrees of the data table, and other output degrees and other input degrees of the data table; determining the type of the data table based on the incidence relation between the basic data information and the type of the data table; and determining a target processing strategy based on the association relationship between the type of the data table and the processing strategy of the data table.

In some embodiments, the types of data tables include: direct source table, secondary source table, intermediate table, result table, temporary table, and other tables.

In some embodiments, the management module 104 is specifically configured to send a first request to the second device, where the first request is used to request to trigger the target processing policy; receiving a first response from the second device; and when the judging module 103 determines that the first response indicates that the developer approves the target processing strategy is feasible, performing data life cycle management on the data table according to the target processing strategy.

As shown in fig. 7, the data lifecycle management apparatus, based on the apparatus structure shown in fig. 6, may further include:

an updating module 105, configured to update the data active duration to the data active duration carried in the first response when the determining module 103 determines that the first response indicates that the developer approves the target processing policy is not feasible, and execute the steps of determining, by the determining module 102, the target processing policy based on the type of the data table when the determining module 103 determines that the creation duration of the data table is greater than the data active duration, and performing, by the managing module 104, data life cycle management on the data table according to the target processing policy when the determining module 103 determines that the first response indicates that the developer approves the target processing policy is feasible.

In some embodiments, the first device and the second device are the same device or different devices.

In the embodiment of the present application, the application data lifecycle management apparatus may be divided into the functional modules according to the method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that the division of the modules in the embodiments of the present application is schematic, and is only one division of logic functions, and there may be another division manner in actual implementation.

The data lifecycle management apparatus according to the embodiment of the present application may be configured to execute the technical solution of the first device in the aforementioned data lifecycle management method, and the implementation principle and the technical effect are similar, where operations for implementing each module may further refer to the relevant description of the method embodiment, and are not described herein again. The modules herein may also be replaced with components or circuits.

Illustratively, an embodiment of the present application further provides an electronic device, including: a memory and a processor; the memory is used for storing program instructions; the processor is configured to invoke program instructions in the memory to cause the electronic device to perform the data lifecycle management method of the previous embodiments.

Illustratively, the present application further provides a computer storage medium, which includes computer instructions, when the computer instructions are run on an electronic device, the electronic device is caused to execute the data lifecycle management method in the foregoing embodiments.

Illustratively, the embodiments of the present application further provide a computer program product, which when running on a computer, causes the computer to execute the data lifecycle management method in the foregoing embodiments.

Illustratively, an embodiment of the present application provides a chip system, which includes: a processor; when the processor executes the computer instructions stored in the memory, the electronic device performs the data lifecycle management method of the previous embodiments.

In the above-described embodiments, all or part of the functions may be implemented by software, hardware, or a combination of software and hardware. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

One of ordinary skill in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the above method embodiments. And the aforementioned storage medium includes: various media capable of storing program codes, such as ROM or RAM, magnetic or optical disks, etc.

Claims

1. A method for data lifecycle management, the method comprising:

the method comprises the steps that first equipment calculates the data active duration of a data table based on the access condition of the data table;

when the creation duration of the data table is longer than the data active duration, the first device determines a target processing strategy based on the type of the data table;

and the first equipment performs data life cycle management on the data table according to the target processing strategy.

2. The method of claim 1, wherein the first device calculates the data activity duration of a data table based on the access condition of the data table, and comprises:

the first device obtains the creation time of the data table, the creation time and the latest access time of each partition table in the data table;

when the creating time length of the data table is longer than a preset time length, the first device determines a difference value between the creating time length of each partition table and the latest access time as a difference value sequence;

the first device carries out density clustering processing on the difference sequence to obtain a plurality of core points, wherein each core point is used for representing the access duration of the partition table of the same type in the data table;

the first device calculates a maximum boundary point corresponding to a maximum core point in the plurality of core points, wherein the maximum boundary point is used for representing the maximum access duration of all types of partition tables in the data table;

the first device determines the maximum boundary point as the data active duration.

3. The method according to claim 2, wherein the step of performing a density clustering process on the difference sequence by the first device to obtain a plurality of core points comprises:

4. The method of claim 3, wherein the first device inputs the difference sequence L, the predetermined neighborhood radius Eps and the predetermined minimum neighborhood point number MinPts of the given point as a core point in the neighborhood into a predetermined algorithm module to output the result sequence R and the reachable distance rd of each partition table difference sample, comprising:

5. The method according to claim 3, wherein outputting clustering results C corresponding to a plurality of core points based on the result sequence R comprises:

step 2.1: sequentially taking out sample points from the result sequence R;

6. The method of claim 2, further comprising:

and when the creation duration of the data table is less than or equal to the preset duration, the first equipment stops performing data life cycle management on the data table.

7. The method of claim 1, wherein the first device determines a target processing policy based on the type of the data table, comprising:

the first device identifies basic data information of the data table, wherein the basic data information comprises: whether the structures of the data table and a first-level upstream data table are consistent or not, the access condition of the data table in a preset time length, the same-base output degree and the same-base input degree of the data table, different-base output degrees and different-base input degrees of the data table, and other output degrees and other input degrees of the data table;

the first device determines the type of the data table based on the incidence relation between the basic data information and the type of the data table;

the first device determines the target processing strategy based on the association relationship between the type of the data table and the processing strategy of the data table.

8. The method of claim 1, wherein the type of the data table comprises: direct source table, secondary source table, intermediate table, result table, temporary table, and other tables.

9. The method of any of claims 1-8, wherein the first device performs data lifecycle management on the data table according to the target processing policy, comprising:

the first device sends a first request to a second device, wherein the first request is used for requesting to trigger the target processing strategy;

the first device receiving a first response from the second device;

and when the first equipment determines that the first response indicates that the developer approves the target processing strategy is feasible, performing data life cycle management on the data table according to the target processing strategy.

10. The method of claim 9, further comprising:

and when determining that the first response indicates that the developer approves the target processing strategy is not feasible, the first device updates the data active duration to the data active duration carried in the first response, determines a target processing strategy based on the type of the data table when the creation duration of the data table is longer than the data active duration, and performs data life cycle management on the data table according to the target processing strategy when determining that the first response indicates that the developer approves the target strategy processing is feasible.

11. An apparatus for data lifecycle management, the apparatus comprising:

the determining module is used for determining a target processing strategy based on the type of the data table when the judging module determines that the creation duration of the data table is longer than the active data duration;

12. An electronic device, comprising: a memory and a processor;

the memory is to store program instructions;

the processor is configured to invoke program instructions in the memory to cause the electronic device to perform the data lifecycle management method of any of claims 1-10.

13. A computer storage medium comprising computer instructions that, when executed on an electronic device, cause the electronic device to perform the data lifecycle management method of any of claims 1-10.

14. A computer program product, which, when run on a computer, causes the computer to perform the data lifecycle management method of any of claims 1-10.