CN113064930B

CN113064930B - Cold and hot data identification method and device of data warehouse and electronic equipment

Info

Publication number: CN113064930B
Application number: CN202011603968.5A
Authority: CN
Inventors: 邓娟; 刘晓斌; 董宇
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Guizhou Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Guizhou Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2023-04-28
Anticipated expiration: 2040-12-29
Also published as: CN113064930A

Abstract

The application discloses a cold and hot data identification method and device of a data warehouse and electronic equipment, and aims to at least solve the problems of low identification efficiency and low accuracy of the cold and hot data identification method in the prior art. The method comprises the following steps: acquiring a historical access record of a business data table in a data warehouse in a specified statistical period; determining a processing period of the service data table and access frequency information in the appointed statistical period based on the historical access record in the appointed statistical period; determining a cold and hot data demarcation time point of the service data table in the appointed statistical period based on a pre-established cold and hot data identification model, a processing period of the service data table, access frequency information in the appointed statistical period and the appointed statistical period; and identifying cold data and hot data in the service data table based on the cold and hot data demarcation time point.

Description

Cold and hot data identification method and device of data warehouse and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for identifying cold and hot data in a data warehouse, and an electronic device.

Background

A Data repository (DW) creates a collection of Data stores for enterprise analytics reporting and decision support multipoint purposes that is used to screen and integrate diverse business Data. The data warehouse has the characteristics of large service data volume, various types and the like. How to effectively identify the cold and hot attributes of the business data in the data warehouse, so as to separate, store, backup or destroy the business data with different attributes is particularly important for the management of the data warehouse.

At present, the cold and hot attributes of service data in a data warehouse are identified, the access condition of each service data is counted one by one mainly through manpower, and the cold and hot attributes of the service data are distinguished according to the counted result, for example, if certain service data are accessed frequently, the service data can be used as hot data, otherwise, the service data can be used as cold data. Therefore, the cold and hot data identification method in the prior art mainly depends on experience of service personnel, so that accuracy of an identification result cannot be guaranteed. In addition, as the order of magnitude of the service scale of the data warehouse is continuously increased, the data volume of the service data in the data warehouse is also rapidly increased, and if the access condition of the service data is counted manually one by one, the identification efficiency is affected.

Disclosure of Invention

The embodiment of the application provides a cold and hot data identification method and device of a data warehouse and electronic equipment, and aims to at least solve the problems of low identification efficiency and low accuracy of the cold and hot data identification method in the prior art.

In order to solve the technical problems, the following technical solutions are adopted in the embodiments of the present application:

in a first aspect, an embodiment of the present application provides a method for identifying cold and hot data in a data warehouse, including:

acquiring a historical access record of a business data table in a data warehouse in a specified statistical period;

determining a processing period of the service data table and access frequency information in the appointed statistical period based on the historical access record in the appointed statistical period;

determining a cold and hot data demarcation time point of the service data table in the appointed statistical period based on a pre-established cold and hot data identification model, a processing period of the service data table, access frequency information in the appointed statistical period and the appointed statistical period;

and identifying cold data and hot data in the service data table based on the cold and hot data demarcation time point.

Optionally, the access frequency information includes a total access frequency and access frequencies within each sub-period of the specified statistical period;

Determining a cold and hot data demarcation time point of the service data table in the appointed statistical period based on a pre-established cold and hot data identification model, a processing period of the service data table, access frequency information in the appointed statistical period and the appointed statistical period, wherein the method comprises the following steps:

inputting the processing period, the specified statistical period and the total access frequency in the specified statistical period of the service data table into the cold and hot data identification model to obtain the cold and hot data proportion of the service data table, wherein the cold and hot data identification model is obtained by training a sample data table by taking the processing period, the total access frequency in the statistical period and the statistical period of the sample data table as training samples and the cold and hot data proportion of the sample data table as labels based on a preset first classification algorithm;

determining an access frequency threshold value matched with the service data table based on the cold-hot data proportion and the total access frequency of the service data table in the appointed statistical period;

and determining a cold and hot data demarcation time point of the service data table in the appointed statistical period based on the access frequency threshold value and the access frequency of the service data table in each sub-period of the appointed statistical period.

Optionally, determining the hot and cold data demarcation time point of the service data table in the specified statistical period based on the access frequency threshold and the access frequency of the service data table in each sub-period of the specified statistical period includes:

based on the total access frequency in the appointed statistical period, respectively determining the accumulated access frequency corresponding to each sub-period, wherein the accumulated access frequency corresponding to the sub-period refers to the accumulated access frequency from the starting time point of the appointed statistical period to the ending time point of the sub-period;

selecting a corresponding sub-period of which the accumulated access frequency reaches the access frequency threshold value from the sub-periods of the specified statistical period;

and determining the selected sub-period as a cold and hot data demarcation time point of the service data table in the appointed statistical period.

Selecting a sub-period of which the corresponding accumulated access frequency reaches the access frequency threshold value from all sub-periods of the appointed statistical period as a candidate sub-period;

determining a difference between an accumulated access frequency corresponding to each sub-period adjacent to the candidate sub-period and an accumulated access frequency corresponding to the candidate sub-period and a difference between the accumulated access frequency and the access frequency threshold;

selecting a sub-period with the smallest difference value between the accumulated access frequencies corresponding to the candidate sub-period and the access frequency threshold value within a set range from each sub-period adjacent to the candidate sub-period;

and determining a cold and hot data demarcation time point of the service data table in the appointed statistical period based on the selected sub-period.

Optionally, the history access record includes a point in time of a write operation performed on the service data table within the specified statistical period;

determining a processing period of the service data table based on the historical access record in the specified statistical period, including:

based on the time points of the write operations executed on the service data table in the appointed statistical period, respectively determining average write periods between the last write operation executed on the service data table and other write operations respectively;

Determining an average writing period of the service data table based on the average writing period between the last writing operation and other writing operations respectively;

and determining the processing period of the service data table based on a preset corresponding relation between the average writing period and the processing period of the data table and the average writing period of the service data table.

Optionally, before determining the processing period of the service data table, the method further includes:

acquiring a historical access record and a processing period of a sample data table in a statistical period;

determining an average writing period of the sample data table based on the history access record in the statistical period;

and training based on a preset second classification algorithm by taking the average writing period of the sample data table as a training sample and the processing period of the sample data table as a label of the training sample so as to obtain the preset corresponding relation.

Optionally, before determining the cold and hot data demarcation time point of the service data table in the specified statistical period based on the pre-established cold and hot data identification model and the processing period of the service data table, the specified statistical period and the access frequency information in the specified statistical period, the method further comprises:

Acquiring a historical access record and a cold and hot data proportion of a sample data table in a statistical period;

determining a processing period of the sample data table and an accumulated access frequency in the statistical period based on the historical access record;

and training based on the first classification algorithm by taking the processing period of the sample data table, the accumulated access frequency in the statistical period and the statistical period as training samples and taking the cold and hot data proportion of the sample data table in the statistical period as a label of the training samples so as to obtain the cold and hot data identification model.

In a second aspect, an embodiment of the present application provides a cold and hot data identification apparatus for a data warehouse, including:

the first acquisition module is used for acquiring a historical access record of a business data table in the data warehouse in a specified statistical period;

the first determining module is used for determining the processing period of the service data table and the access frequency information in the appointed statistic period based on the historical access record;

the second determining module is used for determining cold and hot data demarcation time points of the service data table in the appointed statistical period based on a pre-established cold and hot data identification model, a processing period of the service data table, access frequency information in the appointed statistical period and the appointed statistical period, wherein the cold and hot data identification model is obtained by training the processing period of the sample data table, the access frequency information in the statistical period and the statistical period based on a first classification algorithm;

And the identification module is used for identifying the cold data and the hot data in the service data table based on the cold and hot data demarcation time point.

the second determining module is specifically configured to:

Optionally, the second determining module is specifically configured to:

the first determining module is specifically configured to:

Optionally, the apparatus further comprises:

the second acquisition module is used for acquiring a historical access record and a processing period of the sample data table in a statistical period;

a third determining module, configured to determine an average writing period of the sample data table based on the history access record in the statistical period;

the first training module is used for training based on a preset second classification algorithm by taking the average writing period of the sample data table as a training sample and the processing period of the sample data table as a label of the training sample so as to obtain the preset corresponding relation.

Optionally, the apparatus further comprises:

the third acquisition module is used for acquiring a historical access record and a cold and hot data proportion of the sample data table in a statistical period;

a fourth determining module, configured to determine, based on the history access record, a processing period of the sample data table and an accumulated access frequency in the statistical period;

the second training module is used for training based on the first classification algorithm by taking the processing period of the sample data table, the accumulated access frequency in the statistical period and the statistical period as training samples and taking the cold and hot data proportion of the sample data table in the statistical period as the label of the training samples so as to obtain the cold and hot data identification model.

In a third aspect, an embodiment of the present application provides an electronic device, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the method of the first aspect.

The above-mentioned at least one technical scheme that this application embodiment adopted can reach following beneficial effect:

the method comprises the steps of analyzing historical access records of a service data table in a data warehouse in a specified statistical period, determining a processing period of the service data table and access frequency information in the specified statistical period, further determining a cold and hot data demarcation time point of the service data table in the specified statistical period based on a pre-established cold and hot data identification model, the determined processing period, the access frequency information in the specified statistical period and the specified statistical period, and finally identifying cold data and hot data in the service data table based on the cold and hot data demarcation time point, wherein the whole process does not need manual participation, automatic identification of the cold and hot data is realized, and compared with the method of manually counting and identifying the cold and hot data in the prior art, the accuracy and the efficiency are higher; and the cold and hot data are identified by taking the service data table as a unit, and the cold and hot data are identified based on the cold and hot data demarcation time points of the service data table, so that the efficiency is higher compared with a mode of carrying out statistics and identification on the service data in the data warehouse one by one.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a flowchart of a method for identifying cold and hot data in a data warehouse according to an embodiment of the present application;

FIG. 2 is a flowchart of another method for identifying cold and hot data in a data warehouse according to an embodiment of the present application;

fig. 3 is a schematic diagram of access frequency distribution of a service data table in a data warehouse in a specified statistical period according to an embodiment of the present application;

FIG. 4 is a flowchart of a training method of a cold and hot data identification model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a cold and hot data identification device of a data warehouse according to an embodiment of the present application.

Detailed Description

For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Example 1

Referring to fig. 1, an embodiment of the present application provides a method for identifying hot and cold data in a data warehouse, which may be performed by an electronic device, for example, the electronic device may be a server. As shown in fig. 1, the method comprises the steps of:

s12, acquiring a historical access record of a business data table in the data warehouse in a specified statistical period.

The business data in the data warehouse are stored in different business data tables in a scattered way according to the business information of the business.

The specified statistical period may be any historical period of time prior to the current time. In practical applications, the specified statistical period may be a day before the current time, such as 10 months, 1 days, and Zhou Yideng; alternatively, the specified statistics period may be a certain period of time before the current time, such as yesterday am, etc.; alternatively, the specified statistical period may be a certain month, such as 1 month, 2 months, etc., before the current time; alternatively, the specified statistical period may also be some years before the current time, such as 2017 to 2019.

It should be noted that, the granularity of the specified statistical period may be set in a customized manner according to the actual service requirement, which is not specifically limited in the embodiment of the present application. For example, for a service data table corresponding to a service with a higher timeliness requirement, the specified statistical period may be one month; for the service data table corresponding to the service with low timeliness requirement, the specified statistical period can be one year, and the like.

Access logs are usually generated in the process of accessing the data warehouse, and historical access records of business data tables in the data warehouse in a specified statistical period can be obtained by classifying and integrating the access logs. Among other things, access operations to the data warehouse may include, for example, but are not limited to: writing data, deleting data, modifying data, etc. The history access record may include, but is not limited to: the type of each access operation performed on the service data table, the point in time, etc.

S14, determining the processing period of the service data table and the access frequency information in the appointed statistical period based on the historical access record in the appointed statistical period.

By counting and analyzing the historical access records of the service data table in the appointed counting period, the processing period of the service data and the access frequency information in the appointed counting period can be determined.

In the embodiment of the present application, the designated period of the service data table is used to characterize a time rule for processing the service data in the service data table. Considering that the time law of processing service data in the service data table is generally consistent with the data writing operation of the service data table, for example, if the service data table is generally written with data in units of days, the processing period of the service data table is also generally in units of days; if the service data table is written with data in units of months, then the processing period of the service data table is also typically in units of months, and so on. Based on this, the processing period of the traffic data table can be determined by the point in time of the write operation performed on the traffic data table within the specified statistical period.

In an alternative, in particular, determining the processing period of the service data table may comprise the steps of:

first, the average write period between the last write operation performed on the service data table and the other write operations, respectively, may be determined based on the time points of the write operations performed on the service data table within the specified statistical period, respectively.

More specifically, the average write cycle between the last write operation and the other write operations described above can be determined by the following equation (1).

/>

Wherein T is _i Representing an average write cycle between the last write operation and the ith write operation; a is that _n A time point indicating the last write operation; a is that _i A time point indicating the ith write operation; n represents the number of write operations performed on the traffic data table within the above specified statistical period.

Then, based on the average writing period between the last writing operation and other writing operations, the average writing period of the service data table is determined.

More specifically, the average writing period of the service data table can be determined by the following formula (2).

Wherein, tableCycle represents the average writing period of the service data table.

And finally, determining the processing period of the service data table based on the preset corresponding relation between the average writing period and the processing period of the data table and the average writing period of the service data table.

For example, table 1 shows an example of a preset correspondence between the average write period and the processing period of the data table.

TABLE 1

Write cycle of data table	Processing cycle
		TableC for 1 hour < 12 hours	Hours of
12 hours < TableCycle < 24 hours	Tiantian (Chinese character of 'Tian')
		24 hours < TableCycle < 360 hours	Month of moon
360 hours < TableCycle < 4320 hours	Year of life

It should be noted that, because the service attributes of different services are different, the average writing period of the corresponding service data table is also different, so that the preset corresponding relationship between the average writing period and the processing period of the data table can be self-defined according to the actual service requirement. For example, for the processing of service data tables in a data warehouse in the telecommunications industry, the timeliness requirement for service data is not high, so the processing period of a service data table with a writing period of less than 12 hours can be determined as hours, the processing period of a service data table with a writing period of 12-24 hours can be determined as days, and so on. As another example, for the financial industry, securities industry, etc., since the timeliness requirement of the service data is high, the processing period of the service data table with the writing period of 1-2 hours can be determined as hours, the processing period of the service data table with the writing period of more than 2 hours can be determined as days, etc.

In order to make the processing period of the obtained service data table more accurate, so as to further improve the accuracy of the subsequent recognition result of the cooling and heating data in the service data table, in the embodiment of the present application, the correspondence between the average writing period and the processing period of the data table may also be obtained by learning and training the average writing period and the processing period of a large number of sample data tables. Specifically, before determining the processing period of the service data table in step S14, the method for identifying cold and hot data in the data warehouse according to the embodiment of the present application may further include: firstly, acquiring a historical access record and a processing period of a sample data table in a statistical period; next, determining an average writing period of the sample data table based on the history access record in the statistical period; and finally, training based on a preset second classification algorithm by taking the average writing period of the sample data table as a training sample and the processing period of the sample data table as a label of the training sample so as to obtain the corresponding relation between the average writing period and the processing period of the data table.

The sample data table may be a business data table of known processing cycles in a data warehouse, among others. The preset second classification algorithm may for example comprise a combination of one or more of the following algorithms: decision tree algorithms, bayesian algorithms, artificial neural network algorithms, k-nearest neighbor algorithms (k-NearestNeighbor, KNN), etc.

It should be noted that, the specific implementation of determining the average writing period of the sample data table based on the history access record of the sample data table is substantially the same as the specific implementation of determining the average writing period of the service data table in the above step S14, and the specific please refer to the implementation process of determining the average writing period of the service data table in the above step S14, which is not repeated herein.

The access frequency information of the service data table in the specified statistical period is used for representing the access frequency of the service data table in the specified period. The access frequency information of the service data table in the specified statistical period may specifically include, but is not limited to, a total access frequency of the service data table in the specified statistical period and an access frequency in each sub-period of the specified statistical period. The granularity of the sub-period may be set by user-defining according to the granularity of the specified statistical period and the actual service requirement, which is not particularly limited in the embodiment of the present application. For example, if the specified statistical period is one year, the sub-period may be one month; if the specified statistical period is one month, the sub-period may be one day, and so on.

Specifically, the total access frequency of the service data table in the specified statistical period can be determined based on the number of accesses to the service data table in the specified statistical period and the duration of the specified statistical period; the access frequency of the service data table in the sub-period can be determined based on the number of accesses to the service data table in the sub-period and the duration of the sub-period.

S16, determining a cold and hot data demarcation time point of the service data table in a specified statistical period based on a pre-established cold and hot data identification model and a processing period of the service data table, access frequency information in the specified statistical period and the specified statistical period.

In the embodiment of the application, the cold and hot data identification model is a pre-established model for identifying cold and hot data. Optionally, the cold and hot data identification model may be obtained by training based on a storage period of the sample data table, an access frequency in a statistical period, and the statistical period as training samples, and based on a preset first classification algorithm with a cold and hot data proportion of the sample data table as a label. Wherein the first classification algorithm may include, for example, but not limited to, a combination of one or more of the following: decision tree algorithms, bayesian algorithms, artificial neural network algorithms, k-nearest neighbor algorithms (k-NearestNeighbor, KNN), etc.

Correspondingly, the determined processing period of the service data table, the access frequency in the appointed statistical period and the appointed statistical period can be input into the cold and hot data identification model to obtain the cold and hot data proportion of the service data table, and the cold and hot demarcation time point of the service data table in the appointed statistical period can be further determined based on the cold and hot data proportion. The cold and hot data demarcation time point refers to a time point for dividing cold data and hot data.

It should be noted that the process of training the obtained cold-hot data identification model will be described in detail in the embodiment shown in fig. 4 below, and will not be further developed here.

And S18, identifying cold data and hot data in the service data table based on the cold and hot data demarcation time point.

After the cold and hot data demarcation time points of the service data table are determined, the service data in the service data table can be divided according to the cold and hot data demarcation time points, so that the cold data and the hot data are obtained. For example, a statistical period is designated as 2017, 1 to 2019, 12, and the cold and hot data demarcation time point determined based on the above steps is 2017, 10, whereby the service data whose generation time is in 2017, 1 to 2017, 10 can be determined as cold data and the service data whose generation time is in 2017, 11 to 2019, 12 can be determined as hot data in the service data table.

According to the cold and hot data identification method for the data warehouse, through analysis of the historical access records of the service data table in the data warehouse in the specified statistical period, the processing period of the service data table and the access frequency information in the specified statistical period are determined, the cold and hot data demarcation time point of the service data table in the specified statistical period is further determined based on the pre-established cold and hot data identification model and the determined processing period, the access frequency information in the specified statistical period and the specified statistical period, finally the cold data and the hot data in the service data table are identified based on the cold and hot data demarcation time point, manual participation is not needed in the whole process, automatic identification of the cold and hot data is achieved, and compared with the mode of manually counting and identifying the cold and hot data in the prior art, the accuracy and the efficiency are higher; and the cold and hot data are identified by taking the service data table as a unit, and the cold and hot data are identified based on the cold and hot data demarcation time points of the service data table, so that the efficiency is higher compared with a mode of carrying out statistics and identification on the service data in the data warehouse one by one.

In order to enable those skilled in the art to better understand the technical solutions provided by the embodiments of the present application, the following details of the technical solutions provided by the embodiments of the present application are described.

For the step S16, in an alternative solution, as shown in fig. 2, the step S16 may include:

s161, inputting the processing period of the service data table, the access frequency in the appointed statistic period and the appointed statistic period into a pre-established cold and hot data identification model to obtain the cold and hot data proportion of the service data table.

S162, determining an access frequency threshold matched with the service data table based on the cold-hot data proportion of the service data table and the total access frequency in a specified statistical period.

Because the service data stored by the different service data tables are different, the access frequencies of the service data in the different service data tables are different, and based on the difference, different access frequency thresholds can be determined for the different service data tables. In an alternative scheme, the product of the cold-hot data proportion of the service data table and the total access frequency of the service data table in a specified statistical period can be used as the access frequency threshold value matched with the service data table.

For example, if the ratio of cold to hot data in a certain service data table is 20% (i.e., the cold data accounts for 20%), and the total access frequency of the service data table in a given statistical period is 1380 times, the access frequency threshold of the service data table may be determined to be 1380×20% =276 times.

S163, determining a cold and hot data demarcation time point of the service data table in the appointed statistic period based on the access frequency threshold value and the access frequency of the service data table in each sub-period of the appointed statistic period.

In general, hot data in a data warehouse is frequently accessed, while cold data is rarely accessed, so that in an alternative scheme, the cumulative access frequency corresponding to each sub-period can be respectively determined based on the total access frequency of a service data table in a specified statistical period, wherein the cumulative access frequency corresponding to the sub-period refers to the cumulative access frequency from the starting time point of the specified statistical period to the ending time point of the sub-period; then, a sub-period of which the corresponding accumulated access frequency reaches the access frequency threshold value can be selected from the appointed statistical period, and the sub-period is determined as a cold and hot data demarcation time point of the service data table in the appointed statistical period.

For example, taking the access frequency information of the service data table in the specified statistical period and the matched access frequency threshold value as an example, the sub-period in which the corresponding cumulative access frequency in the service data table reaches the access frequency threshold value can be determined to be 17 years and 10 months in step S163, so that the sub-period can be used as the hot and cold data demarcation time point of the service data table in the specified statistical period (i.e. 17 years, 1 month and 19 years, 12 months). Further, it may be determined that the service data of the service data table having a writing time point between 1 month and 17 years 10 is cold data and the service data of the service data table having a writing time point between 11 months and 19 years 12 months is hot data.

It can be appreciated that, by the above scheme, based on the rule that hot data is usually accessed frequently and cold data is accessed less frequently, the access frequency threshold matched with the service data table is determined based on the cold-hot data proportion of the service data table and the total access frequency in a specified statistical period, and then the cold-hot data demarcation time point of the service data table is determined based on the access frequency threshold and the access frequency of the service data table in each sub-period, which is simple to implement and high in efficiency.

In order to more accurately determine the demarcation time point of the cold and hot data, in another alternative scheme, the up-and-down floating of the cold and hot data proportion can be considered in consideration of a certain error between the cold and hot data proportion obtained based on the cold and hot data recognition model in the actual recognition process and the actual cold and hot data proportion. Specifically, after selecting a sub-period for which the corresponding cumulative access frequency reaches the access frequency threshold from the specified statistical period, the selected sub-period may be used as a candidate sub-period. Next, a difference between the cumulative access frequency corresponding to each sub-period adjacent to the candidate sub-period and the cumulative access frequency corresponding to the candidate sub-period and a difference between the cumulative access frequency and the access frequency threshold value are determined. Further, from each of the sub-periods adjacent to the candidate sub-period, a sub-period having the smallest difference between the accumulated access frequencies corresponding to the candidate sub-period and having a difference with the access frequency threshold within a preset range is selected. And finally, determining the cold and hot data demarcation time point of the service data table in the appointed statistical period based on the selected sub-period.

More specifically, an array corresponding to each sub-period, namely a [0 ], can be generated based on the access frequency of the service data table in each sub-period and the time sequence of each sub-period]To a [ n ]]The array corresponding to the sub-period is used for storing the access frequency of the service data table in the sub-period. Then, the accumulated access frequency s corresponding to the candidate sub-period is calculated _i The ratio p in the total access frequency S within a given statistical period, i.e. p=s/S, where S _i ＝a[0]+a[1]+a[2]+……+a[i]，S＝a[0]+a[1]+a[2]+……+a[n]I < n. Further, assuming that the error of the cold-hot data ratio is c%, if p > (50% -c%), the subscript i-1 can be determined as the right limit, i.e., i=r; if p > (50% + c%), then the subscript i-1 may be determined to be the left limit, i.e., i=l. And searching adjacent elements in r to l elements in the array, and determining the subscript (marked as k) corresponding to the element with the largest difference value. And finally, calculating through the following formula (3), and taking the sub-period corresponding to the subscript k+1 as a cold and hot data demarcation time point of the service data table in the appointed statistical period.

a[k]-a[k+1]＝Max((a[r]-a[r+1]),((a[r+1]-a[r+2]),...,(a[l-1]-a[l])) (3)

It can be understood that by the above scheme, when determining the cold and hot data demarcation time point, errors between the cold and hot data proportion and the actual proportion of the service data table obtained based on the cold and hot data identification model are considered, and the obtained cold and hot data demarcation time point is more accurate, so that the cold and hot data identification result obtained based on the cold and hot data demarcation time point is also more accurate and reliable.

For the cold and hot data identification model in step S161, the embodiment of the application further includes a training method for the cold and hot data identification model.

The training of the cold and hot data identification model is performed in advance based on the processing period, the access frequency information in the statistical period and the statistical period of a large number of sample data tables acquired from the data warehouse, and the cold and hot data identification model does not need to be trained every time in the process of carrying out cold and hot data identification on the service data tables in the data warehouse, or the cold and hot data identification model can be periodically updated from the processing period, the access frequency information in the statistical period and the statistical period of a large number of sample data tables newly acquired from the data warehouse, so that the accuracy and the reliability of the cold and hot data identification model are improved. Wherein, the plurality of sample data tables can be business data tables with known data cold and hot attributes.

Specifically, as shown in fig. 4, the training method for the cold and hot data identification model may include the following steps:

s42, acquiring a historical access record and a cold and hot data proportion of the sample data table in a statistical period.

S44, determining the processing period of the sample data table and the total access frequency in the statistical period based on the acquired historical access record.

It should be noted that, based on the history access record of the sample data table in the statistical period, the processing period of the sample data table is determined similarly to the specific embodiment of determining the processing period of the service data table in the step S14, and the description of the step S14 may be referred to specifically, and will not be repeated here.

Similarly, the specific embodiment of determining the total access frequency of the sample data table in the statistical period based on the historical access record of the sample data table in the statistical period is similar to the specific embodiment of determining the total access frequency of the service data table in the specified statistical period in the above step S14, and the specific reference may be made to the description of the above step S14, which is not repeated here.

In addition, different sample data tables can have different statistical periods, and specifically can be set according to service customization corresponding to the sample data tables. In order to further improve the accuracy and reliability of the cold and hot data identification model obtained through training, the sample data table can adopt a data table corresponding to the same service as the service data table.

In practical application, when the cold and hot data identification model is trained based on the training sample, the sample data may be represented in a table structure form, where the table structure form includes a field attribute name and a field description. Further, in order to distinguish between different sample data, a field for describing the name of the sample data table may be added to the table structure. For example, table2 shows a table structure of a training sample.

TABLE2

Field attribute names	Field description
		Table name	Names of data tables (e.g., table1, table2, table3, etc.)
Processing cycle	The type of processing cycle (e.g., day, month, year, hour, etc.) of the data sheet
		Counting periods	Statistical period of access frequency (e.g. 1 year, 2 years, 3 years, etc.)
Access frequency	Frequency of access to data table in statistical period

S46, training is performed based on a preset first classification algorithm by taking the processing period of the sample data table, the total access frequency in the statistical period and the statistical period as training samples and the cold-hot data proportion of the sample data table in the statistical period as a label of the training samples, so as to obtain a cold-hot data identification model.

The first classification algorithm in embodiments of the present application may, for example, include a combination of one or more of the following algorithms: decision tree algorithms, bayesian algorithms, artificial neural network algorithms, k-nearest neighbor algorithms (k-NearestNeighbor, KNN), etc., the first classification algorithm is not specifically limited in the embodiments of the present application.

It can be understood that, because a certain association relationship exists among the processing period, the access frequency, the statistical period and the cold and hot data proportion of the data table, the processing period, the access frequency and the statistical period of a large number of sample data tables are used as training samples, the cold and hot data proportion of the sample data tables are used as labels, and the cold and hot data identification model is obtained through training by a corresponding classification algorithm, so that the obtained cold and hot data identification model can accurately identify the relationship among the processing period, the access frequency and the statistical period of different service data tables and the cold and hot data proportion, and further the cold and hot data proportion of the service data table can be effectively and accurately obtained based on the cold and hot data identification model, and further cold and hot data demarcation time points can be effectively and accurately extracted, so that an accurate identification result of cold and hot data in the service data table can be obtained.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Referring to fig. 5, at the hardware level, the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, network interface, and memory may be interconnected by an internal bus, which may be an ISA (Industry Standard Architecture ) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 5, but not only one bus or type of bus.

And the memory is used for storing programs. In particular, the program may include program code including computer-operating instructions. The memory may include memory and non-volatile storage and provide instructions and data to the processor.

The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the cold and hot data identification device of the data warehouse on a logic level. The processor is used for executing the programs stored in the memory and is specifically used for executing the following operations:

The method executed by the cold and hot data identification device of the data warehouse disclosed in the embodiment shown in fig. 1 of the application can be applied to a processor or implemented by the processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

The electronic device may also execute the method of fig. 1 and implement the functions of the cold and hot data identification device of the data warehouse in the embodiments shown in fig. 1 to 4, which are not described herein again.

Of course, other implementations, such as a logic device or a combination of hardware and software, are not excluded from the electronic device of the present application, that is, the execution subject of the following processing flow is not limited to each logic unit, but may be hardware or a logic device.

The present embodiments also provide a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the method of the embodiment of fig. 1, and in particular to:

Fig. 6 is a schematic structural view of a cold and hot data identification device of a data warehouse according to an embodiment of the present application. Referring to fig. 6, in one software implementation, a cold and hot data device 600 of a data warehouse may include:

a first obtaining module 610, configured to obtain a history access record of a service data table in a data repository in a specified statistical period;

a first determining module 620, configured to determine, based on the historical access record, a processing period of the service data table and access frequency information in the specified statistical period;

a second determining module 630, configured to determine a hot and cold data demarcation time point of the service data table in the specified statistical period based on a pre-established hot and cold data identification model and a processing period of the service data table, access frequency information in the specified statistical period, and the specified statistical period, where the hot and cold data identification model is obtained by training the processing period of the sample data table, the access frequency information in the statistical period, and the statistical period based on a first classification algorithm;

And the identification module 640 is used for identifying the cold data and the hot data in the service data table based on the cold and hot data demarcation time point.

the second determining module 630 is specifically configured to:

Optionally, the second determining module 630 is specifically configured to:

the first determining module 620 is specifically configured to:

Optionally, the apparatus further comprises:

In summary, the foregoing description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

Claims

1. A method for identifying cold and hot data in a data warehouse, comprising:

identifying cold data and hot data in the service data table based on the cold and hot data demarcation time point;

wherein the access frequency information includes a total access frequency and access frequencies within each sub-period of the specified statistical period;

2. The method of claim 1, wherein determining a cold and hot data demarcation point in time for the business data table within the specified statistical period based on the access frequency threshold and the access frequency of the business data table within each sub-period of the specified statistical period comprises:

3. The method of claim 1, wherein determining a cold and hot data demarcation point in time for the business data table within the specified statistical period based on the access frequency threshold and the access frequency of the business data table within each sub-period of the specified statistical period comprises:

4. The method of claim 1, wherein the history access record includes a point in time of a write operation performed on the traffic data table within the specified statistical period;

5. The method of claim 4, wherein prior to determining the processing period of the traffic data table, the method further comprises:

6. The method of claim 1, wherein prior to determining a cold-hot data demarcation point in time for the business data table within the specified statistical period based on a pre-established cold-hot data identification model and processing period of the business data table, the specified statistical period, and access frequency information within the specified statistical period, the method further comprises:

7. A cold and hot data identification apparatus for a data warehouse, comprising:

a first determining module, configured to determine, based on the historical access record, a processing period of the service data table and access frequency information in the specified statistical period, where the access frequency information includes a total access frequency and access frequencies in each sub-period of the specified statistical period;

the second determining module is configured to input a processing period of the service data table, the specified statistical period and a total access frequency in the specified statistical period into the cold and hot data identification model to obtain a cold and hot data proportion of the service data table, where the cold and hot data identification model is obtained by training a sample data table with the processing period, the total access frequency in the statistical period and the statistical period of the sample data table as training samples, with the cold and hot data proportion of the sample data table as a label, and based on a preset first classification algorithm; determining an access frequency threshold value matched with the service data table based on the cold-hot data proportion and the total access frequency of the service data table in the appointed statistical period; determining a cold and hot data demarcation time point of the service data table in the appointed statistical period based on the access frequency threshold value and the access frequency of the service data table in each sub-period of the appointed statistical period;

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 6.

9. A computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the method of any of claims 1 to 6.