CN113064930A

CN113064930A - Cold and hot data identification method and device of data warehouse and electronic equipment

Info

Publication number: CN113064930A
Application number: CN202011603968.5A
Authority: CN
Inventors: 邓娟; 刘晓斌; 董宇
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Guizhou Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Guizhou Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-07-02
Anticipated expiration: 2040-12-29
Also published as: CN113064930B

Abstract

The application discloses a cold and hot data identification method and device for a data warehouse and electronic equipment, and aims to at least solve the problems of low identification efficiency and low accuracy of the cold and hot data identification method in the prior art. The method comprises the following steps: acquiring historical access records of a business data table in a data warehouse in a specified statistical period; determining a processing period of the business data table and access frequency information in the specified statistical period based on historical access records in the specified statistical period; determining a cold and hot data boundary time point of the business data table in the specified statistical period based on a pre-established cold and hot data identification model and a processing period of the business data table, access frequency information in the specified statistical period and the specified statistical period; and identifying cold data and hot data in the business data table based on the cold and hot data demarcation time point.

Description

Cold and hot data identification method and device of data warehouse and electronic equipment

Technical Field

The application relates to the technical field of computers, in particular to a cold and hot data identification method and device for a data warehouse and electronic equipment.

Background

A Data Warehouse (DW) creates a Data storage set for the purpose of enterprise analytical reporting and decision support multipoint, which is used for screening and integrating diverse business Data. The data warehouse has the characteristics of large business data volume, various types and the like. How to effectively identify the cold and hot attributes of the business data in the data warehouse so as to perform processing such as separation storage, backup or destruction on the business data with different attributes is particularly important for the management of the data warehouse.

At present, identification of cold and hot attributes of service data in a data warehouse is mainly performed by manually counting access conditions of each service data one by one, and distinguishing the cold and hot attributes of the service data according to a statistical result, for example, if a certain service data is frequently accessed, the service data can be used as hot data, and otherwise, the service data can be used as cold data. Therefore, the cold and hot data identification method in the prior art mainly depends on the experience of business personnel, so that the accuracy of the identification result cannot be ensured. Moreover, as the magnitude of the service scale of the data warehouse docking is continuously increased, the data volume of the service data in the data warehouse is also increased rapidly, and if the access conditions of the service data are counted one by one manually, the identification efficiency will be affected.

Disclosure of Invention

The embodiment of the application provides a cold and hot data identification method and device for a data warehouse and electronic equipment, and aims to at least solve the problems of low identification efficiency and low accuracy of the cold and hot data identification method in the prior art.

In order to solve the technical problem, the embodiment of the application adopts the following technical scheme:

in a first aspect, an embodiment of the present application provides a method for identifying cold and hot data of a data warehouse, including:

acquiring historical access records of a business data table in a data warehouse in a specified statistical period;

determining a processing period of the business data table and access frequency information in the specified statistical period based on historical access records in the specified statistical period;

determining a cold and hot data boundary time point of the business data table in the specified statistical period based on a pre-established cold and hot data identification model and a processing period of the business data table, access frequency information in the specified statistical period and the specified statistical period;

and identifying cold data and hot data in the business data table based on the cold and hot data demarcation time point.

Optionally, the access frequency information includes a total access frequency and access frequencies within respective sub-periods of the specified statistical period;

determining a cold and hot data boundary time point of the business data table in the specified statistical period based on a pre-established cold and hot data identification model and a processing period of the business data table, access frequency information in the specified statistical period and the specified statistical period, wherein the cold and hot data boundary time point comprises:

inputting the processing period of the business data table, the specified statistical period and the total access frequency in the specified statistical period into the cold and hot data recognition model to obtain the cold and hot data proportion of the business data table, wherein the cold and hot data recognition model is obtained by taking the processing period of a sample data table, the total access frequency in the statistical period and the statistical period as training samples, taking the cold and hot data proportion of the sample data table as a label and training the sample data table based on a preset first classification algorithm;

determining an access frequency threshold value matched with the service data table based on the cold and hot data proportion and the total access frequency of the service data table in the specified statistical period;

and determining the cold and hot data boundary time point of the business data table in the specified statistical period based on the access frequency threshold and the access frequency of the business data table in each sub-period of the specified statistical period.

Optionally, determining a cold-hot data boundary time point of the service data table in the specified statistical period based on the access frequency threshold and the access frequency of the service data table in each sub-period of the specified statistical period includes:

respectively determining the accumulated access frequency corresponding to each sub-period based on the total access frequency in the specified statistical period, wherein the accumulated access frequency corresponding to the sub-period refers to the accumulated access frequency from the starting time point of the specified statistical period to the ending time point of the sub-period;

selecting a corresponding sub-period with the accumulated access frequency reaching the access frequency threshold value from each sub-period of the specified statistical period;

and determining the selected sub-period as a cold and hot data boundary time point of the service data table in the specified statistical period.

selecting a sub-period with the corresponding accumulated access frequency reaching the access frequency threshold value from each sub-period of the specified statistical period as a candidate sub-period;

determining a difference value between the accumulated access frequency corresponding to each sub-period adjacent to the candidate sub-period and the accumulated access frequency corresponding to the candidate sub-period and a difference value between the accumulated access frequency and the access frequency threshold;

selecting a sub-period which has the smallest difference between the accumulated access frequencies corresponding to the candidate sub-period and is within a set range from each sub-period adjacent to the candidate sub-period;

and determining the cold and hot data boundary time point of the business data table in the specified statistical period based on the selected sub-period.

Optionally, the historical access record includes a time point of a write operation performed on the service data table within the specified statistical period;

determining a processing period of the business data table based on the historical access records in the specified statistical period, wherein the processing period comprises the following steps:

respectively determining average write cycles between the last write operation executed on the business data table and other write operations based on the time points of the write operations executed on the business data table in the specified statistical period;

determining an average writing period of the service data table based on the average writing periods between the last writing operation and other writing operations;

and determining the processing period of the business data table based on the preset corresponding relation between the average writing period and the processing period of the data table and the average writing period of the business data table.

Optionally, before determining the processing period of the service data table, the method further includes:

acquiring a historical access record and a processing period of a sample data table in a statistical period;

determining an average writing period of the sample data table based on the historical access records in the statistical period;

and training based on a preset second classification algorithm by taking the average writing period of the sample data table as a training sample and taking the processing period of the sample data table as a label of the training sample to obtain the preset corresponding relation.

Optionally, before determining a cold and hot data boundary time point of the service data table in the specified statistical period based on a pre-established cold and hot data identification model and a processing period of the service data table, the specified statistical period and access frequency information in the specified statistical period, the method further includes:

acquiring historical access records and cold-hot data proportion of a sample data table in a statistical period;

determining a processing period of the sample data table and an accumulated access frequency in the statistical period based on the historical access record;

and taking the processing period of the sample data table, the accumulated access frequency in the statistical period and the statistical period as training samples, taking the proportion of the cold data and the hot data of the sample data table in the statistical period as labels of the training samples, and training based on the first classification algorithm to obtain the cold data and the hot data recognition model.

In a second aspect, an embodiment of the present application provides a cold and hot data identification apparatus for a data warehouse, including:

the first acquisition module is used for acquiring historical access records of a business data table in a data warehouse in a specified statistical period;

the first determining module is used for determining the processing period of the business data table and the access frequency information in the specified statistical period based on the historical access record;

the second determination module is used for determining a cold and hot data boundary time point of the business data table in the specified statistical period based on a pre-established cold and hot data identification model and the processing period of the business data table, the access frequency information in the specified statistical period and the specified statistical period, wherein the cold and hot data identification model is obtained by training the processing period of the sample data table, the access frequency information in the statistical period and the statistical period based on a first classification algorithm;

and the identification module is used for identifying cold data and hot data in the service data table based on the cold and hot data boundary time point.

the second determining module is specifically configured to:

Optionally, the second determining module is specifically configured to:

the first determining module is specifically configured to:

Optionally, the apparatus further comprises:

the second acquisition module is used for acquiring the historical access record and the processing period of the sample data table in the statistical period;

a third determining module, configured to determine an average write period of the sample data table based on the historical access records in the statistical period;

and the first training module is used for training based on a preset second classification algorithm by taking the average writing period of the sample data table as a training sample and taking the processing period of the sample data table as a label of the training sample so as to obtain the preset corresponding relation.

Optionally, the apparatus further comprises:

the third acquisition module is used for acquiring the historical access record and the cold and hot data proportion of the sample data table in the statistical period;

a fourth determining module, configured to determine, based on the historical access record, a processing period of the sample data table and an accumulated access frequency in the statistical period;

and the second training module is used for training based on the first classification algorithm by taking the processing period of the sample data table, the accumulated access frequency in the statistical period and the statistical period as training samples and taking the proportion of the cold data and the hot data of the sample data table in the statistical period as labels of the training samples to obtain the cold data and hot data identification model.

In a third aspect, an embodiment of the present application provides an electronic device, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, where instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of the first aspect.

The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects:

the method comprises the steps of analyzing historical access records of a business data table in a data warehouse in an appointed statistical period, determining a processing period of the business data table and access frequency information in the appointed statistical period, further determining a cold and hot data boundary time point of the business data table in the appointed statistical period based on a pre-established cold and hot data identification model, the determined processing period, the access frequency information in the appointed statistical period and the appointed statistical period, and finally identifying cold data and hot data in the business data table based on the cold and hot data boundary time point, wherein the whole process does not need manual participation, automatic identification of the cold and hot data is realized, and the accuracy and the efficiency are higher compared with the prior art in a mode of manually counting and identifying the cold and hot data; and the cold and hot data are identified by taking the business data table as a unit, and the cold and hot data are identified based on the cold and hot data dividing time point of the business data table, so that the efficiency is higher compared with a mode of counting and identifying the business data in the data warehouse one by one.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flowchart of a method for identifying cold and hot data of a data warehouse according to an embodiment of the present application;

FIG. 2 is a flow chart of another method for identifying cold and hot data of a data warehouse according to an embodiment of the present application;

fig. 3 is a schematic view of access frequency distribution of a business data table in a data warehouse within a specified statistical period according to an embodiment of the present application;

FIG. 4 is a flowchart of a method for training a hot and cold data recognition model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a cold and hot data identification device of a data warehouse according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Example 1

Referring to fig. 1, an embodiment of the present application provides a method for identifying cold and hot data of a data warehouse, where the method is performed by an electronic device, for example, the electronic device may be a server. As shown in fig. 1, the method comprises the steps of:

and S12, acquiring historical access records of the business data table in the data warehouse in a specified statistical period.

And the service data in the data warehouse is dispersedly stored in different service data tables according to the service information of the service to which the service belongs.

The specified statistical period may be any historical period of time prior to the current time. In practical applications, the specified statistical period may be a day before the current time, such as 10 months and 1 day, the last week, and so on; alternatively, the specified statistical period may be a certain period of time before the current time, such as yesterday morning; or, the specified statistical period may also be a certain month before the current time, such as 1 month, 2 months, etc.; alternatively, the specified statistical period may also be some number of years before the current time, such as 2017 to 2019.

It should be noted that the granularity of the specified statistical period may be set in a user-defined manner according to actual service requirements, and this is not specifically limited in the embodiment of the present application. For example, for a service data table corresponding to a service with a high requirement on timeliness, the specified statistical period may be one month; and for the service data table corresponding to the service with low timeliness requirement, the specified statistical period can be one year, and the like.

Access logs are usually generated in the process of accessing the data warehouse, and through classifying and integrating the access logs, historical access records of a business data table in the data warehouse in a specified statistical period can be obtained. Access operations to the data repository may include, for example, but are not limited to: write data, delete data, modify data, and the like. The historical access records may include, but are not limited to: the type, point in time, etc. of each access operation performed on the service data table.

And S14, determining the processing period of the business data table and the access frequency information in the specified statistical period based on the historical access records in the specified statistical period.

Through counting and analyzing the historical access records of the service data table in the specified counting period, the processing period of the service data and the access frequency information in the specified counting period can be determined.

In the embodiment of the application, the designated period of the service data table is used for representing the time law for processing the service data in the service data table. Considering that the time law of processing the service data in the service data table is usually consistent with the data writing operation on the service data table, for example, if the data writing operation is performed on the service data table in units of days, the processing period of the service data table is also usually in units of days; if data writing is performed on the service data table, usually according to the month unit, the processing cycle of the service data table is also usually according to the month unit, and so on. Based on this, the processing period of the traffic data table can be determined by a point in time of a write operation performed on the traffic data table within a specified statistical period.

In an alternative, in particular, determining the processing period of the service data table may include the following steps:

first, an average write period between the last write operation performed on the service data table and each of the other write operations may be respectively determined based on a time point of the write operation performed on the service data table within a specified statistical period.

More specifically, the average write period between the last write operation and the other write operations described above can be determined by the following formula (1).

Wherein, T_iRepresents an average write cycle between the last write operation and the ith write operation; a. the_nIndicating the time point of the last write operation; a. the_iIndicating the time point of the ith write operation; n represents the number of write operations performed on the service data table within the specified statistical period.

Then, the average writing period of the service data table is determined based on the average writing period between the last writing operation and each of the other writing operations.

More specifically, the average write period of the traffic data table may be determined by the following equation (2).

Wherein, TableCycle represents the average write cycle of the service data table.

And finally, determining the processing period of the business data table based on the preset corresponding relation between the average writing period and the processing period of the data table and the average writing period of the business data table.

For example, table1 shows an example of a preset correspondence relationship between the average writing period and the processing period of the data table.

TABLE1

Write cycle of data table	Treatment cycle
		1 hour < TableCycle < 12 hours	Hour(s)
12 hours < TableCycle < 24 hours	Sky
		24 hours < TableCycle < 360 hours	Moon cake
360 hours < TableCycle < 4320 hours	Year of year

It should be noted that, because the service attributes of different services are different, the average write period of the corresponding service data table is also different, so the preset corresponding relationship between the average write period of the data table and the processing period may be set by the user according to the actual service requirements. For example, for the processing of the service data table in the data warehouse of the telecommunication industry, the requirement on the timeliness of the service data is not high, so that the processing period of the service data table with the writing period of less than 12 hours can be determined as hours, the processing period of the service data table with the writing period of 12-24 hours can be determined as days, and the like. For another example, for the financial industry, the securities industry, etc., because the timeliness requirement of the business data is high, the processing period of the business data table with the writing period of 1-2 hours can be determined as hour, the processing period of the business data table with the writing period of more than 2 hours can be determined as day, etc.

In order to make the processing period of the obtained service data table more accurate and further improve the accuracy of the subsequent identification result of the hot and cold data in the service data table, in the embodiment of the application, the corresponding relationship between the average writing period and the processing period of the data table can be obtained by learning and training the average writing period and the processing period of a large number of sample data tables. Specifically, before determining the processing period of the business data table through step S14, the method for identifying cold and hot data of a data warehouse according to the embodiment of the present application may further include: firstly, acquiring a historical access record and a processing period of a sample data table in a statistical period; then, based on the historical access record in the statistical period, determining the average writing period of the sample data table; and finally, taking the average writing period of the sample data table as a training sample, taking the processing period of the sample data table as a label of the training sample, and training based on a preset second classification algorithm to obtain the corresponding relation between the average writing period and the processing period of the data table.

The sample data table may be a business data table of a known processing period in the data warehouse. The preset second classification algorithm may for example comprise a combination of one or more of the following algorithms: decision tree algorithms, bayesian algorithms, artificial neural network algorithms, k-nearest neighbor (KNN) algorithms, etc.

It should be noted that the specific implementation of determining the average writing period of the sample data table based on the historical access record of the sample data table is substantially the same as the specific implementation of determining the average writing period of the service data table in the step S14, and for details, refer to the implementation process of determining the average writing period of the service data table in the step S14, and are not described herein again.

And the access frequency information of the service data table in the specified statistical period is used for representing the access frequency of the service data table in the specified period. The access frequency information of the service data table in the specified statistical period may specifically include, but is not limited to, the total access frequency of the service data table in the specified statistical period and the access frequency in each sub-period of the specified statistical period. The granularity of the sub-period may be set by a user according to the granularity of the specified statistical period and the actual service requirement, which is not specifically limited in the embodiment of the present application. For example, if a statistical period is specified as one year, the sub-period may be one month; if the specified statistical period is one month, the sub-period may be one day, and so on.

Specifically, the total access frequency of the service data table in the specified statistical period may be determined based on the number of accesses to the service data table in the specified statistical period and the duration of the specified statistical period; the access frequency of the service data table in the sub-period can be determined based on the number of times of accessing the service data table in the sub-period and the duration of the sub-period.

And S16, determining the cold and hot data boundary time point of the business data table in the appointed statistical period based on the pre-established cold and hot data identification model and the processing period of the business data table, the access frequency information in the appointed statistical period and the appointed statistical period.

In the embodiment of the application, the hot and cold data identification model is a pre-established model for identifying hot and cold data. Optionally, the hot and cold data recognition model may be obtained by training based on a preset first classification algorithm with a hot and cold data ratio of the sample data table as a label based on a training sample of a storage period, an access frequency in a statistical period, and a statistical period of the sample data table. Wherein the first classification algorithm may for example include, but is not limited to, a combination of one or more of the following algorithms: decision tree algorithms, bayesian algorithms, artificial neural network algorithms, k-nearest neighbor (KNN) algorithms, etc.

Accordingly, the processing period of the determined business data table, the access frequency in the specified statistical period and the specified statistical period can be input into the hot and cold data identification model to obtain the hot and cold data proportion of the business data table, and further, the hot and cold boundary time point of the business data table in the specified statistical period can be determined based on the hot and cold data proportion. The cold and hot data dividing time point refers to a time point for dividing cold data and hot data.

It should be noted that the process of training the obtained cold and hot data recognition model will be described in detail in the embodiment shown in fig. 4 below, and will not be further described here.

And S18, identifying cold data and hot data in the business data table based on the cold and hot data demarcation time point.

After the cold and hot data demarcation time point of the business data table is determined, the business data in the business data table can be divided according to the cold and hot data demarcation time point to obtain cold data and hot data. For example, a statistical period is specified from 2017, 1 month to 2019, 12 months, and a cold and hot data demarcation time point determined based on the above steps is 2017, 10 months, so that the service data of which the generation time in the service data table is from 2017, 1 month to 2017, 10 months can be determined as cold data, and the service data of which the generation time is from 2017, 11 months to 2019, 12 months can be determined as hot data.

According to the cold and hot data identification method for the data warehouse, historical access records of a business data table in the data warehouse in a specified statistical period are analyzed, a processing period of the business data table and access frequency information in the specified statistical period are determined, cold and hot data boundary time points of the business data table in the specified statistical period are determined further based on a pre-established cold and hot data identification model, the determined processing period, the determined access frequency information in the specified statistical period and the specified statistical period, and finally cold and hot data in the business data table are identified based on the cold and hot data boundary time points, manual participation is not needed in the whole process, automatic identification of the cold and hot data is achieved, and compared with the prior art, accuracy and efficiency are higher through a mode of manually counting and identifying the cold and hot data; and the cold and hot data are identified by taking the business data table as a unit, and the cold and hot data are identified based on the cold and hot data dividing time point of the business data table, so that the efficiency is higher compared with a mode of counting and identifying the business data in the data warehouse one by one.

In order to make those skilled in the art understand the technical solutions provided in the embodiments of the present application, the following detailed descriptions are provided for the technical solutions provided in the embodiments of the present application.

As for the step S16, in an alternative scheme, as shown in fig. 2, the step S16 may include:

s161, inputting the processing period of the business data table, the access frequency in the specified statistical period and the specified statistical period into a pre-established cold and hot data identification model to obtain the proportion of the cold and hot data of the business data table.

And S162, determining an access frequency threshold matched with the service data table based on the cold and hot data proportion of the service data table and the total access frequency in a specified statistical period.

Because the service data stored in different service data tables are different, the access frequency of the service data in different service data tables is different, and based on the difference, different access frequency thresholds can be determined for different service data tables. In an alternative scheme, the access frequency threshold value can be obtained by matching the product of the cold and hot data proportion of the service data table and the total access frequency of the service data table in a specified statistical period.

For example, if the ratio of cold data to hot data in a certain service data table is 20% (i.e., the cold data accounts for 20%), and the total access frequency of the service data table in a given statistical period is 1380 times, it can be determined that the access frequency threshold of the service data table is 1380 × 20% — 276 times.

And S163, determining the cold and hot data boundary time point of the business data table in the specified statistical period based on the access frequency threshold and the access frequency of the business data table in each sub-period of the specified statistical period.

In an alternative scheme, the accumulated access frequency corresponding to each sub-period may be determined based on the total access frequency of the service data table in a specified statistical period, where the accumulated access frequency corresponding to a sub-period is an accumulated access frequency from a starting time point of the specified statistical period to an ending time point of the sub-period; then, a sub-period in which the corresponding accumulated access frequency reaches the access frequency threshold may be selected from the specified statistical period, and the sub-period may be determined as a cold and hot data boundary time point of the service data table in the specified statistical period.

For example, taking the access frequency information of the service data table shown in fig. 3 in a specified statistical period and the matching access frequency threshold value as 276 times as an example, the sub-period in which the corresponding accumulated access frequency in the service data table reaches the access frequency threshold value can be determined to be 17 years and 10 months in the above step S163, and therefore, the sub-period can be used as the cold and hot data boundary time point of the service data table in the specified statistical period (i.e., 17 years and 1 month to 19 years and 12 months). Further, it may be determined that the service data in the service data table at which the writing time point is between 17 year 1 month and 17 year 10 is cold data, and the service data at which the writing time point is between 17 year 11 month and 19 year 12 month is hot data.

It can be understood that, by the above scheme, based on the rule that hot data is frequently accessed and cold data is less accessed, an access frequency threshold matched with the service data table is determined based on the proportion of the hot data and the cold data in the service data table and the total access frequency in a specified statistical period, and then the hot data and cold data boundary time point of the service data table is determined based on the access frequency threshold and the access frequency of the service data table in each sub-period, which is simple to implement and high in efficiency.

In consideration of the fact that a certain error may exist between the cold and hot data proportion obtained based on the cold and hot data recognition model and the actual cold and hot data proportion in the actual recognition process, in order to determine the cold and hot data boundary time point more accurately, in another optional scheme, the up-and-down fluctuation of the cold and hot data proportion can be considered. Specifically, after the sub-period in which the corresponding cumulative access frequency reaches the access frequency threshold is selected from the specified statistical period, the selected sub-period may be used as a candidate sub-period. Next, a difference between the accumulated access frequency corresponding to each sub-period adjacent to the candidate sub-period and the accumulated access frequency corresponding to the candidate sub-period and a difference between the accumulated access frequency and the access frequency threshold are determined. Further, a sub-period with the smallest difference between the accumulated access frequencies corresponding to the candidate sub-periods and the difference between the accumulated access frequencies and the access frequency threshold value within the preset range is selected from the sub-periods adjacent to the candidate sub-periods. And finally, determining the cold and hot data boundary time point of the service data table in the specified statistical period based on the selected sub-period.

More specifically, an array corresponding to each sub-period, i.e. a [0 ], may be generated based on the access frequency of the service data table in each sub-period and the time sequence of each sub-period]To a [ n ]]And the array corresponding to the sub-period is used for storing the access frequency of the service data table in the sub-period. Then, calculating the accumulated access frequency s corresponding to the candidate sub-period_iThe ratio p in the total access frequency S within a given statistical period, i.e. p ═ S/S, where S is_i＝a[0]+a[1]+a[2]+……+a[i]，S＝a[0]+a[1]+a[2]+……+a[n]And i is less than n. Further, assuming that the error of the ratio of the cold data and the hot data is c%, if p > (50% -c%), the subscript i-1 can be determined as the right limit, i ═ r; if p > (50% + c%), the subscript i-1 is determined to be the left limit, i.e., i ═ l. And searching adjacent elements in the r to l elements in the array, and determining the subscript (denoted as k) corresponding to the element with the largest difference value. And finally, calculating by the following formula (3), and taking the sub-period corresponding to the subscript k +1 as a cold and hot data demarcation time point of the business data table in a specified statistical period.

a[k]-a[k+1]＝Max((a[r]-a[r+1]),((a[r+1]-a[r+2]),...,(a[l-1]-a[l]))(3)

By means of the scheme, when the cold and hot data boundary time point is determined, the error between the cold and hot data proportion and the actual proportion of the business data table obtained based on the cold and hot data identification model is considered, and the obtained cold and hot data boundary time point is more accurate, so that the cold and hot data identification result obtained based on the cold and hot data boundary time point is more accurate and reliable.

For the cold and hot data recognition model in the step S161, the embodiment of the present application further includes a training method for the cold and hot data recognition model.

It should be noted that the training of the hot and cold data recognition model is performed in advance based on the processing period, the access frequency information in the statistical period, and the statistical period of a large number of sample data tables acquired from the data warehouse, and the hot and cold data recognition model does not need to be trained each time in the process of performing hot and cold data recognition on the business data tables in the data warehouse, or the hot and cold data recognition model can be updated periodically from the processing period, the access frequency information in the statistical period, and the statistical period of a large number of sample data tables newly acquired from the data warehouse, so as to improve the accuracy and the reliability of the hot and cold data recognition model. The large number of sample data tables may be service data tables with known data hot and cold attributes.

Specifically, as shown in fig. 4, the method for training the cold-hot data recognition model may include the following steps:

and S42, acquiring the historical access record and the cold and hot data proportion of the sample data table in the statistical period.

And S44, determining the processing period of the sample data table and the total access frequency in the statistical period based on the acquired historical access records.

It should be noted that, the determining of the processing period of the sample data table based on the historical access record of the sample data table in the statistical period is similar to the determining of the processing period of the service data table in the step S14, and specific reference may be made to the description of the step S14, which is not described herein again.

Similarly, a specific implementation of determining the total access frequency of the sample data table in the statistical period based on the historical access record of the sample data table in the statistical period is similar to the specific implementation of determining the total access frequency of the service data table in the specified statistical period in step S14, and specific reference may be made to the description of step S14, which is not repeated herein.

In addition, different sample data tables can have different statistical periods, and the setting can be customized according to the service corresponding to the sample data tables. In order to further improve the accuracy and reliability of the hot and cold data identification model obtained by training, the sample data table may adopt a data table corresponding to the same service as the service data table.

It should be further noted that, in practical applications, when a hot and cold data recognition model is trained based on training samples, the sample data may be represented in a table structure form, where the table structure form includes field attribute names and field descriptions. Further, in order to distinguish different sample data, a field for describing the name of the sample data table may be added in the table structure. For example, table2 shows a table structure of a training sample.

TABLE2

Field attribute name	Field description
		Table name	Name of data table (e.g. table1, table2, table3, etc.)
Treatment cycle	Type of processing period of data sheet (e.g. day, month, year, hour, etc.)
		Statistical period	Statistical period of visit frequency (e.g. 1 year, 2 years, 3 years, etc.)
Frequency of access	Frequency of accesses to data tables during a statistical period

And S46, taking the processing period, the total access frequency in the statistical period and the statistical period of the sample data table as training samples, taking the cold-hot data proportion of the sample data table in the statistical period as labels of the training samples, and training based on a preset first classification algorithm to obtain a cold-hot data recognition model.

The first classification algorithm in the embodiments of the present application may for example comprise a combination of one or more of the following algorithms: a decision tree algorithm, a bayesian algorithm, an artificial neural network algorithm, a k-nearest neighbor algorithm (KNN), etc., and the first classification algorithm is not specifically limited in the embodiment of the present application.

It can be understood that, because a certain correlation exists among the processing period, the access frequency, the statistical period and the cold-hot data proportion of the data table, the processing period, the access frequency and the statistical period of a large number of sample data tables are used as training samples, the cold-hot data proportion of the sample data tables is used as a label, and the corresponding classification algorithm is used for training to obtain a cold-hot data identification model, so that the obtained cold-hot data identification model can accurately identify the relationship among the processing period, the access frequency and the statistical period of different business data tables and the cold-hot data proportion, and further, the cold-hot data proportion of the business data tables can be effectively and accurately obtained based on the cold-hot data identification model, and further, the cold-hot data boundary time point can be effectively and accurately extracted to obtain an accurate identification result of the cold-hot data in the business data tables.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Referring to fig. 5, at a hardware level, the electronic device includes a processor, and optionally further includes an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, the network interface, and the memory may be connected to each other via an internal bus, which may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 5, but this does not indicate only one bus or one type of bus.

And the memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.

The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form a cold and hot data identification device of the data warehouse on a logic level. The processor is used for executing the program stored in the memory and is specifically used for executing the following operations:

The method executed by the cold and hot data identification device of the data warehouse according to the embodiment shown in fig. 1 of the present application may be applied to or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

The electronic device may also execute the method of fig. 1, and implement the functions of the cold and hot data identification apparatus of the data warehouse in the embodiments shown in fig. 1 to fig. 4, which are not described herein again in this embodiment of the present application.

Of course, besides the software implementation, the electronic device of the present application does not exclude other implementations, such as a logic device or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or a logic device.

Embodiments of the present application also provide a computer-readable storage medium storing one or more programs, where the one or more programs include instructions, which when executed by a portable electronic device including a plurality of application programs, enable the portable electronic device to perform the method of the embodiment shown in fig. 1, and are specifically configured to:

Fig. 6 is a schematic structural diagram of a cold and hot data identification device of a data warehouse according to an embodiment of the present application. Referring to FIG. 6, in one software implementation, a data warehouse cold and hot data device 600 may include:

a first obtaining module 610, configured to obtain a historical access record of a service data table in a data warehouse in a specified statistical period;

a first determining module 620, configured to determine, based on the historical access record, a processing period of the service data table and access frequency information in the specified statistical period;

a second determining module 630, configured to determine a cold and hot data boundary time point of the service data table in the specified statistics period based on a pre-established cold and hot data identification model and a processing period of the service data table, access frequency information in the specified statistics period, and the specified statistics period, where the cold and hot data identification model is obtained by training the processing period of the sample data table, the access frequency information in the statistics period, and the statistics period based on a first classification algorithm;

and the identifying module 640 is configured to identify cold data and hot data in the service data table based on the cold and hot data boundary time point.

the second determining module 630 is specifically configured to:

Optionally, the second determining module 630 is specifically configured to:

the first determining module 620 is specifically configured to:

Optionally, the apparatus further comprises:

In short, the above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Claims

1. A cold and hot data identification method for a data warehouse is characterized by comprising the following steps:

2. The method of claim 1, wherein the access frequency information includes a total access frequency and access frequencies within respective sub-periods of the specified statistical period;

3. The method of claim 2, wherein determining the hot and cold data demarcation time points of the traffic data table in the specified statistical period based on the access frequency threshold and the access frequency of the traffic data table in each sub-period of the specified statistical period comprises:

4. The method of claim 2, wherein determining the hot and cold data demarcation time points of the traffic data table in the specified statistical period based on the access frequency threshold and the access frequency of the traffic data table in each sub-period of the specified statistical period comprises:

5. The method of claim 1, wherein the historical access record comprises a time point of a write operation performed on the business data table within the specified statistical period;

6. The method of claim 5, wherein before determining the processing period of the traffic data table, the method further comprises:

7. The method of claim 2, wherein before determining the hot and cold data boundary time point of the service data table in the specified statistical period based on the pre-established hot and cold data recognition model and the processing period of the service data table, the specified statistical period and the access frequency information in the specified statistical period, the method further comprises:

8. A cold and hot data recognition device of a data warehouse, comprising:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 7.

10. A computer-readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-7.