CN114169401A

CN114169401A - Data processing and prediction model training method and device

Info

Publication number: CN114169401A
Application number: CN202111350521.6A
Authority: CN
Inventors: 张腾; 谭剑; 李飞飞
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2022-03-11

Abstract

The embodiment of the application provides a data processing and prediction model training method and equipment. The method comprises the following steps: determining characteristic information according to the target data and the access records of the target data; inputting the characteristic information into a prediction model to obtain time information of future access of the target data; the prediction model is obtained by training a training sample, wherein the training sample comprises a sample characteristic, a sample time label and a sample type, and the sample time label and the sample type are determined by whether access records of data corresponding to the sample characteristic exist before and after a random time in a sample sampling period; and performing cold and hot data identification on the target data according to the time information. The sample time label is used as a label of a training sample, the prediction model is trained, and then the time interval of the next visit of the target data based on the prediction model is used as a prediction result, and the cold and hot data recognition of the target data is realized according to the time interval, so that the accuracy of the cold and hot data recognition of the target data can be effectively improved.

Description

Data processing and prediction model training method and device

Technical Field

The present application relates to the field of computers, and in particular, to a method and apparatus for data processing and predictive model training.

Background

With the rapid development of data processing demands, data storage costs are greatly increased. In the data storage process, it is found that the data often has a distinct cold-hot characteristic, that is, the data in some areas belongs to the data with a relatively high access frequency, and the data in other areas are rarely accessed. If a large amount of cold data occupies a high performance device, it may cause a waste of storage resources.

In the prior art, different types of storage media and storage modes are adopted to store cold and hot data respectively. Before the cold and hot data separation of the data, the mixed data needs to be accurately identified and separated. The cold and hot data can be identified by establishing an identification rule, for example, a method for identifying the cold and hot data based on the rules of LRU/LFU/LIRS/Exponental Decay and the like. Another type is based on a machine learning approach that uses historical access characteristics of the data to predict that the data will not be accessed for a future period of time. However, the accuracy of the recognition result obtained by the above method is low. Therefore, a solution for improving the accuracy of cold and hot data identification is needed.

Disclosure of Invention

In order to solve or improve the problems in the prior art, embodiments of the present application provide a data processing and prediction model training method and apparatus.

In a first aspect, in one embodiment of the present application, a data processing method is provided. The method comprises the following steps:

determining characteristic information according to target data and the access record of the target data;

inputting the characteristic information into a prediction model to obtain time information of the target data to be accessed in the future; the prediction model is obtained by training a training sample, wherein the training sample comprises a sample characteristic, a sample time label and a sample type, and the sample time label and the sample type are determined by whether access records of data corresponding to the sample characteristic exist before and after a random time in a sample sampling period;

and performing cold and hot data identification on the target data according to the time information.

In a second aspect, in one embodiment of the present application, a predictive model training method is provided. The method comprises the following steps:

constructing a training sample, wherein the training sample comprises a sample characteristic, a sample time label and a sample type, and the sample time label and the sample type are determined by whether access records of data corresponding to the sample characteristic exist before and after a random time in a sample sampling period;

inputting the training samples into a prediction model to obtain a prediction result;

optimizing parameters in the prediction model according to the prediction result, the sample time label and the sample type;

wherein the predictive model is used to identify cold thermal data.

In a third aspect, in an embodiment of the present application, there is provided a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform a data processing method according to the first aspect or a predictive model training method according to the second aspect.

In a fourth aspect, in one embodiment of the present application, there is provided an electronic device comprising a memory and a processor; wherein,

the memory is used for storing programs;

the processor, coupled to the memory, is configured to execute the program stored in the memory, so as to implement a data processing method according to the first aspect or a predictive model training method according to the second aspect.

According to the technical scheme provided by the embodiment of the invention, the characteristic information of the target data is input into a pre-trained prediction model, and the time information of the target data to be accessed in the future, namely the time difference between the time to be accessed and the current time can be obtained through the prediction model. In order to enable the prediction model to accurately obtain the future visit time information, the constructed training sample comprises a sample characteristic, a sample time label and a sample type, wherein the sample time label and the sample type are determined by whether visit records of data corresponding to the sample characteristic exist before and after a random time in a sample sampling period. The prediction model trained by the training samples can accurately identify cold and hot data of the target data. By the scheme, the sample time label is used as the label of the training sample, the prediction model is trained, the time interval of the next visit of the target data based on the prediction model is used as the prediction result, and cold and hot data recognition of the target data is realized according to the time interval, so that the accuracy of cold and hot recognition of the target data can be effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a data processing method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a method for constructing a training sample according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of sampling period splitting according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a training sample constructed based on an observation window according to an embodiment of the present application;

FIG. 5 is a schematic flowchart of a method for training a prediction model according to an embodiment of the present disclosure;

fig. 6 is a schematic flowchart of a method for optimizing parameters of a prediction model according to an embodiment of the present disclosure;

fig. 7 is a schematic diagram of a sample type and correspondence matching process provided in the embodiment of the present application;

FIG. 8 is a diagram illustrating hot and cold data identification provided by an embodiment of the present application;

FIG. 9 is a schematic flow chart illustrating a predictive model training method according to an embodiment of the present disclosure;

FIG. 10 is a block diagram illustrating an exemplary hot and cold data recognition system according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a prediction model training apparatus according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of another electronic device according to an embodiment of the present application.

Detailed Description

In the data storage process, it is found that these data often have a distinct cold-hot characteristic, that is, data in some areas belong to data with a relatively high access frequency, data in other areas are rarely accessed, or two accesses occur at a long time interval (for example, 3 days or one week, one month, half year, etc.). If a large amount of cold data occupies a high performance device, it may cause a waste of storage resources. In the prior art, different types of storage media and storage modes are adopted to store cold and hot data respectively. Before the cold and hot data separation of the data, the mixed data needs to be accurately identified and separated. Although some schemes employ machine learning schemes, which use historical data access characteristics to predict that data will not be accessed for a future period of time, the richness of the samples obtained when training the machine learning model is limited by the size of the sampled observation window (for example, the observation window can observe data within 1 day or 1 week). Usually, a single sample is generated in a limited observation window, data outside the observation window cannot be used as a training sample, and a machine learning model cannot be used for well predicting future data. Therefore, there is a need for a solution that enables hot and cold data identification that is not limited by the size of the observation window and the number of features.

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.

In some of the flows described in the specification, claims, and above-described figures of the present invention, a number of operations are included that occur in a particular order, which operations may be performed out of order or in parallel as they occur herein. The sequence numbers of the operations, e.g., 101, 102, etc., are used merely to distinguish between the various operations, and do not represent any order of execution per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flowchart of a data processing method according to an embodiment of the present application. The method comprises the following steps:

101: and determining characteristic information according to the target data and the access record of the target data.

102: inputting the characteristic information into a prediction model to obtain time information of the target data to be accessed in the future; the prediction model is obtained by training a training sample, the training sample comprises a sample characteristic, a sample time label and a sample type, and the sample time label and the sample type are determined by whether access records of data corresponding to the sample characteristic exist before and after a random time in a sample sampling period.

103: and performing cold and hot data identification on the target data according to the time information.

The target data is data to be identified as hot and cold data. The feature information includes access features of the target data, data related to the target data, and semantic information of the database level (e.g.,/domain _ name, data _ size, SQL template). In addition, the content included in the feature information may be adjusted to increase or decrease as necessary.

The prediction model may be, for example, a generative analysis model, which is obtained by training a training sample in advance. The time information may be a time interval between a time point at which the target data is accessed in the future and a current time point (or a specified time point). The longer the time interval corresponding to the time information is, the lower the frequency of accessing the target data is, and the more likely the target data is to be divided into cold data; conversely, the shorter the time interval corresponding to the time information, the higher the frequency of accessing the target data, and the more likely the target data is to be classified as hot data. Therefore, when cold and hot data are identified, the target data are not directly subjected to cold and hot time classification, but the time interval between the future accessed time and the current time is used as the basis for cold and hot data differentiation.

The training samples are generated based on the access records and related information of a plurality of target data in historical time. In particular, the training samples may be represented as (x _ i, y _ i, e _ i), where x _ i represents a sample feature (covariate, such as an access feature), y _ i represents a sample time label (label), and e _ i represents a sample type. It should be noted that the sample time stamp and the sample type are determined by whether there is an access record of data corresponding to the sample feature before and after a random time in the sample sampling period. This will be exemplified in the following examples relating to the construction of samples. The sample types referred to herein include a type with erasure and a type without erasure.

The time length of the sample sampling period is not limited, and the sample sampling may be performed based on a long (e.g., 1 year, half year) time period. When the sample is actually sampled, the time period range corresponding to the training sample input to the prediction model is limited due to the limitation of the observation window. The sampling is performed in a time-division manner, and will be explained in the following embodiments.

In the training samples, the prediction model is trained by using the sample time as a label. Thus, time information can be output as a prediction result to the target data using the prediction model.

The scheme for constructing the training samples will be specifically described below.

Fig. 2 is a schematic flowchart of a method for constructing a training sample according to an embodiment of the present application. As can be seen from fig. 2, the construction of the training sample specifically includes the following steps:

201: and acquiring an access record of the sample data in a sampling period.

202: and setting a random moment in the sampling time period to split the sampling time period to obtain a feature extraction time period and an observation window time period.

203: and generating sample characteristics based on the access records of the sample data set in the characteristic extraction time period.

204: and searching whether at least one access record aiming at the sample data exists after the random time.

205: and when at least one access record aiming at the sample data exists, determining the sample time label according to the random moment and the at least one access record, and setting the sample type as a non-deletion type.

206: and when at least one access record aiming at the sample data exists, determining the sample time label according to the random time and the termination time of the observation window, and setting the sample type as a deletion type.

As described above, the time range of the sampling period is relatively wide, and a long history period can be used as the period for sampling. When sampling is actually performed, sampling may be performed for only a part of the periods. In the technical scheme of the application, the access frequencies corresponding to different cold and hot data are not completely the same, so when the access frequency of a certain target data is relatively low, the obtained access records may be relatively dispersed, and the access frequency may also be unfixed (for example, the access time intervals of three consecutive times are respectively one month, two months and three months). In the technical solution of the present application, in order to fully utilize data, a sampling period is split by random time (i.e., the Pivot time in fig. 3), so as to obtain a feature extraction period (i.e., the history phase in fig. 3) and an observation window period (i.e., the observation window period in fig. 3). In practical application, there are many historical access data, and when random time is inserted, access records corresponding to a plurality of sample data can be inserted. For ease of understanding, the following embodiments will be described by taking an example of one target data.

Fig. 3 is a schematic diagram of performing sampling period splitting according to an embodiment of the present application. As can be seen from the view in figure 3,

in the feature extraction period, in order to obtain access features of sample data (for example, an access interval is used as a dynamic feature) and static features such as the size of a file to which the data belongs and a table name are combined to form feature information, the longer the sampling period is, the richer the access features that can be extracted by the user are.

And the observation window corresponding to the observation window time interval is used for marking the sample time label and the sample type in the training process. The longer the sampling period, the less the censored data (erasure data), and conversely, the higher the proportion of censored data.

Therefore, both the feature extraction period and the observation window period need to ensure a sufficient sample acquisition period length to ensure the quality of the generated training sample. In addition, in the observation window period and the feature extraction period, sample data at the initial stage and the end stage of the period are in an unstable state. For example, in the observation window period, the window length of 20% in the initial stage and the window length of 20% in the end stage, that is, random time instants are randomly generated and divided in the middle of the trace (tracking) period of 20% to 80%, and then the data sets corresponding to each random time instant are combined to be used as a full data set.

Specifically, fig. 4 is a schematic diagram of constructing a training sample based on an observation window according to an embodiment of the present application. The access records in the sampling period as in fig. 4 sequentially include T0, T1, T2, T3, T4, and T5, that is, the number of accesses to the sample data in the sampling period is 6, and the corresponding time is T0-T5. The start time of the observation window in the observation window period is Ts and the end time is Te. It can be seen from fig. 3 that T5 is not within the observation window, in other words, the access event of T5 is not observable. Although T5 is not observed, the sample data is still an access event occurring in a future time outside of the observation window. A random time may then be set uniformly or randomly within the observation window. In the observation window based on the same sample data, the more random time is set, the more training samples can be obtained correspondingly.

For example, the random time T may be set between T0 and T1, or between T1 and T2, or between T2 and T3, or between T3 and T4, or between T4 and T5. Since Te is smaller than T5 and larger than T4, if the random time T is the current observed time and T is set between T4 and T5, it indicates that the access event occurred beyond the time of T5 cannot be observed within the observation window.

In practical application, the random time is adjusted; and searching whether at least one access record aiming at the sample data exists after the adjusted random moment so as to determine a sample time label and a sample type according to a searching result. Specifically, the method comprises the following steps:

after the random time is set, whether at least one access record for the sample data can be found after the random time is further judged. For example,

if the random time T is set between T0 and T1, then 4 visits can be observed in the observation window, resulting in training sample a 1.

If the random time T is set between T1 and T2, 3 visits can be observed in the observation window, resulting in training sample a 2.

If the random time T is set between T2 and T3, then 2 visits can be observed in the observation window, resulting in training sample A3.

If the random time T is set between T3 and T4, then 1 visit record can be observed in the observation window, resulting in training sample a 4.

At this time, the sample types corresponding to the samples a1, a2, A3, a4 may be set as non-deletion types, respectively.

If the random time T is set between T4 and T5, 0 visit record can be observed in the observation window, and the visit record at the time of T5 cannot be observed, so that the training sample a5 is obtained. At this time, the sample type corresponding to the training sample a5 may be set as the deleted type.

Therefore, by controlling the number of the generated random moments, training sets of training samples with different scales can be obtained, and generally, the larger the scale of the training set is, the better the algorithm index of the prediction model is, and the more accurate the prediction result is.

The manner of determining the sample time stamp corresponding to step 206 will be specifically illustrated below. The specific way of determining the sample time stamp comprises the following steps: and if the last access time in the observation window is later than the random time, marking a first time difference between the random time and the latest access time after the random time as the sample time label, and marking the target event type as a non-deleted event.

In practical application, when the random time is set, the sample time tag needs to be further calculated according to the random time as the tag of the training sample. Since the occurrence of the access event can be successfully observed after the random time in the observation window, that is, the access record exists, it indicates that the access event can be observed and the data deletion problem does not occur, the sample type corresponding to the training sample is set as the deletion-free type. And the corresponding sample time label is a first time difference between the random time and the latest access time after the random time. Continuing by way of example, the sample time stamp Y1 of the training sample a1 is T2-T, the sample time stamp Y2 of the training sample a2 is T2-T, the sample time stamp Y3 of the training sample A3 is T3-T, and the sample time stamp Y4 of the training sample a4 is T4-T.

The manner of determining the sample time stamp corresponding to step 207 will be specifically exemplified below. The specific way of determining the sample time stamp comprises the following steps: and if the last access time in the observation window period is earlier than the random time, marking a second time difference between the window end time and the random time as the sample time label, and marking the target event type as a deletion event.

In practical application, when the random time is set, the sample time tag needs to be further calculated according to the random time as the tag of the training sample. Because the occurrence of the access event cannot be observed after the random time in the observation window, that is, no access record exists, it indicates that the access event cannot be observed and the data deletion problem occurs, the sample type corresponding to the training sample is set to be the deleted type. A corresponding sample time stamp, being a second time difference between the window end time and a second time between the random times. Continuing by way of example, the sample time tag Y5 ═ Te-t of training sample a 5.

Based on the scheme of the embodiment, when the training sample is constructed, the unobserved access records can be introduced into the training sample in a form of a first time difference and a non-deletion type or a second time difference and a deletion type. Although the observation window has a limited length, the problem of sample loss does not occur in the case of deleting data (some access records of the data are not observed). After the prediction model is trained based on the comprehensive training samples, the obtained training model can better predict the time information of next visit after the current observation time, so that the prediction effect of the survival analysis model on future visits can be improved. The scheme can be applied to a database or a cluster, and is used for identifying cold and hot data of data in each node and further storing the data according to identification results. And when the model obtained by training is used for prediction, the data cold and hot category is judged according to the time information of the next visit output by the prediction model.

After the training samples are obtained by the above embodiments, the prediction model may be trained. The following will specifically exemplify the embodiments.

Fig. 5 is a schematic flowchart of a training method of a prediction model according to an embodiment of the present disclosure. As can be seen from fig. 5, the method specifically comprises the following steps:

501: and constructing a training sample. 502: and inputting the training samples into a prediction model to obtain a prediction result. 503: optimizing parameters in the prediction model according to the prediction result, the sample time label and the sample type; wherein the predictive model is used to identify cold thermal data.

As described above, (x _ i, y _ i, e _ i) is included in the training samples, where x _ i represents a sample characteristic (covariate, such as an access characteristic), y _ i represents a sample time label (label), and e _ i represents a sample type. In the process of constructing the training samples, different random times can be set based on the same sample data so as to obtain a group of training samples. Further, based on a plurality of sample data, a training sample set may be obtained. A predictive model (e.g., a survival analysis model) is trained using the set of training samples. In the training process, y _ i is used as a training label. It is easy to understand that, although a relatively comprehensive training sample is obtained to train the prediction model well, the prediction model needs to be continuously optimized based on the training sample in the training process. The specific optimization process is as follows:

fig. 6 is a schematic flowchart of a parameter optimization method of a prediction model according to an embodiment of the present disclosure. As can be seen from fig. 6, the method specifically comprises the following steps:

601: and determining the corresponding relation between the prediction result and the sample time label.

602: and optimizing parameters in the prediction model according to the matching result of the sample type and the corresponding relation.

Here, the prediction result is time information of a time difference from the current time of the next visit and a sample type. In other words, the correspondence between the prediction result and the sample time tag specifically includes: the prediction result is earlier than the sample time stamp, or the prediction result is later than the sample time stamp. And sample label types include: non-erasure type, and erasure type. The specific matching result determination process will be specifically exemplified in the following embodiment with reference to fig. 7.

As described above, the training samples include sample types in addition to the sample time stamps as the training samples. Step 602 will be described in conjunction with the accompanying drawings. Fig. 7 is a schematic diagram of a sample type and correspondence matching process provided in the embodiment of the present application. As can be seen in fig. 7:

701: and if the sample type is a non-deletion type and the corresponding relation is that the time information corresponding to the prediction result is smaller than the sample time label, determining that the sample type is matched with the corresponding relation.

For example, let training sample Y6 be (x _1, Y _1, e _1), let Y _1 be ty1, and e _1 be 0 (indicating no erasure type). The predicted result is tx, and the comparison shows that tx is smaller than ty 1. In other words, the prediction result is that an access event to the target data will occur at the tx time difference after the current time t. Since tx is less than ty1, i.e., the event can be observed, no erasure data is generated, consistent with the sample type (no erasure type). I.e. the sample type matches said correspondence.

702: and if the sample type is a deleted type and the corresponding relation is that the time information corresponding to the prediction result is greater than the sample time label, determining that the sample label type is matched with the corresponding relation.

For example, let training sample Y7 be (x _2, Y _2, e _2), let Y _2 be ty2, and e _2 be 1 (indicating no erasure type). The predicted result is tx, and the comparison shows that tx is greater than ty 2. In other words, the prediction result is that an access event to the target data will occur at the tx time difference after the current time t. Since tx is greater than ty2, i.e., the event cannot be observed, it will produce erasure data, matching the sample type (with erasure type). I.e. the sample tag type matches the correspondence.

703: and if the sample type is a non-deletion type and the corresponding relation is that the time information corresponding to the prediction result is greater than the sample time label, determining that the sample type is not matched with the corresponding relation.

For example, let training sample Y6 be (x _1, Y _1, e _1), let Y _1 be ty1, and e _1 be 0 (indicating no erasure type). The predicted result is tx, and the comparison shows that tx is greater than ty 1. In other words, the prediction result is that an access event to the target data will occur at the tx time difference after the current time t. Since tx is greater than ty1, i.e., the event cannot be observed, it will produce erasure data that does not match the sample type (no erasure type). I.e. the sample type does not match the correspondence.

704: and if the sample type is a deleted type and the corresponding relation is that the time information corresponding to the prediction result is smaller than the sample time label, determining that the sample label type is not matched with the corresponding relation.

For example, let training sample Y7 be (x _2, Y _2, e _2), let Y _2 be ty2, and e _2 be 1 (indicating no erasure type). The predicted result is tx, and the comparison shows that tx is smaller than ty 2. In other words, the prediction result is that an access event to the target data will occur at the tx time difference after the current time t. Since tx is less than ty2, i.e., the event can be observed, no erasure data is generated, not matching the sample type (no erasure type). I.e. the sample tag type does not match the correspondence.

As an alternative embodiment of optimizing the prediction model, the quality of the prediction model may be measured by the model prediction result when the optimization is performed:

due to the existence of the deleted data, the c-index is generally adopted to measure the effect of the prediction model. C-index refers to the proportion of consistent pairs of predicted results and actual results in all sample pairs.

The calculation steps are as follows:

1. and matching all the training samples to obtain a sample pair. For example, if there are n samples, n x (n-1)/2 sample pairs are generated;

2. if the sample type corresponding to the sample time with the smaller sample time in the sample pair is the deleted type (i.e. the sample data is deleted data) or both samples in the sample pair are deleted data, the sample pair is considered as an invalid pair, and the remaining sample pairs are excluded as useful pairs.

3. And calculating the number of pairs of useful pairs, wherein the predicted result is consistent with the actual result, namely the actual sample time of the individual with the longer sample time is longer.

C-index is the number of consistent pairs/useful pairs; the range of the c-index is between 0 and 1, and the closer to 1, the stronger the capability of distinguishing cold and hot data of the model is proved. In the training optimization process, training is continuously carried out by using training samples so that the c-index of the prediction model is closer to 1.

After the time information is output by using the prediction model, the cold and hot data of the target data are further identified according to the time information. As will be described in detail below.

Fig. 8 is a schematic diagram of hot and cold data identification provided in the embodiment of the present application. As can be seen from fig. 8, the method specifically comprises the following steps:

801: and acquiring the time information respectively corresponding to the target data.

802: and performing cold and hot data identification on the target data according to the size of the time information.

As shown in step 802, when comparing according to the size of the time information, a sorting manner may be adopted. For example, 4 pieces of time information were obtained, tx1 for 10 minutes, tx2 for 20 minutes, tx3 for 30 minutes, and tx4 for 40 minutes, respectively. The sorting can be performed in the order from large to small or from small to large. Let us assume that the order obtained by sorting from small to large is tx1, tx2, tx3, and tx 4. The first 50% or the first 25% may be divided into hot data and the remaining divided into cold data. The specific ratio can be set according to practical situations (such as the size of the thermal data storage space), and is only used as an example and not to limit the technical solution of the present application.

In addition to being in a sorted manner, the threshold size comparison may also be in a manner. The method comprises the following specific steps:

the target data with the time information larger than a first time threshold value is marked as cold data, and the target data is stored to a first storage medium supporting low access performance.

And marking the target data with the time information less than or equal to the first time threshold value as hot data, and storing the target data to a second storage medium supporting high access performance.

For example, 4 pieces of time information were obtained, tx1 for 10 minutes, tx2 for 20 minutes, tx3 for 30 minutes, and tx4 for 40 minutes, respectively. Assuming that the first time threshold is 25 minutes, it can be known that the target data is cold data since tx3 and tx4 are both greater than 25 minutes; since tx1 and tx2 are both less than 25 minutes, the target data is thermal data.

In practical application, the storage state of the cold and hot data can be adjusted by utilizing the time information output by the prediction model. Specifically, the cold data is stored clocked. Determining a remaining time based on a difference between the storage timing and the time information corresponding to the cold data. Migrating the cold data from the first storage medium to the second storage medium when the remaining time is less than a second time threshold.

For example, after the time information (for example, 24 hours) is output through the prediction model, the target data is recognized as cold data and stored into a first storage medium (for example, HDD medium). And starting storage timing for the cold data, wherein the storage timing is accumulated to 23 hours along with the progress of time, at this time, the difference between the time information and the storage timing (namely, the remaining time) is only 1 hour, and the threshold is set to be 1 hour, so that when the remaining time is 1 hour or 59 minutes, since the remaining time is less than the second time threshold, the target data corresponding to the cold data is to be accessed. To increase the access speed to the target data, the cold data may be transferred from the first storage medium into a second storage medium (such as an SSD medium) for storing the hot data.

Based on the same idea, the embodiment of the application further provides a prediction model training method. Fig. 9 is a schematic flowchart of a predictive model training method according to an embodiment of the present application. As can be seen from fig. 9, the method specifically includes the following steps:

901: and constructing a training sample, wherein the training sample comprises a sample characteristic, a sample time label and a sample type, and the sample time label and the sample type are determined by whether access records of data corresponding to the sample characteristic exist before and after a random time in a sample sampling period.

902: and inputting the training samples into a prediction model to obtain a prediction result.

903: optimizing parameters in the prediction model according to the prediction result, the sample time label and the sample type; wherein the predictive model is used to identify cold thermal data.

As shown in step 903, the process of optimizing the parameters in the prediction model is as follows: determining the corresponding relation between the prediction result and the sample time label; and optimizing parameters in the prediction model according to the matching result of the sample type and the corresponding relation.

The construction process of the training sample in step 901 includes the following steps:

obtaining an access record of sample data in a sampling period;

setting a random moment in the sampling time period to split the sampling time period to obtain a feature extraction time period and an observation window time period;

generating sample characteristics based on access records of the sample data set in the characteristic extraction time period;

searching whether at least one access record aiming at the sample data exists after the random time;

when at least one access record aiming at the sample data exists, determining the sample time label according to the random moment and the at least one access record, and setting the sample type as a non-deletion type;

and when at least one access record aiming at the sample data exists, determining the sample time label according to the random time and the termination time of the observation window, and setting the sample type as a deletion type.

Specifically, reference may be made to each embodiment corresponding to fig. 1 to 8, and details are not repeated here.

For ease of understanding, the overall process of hot and cold data identification will be specifically illustrated below by taking the prediction model as an example of a survival analysis model. Fig. 10 is a schematic diagram illustrating an architecture of a hot and cold data recognition system according to an embodiment of the present application. As can be seen from fig. 10, the survival analysis server and the application server are included. The survival analysis server includes an Object Storage Service (OSS), a Relational Database Service (RDS), and a survival analysis model. In the model, a lightweight survival analysis model can be constructed based on algorithms such as cox and rsf, and is simpler than the survival analysis algorithm based on a neural network and has a higher model index (c-index). By using the historical data of the application end as sample data, the embodiment corresponding to fig. 1 to 9 generates a training sample, and then trains the survival analysis model. And optimizing the trained model. And then, the target data needing to be identified can be identified, and the hot and cold classified storage is carried out according to the nodes in the shared storage.

There are two important functions in the Survival analysis, one is the Survival function (Survival function), i.e. the probability that an event did not occur before time t: s (T) ═ Pr (T) ═ T), another is a risk function:

described is the probability of an event occurring at time t

The survival analysis-based models are all based on fitting a risk function, and take a cox model as an example, the cox model assumes that log-hazard is a variable that varies linearly with covariates (i.e., features), i.e.

h(t,X)＝h0(t)exp(β₁X₁+β₂X₂+...+β_kX_k)

Wherein X ═ X₁,X₂,X₃,...,X_k) Are k risk factors that affect the time to live t.

Then, a partial likelihood function (partialikelihood) of the cox risk model is established, logarithms are taken at two sides of the partial likelihood function, and then a partial derivative of the beta is solved, so that the maximum likelihood estimator of the beta can be solved.

We translate the prediction of cold and hot data into a regression problem, i.e. the Time-to-event prediction of the next access of the predicted data, which results in a lot of missed data due to the limited observation window of the historical information, i.e. the next access of a part of the sample is not observed within the observation window, but this does not indicate that the sample will not be accessed after the access window. For those samples that are not visited after the inference time, conventional machine learning is unable to take advantage of the samples. The survival analysis is naturally suitable for processing deleted data, the problem of limited observation window such as data cold and hot is very suitable for modeling by using the survival analysis, and a better effect is obtained by fitting a CHF (cumulative loss function).

Based on the same idea, the embodiment of the application further provides a data processing device. Fig. 11 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The data processing apparatus includes:

the determining module 1101 is configured to determine feature information according to the target data and the access record of the target data.

An input module 1102, configured to input the feature information into a prediction model, so as to obtain time information of future access of the target data; the prediction model is obtained by training a training sample, the training sample comprises a sample characteristic, a sample time label and a sample type, and the sample time label and the sample type are determined by whether access records of data corresponding to the sample characteristic exist before and after a random time in a sample sampling period.

An identifying module 1103, configured to perform cold and hot data identification on the target data according to the time information.

Optionally, a training module 1104 is further included for constructing training samples; inputting the training samples into a prediction model to obtain a prediction result; optimizing parameters in the prediction model according to the prediction result, the sample time label and the sample type; wherein the predictive model is used to identify cold thermal data.

Optionally, the training module 1104 is further configured to determine a corresponding relationship between the prediction result and the sample time tag;

and optimizing parameters in the prediction model according to the matching result of the sample type and the corresponding relation.

Optionally, the training module 1104 is further configured to determine that the sample type is matched with the corresponding relationship if the sample type is a non-deletion type and the corresponding relationship is that the time information corresponding to the prediction result is smaller than the sample time tag;

and if the sample type is a deleted type and the corresponding relation is that the time information corresponding to the prediction result is greater than the sample time label, determining that the sample label type is matched with the corresponding relation.

Optionally, a sample construction module 1105 is further included for: obtaining an access record of sample data in a sampling period; setting a random moment in the sampling time period to split the sampling time period to obtain a feature extraction time period and an observation window time period; generating sample characteristics based on access records of the sample data set in the characteristic extraction time period; searching whether at least one access record aiming at the sample data exists after the random time; when at least one access record aiming at the sample data exists, determining the sample time label according to the random moment and the at least one access record, and setting the sample type as a non-deletion type; and when at least one access record aiming at the sample data exists, determining the sample time label according to the random time and the termination time of the observation window, and setting the sample type as a deletion type.

Optionally, a sample construction module 1105 is further included for: adjusting the random time;

and searching whether at least one access record aiming at the sample data exists after the adjusted random moment so as to determine a sample time label and a sample type according to a searching result.

Optionally, a sample construction module 1105 is further included for: and if the last access time in the observation window is later than the random time, marking a first time difference between the random time and the latest access time after the random time as the sample time label, and marking the target event type as a non-deleted event.

Optionally, a sample construction module 1105 is further included for: and if the last access time in the observation window period is earlier than the random time, marking a second time difference between the window end time and the random time as the sample time label, and marking the target event type as a deletion event.

Optionally, the system further includes an identification module 1103, configured to obtain the time information corresponding to the target data respectively; and performing cold and hot data identification on the target data according to the size of the time information.

Optionally, the identifying module 1103 is further configured to mark the target data with the time information greater than a first time threshold as cold data, and store the target data to a first storage medium supporting low access performance. And marking the target data with the time information less than or equal to the first time threshold value as hot data, and storing the target data to a second storage medium supporting high access performance.

Optionally, the system further comprises a migration module, configured to time storage of the cold data; determining a remaining time based on a difference between the storage timing and the time information corresponding to the cold data; migrating the cold data from the first storage medium to the second storage medium when the remaining time is less than a second time threshold.

In one embodiment of the present application, there is provided a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the data processing method as described in fig. 1 to 8. Reference may be made in particular to the above-mentioned examples.

An embodiment of the application also provides an electronic device. Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device comprises a memory 1201, a processor 1202 and a communication component 1203; wherein,

the memory 1201 is used for storing programs;

the processor 1202, coupled with the memory, is configured to execute the program stored in the memory to:

The memory 1201 described above may be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device. The memory may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Further, the processor 1202 in this embodiment may specifically be: and the programmable exchange processing chip is provided with a data copying engine and can copy the received data.

When the processor 1202 executes the program in the memory, other functions may be implemented in addition to the above functions, which may be specifically referred to in the description of the foregoing embodiments. Further, as shown in fig. 12, the electronic apparatus further includes: power components 1204, and the like.

Based on the same idea, the embodiment of the application further provides a prediction model training device. Fig. 13 is a schematic structural diagram of a prediction model training apparatus according to an embodiment of the present application. The another data processing apparatus includes:

the sample construction module 1301 is configured to construct a training sample, where the training sample includes a sample feature, a sample time tag, and a sample type, and the sample time tag and the sample type are determined by whether there is an access record of data corresponding to the sample feature before and after a random time within a sample sampling period.

An input module 1302, configured to input the training samples into a prediction model to obtain a prediction result.

An optimizing module 1303, configured to optimize parameters in the prediction model according to the prediction result, the sample time tag, and the sample type; wherein the predictive model is used to identify cold thermal data.

Optionally, the optimizing module 1303 is further configured to determine a corresponding relationship between the prediction result and the sample time tag; and optimizing parameters in the prediction model according to the matching result of the sample type and the corresponding relation.

Optionally, the sample constructing module 1301 is further configured to obtain an access record of the sample data in a sampling time period; obtaining an access record of sample data in a sampling period; setting a random moment in the sampling time period to split the sampling time period to obtain a feature extraction time period and an observation window time period; generating sample characteristics based on access records of the sample data set in the characteristic extraction time period; searching whether at least one access record aiming at the sample data exists after the random time; when at least one access record aiming at the sample data exists, determining the sample time label according to the random moment and the at least one access record, and setting the sample type as a non-deletion type; and when at least one access record aiming at the sample data exists, determining the sample time label according to the random time and the termination time of the observation window, and setting the sample type as a deletion type.

In one embodiment of the present application, a non-transitory machine-readable storage medium having executable code stored thereon is provided, which when executed by a processor of an electronic device, causes the processor to perform a predictive model training method as described in fig. 9. Reference may be made in particular to the above-mentioned examples.

An embodiment of the application also provides an electronic device. The electronic device is a standby node electronic device in a computing unit. Fig. 14 is a schematic structural diagram of another electronic device provided in the embodiment of the present application. The electronic device comprises a memory 1401, a processor 1402 and a communication component 1403; wherein,

the memory 1401 for storing a program;

the processor 1402, coupled to the memory, is configured to execute the program stored in the memory to:

wherein the predictive model is used to identify cold thermal data.

The memory 1401 described above may be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device. The memory may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Further, the processor 1402 in this embodiment may specifically be: and the programmable exchange processing chip is provided with a data copying engine and can copy the received data.

The processor 1402, when executing the program in the memory, may also implement other functions in addition to the above functions, which may be referred to in the foregoing description of the embodiments. Further, as shown in fig. 14, the electronic apparatus further includes: power supply component 1404, and the like.

Based on the above embodiment, the feature information of the target data is input into a pre-trained prediction model, and the prediction model can obtain the time information of the target data to be accessed in the future, i.e. the time difference between the time to be accessed and the current time. In order to enable the prediction model to accurately obtain the future visit time information, the constructed training sample comprises a sample characteristic, a sample time label and a sample type, wherein the sample time label and the sample type are determined by whether visit records of data corresponding to the sample characteristic exist before and after a random time in a sample sampling period. The prediction model trained by the training samples can accurately identify cold and hot data of the target data. By the scheme, the sample time label is used as the label of the training sample, the prediction model is trained, the time interval of the next visit of the target data based on the prediction model is used as the prediction result, and cold and hot data recognition of the target data is realized according to the time interval, so that the accuracy of cold and hot recognition of the target data can be effectively improved.

Here, it should be noted that: the data processing apparatus provided in the foregoing embodiments may implement the technical solutions described in the foregoing method embodiments, and the specific implementation principle of each module or unit may refer to the corresponding content in the foregoing method embodiments, which is not described herein again.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein the predictive model is trained by:

constructing a training sample;

wherein the predictive model is used to identify cold thermal data.

3. The method of claim 2, wherein optimizing parameters in the predictive model based on the prediction, sample time labels, and sample types comprises:

determining the corresponding relation between the prediction result and the sample time label;

4. The method of claim 3, the matching of the sample type to the correspondence comprising:

if the sample type is a non-deletion type and the corresponding relation is that the time information corresponding to the prediction result is smaller than the sample time label, determining that the sample type is matched with the corresponding relation;

5. The method of claim 2, the constructing a training sample, comprising:

obtaining an access record of sample data in a sampling period;

6. The method of claim 5, further comprising:

adjusting the random time;

7. The method of claim 5, the determining the sample time stamp from the random time and the at least one access record, comprising:

and if the last access time in the observation window is later than the random time, marking a first time difference between the random time and the latest access time after the random time as the sample time label, and marking the target event type as a non-deleted event.

8. The method of claim 5, the determining the sample time stamp from the random time and the termination time of the observation window, comprising:

and if the last access time in the observation window period is earlier than the random time, marking a second time difference between the window end time and the random time as the sample time label, and marking the target event type as a deletion event.

9. The method of claim 1, wherein the performing hot and cold data recognition on the target data according to the time information comprises:

acquiring the time information respectively corresponding to the target data;

and performing cold and hot data identification on the target data according to the size of the time information.

10. The method of claim 1 or 9, further comprising:

marking the target data with the time information larger than a first time threshold value as cold data, and storing the target data to a first storage medium supporting low access performance;

11. The method of claim 10, further comprising:

timing storage of the cold data;

determining a remaining time based on a difference between the storage timing and the time information corresponding to the cold data;

migrating the cold data from the first storage medium to the second storage medium when the remaining time is less than a second time threshold.

12. A predictive model training method, comprising:

wherein the predictive model is used to identify cold thermal data.

13. A non-transitory machine-readable storage medium having stored thereon executable code that, when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-11, or the method of claim 12.

14. An electronic device comprising a memory and a processor; wherein,

the memory is used for storing programs;

the processor, coupled with the memory, configured to execute the program stored in the memory, so as to implement the method of any one of the preceding claims 1 to 11, or the method of claim 12.