CN108470071B

CN108470071B - Data processing method and device

Info

Publication number: CN108470071B
Application number: CN201810271496.4A
Authority: CN
Inventors: 杨帆; 金宝宝; 张成松
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2018-03-29
Filing date: 2018-03-29
Publication date: 2022-02-18
Anticipated expiration: 2038-03-29
Also published as: CN108470071A

Abstract

According to the data processing method and device, after the data to be processed is obtained, feature data of the data to be processed are extracted, wherein the feature data comprise at least one of features of the data to be processed in a time dimension or a data access frequency dimension; on the basis, the feature data are processed by utilizing a pre-trained data processing model to determine the life cycle state of the data to be processed. According to the data processing method and device, the pre-trained data processing model is used for determining the life cycle state of the data to be processed based on at least one of the characteristics of the data to be processed in the time dimension or the data access frequency dimension, and therefore when the life cycle state of the data to be processed is determined, the rule of the corresponding relation between the at least one of the characteristics of the big data in the time dimension or the data access frequency dimension and the life cycle state of the big data is specifically used.

Description

Data processing method and device

Technical Field

The invention belongs to the field of machine learning, and particularly relates to a data processing method and device.

Background

In a production environment, data with an end life cycle, such as a table, a log file, etc., often needs to be deleted to release storage space and ensure normal operation of a system, which inevitably needs to determine whether the life cycle of the data has been ended.

Two methods for determining whether the life cycle of the data is finished currently exist, one method is to preset a threshold value T, compare the total survival time T of the data from the generation time with the set threshold value T, and delete the data with T > T as the data with the finished life cycle, otherwise, consider the life cycle of the data not to be finished; and the other is to manually check whether the data is useful, and the data which is not useful for manual checking is regarded as the data with the end of the life cycle and is deleted.

Both the two methods have some defects, the threshold T in the first method is difficult to determine, and accordingly, the accuracy of judging whether the data life cycle is ended or not is low, and further adverse effects are brought to data deletion operation, wherein if the T is set too large, a lot of useless data cannot be deleted, and if the T is set too small, useful data is often deleted while the useless data are deleted; the second method has the problem of high labor cost due to the need of manual data-by-data examination and judgment.

Disclosure of Invention

In view of the above, an object of the present invention is to provide a data processing method and apparatus, which are used to overcome the above problems in the prior art, so that the determined data lifecycle status has higher accuracy under the premise of low labor cost.

Therefore, the invention discloses the following technical scheme:

a method of data processing, comprising:

acquiring data to be processed;

extracting feature data of a predetermined feature of the data to be processed; the characteristic data includes at least one of: the characteristics of the data to be processed in a time dimension or a data access frequency dimension;

and processing the characteristic data by using a pre-trained data processing model to obtain a processing result so as to determine the life cycle state of the data to be processed.

Preferably, the method for extracting feature data of the predetermined feature of the data to be processed includes:

and extracting at least one of the survival time length of the data to be processed from the creation time point to the current time length, the time length from the last access time point to the current time length or the data access frequency in at least one latest preset time period.

Preferably, in the method, the processing the feature data by using a pre-trained data processing model to obtain a processing result includes:

obtaining the data type of the data to be processed;

determining a sub-processing model corresponding to the data type from the data processing model; the data processing model comprises more than one sub-processing model, different sub-processing models correspond to different data types, and the data characteristic types and/or characteristic weights corresponding to different sub-processing models are different;

and inputting the characteristic data of the data to be processed into the sub-processing model corresponding to the data type to obtain a processing result of the data to be processed.

Preferably, the determining the life cycle state of the data to be processed includes:

obtaining first confidence information that the life cycle of the data to be processed is finished and second confidence information that the life cycle of the data to be processed is not finished, wherein the first confidence information and the second confidence information are included in the processing result;

and determining the life cycle state of the data to be processed based on the first confidence coefficient information and the second confidence coefficient information.

In the above method, preferably, the training process of the data processing model includes:

obtaining a plurality of training samples;

extracting feature data of the predetermined features of each training sample; the feature data of the training samples comprises at least one of: training characteristics of a sample in the time dimension or the data access frequency dimension;

marking the data life cycle state of each training sample to obtain a data life cycle state marking result of each training sample;

establishing a corresponding relation between the feature data of each training sample and the data life cycle state labeling result to obtain corresponding relation data of the feature data of each training sample and the data life cycle state labeling result;

and training a data processing model by utilizing the corresponding relation data of each training sample based on a preset machine learning algorithm, so that the data processing model can output a corresponding data life cycle state prediction result based on the input feature data.

The above method, preferably, the obtaining a plurality of training samples includes:

randomly selecting a batch of data from a production environment, and using the randomly selected data as a training sample.

A data processing apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring data to be processed;

an extraction unit for extracting feature data of a predetermined feature of the data to be processed; the characteristic data includes at least one of: the characteristics of the data to be processed in a time dimension or a data access frequency dimension;

and the processing unit is used for processing the characteristic data by utilizing a pre-trained data processing model to obtain a processing result so as to determine the life cycle state of the data to be processed.

The above apparatus, preferably, the extraction unit is specifically configured to:

The above apparatus, preferably, the processing unit is specifically configured to:

obtaining the data type of the data to be processed;

The above apparatus, preferably, further comprises:

a pre-processing unit to:

obtaining a plurality of training samples;

According to the scheme, after the data to be processed is obtained, the feature data of the data to be processed is extracted, wherein the feature data comprises at least one of features of the data to be processed in a time dimension or a data access frequency dimension; on the basis, the feature data are processed by utilizing a pre-trained data processing model to determine the life cycle state of the data to be processed. According to the data processing method and device, the pre-trained data processing model is used for determining the life cycle state of the data to be processed based on at least one of the characteristics of the data to be processed in the time dimension or the data access frequency dimension, and therefore when the life cycle state of the data to be processed is determined, the rule of the corresponding relation between the at least one of the characteristics of the big data in the time dimension or the data access frequency dimension and the life cycle state of the big data is specifically used.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a data processing method according to an embodiment of the present application;

FIG. 2 is an overall framework diagram of training a data processing model and performing data life cycle prediction based on the trained model according to the second embodiment of the present application;

FIG. 3 is a schematic diagram of an implementation process of a training data processing model provided in the second embodiment of the present application;

FIG. 4 is a flowchart of a data processing method provided in the third embodiment of the present application;

FIG. 5 is a flowchart of a data processing method according to a fourth embodiment of the present application;

fig. 6 is a schematic structural diagram of a data processing apparatus according to a fifth embodiment of the present application;

fig. 7 is a schematic structural diagram of a data processing apparatus according to a sixth embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The application provides a data processing method and a data processing device, which are used for solving the problems that the data life cycle state accuracy determined by a threshold value T setting mode is low and the labor cost is too high in a manual review mode in the prior art, so that the determined data life cycle state has higher accuracy on the premise of low labor cost, and the scheme of the application is explained through a plurality of embodiments.

Referring to fig. 1, a flowchart of a first embodiment of a data processing method provided in the present application is applicable to various terminal devices such as a smart phone, a tablet personal computer (PAD), a personal Digital assistant (pda) (personal Digital assistant), a notebook, a desktop, or a kiosk, or may also be applicable to various general or special servers, as shown in fig. 1, the data processing method includes the following processing steps:

and step 101, obtaining data to be processed.

The data to be processed may be various types of data generated or created in an actual production environment, such as but not limited to data table data or log data of a log file, etc.

For these data, there is often a need to know the life cycle state of the data for a specific purpose, for example, to delete the data (or called failure data) whose life cycle has ended to release the storage space, it is necessary to know whether the life cycle of the data has ended; in order to store data in a classified manner according to the frequency of data access, it is necessary to know whether the data is in a frequently-accessed state, a rarely-accessed state, or a failed state. To this situation, the objective of the present application is to determine the life cycle state of data with low labor cost and high accuracy.

The determination of the data life cycle state needs to be based on the division of the data life cycle state, the division modes are different, and when the data life cycle state is predicted, the corresponding candidate states are different.

For example, the data is divided according to whether the data is invalid or not, and the life cycle state of the data can be divided into two states, namely the life cycle is not finished and the life cycle is finished; the data access frequency is divided according to the data access frequency, so that the life cycle state of the data can be divided into states of frequent access, few access, failure and the like.

Step 102, extracting feature data of predetermined features of the data to be processed; the characteristic data includes at least one of: the characteristics of the data to be processed in the time dimension or the data access frequency dimension.

The characteristics of the data to be processed in the time dimension may include, but are not limited to: the data to be processed is from the creation time point to the current survival time and from the last access time point to the current time.

Characteristics of the data to be processed in the data access frequency dimension may include, but are not limited to: the data access frequency of the data to be processed in at least one predetermined time period, for example, the access frequency of the data to be processed in the time granularity of the last 1 day, the last 1 week, the last 15 days, the last 1 month, the last 1 year or the last 5 years, etc.

For different types of data to be processed, when data feature extraction is performed on the data to be processed, the same data feature extraction may be performed on the data to be processed, or different data feature extraction may also be performed on the data to be processed, which is not limited in the present application.

Specifically, for different types of data, such as a data table and a log file, it may be set that the features to be extracted are both the features in the time dimension and/or the data access frequency dimension.

Or, different features to be extracted may be set for different types of data, such as the data table, the data access frequency dimension, and the value dimension, and the log file, the time dimension, the data access frequency dimension, and the like.

The feature of the value dimension may be, but is not limited to, a value level of data, for example, three preset value levels, namely, a high value level, a medium value level, and a low value level, for example, when data feature extraction is performed, for a data table of intermediate data, since an importance degree of the intermediate data in an actual production environment is often low, an enterprise person or a user generally does not pay much attention to the intermediate data, so that a feature value of the value feature may be extracted as "low", and for a data table of result data, since an importance degree of the result data in an actual production environment is often high, the result data is generally a key attention object of the enterprise person or the user, so that a feature value of the value feature may be extracted as "high".

And 103, processing the characteristic data by using a pre-trained data processing model to obtain a processing result so as to determine the life cycle state of the data to be processed.

The data processing model is a model trained by using large-batch data of types such as a data table or a log file in advance, and can describe the corresponding relation rule between the data characteristics (such as the characteristics of time dimension and/or access frequency dimension) of the data and the life cycle state of the data.

In view of this, in this step, feature data of the data to be processed may be provided to the data processing model as an input of the data processing model, the data processing model outputs processing result data capable of indicating the life cycle state of the data to be processed based on the learned rule of correspondence between the data feature and the data life cycle state, and further, the life cycle state of the data to be processed may be determined based on the processing result data of the data processing model, such as determining whether the life cycle of the data to be processed has ended, or determining whether the data to be processed is in a state of frequent access, few access, or failure, and so on.

The data processing method of the present application needs to be developed on the basis of the pre-training of the data processing model, that is, as shown in fig. 2, a data processing model needs to be pre-trained by using training samples, and on this basis, the life cycle state prediction of data can be performed on the basis of the data processing model.

The second embodiment provides an implementation process for training the data processing model, and referring to fig. 3, is a flowchart of the second embodiment of the training process for the data processing model provided by the present application, where the training process includes:

step 301, obtaining a plurality of training samples.

The training samples may be, but are not limited to, batch data randomly selected from an actual production environment, such as batch log file data, or batch spreadsheet data, etc.

The data type of the training sample is determined according to the function required by the data processing model to be trained, the training sample is data of a data table type if the data processing model to be trained is required to have a life cycle state prediction function of data of a data table, and the training sample is data of a log type if the data processing model to be trained is required to have a life cycle state prediction function of log data.

Step 302, extracting feature data of the predetermined features of each training sample; the feature data of the training samples comprises at least one of: training the characteristics of the sample in the time dimension or the data access frequency dimension.

The features of the training samples in the time dimension may include, but are not limited to: training samples from the point in time of creation to the current length of time of survival, from the point in time of the last visit to the current length of time.

The features of the training samples in the data access frequency dimension may include, but are not limited to: the frequency of data access of the training samples during at least one predetermined time period, such as the frequency of access of the training samples during the time granularity of the last 1 day, the last 1 week, the last 15 days, the last 1 month, the last 1 year, the last 5 years, etc.

For training samples of different types, when data feature extraction is performed on the training samples, the same data feature extraction may be performed on the training samples, or different data feature extraction may be performed on the training samples, which is not limited in the present application.

Specifically, for different types of training samples such as a data table and a log file, it may be set that the features to be extracted are both the features in the time dimension and/or the data access frequency dimension.

Or, for different types of training samples such as a data table and a log file, different features to be extracted may be set for the training samples, for example, for the data table, the features to be extracted may be set as features of a time dimension, features of a data access frequency dimension, and features of a value dimension, and for the log file, the features to be extracted may be set as features of the time dimension, features of the data access frequency dimension, and the like.

The feature of the value dimension may be, but is not limited to, a value level of data, for example, three preset value levels, namely, a high value level, a medium value level, and a low value level, for example, when data feature extraction is performed on a training sample, for a data table of intermediate data, since the importance degree of the intermediate data in an actual production environment is often low, an enterprise person or a user generally does not pay much attention to the intermediate data, so that the feature value of the value feature of the data table can be extracted as "low", and for a data table of result data, since the importance degree of the result data in an actual production environment is often high, the result data is generally a key attention object of the enterprise person or the user, so that the feature value of the value feature of the data table can be extracted as "high".

And step 303, marking the data life cycle state of each training sample to obtain a data life cycle state marking result of each training sample.

Specifically, the life cycle state of each training sample can be labeled by means of manual labeling and the like.

The life cycle state labeling result of the training sample is determined according to the dividing mode of the life cycle state, for example, if the data is divided into two states of life cycle finished and life cycle unfinished according to whether the data is invalid or not, the life cycle state labeling result of the training sample is one of the two states, and if the life cycle is divided into three states of frequent visit, few visit and invalid according to the frequency degree of visit, the life cycle state labeling result of the training sample is one of the three states

It should be noted that the processing procedure for labeling the life cycle state of the training sample provided in this step and the processing procedure for extracting the features of the training sample data provided in the previous step are not limited to the sequential execution order provided in this embodiment, and in practical application, the feature extraction procedure may be executed first, then the state labeling procedure may be executed, then the feature extraction procedure may be executed, or the two processing procedures may be executed at the same time, which is not limited in this embodiment.

And 304, establishing a corresponding relation between the feature data of each training sample and the data life cycle state labeling result to obtain corresponding relation data of the feature data of each training sample and the data life cycle state labeling result.

On the basis of extracting the feature data of the training samples and marking the life cycle state of the training samples, the corresponding relation between the feature data of each training sample and the life cycle state marking result can be established, so that support is provided for training of the data processing model.

Step 305, based on a predetermined machine learning algorithm, training a data processing model by using the corresponding relation data of each training sample, so that the data processing model can output a corresponding data life cycle state prediction result based on the input feature data.

The Machine learning algorithm may be, but is not limited to, any one of a random forest, a KNN (k-nearest neighbor classification algorithm), a logistic regression, an SVM (Support Vector Machine), and the like, and in a specific application, a technician may select an appropriate Machine learning algorithm based on actual model characteristic requirements.

After the corresponding relationship between the feature data of each training sample and the life cycle state labeling result is established, model parameters such as each feature weight of the initialization model and the learning rate of the initialization model can be initialized for the selected machine learning algorithm, and on the basis, the initialized model can be trained by using the corresponding relationship between the feature data of each training sample and the life cycle state labeling result, so that the data processing model can continuously learn the rule of the corresponding relationship between the data feature of each training sample and the life cycle state, further continuously adjust/optimize the feature weight of each feature, and finally obtain the data processing model capable of outputting the corresponding life cycle state prediction result based on the input feature data.

The data type of the data that can be processed by the trained data processing model generally needs to be consistent with the data type of the training sample, for example, if the data type of the training sample is a data table type, the trained data processing model can process the data of the data table, that is, the life cycle state prediction result of the data table can be output according to the data characteristics of the input data of the data table; if the data type of the training sample is the log type, the trained data processing model can process the data of the log type, namely, the life cycle state prediction result of the log data can be output according to the data characteristics of the input log data.

That is, the data processing model trained based on training samples of different data types is generally only applicable to life cycle state prediction of data of the same type as the training sample data.

In order to provide the life cycle state prediction function suitable for different types of data for users, the trained data processing models respectively suitable for different data types can be integrated as sub-models to obtain a total model after the integration of each sub-model is completed, the general model can be simultaneously suitable for predicting the life cycle state of different types of data, and in practical application, a general interface positioned at the upper layer of each sub-model can be provided for the general model (for example, a data input and state prediction interface facing to each type of data and the like are designed), so that the user is provided with a life cycle state prediction function that is simultaneously applicable to different types of data, the general interface at least has the functions of identifying/judging the data type of the data to be predicted and calling a corresponding sub-model to predict the data life cycle state according to the data type of the data to be predicted.

In the embodiment, based on a preset machine learning algorithm, a data model is trained by using training samples of corresponding data types, so that a life cycle state prediction function is provided for the data of the corresponding data types; data processing models respectively suitable for different data types are integrated as submodels, and a general interface positioned at the upper layer of each submodel is provided for the general model obtained by integration, so that a general life cycle state prediction function is provided for various types of data.

Referring to fig. 4, a flowchart of a third embodiment of a data processing method provided by the present application is shown, in this embodiment, the data processing model is preferably a model that can be simultaneously applied to perform lifecycle state prediction on data of different data types (e.g., data table data, log data), specifically, the data processing model includes more than one sub-processing model, different sub-processing models correspond to different data types and are used for performing lifecycle state prediction on data of different data types, and data feature types and/or feature weights corresponding to different sub-processing models are different.

As shown in fig. 4, in this embodiment, the step 103 (processing the feature data by using a pre-trained data processing model to obtain a processing result) may be implemented by the following processing procedures:

and step 1031, obtaining the data type of the data to be processed.

The data to be processed may be, but is not limited to, data table data or log data.

The data type of the data to be processed may be identified, but not limited to, based on a data format of the data to be processed, a document type of a document in which the data to be processed is located (for example, there is a large difference between the data format and the document type of the data table and the log data), or pre-labeled type information, and the like.

Step 1032, determining a sub-processing model corresponding to the data type from the data processing model; the data processing model comprises more than one sub-processing model, different sub-processing models correspond to different data types, and the data feature types and/or feature weights corresponding to different sub-processing models are different.

After the data type of the data to be processed is obtained, a sub-processing model matched with the data to be processed can be determined from a plurality of sub-processing models included in the data processing model according to the data type.

For example, if the data type of the data to be processed is data table data, a first sub-processing model capable of performing life cycle state prediction on the data table data may be determined from a plurality of sub-processing models included in the data processing model; if the data type of the data to be processed is log data, a second sub-processing model capable of predicting the life cycle state of the log data can be determined from a plurality of sub-processing models included in the data processing model.

Step 1033, inputting the feature data of the data to be processed into the sub-processing model corresponding to the data type, and obtaining a processing result of the data to be processed.

After determining the sub-processing model corresponding to the data type of the data to be processed, the feature data of the data to be processed may be input into the sub-processing model, and finally, the sub-processing model outputs a life cycle state prediction result corresponding to the feature data of the data to be processed (such as a feature of a time dimension, a feature of a data access frequency dimension, and/or a feature of a value dimension) based on a rule of a correspondence relationship between the data feature and the data life cycle state.

The output life cycle state prediction result is different according to different data life cycle state division modes.

Exemplarily, if the data lifecycle state is divided into two states, i.e., a lifecycle is ended and the lifecycle is not ended according to whether the data lifecycle state is invalid or not, the lifecycle state prediction result includes confidence information corresponding to the ended data lifecycle and confidence information corresponding to the unfinished lifecycle; if the data life cycle state is divided into three states of frequent access, little access and failure according to the access frequency, the life cycle state prediction result comprises three confidence information respectively corresponding to the three states of frequent access, little access and failure.

The present embodiment provides a universal lifecycle state prediction function for different types of data by using a data processing model that can be adapted to different data types. In addition, in the embodiment, the life cycle state prediction is performed on the data to be processed by utilizing the corresponding relation rule of the characteristics of the big data in the time dimension, the data access frequency dimension and/or the value dimension and the life cycle state of the big data, so that compared with the prior art, the labor cost is reduced, and the determined life cycle state of the data has higher accuracy.

Referring to fig. 5, a flowchart of a fourth embodiment of a data processing method provided by the present application is shown, where this embodiment takes as an example that a lifecycle state of data is divided into two states, that is, a lifecycle is ended and a lifecycle is not ended according to whether the lifecycle state is invalid, and an implementation process for determining a lifecycle state of data to be processed according to a processing result of the data processing model is described, as shown in fig. 5, a lifecycle state of the data to be processed may be determined through the following processing processes:

step 501, obtaining first confidence information that the life cycle of the data to be processed is finished and second confidence information that the life cycle of the data to be processed is not finished, which are included in the processing result.

Under the condition that the life cycle state of the data is divided into two states of end life cycle and non-end life cycle according to whether the data is invalid or not, generally, the data processing model can output the following result data after the life cycle state of the data to be processed is predicted: the first confidence information that the life cycle of the data to be processed is finished, and the second confidence information that the life cycle of the data to be processed is not finished.

The first confidence information and the second confidence information may be, but not limited to, information such as a probability value or a percentage value that can indicate a degree of likelihood that the data to be processed belongs to the corresponding state.

Step 502, determining the life cycle state of the data to be processed based on the first confidence information and the second confidence information.

On the basis of obtaining first confidence information that the life cycle of the data to be processed is finished and second confidence information that the life cycle of the data to be processed is not finished, the life cycle state of the data to be processed can be determined based on the value of the first confidence information and the second confidence information and by combining a preset state determination strategy.

Illustratively, the state determination policy may be: the life cycle state of the data to be processed is a state corresponding to the confidence coefficient with a larger value. For example, assuming that the confidence of the to-be-processed data corresponding to the state in which the life cycle has not ended is 70%, and the confidence corresponding to the state in which the life cycle has ended is 30%, it can be determined that the life cycle state of the to-be-processed data is that the life cycle has not ended.

The state determination policy may also be: and if the confidence coefficient of the numerical value is lower than the preset threshold, the data to be processed is not identified.

For example, assuming that the predetermined threshold is 70%, if the confidence of the to-be-processed data corresponding to the state in which the life cycle is not ended is 45%, and the confidence of the to-be-processed data corresponding to the state in which the life cycle is ended is 55%, the state of the to-be-processed data is not recognized because the numerical confidence (55%) is lower than the predetermined threshold (70%), if the confidence of the to-be-processed data corresponding to the state in which the life cycle is not ended is 10%, the confidence of the to-be-processed data corresponding to the state in which the life cycle is ended is 90%, and the numerical confidence (90%) is higher than the predetermined threshold (70%), the data state of the to-be-processed data is the state corresponding to the numerical confidence (90%), that the life cycle is ended.

In specific implementation, the state determination policy may be determined based on actual requirements (whether unrecognized conditions are allowed to exist, etc.), and the two policies provided by this embodiment are not limited.

In the embodiment, when the life cycle state of the data to be processed is determined, the data processing model specifically predicts the life cycle state by using the corresponding relation rule of the characteristics of the big data in each dimension and the life cycle state of the big data, so that compared with the prior art, the labor cost is reduced, and the determined life cycle state of the data has higher accuracy.

Referring to fig. 6, a schematic structural diagram of a fifth embodiment of a data processing apparatus provided in the present application is shown, where the apparatus may be applied to various terminal devices such as a smart phone, a tablet computer, a personal digital assistant, a notebook, a desktop, or a kiosk, or may also be applied to various general or special servers. As shown in fig. 1, the data processing apparatus includes:

an obtaining unit 601, configured to obtain data to be processed.

An extracting unit 602, configured to extract feature data of a predetermined feature of the data to be processed; the characteristic data includes at least one of: the characteristics of the data to be processed in the time dimension or the data access frequency dimension.

The processing unit 603 is configured to process the feature data by using a pre-trained data processing model to obtain a processing result, so as to determine a life cycle state of the data to be processed.

In view of this, feature data of the data to be processed may be provided to the data processing model as an input of the data processing model, the data processing model outputs processing result data capable of indicating a life cycle state of the data to be processed based on a rule of correspondence between learned data features and data life cycle states, and the life cycle state of the data to be processed may be determined based on the processing result data of the data processing model, such as determining whether the life cycle of the data to be processed has ended, or determining whether the data to be processed is in a state of frequent access, few access, or failure, and so on.

According to the scheme, the data processing device extracts the feature data of the data to be processed after the data to be processed is obtained, wherein the feature data comprises at least one of features of the data to be processed in a time dimension or a data access frequency dimension; on the basis, the feature data are processed by utilizing a pre-trained data processing model to determine the life cycle state of the data to be processed. According to the data processing method and device, the pre-trained data processing model is used for determining the life cycle state of the data to be processed based on at least one of the characteristics of the data to be processed in the time dimension or the data access frequency dimension, and therefore when the life cycle state of the data to be processed is determined, the rule of the corresponding relation between the at least one of the characteristics of the big data in the time dimension or the data access frequency dimension and the life cycle state of the big data is specifically used.

The data processing apparatus of the present application needs to be used on the basis of the data processing model that is trained in advance, that is, as shown in fig. 2, a data processing model needs to be trained in advance by using training samples, and on this basis, the life cycle state prediction of data can be performed on the basis of the data processing model.

In view of this, referring to fig. 7, a schematic structural diagram of a sixth embodiment of a data processing apparatus provided in the present application is shown, in this embodiment, the data processing apparatus further includes:

a preprocessing unit 604 for:

obtaining a plurality of training samples; extracting feature data of the predetermined features of each training sample; the feature data of the training samples comprises at least one of: training characteristics of a sample in the time dimension or the data access frequency dimension; marking the data life cycle state of each training sample to obtain a data life cycle state marking result of each training sample; establishing a corresponding relation between the feature data of each training sample and the data life cycle state labeling result to obtain corresponding relation data of the feature data of each training sample and the data life cycle state labeling result; and training a data processing model by utilizing the corresponding relation data of each training sample based on a preset machine learning algorithm, so that the data processing model can output a corresponding data life cycle state prediction result based on the input feature data.

The life cycle state labeling result of the training sample is determined according to the dividing mode of the life cycle state, for example, if the data is divided into two states of the end of the life cycle and the non-end of the life cycle according to whether the data is invalid or not, the life cycle state labeling result of the training sample is one of the two states, and if the life cycle is divided into three states of frequent visit, few visit and invalid according to the frequency degree of visit, the life cycle state labeling result of the training sample is one of the three states.

In the seventh embodiment of the present application, the data processing model is preferably a model that can be simultaneously applied to perform life cycle state prediction on data of different data types (e.g., data table data, log data), and specifically, the data processing model includes more than one sub-processing model, different sub-processing models correspond to different data types and are used for performing life cycle state prediction on data of different data types, and data feature types and/or feature weights corresponding to different sub-processing models are different.

On this basis, the processing unit 603 processes the feature data by using a pre-trained data processing model to obtain a processing result, which specifically includes:

obtaining the data type of the data to be processed; determining a sub-processing model corresponding to the data type from the data processing model; the data processing model comprises more than one sub-processing model, different sub-processing models correspond to different data types, and the data characteristic types and/or characteristic weights corresponding to different sub-processing models are different; and inputting the characteristic data of the data to be processed into the sub-processing model corresponding to the data type to obtain a processing result of the data to be processed.

In the eighth embodiment of the present application, taking as an example that the life cycle state of the data is divided into two states, i.e., the end state of the life cycle and the non-end state of the life cycle, according to the processing result of the data processing model, the introduction processing unit 603 determines the implementation process of the life cycle state of the data to be processed. Wherein the processing unit 603 may determine the lifecycle state of the data to be processed by:

obtaining first confidence information that the life cycle of the data to be processed is finished and second confidence information that the life cycle of the data to be processed is not finished, wherein the first confidence information and the second confidence information are included in the processing result; and determining the life cycle state of the data to be processed based on the first confidence coefficient information and the second confidence coefficient information.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

For convenience of description, the above system or apparatus is described as being divided into various modules or units by function, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

Finally, it is further noted that, herein, relational terms such as first, second, third, fourth, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A data processing method, comprising:

acquiring data to be processed;

carrying out life cycle state recognition processing on the characteristic data by utilizing a rule of a corresponding relation between at least one of characteristics of big data in a time dimension or a data access frequency dimension, which is described by a pre-trained data processing model, and a life cycle state of the big data to obtain a processing result of the data to be processed, and determining the life cycle state of the data to be processed based on the processing result;

the determining the life cycle state of the data to be processed based on the processing result comprises: under the condition that the life cycle state of the data to be processed is divided into two states of the life cycle being ended and the life cycle not being ended, first confidence information of the processed data with the ended life cycle and second confidence information of the processed data with the unfinished life cycle are obtained; determining a life cycle state of the data to be processed based on the first confidence information and the second confidence information;

wherein the data type of the training sample of the data processing model at least comprises the data type of the data to be processed.

2. The method according to claim 1, wherein the extracting feature data of the predetermined feature of the data to be processed comprises:

3. The method according to claim 1, wherein the performing the lifecycle state identification processing on the feature data to obtain the processing result of the data to be processed comprises:

obtaining the data type of the data to be processed;

inputting the characteristic data of the data to be processed into the sub-processing model corresponding to the data type to obtain a processing result of the data to be processed, so as to obtain the life cycle state of the data to be processed based on the processing result.

4. The method of claim 1, wherein the training process of the data processing model comprises:

obtaining a plurality of training samples;

5. The method of claim 4, wherein obtaining the plurality of training samples comprises:

6. A data processing apparatus, comprising:

the processing unit is used for carrying out life cycle state recognition processing on the characteristic data by utilizing the rule of the corresponding relation between at least one of the characteristics of the big data in the time dimension or the data access frequency dimension, which is described by a pre-trained data processing model, and the life cycle state of the big data to obtain the processing result of the data to be processed, and determining the life cycle state of the data to be processed based on the processing result;

when determining the lifecycle state of the data to be processed based on the processing result, the processing unit is specifically configured to: under the condition that the life cycle state of the data to be processed is divided into two states of the life cycle being ended and the life cycle not being ended, first confidence information of the processed data with the ended life cycle and second confidence information of the processed data with the unfinished life cycle are obtained; determining a life cycle state of the data to be processed based on the first confidence information and the second confidence information;

7. The apparatus according to claim 6, wherein the extraction unit is specifically configured to:

8. The apparatus according to claim 6, wherein the processing unit is specifically configured to:

obtaining the data type of the data to be processed;

9. The apparatus of claim 6, further comprising:

a pre-processing unit to:

obtaining a plurality of training samples;