CN108470071B - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN108470071B
CN108470071B CN201810271496.4A CN201810271496A CN108470071B CN 108470071 B CN108470071 B CN 108470071B CN 201810271496 A CN201810271496 A CN 201810271496A CN 108470071 B CN108470071 B CN 108470071B
Authority
CN
China
Prior art keywords
data
life cycle
processed
processing
cycle state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810271496.4A
Other languages
Chinese (zh)
Other versions
CN108470071A (en
Inventor
杨帆
金宝宝
张成松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN201810271496.4A priority Critical patent/CN108470071B/en
Publication of CN108470071A publication Critical patent/CN108470071A/en
Application granted granted Critical
Publication of CN108470071B publication Critical patent/CN108470071B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Debugging And Monitoring (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

According to the data processing method and device, after the data to be processed is obtained, feature data of the data to be processed are extracted, wherein the feature data comprise at least one of features of the data to be processed in a time dimension or a data access frequency dimension; on the basis, the feature data are processed by utilizing a pre-trained data processing model to determine the life cycle state of the data to be processed. According to the data processing method and device, the pre-trained data processing model is used for determining the life cycle state of the data to be processed based on at least one of the characteristics of the data to be processed in the time dimension or the data access frequency dimension, and therefore when the life cycle state of the data to be processed is determined, the rule of the corresponding relation between the at least one of the characteristics of the big data in the time dimension or the data access frequency dimension and the life cycle state of the big data is specifically used.

Description

Data processing method and device
Technical Field
The invention belongs to the field of machine learning, and particularly relates to a data processing method and device.
Background
In a production environment, data with an end life cycle, such as a table, a log file, etc., often needs to be deleted to release storage space and ensure normal operation of a system, which inevitably needs to determine whether the life cycle of the data has been ended.
Two methods for determining whether the life cycle of the data is finished currently exist, one method is to preset a threshold value T, compare the total survival time T of the data from the generation time with the set threshold value T, and delete the data with T > T as the data with the finished life cycle, otherwise, consider the life cycle of the data not to be finished; and the other is to manually check whether the data is useful, and the data which is not useful for manual checking is regarded as the data with the end of the life cycle and is deleted.
Both the two methods have some defects, the threshold T in the first method is difficult to determine, and accordingly, the accuracy of judging whether the data life cycle is ended or not is low, and further adverse effects are brought to data deletion operation, wherein if the T is set too large, a lot of useless data cannot be deleted, and if the T is set too small, useful data is often deleted while the useless data are deleted; the second method has the problem of high labor cost due to the need of manual data-by-data examination and judgment.
Disclosure of Invention
In view of the above, an object of the present invention is to provide a data processing method and apparatus, which are used to overcome the above problems in the prior art, so that the determined data lifecycle status has higher accuracy under the premise of low labor cost.
Therefore, the invention discloses the following technical scheme:
a method of data processing, comprising:
acquiring data to be processed;
extracting feature data of a predetermined feature of the data to be processed; the characteristic data includes at least one of: the characteristics of the data to be processed in a time dimension or a data access frequency dimension;
and processing the characteristic data by using a pre-trained data processing model to obtain a processing result so as to determine the life cycle state of the data to be processed.
Preferably, the method for extracting feature data of the predetermined feature of the data to be processed includes:
and extracting at least one of the survival time length of the data to be processed from the creation time point to the current time length, the time length from the last access time point to the current time length or the data access frequency in at least one latest preset time period.
Preferably, in the method, the processing the feature data by using a pre-trained data processing model to obtain a processing result includes:
obtaining the data type of the data to be processed;
determining a sub-processing model corresponding to the data type from the data processing model; the data processing model comprises more than one sub-processing model, different sub-processing models correspond to different data types, and the data characteristic types and/or characteristic weights corresponding to different sub-processing models are different;
and inputting the characteristic data of the data to be processed into the sub-processing model corresponding to the data type to obtain a processing result of the data to be processed.
Preferably, the determining the life cycle state of the data to be processed includes:
obtaining first confidence information that the life cycle of the data to be processed is finished and second confidence information that the life cycle of the data to be processed is not finished, wherein the first confidence information and the second confidence information are included in the processing result;
and determining the life cycle state of the data to be processed based on the first confidence coefficient information and the second confidence coefficient information.
In the above method, preferably, the training process of the data processing model includes:
obtaining a plurality of training samples;
extracting feature data of the predetermined features of each training sample; the feature data of the training samples comprises at least one of: training characteristics of a sample in the time dimension or the data access frequency dimension;
marking the data life cycle state of each training sample to obtain a data life cycle state marking result of each training sample;
establishing a corresponding relation between the feature data of each training sample and the data life cycle state labeling result to obtain corresponding relation data of the feature data of each training sample and the data life cycle state labeling result;
and training a data processing model by utilizing the corresponding relation data of each training sample based on a preset machine learning algorithm, so that the data processing model can output a corresponding data life cycle state prediction result based on the input feature data.
The above method, preferably, the obtaining a plurality of training samples includes:
randomly selecting a batch of data from a production environment, and using the randomly selected data as a training sample.
A data processing apparatus comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring data to be processed;
an extraction unit for extracting feature data of a predetermined feature of the data to be processed; the characteristic data includes at least one of: the characteristics of the data to be processed in a time dimension or a data access frequency dimension;
and the processing unit is used for processing the characteristic data by utilizing a pre-trained data processing model to obtain a processing result so as to determine the life cycle state of the data to be processed.
The above apparatus, preferably, the extraction unit is specifically configured to:
and extracting at least one of the survival time length of the data to be processed from the creation time point to the current time length, the time length from the last access time point to the current time length or the data access frequency in at least one latest preset time period.
The above apparatus, preferably, the processing unit is specifically configured to:
obtaining the data type of the data to be processed;
determining a sub-processing model corresponding to the data type from the data processing model; the data processing model comprises more than one sub-processing model, different sub-processing models correspond to different data types, and the data characteristic types and/or characteristic weights corresponding to different sub-processing models are different;
and inputting the characteristic data of the data to be processed into the sub-processing model corresponding to the data type to obtain a processing result of the data to be processed.
The above apparatus, preferably, further comprises:
a pre-processing unit to:
obtaining a plurality of training samples;
extracting feature data of the predetermined features of each training sample; the feature data of the training samples comprises at least one of: training characteristics of a sample in the time dimension or the data access frequency dimension;
marking the data life cycle state of each training sample to obtain a data life cycle state marking result of each training sample;
establishing a corresponding relation between the feature data of each training sample and the data life cycle state labeling result to obtain corresponding relation data of the feature data of each training sample and the data life cycle state labeling result;
and training a data processing model by utilizing the corresponding relation data of each training sample based on a preset machine learning algorithm, so that the data processing model can output a corresponding data life cycle state prediction result based on the input feature data.
According to the scheme, after the data to be processed is obtained, the feature data of the data to be processed is extracted, wherein the feature data comprises at least one of features of the data to be processed in a time dimension or a data access frequency dimension; on the basis, the feature data are processed by utilizing a pre-trained data processing model to determine the life cycle state of the data to be processed. According to the data processing method and device, the pre-trained data processing model is used for determining the life cycle state of the data to be processed based on at least one of the characteristics of the data to be processed in the time dimension or the data access frequency dimension, and therefore when the life cycle state of the data to be processed is determined, the rule of the corresponding relation between the at least one of the characteristics of the big data in the time dimension or the data access frequency dimension and the life cycle state of the big data is specifically used.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a data processing method according to an embodiment of the present application;
FIG. 2 is an overall framework diagram of training a data processing model and performing data life cycle prediction based on the trained model according to the second embodiment of the present application;
FIG. 3 is a schematic diagram of an implementation process of a training data processing model provided in the second embodiment of the present application;
FIG. 4 is a flowchart of a data processing method provided in the third embodiment of the present application;
FIG. 5 is a flowchart of a data processing method according to a fourth embodiment of the present application;
fig. 6 is a schematic structural diagram of a data processing apparatus according to a fifth embodiment of the present application;
fig. 7 is a schematic structural diagram of a data processing apparatus according to a sixth embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The application provides a data processing method and a data processing device, which are used for solving the problems that the data life cycle state accuracy determined by a threshold value T setting mode is low and the labor cost is too high in a manual review mode in the prior art, so that the determined data life cycle state has higher accuracy on the premise of low labor cost, and the scheme of the application is explained through a plurality of embodiments.
Referring to fig. 1, a flowchart of a first embodiment of a data processing method provided in the present application is applicable to various terminal devices such as a smart phone, a tablet personal computer (PAD), a personal Digital assistant (pda) (personal Digital assistant), a notebook, a desktop, or a kiosk, or may also be applicable to various general or special servers, as shown in fig. 1, the data processing method includes the following processing steps:
and step 101, obtaining data to be processed.
The data to be processed may be various types of data generated or created in an actual production environment, such as but not limited to data table data or log data of a log file, etc.
For these data, there is often a need to know the life cycle state of the data for a specific purpose, for example, to delete the data (or called failure data) whose life cycle has ended to release the storage space, it is necessary to know whether the life cycle of the data has ended; in order to store data in a classified manner according to the frequency of data access, it is necessary to know whether the data is in a frequently-accessed state, a rarely-accessed state, or a failed state. To this situation, the objective of the present application is to determine the life cycle state of data with low labor cost and high accuracy.
The determination of the data life cycle state needs to be based on the division of the data life cycle state, the division modes are different, and when the data life cycle state is predicted, the corresponding candidate states are different.
For example, the data is divided according to whether the data is invalid or not, and the life cycle state of the data can be divided into two states, namely the life cycle is not finished and the life cycle is finished; the data access frequency is divided according to the data access frequency, so that the life cycle state of the data can be divided into states of frequent access, few access, failure and the like.
Step 102, extracting feature data of predetermined features of the data to be processed; the characteristic data includes at least one of: the characteristics of the data to be processed in the time dimension or the data access frequency dimension.
The characteristics of the data to be processed in the time dimension may include, but are not limited to: the data to be processed is from the creation time point to the current survival time and from the last access time point to the current time.
Characteristics of the data to be processed in the data access frequency dimension may include, but are not limited to: the data access frequency of the data to be processed in at least one predetermined time period, for example, the access frequency of the data to be processed in the time granularity of the last 1 day, the last 1 week, the last 15 days, the last 1 month, the last 1 year or the last 5 years, etc.
For different types of data to be processed, when data feature extraction is performed on the data to be processed, the same data feature extraction may be performed on the data to be processed, or different data feature extraction may also be performed on the data to be processed, which is not limited in the present application.
Specifically, for different types of data, such as a data table and a log file, it may be set that the features to be extracted are both the features in the time dimension and/or the data access frequency dimension.
Or, different features to be extracted may be set for different types of data, such as the data table, the data access frequency dimension, and the value dimension, and the log file, the time dimension, the data access frequency dimension, and the like.
The feature of the value dimension may be, but is not limited to, a value level of data, for example, three preset value levels, namely, a high value level, a medium value level, and a low value level, for example, when data feature extraction is performed, for a data table of intermediate data, since an importance degree of the intermediate data in an actual production environment is often low, an enterprise person or a user generally does not pay much attention to the intermediate data, so that a feature value of the value feature may be extracted as "low", and for a data table of result data, since an importance degree of the result data in an actual production environment is often high, the result data is generally a key attention object of the enterprise person or the user, so that a feature value of the value feature may be extracted as "high".
And 103, processing the characteristic data by using a pre-trained data processing model to obtain a processing result so as to determine the life cycle state of the data to be processed.
The data processing model is a model trained by using large-batch data of types such as a data table or a log file in advance, and can describe the corresponding relation rule between the data characteristics (such as the characteristics of time dimension and/or access frequency dimension) of the data and the life cycle state of the data.
In view of this, in this step, feature data of the data to be processed may be provided to the data processing model as an input of the data processing model, the data processing model outputs processing result data capable of indicating the life cycle state of the data to be processed based on the learned rule of correspondence between the data feature and the data life cycle state, and further, the life cycle state of the data to be processed may be determined based on the processing result data of the data processing model, such as determining whether the life cycle of the data to be processed has ended, or determining whether the data to be processed is in a state of frequent access, few access, or failure, and so on.
According to the scheme, after the data to be processed is obtained, the feature data of the data to be processed is extracted, wherein the feature data comprises at least one of features of the data to be processed in a time dimension or a data access frequency dimension; on the basis, the feature data are processed by utilizing a pre-trained data processing model to determine the life cycle state of the data to be processed. According to the data processing method and device, the pre-trained data processing model is used for determining the life cycle state of the data to be processed based on at least one of the characteristics of the data to be processed in the time dimension or the data access frequency dimension, and therefore when the life cycle state of the data to be processed is determined, the rule of the corresponding relation between the at least one of the characteristics of the big data in the time dimension or the data access frequency dimension and the life cycle state of the big data is specifically used.
The data processing method of the present application needs to be developed on the basis of the pre-training of the data processing model, that is, as shown in fig. 2, a data processing model needs to be pre-trained by using training samples, and on this basis, the life cycle state prediction of data can be performed on the basis of the data processing model.
The second embodiment provides an implementation process for training the data processing model, and referring to fig. 3, is a flowchart of the second embodiment of the training process for the data processing model provided by the present application, where the training process includes:
step 301, obtaining a plurality of training samples.
The training samples may be, but are not limited to, batch data randomly selected from an actual production environment, such as batch log file data, or batch spreadsheet data, etc.
The data type of the training sample is determined according to the function required by the data processing model to be trained, the training sample is data of a data table type if the data processing model to be trained is required to have a life cycle state prediction function of data of a data table, and the training sample is data of a log type if the data processing model to be trained is required to have a life cycle state prediction function of log data.
Step 302, extracting feature data of the predetermined features of each training sample; the feature data of the training samples comprises at least one of: training the characteristics of the sample in the time dimension or the data access frequency dimension.
The features of the training samples in the time dimension may include, but are not limited to: training samples from the point in time of creation to the current length of time of survival, from the point in time of the last visit to the current length of time.
The features of the training samples in the data access frequency dimension may include, but are not limited to: the frequency of data access of the training samples during at least one predetermined time period, such as the frequency of access of the training samples during the time granularity of the last 1 day, the last 1 week, the last 15 days, the last 1 month, the last 1 year, the last 5 years, etc.
For training samples of different types, when data feature extraction is performed on the training samples, the same data feature extraction may be performed on the training samples, or different data feature extraction may be performed on the training samples, which is not limited in the present application.
Specifically, for different types of training samples such as a data table and a log file, it may be set that the features to be extracted are both the features in the time dimension and/or the data access frequency dimension.
Or, for different types of training samples such as a data table and a log file, different features to be extracted may be set for the training samples, for example, for the data table, the features to be extracted may be set as features of a time dimension, features of a data access frequency dimension, and features of a value dimension, and for the log file, the features to be extracted may be set as features of the time dimension, features of the data access frequency dimension, and the like.
The feature of the value dimension may be, but is not limited to, a value level of data, for example, three preset value levels, namely, a high value level, a medium value level, and a low value level, for example, when data feature extraction is performed on a training sample, for a data table of intermediate data, since the importance degree of the intermediate data in an actual production environment is often low, an enterprise person or a user generally does not pay much attention to the intermediate data, so that the feature value of the value feature of the data table can be extracted as "low", and for a data table of result data, since the importance degree of the result data in an actual production environment is often high, the result data is generally a key attention object of the enterprise person or the user, so that the feature value of the value feature of the data table can be extracted as "high".
And step 303, marking the data life cycle state of each training sample to obtain a data life cycle state marking result of each training sample.
Specifically, the life cycle state of each training sample can be labeled by means of manual labeling and the like.
The life cycle state labeling result of the training sample is determined according to the dividing mode of the life cycle state, for example, if the data is divided into two states of life cycle finished and life cycle unfinished according to whether the data is invalid or not, the life cycle state labeling result of the training sample is one of the two states, and if the life cycle is divided into three states of frequent visit, few visit and invalid according to the frequency degree of visit, the life cycle state labeling result of the training sample is one of the three states
It should be noted that the processing procedure for labeling the life cycle state of the training sample provided in this step and the processing procedure for extracting the features of the training sample data provided in the previous step are not limited to the sequential execution order provided in this embodiment, and in practical application, the feature extraction procedure may be executed first, then the state labeling procedure may be executed, then the feature extraction procedure may be executed, or the two processing procedures may be executed at the same time, which is not limited in this embodiment.
And 304, establishing a corresponding relation between the feature data of each training sample and the data life cycle state labeling result to obtain corresponding relation data of the feature data of each training sample and the data life cycle state labeling result.
On the basis of extracting the feature data of the training samples and marking the life cycle state of the training samples, the corresponding relation between the feature data of each training sample and the life cycle state marking result can be established, so that support is provided for training of the data processing model.
Step 305, based on a predetermined machine learning algorithm, training a data processing model by using the corresponding relation data of each training sample, so that the data processing model can output a corresponding data life cycle state prediction result based on the input feature data.
The Machine learning algorithm may be, but is not limited to, any one of a random forest, a KNN (k-nearest neighbor classification algorithm), a logistic regression, an SVM (Support Vector Machine), and the like, and in a specific application, a technician may select an appropriate Machine learning algorithm based on actual model characteristic requirements.
After the corresponding relationship between the feature data of each training sample and the life cycle state labeling result is established, model parameters such as each feature weight of the initialization model and the learning rate of the initialization model can be initialized for the selected machine learning algorithm, and on the basis, the initialized model can be trained by using the corresponding relationship between the feature data of each training sample and the life cycle state labeling result, so that the data processing model can continuously learn the rule of the corresponding relationship between the data feature of each training sample and the life cycle state, further continuously adjust/optimize the feature weight of each feature, and finally obtain the data processing model capable of outputting the corresponding life cycle state prediction result based on the input feature data.
The data type of the data that can be processed by the trained data processing model generally needs to be consistent with the data type of the training sample, for example, if the data type of the training sample is a data table type, the trained data processing model can process the data of the data table, that is, the life cycle state prediction result of the data table can be output according to the data characteristics of the input data of the data table; if the data type of the training sample is the log type, the trained data processing model can process the data of the log type, namely, the life cycle state prediction result of the log data can be output according to the data characteristics of the input log data.
That is, the data processing model trained based on training samples of different data types is generally only applicable to life cycle state prediction of data of the same type as the training sample data.
In order to provide the life cycle state prediction function suitable for different types of data for users, the trained data processing models respectively suitable for different data types can be integrated as sub-models to obtain a total model after the integration of each sub-model is completed, the general model can be simultaneously suitable for predicting the life cycle state of different types of data, and in practical application, a general interface positioned at the upper layer of each sub-model can be provided for the general model (for example, a data input and state prediction interface facing to each type of data and the like are designed), so that the user is provided with a life cycle state prediction function that is simultaneously applicable to different types of data, the general interface at least has the functions of identifying/judging the data type of the data to be predicted and calling a corresponding sub-model to predict the data life cycle state according to the data type of the data to be predicted.
In the embodiment, based on a preset machine learning algorithm, a data model is trained by using training samples of corresponding data types, so that a life cycle state prediction function is provided for the data of the corresponding data types; data processing models respectively suitable for different data types are integrated as submodels, and a general interface positioned at the upper layer of each submodel is provided for the general model obtained by integration, so that a general life cycle state prediction function is provided for various types of data.
Referring to fig. 4, a flowchart of a third embodiment of a data processing method provided by the present application is shown, in this embodiment, the data processing model is preferably a model that can be simultaneously applied to perform lifecycle state prediction on data of different data types (e.g., data table data, log data), specifically, the data processing model includes more than one sub-processing model, different sub-processing models correspond to different data types and are used for performing lifecycle state prediction on data of different data types, and data feature types and/or feature weights corresponding to different sub-processing models are different.
As shown in fig. 4, in this embodiment, the step 103 (processing the feature data by using a pre-trained data processing model to obtain a processing result) may be implemented by the following processing procedures:
and step 1031, obtaining the data type of the data to be processed.
The data to be processed may be, but is not limited to, data table data or log data.
The data type of the data to be processed may be identified, but not limited to, based on a data format of the data to be processed, a document type of a document in which the data to be processed is located (for example, there is a large difference between the data format and the document type of the data table and the log data), or pre-labeled type information, and the like.
Step 1032, determining a sub-processing model corresponding to the data type from the data processing model; the data processing model comprises more than one sub-processing model, different sub-processing models correspond to different data types, and the data feature types and/or feature weights corresponding to different sub-processing models are different.
After the data type of the data to be processed is obtained, a sub-processing model matched with the data to be processed can be determined from a plurality of sub-processing models included in the data processing model according to the data type.
For example, if the data type of the data to be processed is data table data, a first sub-processing model capable of performing life cycle state prediction on the data table data may be determined from a plurality of sub-processing models included in the data processing model; if the data type of the data to be processed is log data, a second sub-processing model capable of predicting the life cycle state of the log data can be determined from a plurality of sub-processing models included in the data processing model.
Step 1033, inputting the feature data of the data to be processed into the sub-processing model corresponding to the data type, and obtaining a processing result of the data to be processed.
After determining the sub-processing model corresponding to the data type of the data to be processed, the feature data of the data to be processed may be input into the sub-processing model, and finally, the sub-processing model outputs a life cycle state prediction result corresponding to the feature data of the data to be processed (such as a feature of a time dimension, a feature of a data access frequency dimension, and/or a feature of a value dimension) based on a rule of a correspondence relationship between the data feature and the data life cycle state.
The output life cycle state prediction result is different according to different data life cycle state division modes.
Exemplarily, if the data lifecycle state is divided into two states, i.e., a lifecycle is ended and the lifecycle is not ended according to whether the data lifecycle state is invalid or not, the lifecycle state prediction result includes confidence information corresponding to the ended data lifecycle and confidence information corresponding to the unfinished lifecycle; if the data life cycle state is divided into three states of frequent access, little access and failure according to the access frequency, the life cycle state prediction result comprises three confidence information respectively corresponding to the three states of frequent access, little access and failure.
The present embodiment provides a universal lifecycle state prediction function for different types of data by using a data processing model that can be adapted to different data types. In addition, in the embodiment, the life cycle state prediction is performed on the data to be processed by utilizing the corresponding relation rule of the characteristics of the big data in the time dimension, the data access frequency dimension and/or the value dimension and the life cycle state of the big data, so that compared with the prior art, the labor cost is reduced, and the determined life cycle state of the data has higher accuracy.
Referring to fig. 5, a flowchart of a fourth embodiment of a data processing method provided by the present application is shown, where this embodiment takes as an example that a lifecycle state of data is divided into two states, that is, a lifecycle is ended and a lifecycle is not ended according to whether the lifecycle state is invalid, and an implementation process for determining a lifecycle state of data to be processed according to a processing result of the data processing model is described, as shown in fig. 5, a lifecycle state of the data to be processed may be determined through the following processing processes:
step 501, obtaining first confidence information that the life cycle of the data to be processed is finished and second confidence information that the life cycle of the data to be processed is not finished, which are included in the processing result.
Under the condition that the life cycle state of the data is divided into two states of end life cycle and non-end life cycle according to whether the data is invalid or not, generally, the data processing model can output the following result data after the life cycle state of the data to be processed is predicted: the first confidence information that the life cycle of the data to be processed is finished, and the second confidence information that the life cycle of the data to be processed is not finished.
The first confidence information and the second confidence information may be, but not limited to, information such as a probability value or a percentage value that can indicate a degree of likelihood that the data to be processed belongs to the corresponding state.
Step 502, determining the life cycle state of the data to be processed based on the first confidence information and the second confidence information.
On the basis of obtaining first confidence information that the life cycle of the data to be processed is finished and second confidence information that the life cycle of the data to be processed is not finished, the life cycle state of the data to be processed can be determined based on the value of the first confidence information and the second confidence information and by combining a preset state determination strategy.
Illustratively, the state determination policy may be: the life cycle state of the data to be processed is a state corresponding to the confidence coefficient with a larger value. For example, assuming that the confidence of the to-be-processed data corresponding to the state in which the life cycle has not ended is 70%, and the confidence corresponding to the state in which the life cycle has ended is 30%, it can be determined that the life cycle state of the to-be-processed data is that the life cycle has not ended.
The state determination policy may also be: and if the confidence coefficient of the numerical value is lower than the preset threshold, the data to be processed is not identified.
For example, assuming that the predetermined threshold is 70%, if the confidence of the to-be-processed data corresponding to the state in which the life cycle is not ended is 45%, and the confidence of the to-be-processed data corresponding to the state in which the life cycle is ended is 55%, the state of the to-be-processed data is not recognized because the numerical confidence (55%) is lower than the predetermined threshold (70%), if the confidence of the to-be-processed data corresponding to the state in which the life cycle is not ended is 10%, the confidence of the to-be-processed data corresponding to the state in which the life cycle is ended is 90%, and the numerical confidence (90%) is higher than the predetermined threshold (70%), the data state of the to-be-processed data is the state corresponding to the numerical confidence (90%), that the life cycle is ended.
In specific implementation, the state determination policy may be determined based on actual requirements (whether unrecognized conditions are allowed to exist, etc.), and the two policies provided by this embodiment are not limited.
In the embodiment, when the life cycle state of the data to be processed is determined, the data processing model specifically predicts the life cycle state by using the corresponding relation rule of the characteristics of the big data in each dimension and the life cycle state of the big data, so that compared with the prior art, the labor cost is reduced, and the determined life cycle state of the data has higher accuracy.
Referring to fig. 6, a schematic structural diagram of a fifth embodiment of a data processing apparatus provided in the present application is shown, where the apparatus may be applied to various terminal devices such as a smart phone, a tablet computer, a personal digital assistant, a notebook, a desktop, or a kiosk, or may also be applied to various general or special servers. As shown in fig. 1, the data processing apparatus includes:
an obtaining unit 601, configured to obtain data to be processed.
The data to be processed may be various types of data generated or created in an actual production environment, such as but not limited to data table data or log data of a log file, etc.
For these data, there is often a need to know the life cycle state of the data for a specific purpose, for example, to delete the data (or called failure data) whose life cycle has ended to release the storage space, it is necessary to know whether the life cycle of the data has ended; in order to store data in a classified manner according to the frequency of data access, it is necessary to know whether the data is in a frequently-accessed state, a rarely-accessed state, or a failed state. To this situation, the objective of the present application is to determine the life cycle state of data with low labor cost and high accuracy.
The determination of the data life cycle state needs to be based on the division of the data life cycle state, the division modes are different, and when the data life cycle state is predicted, the corresponding candidate states are different.
For example, the data is divided according to whether the data is invalid or not, and the life cycle state of the data can be divided into two states, namely the life cycle is not finished and the life cycle is finished; the data access frequency is divided according to the data access frequency, so that the life cycle state of the data can be divided into states of frequent access, few access, failure and the like.
An extracting unit 602, configured to extract feature data of a predetermined feature of the data to be processed; the characteristic data includes at least one of: the characteristics of the data to be processed in the time dimension or the data access frequency dimension.
The characteristics of the data to be processed in the time dimension may include, but are not limited to: the data to be processed is from the creation time point to the current survival time and from the last access time point to the current time.
Characteristics of the data to be processed in the data access frequency dimension may include, but are not limited to: the data access frequency of the data to be processed in at least one predetermined time period, for example, the access frequency of the data to be processed in the time granularity of the last 1 day, the last 1 week, the last 15 days, the last 1 month, the last 1 year or the last 5 years, etc.
For different types of data to be processed, when data feature extraction is performed on the data to be processed, the same data feature extraction may be performed on the data to be processed, or different data feature extraction may also be performed on the data to be processed, which is not limited in the present application.
Specifically, for different types of data, such as a data table and a log file, it may be set that the features to be extracted are both the features in the time dimension and/or the data access frequency dimension.
Or, different features to be extracted may be set for different types of data, such as the data table, the data access frequency dimension, and the value dimension, and the log file, the time dimension, the data access frequency dimension, and the like.
The feature of the value dimension may be, but is not limited to, a value level of data, for example, three preset value levels, namely, a high value level, a medium value level, and a low value level, for example, when data feature extraction is performed, for a data table of intermediate data, since an importance degree of the intermediate data in an actual production environment is often low, an enterprise person or a user generally does not pay much attention to the intermediate data, so that a feature value of the value feature may be extracted as "low", and for a data table of result data, since an importance degree of the result data in an actual production environment is often high, the result data is generally a key attention object of the enterprise person or the user, so that a feature value of the value feature may be extracted as "high".
The processing unit 603 is configured to process the feature data by using a pre-trained data processing model to obtain a processing result, so as to determine a life cycle state of the data to be processed.
The data processing model is a model trained by using large-batch data of types such as a data table or a log file in advance, and can describe the corresponding relation rule between the data characteristics (such as the characteristics of time dimension and/or access frequency dimension) of the data and the life cycle state of the data.
In view of this, feature data of the data to be processed may be provided to the data processing model as an input of the data processing model, the data processing model outputs processing result data capable of indicating a life cycle state of the data to be processed based on a rule of correspondence between learned data features and data life cycle states, and the life cycle state of the data to be processed may be determined based on the processing result data of the data processing model, such as determining whether the life cycle of the data to be processed has ended, or determining whether the data to be processed is in a state of frequent access, few access, or failure, and so on.
According to the scheme, the data processing device extracts the feature data of the data to be processed after the data to be processed is obtained, wherein the feature data comprises at least one of features of the data to be processed in a time dimension or a data access frequency dimension; on the basis, the feature data are processed by utilizing a pre-trained data processing model to determine the life cycle state of the data to be processed. According to the data processing method and device, the pre-trained data processing model is used for determining the life cycle state of the data to be processed based on at least one of the characteristics of the data to be processed in the time dimension or the data access frequency dimension, and therefore when the life cycle state of the data to be processed is determined, the rule of the corresponding relation between the at least one of the characteristics of the big data in the time dimension or the data access frequency dimension and the life cycle state of the big data is specifically used.
The data processing apparatus of the present application needs to be used on the basis of the data processing model that is trained in advance, that is, as shown in fig. 2, a data processing model needs to be trained in advance by using training samples, and on this basis, the life cycle state prediction of data can be performed on the basis of the data processing model.
In view of this, referring to fig. 7, a schematic structural diagram of a sixth embodiment of a data processing apparatus provided in the present application is shown, in this embodiment, the data processing apparatus further includes:
a preprocessing unit 604 for:
obtaining a plurality of training samples; extracting feature data of the predetermined features of each training sample; the feature data of the training samples comprises at least one of: training characteristics of a sample in the time dimension or the data access frequency dimension; marking the data life cycle state of each training sample to obtain a data life cycle state marking result of each training sample; establishing a corresponding relation between the feature data of each training sample and the data life cycle state labeling result to obtain corresponding relation data of the feature data of each training sample and the data life cycle state labeling result; and training a data processing model by utilizing the corresponding relation data of each training sample based on a preset machine learning algorithm, so that the data processing model can output a corresponding data life cycle state prediction result based on the input feature data.
The training samples may be, but are not limited to, batch data randomly selected from an actual production environment, such as batch log file data, or batch spreadsheet data, etc.
The data type of the training sample is determined according to the function required by the data processing model to be trained, the training sample is data of a data table type if the data processing model to be trained is required to have a life cycle state prediction function of data of a data table, and the training sample is data of a log type if the data processing model to be trained is required to have a life cycle state prediction function of log data.
The features of the training samples in the time dimension may include, but are not limited to: training samples from the point in time of creation to the current length of time of survival, from the point in time of the last visit to the current length of time.
The features of the training samples in the data access frequency dimension may include, but are not limited to: the frequency of data access of the training samples during at least one predetermined time period, such as the frequency of access of the training samples during the time granularity of the last 1 day, the last 1 week, the last 15 days, the last 1 month, the last 1 year, the last 5 years, etc.
For training samples of different types, when data feature extraction is performed on the training samples, the same data feature extraction may be performed on the training samples, or different data feature extraction may be performed on the training samples, which is not limited in the present application.
Specifically, for different types of training samples such as a data table and a log file, it may be set that the features to be extracted are both the features in the time dimension and/or the data access frequency dimension.
Or, for different types of training samples such as a data table and a log file, different features to be extracted may be set for the training samples, for example, for the data table, the features to be extracted may be set as features of a time dimension, features of a data access frequency dimension, and features of a value dimension, and for the log file, the features to be extracted may be set as features of the time dimension, features of the data access frequency dimension, and the like.
The feature of the value dimension may be, but is not limited to, a value level of data, for example, three preset value levels, namely, a high value level, a medium value level, and a low value level, for example, when data feature extraction is performed on a training sample, for a data table of intermediate data, since the importance degree of the intermediate data in an actual production environment is often low, an enterprise person or a user generally does not pay much attention to the intermediate data, so that the feature value of the value feature of the data table can be extracted as "low", and for a data table of result data, since the importance degree of the result data in an actual production environment is often high, the result data is generally a key attention object of the enterprise person or the user, so that the feature value of the value feature of the data table can be extracted as "high".
Specifically, the life cycle state of each training sample can be labeled by means of manual labeling and the like.
The life cycle state labeling result of the training sample is determined according to the dividing mode of the life cycle state, for example, if the data is divided into two states of the end of the life cycle and the non-end of the life cycle according to whether the data is invalid or not, the life cycle state labeling result of the training sample is one of the two states, and if the life cycle is divided into three states of frequent visit, few visit and invalid according to the frequency degree of visit, the life cycle state labeling result of the training sample is one of the three states.
It should be noted that the processing procedure for labeling the life cycle state of the training sample provided in this step and the processing procedure for extracting the features of the training sample data provided in the previous step are not limited to the sequential execution order provided in this embodiment, and in practical application, the feature extraction procedure may be executed first, then the state labeling procedure may be executed, then the feature extraction procedure may be executed, or the two processing procedures may be executed at the same time, which is not limited in this embodiment.
On the basis of extracting the feature data of the training samples and marking the life cycle state of the training samples, the corresponding relation between the feature data of each training sample and the life cycle state marking result can be established, so that support is provided for training of the data processing model.
The Machine learning algorithm may be, but is not limited to, any one of a random forest, a KNN (k-nearest neighbor classification algorithm), a logistic regression, an SVM (Support Vector Machine), and the like, and in a specific application, a technician may select an appropriate Machine learning algorithm based on actual model characteristic requirements.
After the corresponding relationship between the feature data of each training sample and the life cycle state labeling result is established, model parameters such as each feature weight of the initialization model and the learning rate of the initialization model can be initialized for the selected machine learning algorithm, and on the basis, the initialized model can be trained by using the corresponding relationship between the feature data of each training sample and the life cycle state labeling result, so that the data processing model can continuously learn the rule of the corresponding relationship between the data feature of each training sample and the life cycle state, further continuously adjust/optimize the feature weight of each feature, and finally obtain the data processing model capable of outputting the corresponding life cycle state prediction result based on the input feature data.
The data type of the data that can be processed by the trained data processing model generally needs to be consistent with the data type of the training sample, for example, if the data type of the training sample is a data table type, the trained data processing model can process the data of the data table, that is, the life cycle state prediction result of the data table can be output according to the data characteristics of the input data of the data table; if the data type of the training sample is the log type, the trained data processing model can process the data of the log type, namely, the life cycle state prediction result of the log data can be output according to the data characteristics of the input log data.
That is, the data processing model trained based on training samples of different data types is generally only applicable to life cycle state prediction of data of the same type as the training sample data.
In order to provide the life cycle state prediction function suitable for different types of data for users, the trained data processing models respectively suitable for different data types can be integrated as sub-models to obtain a total model after the integration of each sub-model is completed, the general model can be simultaneously suitable for predicting the life cycle state of different types of data, and in practical application, a general interface positioned at the upper layer of each sub-model can be provided for the general model (for example, a data input and state prediction interface facing to each type of data and the like are designed), so that the user is provided with a life cycle state prediction function that is simultaneously applicable to different types of data, the general interface at least has the functions of identifying/judging the data type of the data to be predicted and calling a corresponding sub-model to predict the data life cycle state according to the data type of the data to be predicted.
In the embodiment, based on a preset machine learning algorithm, a data model is trained by using training samples of corresponding data types, so that a life cycle state prediction function is provided for the data of the corresponding data types; data processing models respectively suitable for different data types are integrated as submodels, and a general interface positioned at the upper layer of each submodel is provided for the general model obtained by integration, so that a general life cycle state prediction function is provided for various types of data.
In the seventh embodiment of the present application, the data processing model is preferably a model that can be simultaneously applied to perform life cycle state prediction on data of different data types (e.g., data table data, log data), and specifically, the data processing model includes more than one sub-processing model, different sub-processing models correspond to different data types and are used for performing life cycle state prediction on data of different data types, and data feature types and/or feature weights corresponding to different sub-processing models are different.
On this basis, the processing unit 603 processes the feature data by using a pre-trained data processing model to obtain a processing result, which specifically includes:
obtaining the data type of the data to be processed; determining a sub-processing model corresponding to the data type from the data processing model; the data processing model comprises more than one sub-processing model, different sub-processing models correspond to different data types, and the data characteristic types and/or characteristic weights corresponding to different sub-processing models are different; and inputting the characteristic data of the data to be processed into the sub-processing model corresponding to the data type to obtain a processing result of the data to be processed.
The data to be processed may be, but is not limited to, data table data or log data.
The data type of the data to be processed may be identified, but not limited to, based on a data format of the data to be processed, a document type of a document in which the data to be processed is located (for example, there is a large difference between the data format and the document type of the data table and the log data), or pre-labeled type information, and the like.
After the data type of the data to be processed is obtained, a sub-processing model matched with the data to be processed can be determined from a plurality of sub-processing models included in the data processing model according to the data type.
For example, if the data type of the data to be processed is data table data, a first sub-processing model capable of performing life cycle state prediction on the data table data may be determined from a plurality of sub-processing models included in the data processing model; if the data type of the data to be processed is log data, a second sub-processing model capable of predicting the life cycle state of the log data can be determined from a plurality of sub-processing models included in the data processing model.
After determining the sub-processing model corresponding to the data type of the data to be processed, the feature data of the data to be processed may be input into the sub-processing model, and finally, the sub-processing model outputs a life cycle state prediction result corresponding to the feature data of the data to be processed (such as a feature of a time dimension, a feature of a data access frequency dimension, and/or a feature of a value dimension) based on a rule of a correspondence relationship between the data feature and the data life cycle state.
The output life cycle state prediction result is different according to different data life cycle state division modes.
Exemplarily, if the data lifecycle state is divided into two states, i.e., a lifecycle is ended and the lifecycle is not ended according to whether the data lifecycle state is invalid or not, the lifecycle state prediction result includes confidence information corresponding to the ended data lifecycle and confidence information corresponding to the unfinished lifecycle; if the data life cycle state is divided into three states of frequent access, little access and failure according to the access frequency, the life cycle state prediction result comprises three confidence information respectively corresponding to the three states of frequent access, little access and failure.
The present embodiment provides a universal lifecycle state prediction function for different types of data by using a data processing model that can be adapted to different data types. In addition, in the embodiment, the life cycle state prediction is performed on the data to be processed by utilizing the corresponding relation rule of the characteristics of the big data in the time dimension, the data access frequency dimension and/or the value dimension and the life cycle state of the big data, so that compared with the prior art, the labor cost is reduced, and the determined life cycle state of the data has higher accuracy.
In the eighth embodiment of the present application, taking as an example that the life cycle state of the data is divided into two states, i.e., the end state of the life cycle and the non-end state of the life cycle, according to the processing result of the data processing model, the introduction processing unit 603 determines the implementation process of the life cycle state of the data to be processed. Wherein the processing unit 603 may determine the lifecycle state of the data to be processed by:
obtaining first confidence information that the life cycle of the data to be processed is finished and second confidence information that the life cycle of the data to be processed is not finished, wherein the first confidence information and the second confidence information are included in the processing result; and determining the life cycle state of the data to be processed based on the first confidence coefficient information and the second confidence coefficient information.
Under the condition that the life cycle state of the data is divided into two states of end life cycle and non-end life cycle according to whether the data is invalid or not, generally, the data processing model can output the following result data after the life cycle state of the data to be processed is predicted: the first confidence information that the life cycle of the data to be processed is finished, and the second confidence information that the life cycle of the data to be processed is not finished.
The first confidence information and the second confidence information may be, but not limited to, information such as a probability value or a percentage value that can indicate a degree of likelihood that the data to be processed belongs to the corresponding state.
On the basis of obtaining first confidence information that the life cycle of the data to be processed is finished and second confidence information that the life cycle of the data to be processed is not finished, the life cycle state of the data to be processed can be determined based on the value of the first confidence information and the second confidence information and by combining a preset state determination strategy.
Illustratively, the state determination policy may be: the life cycle state of the data to be processed is a state corresponding to the confidence coefficient with a larger value. For example, assuming that the confidence of the to-be-processed data corresponding to the state in which the life cycle has not ended is 70%, and the confidence corresponding to the state in which the life cycle has ended is 30%, it can be determined that the life cycle state of the to-be-processed data is that the life cycle has not ended.
The state determination policy may also be: and if the confidence coefficient of the numerical value is lower than the preset threshold, the data to be processed is not identified.
For example, assuming that the predetermined threshold is 70%, if the confidence of the to-be-processed data corresponding to the state in which the life cycle is not ended is 45%, and the confidence of the to-be-processed data corresponding to the state in which the life cycle is ended is 55%, the state of the to-be-processed data is not recognized because the numerical confidence (55%) is lower than the predetermined threshold (70%), if the confidence of the to-be-processed data corresponding to the state in which the life cycle is not ended is 10%, the confidence of the to-be-processed data corresponding to the state in which the life cycle is ended is 90%, and the numerical confidence (90%) is higher than the predetermined threshold (70%), the data state of the to-be-processed data is the state corresponding to the numerical confidence (90%), that the life cycle is ended.
In specific implementation, the state determination policy may be determined based on actual requirements (whether unrecognized conditions are allowed to exist, etc.), and the two policies provided by this embodiment are not limited.
In the embodiment, when the life cycle state of the data to be processed is determined, the data processing model specifically predicts the life cycle state by using the corresponding relation rule of the characteristics of the big data in each dimension and the life cycle state of the big data, so that compared with the prior art, the labor cost is reduced, and the determined life cycle state of the data has higher accuracy.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
For convenience of description, the above system or apparatus is described as being divided into various modules or units by function, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
Finally, it is further noted that, herein, relational terms such as first, second, third, fourth, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (9)

1. A data processing method, comprising:
acquiring data to be processed;
extracting feature data of a predetermined feature of the data to be processed; the characteristic data includes at least one of: the characteristics of the data to be processed in a time dimension or a data access frequency dimension;
carrying out life cycle state recognition processing on the characteristic data by utilizing a rule of a corresponding relation between at least one of characteristics of big data in a time dimension or a data access frequency dimension, which is described by a pre-trained data processing model, and a life cycle state of the big data to obtain a processing result of the data to be processed, and determining the life cycle state of the data to be processed based on the processing result;
the determining the life cycle state of the data to be processed based on the processing result comprises: under the condition that the life cycle state of the data to be processed is divided into two states of the life cycle being ended and the life cycle not being ended, first confidence information of the processed data with the ended life cycle and second confidence information of the processed data with the unfinished life cycle are obtained; determining a life cycle state of the data to be processed based on the first confidence information and the second confidence information;
wherein the data type of the training sample of the data processing model at least comprises the data type of the data to be processed.
2. The method according to claim 1, wherein the extracting feature data of the predetermined feature of the data to be processed comprises:
and extracting at least one of the survival time length of the data to be processed from the creation time point to the current time length, the time length from the last access time point to the current time length or the data access frequency in at least one latest preset time period.
3. The method according to claim 1, wherein the performing the lifecycle state identification processing on the feature data to obtain the processing result of the data to be processed comprises:
obtaining the data type of the data to be processed;
determining a sub-processing model corresponding to the data type from the data processing model; the data processing model comprises more than one sub-processing model, different sub-processing models correspond to different data types, and the data characteristic types and/or characteristic weights corresponding to different sub-processing models are different;
inputting the characteristic data of the data to be processed into the sub-processing model corresponding to the data type to obtain a processing result of the data to be processed, so as to obtain the life cycle state of the data to be processed based on the processing result.
4. The method of claim 1, wherein the training process of the data processing model comprises:
obtaining a plurality of training samples;
extracting feature data of the predetermined features of each training sample; the feature data of the training samples comprises at least one of: training characteristics of a sample in the time dimension or the data access frequency dimension;
marking the data life cycle state of each training sample to obtain a data life cycle state marking result of each training sample;
establishing a corresponding relation between the feature data of each training sample and the data life cycle state labeling result to obtain corresponding relation data of the feature data of each training sample and the data life cycle state labeling result;
and training a data processing model by utilizing the corresponding relation data of each training sample based on a preset machine learning algorithm, so that the data processing model can output a corresponding data life cycle state prediction result based on the input feature data.
5. The method of claim 4, wherein obtaining the plurality of training samples comprises:
randomly selecting a batch of data from a production environment, and using the randomly selected data as a training sample.
6. A data processing apparatus, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring data to be processed;
an extraction unit for extracting feature data of a predetermined feature of the data to be processed; the characteristic data includes at least one of: the characteristics of the data to be processed in a time dimension or a data access frequency dimension;
the processing unit is used for carrying out life cycle state recognition processing on the characteristic data by utilizing the rule of the corresponding relation between at least one of the characteristics of the big data in the time dimension or the data access frequency dimension, which is described by a pre-trained data processing model, and the life cycle state of the big data to obtain the processing result of the data to be processed, and determining the life cycle state of the data to be processed based on the processing result;
when determining the lifecycle state of the data to be processed based on the processing result, the processing unit is specifically configured to: under the condition that the life cycle state of the data to be processed is divided into two states of the life cycle being ended and the life cycle not being ended, first confidence information of the processed data with the ended life cycle and second confidence information of the processed data with the unfinished life cycle are obtained; determining a life cycle state of the data to be processed based on the first confidence information and the second confidence information;
wherein the data type of the training sample of the data processing model at least comprises the data type of the data to be processed.
7. The apparatus according to claim 6, wherein the extraction unit is specifically configured to:
and extracting at least one of the survival time length of the data to be processed from the creation time point to the current time length, the time length from the last access time point to the current time length or the data access frequency in at least one latest preset time period.
8. The apparatus according to claim 6, wherein the processing unit is specifically configured to:
obtaining the data type of the data to be processed;
determining a sub-processing model corresponding to the data type from the data processing model; the data processing model comprises more than one sub-processing model, different sub-processing models correspond to different data types, and the data characteristic types and/or characteristic weights corresponding to different sub-processing models are different;
inputting the characteristic data of the data to be processed into the sub-processing model corresponding to the data type to obtain a processing result of the data to be processed, so as to obtain the life cycle state of the data to be processed based on the processing result.
9. The apparatus of claim 6, further comprising:
a pre-processing unit to:
obtaining a plurality of training samples;
extracting feature data of the predetermined features of each training sample; the feature data of the training samples comprises at least one of: training characteristics of a sample in the time dimension or the data access frequency dimension;
marking the data life cycle state of each training sample to obtain a data life cycle state marking result of each training sample;
establishing a corresponding relation between the feature data of each training sample and the data life cycle state labeling result to obtain corresponding relation data of the feature data of each training sample and the data life cycle state labeling result;
and training a data processing model by utilizing the corresponding relation data of each training sample based on a preset machine learning algorithm, so that the data processing model can output a corresponding data life cycle state prediction result based on the input feature data.
CN201810271496.4A 2018-03-29 2018-03-29 Data processing method and device Active CN108470071B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810271496.4A CN108470071B (en) 2018-03-29 2018-03-29 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810271496.4A CN108470071B (en) 2018-03-29 2018-03-29 Data processing method and device

Publications (2)

Publication Number Publication Date
CN108470071A CN108470071A (en) 2018-08-31
CN108470071B true CN108470071B (en) 2022-02-18

Family

ID=63262326

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810271496.4A Active CN108470071B (en) 2018-03-29 2018-03-29 Data processing method and device

Country Status (1)

Country Link
CN (1) CN108470071B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110969249A (en) * 2018-09-29 2020-04-07 北京国双科技有限公司 Production well yield prediction model establishing method, production well yield prediction method and related device
CN110032750A (en) * 2018-12-18 2019-07-19 阿里巴巴集团控股有限公司 A kind of model construction, data life period prediction technique, device and equipment
CN110309127B (en) * 2019-07-02 2021-07-16 联想(北京)有限公司 Data processing method and device and electronic equipment
CN110688385A (en) * 2019-09-29 2020-01-14 联想(北京)有限公司 Data processing method and electronic equipment
CN112784501A (en) * 2021-03-23 2021-05-11 中国核电工程有限公司 Modeling system and method for residual life prediction model of equipment and prediction system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156983A (en) * 2011-03-31 2011-08-17 上海交通大学 Pattern recognition and target tracking based method for detecting abnormal pedestrian positions
CN105590157A (en) * 2014-12-25 2016-05-18 中国银联股份有限公司 Data management based on data lifecycle management template
CN105915555A (en) * 2016-06-29 2016-08-31 北京奇虎科技有限公司 Method and system for detecting network anomalous behavior
CN107527070A (en) * 2017-08-25 2017-12-29 江苏赛睿信息科技股份有限公司 Recognition methods, storage medium and the server of dimension data and achievement data

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8032863B2 (en) * 2004-11-18 2011-10-04 Parasoft Corporation System and method for global group reporting
JP5075761B2 (en) * 2008-05-14 2012-11-21 株式会社日立製作所 Storage device using flash memory

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156983A (en) * 2011-03-31 2011-08-17 上海交通大学 Pattern recognition and target tracking based method for detecting abnormal pedestrian positions
CN105590157A (en) * 2014-12-25 2016-05-18 中国银联股份有限公司 Data management based on data lifecycle management template
CN105915555A (en) * 2016-06-29 2016-08-31 北京奇虎科技有限公司 Method and system for detecting network anomalous behavior
CN107527070A (en) * 2017-08-25 2017-12-29 江苏赛睿信息科技股份有限公司 Recognition methods, storage medium and the server of dimension data and achievement data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《基于文件的数据分级存储的研究与实现》;刘晓然;《中国优秀硕士学位论文全文数据库 信息科技辑》;20140215;全文 *

Also Published As

Publication number Publication date
CN108470071A (en) 2018-08-31

Similar Documents

Publication Publication Date Title
CN108470071B (en) Data processing method and device
CN109582793B (en) Model training method, customer service system, data labeling system and readable storage medium
CN111309912B (en) Text classification method, apparatus, computer device and storage medium
US20150170053A1 (en) Personalized machine learning models
US20120136812A1 (en) Method and system for machine-learning based optimization and customization of document similarities calculation
CN110929799B (en) Method, electronic device, and computer-readable medium for detecting abnormal user
CN108509424B (en) System information processing method, apparatus, computer device and storage medium
CN111582341B (en) User abnormal operation prediction method and device
US20220253725A1 (en) Machine learning model for entity resolution
CN113722438A (en) Sentence vector generation method and device based on sentence vector model and computer equipment
CN110909784A (en) Training method and device of image recognition model and electronic equipment
EP4371027A1 (en) Intelligent task completion detection at a computing device
CN115935344A (en) Abnormal equipment identification method and device and electronic equipment
CN116663525A (en) Document auditing method, device, equipment and storage medium
CN114138977A (en) Log processing method and device, computer equipment and storage medium
CN113360305A (en) Computer equipment and abnormal operation detection method, device and storage medium thereof
US8918406B2 (en) Intelligent analysis queue construction
CN116228301A (en) Method, device, equipment and medium for determining target user
CN115828901A (en) Sensitive information identification method and device, electronic equipment and storage medium
CN115757075A (en) Task abnormity detection method and device, computer equipment and storage medium
CN115238077A (en) Text analysis method, device and equipment based on artificial intelligence and storage medium
CN114610953A (en) Data classification method, device, equipment and storage medium
CN111080444B (en) Information auditing method and device
CN117112445B (en) Machine learning model stability detection method, device, equipment and medium
CN113963234B (en) Data annotation processing method, device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant