CN113609031A

CN113609031A - Data cleaning model construction method, data cleaning method, related equipment and medium

Info

Publication number: CN113609031A
Application number: CN202110774567.4A
Authority: CN
Inventors: 付玉鑫
Original assignee: Shenzhen Chenbei Technology Co Ltd
Current assignee: Shenzhen Chenbei Technology Co Ltd
Priority date: 2021-07-08
Filing date: 2021-07-08
Publication date: 2021-11-05
Anticipated expiration: 2041-07-08
Also published as: CN113609031B

Abstract

The application provides a data cleaning model construction method, a data cleaning method, related equipment and media, wherein the data cleaning model construction method comprises the following steps: acquiring a training sample set corresponding to a target cache category, wherein the training sample set comprises sample information of a plurality of cache data belonging to the target cache category, and the sample information of each cache data comprises a plurality of characteristic information and a cleaning identifier; determining the judgment accuracy under a target condition based on the characteristic information of the plurality of cache data under the first cache characteristic and the cleaning identification corresponding to the plurality of cache data, wherein the target condition is the condition of judging the cleaning category corresponding to the plurality of cache data by taking the first cache characteristic as a judgment standard; and constructing a cache data cleaning model according to the judgment accuracy under a plurality of conditions, wherein the plurality of conditions are conditions for judging the cleaning types corresponding to the plurality of cache data by respectively taking a plurality of cache characteristics as judgment standards. The method and the device can realize accurate judgment on whether the cache data needs to be cleaned.

Description

Data cleaning model construction method, data cleaning method, related equipment and medium

Technical Field

The present application relates to the field of cache data processing, and in particular, to a data cleaning model construction method, a data cleaning method, related devices, and media.

Background

The cache data is a temporary file which is stored by various application clients (such as WeChat and browser) in the process of being used by a user and is generated based on user behaviors, so that the user can complete quick response to the user when using the client in the subsequent process.

When the cache data is too much, the terminal installed with the application client end can be blocked due to the fact that the occupied cache space is large. For the phenomenon, at present, most of the terminal processing means are to directly clear all the cache data in the cache space, and although the problem of blocking is solved, for the cache data with higher use frequency, the user needs to cache again, and the quick response to the user cannot be completed, which brings inconvenience to the user.

Disclosure of Invention

The application provides a data cleaning model construction method, a data cleaning method, related equipment and media, and aims to solve the technical problem of inconvenience in use caused by cleaning of all cache data in the prior art.

In a first aspect, a data cleaning model construction method is provided, which includes the following steps:

acquiring a training sample set corresponding to a target cache category, wherein the training sample set comprises sample information of a plurality of cache data belonging to the target cache category, the sample information of each cache data comprises a cleaning identifier and a plurality of feature information, the feature information is respectively feature information corresponding to a plurality of preset cache features, and the cleaning identifier is used for indicating the cleaning category corresponding to each cache data;

determining the judgment accuracy under a target situation based on the feature information of the plurality of cache data under a first cache feature and the cleaning identifier corresponding to each of the plurality of cache data, wherein the first cache feature is any one of the plurality of cache features, and the target situation is a situation of judging the cleaning type corresponding to each of the plurality of cache data by taking the first cache feature as a judgment standard;

and constructing a cache data cleaning model according to the judgment accuracy under a plurality of conditions, wherein the plurality of conditions are conditions for judging the cleaning types corresponding to the plurality of cache data by taking the plurality of cache characteristics as judgment standards respectively, the cache data cleaning model is a decision tree for judging the cleaning types of the cache data by taking the plurality of cache characteristics as the judgment standards in sequence, the cache characteristic with high judgment accuracy is a parent node of the cache characteristic with low judgment accuracy in the decision tree, the cache characteristic with low judgment accuracy is connected to a first branch of the cache characteristic with high judgment accuracy, and the first branch is a branch with no cleaning judgment result.

In the technical scheme, the cleaning category of the cache data in the training sample set is judged by respectively taking a plurality of preset cache features as judgment standards, the judgment accuracy by respectively taking the plurality of cache features as the judgment standards is determined, and a decision tree for judging the cleaning category of the cache data by taking the plurality of cache features as the judgment standards is constructed on the basis of the judgment accuracy corresponding to each of the plurality of cache features, so that the sequencing of judgment by utilizing each cache feature is determined in a simpler manner, and the operation speed is high; the cache features with high judgment accuracy are father nodes of the cache features with high judgment accuracy in the decision tree, and the cache features with low judgment accuracy are connected to the branches with high judgment accuracy and no cleaning, which is equivalent to a decision strategy that whether the cache data need to be cleaned is determined by preferentially utilizing the cache features with high judgment accuracy, and whether the cache data need to be cleaned is determined by utilizing the cache features with low judgment accuracy, so that accurate judgment on whether the cache data need to be cleaned can be realized, the cache data useful for a user can be reserved, and the cache data can be cleaned more finely compared with the cache data which are all cleaned.

With reference to the first aspect, in a possible implementation manner, the determining, based on feature information of the plurality of cache data under the first cache feature and the corresponding cleaning identifier of each of the plurality of cache data, the determination accuracy in the target situation includes: determining a first information entropy related to a cleaning identifier according to the cleaning identifier corresponding to each of the plurality of cache data; determining conditional entropy of the first cache characteristic according to characteristic information of the plurality of cache data under the first cache characteristic and cleaning identifiers corresponding to the plurality of cache data; and calculating the information gain of the first cache characteristic according to the first information entropy and the conditional entropy to indicate the judgment accuracy under the target condition. The accuracy of each cache characteristic serving as a judgment standard in judging the cleaning category of the cache data is measured by using the information gain, the calculation mode is simple, and the speed of constructing a cache data cleaning model is improved.

With reference to the first aspect, in a possible implementation manner, the determining, based on feature information of the plurality of cache data under the first cache feature and the corresponding cleaning identifier of each of the plurality of cache data, the determination accuracy in the target situation includes: determining a first information entropy related to a cleaning identifier according to the cleaning identifier corresponding to each of the plurality of cache data; determining conditional entropy of the first cache characteristic according to characteristic information of the plurality of cache data under the first cache characteristic and cleaning identifiers corresponding to the plurality of cache data; calculating the information gain of the first cache characteristic according to the first information entropy and the conditional entropy; determining the second information entropy related to the first cache characteristic according to the characteristic information of the plurality of cache data under the first cache characteristic; and calculating an information gain ratio of the first cache characteristic according to the information gain and the second information entropy to indicate the judgment accuracy under the target condition. The accuracy of each cache feature can be measured more accurately by measuring the accuracy of each cache feature as a judgment standard when the cleaning category of the cache data is judged.

With reference to the first aspect, in a possible implementation manner, the determining, based on feature information of the plurality of cache data under the first cache feature and the corresponding cleaning identifier of each of the plurality of cache data, the determination accuracy in the target situation includes: respectively determining the probability distribution of the cleaning identifiers on various types of feature information under the first cache feature according to the cleaning identifiers corresponding to the plurality of cache data; and determining the minimum kini index of the first cache characteristic according to the cleaning identifier probability distribution and the characteristic information of the plurality of cache data under the first cache characteristic, so as to indicate the judgment accuracy under the target situation. The accuracy of each cache feature can be measured more quickly by measuring the accuracy of each cache feature when the cache feature is used as a judgment standard to judge the cleaning category of the cache data.

With reference to the first aspect, in a possible implementation manner, the obtaining a training sample set corresponding to a target cache category includes: obtaining a plurality of cache data belonging to the target cache category, and determining feature information of the plurality of cache data under the plurality of cache features respectively; determining a sub-cleaning identifier of first cache data under a first cache characteristic according to characteristic information of the first cache data under the first cache characteristic and a preset identifier processing rule corresponding to the first cache characteristic, so as to obtain a plurality of sub-cleaning identifiers corresponding to the first cache data, where the first cache data is any cache data in the plurality of cache data, the first cache characteristic is any cache characteristic in the plurality of cache characteristics, the preset identifier processing rule is a processing rule for judging a cleaning category of the cache data based on the characteristic information, and the sub-cleaning identifier of the first cache data under the first cache characteristic is used for indicating the cleaning category corresponding to the first cache data under the first cache characteristic; and determining the cleaning identifier corresponding to the first cache data according to the plurality of sub-cleaning identifiers corresponding to the first cache data so as to obtain the cleaning identifiers corresponding to the plurality of cache data respectively. By setting the identification processing rule for each cache feature, the cleaning category of the cache data can be measured from multiple dimensions, so that the accurate calibration of the cleaning category of the cache data can be realized.

With reference to the first aspect, in a possible implementation manner, the determining, according to a plurality of sub-cleaning identifiers corresponding to the first cache data, a cleaning identifier corresponding to the first cache data includes: if the number of target sub-cleaning identifiers in a plurality of sub-cleaning identifiers corresponding to the first cache data is greater than a preset number, determining that the corresponding cleaning identifier of the first cache data is the target sub-cleaning identifier, where the target sub-cleaning identifier is used to indicate one of cleaning or non-cleaning; or determining a seed cleaning identifier with the largest proportion among a plurality of seed cleaning identifiers corresponding to the first cache data as the cleaning identifier corresponding to the first cache data. The cleaning identification of the cache data is determined according to the number or the proportion of the sub-cleaning identifications, so that the cleaning category indicated by the cleaning identification can be close to the real situation of the cache data to the greatest extent, and the accuracy of model construction is facilitated.

With reference to the first aspect, in a possible implementation manner, the constructing a cache data cleaning model according to the determination accuracy in multiple situations includes: determining a second cache characteristic corresponding to the maximum judgment accuracy according to the judgment accuracy under the multiple situations; deleting the second cache feature, the feature information corresponding to the second cache feature and the second cache data in the training sample set, and returning to execute the step of determining the judgment accuracy in the target situation based on the feature information of the plurality of cache data under the first cache feature and the cleaning identifier corresponding to each of the plurality of cache data until the judgment accuracy in the plurality of situations is less than the preset accuracy or only one cache feature in the sample information remains, wherein the cleaning identifier is used for indicating the cleaned cache data when the second cache feature is used as a judgment standard to judge the cleaning category corresponding to each of the plurality of cache data; and constructing a cache data cleaning model according to the sequence of the plurality of cache features deleted in the training sample set. Through multiple rounds of iterative judgment, the sequence of judgment by utilizing each cache characteristic can be more accurately determined, and the judgment accuracy of the decision tree can be improved.

In a second aspect, a data cleaning method is provided, which includes the following steps:

when detecting that the total amount of cache data corresponding to a target cache category is greater than a preset data amount, or that the ratio of the total amount of cache data corresponding to the target cache category to the total cache space corresponding to the target cache category is greater than a preset ratio threshold, cleaning the cache data corresponding to the target cache category by using a cache data cleaning model, wherein the cache data cleaning model is constructed by using the method of the first aspect.

The cache data cleaning model constructed by the method of the first aspect is a decision tree judging mode for judging whether the cache data needs to be cleaned or not by using the cache features with low judging accuracy, so that the cache data can be accurately judged whether the cache data needs to be cleaned or not, the cleaning category of the cache data is judged by using the decision tree, the cache data useful for users can be reserved, and compared with the mode that the cache data is completely cleaned, the cache data can be cleaned finely.

In a third aspect, a device for constructing a slow data cleansing model is provided, including:

an obtaining module, configured to obtain a training sample set corresponding to a target cache category, where the training sample set includes sample information of a plurality of cache data belonging to the target cache category, and the sample information of each cache data includes a cleaning identifier and a plurality of feature information, where the plurality of feature information are feature information corresponding to a plurality of preset cache features, respectively, and the cleaning identifier is used to indicate a cleaning category corresponding to each cache data;

an accuracy determination module, configured to determine, based on feature information of the plurality of cache data under a first cache feature and a corresponding cleaning identifier of the plurality of cache data, a determination accuracy in a target situation, where the first cache feature is any one of the plurality of cache features, and the target situation is a situation in which a cleaning category corresponding to each of the plurality of cache data is determined by using the first cache feature as a determination criterion;

the model building module is used for building a cache data cleaning model according to the judgment accuracy under a plurality of conditions, the plurality of conditions are the conditions that the cleaning types corresponding to the plurality of cache data are judged by taking the plurality of cache characteristics as judgment standards respectively, the cache data cleaning model is a decision tree which sequentially takes the plurality of cache characteristics as the judgment standards and judges the cleaning types of the cache data, wherein the cache characteristics with high judgment accuracy are parent nodes of the cache characteristics with low judgment accuracy in the decision tree, the cache characteristics with low judgment accuracy are connected to a first branch of the cache characteristics with high judgment accuracy, and the first branch is a branch with no cleaning judgment result.

In a fourth aspect, a data cleansing apparatus is provided, including:

a cleaning module, configured to clean, when it is detected that a total amount of cache data corresponding to a target cache category is greater than a preset data amount, or a ratio of the total amount of cache data corresponding to the target cache category to a total cache space corresponding to the target cache category is greater than a preset ratio threshold, the cache data corresponding to the target cache category through a cache data cleaning model, where the cache data cleaning model is constructed by the method of the first aspect.

In a fifth aspect, there is provided a computer device comprising a memory and one or more processors for executing one or more computer programs stored in the memory, the one or more processors, when executing the one or more computer programs, causing the computer device to implement the data cleansing model building method of the first aspect or the data cleansing method of the second aspect.

A sixth aspect provides a computer-readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the data cleansing model construction method of the first aspect or the data cleansing method of the second aspect.

The application can realize the following beneficial effects: the method can accurately judge whether the cache data needs to be cleaned, retain the useful cache data for the user, and realize the refined cleaning of the cache data.

Drawings

Fig. 1 is a schematic flowchart of a data cleaning model building method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a decision tree according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart illustrating another data cleaning model construction method according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of a data cleaning method according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an apparatus for constructing a data cleaning model according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a data cleansing apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The technical scheme of the embodiment of the application is suitable for a scene of processing the cache, wherein the processing of the cache specifically refers to cleaning of cache data stored in a cache space of the application device, and the application device can be a device for acquiring and responding to user behaviors for realizing interaction with a user, such as a mobile phone, a tablet, a notebook computer and the like. Specifically, the cache data may be a cookie file generated by a user accessing a network using a web browser, a file corresponding to a picture or a document generated by a user previewing the picture, previewing the document, or the like, an installation/patch file related to an application generated by a user updating the application, or the like, and is not limited to the description herein.

Since the cache data is stored in the cache space of the application device as a temporary file, the cache space of the application device is limited, and when the cache data in the cache space is excessive, the running speed of the application device is affected to a certain extent. Therefore, aiming at the cache data stored in the cache space, the application provides a data cleaning model construction method and a data cleaning method, and the cache data cleaning model is constructed and applied to the application equipment to clean the cache data so as to improve the running speed of the application equipment.

In a possible scenario, the data cleansing model building method and the data cleansing method related to the present application may be implemented in the same device, for example, may be implemented in both application devices. In other possible scenarios, the data cleansing model building method and the data cleansing method may be implemented in different devices, respectively. For example, the data cleaning method is implemented in an application device to clean the cache data; the data cleaning model building method is implemented in another device, and after the cache data cleaning model is built and obtained in the other device based on the data cleaning model building method, the cache data cleaning model is transplanted to the application device. The migration may refer to storing the cache data cleaning model in the application device in a manual storage manner after the cache data cleaning model is obtained by the other device. Optionally, the migrating may also refer to obtaining, by the another device, the cache data cleaning model, and then storing the obtained cache data cleaning model in the application device in a data interaction manner, for example, the application device may send a request for obtaining the cache data cleaning model to the another device, and the another device sends the cache data cleaning model to the application device based on the request, so as to store the cache data cleaning model in the application device. The data interaction between the application device and another device may be performed in a wireless communication manner such as bluetooth or WiFi, or may be performed in a wired communication manner, or may be performed in a manner of combining wired and wireless, which relates to a data interaction manner between the two devices, and the application is not limited thereto. Specifically, the other device may be a PC computer, a server, or the like.

The cache data cleaning model constructed by the data cleaning model construction method is used for cleaning cache data, and useful cache data for users can be reserved, so that the purpose of finely cleaning the cache data in application equipment is achieved. Specifically, the cache data cleaning model may be implemented as an executable file (such as a plug-in, an application client, etc.) that runs automatically and is stored in the application device, so as to automatically clean the cache data in the application device.

The technical solution of the present application is specifically described below.

Referring to fig. 1, fig. 1 is a schematic flow chart of a data cleaning model building method according to an embodiment of the present application, which can be applied to the aforementioned application device or another device; as shown in fig. 1, the method comprises the steps of:

s101, a training sample set corresponding to a target cache category is obtained, the training sample set corresponding to the target cache category comprises sample information of a plurality of cache data belonging to the target cache category, the sample information of each cache data comprises a plurality of feature information and a cleaning identifier, the feature information is respectively feature information corresponding to a plurality of preset cache features, and the cleaning identifier is used for indicating the cleaning category.

Here, the target cache category is a category determined according to user behavior, and one category of user behavior corresponds to one cache category. For example, the target cache category may be a category of a cookie file generated by a user accessing a network, or may be a category of a temporary file generated by the user using various types of application clients and adapted to various types of clients. The cache data of different cache categories may have different cache characteristics, wherein one cache characteristic is used for measuring and representing the cache data from one dimension, so as to divide the cache data into different categories under one cache characteristic. For example, the cache characteristic may be the size of the cache data, and by setting m1 data size thresholds, the cache data may be divided into (m1+1) categories, where m1 ≧ 1 and is a positive integer; the cache characteristics can also be retention time of the cache data in the cache space, the cache data can be divided into (m2+1) types by setting the length of m2 retention time, and m2 is more than or equal to 1 and is a positive integer; the cache characteristics may also be frequency of use, number of uses, type of cached data, time of last use, and the like.

In the embodiment of the present application, for the cache data belonging to the target cache category, a plurality of cache features may be preset, and the plurality of cache features are combined to distinguish a cleaning category of the cache data belonging to the target cache category, that is, whether the cache data belonging to the target cache category needs to be cleaned or not. For one cache data belonging to a target cache category, obtaining characteristic information of the cache data under each cache characteristic, thereby obtaining a plurality of characteristic information of the cache data; and integrating a plurality of characteristic information of the cache data to determine whether the cache data needs to be cleaned, using a cleaning identifier to represent whether the cache data needs to be cleaned, wherein the cleaning identifier corresponding to the cache data and the plurality of characteristic information of the cache data form sample information of the cache data. Specifically, the cleaning identifier may be directly cleaning or not cleaning; alternatively, the cleaning flag may be 1 or 0 for indicating cleaning or not cleaning; alternatively, the cleaning flag may be yes or no for indicating cleaning or not cleaning. The embodiments of the present application are not limited to the specific form of the cleaning identifier.

For example, the target cache category is a category of temporary files which are generated by using various application clients and are suitable for various clients, and the plurality of cache characteristics preset for the target cache category are the size of cache data, the retention time of the cache data in the cache space (which refers to the time length from the time when the cache data is stored in the cache space to the current time), the use frequency (which refers to the number of times of obtaining the cache data from the cache space), the type of the cache data, and the time used for the last time. The size of the cache data d1 is 20 megabytes, the cache data d1 has been stored in the cache space for 1 week, the use frequency of the cache data d1 is 1 time/1 day, the use frequency of the cache data d1 is 7 times, the cache data d1 is a picture-type file, and the last time that the cache data d1 is used is 2021 month 1 am 10 pm in 2021 year. From these characteristic information of the cached data d1, it is determined that the cached data d1 needs to be cleaned, and 20 megabits, 1 week, 1 time/day, 7 times, picture type files, 10 am 1 month 1 am 2021 year, and cleaning are taken as sample information of the cached data d 1.

The training sample set corresponding to the target cache category can be obtained by obtaining a plurality of characteristic information of a plurality of cache data belonging to the target cache category and the cleaning identification corresponding to the plurality of cache data. By presetting a plurality of cache features, the cache data can be measured from a plurality of dimensions, whether the cache data belonging to the target cache category needs to be cleaned or not is distinguished by combining the plurality of cache features, and the accuracy of cache data division can be ensured.

In some possible embodiments, for the cache data belonging to the target cache category, it may be determined whether the cache data belonging to the target cache category needs to be cleaned by using a plurality of cache features, and it is determined whether the cache data belonging to the target cache category needs to be cleaned according to a result obtained by determining each cache feature. The above step S101 may include the following steps t1 to t 3.

And step t1, acquiring a plurality of cache data belonging to the target cache category, and determining characteristic information of the plurality of cache data under a plurality of cache characteristics respectively.

For the content of determining the characteristic information of the plurality of cache data under the plurality of cache characteristics, reference may be made to the foregoing description.

Step t2, determining the sub-cleaning identifier of the first cache data under the first cache characteristic according to the characteristic information of the first cache data under the first cache characteristic and the preset identifier processing rule corresponding to the first cache characteristic, so as to obtain a plurality of sub-cleaning identifiers corresponding to the first cache data.

Here, the first cache data is any cache data of the plurality of cache data, and the first cache characteristic is any cache characteristic of the plurality of cache characteristics. The preset identification processing rule is used for judging whether the cache data need to be cleaned based on the characteristic information of the cache data under the cache characteristic, namely the preset identification processing rule is a processing rule for judging the cleaning category of the cache data based on the characteristic information, and the preset identification processing rule corresponding to the first cache characteristic is a processing rule for judging the cleaning category of the cache data based on the characteristic information of the cache data under the first cache characteristic.

For example, if the first cache characteristic is a cache data size, the preset identifier processing rule corresponding to the first cache characteristic may be: judging whether the cache data is smaller than a preset data threshold value or not, and if so, judging that the cache data does not need to be cleaned (namely the cleaning type of the cache data is not cleaned); otherwise, judging that the cache data needs to be cleaned (namely the cleaning category of the cache data is cleaning); illustratively, the preset data threshold may be 200 megabits. For another example, if the first cache characteristic is retention time of the cache data in the cache space, the preset identifier processing rule corresponding to the first cache characteristic may be: judging whether the retention time of the cache data in the cache space is less than a first preset time length or not, if so, judging that the cache data does not need to be cleaned, otherwise, judging that the cache data needs to be cleaned; illustratively, the first predetermined length of time is two weeks. For another example, if the first cache characteristic is the usage frequency, the preset identifier processing rule corresponding to the first cache characteristic may be: judging whether the use frequency of the cache data is greater than a preset frequency, if so, judging that the cache data does not need to be cleaned, otherwise, judging that the cache data needs to be cleaned; illustratively, the preset frequency is once every two days. For another example, if the first cache characteristic is the number of times of use, the preset identifier processing rule corresponding to the first cache characteristic may be: judging whether the use times of the cache data are greater than the preset times or not, if so, judging that the cache data do not need to be cleaned, otherwise, judging that the cache data need to be cleaned; illustratively, the preset number of times is 5. For another example, if the first cache characteristic is a cache data type, the preset identifier processing rule corresponding to the first cache characteristic may be: judging whether the type of the cache data is a preset data type, if so, judging that the cache data does not need to be cleaned, otherwise, judging that the cache data needs to be cleaned; illustratively, the predetermined data type is a document file. For another example, if the first cache characteristic is the time used last time, the preset identifier processing rule corresponding to the first cache characteristic may be: judging whether the time of the last use of the cache data is less than a second preset time length from the current time, if so, judging that the cache data does not need to be cleaned, otherwise, judging that the cache data needs to be cleaned; illustratively, the second predetermined length of time is 7 days. Etc., are not limited to the description herein. It should be understood that the specific design of the cache feature and the preset identification processing rule may be designed according to specific situations, and the above examples do not limit the present application.

The sub-cleaning identifier is used for indicating whether the cache data needs to be cleaned under the first cache characteristic, that is, indicating a cleaning category of the cache data, and the sub-cleaning identifier is used for indicating the cleaning category corresponding to the first cache characteristic under the first cache characteristic. The specific form of the cleaning sub-label is consistent with that of the cleaning label, and reference may be made to the specific form of the cleaning label described above.

In the embodiment of the application, for each cache feature in the plurality of cache features, a preset identifier processing rule is correspondingly set, and according to the feature information of the first cache data under each cache feature and the preset identifier processing rule corresponding to each cache feature, a plurality of sub-cleaning identifiers corresponding to the first cache data can be obtained. By setting the preset identification processing rule for each cache characteristic, the cache data can be subjected to characteristic decomposition, and the cache characteristic and the judgment rule which are most suitable for judging whether the cache data need to be cleaned are found.

And t3, determining the cleaning identifier of the first cache data according to the plurality of sub-cleaning identifiers corresponding to the first cache data, so as to obtain the cleaning identifiers corresponding to the plurality of cache data.

In a possible implementation manner, if the number of target sub-cleaning identifiers in a plurality of sub-cleaning identifiers corresponding to the first cache data is greater than a preset number, it is determined that the cleaning identifier of the first cache data is the target sub-cleaning identifier, and the target sub-cleaning identifier is used for indicating one of cleaning and non-cleaning. The value of the preset number is related to the number of the cache features, for example, the preset number may be greater than or equal to 1/2 of the number of the cache features.

In another possible implementation manner, the seed cleaning identifier with the largest proportion among the plurality of seed cleaning identifiers corresponding to the first cache data may be determined as the cleaning identifier of the first cache data.

By determining the cleaning identifier of each cache data in the manner of the above steps t2-t3, the cleaning identifiers corresponding to the plurality of cache data can be obtained. By setting the identification processing rule for each cache characteristic, the cleaning category of the cache data can be measured from multiple dimensions, and the cleaning identification of the cache data is determined according to the number or the proportion of the sub-cleaning identifications, so that the cleaning category indicated by the cleaning identification can be close to the real condition of the cache data to the maximum extent, and the accurate calibration of the cleaning category of the cache data can be realized.

Optionally, when the sub cleaning identifier of the first cache data under the first cache characteristic is obtained through the step t2, the sub cleaning identifier of the first cache data under the first cache characteristic may be used to replace the characteristic information of the first cache data under the first cache characteristic, so as to update the characteristic information of the first cache data under the first cache characteristic to the sub cleaning identifier of the first cache data under the first cache characteristic, thereby updating the characteristic information of the first cache characteristic data under each cache characteristic to the sub cleaning identifier under each cache characteristic, and further updating the characteristic information of each cache characteristic data under each cache characteristic to the sub cleaning identifier under each cache characteristic. By replacing the characteristic information of each cache data under the cache characteristic with the sub-cleaning identifier under the cache characteristic, continuous characteristic information can be dispersed into cleaning and non-cleaning, classification of the cache data under each cache characteristic is realized, and quantification and judgment by subsequent utilization of the cache characteristic are facilitated.

S102, determining the judgment accuracy under the target condition based on the feature information of the plurality of cache data under the first cache feature and the cleaning identifier corresponding to the plurality of cache data, wherein the first cache feature is any cache feature in the plurality of cache features, and the target condition is the condition of judging the cleaning category corresponding to the plurality of cache data by taking the first cache feature as a judgment standard.

Here, the determination accuracy in the target case reflects: the method comprises the steps of determining cleaning categories corresponding to a plurality of cache data by taking a first cache characteristic as a judgment standard, and determining differences between the cleaning categories corresponding to the plurality of cache data and cleaning identifications corresponding to the plurality of cache data, wherein the cleaning categories corresponding to the plurality of cache data can be determined according to a judgment rule corresponding to the first cache characteristic and characteristic information of the plurality of cache data under the first cache characteristic. The higher the judgment accuracy is, the smaller the difference between the cleaning identifier corresponding to each of the plurality of cache data determined by using the first cache feature as the judgment standard is, that is, the more accurate the result of judging the cleaning type corresponding to each of the plurality of cache data by using the first cache feature as the judgment standard is; the lower the judgment accuracy is, the larger the difference between the cleaning identifier corresponding to each of the plurality of cache data determined by using the first cache feature as the judgment standard is, that is, the less accurate the result of judging the cleaning type corresponding to each of the plurality of cache data by using the first cache feature as the judgment standard is.

The judgment rule corresponding to the first cache characteristic is a rule for judging whether the cache data needs to be cleaned by taking the first cache characteristic as a judgment standard. For example, if the first cache characteristic is the size of the cache data, the determination rule of the first cache characteristic may be to determine whether the cache data is smaller than a preset data threshold, and if the cache data is smaller than a preset data processing threshold, it is determined that the cache data does not need to be cleaned; otherwise, the cache data is judged to need to be cleaned. In some possible implementations, the determination rule corresponding to the first cache feature may be preset, for example, may be the same rule as the preset identification processing rule introduced in step S101. In other possible embodiments, the determination rule corresponding to the first cache feature may also be determined in the process of determining the determination accuracy in the target situation.

Based on the feature information of each cache feature of the plurality of cache data and the cleaning identifier corresponding to each cache data, the judgment accuracy in each situation can be determined, so that the judgment accuracy in a plurality of situations and the judgment rule in a plurality of situations can be obtained.

In a specific implementation, one or more ways may be adopted to measure the accuracy of judging the cleaning category of the cache data by using one cache feature as a judgment standard. For a specific embodiment of measuring the accuracy when each cache feature is used as the determination criterion, reference may be made to the following description.

S103, a cache data cleaning model is constructed according to the judgment accuracy under a plurality of conditions, wherein the plurality of conditions are conditions that a plurality of cache characteristics are respectively used as judgment standards to judge the cleaning types corresponding to the plurality of cache data.

The cache data cleaning model is a decision tree with a plurality of cache features as nodes, a judgment rule of one cache feature is a decision condition of a node corresponding to the one cache feature, according to the judgment accuracy corresponding to each cache feature, the cache feature with high judgment accuracy is used as a parent node of the cache feature with low judgment accuracy, and the cache feature with low judgment accuracy is connected to a first branch of the cache feature with high judgment accuracy, so that the cache data cleaning model is constructed. The first branch is a branch which is judged that the cache data is not cleaned when the cleaning type of the cache data is judged according to the judgment rule of the cache characteristic with high judgment accuracy. Among the plurality of cache features, the cache feature corresponding to the maximum accuracy is a root node in the decision tree.

For example, the cache characteristics are, for example, the size of the cache data, the retention time of the cache data in the cache space, the time of the last use, and the number of uses, respectively, and the determination rules of the cache characteristics are preset and are consistent with the preset identification processing rule described above. The determination accuracies corresponding to the plurality of cache features determined in step S103 are, in order from high to low: the last used time, the size of the cache data, the use times and the retention time of the cache data in the cache space; the constructed decision tree is shown in fig. 2, and as can be seen from fig. 2, the last used time is the root node, the size of the cache data is connected to the branch that is not cleaned as a result of the last used time determination, the number of times of use is connected to the branch that is not cleaned as a result of the size determination of the cache data, and the remaining time of the cache data in the cache space is connected to the branch that is not cleaned as a result of the number of times of use determination.

In the technical scheme of fig. 1, the cleaning categories of the cache data in the training sample set are judged by respectively taking a plurality of preset cache features as judgment standards, the judgment accuracies by respectively taking the plurality of cache features as the judgment standards are determined, and a decision tree for judging the cleaning categories of the cache data by taking the plurality of cache features as the judgment standards is constructed based on the judgment accuracies corresponding to the plurality of cache features, so that the sequence of judgment by using each cache feature is determined in a simpler manner, and the operation speed is high; the cache features with high judgment accuracy are father nodes of the cache features with high judgment accuracy in the decision tree, and the cache features with low judgment accuracy are connected to the branches with high judgment accuracy and no cleaning, which is equivalent to a decision strategy that whether the cache data need to be cleaned is determined by preferentially utilizing the cache features with high judgment accuracy, and whether the cache data need to be cleaned is determined by utilizing the cache features with low judgment accuracy, so that accurate judgment on whether the cache data need to be cleaned can be realized, the cache data useful for a user can be reserved, and the cache data can be cleaned more finely compared with the cache data which are all cleaned.

According to different requirements, the accuracy of judging the cleaning category of the cache data by taking each cache characteristic as a judgment standard can be measured in different modes.

In a possible implementation, the information gain can be used to measure the accuracy of the buffer characteristics as the judgment criteria for judging the cleaning category of the buffer data. The step S102 may include the following steps a 1-a 3:

step a1, determining a first information entropy related to a cleaning identifier according to the cleaning identifier corresponding to each of the plurality of cache data.

Here, the first information entropy is used to indicate the uncertainty of the cleaning flag. The larger the first information entropy is, the higher the uncertainty of the cleaning identifier is, and the larger the information quantity required by the cleaning identifier is determined to be; the smaller the first information entropy is, the lower the uncertainty of the cleaning identifier is, and the smaller the information amount required for determining the cleaning identifier is.

Specifically, the calculation formula of the first information entropy is as follows:

h (d) is a first information entropy, k is used to indicate a category of a cleaning identifier corresponding to the cache data, specifically including two categories, | C_kAnd | D1| is the total amount of the plurality of cache data (equal to the number of training samples in the training sample set).

For example, assume that the training sample set is shown in table 1:

caching feature 1	Caching feature 2	Caching feature 3	Caching feature 4	Cleaning sign
					Characteristic information 11(ta)	Characteristic information 21	…	…	Cleaning up
Characteristic information 12(ta)	Characteristic information 22	…	…	Without cleaning
					Characteristic information 13(ta)	…	…	…	Without cleaning
Characteristic information 14(tb)	…	…	…	Cleaning up
					Characteristic information 15(tb)	…	…	…	Without cleaning
Characteristic information 16(tc)	…	…	…	Cleaning up
					Characteristic information 17(tc)	…	…	…	Cleaning up

TABLE 1

As can be seen from table 1, there are 4 for the identifier of cleaning in 7 cleaning identifiers, and there are 3 for the identifier without cleaning, and then the first information entropy is:

step a2, determining conditional entropy of the first cache characteristic according to the characteristic information of the plurality of cache data under the first cache characteristic and the cleaning identifier corresponding to each of the plurality of cache data.

Here, the conditional entropy of the first cached feature is used to indicate the uncertainty of the first cached feature pair in determining the purge identification, if the purge identification is known. The smaller the conditional entropy of the first cache feature is, the lower the uncertainty of the first cache feature in determining the cleaning identifier is; the larger the conditional entropy of the first cache feature is, the higher the uncertainty of the first cache feature in determining the cleaning identifier is.

Specifically, the conditional entropy of the first cache feature is calculated as follows:

wherein H (D | a) is conditional entropy of the first cache feature, n is total number of categories of feature information of the plurality of cache data under the first cache feature, i is used for indicating the category of the feature information under the first cache feature, and | D_iL is the number of the cache data of which the feature information under the first cache feature is the ith-class feature information in the plurality of cache data, | D2| is the total number of the plurality of cache data (equal to the number of the training samples in the training sample set), k is used for indicating the class of the cleaning identifier corresponding to the cache data, | D_ikAnd | is the number of the cache data of which the characteristic information under the first cache characteristic is ith-class characteristic information and which corresponds to the kth-class cleaning identifier.

For example, assume that the training sample set is shown in table 1, and the first buffer feature is buffer feature 1. As can be seen from table 1, the characteristic information of the cache feature 1 has three categories in total, where ta, tb, and tc respectively, n is 3, and the conditional entropy of the cache feature 1 is:

it should be understood that, in the case that the feature information of each cache feature data under each cache feature is updated to the sub-cleaning identifier under each cache feature respectively in the manner described in step S101 above, the total number of categories of the feature information under each cache feature is 2, that is, n is equal to 2.

For example, if the data in table 1 is updated as shown in table 2:

caching feature 1	Caching feature 2	Caching feature 3	Caching feature 4	Cleaning sign
					Sub cleaning label (cleaning)	Sub-cleaning mark	…	…	Cleaning up
Sub cleaning label (cleaning)	Sub-cleaning mark	…	…	Without cleaning
					Sub cleaning label (not cleaning)	…	…	…	Without cleaning
Sub cleaning label (not cleaning)	…	…	…	Cleaning up
					Sub cleaning label (not cleaning)	…	…	…	Without cleaning
Sub cleaning label (not cleaning)	…	…	…	Cleaning up
					Sub cleaning label (not cleaning)	…	…	…	Cleaning up

H (D | A) is the conditional entropy of the first cache characteristic, n is the total number of categories of the sub-cleaning identifications of the plurality of cache data under the first cache characteristic,i indicates the category of the sub-cleaning identifier under the first cache feature, | D_iL is the number of the cache data in which the sub-cleaning identifier under the first cache characteristic in the plurality of cache data is the ith-type sub-cleaning identifier, | D2| is the total number of the plurality of cache data (equal to the number of the training samples in the training sample set), k is used for indicating the category of the cleaning identifier corresponding to the cache data, | D_ikAnd | is the number of the cache data of which the sub-cleaning identifier under the first cache characteristic is the ith type sub-cleaning identifier and corresponds to the kth type cleaning identifier.

Step a3, calculating an information gain of the first buffer feature according to the first information entropy related to the cleaning identifier and the conditional entropy of the first buffer feature, so as to indicate the judgment accuracy in the target situation.

Specifically, the conditional entropy of the first cache feature is subtracted from the first information entropy to obtain the information gain of the first cache feature. The calculation formula of the information gain is as follows:

G(D,A)＝H(D)-H(D|A)

the information gain is positively correlated with the judgment accuracy in the target situation, that is, the larger the information gain is, the higher the judgment accuracy in the target situation is, and the smaller the information gain is, the lower the judgment accuracy in the target situation is.

In the steps a 1-a 3, the accuracy of judging the cleaning type of the cache data by using the information gain to measure each cache characteristic as a judgment standard is simple, and the calculation mode is simple, thereby being beneficial to improving the speed of constructing a cache data cleaning model.

In another possible implementation, the information gain ratio can be used to measure the accuracy of the buffer characteristics as the judgment criterion for judging the cleaning category of the buffer data. The step S102 may include the following steps b 1-b 5:

step b1, determining a first information entropy related to the cleaning identifier according to the cleaning identifier corresponding to each of the plurality of cache data.

Step b2, determining the conditional entropy of the first cache characteristic according to the characteristic information of the plurality of cache data under the first cache characteristic and the cleaning identifier corresponding to each of the plurality of cache data.

And b3, calculating the information gain of the first cache feature according to the first information entropy related to the cleaning identifier and the conditional entropy of the first cache feature.

Here, the detailed implementation manner of step b 1-step b3 may refer to the foregoing step a 1-step a3, and will not be described herein again.

And b4, determining second information entropy related to the first cache characteristics according to the characteristic information of the plurality of cache data under the first cache characteristics.

Here, the second information entropy is used to indicate uncertainty of the feature information under the first cache feature. The larger the second information entropy is, the higher the uncertainty of the feature information under the first cache feature is; the smaller the first information entropy, the lower the uncertainty of the feature information under a cached feature.

Specifically, the calculation formula of the second information entropy about the first cache feature is as follows:

wherein H_A(D) For a second entropy of information related to the first cache characteristic, n is a total number of categories of characteristic information of the plurality of cache data under the first cache characteristic, i is used for indicating the category of the characteristic information under the first cache characteristic, and | D_iAnd | D2| is the total amount of the plurality of cache data (equal to the number of training samples in the training sample set).

For example, assuming that the training sample set is shown in table 1, the first cache feature is cache feature 1, and as can be seen from table 1, the feature information of cache feature 1 has three categories in total, which are ta, tb and tc, respectively, where the number of ta is 3, the number of tb is 2, and the number of tc is 2, then the entropy of the second information related to cache feature 1 is:

and b5, calculating an information gain ratio of the first buffer characteristic according to the information gain of the first buffer characteristic and a second information entropy related to the first buffer characteristic, so as to indicate the judgment accuracy in the target situation.

Specifically, the quotient of the information gain and the second information entropy is determined as the information gain ratio of the first buffer characteristic. The calculation formula of the information gain ratio is as follows:

the information gain ratio is positively correlated with the judgment accuracy in the target situation, that is, the larger the information gain ratio is, the higher the judgment accuracy in the target situation is, and the smaller the information gain ratio is, the lower the judgment accuracy in the target situation is.

In steps b 1-b 5, the accuracy of each cache feature can be measured more accurately by measuring the accuracy of each cache feature when the cache data is judged to be of the cleaning type using the information gain ratio as a judgment criterion.

In yet another possible implementation, the accuracy of the cache feature as the determination criterion for determining the cleaning category of the cache data may be measured by using the minimum kini coefficient. The step S102 may include the following steps c 1-c 2:

and c1, respectively determining the probability distribution of the cleaning identifier on each type of feature information under the first cache feature according to the cleaning identifier corresponding to each of the plurality of cache data.

Here, for a class of feature information under the first cache feature, the cleaning identifier probability distribution includes: the probability distribution of the cleaning identifier corresponding to the characteristic information of the type and the probability distribution of the cleaning identifier not corresponding to the characteristic information of the type.

For example, taking the training sample set shown in table 1 as an example, if the first cache feature is cache feature 1, the feature information of cache feature 1 has three categories, which are ta, tb, and tc. As can be seen from table 1, for the class ta of feature information, there are 3 feature information for ta, which corresponds to 1 cleaning and 2 feature information which does not clean, so the probability distribution of the cleaning identifier corresponding to ta is (1/3, 2/3); since there are 4 pieces of feature information that is not ta, which corresponds to 3 pieces of cleaning, and 1 piece of feature information that is not ta does not correspond to cleaning marks, the probability distribution of the cleaning marks corresponding to the feature information that is not ta is (3/4, 1/4). For the class tb of feature information, there are 2 feature information corresponding to tb, which corresponds to 1 cleaning and 1 feature information which does not clean, so the probability distribution of cleaning identifier corresponding to tb is (1/2 ); since there are 5 pieces of feature information other than tb, which correspond to 3 pieces of cleaning and 2 pieces of non-cleaning, the probability distribution of the cleaning flag corresponding to the feature information other than tb is (3/5, 2/5). For the characteristic information tc, there are 2 characteristic information tc corresponding to 2 cleaning and 0 characteristic information tc not to be cleaned, so that the probability distribution of the cleaning identifier corresponding to tc is (1, 0); since there are 5 pieces of feature information other than tc, which correspond to 2 cleans and 3 pieces of feature information other than tc, the probability distribution of the cleaning identifier corresponding to the feature information other than tc is (2/5, 3/5).

And c2, determining the minimum Gini index of the first cache characteristic according to the cleaning identifier probability distribution on various types of characteristic information under the first cache characteristic and the characteristic information of the plurality of cache data under the first cache characteristic, so as to indicate the judgment accuracy under the target condition.

Specifically, according to the cleaning identifier probability distribution on each feature information under the first cache feature and the feature information of the plurality of cache data under the first cache feature, the kini coefficients corresponding to each type of feature information under the first feature information are respectively calculated to obtain a plurality of kini coefficients, and the smallest one of the plurality of kini coefficients is determined as the smallest kini index of the first cache feature.

Specifically, for one type of feature information under the first cache feature, the calculation formula of the kini index is as follows:

wherein Gini (D, A) is the minimum Keyny coefficient, D₁Is the number of the characteristic information of the class in the characteristic information of a plurality of cache data under the first cache characteristic, D₂The number of the other types of feature information except the feature information of the type of feature information in the feature information of the first cache feature of the plurality of cache data is Gini (D)₁) The Gini coefficient, Gini (D), which is the probability distribution of the cleaning identifier corresponding to the characteristic information₂) Is the kini coefficient of the cleaning identifier probability distribution corresponding to the other types of feature information, | D2| is the total amount of the plurality of cache data (equal to the number of training sample samples in the training sample set).

For example, assume that the training sample set is shown in table 1, and the first buffer feature is buffer feature 1. There are three categories of feature information of the cache feature 1, ta, tb, tc, and there are 3 kini coefficients, where:

the corresponding kiney coefficient of ta is:

the kini coefficient for tb is:

the kini coefficient for tc is:

since the kini coefficient corresponding to tc is the smallest, the kini coefficient corresponding to tc is set as the smallest kini coefficient of the cache feature 1.

The minimum kini coefficient is inversely related to the judgment accuracy under the target condition, namely the judgment accuracy under the target condition is higher when the minimum kini coefficient is smaller; the larger the minimum kini coefficient is, the lower the judgment accuracy in the target situation is.

Optionally, in some possible embodiments, the determination rule using the first cache feature as the determination criterion may also be determined according to the minimum kini coefficient of the first cache feature, where a class of information corresponding to the minimum kini coefficient of the first cache feature is used as a partition point for determining the cache category. For example, if the first cache characteristic is the cache characteristic 1 shown in table 1, tc is taken as a division point of the cache characteristic 1, the cache data to which the characteristic information and tc under the cache characteristic 1 belong is divided into cleaning, and the cache data to which the characteristic information and tc under the cache characteristic 1 do not belong is divided into cleaning.

In steps c 1-c 2, the accuracy of each cache feature when the cleaning type of the cache data is judged by using the minimum kini coefficient to measure each cache feature as a judgment standard is only required to perform simple probability calculation, so that the accuracy of each cache feature can be measured more quickly.

It should be noted that, the determination accuracy corresponding to each cache feature may also be determined by using other manners capable of measuring the accuracy of determining the cleaning category of the cache data by using each cache feature as the determination standard, which is not limited in the present application.

In some possible embodiments, in the process of constructing the cache data cleaning model, in order to obtain a more accurate cache data cleaning model, the parent-child relationship of a plurality of cache features in the decision tree may also be determined in a multiple iteration manner. Referring to fig. 3, fig. 3 is a schematic flow chart of another data cleaning model building method provided in the embodiment of the present application, which can be applied to the aforementioned application device or another device; as shown in fig. 3, the method comprises the steps of:

s201, a training sample set corresponding to a target cache category is obtained, the training sample set corresponding to the target cache category comprises sample information of a plurality of cache data belonging to the target cache category, the sample information of each cache data comprises a plurality of feature information and a cleaning identifier, the feature information is respectively feature information corresponding to a plurality of preset cache features, and the cleaning identifier is used for indicating the cleaning category.

S202, determining the judgment accuracy under the target condition based on the feature information of the plurality of cache data under the first cache feature and the cleaning identifier corresponding to the plurality of cache data, wherein the first cache feature is any cache feature in the plurality of cache features, and the target condition is the condition of judging the cleaning category corresponding to the plurality of cache data by taking the first cache feature as a judgment standard.

Here, the detailed implementation manner of steps S201 to S202 can refer to the description of steps S101 to S102, and is not described herein again.

S203, according to the judgment accuracy under a plurality of conditions, determining a second cache feature corresponding to the maximum judgment accuracy, wherein the plurality of conditions are conditions of judging the cleaning categories corresponding to the plurality of cache data by taking the plurality of cache features as judgment standards respectively.

And S204, deleting the second cache feature, the feature information corresponding to the second cache feature and the second cache data in the training sample set, wherein the cleaning identifier is used for indicating the cleaned cache data when the second cache data is used as a judgment standard to judge the cleaning category corresponding to each of the plurality of cache data.

Returning to execute the step S202 under the condition that the sample information has a plurality of cache features and the judgment accuracy corresponding to each cache feature is greater than or equal to the preset accuracy; if the determination accuracy corresponding to each cache feature is less than the preset accuracy or there is only one cache feature in the sample information, step S205 is executed.

S205, a cache data cleaning model is constructed according to the sequence of the deleted cache features in the training sample set.

Here, the sequence of deleting the plurality of cache features in the training sample set reflects the accuracy of judging the cleaning category of the cache data by using each cache feature, and the earlier the cache feature is deleted, the higher the accuracy is described, and the later the cache feature is deleted, the lower the accuracy is described. Therefore, the node relation of the cache features in the decision tree is determined by the sequence structure of the deletion of the cache features in the training sample set, the cache feature deleted earlier is used as a parent node of the cache feature deleted later, and the cache feature deleted later is connected to the branch of which the judgment result of the cache feature deleted later is not clear, so that the cache data cleaning model is constructed. Wherein, the cache feature deleted earliest is the root node in the decision tree.

For example, the plurality of cache features are the size of the cache data, the retention time of the cache data in the cache space, the time of being used last time, and the number of times of use, respectively, and the order of deleting the four features is: the last time of use, the size of the cache data, the number of times of use, and the retention time of the cache data in the cache space, the resulting decision tree is constructed as shown in fig. 2.

In the embodiment of the method corresponding to fig. 3, in the process of determining the positions of the plurality of cache features in the decision tree, each cache feature is determined through multiple iterations as the accuracy of the determination standard, and the cache data that can be determined to be cleaned through the current cache feature is removed, so that the sequence of determination by using each cache feature can be determined more accurately, and thus the accuracy of decision tree determination can be improved, that is, the accuracy of determination of the cache data cleaning model is improved.

After the cache data cleaning model is constructed and obtained through the method embodiment, whether the cache data needs to be cleaned or not can be judged by using the cache data cleaning model. Referring to fig. 4, fig. 4 is a schematic flow chart of a data cleaning method provided in an embodiment of the present application, where the method can be applied to the aforementioned application device; as shown in fig. 4, the method includes the steps of:

s301, when detecting that the total amount of the cache data corresponding to the target cache category is larger than the preset data amount, or the ratio of the total amount of the cache data corresponding to the target cache category to the total cache space corresponding to the target cache category is larger than the preset ratio threshold, cleaning the cache data corresponding to the target cache category through a cache data cleaning model.

Specifically, cache information of the cache data corresponding to the target cache category under the cache feature may be obtained according to the cache feature corresponding to each node of the cache data cleaning model, and then, based on a decision condition corresponding to each node, whether the cache corresponding to the target cache category needs to be cleaned is determined until it is determined that the cache data needs to be cleaned or until the last node in the cache data cleaning model is traversed.

Taking a cache data cleaning model as the decision tree shown in fig. 2 as an example, when cache data corresponding to a target cache category is obtained, determining the time of the last use of the cache data, and if the time of the last use of the cache data is not less than a second preset time length from the current time, determining that the cache data needs to be cleaned; if the time of the latest use of the cache data is less than the current time by a second preset time length, determining the size of the cache data, and if the cache data is not less than a preset data threshold, determining that the cache data needs to be cleaned; if the cache data is greater than a preset data threshold value, determining the number of times of using the cache data, and if the number of times of using the cache data is not greater than the preset number of times, determining that the cache data needs to be cleaned; if the using times of the cache data are larger than the preset times, determining the retention time of the cache data in the cache space, and if the retention time of the cache data in the cache space is not smaller than a first preset time length, determining that the cache data need to be cleaned; and if the retention time of the cache data in the cache space is less than the first preset time length, determining that the cache data does not need to be cleaned.

In the embodiment of the method corresponding to fig. 4, when it is detected that the total amount of the cache data corresponding to the target cache category is greater than the preset data amount, or that the ratio of the total amount of the cache data corresponding to the target cache category to the total cache space corresponding to the target cache category is greater than the preset ratio threshold, the cache data corresponding to the target cache category is cleaned through the cache data cleaning model, and since the cache feature with high determination accuracy is the parent node of the cache feature with low determination accuracy, it is equivalent to preferentially utilize the cache feature with high determination accuracy to determine whether the cache data needs to be cleaned, so that it is possible to accurately determine whether the cache data needs to be cleaned; in addition, each cache data corresponding to the target cache category is judged whether to be cleaned or not, so that the cache data can be cleaned finely.

Optionally, after the cache data cleaning model is obtained by the above method embodiment through the above construction, the cache data cleaning model may also be verified by using a verification sample set, and when it is determined according to the verification sample set that the accuracy of the cache data cleaning model is greater than a preset verification accuracy, the cache data cleaning model is determined to be a final cache data cleaning model, and the final cache data cleaning model is used to determine whether the cache data needs to be cleaned.

The verification sample set is a sample set used for verification and different from the training sample set, and the form of the sample information in the verification sample set is consistent with that of the sample information in the training sample set. Specifically, after the cache data cleaning model is constructed, the cache data cleaning model may be used to determine the cleaning category of each piece of sample information in the verification sample set, and the specific content of the embodiment in fig. 4 may be referred to in a manner of determining the cleaning category of each piece of sample information in the verification sample set by using the cache data cleaning model. After the cleaning category of each piece of sample information is determined, comparing the cleaning category of each piece of sample information with the cleaning identifier in each piece of sample information, determining a first quantity of the piece of sample information with the same cleaning category indicated by the cleaning identifier, and determining the ratio of the first quantity to the total quantity of the piece of sample information in the verification sample set as the accuracy of the cache data cleaning model. By verifying the constructed cache data cleaning model, the accuracy of the cache data cleaning model can be ensured to reach a higher level.

Optionally, on the basis of the above method embodiment, the case of repartitioning the verification sample set and the training sample set for the sample set, or, the case of training the sample set and the verification sample set by using new cache data, performing the training and verification steps multiple times, that is, performing the above steps S101 to S103 and the verification step multiple times, or performing the above steps S201 to S205 and the verification step multiple times, ensuring that the sample sets used for each training and verification are different from each other, and then using the finally determined cache data cleaning model to determine whether the cache data needs to be cleaned. Through training and verification for many times, the accuracy of the cache data cleaning model can be further improved.

The method of the present application is described above, and in order to better carry out the method of the present application, the apparatus of the present application is described next.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a data cleansing model building apparatus according to an embodiment of the present application, where the cache data cleansing building apparatus may be the aforementioned server or terminal, and as shown in fig. 5, the apparatus 40 includes:

an obtaining module 401, configured to obtain a training sample set corresponding to a target cache category, where the training sample set includes sample information of multiple cache data belonging to the target cache category, and the sample information of each cache data includes a cleaning identifier and multiple pieces of feature information, where the multiple pieces of feature information are feature information corresponding to multiple preset cache features respectively, and the cleaning identifier is used to indicate a cleaning category corresponding to each cache data;

an accuracy determining module 402, configured to determine a determination accuracy in a target situation based on feature information of the plurality of cache data under a first cache feature and a corresponding cleaning identifier of the plurality of cache data, where the first cache feature is any one of the plurality of cache features, and the target situation is a situation in which a cleaning category corresponding to each of the plurality of cache data is determined by using the first cache feature as a determination criterion;

a model building module 403, configured to build a cache data cleaning model according to the determination accuracy under multiple conditions, where the multiple conditions are the conditions that the multiple cache features are respectively used as the determination criteria to determine the cleaning categories corresponding to the multiple cache data, and the cache data cleaning model is a decision tree that sequentially uses the multiple cache features as the determination criteria to determine the cleaning categories of the cache data, where a cache feature with high determination accuracy is a parent node of a cache feature with low determination accuracy in the decision tree, and the cache feature with low determination accuracy is connected to a first branch of the cache feature with high determination accuracy, and the first branch is a branch whose determination result is not cleaned.

In one possible design, the accuracy determining module 401 is specifically configured to: determining a first information entropy related to a cleaning identifier according to the cleaning identifier corresponding to each of the plurality of cache data; determining conditional entropy of the first cache characteristic according to characteristic information of the plurality of cache data under the first cache characteristic and cleaning identifiers corresponding to the plurality of cache data; and calculating the information gain of the first cache characteristic according to the first information entropy and the conditional entropy to indicate the judgment accuracy under the target condition.

In one possible design, the accuracy determining module 401 is specifically configured to: determining a first information entropy related to a cleaning identifier according to the cleaning identifier corresponding to each of the plurality of cache data; determining conditional entropy of the first cache characteristic according to characteristic information of the plurality of cache data under the first cache characteristic and cleaning identifiers corresponding to the plurality of cache data; calculating the information gain of the first cache characteristic according to the first information entropy and the conditional entropy; determining the second information entropy related to the first cache characteristic according to the characteristic information of the plurality of cache data under the first cache characteristic; and calculating an information gain ratio of the first cache characteristic according to the information gain and the second information entropy to indicate the judgment accuracy under the target condition.

In one possible design, the accuracy determining module 401 is specifically configured to: respectively determining the probability distribution of the cleaning identifiers on various types of feature information under the first cache feature according to the cleaning identifiers corresponding to the plurality of cache data; and determining the minimum kini index of the first cache characteristic according to the cleaning identifier probability distribution and the characteristic information of the plurality of cache data under the first cache characteristic, so as to indicate the judgment accuracy under the target situation.

In a possible design, the obtaining module 401 is specifically configured to: obtaining a plurality of cache data belonging to the target cache category, and determining feature information of the plurality of cache data under the plurality of cache features respectively; determining a sub-cleaning identifier of first cache data under a first cache characteristic according to characteristic information of the first cache data under the first cache characteristic and a preset identifier processing rule corresponding to the first cache characteristic, so as to obtain a plurality of sub-cleaning identifiers corresponding to the first cache data, where the first cache data is any cache data in the plurality of cache data, the first cache characteristic is any cache characteristic in the plurality of cache characteristics, the preset identifier processing rule is a processing rule for judging a cleaning category of the cache data based on the characteristic information, and the sub-cleaning identifier of the first cache data under the first cache characteristic is used for indicating the cleaning category corresponding to the first cache data under the first cache characteristic; and determining the cleaning identifier corresponding to the first cache data according to the plurality of sub-cleaning identifiers corresponding to the first cache data so as to obtain the cleaning identifiers corresponding to the plurality of cache data respectively.

In a possible design, the obtaining module 401 is specifically configured to: if the number of target sub-cleaning identifiers in a plurality of sub-cleaning identifiers corresponding to the first cache data is greater than a preset number, determining that the corresponding cleaning identifier of the first cache data is the target sub-cleaning identifier, where the target sub-cleaning identifier is used to indicate one of cleaning or non-cleaning; or determining a seed cleaning identifier with the largest proportion among a plurality of seed cleaning identifiers corresponding to the first cache data as the cleaning identifier corresponding to the first cache data.

In one possible design, model building module 403 is specifically configured to: determining a second cache characteristic corresponding to the maximum judgment accuracy according to the judgment accuracy under the multiple situations; deleting the second cache feature, the feature information corresponding to the second cache feature and the second cache data in the training sample set, and returning to execute the step of determining the judgment accuracy in the target situation based on the feature information of the plurality of cache data under the first cache feature and the cleaning identifier corresponding to each of the plurality of cache data until the judgment accuracy in the plurality of situations is less than the preset accuracy or only one cache feature in the sample information remains, wherein the cleaning identifier is used for indicating the cleaned cache data when the second cache feature is used as a judgment standard to judge the cleaning category corresponding to each of the plurality of cache data; and constructing a cache data cleaning model according to the sequence of the plurality of cache features deleted in the training sample set.

It should be noted that, for the content that is not mentioned in the embodiment corresponding to fig. 5, reference may be made to the description of the method embodiment corresponding to fig. 1 to fig. 3, and details are not repeated here.

According to the device, the cleaning type of the cache data in the training sample set is judged by respectively taking a plurality of preset cache features as judgment standards, the judgment accuracy by respectively taking the plurality of cache features as the judgment standards is determined, and a decision tree for judging the cleaning type of the cache data by taking the plurality of cache features as the judgment standards is constructed on the basis of the judgment accuracy corresponding to the plurality of cache features, so that the sequence of judgment by utilizing each cache feature is determined in a simpler manner, and the operation speed is high; the cache features with high judgment accuracy are father nodes of the cache features with high judgment accuracy in the decision tree, and the cache features with low judgment accuracy are connected to the branches with high judgment accuracy and no cleaning, which is equivalent to a decision strategy that whether the cache data need to be cleaned is determined by preferentially utilizing the cache features with high judgment accuracy, and whether the cache data need to be cleaned is determined by utilizing the cache features with low judgment accuracy, so that accurate judgment on whether the cache data need to be cleaned can be realized, the cache data useful for a user can be reserved, and the cache data can be cleaned more finely compared with the cache data which are all cleaned.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a data cleansing device according to an embodiment of the present disclosure, where the data cleansing device may be a mobile phone, a computer, or the like as mentioned above; as shown in fig. 6, the apparatus 50 includes:

a cleaning module 501, configured to clean the cache data corresponding to the target cache category through a cache data cleaning model when it is detected that a total amount of the cache data corresponding to the target cache category is greater than a preset data amount, or that a ratio of the total amount of the cache data corresponding to the target cache category to a total cache space corresponding to the target cache category is greater than a preset ratio threshold, where the cache data cleaning model is constructed by the data cleaning model construction method in the foregoing method embodiment.

It should be noted that, for the content that is not mentioned in the embodiment corresponding to fig. 6, reference may be made to the description of the method embodiment corresponding to fig. 4, and details are not described here again.

In the device, when detecting that the total amount of cache data corresponding to a target cache category is greater than a preset data amount, or the ratio of the total amount of cache data corresponding to the target cache category to the total cache space corresponding to the target cache category is greater than a preset ratio threshold, cleaning the cache data corresponding to the target cache category through a cache data cleaning model, wherein the cache feature with high judgment accuracy is a parent node of the cache feature with low judgment accuracy, which is equivalent to preferentially utilizing the cache feature with high judgment accuracy to judge whether the cache data needs to be cleaned, so that accurate judgment on whether the cache data needs to be cleaned can be realized; in addition, each cache data corresponding to the target cache category is judged whether to be cleaned or not, so that the cached data can be cleaned finely.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a computer device provided in an embodiment of the present application, where the computer device 60 includes a processor 601 and a memory 602. The processor 601 is connected to the memory 602, for example, the processor 601 may be connected to the memory 602 through a bus.

The processor 601 is configured to enable the computer device 60 to perform the respective functions in the methods of fig. 1-3 or the method of fig. 4. The processor 601 may be a Central Processing Unit (CPU), a Network Processor (NP), a hardware chip, or any combination thereof. The hardware chip may be an Application Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

The memory 602 is used for storing program codes and the like. The memory 602 may include Volatile Memory (VM), such as Random Access Memory (RAM); the memory 1002 may also include a non-volatile memory (NVM), such as a read-only memory (ROM), a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD); the memory 602 may also comprise a combination of memories of the kind described above.

In some possible cases, the processor 601 may call the program code to perform the following operations:

In other possible cases, the processor 601 may call the program code to perform the following:

and when detecting that the total amount of the cache data corresponding to the target cache category is larger than a preset data amount, or the ratio of the total amount of the cache data corresponding to the target cache category to the total cache space corresponding to the target cache category is larger than a preset ratio threshold, cleaning the cache data corresponding to the target cache category through a cache data cleaning model, wherein the cache data cleaning model is constructed through the method embodiment.

It should be noted that, the implementation of each operation may also correspond to the corresponding description with reference to the above method embodiment; the processor 601 may also cooperate with other functional hardware to perform other operations in the above-described method embodiments.

Embodiments of the present application also provide a computer-readable storage medium, which stores a computer program, where the computer program includes program instructions, and the program instructions, when executed by a computer, cause the computer to execute the method according to the foregoing embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A data cleaning model construction method is characterized by comprising the following steps:

acquiring a training sample set corresponding to a target cache category, wherein the training sample set comprises sample information of a plurality of cache data belonging to the target cache category, the sample information of each cache data comprises a cleaning identifier and a plurality of feature information, the feature information is respectively corresponding to a plurality of preset cache features, and the cleaning identifier is used for indicating the cleaning category corresponding to each cache data;

determining the judgment accuracy under a target situation based on the feature information of the plurality of cache data under a first cache feature and the cleaning identifier corresponding to each of the plurality of cache data, wherein the first cache feature is any cache feature in the plurality of cache features, and the target situation is a situation of judging the cleaning category corresponding to each of the plurality of cache data by taking the first cache feature as a judgment standard;

according to the judgment accuracy under a plurality of situations, a cache data cleaning model is constructed, the situations are decision trees which use the cache features as judgment standards respectively to judge the cleaning categories corresponding to the cache data, the cache data cleaning model is a decision tree which uses the cache features as judgment standards in sequence to judge the cleaning categories of the cache data, wherein the cache features with high judgment accuracy are parent nodes of the cache features with low judgment accuracy in the decision tree, the cache features with low judgment accuracy are connected to a first branch of the cache features with high judgment accuracy, and the first branch is a branch with no cleaning judgment result.

2. The method of claim 1, wherein the determining the accuracy of the determination in the target situation based on the feature information of the plurality of cache data under the first cache feature and the corresponding cleaning identifier of each of the plurality of cache data comprises:

determining a first information entropy related to a cleaning identifier according to the cleaning identifier corresponding to each of the plurality of cache data;

determining conditional entropy of the first cache characteristic according to characteristic information of the plurality of cache data under the first cache characteristic and cleaning identifiers corresponding to the plurality of cache data;

and calculating the information gain of the first cache characteristic according to the first information entropy and the conditional entropy so as to indicate the judgment accuracy under the target condition.

3. The method of claim 1, wherein the determining the accuracy of the determination in the target situation based on the feature information of the plurality of cache data under the first cache feature and the corresponding cleaning identifier of each of the plurality of cache data comprises:

calculating the information gain of the first cache characteristic according to the first information entropy and the conditional entropy;

determining the second information entropy related to the first cache characteristic according to the characteristic information of the plurality of cache data under the first cache characteristic;

and calculating an information gain ratio of the first cache characteristic according to the information gain and the second information entropy so as to indicate the judgment accuracy under the target condition.

4. The method of claim 1, wherein the determining the accuracy of the determination in the target situation based on the feature information of the plurality of cache data under the first cache feature and the corresponding cleaning identifier of each of the plurality of cache data comprises:

respectively determining the probability distribution of the cleaning identifiers on various types of feature information under the first cache feature according to the cleaning identifiers corresponding to the plurality of cache data;

and determining the minimum Gini index of the first cache characteristic according to the cleaning identifier probability distribution and the characteristic information of the plurality of cache data under the first cache characteristic, so as to indicate the judgment accuracy under the target situation.

5. The method according to any one of claims 1 to 4, wherein the obtaining of the training sample set corresponding to the target cache category includes:

obtaining a plurality of cache data belonging to the target cache category, and determining feature information of the plurality of cache data under the plurality of cache features respectively;

determining a sub-cleaning identifier of first cache data under a first cache characteristic according to characteristic information of the first cache data under the first cache characteristic and a preset identifier processing rule corresponding to the first cache characteristic to obtain a plurality of sub-cleaning identifiers corresponding to the first cache data, wherein the first cache data is any cache data in the plurality of cache data, the first cache characteristic is any cache characteristic in the plurality of cache characteristics, the preset identifier processing rule is a processing rule for judging a cleaning category of the cache data based on the characteristic information, and the sub-cleaning identifier of the first cache data under the first cache characteristic is used for indicating the cleaning category corresponding to the first cache data under the first cache characteristic;

and determining the cleaning identifier corresponding to the first cache data according to the plurality of sub-cleaning identifiers corresponding to the first cache data so as to obtain the cleaning identifiers corresponding to the plurality of cache data respectively.

6. The method of claim 5, wherein the determining the cleaning identifier corresponding to the first cache data according to the plurality of sub cleaning identifiers corresponding to the first cache data comprises:

if the number of target sub-cleaning identifiers in the plurality of sub-cleaning identifiers corresponding to the first cache data is greater than a preset number, determining that the cleaning identifier corresponding to the first cache data is the target sub-cleaning identifier, wherein the target sub-cleaning identifier is used for indicating one of cleaning or non-cleaning; alternatively, the first and second electrodes may be,

and determining the seed cleaning identifier which accounts for the largest proportion of the plurality of seed cleaning identifiers corresponding to the first cache data as the cleaning identifier corresponding to the first cache data.

7. The method according to any one of claims 1 to 4, wherein the constructing a cache data cleaning model according to the judgment accuracy in a plurality of situations comprises:

determining a second cache characteristic corresponding to the maximum judgment accuracy according to the judgment accuracy under the multiple situations;

deleting the second cache feature, the feature information corresponding to the second cache feature and the second cache data in the training sample set, returning to execute the step of determining the judgment accuracy in the target situation based on the feature information of the plurality of cache data under the first cache feature and the cleaning identifier corresponding to each of the plurality of cache data until the judgment accuracy in the plurality of situations is less than the preset accuracy or only one cache feature in the sample information remains, wherein the cleaning identifier is used for indicating the cleaned cache data when the second cache feature is used as a judgment standard to judge the cleaning category corresponding to each of the plurality of cache data;

and constructing the cache data cleaning model according to the sequence of the plurality of cache features deleted in the training sample set.

8. A method of data scrubbing, comprising:

when detecting that the total amount of cache data corresponding to a target cache category is greater than a preset data amount, or that the ratio of the total amount of cache data corresponding to the target cache category to the total cache space corresponding to the target cache category is greater than a preset ratio threshold, cleaning the cache data corresponding to the target cache category by using a cache data cleaning model, wherein the cache data cleaning model is constructed by using the method according to any one of claims 1 to 7.

9. An apparatus for constructing a data cleansing model, comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a training sample set corresponding to a target cache category, the training sample set comprises sample information of a plurality of cache data belonging to the target cache category, the sample information of each cache data comprises a plurality of feature information and a cleaning identifier, the feature information is respectively corresponding to a plurality of preset cache features, and the cleaning identifier is used for indicating the cleaning category;

the accuracy judgment module is used for determining the judgment accuracy under a target situation based on the feature information of the plurality of cache data under the first cache feature and the cleaning identification corresponding to the plurality of cache data, wherein the first cache feature is any cache feature in the plurality of cache features, and the target situation is the situation of judging the cleaning category corresponding to the plurality of cache data by taking the first cache feature as a judgment standard;

the model building module is used for building a cache data cleaning model according to the judging accuracy under a plurality of situations, the situations are decision trees which judge the cleaning categories corresponding to the cache data by taking the cache features as judging standards respectively, the cache data cleaning model is a decision tree which judges the cleaning categories of the cache data by taking the cache features as the judging standards in sequence, wherein the cache features with high judging accuracy are parent nodes of the cache features with low judging accuracy in the decision tree, the cache features with low judging accuracy are connected to a first branch of the cache features with high judging accuracy, and the first branch is a branch with no cleaning judgment result.

10. A data cleansing apparatus, comprising:

a cleaning module, configured to clean, when it is detected that a total amount of cache data corresponding to a target cache category is greater than a preset data amount, or a ratio of the total amount of cache data corresponding to the target cache category to a total cache space corresponding to the target cache category is greater than a preset ratio threshold, the cache data corresponding to the target cache category through a cache data cleaning model, where the cache data cleaning model is constructed according to any one of claims 1 to 7.

11. A computer device comprising memory and one or more processors to execute one or more computer programs stored in the memory, the one or more processors, when executing the one or more computer programs, causing the computer device to implement the method of any one of claims 1-7 or claim 8.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method of any one of claims 1-7 or claim 8.