CN113609031B

CN113609031B - Data cleaning model construction method, data cleaning method, related equipment and medium

Info

Publication number: CN113609031B
Application number: CN202110774567.4A
Authority: CN
Inventors: 付玉鑫
Original assignee: Shenzhen Chenbei Technology Co Ltd
Current assignee: Shenzhen Chenbei Technology Co Ltd
Priority date: 2021-07-08
Filing date: 2021-07-08
Publication date: 2024-03-29
Anticipated expiration: 2041-07-08
Also published as: CN113609031A

Abstract

The application provides a data cleaning model construction method, a data cleaning method, related equipment and media, wherein the data cleaning model construction method comprises the following steps: acquiring a training sample set corresponding to a target cache category, wherein the training sample set comprises sample information of a plurality of cache data belonging to the target cache category, and the sample information of each cache data comprises a plurality of characteristic information and a cleaning identifier; determining the judgment accuracy under the target situation based on the characteristic information of the plurality of cache data under the first cache characteristic and the cleaning identification corresponding to each of the plurality of cache data, wherein the target situation refers to the situation that the first cache characteristic is taken as a judgment standard to judge the cleaning category corresponding to each of the plurality of cache data; and constructing a cache data cleaning model according to the judging accuracy under a plurality of situations, wherein the plurality of situations are situations in which a plurality of cache characteristics are respectively used as judging standards, and the cleaning categories corresponding to the plurality of cache data are judged. The method and the device can accurately judge whether the cache data need to be cleaned.

Description

Data cleaning model construction method, data cleaning method, related equipment and medium

Technical Field

The present disclosure relates to the field of cache data processing, and in particular, to a data cleaning model construction method, a data cleaning method, and related devices and media.

Background

The cache data is a temporary file which is stored by various application clients (such as WeChat and browser) in the process of being used by the user and is generated based on user behaviors, so that the user can conveniently finish quick response to the user when the user uses the client subsequently.

When the cache data is too much, the occupied cache space is more, so that the terminal provided with the application client side can be blocked. For this phenomenon, at present, most of processing means of the terminal are to directly clean all cache data in the cache space, and although the problem of blocking is solved, for the cache data with higher use frequency, the user is required to cache again, quick response to the user cannot be completed, and inconvenience is brought to the user.

Disclosure of Invention

The application provides a data cleaning model construction method, a data cleaning method, related equipment and media, and aims to solve the technical problem of inconvenient use caused by cleaning all cache data in the prior art.

In a first aspect, a method for constructing a data cleansing model is provided, the method comprising the following steps:

Acquiring a training sample set corresponding to a target cache category, wherein the training sample set comprises sample information of a plurality of cache data belonging to the target cache category, the sample information of each cache data comprises a cleaning identifier and a plurality of characteristic information, the characteristic information is respectively corresponding to a plurality of preset cache characteristics, and the cleaning identifier is used for indicating a cleaning category corresponding to each cache data;

determining the judgment accuracy under a target situation based on the characteristic information of the plurality of cache data under a first cache characteristic and the cleaning identification corresponding to each of the plurality of cache data, wherein the first cache characteristic is any cache characteristic in the plurality of cache characteristics, and the target situation is a situation that the cleaning category corresponding to each of the plurality of cache data is judged by taking the first cache characteristic as a judgment standard;

according to the judging accuracy of a plurality of situations, a cache data clearing model is constructed, wherein the plurality of situations are situations in which the plurality of cache features are respectively used as judging standards to judge clearing categories corresponding to the plurality of cache data, the cache data clearing model is a decision tree for judging the clearing categories of the cache data by sequentially using the plurality of cache features as judging standards, the cache features with high judging accuracy are father nodes of the cache features with low judging accuracy in the decision tree, the cache features with low judging accuracy are connected to a first branch of the cache features with high judging accuracy, and the first branch is a branch with a judging result of not clearing.

In the technical scheme, the cleaning type of the cache data in the training sample set is judged by taking a plurality of preset cache features as judgment standards, the judgment accuracy of the plurality of cache features is determined when the cache features are taken as the judgment standards, a decision tree for judging the cleaning type of the cache data by taking the plurality of cache features as the judgment standards is constructed based on the judgment accuracy of the respective corresponding cache features, the judging sequence by utilizing the cache features is determined in a simpler mode, and the operation speed is high; because the cache feature with high judgment accuracy is the father node of the cache feature with high judgment accuracy in the decision tree, and the cache feature with low judgment accuracy is connected to the branch of which the judgment result of the cache feature with high judgment accuracy is not clear, the decision strategy of judging whether the cache data needs to be cleared by preferentially utilizing the cache feature with high judgment accuracy and judging whether the cache data needs to be cleared by utilizing the cache feature with low judgment accuracy is equivalent to determining whether the cache data needs to be cleared or not, the accurate judgment of whether the cache data needs to be cleared or not can be realized, thereby the cache data useful for users can be reserved, and compared with the case that the cache data is completely cleared, the fine clearing of the cache data can be realized.

With reference to the first aspect, in one possible implementation manner, determining the accuracy of the determination in the target situation based on the feature information of the plurality of cache data under the first cache feature and the cleaning identifier corresponding to each of the plurality of cache data includes: determining a first information entropy related to the cleaning identification according to the cleaning identification corresponding to each of the plurality of cache data; determining the conditional entropy of the first cache feature according to the feature information of the plurality of cache data under the first cache feature and the cleaning identification corresponding to each of the plurality of cache data; and calculating the information gain of the first cache feature according to the first information entropy and the conditional entropy to indicate the judgment accuracy under the target condition. The accuracy of each cache characteristic as a judgment standard in judging the cleaning type of the cache data is measured by utilizing the information gain, the calculation mode is simple, and the speed of constructing the cache data cleaning model is improved.

With reference to the first aspect, in one possible implementation manner, determining the accuracy of the determination in the target situation based on the feature information of the plurality of cache data under the first cache feature and the cleaning identifier corresponding to each of the plurality of cache data includes: determining a first information entropy related to the cleaning identification according to the cleaning identification corresponding to each of the plurality of cache data; determining the conditional entropy of the first cache feature according to the feature information of the plurality of cache data under the first cache feature and the cleaning identification corresponding to each of the plurality of cache data; calculating to obtain the information gain of the first cache feature according to the first information entropy and the conditional entropy; determining the second information entropy related to the first cache feature according to the feature information of the plurality of cache data under the first cache feature; and calculating an information gain ratio of the first cache characteristic according to the information gain and the second information entropy so as to indicate the judgment accuracy under the target condition. By measuring the accuracy of each cache feature as a judgment standard when judging the cleaning class of the cache data by using the information gain ratio, the accuracy of each cache feature can be measured more accurately.

With reference to the first aspect, in one possible implementation manner, determining the accuracy of the determination in the target situation based on the feature information of the plurality of cache data under the first cache feature and the cleaning identifier corresponding to each of the plurality of cache data includes: according to the cleaning identifications corresponding to the cache data, determining cleaning identification probability distribution on various characteristic information under the first cache characteristic; and determining the minimum base index of the first cache feature according to the cleaning identification probability distribution and the feature information of the plurality of cache data under the first cache feature, so as to be used for indicating the judgment accuracy under the target condition. By measuring the accuracy of each cache feature as a judgment standard when judging the cleaning type of the cache data by using the minimum coefficient of the foundation, the accuracy of each cache feature can be measured more quickly.

With reference to the first aspect, in one possible implementation manner, the obtaining a training sample set corresponding to the target cache class includes: acquiring a plurality of cache data belonging to the target cache category, and determining characteristic information of the plurality of cache data under the plurality of cache characteristics respectively; determining a sub-cleaning identifier of the first cache data under the first cache feature according to feature information of the first cache data under the first cache feature and a preset identifier processing rule corresponding to the first cache feature to obtain a plurality of sub-cleaning identifiers corresponding to the first cache data, wherein the first cache data is any cache data in the plurality of cache data, the first cache feature is any cache feature in the plurality of cache features, the preset identifier processing rule refers to a processing rule for judging a cleaning type of the cache data based on the feature information, and the sub-cleaning identifier of the first cache data under the first cache feature is used for indicating the cleaning type corresponding to the first cache data under the first cache feature; and determining the cleaning identification corresponding to the first cache data according to the plurality of sub-cleaning identifications corresponding to the first cache data so as to obtain the cleaning identification corresponding to each of the plurality of cache data. By setting the identification processing rule for each cache feature, the cleaning category of the cache data can be measured from multiple dimensions, and therefore accurate calibration of the cleaning category of the cache data can be achieved.

With reference to the first aspect, in one possible implementation manner, the determining, according to the plurality of sub cleaning identifiers corresponding to the first cache data, the cleaning identifier corresponding to the first cache data includes: if the number of target sub-cleaning identifiers in the plurality of sub-cleaning identifiers corresponding to the first cache data is greater than a preset number, determining that the corresponding cleaning identifier of the first cache data is the target sub-cleaning identifier, wherein the target sub-cleaning identifier is used for indicating one of cleaning or uncleanness; or determining one sub-cleaning identifier with the largest proportion among a plurality of sub-cleaning identifiers corresponding to the first cache data as the cleaning identifier corresponding to the first cache data. The clearing identification of the cache data is determined according to the number or the duty ratio of the sub clearing identification, so that the clearing category indicated by the clearing identification can be maximally close to the real situation of the cache data, and the accuracy of model construction is facilitated.

With reference to the first aspect, in one possible implementation manner, the constructing a cache data cleaning model according to the judgment accuracy in a plurality of situations includes: determining a second cache feature corresponding to the maximum judgment accuracy according to the judgment accuracy under the multiple conditions; deleting the second cache feature, feature information corresponding to the second cache feature, and second cache data in the training sample set, and returning to execute the step of determining the judgment accuracy under the target condition based on the feature information of the plurality of cache data under the first cache feature and the cleaning identifier corresponding to each of the plurality of cache data, until the judgment accuracy under the plurality of conditions is less than the preset accuracy, or only one cache feature in the sample information remains, wherein the cleaning identifier is used for indicating the cleaned cache data when the second cache feature is used as a judgment standard to judge the cleaning category corresponding to each of the plurality of cache data; and constructing a cache data cleaning model according to the deleted sequence of the plurality of cache features in the training sample set. Through multiple rounds of iterative judgment, the sequence of judgment by utilizing each cache characteristic can be more accurately determined, and the accuracy of decision tree judgment can be improved.

In a second aspect, a data cleaning method is provided, including the steps of:

and when the total amount of the cache data corresponding to the target cache category is detected to be larger than the preset data amount, or the ratio of the total amount of the cache data corresponding to the target cache category to the total cache space corresponding to the target cache category is detected to be larger than a preset ratio threshold, cleaning the cache data corresponding to the target cache category through a cache data cleaning model, wherein the cache data cleaning model is constructed through the method of the first aspect.

Because the cache data cleaning model constructed by the method of the first aspect is a decision mode of using the cache feature with low accuracy in determination to determine whether the cache data needs cleaning, thereby realizing accurate determination of whether the cache data needs cleaning, and using the decision tree to determine the cleaning type of the cache data, the cache data useful for the user can be retained, and compared with the process of cleaning all the cache data, the process of cleaning the cached data can be realized.

In a third aspect, a device for constructing a buffered data cleansing model is provided, including:

the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring a training sample set corresponding to a target cache category, the training sample set comprises sample information of a plurality of cache data belonging to the target cache category, the sample information of each cache data comprises a cleaning identifier and a plurality of characteristic information, the characteristic information is respectively corresponding to a plurality of preset cache characteristics, and the cleaning identifier is used for indicating a cleaning category corresponding to each cache data;

The accuracy judging module is used for determining judging accuracy under a target condition based on the characteristic information of the plurality of cache data under a first cache characteristic and the cleaning identification corresponding to each of the plurality of cache data, wherein the first cache characteristic is any cache characteristic in the plurality of cache characteristics, and the target condition is a condition that the first cache characteristic is taken as a judging standard to judge cleaning categories corresponding to each of the plurality of cache data;

the model construction module is used for constructing a cache data cleaning model according to the judging accuracy of a plurality of situations, wherein the situations are the situations that the cache features are used as judging standards to judge cleaning categories corresponding to the cache data respectively, the cache data cleaning model is a decision tree which sequentially uses the cache features as judging standards to judge the cleaning categories of the cache data, the cache features with high judging accuracy are father nodes of the cache features with low judging accuracy in the decision tree, the cache features with low judging accuracy are connected to a first branch of the cache features with high judging accuracy, and the first branch is a branch with a judging result of not cleaning.

In a fourth aspect, a data cleaning apparatus is provided, including:

and the cleaning module is used for cleaning the cache data corresponding to the target cache category through a cache data cleaning model under the condition that the total cache data corresponding to the target cache category is detected to be larger than a preset data amount or the ratio of the total cache space corresponding to the target cache category to the total cache data corresponding to the target cache category is detected to be larger than a preset ratio threshold, wherein the cache data cleaning model is constructed through the method of the first aspect.

In a fifth aspect, a computer device is provided, comprising a memory and one or more processors configured to execute one or more computer programs stored in the memory, the one or more processors, when executing the one or more computer programs, cause the computer device to implement the data cleaning model construction method of the first aspect or the data cleaning method of the second aspect.

In a sixth aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the data cleaning model construction method of the first aspect or the data cleaning method of the second aspect.

The application can realize the following beneficial effects: the method can accurately judge whether the cache data needs to be cleaned or not, retain the cache data which is useful for users, and realize the fine cleaning of the cache data.

Drawings

Fig. 1 is a schematic flow chart of a data cleaning model construction method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a decision tree according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating another method for constructing a data cleansing model according to an embodiment of the present disclosure;

fig. 4 is a flow chart of a data cleaning method according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a device for constructing a data cleansing model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a data cleaning device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The technical scheme of the embodiment of the application is suitable for a scene of processing the cache, wherein the processing of the cache specifically refers to cleaning of cache data stored in a cache space of application equipment, and the application equipment can be equipment for acquiring and responding to user behaviors to realize interaction with a user, such as a mobile phone, a tablet, a notebook computer and the like. Specifically, the cached data may be a cookie file generated by a user accessing the network using a web browser, a file corresponding to a picture or document generated by a user previewing the picture, previewing a document, or the like, an installation/patch file related to an application generated by a user updating the application, or the like, and is not limited to the description herein.

Because the cache data is stored in the cache space of the application device as a temporary file, the cache space of the application device is limited, and when the cache data in the cache space is excessive, the operation speed of the application device can be influenced to a certain extent. Therefore, for the cache data stored in the cache space, the application provides a data cleaning model construction method and a data cleaning method, and the cache data cleaning model is constructed and applied to the application device to clean the cache data so as to improve the running speed of the application device.

In one possible scenario, the data cleansing model building method and the data cleansing method related to the application may be implemented in the same device, for example, may both be implemented in an application device. In other possible scenarios, the data cleansing model building method and the data cleansing method may be implemented in different devices, respectively. For example, the data cleaning method is implemented in an application device, and the cached data is cleaned; the data cleaning model construction method is implemented in another device, and after the cache data cleaning model is constructed in the other device based on the data cleaning model construction method, the cache data cleaning model is transplanted into the application device. The migration may refer to storing the cache data in the application device in a manual storage manner after the cache data cleaning model is obtained by the other device. Optionally, migration may also refer to obtaining the cache data cleaning model by the other device, and storing the cache data cleaning model in the application device by means of data interaction, for example, the application device may send a request for obtaining the cache data cleaning model to the other device, and the other device sends the cache data cleaning model to the application device based on the request, so as to store the cache data cleaning model in the application device. The application device and the other device can perform data interaction through wireless communication modes such as bluetooth and WiFi, can perform data interaction through wired communication modes, or perform data interaction through a mode of combining wired and wireless, and the application is not limited with respect to the data interaction mode between the two. Specifically, the other device may be a PC computer, a server, or the like.

The cache data clearing model constructed based on the data clearing model construction method can clear cache data, and can reserve the cache data which are useful for users, so that the purpose of finely clearing the cache data in application equipment is achieved. Specifically, the cache data cleaning model may be embodied as an automatically running executable file (such as a plug-in, an application client, etc.) stored in the application device to automatically clean cache data in the application device.

The following specifically describes the technical scheme of the present application.

Referring to fig. 1, fig. 1 is a schematic flow chart of a data cleaning model construction method according to an embodiment of the present application, where the method may be applied to the aforementioned application device or another device; as shown in fig. 1, the method comprises the steps of:

s101, acquiring a training sample set corresponding to a target cache category, wherein the training sample set corresponding to the target cache category comprises sample information of a plurality of cache data belonging to the target cache category, the sample information of each cache data comprises a plurality of feature information and a cleaning identifier, the feature information is respectively corresponding to a plurality of preset cache features, and the cleaning identifier is used for indicating the cleaning category.

Here, the target cache category is a category determined according to user behavior, and one type of user behavior corresponds to one cache category. For example, the target cache category may be a category of cookie files generated by a user accessing the network, or may be a category of temporary files generated by a user using various application clients and adapted to various clients. The cache data of different cache categories may have different cache characteristics, wherein one cache characteristic is used to scale and represent the cache data from one dimension to divide the cache data into different categories under one cache characteristic. For example, the cache feature may be cache data size, and by setting m1 data size thresholds, the cache data may be divided into (m1+1) categories, where m1 is greater than or equal to 1 and is a positive integer; the cache characteristic can also be the retention time of the cache data in the cache space, and the cache data can be divided into (m2+1) types by setting m2 retention time lengths, wherein m2 is more than or equal to 1 and is a positive integer; the cache characteristics may also be frequency of use, number of uses, type of cache data, last time used, etc.

In this embodiment of the present application, for the cache data belonging to the target cache class, a plurality of cache features may be preset, and the plurality of cache features are combined to distinguish a cleaning class of the cache data belonging to the target cache class, that is, whether the cache data belonging to the target cache class needs to be cleaned. For one cache data belonging to the target cache category, obtaining the characteristic information of the cache data under each cache characteristic by acquiring the characteristic information of the cache data, so as to obtain a plurality of characteristic information of the cache data; and integrating a plurality of characteristic information of the cache data, determining whether the cache data needs to be cleaned or not, and using a cleaning identifier to indicate whether the cache data needs to be cleaned or not, wherein the cleaning identifier corresponding to the cache data and the characteristic information of the cache data form sample information of the cache data. Specifically, the cleaning mark can be cleaning or not cleaning directly; alternatively, the cleaning flag may be 1 or 0 to indicate cleaning or not cleaning; alternatively, the clean-up identifier may be yes or no to indicate clean-up or no clean-up. The embodiments of the present application are not limited with respect to the specific form of the cleaning mark.

For example, the target cache category is a category of a temporary file which is generated by using various application clients and is suitable for various clients, and the plurality of cache characteristics preset for the target cache category are the size of the cache data, the retention time of the cache data in the cache space (refer to the time length between the time when the cache data is stored in the cache space and the current time), the use frequency (refer to the number of times the cache data is acquired from the cache space), the cache data type and the last time used. The size of the cache data d1 is 20 megabytes, the cache data d1 is stored in the cache space for 1 week, the use frequency of the cache data d1 is 1 time/1 day, the use frequency of the cache data d1 is 7 times, the cache data d1 is a picture type file, and the last time the cache data d1 is used is 2021, 1 month and 1 am 10 points. Based on these characteristic information of the cache data d1, it is determined that the cache data d1 needs to be cleaned, 20 mega, 1 week, 1 time/day, 7 times, a picture type file, 2021, 1 month, 1 am, 10 am, and cleaning are taken as sample information of the cache data d 1.

The training sample set corresponding to the target cache category can be obtained by acquiring a plurality of characteristic information of each of a plurality of cache data belonging to the target cache category and a cleaning identifier corresponding to each of the plurality of cache data. By presetting a plurality of cache features, the cache data can be measured from a plurality of dimensions, and whether the cache data belonging to the target cache category needs to be cleaned or not can be distinguished by combining the plurality of cache features, so that the accuracy of cache data division can be ensured.

In some possible embodiments, for the cache data belonging to the target cache category, whether the cache data belonging to the target cache category needs to be cleaned or not may be determined by using a plurality of cache features, and whether the cache data belonging to the target cache category needs to be cleaned or not is determined according to a result obtained by determining each cache feature. The step S101 may include the following steps t1 to t3.

Step t1, obtaining a plurality of cache data belonging to the target cache category, and determining characteristic information of the plurality of cache data under a plurality of cache characteristics respectively.

The above description may be referred to for determining the content of the feature information of the plurality of cache data under the plurality of cache features, respectively.

And step t2, determining sub-cleaning identifiers of the first cache data under the first cache characteristics according to the characteristic information of the first cache data under the first cache characteristics and a preset identifier processing rule corresponding to the first cache characteristics so as to obtain a plurality of sub-cleaning identifiers corresponding to the first cache data.

Here, the first cache data is any one of the plurality of cache data, and the first cache feature is any one of the plurality of cache features. The preset identification processing rule is used for judging whether the cache data needs to be cleaned based on the characteristic information of the cache data under the cache characteristic, namely the preset identification processing rule refers to a processing rule for judging the cleaning type of the cache data based on the characteristic information, and the preset identification processing rule corresponding to the first cache characteristic refers to a processing rule for judging the cleaning type of the cache data based on the characteristic information of the cache data under the first cache characteristic.

For example, if the first cache feature is the size of the cache data, the preset identifier processing rule corresponding to the first cache feature may be: judging whether the cache data is smaller than a preset data threshold value, if so, judging that the cache data does not need to be cleaned (namely, the cleaning type of the cache data is not cleaned); otherwise, judging that the cache data needs to be cleaned (namely, cleaning type of the cache data is cleaning); the preset data threshold may be 200 megabytes, for example. For another example, if the first cache feature is a retention time of the cache data in the cache space, the preset identifier processing rule corresponding to the first cache feature may be: judging whether the retention time of the cache data in the cache space is smaller than a first preset time length, if so, judging that the cache data does not need to be cleaned, otherwise, judging that the cache data needs to be cleaned; illustratively, the first predetermined length of time is two weeks. For another example, if the first cache feature is a frequency of use, the preset identifier processing rule corresponding to the first cache feature may be: judging whether the use frequency of the cache data is greater than a preset frequency, if so, judging that the cache data does not need to be cleaned, otherwise, judging that the cache data needs to be cleaned; the preset frequency is illustratively once a couple of days. For another example, if the first cache feature is the number of times of use, the preset identifier processing rule corresponding to the first cache feature may be: judging whether the use times of the cache data are larger than preset times, if so, judging that the cache data do not need to be cleaned, otherwise, judging that the cache data need to be cleaned; the preset number of times is exemplified by 5 times. For another example, if the first cache feature is a cache data type, the preset identifier processing rule corresponding to the first cache feature may be: judging whether the type of the cache data is a preset data type, if so, judging that the cache data does not need to be cleaned, otherwise, judging that the cache data needs to be cleaned; illustratively, the preset data type is a document file. For another example, if the first cache feature is the last time it was used, the preset identifier processing rule corresponding to the first cache feature may be: judging whether the time distance between the last time of the cache data and the current time is smaller than a second preset time length, if so, judging that the cache data does not need to be cleaned, otherwise, judging that the cache data needs to be cleaned; illustratively, the second predetermined time period is 7 days. Etc., are not limited to the descriptions herein. It should be understood that the specific design of the cache feature and the preset identifier processing rule may be designed according to the specific situation, and the above examples are not limiting to the present application.

The sub-cleaning identifier is used for indicating whether the cache data needs to be cleaned under the first cache feature, namely, indicating the cleaning type of the cache data, and the sub-cleaning identifier is used for indicating the cleaning type corresponding to the first cache feature under the first cache feature. The specific form of the cleaning sub-mark is consistent with the specific form of the cleaning mark, and reference is made to the specific form of the cleaning mark described above.

In this embodiment, a preset identifier processing rule is correspondingly set for each of a plurality of cache features, and a plurality of sub-cleaning identifiers corresponding to the first cache data can be obtained according to feature information of the first cache data under each cache feature and the preset identifier processing rule corresponding to each cache feature. By setting a preset identification processing rule for each cache feature, the cache data can be subjected to feature decomposition, and the cache feature and the judgment rule which are most suitable for judging whether the cache data need to be cleaned or not can be found.

And step t3, determining cleaning identifiers of the first cache data according to the plurality of sub-cleaning identifiers corresponding to the first cache data so as to obtain cleaning identifiers corresponding to the plurality of cache data.

In a possible implementation manner, if the number of target sub-cleaning identifiers in the plurality of sub-cleaning identifiers corresponding to the first cache data is greater than a preset number, determining the cleaning identifier of the first cache data as a target sub-cleaning identifier, where the target sub-cleaning identifier is used for indicating one of cleaning or uncleaning. Wherein the preset number of values is related to the number of cache features, for example, the preset number may be greater than or equal to 1/2 of the number of cache features.

In another possible implementation manner, one sub-cleaning identifier with the largest proportion among the plurality of sub-cleaning identifiers corresponding to the first cache data may be determined as the cleaning identifier of the first cache data.

By determining the cleaning identifier of each cache data in the manner of the steps t2-t3, the cleaning identifiers corresponding to the cache data can be obtained. By setting the identification processing rule for each cache feature, the cleaning type of the cache data can be measured from multiple dimensions, and the cleaning identification of the cache data is determined according to the number or the duty ratio of the sub cleaning identifications, so that the cleaning type indicated by the cleaning identification can be maximally close to the real condition of the cache data, and the cleaning type of the cache data can be accurately calibrated.

Optionally, under the condition that the sub cleaning identifier of the first cache data under the first cache feature is obtained in the step t2, the feature information of the first cache data under the first cache feature may be replaced by the sub cleaning identifier of the first cache data under the first cache feature, so that the feature information of the first cache data under the first cache feature is updated to the sub cleaning identifier of the first cache data under the first cache feature, so that the feature information of the first cache feature data under each cache feature is updated to the sub cleaning identifier of each cache feature, and then the feature information of each cache feature data under each cache feature is updated to the sub cleaning identifier of each cache feature. The continuous characteristic information can be discretized into two types of cleaning and non-cleaning by replacing the characteristic information of each cache data under the cache characteristics with the sub-cleaning identification under the cache characteristics, so that the classification of the cache data under each cache characteristic is realized, and the subsequent quantization and judgment by utilizing the cache characteristics are facilitated.

S102, determining judgment accuracy under a target situation based on feature information of a plurality of cache data under a first cache feature and cleaning identifiers corresponding to the cache data, wherein the first cache feature is any cache feature in the cache features, and the target situation is a situation that the first cache feature is taken as a judgment standard to judge cleaning categories corresponding to the cache data.

Here, the judgment accuracy in the target situation reflects: and taking the first cache characteristic as a cleaning category corresponding to each of the plurality of cache data determined by the judging standard, and a gap between cleaning identifications corresponding to each of the plurality of cache data, wherein the cleaning category corresponding to each of the plurality of cache data can be determined according to the judging rule corresponding to the first cache characteristic and characteristic information of the plurality of cache data under the first cache characteristic. The higher the judging accuracy is, the cleaning category corresponding to each of the plurality of cache data determined by taking the first cache feature as a judging standard is described, and the smaller the gap between cleaning identifications corresponding to each of the plurality of cache data is, namely the more accurate the result of judging the cleaning category corresponding to each of the plurality of cache data by taking the first cache feature as the judging standard is; the lower the judging accuracy is, the cleaning category corresponding to each of the plurality of cache data determined by taking the first cache feature as the judging standard is described, and the larger the gap between the cleaning identifications corresponding to each of the plurality of cache data is, namely the less accurate the result of judging the cleaning category corresponding to each of the plurality of cache data by taking the first cache feature as the judging standard is.

The judging rule corresponding to the first cache feature is a rule for judging whether the cache data needs to be cleaned or not by taking the first cache feature as a judging standard. For example, if the first cache feature is the size of the cache data, the determining rule of the first cache feature may be to determine whether the cache data is smaller than a preset data threshold, and if so, determine that the cache data does not need to be cleaned; otherwise, determining that the cache data needs to be cleaned. In some possible implementations, the determination rule corresponding to the first cache feature may be preset, for example, may be the same rule as the preset identification processing rule described in the foregoing step S101. In other possible embodiments, the determination rule corresponding to the first cache feature may also be determined in the process of determining the accuracy of the determination in the target situation.

Based on the characteristic information of each cache characteristic of the plurality of cache data and the cleaning identification corresponding to each cache data, the judgment accuracy of each situation can be determined, so that the judgment accuracy of the plurality of situations and the judgment rules of the plurality of situations can be obtained.

In a specific implementation, one or more ways may be used to measure the accuracy of determining the cleaning class of the buffered data using a buffer characteristic as a determination criterion. For a specific embodiment for measuring accuracy when each cache feature is used as a criterion, reference may be made to the following description.

S103, constructing a cache data cleaning model according to the judging accuracy in a plurality of situations, wherein the plurality of situations are situations in which a plurality of cache characteristics are respectively used as judging standards to judge cleaning categories corresponding to the plurality of cache data.

The cache data cleaning model is a decision tree taking a plurality of cache features as nodes, a judging rule of one cache feature is a decision condition of the node corresponding to the one cache feature, according to judging accuracy corresponding to each cache feature, the cache feature with high judging accuracy is taken as a father node of the cache feature with low judging accuracy, and the cache feature with low judging accuracy is connected to a first branch of the cache feature with high judging accuracy, so that the cache data cleaning model is constructed. The first branch refers to a branch for judging that the cache data is not cleaned when judging the cleaning type of the cache data according to the judging rule of the cache feature with high judging accuracy. Among the plurality of cache features, the cache feature corresponding to the greatest accuracy is a root node in the decision tree.

For example, the cache features are respectively the size of the cache data, the retention time of the cache data in the cache space, the last time the cache data is used, and the number of times the cache features are used, and the judgment rules of the cache features are preset and are consistent with the preset identification processing rules described above. The determination accuracy of each of the plurality of cache features determined in step S103 is sequentially from high to low: the last time used, the size of the cache data, the number of times used and the retention time of the cache data in the cache space; as shown in fig. 2, the time of the last time of use is taken as the root node, the size of the cache data is connected to the branch which is not cleaned as the result of the last time of use, the number of times of use is connected to the branch which is not cleaned as the result of the size of the cache data, and the retention time of the cache data in the cache space is connected to the branch which is not cleaned as the result of the number of times of use.

In the technical solution of fig. 1, by respectively judging the cleaning type of the cache data in the training sample set by using a plurality of preset cache features as judgment standards, and determining the judgment accuracy when the plurality of cache features are respectively used as the judgment standards, based on the judgment accuracy corresponding to each of the plurality of cache features, constructing a decision tree for judging the cleaning type of the cache data by using the plurality of cache features as the judgment standards, determining the sequence of judging by using each cache feature in a simpler manner, and having high operation speed; because the cache feature with high judgment accuracy is the father node of the cache feature with high judgment accuracy in the decision tree, and the cache feature with low judgment accuracy is connected to the branch of which the judgment result of the cache feature with high judgment accuracy is not clear, the decision strategy of judging whether the cache data needs to be cleared by preferentially utilizing the cache feature with high judgment accuracy and judging whether the cache data needs to be cleared by utilizing the cache feature with low judgment accuracy is equivalent to determining whether the cache data needs to be cleared or not, the accurate judgment of whether the cache data needs to be cleared or not can be realized, thereby the cache data useful for users can be reserved, and compared with the case that the cache data is completely cleared, the fine clearing of the cache data can be realized.

According to different requirements, the accuracy of judging the cleaning type of the cache data by taking each cache characteristic as a judging standard can be measured in different modes.

In one possible implementation, the accuracy of the cache characteristics as a determination criterion to determine the cleaning class of the cache data may be measured using the information gain. The step S102 may include the following steps a1 to a3:

and a step a1, determining a first information entropy related to the cleaning identification according to the cleaning identification corresponding to each of the plurality of cache data.

Here, the first information entropy is used to indicate the uncertainty of the cleaning identity. The larger the first information entropy is, the higher the cleaning identification uncertainty is, and the larger the information quantity required for determining the cleaning identification is; the smaller the first information entropy, the lower the cleaning identifier uncertainty, and the smaller the amount of information required to determine the cleaning identifier.

Specifically, the calculation formula of the first information entropy is as follows:

wherein H (D) is a first information entropy, k is used for indicating the category of the cleaning mark corresponding to the cache data, and specifically comprises two categories of cleaning and non-cleaning, |C _k The I is the number of cache data corresponding to the kth class cleaning identifier in the cleaning identifiers corresponding to the cache data respectively, and the I D1I is the total number of the cache data (which is equal to the number of training samples in the training sample set).

By way of example, assume that the training sample set is as shown in Table 1:

cache feature 1	Cache feature 2	Cache feature 3	Cache feature 4	Cleaning mark
					Characteristic information 11 (ta)	Feature information 21	…	…	Cleaning up
Characteristic information 12 (ta)	Feature information 22	…	…	Does not clean
					Characteristic information 13 (ta)	…	…	…	Does not clean
Characteristic information 14 (tb)	…	…	…	Cleaning up
					Characteristic information 15 (tb)	…	…	…	Does not clean
Characteristic information 16 (tc)	…	…	…	Cleaning up
					Characteristic information 17 (tc)	…	…	…	Cleaning up

TABLE 1

As can be seen from table 1, the number of the identifiers for cleaning is 4, and the number of the identifiers for not cleaning is 3, and the first information entropy is:

and a2, determining the conditional entropy of the first cache feature according to the feature information of the plurality of cache data under the first cache feature and the cleaning identification corresponding to each of the plurality of cache data.

Here, the conditional entropy of the first cache feature is used to indicate an uncertainty of the first cache feature in determining the cleaning identity if the cleaning identity is known. The smaller the conditional entropy of the first cache feature is, the lower the uncertainty of the first cache feature on determining the cleaning mark is; the greater the conditional entropy of the first cache feature, the higher the uncertainty of the first cache feature in determining the cleaning identity.

Specifically, the conditional entropy of the first cache feature is calculated as follows:

Wherein H (d|a) is a conditional entropy of the first cache feature, n is a total number of categories of feature information of the plurality of cache data under the first cache feature, i is used for indicating the category of the feature information under the first cache feature, |d _i The I is the number of cache data, the feature information of the plurality of cache data under the first cache feature is the i type feature information, the I D2I is the total number of the plurality of cache data (equal to the number of training samples in the training sample set), k is used for indicating the class of cleaning identification corresponding to the cache data, and the I D _ik And the I is the number of cache data which is characterized by the i-th type of characteristic information and corresponds to the k-th type cleaning identification under the first cache characteristic.

For illustration, assume that the training sample set is shown in table 1 with the first cache feature being cache feature 1. As can be seen from table 1, the feature information of the cache feature 1 has three categories in total, ta, tb, tc, and n=3, and the conditional entropy of the cache feature 1 is:

it should be understood that, in the case where the feature information of each cache feature data under each cache feature is updated to the sub-cleaning identifier under each cache feature in the manner described in the above step S101, the total number of the categories of the feature information under each cache feature is 2, that is, n is equal to 2.

For example, if the data in table 1 is updated as shown in table 2:

cache feature 1	Cache feature 2	Cache feature 3	Cache feature 4	Cleaning mark
					Son cleaning mark (cleaning)	Son cleaning mark	…	…	Cleaning up
Son cleaning mark (cleaning)	Son cleaning mark	…	…	Does not clean
					Son cleaning mark (cleaning)	…	…	…	Does not clean
Son cleaning mark (cleaning)	…	…	…	Cleaning up
					Son cleaning mark (cleaning)	…	…	…	Does not clean
Son cleaning mark (cleaning)	…	…	…	Cleaning up
					Son cleaning mark (cleaning)	…	…	…	Cleaning up

H (D|A) is the conditional entropy of the first cache feature, n is the total number of categories of sub-cleaning identifiers of the plurality of cache data under the first cache feature, i is used for indicating the category of the sub-cleaning identifier under the first cache feature, |D _i The I is the number of cache data of which the sub-cleaning identification is the i type sub-cleaning identification in the first cache characteristic in the plurality of cache data, the I D2I is the total number of the plurality of cache data (equal to the number of training samples in the training sample set), k is used for indicating the type of the cleaning identification corresponding to the cache data, and the I D _ik And the I is the number of cache data of which the sub-cleaning identifier is the i-th type sub-cleaning identifier and corresponds to the k-th type cleaning identifier under the first cache characteristic.

And a step a3, calculating the information gain of the first cache feature according to the first information entropy related to the cleaning mark and the conditional entropy of the first cache feature, so as to indicate the judgment accuracy under the target condition.

Specifically, subtracting the conditional entropy of the first cache feature from the first information entropy to obtain the information gain of the first cache feature. The information gain is calculated as follows:

G(D,A)＝H(D)-H(D|A)

the information gain is positively correlated with the judgment accuracy in the target situation, that is, the larger the information gain is, the higher the judgment accuracy in the target situation is, the smaller the information gain is, and the lower the judgment accuracy in the target situation is.

In the step a 1-step a3, the accuracy of each cache characteristic as a judgment standard in judging the cleaning type of the cache data is measured by utilizing the information gain, so that the calculation mode is simple, and the speed of constructing the cache data cleaning model is improved.

In another possible embodiment, the accuracy of the cache feature in determining the clean up class of the cache data may be measured using the information gain ratio as a criterion. The step S102 may include the following steps b1 to b5:

and b1, determining a first information entropy related to the cleaning identification according to the cleaning identification corresponding to each of the plurality of cache data.

And b2, determining the conditional entropy of the first cache feature according to the feature information of the plurality of cache data under the first cache feature and the cleaning identification corresponding to each of the plurality of cache data.

And b3, calculating the information gain of the first cache feature according to the first information entropy related to the cleaning mark and the conditional entropy of the first cache feature.

For the specific implementation of the steps b1 to b3, reference may be made to the foregoing steps a1 to a3, which are not repeated here.

And b4, determining a second information entropy related to the first cache feature according to the feature information of the plurality of cache data under the first cache feature.

Here, the second information entropy is used to indicate uncertainty of feature information under the first cache feature. The larger the second information entropy is, the higher the uncertainty of the feature information under the first cache feature is; the smaller the first information entropy, the lower the uncertainty of the feature information under a cached feature.

Specifically, a calculation formula of the second information entropy related to the first cache feature is as follows:

wherein H is _A (D) For the second information entropy related to the first cache feature, n is the total number of categories of feature information of the plurality of cache data under the first cache feature, i is used for indicating the category of the feature information under the first cache feature, |D _i The i is the number of cache data in which feature information under the first cache feature is the i-th type of feature information in the plurality of cache data, and the i d2 is the total number of the plurality of cache data (equal to the number of training samples in the training sample set).

For example, assuming that the training sample set is shown in table 1, the first cache feature is cache feature 1, as can be seen from table 1, the feature information of cache feature 1 has three categories in total, namely ta, tb, tc, ta is 3 in number, tb is 2 in number, tc is 2 in number, and the second information entropy related to cache feature 1 is:

and b5, calculating an information gain ratio of the first cache feature according to the information gain of the first cache feature and the second information entropy related to the first cache feature, so as to indicate the judgment accuracy under the target condition.

Specifically, a quotient of the information gain and the second information entropy is determined as an information gain ratio of the first cache feature. The information gain ratio is calculated as follows:

the information gain ratio is positively correlated with the judgment accuracy in the target situation, that is, the larger the information gain ratio is, the higher the judgment accuracy in the target situation is, the smaller the information gain ratio is, and the lower the judgment accuracy in the target situation is.

In steps b1 to b5, the accuracy of each cache feature can be measured more accurately by measuring the accuracy of each cache feature when the cleaning class of the cache data is judged by using the information gain ratio as a judgment criterion.

In yet another possible embodiment, the minimum radix factor may be used to measure the accuracy of the cache feature in determining the clean up class of the cache data as a determination criterion. The step S102 may include the following steps c1 to c2:

and c1, respectively determining cleaning identifier probability distribution on various feature information under the first cache feature according to cleaning identifiers corresponding to the cache data.

Here, for a class of feature information under the first cache feature, the cleaning identifier probability distribution includes: the probability distribution of the cleaning identifier corresponding to the characteristic information of the type and the probability distribution of the cleaning identifier not corresponding to the characteristic information of the type of characteristic information.

For illustration, assuming that the training sample set is shown in table 1, and assuming that the first cache feature is cache feature 1, the feature information of the cache feature 1 has three categories, which are ta, tb, and tc, respectively. As can be seen from table 1, for the type of feature information ta, there are 3 feature information ta, corresponding to 1 cleaning, 2 cleaning, so the cleaning identifier probability distribution corresponding to ta is (1/3, 2/3); the number of the feature information which is not ta is 4, the number of the feature information corresponds to 3 cleaning, and the number of the feature information which is not ta is 1, so that the probability distribution of the cleaning mark corresponding to the feature information which is not ta is (3/4, 1/4). For the type of characteristic information of tb, 2 characteristic information of tb corresponds to 1 cleaning and 1 cleaning is not performed, so that the cleaning identification probability distribution corresponding to tb is (1/2 ); the number of the feature information which is not tb is 5, the number of the feature information which is not tb corresponds to 3 cleanings, and the number of the feature information which is not tb corresponds to 2 cleanings, so that the probability distribution of the cleanings corresponding to the feature information which is not tb is (3/5, 2/5). For the characteristic information of tc, there are 2 characteristic information of tc, corresponding to 2 cleanings, 0 cleanings, therefore, the cleaning identification probability distribution corresponding to tc is (1, 0); the number of the feature information which is not tc is 5, the number of the feature information corresponds to 2 cleanings, and the number of the feature information which is not tc is 3, so that the probability distribution of the cleaning mark corresponding to the feature information which is not tc is (2/5, 3/5).

And c2, determining the minimum radix index of the first cache feature according to the cleaning identification probability distribution on various feature information under the first cache feature and the feature information of a plurality of cache data under the first cache feature, so as to be used for indicating the judgment accuracy under the target condition.

Specifically, according to the cleaning identification probability distribution on each piece of feature information under the first cache feature and the feature information of the plurality of pieces of cache data under the first cache feature, respectively calculating the respective corresponding base coefficient of each piece of feature information under the first feature information to obtain a plurality of base coefficients, and determining the smallest base coefficient in the plurality of base coefficients as the smallest base coefficient index of the first cache feature.

Specifically, for one type of feature information under the first cache feature, the calculation formula of the base index is as follows:

wherein Gini (D, A) is the minimum coefficient of Kerni, D ₁ Is the quantity of the type of characteristic information in the characteristic information of a plurality of cache data under the first cache characteristic, D ₂ Is the number of other types of characteristic information except the type of characteristic information in the characteristic information of the plurality of cache data under the first cache characteristic, gini (D ₁ ) Is the coefficient of the cleaning mark probability distribution corresponding to the characteristic information, gini (D) ₂ ) Is the coefficient of the basis of the cleaning identification probability distribution corresponding to the other types of characteristic information, |d2|, is the total number of the plurality of cache data (equal to the number of training samples in the training sample set).

For illustration, assume that the training sample set is shown in table 1 with the first cache feature being cache feature 1. Three categories of feature information of the cache feature 1 are ta, tb and tc respectively, and then the number of the coefficient of the foundation is 3, wherein:

the coefficient of the foundation corresponding to ta is:

tb corresponds to the coefficient of base:

the coefficient of the foundation corresponding to tc is:

since the coefficient of the co-ordinates corresponding to tc is the smallest, the coefficient of co-ordinates corresponding to tc is taken as the smallest coefficient of co-ordinates of the cache feature 1.

The minimum coefficient is inversely related to the judgment accuracy in the target situation, namely the smaller the minimum coefficient is, the higher the judgment accuracy in the target situation is; the larger the minimum coefficient of kunity, the lower the judgment accuracy in the target situation.

Optionally, in some possible embodiments, a determination rule when the first cache feature is taken as a determination criterion may be further determined according to a minimum coefficient of the first cache feature, where a class of information corresponding to the minimum coefficient of the first cache feature is taken as a division point for determining the cache class. For example, if the first cache feature is cache feature 1 shown in table 1, tc is taken as a dividing point of the cache feature 1, the cache data of which the feature information under the cache feature 1 and tc belong to a class is divided into clean, and the cache data of which the feature information under the cache feature 1 and tc do not belong to a class is divided into clean.

In step c 1-step c2, the accuracy of each cache feature can be measured more quickly by using the minimum coefficient of the radix to measure the accuracy of each cache feature as a criterion when judging the cleaning type of the cache data, and only a simple probability operation is needed.

It should be noted that, the determination accuracy corresponding to each cache feature may be determined by other ways capable of measuring the accuracy of determining the cleaning class of the cache data by using each cache feature as the determination standard, which is not limited in this application.

In some possible embodiments, in the process of constructing the cache data cleaning model, in order to obtain a more accurate cache data cleaning model, the parent-child relationship of a plurality of cache features in the decision tree can be determined in a multiple iteration mode. Referring to fig. 3, fig. 3 is a schematic flow chart of another method for constructing a data cleansing model according to an embodiment of the present application, where the method may be applied to the aforementioned application device or another device; as shown in fig. 3, the method comprises the steps of:

s201, a training sample set corresponding to a target cache category is obtained, the training sample set corresponding to the target cache category comprises sample information of a plurality of cache data belonging to the target cache category, the sample information of each cache data comprises a plurality of feature information and a cleaning identifier, the feature information is respectively corresponding to a plurality of preset cache features, and the cleaning identifier is used for indicating the cleaning category.

S202, determining judgment accuracy under a target situation based on feature information of a plurality of cache data under a first cache feature and cleaning identifiers corresponding to the cache data, wherein the first cache feature is any cache feature in the cache features, and the target situation is a situation that the first cache feature is taken as a judgment standard to judge cleaning categories corresponding to the cache data.

Here, the specific implementation manner of step S201 to step S202 may refer to the descriptions of the foregoing steps S101 to S102, and will not be repeated here.

S203, determining a second cache feature corresponding to the maximum judgment accuracy according to the judgment accuracy in a plurality of cases, wherein the plurality of cases are cases in which the plurality of cache features are respectively used as judgment standards to judge the cleaning type corresponding to the plurality of cache data.

S204, deleting the second cache feature, feature information corresponding to the second cache feature and second cache data in the training sample set, wherein the second cache data is cache data for indicating cleaning when the second cache feature is taken as a judging standard to judge the cleaning type corresponding to each of the plurality of cache data.

If there are multiple cache features in the sample information and the judgment accuracy corresponding to each cache feature is greater than or equal to the preset accuracy, returning to execute step S202; step S205 is executed if the judgment accuracy corresponding to each cache feature is less than the preset accuracy or only one cache feature is left in the sample information.

S205, constructing a cache data cleaning model according to the deleted sequence of the plurality of cache features in the training sample set.

Here, the order in which the plurality of cache features are deleted in the training sample set reflects the accuracy of judging the cleaning class of the cache data with each cache feature, and the earlier the cache feature is deleted, the higher the description accuracy, and the later the cache feature is deleted, the lower the description accuracy. Therefore, the sequence structure of deleting the plurality of cache features in the training sample set determines the node relation of the plurality of cache features in the decision tree, takes the cache feature deleted earlier as the father node of the cache feature deleted later, and connects the cache feature deleted later to the branch of which the cache feature deleted later is not cleaned as a judgment result, thus constructing and obtaining the cache data cleaning model. The earliest deleted cache feature is the root node in the decision tree.

For example, the cache features are respectively the size of the cache data, the retention time of the cache data in the cache space, the last time the cache data is used, and the number of times the cache data is used, and the sequence of deleting the four features is as follows: the last time the decision tree was used, the size of the cache data, the number of times the decision tree was used, and the time the cache data was kept in the cache space, the decision tree was constructed as shown in fig. 2.

In the embodiment of the method corresponding to fig. 3, in the process of determining the positions of a plurality of cache features in the decision tree, the accuracy of determining each cache feature as a judgment standard through multiple iterations is removed, and the cache data which can be determined to be cleaned through the current cache feature can be more accurately determined, so that the judging order of each cache feature can be more accurately determined, and the accuracy of decision tree judgment can be improved, namely, the judging accuracy of a cache data cleaning model can be improved.

After the cache data cleaning model is constructed by the method embodiment, whether the cache data needs to be cleaned or not can be judged by using the cache data cleaning model. Referring to fig. 4, fig. 4 is a schematic flow chart of a data cleaning method according to an embodiment of the present application, where the method may be applied to the aforementioned application device; as shown in fig. 4, the method comprises the steps of:

and S301, when the situation that the total amount of the cache data corresponding to the target cache category is larger than the preset data amount or the ratio of the total amount of the cache data corresponding to the target cache category to the total cache space corresponding to the target cache category is larger than a preset ratio threshold is detected, cleaning the cache data corresponding to the target cache category through a cache data cleaning model.

Specifically, according to the cache characteristics corresponding to each node of the cache data cleaning model, the cache information of the cache data corresponding to the target cache category under the cache characteristics can be obtained, and then, based on the decision condition corresponding to each node, whether the cache corresponding to the target cache category needs to be cleaned is judged until the cache data is determined to be cleaned or until the cache data traverses to the last node in the cache data cleaning model.

Taking the cache data cleaning model as an example as a decision tree shown in fig. 2, when cache data corresponding to a target cache category is obtained, determining the last time the cache data is used, and if the last time the cache data is used is not less than a second preset time length from the current time, determining that the cache data needs to be cleaned; if the time of the last time of the cache data is less than the current time by a second preset time length, determining the size of the cache data, and if the cache data is not less than a preset data threshold, determining that the cache data needs to be cleaned; if the cache data is larger than a preset data threshold value, determining the use times of the cache data, and if the use times of the cache data are not larger than the preset times, determining that the cache data need to be cleaned; if the using times of the cache data are greater than the preset times, determining the retention time of the cache data in a cache space, and if the retention time of the cache data in the cache space is not less than a first preset time length, determining that the cache data need to be cleaned; if the retention time of the cache data in the cache space is less than the first preset time length, determining that the cache data does not need to be cleaned.

In the embodiment of the method corresponding to fig. 4, when it is detected that the total amount of the buffered data corresponding to the target buffer class is greater than the preset data amount, or the ratio of the total amount of the buffered data corresponding to the target buffer class to the total buffered space corresponding to the target buffer class is greater than a preset ratio threshold, the buffered data corresponding to the target buffer class is cleared by the buffered data clearing model, and since the buffered feature with high judgment accuracy is the father node of the buffered feature with low judgment accuracy, it is equivalent to preferentially utilizing the buffered feature with high judgment accuracy to judge whether the buffered data needs to be cleared, so that it can realize accurate judgment on whether the buffered data needs to be cleared; in addition, as whether the cache data corresponding to the target cache category need to be cleaned or not is judged, the cache data can be cleaned finely.

Optionally, after the cache data cleaning model is constructed by the above construction, the cache data cleaning model may be further verified by using a verification sample set, and if the accuracy of the cache data cleaning model is determined to be greater than the preset verification accuracy according to the verification sample set, the cache data cleaning model is determined to be a final cache data cleaning model, and the final cache data cleaning model is used to determine whether the cache data needs to be cleaned.

The sample information in the verification sample set is consistent with the sample information in the training sample set in form. Specifically, after the cache data cleaning model is constructed, the cleaning type of each sample information in the verification sample set may be determined by using the cache data cleaning model, and the specific content of the foregoing embodiment of fig. 4 may be referred to for a manner of determining the cleaning type of each sample information in the verification sample set by using the cache data cleaning model. After the cleaning category of each sample information is determined, comparing the cleaning category of each sample information with the cleaning identification in each sample information, determining a first quantity of sample information with the same cleaning category indicated by the cleaning identification, and determining the ratio of the first quantity to the total quantity of the sample information in the verification sample set as the accuracy of the cache data cleaning model. By verifying the constructed cache data cleaning model, the accuracy of the cache data cleaning model can be ensured to reach a higher level.

Optionally, on the basis of the above method embodiment, the case of re-dividing the sample set into the verification sample set and the training sample set, or the case of training the sample set and the verification sample set with new cached data may be used, where the training and verification steps are performed multiple times, that is, the steps S101-S103 and the verification step are performed multiple times, or the steps S201-S205 and the verification step are performed multiple times, so as to ensure that the sample sets adopted in each training and verification are different, and then the finally determined cached data cleaning model is used to determine whether the cached data needs to be cleaned. Through multiple training and verification, the accuracy of the cache data cleaning model can be further improved.

The foregoing describes the method of the present application and, in order to better practice the method of the present application, the apparatus of the present application is described next.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a device for constructing a data cleansing model according to an embodiment of the present application, where the device for constructing a buffered data cleansing may be the aforementioned server, or a terminal, as shown in fig. 5, and the device 40 includes:

an obtaining module 401, configured to obtain a training sample set corresponding to a target cache category, where the training sample set includes sample information of a plurality of cache data belonging to the target cache category, and sample information of each cache data includes a cleaning identifier and a plurality of feature information, where the plurality of feature information is feature information corresponding to a preset plurality of cache features, and the cleaning identifier is used to indicate a cleaning category corresponding to each cache data;

an accuracy judging module 402, configured to determine, based on feature information of the plurality of cache data under a first cache feature, and cleaning identifiers corresponding to the plurality of cache data, a judging accuracy under a target situation, where the first cache feature is any cache feature of the plurality of cache features, and the target situation is a situation in which cleaning categories corresponding to the plurality of cache data are judged by using the first cache feature as a judging criterion;

The model building module 403 is configured to build a cache data cleaning model according to the determination accuracy in a plurality of situations, where the plurality of cache features are used as determination criteria to determine cleaning categories corresponding to the plurality of cache data, and the cache data cleaning model is a decision tree that sequentially uses the plurality of cache features as determination criteria to determine cleaning categories of the cache data, where a cache feature with high determination accuracy is a parent node of a cache feature with low determination accuracy in the decision tree, and the cache feature with low determination accuracy is connected to a first branch of the cache feature with high determination accuracy, and the first branch is a branch with no cleaning determination result.

In one possible design, the accuracy determination module 401 is specifically configured to: determining a first information entropy related to the cleaning identification according to the cleaning identification corresponding to each of the plurality of cache data; determining the conditional entropy of the first cache feature according to the feature information of the plurality of cache data under the first cache feature and the cleaning identification corresponding to each of the plurality of cache data; and calculating the information gain of the first cache feature according to the first information entropy and the conditional entropy to indicate the judgment accuracy under the target condition.

In one possible design, the accuracy determination module 401 is specifically configured to: determining a first information entropy related to the cleaning identification according to the cleaning identification corresponding to each of the plurality of cache data; determining the conditional entropy of the first cache feature according to the feature information of the plurality of cache data under the first cache feature and the cleaning identification corresponding to each of the plurality of cache data; calculating to obtain the information gain of the first cache feature according to the first information entropy and the conditional entropy; determining the second information entropy related to the first cache feature according to the feature information of the plurality of cache data under the first cache feature; and calculating an information gain ratio of the first cache characteristic according to the information gain and the second information entropy so as to indicate the judgment accuracy under the target condition.

In one possible design, the accuracy determination module 401 is specifically configured to: according to the cleaning identifications corresponding to the cache data, determining cleaning identification probability distribution on various characteristic information under the first cache characteristic; and determining the minimum base index of the first cache feature according to the cleaning identification probability distribution and the feature information of the plurality of cache data under the first cache feature, so as to be used for indicating the judgment accuracy under the target condition.

In one possible design, the acquisition module 401 is specifically configured to: acquiring a plurality of cache data belonging to the target cache category, and determining characteristic information of the plurality of cache data under the plurality of cache characteristics respectively; determining sub-cleaning identifiers of the first cache data under the first cache feature according to feature information of the first cache data under the first cache feature and a preset identifier processing rule corresponding to the first cache feature to obtain a plurality of sub-cleaning identifiers corresponding to the first cache data, wherein the first cache data is any cache data in the plurality of cache data, the first cache feature is any cache feature in the plurality of cache features, the preset identifier processing rule refers to a processing rule for judging cleaning types of the cache data based on the feature information, and the sub-cleaning identifiers of the first cache data under the first cache feature are used for indicating the cleaning types corresponding to the first cache data under the first cache feature; and determining the cleaning identification corresponding to the first cache data according to the plurality of sub-cleaning identifications corresponding to the first cache data so as to obtain the cleaning identification corresponding to each of the plurality of cache data.

In one possible design, the acquisition module 401 is specifically configured to: if the number of target sub-cleaning identifiers in the plurality of sub-cleaning identifiers corresponding to the first cache data is greater than a preset number, determining that the corresponding cleaning identifier of the first cache data is the target sub-cleaning identifier, wherein the target sub-cleaning identifier is used for indicating one of cleaning or uncleanness; or determining one sub-cleaning identifier with the largest proportion among a plurality of sub-cleaning identifiers corresponding to the first cache data as the cleaning identifier corresponding to the first cache data.

In one possible design, model building module 403 is specifically configured to: determining a second cache feature corresponding to the maximum judgment accuracy according to the judgment accuracy under the multiple conditions; deleting the second cache feature, feature information corresponding to the second cache feature, and second cache data in the training sample set, and returning to execute the step of determining the judgment accuracy under the target condition based on the feature information of the plurality of cache data under the first cache feature and the cleaning identifier corresponding to each of the plurality of cache data, until the judgment accuracy under the plurality of conditions is less than the preset accuracy, or only one cache feature in the sample information remains, wherein the cleaning identifier is used for indicating the cleaned cache data when the second cache feature is used as a judgment standard to judge the cleaning category corresponding to each of the plurality of cache data; and constructing a cache data cleaning model according to the deleted sequence of the plurality of cache features in the training sample set.

It should be noted that, what is not mentioned in the embodiment corresponding to fig. 5 may refer to the description of the method embodiment corresponding to fig. 1 to 3, and will not be repeated here.

According to the device, the cleaning type of the cache data in the training sample set is judged by taking the preset cache features as the judgment standard, the judgment accuracy when the cache features are taken as the judgment standard is determined, the decision tree for judging the cleaning type of the cache data by taking the cache features as the judgment standard is constructed based on the judgment accuracy corresponding to the cache features, the judging sequence by utilizing the cache features is determined in a simpler mode, and the operation speed is high; because the cache feature with high judgment accuracy is the father node of the cache feature with high judgment accuracy in the decision tree, and the cache feature with low judgment accuracy is connected to the branch of which the judgment result of the cache feature with high judgment accuracy is not clear, the decision strategy of judging whether the cache data needs to be cleared by preferentially utilizing the cache feature with high judgment accuracy and judging whether the cache data needs to be cleared by utilizing the cache feature with low judgment accuracy is equivalent to determining whether the cache data needs to be cleared or not, the accurate judgment of whether the cache data needs to be cleared or not can be realized, thereby the cache data useful for users can be reserved, and compared with the case that the cache data is completely cleared, the fine clearing of the cache data can be realized.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a data cleaning device according to an embodiment of the present application, where the data cleaning device may be a mobile phone, a computer, or the like, as mentioned above; as shown in fig. 6, the apparatus 50 includes:

the cleaning module 501 is configured to clean, when it is detected that the total amount of cache data corresponding to a target cache class is greater than a preset data amount, or the ratio of the total amount of cache data corresponding to the target cache class to the total cache space corresponding to the target cache class is greater than a preset ratio threshold, cache data corresponding to the target cache class through a cache data cleaning model, where the cache data cleaning model is obtained by constructing the data cleaning model construction method in the foregoing method embodiment.

It should be noted that, what is not mentioned in the embodiment corresponding to fig. 6 may be referred to the description of the method embodiment corresponding to fig. 4, and will not be repeated here.

In the device, when the total amount of the cache data corresponding to the target cache category is detected to be larger than the preset data amount, or the ratio of the total amount of the cache data corresponding to the target cache category to the total space of the caches corresponding to the target cache category is larger than a preset ratio threshold, the cache data corresponding to the target cache category is cleaned through a cache data cleaning model, and as the cache feature with high judgment accuracy is a father node of the cache feature with low judgment accuracy, the cache feature with high judgment accuracy is preferentially utilized to judge whether the cache data needs cleaning, so that the accurate judgment on whether the cache data needs cleaning can be realized; in addition, as whether the cached data corresponding to the target cache type needs to be cleaned or not is judged, the cached data can be cleaned finely.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a computer device provided in an embodiment of the present application, and the computer device 60 includes a processor 601 and a memory 602. The processor 601 is connected to the memory 602, for example the processor 601 may be connected to the memory 602 by a bus.

The processor 601 is configured to support the computer device 60 to perform the corresponding functions in the methods of fig. 1-3 or the method of fig. 4. The processor 601 may be a central processing unit (central processing unit, CPU), a network processor (network processor, NP), a hardware chip or any combination thereof. The hardware chip may be an application specific integrated circuit (application specific integrated circuit, ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), general-purpose array logic (generic array logic, GAL), or any combination thereof.

The memory 602 is used for storing program codes and the like. The memory 602 may include Volatile Memory (VM), such as random access memory (random access memory, RAM); the memory 1002 may also include a non-volatile memory (NVM), such as read-only memory (ROM), flash memory (flash memory), hard disk (HDD) or Solid State Drive (SSD); the memory 602 may also include a combination of the types of memory described above.

In some possible cases, the processor 601 may call the program code to:

In other possible cases, the processor 601 may call the program code to:

and when the total amount of the cache data corresponding to the target cache category is detected to be larger than the preset data amount, or the ratio of the total amount of the cache data corresponding to the target cache category to the total cache space corresponding to the target cache category is larger than a preset ratio threshold, cleaning the cache data corresponding to the target cache category through a cache data cleaning model, wherein the cache data cleaning model is constructed through the method embodiment.

It should be noted that, implementation of each operation may also correspond to the corresponding description referring to the above method embodiment; the processor 601 may also cooperate with other functional hardware to perform other operations in the method embodiments described above.

The present application also provides a computer-readable storage medium storing a computer program comprising program instructions that, when executed by a computer, cause the computer to perform the method of the previous embodiments.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in the embodiments may be accomplished by computer programs stored in a computer-readable storage medium, which when executed, may include the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only memory (ROM), a random-access memory (Random Access memory, RAM), or the like.

The foregoing disclosure is only illustrative of the preferred embodiments of the present application and is not intended to limit the scope of the claims herein, as the equivalent of the claims herein shall be construed to fall within the scope of the claims herein.

Claims

1. The data cleaning model construction method is characterized by comprising the following steps of:

acquiring a training sample set corresponding to a target cache category, wherein the training sample set comprises sample information of a plurality of cache data belonging to the target cache category, the sample information of each cache data comprises a cleaning identifier and a plurality of characteristic information, the plurality of characteristic information is respectively corresponding to a plurality of preset cache characteristics, and the cleaning identifier is used for indicating a cleaning category corresponding to each cache data;

determining the judging accuracy under a target condition based on the characteristic information of the plurality of cache data under a first cache characteristic and the cleaning identification corresponding to each of the plurality of cache data, wherein the first cache characteristic is any cache characteristic in the plurality of cache characteristics, and the target condition is a condition that the cleaning category corresponding to each of the plurality of cache data is judged by taking the first cache characteristic as a judging standard;

According to judging accuracy in a plurality of situations, constructing a cache data cleaning model, wherein the plurality of situations are situations in which the plurality of cache features are respectively used as judging standards to judge cleaning categories corresponding to the plurality of cache data, the cache data cleaning model is a decision tree for judging cleaning categories of the cache data by sequentially using the plurality of cache features as judging standards, the cache features with high judging accuracy are father nodes of the cache features with low judging accuracy in the decision tree, the cache features with low judging accuracy are connected to a first branch of the cache features with high judging accuracy, and the first branch is a branch with a judging result of not cleaning;

the constructing a cache data cleaning model according to the judging accuracy under a plurality of conditions comprises the following steps:

determining a second cache feature corresponding to the maximum judgment accuracy according to the judgment accuracy in the plurality of cases,

deleting the second cache feature, feature information corresponding to the second cache feature and second cache data in the training sample set, returning to execute the step of determining the judgment accuracy under the target condition based on the feature information of the plurality of cache data under the first cache feature and the cleaning identification corresponding to each of the plurality of cache data until the judgment accuracy under the plurality of conditions is less than the preset accuracy or only one cache feature in the sample information remains, wherein the cleaning identification is used for indicating the cleaned cache data when the second cache feature is taken as a judgment standard to judge the cleaning category corresponding to each of the plurality of cache data,

And constructing the cache data cleaning model according to the deleted sequence of the plurality of cache features in the training sample set.

2. The method according to claim 1, wherein determining the accuracy of the determination in the target situation based on the feature information of the plurality of cache data under the first cache feature and the cleaning identifier corresponding to each of the plurality of cache data includes:

determining a first information entropy related to the cleaning identification according to the cleaning identification corresponding to each of the plurality of cache data;

determining the conditional entropy of the first cache feature according to the feature information of the plurality of cache data under the first cache feature and the cleaning identification corresponding to each of the plurality of cache data;

and calculating the information gain of the first cache feature according to the first information entropy and the conditional entropy so as to indicate the judgment accuracy under the target condition.

3. The method according to claim 1, wherein determining the accuracy of the determination in the target situation based on the feature information of the plurality of cache data under the first cache feature and the cleaning identifier corresponding to each of the plurality of cache data includes:

according to the first information entropy and the conditional entropy, calculating to obtain the information gain of the first cache feature;

determining a second information entropy related to the first cache feature according to the feature information of the plurality of cache data under the first cache feature;

and calculating the information gain ratio of the first cache characteristic according to the information gain and the second information entropy so as to indicate the judgment accuracy under the target condition.

4. The method according to claim 1, wherein determining the accuracy of the determination in the target situation based on the feature information of the plurality of cache data under the first cache feature and the cleaning identifier corresponding to each of the plurality of cache data includes:

according to the cleaning identifications corresponding to the cache data, respectively determining cleaning identification probability distribution on various characteristic information under the first cache characteristic;

And determining the minimum base index of the first cache feature according to the cleaning identification probability distribution and the feature information of the plurality of cache data under the first cache feature, so as to be used for indicating the judgment accuracy under the target condition.

5. The method according to any one of claims 1-4, wherein the obtaining a training sample set corresponding to the target cache class includes:

acquiring a plurality of cache data belonging to the target cache category, and determining characteristic information of the plurality of cache data under the plurality of cache characteristics respectively;

determining a sub-cleaning identifier of the first cache data under the first cache feature according to feature information of the first cache data under the first cache feature and a preset identifier processing rule corresponding to the first cache feature to obtain a plurality of sub-cleaning identifiers corresponding to the first cache data, wherein the first cache data is any cache data in the plurality of cache data, the first cache feature is any cache feature in the plurality of cache features, the preset identifier processing rule refers to a processing rule for judging a cleaning category of the cache data based on the feature information, and the sub-cleaning identifier of the first cache data under the first cache feature is used for indicating the cleaning category corresponding to the first cache data under the first cache feature;

And determining the cleaning identification corresponding to the first cache data according to the plurality of sub-cleaning identifications corresponding to the first cache data so as to obtain the cleaning identification corresponding to each of the plurality of cache data.

6. The method of claim 5, wherein determining the cleaning identifier corresponding to the first cache data according to the plurality of sub-cleaning identifiers corresponding to the first cache data comprises:

if the number of target sub-cleaning identifiers in the plurality of sub-cleaning identifiers corresponding to the first cache data is greater than a preset number, determining that the cleaning identifier corresponding to the first cache data is the target sub-cleaning identifier, wherein the target sub-cleaning identifier is used for indicating one of cleaning or uncleaning; or,

and determining one sub cleaning identifier with the largest proportion among a plurality of sub cleaning identifiers corresponding to the first cache data as a cleaning identifier corresponding to the first cache data.

7. A method according to claim 3, characterized in that the second information entropy is calculated as follows:

wherein H is _A (D) For the second information entropy related to the first cache feature, n is the total number of categories of feature information of the plurality of cache data under the first cache feature, i is used for indicating the category of the feature information under the first cache feature, |D _i I is the first cache feature in the plurality of cache dataIs the number of cache data of the i-th type of feature information, and |d2| is the total number of the plurality of cache data.

8. A method of data cleaning comprising:

and when the situation that the total amount of the cache data corresponding to the target cache category is larger than the preset data amount or the ratio of the total amount of the cache data corresponding to the target cache category to the total cache space corresponding to the target cache category is larger than a preset ratio threshold is detected, cleaning the cache data corresponding to the target cache category through a cache data cleaning model, wherein the cache data cleaning model is constructed through the method according to any one of claims 1-7.

9. A device for constructing a data cleansing model, comprising:

the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring a training sample set corresponding to a target cache category, the training sample set comprises sample information of a plurality of cache data belonging to the target cache category, the sample information of each cache data comprises a plurality of characteristic information and a cleaning identifier, the plurality of characteristic information is respectively corresponding to a plurality of preset cache characteristics, and the cleaning identifier is used for indicating the cleaning category;

The accuracy judging module is used for determining judging accuracy under a target condition based on characteristic information of the plurality of cache data under a first cache characteristic and cleaning identifiers corresponding to the plurality of cache data respectively, wherein the first cache characteristic is any cache characteristic in the plurality of cache characteristics, and the target condition is a condition that the first cache characteristic is taken as a judging standard to judge cleaning categories corresponding to the plurality of cache data respectively;

the model construction module is used for constructing a cache data cleaning model according to the judging accuracy under a plurality of situations, wherein the situations are situations in which the plurality of cache features are respectively used as judging standards to judge cleaning categories corresponding to the plurality of cache data, the cache data cleaning model is a decision tree for judging the cleaning categories of the cache data by sequentially using the plurality of cache features as judging standards, the cache features with high judging accuracy are father nodes of the cache features with low judging accuracy in the decision tree, the cache features with low judging accuracy are connected to a first branch of the cache features with high judging accuracy, and the first branch is a branch with a judging result of not cleaning;

The model construction module is specifically used for: determining a second cache feature corresponding to the maximum judgment accuracy according to the judgment accuracy under the multiple conditions, deleting the second cache feature, feature information corresponding to the second cache feature and second cache data in the training sample set, returning to execute the step of determining the judgment accuracy under the target condition based on the feature information of the multiple cache data under the first cache feature and the cleaning identification corresponding to each of the multiple cache data until the judgment accuracy under the multiple conditions is smaller than the preset accuracy or only one cache feature in the sample information remains, wherein the cleaning identification is used for indicating the cleaned cache data when the second cache feature is used as a judgment standard to judge the cleaning category corresponding to each of the multiple cache data, and constructing the cache data cleaning model according to the deleted sequence of the multiple cache features in the training sample set.

10. A data cleaning device, comprising:

the cleaning module is configured to clean, when it is detected that the total amount of cache data corresponding to a target cache class is greater than a preset data amount, or the ratio of the total amount of cache data corresponding to the target cache class to the total cache space corresponding to the target cache class is greater than a preset ratio threshold, cache data corresponding to the target cache class through a cache data cleaning model, where the cache data cleaning model is constructed by a method according to any one of claims 1 to 7.

11. A computer device comprising a memory and one or more processors configured to execute one or more computer programs stored in the memory, which when executed, cause the computer device to implement the method of any of claims 1-7 or claim 8.

12. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1-7 or claim 8.