WO2023155849A1 - 基于时间衰减的样本删除方法及其装置、存储介质 - Google Patents

基于时间衰减的样本删除方法及其装置、存储介质 Download PDF

Info

Publication number
WO2023155849A1
WO2023155849A1 PCT/CN2023/076554 CN2023076554W WO2023155849A1 WO 2023155849 A1 WO2023155849 A1 WO 2023155849A1 CN 2023076554 W CN2023076554 W CN 2023076554W WO 2023155849 A1 WO2023155849 A1 WO 2023155849A1
Authority
WO
WIPO (PCT)
Prior art keywords
storage
samples
sample
storage space
attribute
Prior art date
Application number
PCT/CN2023/076554
Other languages
English (en)
French (fr)
Inventor
屠要峰
杨洪章
杨仝
高军
郭斌
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023155849A1 publication Critical patent/WO2023155849A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Definitions

  • the present disclosure relates to but is not limited to the technical field of data processing, and in particular relates to a method for deleting samples based on time decay, its device, and a storage medium.
  • the trained artificial intelligence model implements the functions of the application.
  • the sample may have low value density and serious uneven positive and negative samples. It is prone to the phenomenon of unbalanced sample types with too many old samples and too few newly collected samples. As a result, the accuracy of the artificial intelligence model is reduced, and the phenomenon of model aging occurs.
  • a possible solution is to delete obsolete samples based on the distance of the sample storage time as the basis for sample deletion, so as to achieve a balanced sample type, such as using log-style cyclic coverage, that is, to collect samples in a fixed storage space
  • the samples are stored in chronological order, and when the capacity of the storage space is exhausted, the newly collected samples are stored by covering the head of the storage space.
  • simply deleting all the old samples directly according to the chronological order will result in a small number of old samples with useful value not being retained, thus reducing the quality of the samples.
  • Embodiments of the present disclosure provide a method for deleting a sample based on time decay, its device, and a storage medium.
  • an embodiment of the present disclosure provides a method for deleting samples based on time decay, including: acquiring multiple samples; saving the samples to a storage space, wherein the storage space corresponds to a storage attribute, and the storage The attribute changes with the storage time of the sample stored in the storage space, different storage attributes correspond to different preset deletion capacities, and the preset deletion capacity corresponding to the same storage attribute changes with the Attenuation due to storage time; calculate the popularity value of each sample in the storage space that belongs to the target storage attribute; delete the samples in the storage space according to the popularity value and the current preset deletion capacity of the storage space the sample.
  • an embodiment of the present disclosure provides a sample deletion device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program, it implements the following: The time-decay-based sample deletion method described in the first aspect.
  • an embodiment of the present disclosure further provides a computer-readable storage medium storing computer-executable instructions, the computer-executable instructions being used to execute the method for deleting samples based on time decay as described in the first aspect.
  • an embodiment of the present disclosure further provides a computer program product, including a computer program or a computer instruction, the computer program or the computer instruction is stored in a computer-readable storage medium, and the processor of the computer device reads from the The computer-readable storage medium reads the computer program or the computer instruction, and the processor executes the computer program or the computer instruction, so that the computer device executes the time decay-based sample as described in the first aspect. delete method.
  • FIG. 1 is a flow chart of the steps of a method for deleting samples based on time decay provided by an embodiment of the present disclosure
  • Fig. 2 is a flow chart of steps of a sample deletion method provided by another embodiment of the present disclosure.
  • FIG. 3 is a flow chart of steps for saving samples provided by another embodiment of the present disclosure.
  • Fig. 4 is a flow chart of steps of a sample deletion method provided by another embodiment of the present disclosure.
  • Fig. 5 is a flow chart of the steps of the sample in the scheduling linked list provided by another embodiment of the present disclosure.
  • FIG. 6 is a flow chart of steps of a sample deletion method provided by another embodiment of the present disclosure.
  • Fig. 7 is a flow chart of steps for obtaining a sample heat value provided by another embodiment of the present disclosure.
  • FIG. 8 is a flow chart of steps for changing storage attributes of a storage space provided by another embodiment of the present disclosure.
  • FIG. 9 is a flow chart of steps for determining a preset deletion capacity provided by another embodiment of the present disclosure.
  • Fig. 10 is a schematic diagram of a linked list provided by another embodiment of the present disclosure.
  • Fig. 11 is a schematic diagram of sample deletion provided by another embodiment of the present disclosure.
  • Fig. 12 is a graph of the ratio of the historical heat value to the current heat value provided by another embodiment of the present disclosure.
  • Fig. 13 is a block diagram of a sample deletion device provided by another embodiment of the present disclosure.
  • Fig. 14 is a flowchart of steps of a sample deletion method provided by another embodiment of the present disclosure.
  • Fig. 15 is a schematic structural diagram of a sample deletion device provided by another embodiment of the present disclosure.
  • the failure of the hard disk with a short service life is mostly caused by the wear and tear of the rotating disk, while the failure of the hard disk with a long service life The reason is that the redirection space is exhausted. If the proportion of old data is too large and the proportion of newly collected data is too small, typical model aging problems will occur, and the accuracy of hard disk failure prediction will become lower and lower. Moreover, the growing number of samples seriously occupies the storage space of business data.
  • the acquisition end at the edge often has limitations in storage, calculation, transmission, etc., or due to reasons such as personal privacy protection and policies and laws, the acquisition end cannot be transmitted to The central node can only store it temporarily, waiting for the command of the central node to read the data, which leads to the hidden danger of capacity exhaustion in the storage capacity. For example, there are various components of on-board electronics, and there are many types of samples collected and stored at a high frequency. However, the capacity of on-board storage devices is limited, and obsolete samples will inevitably face the problem of not being able to store them.
  • a common practice is simply based on the distance of time as the basis for deletion, for example, the elapsed time from the execution time of the previous file exceeds a predetermined time, and an aging effect is provided to the file;
  • another example is a driving recorder, which uses a log-like cycle Covering, that is, storing samples in a fixed storage space according to the time sequence of sample collection, when the space is exhausted, overwrite storage from the head of the storage space.
  • Embodiments of the present disclosure include a method for deleting samples based on time decay, its device, and a storage medium, wherein the method for deleting samples based on time decay includes: acquiring multiple samples; saving the samples to a storage space, wherein the storage space Corresponding to the storage attribute, the storage attribute changes with the storage time of the sample stored in the storage space, different storage attributes correspond to different preset deletion capacities, and the preset deletion capacity corresponding to the same storage attribute decays with the storage time; calculation The heat value of each sample in the storage space belonging to the target storage attribute; delete the samples in the storage space according to the heat value and the current preset deletion capacity of the storage space.
  • obsolete samples are deleted according to the heat value of the sample and the preset deletion capacity of the storage space.
  • the technology of this application can retain valuable data in old samples, thereby effectively improving the quality of samples.
  • FIG. 1 is a flow chart of the steps of a time-decay-based sample deletion method provided by an embodiment of the present disclosure.
  • the sample deletion method includes but is not limited to the following steps:
  • Step S110 acquiring multiple samples
  • the time for obtaining samples can be determined according to information such as sample collection frequency, collection volume, and storage capacity of the storage space.
  • This embodiment of the present application does not limit the time threshold and time unit for obtaining samples, which can be every 6 A preset number of samples are obtained every month, and the time threshold can also be in minutes, hours or days, and those skilled in the art can adjust it according to actual conditions.
  • Step S120 save the sample to the storage space, wherein the storage space corresponds to the storage attribute, and the storage attribute follows the The storage time of the samples stored in the storage space changes.
  • Different storage attributes correspond to different preset deletion capacities, and the preset deletion capacity corresponding to the same storage attribute decays with the storage time.
  • the storage attribute is used to characterize the storage space based on the storage stage corresponding to the storage time of the storage sample.
  • it may include the collection sample stage, stock sample stage, decay sample stage, and discard sample stage, wherein each The stages correspond to different storage attributes, and the storage attributes of the storage space change with the storage time.
  • the storage space for storing newly acquired samples is in the stage of collecting samples, corresponding to storage attribute A; the storage time of the stored samples is corresponding to The storage space is in the stage of decaying samples, corresponding to storage attribute B. After the sample is acquired, the sample is saved to the storage space.
  • the startup time of the storage space is calculated from the storage time of the sample. As the startup time increases, the storage attribute of the storage space changes accordingly, and different storage attributes correspond to different preset deletion capacities, and the preset deletion capacity corresponding to the same storage attribute decays with the storage time.
  • Step S130 calculating the heat value of each sample in the storage space attributable to the target storage attribute.
  • the target storage attribute may be the storage attribute corresponding to the inventory sample stage, the decay sample stage, and the discarded sample stage.
  • the heat value is used to represent the use value of the sample.
  • the use value of the sample with a low heat value is low.
  • the heat value of each sample is obtained, and the heat value is used as the basis for deleting the sample, so that the sample with use value can be retained.
  • Step S140 delete the samples in the storage space according to the popularity value and the current preset deletion capacity of the storage space.
  • the technical solution of this application takes the storage properties of the storage space into account while considering the storage time of the samples. And the heat value of the sample is used as the basis for deletion, which can retain the valuable old samples, thereby improving the quality of the samples and providing effective training data for the artificial intelligence model.
  • step S140 in the embodiment shown in FIG. 1 includes but is not limited to the following steps:
  • Step S210 according to the heat value of each sample in each storage space, and the current preset deletion capacity of each storage space, correspondingly delete the samples in each storage space, so that the total amount of deleted samples is equal to the acquired amount of samples.
  • each storage space the samples in each storage space are correspondingly deleted from each storage space, so that the total amount of deleted samples is equal to the acquisition amount of samples, so that all storage spaces save The total number of samples remains unchanged, effectively avoiding the data expansion problem of the storage space, and avoiding the depletion of the storage capacity of the storage space.
  • the sample deletion method further includes but is not limited to the following steps:
  • Step S310 divide the storage space into multiple linked lists according to the heat value, and the linked lists correspond to linked list identifiers;
  • Step S320 saving the sample to the corresponding linked list according to the linked list identifier and popularity value.
  • the storage space is divided into multiple linked lists according to the heat value, each linked list corresponds to a linked list identifier, and the linked list identifier is used to represent the preset heat value range to which the samples in the linked list belong.
  • the linked list identifier is used to represent the preset heat value range to which the samples in the linked list belong. Save the sample to the corresponding linked list, so that a bidirectional index is established between the sample belonging to the preset heat value range and the linked list identifier, which can effectively improve the efficiency of sample deletion; for example, when it is determined that the heat value of the target deleted sample belongs to the first heat value range of values, the The first linked list identifier corresponding to the first heat value range can be determined, and the linked list corresponding to the first linked list identifier is correspondingly deleted.
  • the embodiment of the present application does not limit the specific method of dividing the storage space according to the heat value.
  • the most significant bit of the binary heat value can be used to divide the storage space, and the most significant bit is used as the linked list of the divided linked list Identification; refer to FIG. 10, which is a schematic diagram of a linked list provided by another embodiment of the present disclosure.
  • the storage space can be segmented by taking the most significant two bits of the binary heat value to obtain 11 linked lists , the 11 linked lists start with an array of linked list headers, and form linked lists of different levels from bottom to top according to the range of heat values: linked list 00, the most significant two bits in binary are 00, that is, the heat value is higher than binary 00 (decimal 0), The heat value range is 0-1; the linked list 01, the most significant two digits of the binary is 01, that is, the heat value is higher than the binary 01 (decimal 1), and the heat value range is 1-2; the linked list 10, the most significant two digits of the binary is 01, that is, the heat value is higher than binary 10 (decimal 2), and the heat value range is 2-3; the linked list is 11, and the most significant two bits in binary are 11, that is, the heat value is higher than binary 11 (decimal 3), and the heat value range is 3-4; linked list 100, the most significant two digits in binary is 10, that is, the heat value is higher than binary
  • the method of saving the sample to the corresponding linked list according to the linked list identifier and heat value is as follows: For example, the heat value of the sample is 5, the binary representation is 101, and the most significant two digits are 10, so it can be determined that the sample is saved To the linked list whose linked list ID is 100, the heat value range of this linked list is 4 to 6; since the storage space is divided into linked lists sorted from low to high according to the heat value range, the samples in each linked list are not sorted, the purpose is to sacrifice small Sorting accuracy to avoid expensive sorting overhead.
  • the samples of the linked list are deleted in order from low to high according to the multi-level linked list until the amount of deleted samples meets the preset deletion capacity of the storage space.
  • step S140 in the embodiment shown in FIG. 1 includes but is not limited to the following steps:
  • Step S410 determining the first popularity threshold according to the linked list identifier and the preset deletion capacity
  • Step S420 deleting samples whose popularity values are smaller than the first popularity threshold.
  • the linked list identifier is used to represent the preset heat value range of the samples in the linked list; according to Linked list identification and preset deletion capacity determine the first heat threshold, the first heat threshold is the lowest heat value of the sample to be retained, so as to determine the target linked list whose heat value is less than the first heat threshold, when there are multiple target linked lists, according to the heat value Delete the samples of the target linked list in order from low to high until the number of deleted samples satisfies the preset deletion capacity, so that old samples with useful value can be retained and the quality of samples can be improved.
  • the sample deletion method of the present application also includes but is not limited to the following steps:
  • Step S510 when it is detected that the heat value of the sample in the linked list has changed, the sample whose heat value has changed is scheduled to a new linked list according to the changed heat value.
  • the thermal value of the samples in the linked list is detected in real time or periodically, and when the thermal value of the sample in the linked list is detected to change, the sample whose thermal value has changed is dispatched to a new linked list according to the changed thermal value. It can accurately divide samples, avoid deletion errors, and ensure the quality of samples.
  • step S140 in the embodiment shown in FIG. 1 includes but is not limited to the following steps:
  • Step S610 determine the target sample in the storage space according to the current preset deletion capacity, wherein the heat value of the target sample is smaller than the preset second heat threshold;
  • Step S620 delete the target sample.
  • the second hotness threshold of the storage space is determined according to the preset deletion capacity, and the second hotness threshold is the lowest hotness value of the samples that need to be retained in the current storage space, from The storage space deletes the target samples that meet the preset deletion capacity, and the heat value of the target samples is less than the second heat threshold, which can realize the deletion of obsolete samples with low value and improve the data quality of the samples.
  • Fig. 1 S130 includes but is not limited to the following steps:
  • Step S710 obtaining a sample proportion value according to samples with different sample types
  • step S720 the popularity value of the current sample is obtained according to the sample type value, the sample proportion value, the access times of the current sample and the historical popularity value of the current sample.
  • the embodiment of the present application does not limit the specific method of obtaining the heat value of the current sample according to the sample type value, sample proportion value, visit times of the current sample, and historical heat value of the current sample.
  • the heat value can be calculated according to the following formula: Among them, Store k-1 is the historical popularity value, a is the sample type value, b is the sample ratio value, and Visit k is the number of visits to the current sample.
  • the embodiment of the present application does not limit the processing method of the historical heat value. It may be to take the square root of the historical heat value to weaken the influence of the historical heat value on the current heat value, or to take the historical heat value. Cube root or logarithmic processing; with reference to FIG. 12, FIG.
  • the curve 1210 is the original historical heat value (that is, without processing The curve corresponding to the ratio of the historical heat value) and the current heat value
  • the curve 1220 is the curve corresponding to the ratio of the historical heat value taking the square root to the current heat value
  • the curve 1230 is corresponding to the ratio of the cube root historical heat value and the current heat value
  • the sample type can include minority type samples and majority type samples. Minority type samples are sample types whose sample ratio value is less than the preset sample ratio threshold.
  • the majority type sample is a sample type whose sample ratio value is greater than the preset sample ratio threshold.
  • when the sample type of the detected sample is a minority type sample take a as 1, and when the sample type of the detected sample is a majority type For the sample, take a as 0 to achieve a higher weight for the minority type samples. Since the deletion of samples first deletes the samples with low heat value, it makes it more inclined to delete samples whose sample type is the majority type sample, and then makes the sample type of the sample more and more tend to balanced.
  • the embodiment of the present application does not limit the specific update time of the current heat value, and the specific update time of the heat value may be the time of acquiring a new sample, and those skilled in the art may adjust it according to the actual situation.
  • the storage spaces have storage serial numbers, and the storage serial numbers of the storage spaces increase with the enabling time of the storage spaces; there are various types of storage attributes, and different types The storage attributes correspond to different maximum storage spaces; the sample deletion method of this application also includes but is not limited to the following steps:
  • Step S810 when the number of storage spaces with the same storage attribute is greater than or equal to the corresponding maximum number of storage spaces, change the storage attributes of the corresponding storage spaces according to the sequence of storage numbers.
  • FIG. 11 is a schematic diagram of sample deletion provided by another embodiment of the present disclosure.
  • storage attributes which may include storage space in the acquisition phase, storage space in the storage phase, storage space in the decay phase, and disposal.
  • Stage storage space there are multiple storage spaces, and there are multiple storage spaces for different types of storage attributes; each storage space corresponding to each type of storage attribute has a storage sequence number sorted from small to large, and the storage space
  • the storage serial number increases with the activation time of the storage space.
  • the storage attributes of the corresponding storage spaces are changed in the order of the storage serial numbers from small to large. For example, when the number of storage spaces in the inventory stage is greater than or equal to the corresponding maximum storage space, change the storage attribute of the storage space with the smallest storage number to the storage space in the decay stage according to the order of storage number from small to large.
  • the storage attribute includes a first storage attribute, a second storage attribute, a third storage attribute, and a fourth storage attribute;
  • the first storage attribute is used to indicate that the storage space is in the stage of collecting samples;
  • the second storage attribute is used for It is used to represent that the storage space is in the stock sample stage;
  • the third storage attribute is used to represent that the storage space is in the decay sample stage;
  • the fourth storage attribute is used to represent that the storage space is in the discarded sample stage;
  • the sample deletion method also includes but not limited to the following steps:
  • the storage device dividing the storage device into a first storage area, a second storage area, a third storage area and a fourth storage area, wherein the first storage area corresponds to the first storage attribute, the second storage area corresponds to the second storage attribute, and the second storage area corresponds to the second storage attribute.
  • the third storage area corresponds to the third storage attribute, and the fourth storage area corresponds to the fourth storage attribute;
  • step S120 includes but is not limited to the following steps:
  • the sample deletion method after changing the storage space in the first storage area with the longest storage time of the samples to belong to the second storage area, the sample deletion method also includes but is not limited to the following steps:
  • the sample deletion method after changing the storage space in the second storage area with the longest storage time for storing samples to belong to the third storage area, the sample deletion method also includes but is not limited to the following steps:
  • the storage space in the third storage area with the longest storage time for saving samples is changed to belong to the fourth storage area.
  • the number of storage spaces corresponding to different storage areas is different. Since the first storage area corresponds to collecting newly acquired samples, it is necessary to ensure that the first storage area has enough storage space to collect new samples. Therefore, when The amount of storage space in the first storage area is greater than or equal to the maximum amount of storage space in the first storage area, and the first In the storage area, the storage space with the longest storage time for saving samples is changed to belong to the second storage area; after changing the storage space with the longest storage time for saving samples in the first storage area to belong to the second storage area, when The amount of storage space in the second storage area is greater than or equal to the maximum amount of storage space in the second storage area, and the storage space with the longest storage time for saving samples in the second storage area is changed to belong to the third storage area; After the storage space with the longest storage time for saving samples in the second storage area is changed to belong to the third storage area, when the amount of storage space in the third storage area is greater than or equal to the maximum amount of storage space in the third storage area, the The storage space with the
  • sample deletion method further includes but not limited to the following steps:
  • Step S910 when the storage space belongs to the first storage area, the default deletion capacity is 0;
  • Step S920 when the storage space belongs to the second storage area, the default deletion capacity is 0;
  • Step S930 when the storage space belongs to the third storage area, the preset deletion capacity is determined according to the current storage time of the samples stored in the storage space in the third storage area;
  • Step S940 when the storage space belongs to the fourth storage area, the preset deletion capacity is determined according to the amount of storage space belonging to the third storage area.
  • the default deletion capacity is 0, that is, no Delete the samples in the storage space.
  • time changes more and more samples are stored in the storage space. It is necessary to delete outdated and low-value samples.
  • the current storage attribute of the storage space is the third storage attribute (corresponding to the third storage area)
  • the number of first storage spaces belonging to the third storage area is multiple, and the preset deletion capacity of each first storage space is based on the order of the current storage time of the samples saved in the first storage area in the third storage area Determined, and decays with the storage time;
  • the preset deletion capacity is determined according to the amount of storage space belonging to the third storage area , so as to ensure that the target deletion sample is old enough, and effectively guarantee the quality of the sample.
  • the sorting results of the four first storage spaces are obtained according to the order of the current storage time of the samples stored in the first storage space, according to The sorting structure deletes the capacity of 1/2 1 , 1/2 2 , 1/2 3 , and 1/2 4 in the corresponding first storage space respectively (for example, the capacity of 1/2 1 is deleted from the first storage space that stores samples first ;The capacity of the first storage space for the latest storage sample is deleted by 1/2 4 ; the selection method of the specific target deleted sample is determined according to the heat value of each sample in the bucket); the preset deletion capacity corresponding to the fourth storage attribute is 1 /2 4 The storage capacity of the second storage space, wherein the second storage space belongs to the fourth storage attribute; thus, whenever the third storage space belonging to the first storage attribute acquires a sample size of one storage space, in All the first storage space and the second storage space delete a total of 1 bucket of samples, that is, the total amount of samples deleted is equal to the number of samples
  • each storage attribute in the following two examples is represented by "bucket", and the storage capacity of the buckets corresponding to each storage attribute is the same.
  • FIG. 13 is a schematic module diagram of a sample deletion device provided by another embodiment of the present disclosure.
  • the sample deletion device 1300 includes a storage space 1310 , a popularity value update module 1320 , a linked list division module 1330 and a sample deletion module 1340 , wherein,
  • the storage space includes storage spaces of four types of storage attributes: storage space 1311 of the collection stage; storage space 1312 of the inventory stage; storage space 1313 of the decay stage; and storage space 1314 of the discarding stage.
  • the functions of each module of the sample deletion device 1300 are described below:
  • the storage capacity of the storage space 1311 in the collection stage is 1 bucket, which is used to collect samples, and the samples in the bucket in the collection stage are not used as training data for the artificial intelligence model.
  • the storage capacity of the storage space 1312 in the stock stage is t/2 buckets, and the common value of t is 4 or 8.
  • the samples in the buckets in the stock stage are mainly used as training data for the artificial intelligence model.
  • the storage capacity of the storage space 1313 in the decay stage is t buckets. As time goes by, the possibility of the samples in the buckets in the decay stage being used for training and modeling gradually decreases, so the samples in the buckets in the decay stage are periodically deleted .
  • the discarded stage storage space 1314 does not specify the storage capacity, and since only a very small number of samples in the discarded stage storage space 1314 may still be accessed, most of the samples will be deleted.
  • the linked list division module 1330 divides the storage space by taking the most significant two digits of the binary heat value to divide the linked list, and establishes a bidirectional index between the linked list identifier and the samples in the linked list, which can be used when the heat value changes according to the highest two effective bits of the heat value to quickly Move the linked list to the corresponding target linked list, and quickly delete the target sample through the linked list during decay.
  • the smaller the heat value the more accurate the section division of the linked list.
  • the heat value update module 1320 is used to maintain the heat value of all samples in the storage space 1314 in the storage stage, the decay stage and the discarded stage. When deleting samples, the sample with the lowest heat value is preferentially deleted.
  • the sample in the k bucket The popularity score Storek is calculated by the following formula: For specific parameter explanations, refer to the principle description of the embodiment in FIG. 5 , and details are not repeated here.
  • the sample deletion module 1340 when the preset time threshold arrives, deletes 1/2 1 , 1/2 2 , 1/2 3 ... 1/2 t of the capacity of the buckets in the t buckets in the decay stage, respectively, for In the storage space 1314 of the abandonment stage, the capacity of 1/2 t bucket is deleted, and the sum of the above deleted capacities is exactly equal to the capacity of 1 bucket, so the total number of samples remains unchanged.
  • the relationship between the various modules in the sample deletion device shown in Figure 13 is as follows: every preset time threshold, the bucket in the collection phase is moved to the front of the storage phase, and the last bucket in the storage phase is moved to the front of the decay phase. Frontmost, the decay phase moves the last bucket into the discard phase.
  • the popularity value update module 1320 calculates the popularity value of the samples of the inventory stage module, the decay stage module and the discarded stage module, and divides a plurality of linked lists according to the heat value, and the sample deletion module 1340 stores the data in the storage space of the decay stage and the storage space of the discard stage 1314. samples are deleted.
  • Example 1 and Example 2 are applied to the sample deletion apparatus shown in FIG. 13 .
  • FIG. 14 is a flow chart of the steps of a sample deletion method provided by another embodiment of the present disclosure.
  • the sample deletion method includes the following steps:
  • Step S1410 system initialization, start to collect samples, and save the collected samples to the storage space of the collection stage;
  • Step S1420 as time goes by, at each preset time threshold, for example, the preset time threshold is q months, change the storage attribute of the storage space in the acquisition stage with the earliest storage time to the storage space in the stock stage, that is, The bucket in the collection stage with the earliest storage time is moved to the front of the storage space in the stock stage, and a new bucket is used for the storage space in the collection stage. row data collection;
  • Step S1430 as time goes by, after q*t/2 months, when the positions of t/2 buckets that can be accommodated in the storage space of the stock stage are all full, schedule the bucket with the earliest startup time in the storage space of the stock stage to decay Stage storage space, that is, the last bucket in the stock stage storage space is moved to the front of the decay stage storage space;
  • Step S1440 every preset time threshold, for example, every q months, schedule the bucket with the earliest startup time in the storage space in the decay phase to the storage space in the discard phase, that is, move the last bucket in the storage space in the inventory phase to the bucket in the storage space in the decay phase the front;
  • Step S1450 determine the first preset deletion capacity of each bucket of the decaying stage storage space, delete samples from the decaying stage storage space according to the heat value of the sample in each bucket and each first preset deletion capacity, and delete the sample from the waste stage storage space Delete samples of the second preset deletion capacity; for example: t buckets of storage space in the decay stage delete 1/2 1 , 1/2 2 , 1/2 3 ... 1/2 t (that is, the first preset deletes capacity), the principle of deletion is that the samples with the lowest heat value in the bucket are deleted first; as time goes by, and after q*t months, the positions of t buckets that can be accommodated in the decay stage are all full, and the decay The last bucket of the stage is moved into the discard stage.
  • the last bucket in the decay phase is moved into the waste phase, all samples in the waste phase are mixed together to calculate the heat value, and the capacity of the bucket of 1/2 t (that is, the second preset deletion capacity) is deleted.
  • Example 2 applied to the automotive electronics scene, a single car collects 1 sample per second, 2,592,000 automotive sensor samples collected every month, about 1TB (that is, the capacity of the bucket is 1TB), and the storage space capacity of the car Only 6TB.
  • Number of buckets There is 1 bucket in the storage space in the collection phase; 4 buckets in the storage space in the storage phase; 8 buckets in the storage space in the decay phase.
  • the data collected in month x is put into bucket Ax, for example, the data collected in month 1 is put into bucket A1.
  • Month 2 The collected automotive sensor samples are placed in bucket A 2 ; bucket A 1 is put into the storage space of the stock stage.
  • Month 3 The collected car sensor samples are placed in bucket A 3 ; bucket A 2 is put into the storage space of the stock stage.
  • Month 5 The collected automotive sensor samples are placed in bucket A 5 ; bucket A 4 is put into the storage space of the stock stage. So far, the storage space in the storage stage is full.
  • a 7 bucket Collected automotive sensor samples are placed in A 7 bucket.
  • a 6 buckets should be put into the storage space of the stock stage;
  • a 2 buckets should be put into the decay stage storage space, and at least 1/2 of the bucket’s capacity, that is, 1,296,000 samples should be deleted;
  • a 1 bucket should have at least 1/4 of the bucket’s capacity, that is, 648,000 samples sample.
  • a 8 bucket Collected automotive sensor samples are placed in A 8 bucket.
  • a 7 buckets are put into the storage space of the stock stage;
  • a 3 buckets are put into the decay stage storage space, and at least 1/2 of the bucket’s capacity, that is, 1,296,000 samples should be deleted;
  • a 1 bucket should delete at least 1/8 of the bucket's capacity, that is, 324,000 samples.
  • a 9 bucket Collected car sensor samples are placed in A 9 bucket.
  • a 8 buckets are put into the storage space of the stock stage;
  • a 4 buckets are put into the decay stage storage space, and at least 1/2 of the bucket’s capacity, that is, 1,296,000 samples should be deleted;
  • a 2 buckets should delete at least 1/8 of the capacity of the bucket, that is, 324,000 samples;
  • a 1 bucket should delete at least 1/16 of the capacity of the bucket, that is, 162,000 samples.
  • a 12 bucket Collected automotive sensor samples are placed in A 12 bucket.
  • a 11 buckets should be put into the storage space of the stock stage;
  • a 7 buckets should be put into the storage space of the decay stage, and at least 1/2 of the bucket’s capacity, that is, 1,296,000 samples should be deleted;
  • a 6 buckets should have at least 1/4 of the bucket’s capacity, that is, 648,000 samples Samples;
  • a 5 barrels should delete at least 1/8 of the capacity of the barrel, which is 324,000 samples;
  • a 4 barrels should delete at least 1/16 of the capacity of the barrel, which is 162,000 samples;
  • a 3 barrels should delete at least 1/32 of the capacity of the barrel, which is 81,000
  • a 2 -barrel should delete at least 1/64 of the capacity of the barrel, that is, 40,500 samples;
  • a 1 barrel should delete at least 1/128 of the capacity of the barrel, that is, 20,250 samples.
  • the 13th month The collected car sensor samples are put into bucket A 13 ; bucket A 12 is put into the storage space of the stock stage; bucket A 8 is put into the storage space of the decay stage, and at least 1/2 of the bucket's capacity, that is, 1,296,000 samples, should be deleted; A 7 barrels should delete at least 1/4 of the capacity of the barrel, that is, 648,000 samples; A 6 barrels should delete at least 1/8 of the capacity of the barrel, that is, 324,000 samples; A 5 barrels should delete at least 1/16 of the capacity of the barrel, that is, 162,000 samples ; A 4 barrels should delete at least 1/32 of the capacity of the barrel, that is, 81,000 samples; A 3 barrels should delete at least 1/64 of the capacity of the barrel, that is, 40,500 samples; A 2 barrels should delete at least 1/128 of the capacity of the barrel, that is, 20,250 samples Samples; A 1 bucket should delete at least 1/256 of the bucket's capacity, that is, 10125 samples.
  • the 14th month The collected car sensor samples are put into bucket A 14 ; bucket A 11 is put into the storage space of the stock stage; bucket A 9 is put into the storage space of the decay stage, and at least 1/2 of the bucket’s capacity, that is, 1,296,000 samples, should be deleted; A 8 buckets should delete at least 1/4 of the capacity of the bucket, that is 648000 samples; A 7 buckets should delete at least 1/8 of the capacity of the bucket, that is 324000 samples; A 6 buckets should delete at least The capacity of 1/16 barrel is 162,000 samples; A 5 barrels should delete at least 1/32 of the capacity of the barrel, which is 81,000 samples; A 4 barrels should delete at least 1/64 of the capacity of the barrel, which is 40,500 samples; A 3 barrels should be deleted At least 1/128 of the capacity of the bucket is 20250 samples; A 2 buckets should be deleted at least 1/256 of the capacity of the bucket is 10125 samples; A 1 bucket is put into the storage space of the waste stage, and the heat of all samples in the storage space of the decay stage
  • the 15th month The collected car sensor samples are put into bucket A 15 ; bucket A 12 is put into the storage space of the stock stage; bucket A 10 is put into the storage space of the decay stage, and at least 1/2 of the bucket's capacity, that is, 1,296,000 samples, should be deleted; A 9 barrels should delete at least 1/4 of the capacity of the barrel, which is 648,000 samples; A 8 barrels should delete at least 1/8 of the capacity of the barrel, which is 324,000 samples; A 7 barrels should delete at least 1/16 of the capacity of the barrel, which is 162,000 samples ; A 6 barrels should delete at least 1/32 of the capacity of the barrel, that is, 81000 samples; A 5 barrels should delete at least 1/64 of the capacity of the barrel, that is, 40500 samples; A 4 barrels should delete at least 1/128 of the capacity of the barrel, that is, 20250 samples Samples; A 3 buckets should delete at least 1/256 buckets, that is, 10125 samples; A 2 buckets should be put into the storage space of the waste phase, and all samples in the
  • the technical solution of the present disclosure can generate a new bucket of samples every month, and delete a total of one bucket of samples at the same time, so that the data is no longer inflated, and the deleted samples are all old and cannot be accessed.
  • the data with low heat can effectively avoid the problem of model aging, and the sample deletion is realized based on the linked list, which can quickly delete the target sample. It is worth noting that in practical applications, it may not be possible to achieve the ideal state where the newly collected sample size is completely equal to the deleted sample size.
  • FIG. 15 is a schematic structural diagram of a sample deletion device provided by another embodiment of the present disclosure.
  • An embodiment of the present disclosure also provides a sample deletion device 1500, which includes: a memory 1510, Processor 1520 and computer programs stored on memory 1510 and executable on processor 1520 .
  • the processor 1520 and the memory 1510 may be connected through a bus or in other ways.
  • the non-transitory software programs and instructions required to implement the sample deletion method of the above-mentioned embodiment are stored in the memory 1510, and when executed by the processor 1520, the sample deletion method in the above-mentioned embodiment is executed, for example, the implementation of the above-described Figure 1 Method step S110 to method step S140 in, method step S210 in Fig. 2, method step S310 in Fig. 3 to method step S320, method step S410 in Fig. 4 to method step S420, method step S510 in Fig. 5, Method step S610 to method step S620 in FIG. 6 , method step S710 to method step S720 in FIG. 7 , method step S810 in FIG. 8 , method step S910 to method step S940 in FIG. 9 .
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • an embodiment of the present disclosure also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, by the above-mentioned Executed by a processor 1520 in the embodiment of the sample deletion apparatus 1500, the above-mentioned processor 1520 can execute the time-decay-based sample deletion method in the above-mentioned embodiment, for example, perform the method step S110 to the method step in FIG. 1 described above S140, method step S210 in FIG. 2, method step S310 to method step S320 in FIG. 3, method step S410 to method step S420 in FIG. 4, method step S510 in FIG. 5, method step S610 to in FIG. 6 Method step S620 , method step S710 to method step S720 in FIG. 7 , method step S810 in FIG. 8 , method step S910 to method step S940 in FIG. 9 .
  • an embodiment of the present disclosure also provides a computer program product, including computer programs or computer instructions, the computer programs or computer instructions are stored in a computer-readable storage medium, and the processor of the computer device reads from the computer
  • the readable storage medium reads the computer program or computer instruction, and the processor executes the computer program or computer instruction, so that the computer device executes the method for deleting a sample based on time decay as described in any of the preceding embodiments, for example, executes the method described above in FIG. 1 Method step S110 to method step S140, method step S210 in Fig. 2, method step S310 to method step S320 in Fig. 3, method step S410 to method step S420 in Fig. 4, method step S510 in Fig. 5, Fig. Method step S610 to method step S620 in 6, method step S710 to method step S720 in FIG. 7 , method step S810 in FIG. 8 , method step S910 to method step S940 in FIG. 9 .
  • Embodiments of the present disclosure include a method for deleting samples based on time decay, its device, and a storage medium, wherein the method for deleting samples based on time decay includes: acquiring multiple samples; saving the samples to a storage space, wherein the storage space Corresponding to the storage attribute, the storage attribute changes with the storage time of the sample stored in the storage space, different storage attributes correspond to different preset deletion capacities, and the preset deletion capacity corresponding to the same storage attribute decays with the storage time; calculation The heat value of each sample in the storage space belonging to the target storage attribute; delete the samples in the storage space according to the heat value and the current preset deletion capacity of the storage space.
  • obsolete samples are deleted according to the heat value of the sample and the preset deletion capacity of the storage space.
  • the technology of this application can retain valuable data in old samples, thereby effectively improving the quality of samples.
  • Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Investigating Or Analyzing Materials Using Thermal Means (AREA)

Abstract

一种基于时间衰减的样本删除方法及其装置、存储介质,其中,基于时间衰减的样本删除方法包括:获取多个样本(S110);将样本保存至存储空间,其中,存储空间对应有存储属性,存储属性随着存储空间保存样本的存储时间而改变,不同的存储属性对应有不同的预设删除容量,同一存储属性所对应的预设删除容量随着存储时间而衰减(S120);计算归属于目标存储属性的存储空间中的各个样本的热度值(S130);根据热度值和存储空间的当前预设删除容量,删除存储空间中的样本(S140)。

Description

基于时间衰减的样本删除方法及其装置、存储介质
相关申请的交叉引用
本申请基于申请号为202210153039.1、申请日为2022年2月18日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本公开涉及但不限于数据处理技术领域,尤其涉及一种基于时间衰减的样本删除方法及其装置、存储介质。
背景技术
随着万物互联的时代来临,样本的采集在越来越多的应用中出现,并且,受益于人工智能算法的不断发展与成熟,越来越多应用将样本作为人工智能模型的训练数据,利用训练好的人工智能模型实现应用的功能。随着采集的样本数量的日益增长,严重挤占了业务数据的存储空间。此外,样本的特征分布也在不断变化,样本可能出现价值密度低且正负样本严重不均的问题,容易出现陈旧样本比例过多而新采集的样本比例过少的样本类型不均衡的现象,从而使得人工智能模型的准确率降低,发生模型老化的现象。例如,针对硬盘故障预测的场景,随着硬盘寿命的不断增长,引起硬盘故障的原因会悄然发生变化,使用期较短的硬盘故障原因多为旋转盘片磨损,而使用期较长的硬盘故障原因多为重定向空间耗尽,如果陈旧数据的比例过大,新采集的数据比例过小,则会产生典型的模型老化问题,对硬盘故障预测的准确性会越来越低。
一种可能的解决方法是基于样本存储时间的远近作为样本删除的依据,对陈旧的样本进行删除,以实现均衡样本类型,例如采用日志式的循环覆盖,即在固定的存储空间中按照采集样本的时间顺序存放样本,当存储空间的容量用尽时,通过覆盖存储空间的头部进行存储新采集的样本。但是,仅仅根据时间顺序直接删除全部的陈旧样本,会导致少部分有使用价值的陈旧样本无法得到保留,从而降低了样本的质量。
发明内容
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
本公开实施例提供了一种基于时间衰减的样本删除方法及其装置、存储介质。
第一方面,本公开实施例提供了一种基于时间衰减的样本删除方法,包括:获取多个样本;将所述样本保存至存储空间,其中,所述存储空间对应有存储属性,所述存储属性随着所述存储空间保存所述样本的存储时间而改变,不同的所述存储属性对应有不同的预设删除容量,同一所述存储属性所对应的所述预设删除容量随着所述存储时间而衰减;计算归属于目标存储属性的所述存储空间中的各个所述样本的热度值;根据所述热度值和所述存储空间的当前预设删除容量,删除所述存储空间中的所述样本。
第二方面,本公开实施例提供了一种样本删除装置,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如第一方面所述的基于时间衰减的样本删除方法。
第三方面,本公开实施例还提供了一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行如第一方面所述的基于时间衰减的样本删除方法。
第四方面,本公开实施例还提供了一种计算机程序产品,包括计算机程序或计算机指令,所述计算机程序或所述计算机指令存储在计算机可读存储介质中,计算机设备的处理器从所述计算机可读存储介质读取所述计算机程序或所述计算机指令,所述处理器执行所述计算机程序或所述计算机指令,使得所述计算机设备执行如第一方面所述的基于时间衰减的样本删除方法。
本公开的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本公开而了解。本公开的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。
附图说明
附图用来提供对本公开技术方案的进一步理解,并且构成说明书的一部分,与本公开的实施例一起用于解释本公开的技术方案,并不构成对本公开技术方案的限制。
图1是本公开一个实施例提供的基于时间衰减的样本删除方法的步骤流程图;
图2是本公开另一个实施例提供的样本删除方法的步骤流程图;
图3是本公开另一个实施例提供的保存样本的步骤流程图;
图4是本公开另一个实施例提供的样本删除方法的步骤流程图;
图5是本公开另一个实施例提供的调度链表中的样本的步骤流程图;
图6是本公开另一个实施例提供的样本删除方法的步骤流程图;
图7是本公开另一个实施例提供的获取样本热度值的步骤流程图;
图8是本公开另一个实施例提供的变更存储空间的存储属性的步骤流程图;
图9是本公开另一个实施例提供的确定预设删除容量的步骤流程图;
图10是本公开另一个实施例提供的链表示意图;
图11是本公开另一个实施例提供的样本删除示意图;
图12是本公开另一个实施例提供的历史热度值与当前热度值的比值的曲线图;
图13是本公开另一个实施例提供的样本删除装置的模块示意图;
图14是本公开另一个实施例提供的样本删除方法的步骤流程图;
图15是本公开另一个实施例提供的样本删除装置的结构示意图。
具体实施方式
为了使本公开的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本公开进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本公开,并不用于限定本公开。
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述 的步骤。说明书、权利要求书或上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
随着万物互联的时代来临,样本信息的采集在越来越多的应用中出现,例如汽车电子行业的汽车行驶数据、5G基站中的故障信息、分布式数据库的系统运行指标、存储服务器的硬盘故障指标等。随着采集的样本数量的日益增长,样本的特征分布也在不断变化,样本的价值密度低,容易出现陈旧样本比例过多而新采集的样本比例过少的样本类型不均衡的现象,从而使得人工智能模型的准确率降低,发生模型老化的现象。例如,针对硬盘故障预测的场景,随着硬盘寿命的不断增长,引起硬盘故障的原因会悄然发生变化,使用期较短的硬盘故障原因多为旋转盘片磨损,而使用期较长的硬盘故障原因多为重定向空间耗尽,如果陈旧数据的比例过大,新采集的数据比例过小,则会产生典型的模型老化问题,对硬盘故障预测的准确性会越来越低。并且,日益增长的样本严重挤占了业务数据的存储空间,然而,处于边缘的采集端往往在存储、计算、传输等方面存在限制,或者由于个人隐私保护及政策法律等原因,采集端无法传送至中心节点,只能暂时存放,等待着中心节点读取数据的指令,从而导致存储容量存在容量枯竭的隐患。例如,车载电子的元器件多样,采集存储的样本种类多、频率高,而车载存储设备的容量有限,陈旧的样本必然面临存不下的问题。
一种常规做法是简单的基于时间的远近作为删除的依据,例如从先前文件执行时间起经过的时间超过预定时间,向所述文件提供老化效果;又例如行车记录仪,其采用日志式的循环覆盖,即在固定的存储空间中按照样本采集的时间顺序存放样本,当空间用尽时,从存储空间的头部覆盖存储。上述方法简单易行,但粗暴地删除全部的陈旧样本,导致小部分有使用价值的历史数据却无法得到保留。
面对上述问题,需要对陈旧样本进行适当的删除。
本公开实施例包括一种基于时间衰减的样本删除方法及其装置、存储介质,其中,所述基于时间衰减的样本删除方法包括:获取多个样本;将样本保存至存储空间,其中,存储空间对应有存储属性,存储属性随着存储空间保存样本的存储时间而改变,不同的存储属性对应有不同的预设删除容量,同一存储属性所对应的预设删除容量随着存储时间而衰减;计算归属于目标存储属性的存储空间中的各个样本的热度值;根据热度值和存储空间的当前预设删除容量,删除存储空间中的样本。根据本公开实施例提供的方案,根据样本的热度值以及存储空间的预设删除容量删除陈旧的样本,相较于前述常规做法仅仅根据样本存储时间的远近删除样本的技术方案,本申请的技术方案能够保留陈旧样本中有价值的数据,从而有效提高样本的质量。
下面结合附图,对本公开实施例作进一步阐述。
如图1所示,图1是本公开一个实施例提供的基于时间衰减的样本删除方法的步骤流程图,该样本删除方法包括但不限于有以下步骤:
步骤S110,获取多个样本;
需要说明的是,获取样本的时间可以根据样本的采集频度、采集量以及该存储空间的存储容量等信息确定,本申请实施例并不限制获取样本的时间阈值以及时间单位,可以是每6个月获取预设数量的样本,该时间阈值还可以是以分钟、小时或天为时间单位,本领域技术人员能够根据实际情况进行调整。
步骤S120,将样本保存至存储空间,其中,存储空间对应有存储属性,存储属性随着 存储空间保存样本的存储时间而改变,不同的存储属性对应有不同的预设删除容量,同一存储属性所对应的预设删除容量随着存储时间而衰减。
可以理解的是,存储属性用于表征存储空间基于存储样本的存储时间对应的存储阶段,在一实施例中可以包括采集样本阶段、存量样本阶段、衰减样本阶段以及废弃样本阶段,其中,每个阶段对应不同的存储属性,存储空间的存储属性随着存储时间而改变,例如,存储新获取的样本的存储空间处于采集样本阶段,对应存储属性A;存储样本的存储时间为一个月前对应的存储空间处于衰减样本阶段,对应存储属性B。在获取样本之后,将样本保存至存储空间,由于存储空间的存储属性随着存储空间保存样本的存储时间而改变,因此,存储空间的启动时间从样本的存储时间开始计算,随着存储空间的启动时间的增加,该存储空间的存储属性随之变化,并且,不同的存储属性对应有不同的预设删除容量,同一存储属性所对应的预设删除容量随着存储时间而衰减。
需要说明的是,本申请实施例并不限制存储空间的数量以及存储空间的存储容量,本领域技术人员能够根据实际情况进行调整。
步骤S130,计算归属于目标存储属性的存储空间中的各个样本的热度值。
可以理解的是,参考步骤120的描述,目标存储属性可以为存量样本阶段、衰减样本阶段以及废弃样本阶段对应的存储属性。热度值用于表征样本的使用价值,热度值低的样本的使用价值较低,获取各个样本的热度值,以热度值作为删除样本的依据,能够保留有使用价值的样本。
步骤S140,根据热度值和存储空间的当前预设删除容量,删除存储空间中的样本。
可以理解的是,由于预设删除容量根据存储属性而确定,并且随着存储空间保存样本的存储时间而改变,本申请的技术方案在考虑样本的存储时间远近的同时,以存储空间的存储属性以及样本的热度值作为删除依据,能够保留有使用价值的陈旧样本,从而提高样本的质量,为人工智能模型提供有效的训练数据。
另外,参照图2,在一实施例中,图1所示实施例中的步骤S140包括但不限于有以下步骤:
步骤S210,根据各个存储空间中各个样本的热度值,以及各个存储空间的当前预设删除容量,对应删除各个存储空间中的样本,使得样本的删除总量等于样本的获取量。
可以理解的是,根据各个存储空间中各个样本的热度值,从各个存储空间中对应删除各个存储空间中的样本,使得样本的删除总量等于样本的获取量,从而使得所有的存储空间中保存的样本总量保持不变,有效避免存储空间的数据膨胀问题,能够避免存储空间存储容量的枯竭。
另外,参照图3,在一实施例中,在图1所示实施例中的步骤S130之后,样本删除方法还包括但不限于有以下步骤:
步骤S310,根据热度值将存储空间划分为多个链表,链表对应有链表标识;
步骤S320,根据链表标识和热度值将样本保存至对应的链表。
可以理解的是,根据热度值将存储空间划分为多个链表,每个链表对应有链表标识,链表标识用于表征该链表中的样本所归属的预设热度值范围,根据链表标识和热度值将样本保存至对应的链表,使得归属于预设热度值范围的样本与链表标识之间建立双向索引,能够有效提高样本删除的效率;例如,当确定目标删除样本的热度值归属于第一热度值范围,便能 够确定第一热度值范围对应的第一链表标识,对应删除该第一链表标识对应的链表。
需要说明的是,本申请实施例并不对根据热度值划分存储空间的具体方式做限制,可以是取二进制的热度值的最高有效位对存储空间进行划分,最高有效位作为划分好的链表的链表标识;参考图10,图10是本公开另一个实施例提供的链表示意图,在一实施例中,可以是取二进制的热度值的最高两位有效位对存储空间进行分段,得到11个链表,该11个链表以一个链表头数组开始,根据热度值范围自下而上构成不同层级的链表:链表00,二进制最高两位有效位为00,即热度值高于二进制00(十进制0),热度值范围为0-1;链表01,二进制最高两位有效位为01,即热度值高于二进制01(十进制1),热度值范围为1-2;链表10,二进制最高两位有效位为01,即热度值高于二进制10(十进制2),热度值范围为2-3;链表11,二进制最高两位有效位为11,即热度值高于二进制11(十进制3),热度值范围为3-4;链表100,二进制最高两位有效位为10,即热度值高于二进制100(十进制4),热度值范围为4-6;链表110,二进制最高两位有效位为11,即热度值高于二进制110(十进制6),热度值范围为6-8;链表1000,二进制最高两位有效位为10,即热度值高于二进制1000(十进制8),热度值范围为8-12;链表1100,二进制最高两位有效位为11,即热度值高于二进制1100(十进制12),热度值范围为12-16;链表10000,二进制最高两位有效位为10,即热度值高于二进制10000(十进制16),热度值范围为16-24;链表11000,二进制最高两位有效位为11,即热度值高于二进制11000(十进制24),热度值范围为24-32;链表100000,二进制最高两位有效位为10,即热度值高于二进制100000(十进制32),热度值大于32。
可以理解的是,根据链表标识和热度值将样本保存至对应的链表的方法如下:例如,样本的热度值为5,二进制表示是101,最高两位有效位为10,因此可以确定该样本保存至链表标识为100的链表,该链表的热度值范围为4至6;由于存储空间以热度值范围被划分为由低到高排序的链表,各个链表中的样本不进行排序,目的是牺牲微小的排序精度以避免昂贵的排序开销,当产生数据删除需求,根据多级链表从低到高依次删除链表的样本,直到样本的删除量满足存储空间的预设删除容量。
另外,参照图4,在一实施例中,图1所示实施例中的步骤S140包括但不限于有以下步骤:
步骤S410,根据链表标识和预设删除容量确定第一热度阈值;
步骤S420,删除热度值小于第一热度阈值的样本。
可以理解的是,参考步骤S310至步骤S320的原理描述,由于存储空间根据热度值的高低顺序划分成不同层级的链表,链表标识用于表征该链表中的样本归属的预设热度值范围;根据链表标识和预设删除容量确定第一热度阈值,第一热度阈值为需要保留的样本的最低热度值,从而确定热度值小于第一热度阈值的目标链表,当目标链表有多个,根据热度值从低到高依次删除目标链表的样本,直到样本的删除量满足预设删除容量,进而能够保留有使用价值的陈旧样本,提高样本的质量。
另外,参照图5,在一实施例中,在图4所示实施例中的步骤S420之前,本申请的样本删除方法还包括但不限于有以下步骤:
步骤S510,当检测到链表中的样本的热度值发生变化,根据变化后的热度值将热度值发生变化的样本调度至新的链表。
可以理解的是,实时或定期对链表中样本的热度值进行检测,当检测到链表中的样本的热度值发生变化,根据变化后的热度值将热度值发生变化的样本调度至新的链表,能够将样本进行准确划分,避免删除失误,保证样本的质量。
另外,参照图6,在一实施例中,图1所示实施例中的步骤S140包括但不限于有以下步骤:
步骤S610,根据当前预设删除容量确定存储空间中的目标样本,其中,目标样本的热度值小于预设的第二热度阈值;
步骤S620,删除目标样本。
可以理解的是,在确定存储空间的预设删除容量后,根据预设删除容量确定该存储空间的第二热度阈值,第二热度阈值为当前存储空间中需要保留的样本的最低热度值,从存储空间中删除满足预设删除容量的目标样本,该目标样本的热度值小于第二热度阈值,能够实现删除价值低的陈旧样本,提高样本的数据质量。
另外,参照图7,在一实施例中,多个样本具有不同的样本类型,不同的样本类型对应有不同的样本类型值,获取多个样本的热度值,图1所示实施例中的步骤S130包括但不限于有以下步骤:
步骤S710,根据具有不同样本类型的样本得到样本比例值;
步骤S720,根据当前样本的样本类型值、样本比例值、当前样本的访问次数和当前样本的历史热度值得到当前样本的热度值。
需要说明的是,本申请实施例并不对根据当前样本的样本类型值、样本比例值、当前样本的访问次数和当前样本的历史热度值得到当前样本的热度值的具体方式做限制,当前样本的热度值可以根据如下公式计算:其中,Storek-1为历史热度值,a为样本类型值,b为样本比例值,Visitk为当前样本的访问次数。
需要说明的是,本申请实施例并不限制对历史热度值的处理方式,可以是对历史热度值取平方根,以实现弱化历史热度值对当前热度值的影响,还可以是对历史热度值取立方根或取对数的处理方式;参考图12,图12是本公开另一个实施例提供的历史热度值与当前热度值的比值的曲线图,曲线1210为原始的历史热度值(即未经处理的历史热度值)与当前热度值的比值对应的曲线,曲线1220为取平方根的历史热度值与当前热度值的比值对应的曲线,曲线1230为取立方根的历史热度值与当前热度值的比值对应的曲线,曲线1240为取对数的历史热度值与当前热度值的比值对应的曲线;可以看出,曲线1220和曲线1230相较于曲线1210、曲线1240更接近于1,即对历史热度值取平方根或取立方根能够有效的弱化历史热度值对当前热度值的影响,但是曲线1230,即对历史热度值取立方根的方式对历史热度值的弱化程度过大,无法有效体现历史热度值对当前热度值的影响,因此曲线1220,即对历史热度值取平方根的方式能够更好地弱化历史热度值对当前热度值的影响。
需要说明的是,本申请实施例并不对样本类型值的具体取值做限制,样本类型可以包括少数类型样本和多数类型样本,少数类型样本为样本比例值小于预设样本比例阈值的样本类型,多数类型样本为样本比例值大于预设样本比例阈值的样本类型,在一实施例中,当检测到样本的样本类型为少数类型样本,取a为1,当检测到样本的样本类型为多数类型样本,取a为0,实现为少数类型样本赋予更高的权重,由于删除样本先删除热度值低的样本,从而使得更倾向于删除样本类型为多数类型样本的样本,进而使得样本的样本类型越来越趋于 均衡。
需要说明的是,本申请实施例并不限制当前热度值的具体更新时间,热度值的具体更新时间可以为获取新的样本的时间,本领域技术人员根据实际情况调整即可。
另外,参照图8,在一实施例中,存储空间的数量为多个,存储空间具有存储序号,存储空间的存储序号随着存储空间的启用时间增加;存储属性的类型有多种,不同类型的存储属性对应有不同的最大储存空间数量;本申请的样本删除方法还包括但不限于有以下步骤:
步骤S810,当具有相同的存储属性的存储空间的数量大于或等于对应的最大储存空间数量,按照存储序号的先后顺序变更对应的存储空间的存储属性。
可以理解的是,参考图11,图11是本公开另一个实施例提供的样本删除示意图,存储属性的类型有多种,可以包括采集阶段存储空间、存量阶段存储空间、衰减阶段存储空间和废弃阶段存储空间;存储空间的数量为多个,不同类型的存储属性的存储空间的数量具有多个;各个类型的存储属性对应的各个存储空间均具有从小到大的顺序排序的存储序号,存储空间的存储序号随着存储空间的启用时间增加,当具有相同类型的存储属性的存储空间的数量大于对应的最大储存空间数量,按照存储序号从小到大的顺序,变更对应的存储空间的存储属性,例如当存量阶段存储空间的数量大于或等于对应的最大储存空间数量,按照存储序号从小到大的顺序,将存储序号最小的存储空间的存储属性变更为衰减阶段存储空间。
另外,在一实施例中,存储属性包括第一存储属性、第二存储属性、第三存储属性和第四存储属性;第一存储属性用于表征存储空间处于采集样本阶段;第二存储属性用于表征存储空间处于存量样本阶段;第三存储属性用于表征存储空间处于衰减样本阶段;第四存储属性用于表征存储空间处于废弃样本阶段;
在步骤S120之前,样本删除方法还包括但不限于有以下步骤:
将存储设备划分成第一存储区域、第二存储区域、第三存储区域和第四存储区域,其中,第一存储区域与第一存储属性对应,第二存储区域与第二存储属性对应,第三存储区域与第三存储属性对应,第四存储区域与第四存储属性对应;
其中,步骤S120包括但不限于有以下步骤:
将样本保存至第一存储区域中的存储空间;
当第一存储区域中的存储空间的数量大于或等于第一存储区域的最大存储空间数量,把第一存储区域中保存样本的存储时间最长的存储空间更改为归属于第二存储区域。
另外,在把第一存储区域中保存样本的存储时间最长的存储空间更改为归属于第二存储区域之后,样本删除方法还包括但不限于有以下步骤:
当第二存储区域中的存储空间的数量大于或等于第二存储区域的最大存储空间数量,把第二存储区域中保存样本的存储时间最长的存储空间更改为归属于第三存储区域。
另外,在把第二存储区域中保存样本的存储时间最长的存储空间更改为归属于第三存储区域之后,样本删除方法还包括但不限于有以下步骤:
当第三存储区域中的存储空间的数量大于或等于第三存储区域的最大存储空间数量,把第三存储区域中保存样本的存储时间最长的存储空间更改为归属于第四存储区域。
可以理解的是,不同的存储区域对应的存储空间的数量不同,由于第一存储区域对应用于采集新获取的样本,需要保证第一存储区域具有足够的存储空间能够采集新样本,因此,当第一存储区域中的存储空间的数量大于或等于第一存储区域的最大存储空间数量,把第一 存储区域中保存样本的存储时间最长的存储空间更改为归属于第二存储区域;在把第一存储区域中保存样本的存储时间最长的存储空间更改为归属于第二存储区域之后,当第二存储区域中的存储空间的数量大于或等于第二存储区域的最大存储空间数量,把第二存储区域中保存样本的存储时间最长的存储空间更改为归属于第三存储区域;在把第二存储区域中保存样本的存储时间最长的存储空间更改为归属于第三存储区域之后,当第三存储区域中的存储空间的数量大于或等于第三存储区域的最大存储空间数量,把第三存储区域中保存样本的存储时间最长的存储空间更改为归属于第四存储区域;基于各个存储区域的最大存储空间数量以及存储空间的存储时间变更存储空间的存储属性,能够有效保证第一存储区域具有足够的存储空间能够采集新样本,以及为删除样本做准备。
另外,参考图9,在一实施例中,样本删除方法还包括但不限于有以下步骤:
步骤S910,当存储空间归属于第一存储区域,预设删除容量为0;
或者,
步骤S920,当存储空间归属于第二存储区域,预设删除容量为0;
或者,
步骤S930,当存储空间归属于第三存储区域,预设删除容量根据存储空间在第三存储区域的保存样本的当前存储时间而确定;
或者,
步骤S940,当存储空间归属于第四存储区域,预设删除容量根据归属于第三存储区域的存储空间的数量而确定。
可以理解的是,当存储空间的当前存储属性为第一存储属性(对应归属于第一存储区域)或者第二存储属性(对应归属于第二存储区域),预设删除容量为0,即不删除存储空间中的样本,随着时间的变化,存储空间保存越来越多样本,需要删除陈旧且价值低的样本,当存储空间的当前存储属性为第三存储属性(对应归属于第三存储区域),归属于第三存储区域的第一存储空间的数量为多个,各个第一存储空间的预设删除容量根据第一存储空间在第三存储区域的保存样本的当前存储时间的先后顺序确定,并且随着存储时间而衰减;当存储空间的当前存储属性为第四存储属性(对应归属于第四存储区域),预设删除容量根据归属于第三存储区域的存储空间的数量而确定,从而能够保证目标删除样本足够陈旧,以及有效保证样本的质量。
以下根据一个例子进行说明:当归属于第三存储属性的第一存储空间的数量为4,按照第一存储空间所保存样本的当前存储时间的先后顺序获取4个第一存储空间的排序结果,根据排序结构分别在对应的第一存储空间删除1/21、1/22、1/23、1/24的容量(例如最先存储样本的第一存储空间删除1/21的容量;最晚存储样本的第一存储空间删除1/24的容量;具体的目标删除样本的选取方式根据该桶中各个样本的热度值确定);第四存储属性对应的预设删除容量为1/24第二存储空间的存储容量,其中,第二存储空间归属于第四存储属性;由此,使得每当归属于第一存储属性的第三存储空间获取1个存储空间的样本量,在所有的第一存储空间以及第二存储空间总共删除1个桶的样本量,即样本的删除总量等于样本的获取量,从而使得所有的存储空间中保存的样本总量保持不变,有效避免存储空间的数据膨胀问题。
另外,为了对本公开提供的基于时间衰减的样本删除方法进行更详细的说明,以下以两 个示例对本公开的技术方案进行描述。
为了便于描述,以下两个示例中各个存储属性的存储空间用“桶”表示,各个存储属性对应的桶的存储容量相同。
参考图13,图13是本公开另一个实施例提供的样本删除装置的模块示意图,该样本删除装置1300包括存储空间1310、热度值更新模块1320、链表划分模块1330和样本删除模块1340,其中,存储空间包括4种存储属性的存储空间:采集阶段存储空间1311;存量阶段存储空间1312;衰减阶段存储空间1313;废弃阶段存储空间1314。下面对样本删除装置1300的各个模块的功能进行描述:
采集阶段存储空间1311的存储容量为1个桶,用于采集样本,采集阶段的桶中的样本不作为人工智能模型的训练数据。
存量阶段存储空间1312的存储容量为t/2个桶,t的常见取值为4或8,存量阶段的桶中的样本主要作为人工智能模型的训练数据。
衰减阶段存储空间1313的存储容量为t个桶,衰减阶段的桶中的样本随着时间的推移,被用于训练建模的可能性逐步降低,因此对衰减阶段的桶中的样本进行定期删除。
废弃阶段存储空间1314并未规定存储容量,由于废弃阶段存储空间1314中的样本仅有极少量的样本还有被访问的可能性,因此绝大部分的样本都会被删除。
链表划分模块1330,取二进制热度值的最高两位有效位对存储空间进行划分链表,链表标识与链表中的样本之间建立双向索引,能在热度值变化时依据热度值最高两位有效位快速移动链表标识到对应的目标链表,在衰减时也能通过链表快速删除目标样本。热度值越小,链表区段划分越精确。
热度值更新模块1320,用于维护存量阶段、衰减阶段和废弃阶段存储空间1314中的全体样本的热度值,在删除样本时,优先将热度值最低的样本删除,样本在第k个桶中的热度得分Storek通过下式计算:具体的参数解释可参考图5实施例的原理描述,在此不多做赘述。
样本删除模块1340,当预设的时间阈值来临,在衰减阶段的t个桶中,分别删除1/21、1/22、1/23……1/2t的桶的容量,对废弃阶段存储空间1314删除1/2t桶的容量,以上删除容量之和正好等于1个桶的容量,因此样本总量保持不变。
可以理解的是,图13所示的样本删除装置中各个模块之间的关系为:每相隔预设的时间阈值,采集阶段的桶移入存量阶段的最前面,存量阶段最后一个桶移入衰减阶段的最前面,衰减阶段最后一个桶移入废弃阶段。热度值更新模块1320计算存量阶段模块、衰减阶段模块和废弃阶段模块的样本的热度值,并根据热度值划分多个链表,样本删除模块1340对衰减阶段的存储空间和废弃阶段的存储空间1314中的样本进行删除。
需要说明的是,示例一和示例二的方法步骤应用于图13所示的样本删除装置。
示例一,参考图14,图14是本公开另一个实施例提供的样本删除方法的步骤流程图,该样本删除方法包括有以下步骤:
步骤S1410,系统初始化,开始采集样本,将所采集到的样本保存至采集阶段存储空间;
步骤S1420,随着时间的推移,每间隔预设的时间阈值,例如预设的时间阈值为q个月,将存储时间最早的采集阶段的存储空间的存储属性变更为存量阶段的存储空间,即将存储时间最早的采集阶段的桶移入存量阶段存储空间最前面,采集阶段存储空间使用一个新的桶进 行数据采集;
步骤S1430,随着时间的推移,经过q*t/2个月,当存量阶段存储空间能容纳的t/2个桶的位置全满,将存量阶段存储空间中启动时间最早的桶调度至衰减阶段存储空间,即存量阶段存储空间中最后一个桶移入衰减阶段存储空间的最前面;
步骤S1440,每间隔预设的时间阈值,例如每间隔q个月,将衰减阶段存储空间中启动时间最早的桶调度至废弃阶段存储空间,即存量阶段存储空间最后一个桶移入衰减阶段存储空间的最前面;
步骤S1450,确定衰减阶段存储空间的各个桶的第一预设删除容量,根据各个桶内样本的热度值和各个第一预设删除容量从衰减阶段存储空间中删除样本,并且从废弃阶段存储空间中删除第二预设删除容量的样本;例如:衰减阶段存储空间的t个桶分别删除1/21、1/22、1/23……1/2t(即第一预设删除容量)的桶的容量,删除的原则是桶内热度值最低的样本优先删除;随着时间的推移,又经过了q*t个月,衰减阶段能容纳的t个桶的位置全满,衰减阶段最后一个桶移入废弃阶段。此后,每间隔q个月,衰减阶段最后一个桶移入废弃阶段,废弃阶段的所有样本混杂在一起计算热度值,删除1/2t(即第二预设删除容量)的桶的容量。
需要说明的是,本示例并不对衰减阶段存储空间中桶的具体数量做限制,即不对t的具体数值做限制,t的数值可以为4或8;当t=4,处于衰减阶段的桶有4个,按照每个桶的存储样本时间的先后顺序得到4个桶的排序结果,根据排序结果依次删除1/21、1/22、1/23、1/24桶的容量(例如衰减阶段中最先存储样本的桶删除的1/21桶容量,具体的目标删除样本的选取方式根据该桶中各个样本的热度值确定);在废弃阶段中,以衰减阶段的桶的数量以及作为依据,删除废弃阶段中最先存储样本的桶删除的1/24桶的容量;由此,使得采集阶段每获取1个桶的样本量,在衰减阶段以及废弃阶段总共删除1个桶的样本量,即样本的删除总量等于样本的获取量,从而使得所有的存储空间中保存的样本总量保持不变,有效避免存储空间的数据膨胀问题。
可以理解的是,上述步骤一直处于执行过程中,样本的总量将一直控制在(t/2)+2个桶的样本,本公开能够使得新采集的样本量与删除的样本量相等,数据不再膨胀,并且删除的样本都是陈旧的、且访问热度低的数据,有效避免模型老化的问题,而且基于链表的方式实现样本删除,可以实现快速删除目标样本。
示例二,应用于汽车电子场景,单台汽车每秒采集1条样本,每1个月采集的汽车传感器样本2592000条,约1TB(即,桶的容量为1TB),而该汽车的存储空间容量仅为6TB。桶的数量:采集阶段存储空间有1个桶;存量阶段存储空间有4个桶;衰减阶段存储空间有8个桶。第x月采集的数据放入Ax桶中,例如第1个月采集的数据放入A1桶中。
下面详细描述以月为时间单位,描述各个存储属性的存储空间中各个桶的存储位置变化以及桶中样本的删除方式:
第1月:采集的汽车传感器样本放入A1桶。
第2月:采集的汽车传感器样本放入A2桶;A1桶放入存量阶段存储空间。
第3月:采集的汽车传感器样本放入A3桶;A2桶放入存量阶段存储空间。
第4月:采集的汽车传感器样本放入A4桶;A3桶放入存量阶段存储空间。
第5月:采集的汽车传感器样本放入A5桶;A4桶放入存量阶段存储空间。至此,存量阶段存储空间已满。
第6月:采集的汽车传感器样本放入A6桶;A5桶放入存量阶段存储空间。A1桶放入衰减阶段存储空间,并应删除至少1/2桶的容量即1296000条样本,实际删除1297000条样本。(如图9所示,假设A5桶在删除前共有2592000个样本,已按热度值将样本放入不同的链表中,其中热度值介于0至1之间的样本数量有1295000个,热度值介于1至2之间的样本数量2000个,热度值介于2至4之间的样本数量1000个;将热度值介于0至1之间和热度介于1至2之间的样本全部删除。)
第7月:采集的汽车传感器样本放入A7桶。A6桶放入存量阶段存储空间;A2桶放入衰减阶段存储空间,并应删除至少1/2桶的容量即1296000条样本;A1桶应删除至少1/4桶的容量即648000条样本。
第8月:采集的汽车传感器样本放入A8桶。A7桶放入存量阶段存储空间;A3桶放入衰减阶段存储空间,并应删除至少1/2桶的容量即1296000条样本;A2桶应删除至少1/4桶的容量即648000条样本;A1桶应删除至少1/8桶的容量即324000条样本。
第9月:采集的汽车传感器样本放入A9桶。A8桶放入存量阶段存储空间;A4桶放入衰减阶段存储空间,并应删除至少1/2桶的容量即1296000条样本;A3桶应删除至少1/4桶的容量即648000条样本;A2桶应删除至少1/8桶的容量即324000条样本;A1桶应删除至少1/16桶的容量即162000条样本。
第10月:采集的汽车传感器样本放入A10桶;A9桶放入存量阶段存储空间;A5桶放入衰减阶段存储空间,并应删除至少1/2桶的容量即1296000条样本;A4桶应删除至少1/4桶的容量即648000条样本;A3桶应删除至少1/8桶的容量即324000条样本;A2桶应删除至少1/16桶的容量即162000条样本;A1桶应删除至少1/32桶的容量即81000条样本。
第11月:采集的汽车传感器样本放入A11桶;A10桶放入存量阶段存储空间;A6桶放入衰减阶段存储空间,并应删除至少1/2桶的容量即1296000条样本;A5桶应删除至少1/4桶的容量即648000条样本;A4桶应删除至少1/8桶的容量,即324000条样本;A3桶应删除至少1/16桶的容量即162000条样本;A2桶应删除至少1/32桶的容量即81000条样本;A1桶应删除至少1/64桶的容量即40500条样本。
第12月:采集的汽车传感器样本放入A12桶。A11桶放入存量阶段存储空间;A7桶放入衰减阶段存储空间,并应删除至少1/2桶的容量即1296000条样本;A6桶应删除至少1/4桶的容量即648000条样本;A5桶应删除至少1/8桶的容量即324000条样本;A4桶应删除至少1/16桶的容量即162000条样本;A3桶应删除至少1/32桶的容量即81000条样本;A2桶应删除至少1/64桶的容量即40500条样本;A1桶应删除至少1/128桶的容量即20250条样本。
第13月:采集的汽车传感器样本放入A13桶;A12桶放入存量阶段存储空间;A8桶放入衰减阶段存储空间,并应删除至少1/2桶的容量即1296000条样本;A7桶应删除至少1/4桶的容量即648000条样本;A6桶应删除至少1/8桶的容量即324000条样本;A5桶应删除至少1/16桶的容量即162000条样本;A4桶应删除至少1/32桶的容量即81000条样本;A3桶应删除至少1/64桶的容量即40500条样本;A2桶应删除至少1/128桶的容量即20250条样本;A1桶应删除至少1/256桶的容量即10125条样本。
第14月:采集的汽车传感器样本放入A14桶;A11桶放入存量阶段存储空间;A9桶放入衰减阶段存储空间,并应删除至少1/2桶的容量即1296000条样本;A8桶应删除至少1/4桶的容量即648000条样本;A7桶应删除至少1/8桶的容量即324000条样本;A6桶应删除至少 1/16桶的容量即162000条样本;A5桶应删除至少1/32桶的容量即81000条样本;A4桶应删除至少1/64桶的容量即40500条样本;A3桶应删除至少1/128桶的容量即20250条样本;A2桶应删除至少1/256桶的容量即10125条样本;A1桶放入废弃阶段存储空间,衰减阶段存储空间中所有样本中热度介于0至1之间的样本一律删掉。
第15月:采集的汽车传感器样本放入A15桶;A12桶放入存量阶段存储空间;A10桶放入衰减阶段存储空间,并应删除至少1/2桶的容量即1296000条样本;A9桶应删除至少1/4桶的容量即648000条样本;A8桶应删除至少1/8桶的容量即324000条样本;A7桶应删除至少1/16桶的容量即162000条样本;A6桶应删除至少1/32桶的容量即81000条样本;A5桶应删除至少1/64桶的容量即40500条样本;A4桶应删除至少1/128桶的容量即20250条样本;A3桶应删除至少1/256桶的容量即10125条样本;A2桶放入废弃阶段存储空间,衰减阶段存储空间中所有样本中热度介于0至1之间的样本一律删掉。
可以理解的是,本公开的技术方案能够使得每个月新产生1个桶的样本,也会同时删除共计1个桶的样本,数据不再膨胀,并且删除的样本都是陈旧的、且访问热度低的数据,有效避免模型老化的问题,而且基于链表的方式实现样本删除,可以实现快速删除目标样本。值得注意的是,实际应用中可能无法达到新采集的样本量与删除的样本量完全相等的理想状态。
另外,参考图15,图15是本公开另一个实施例提供的样本删除装置的结构示意图,本公开的一个实施例还提供了一种样本删除装置1500,该样本删除装置1500包括:存储器1510、处理器1520及存储在存储器1510上并可在处理器1520上运行的计算机程序。
处理器1520和存储器1510可以通过总线或者其他方式连接。
实现上述实施例的样本删除方法所需的非暂态软件程序以及指令存储在存储器1510中,当被处理器1520执行时,执行上述实施例中的样本删除方法,例如,执行以上描述的图1中的方法步骤S110至方法步骤S140、图2中的方法步骤S210、图3中的方法步骤S310至方法步骤S320、图4中的方法步骤S410至方法步骤S420、图5中的方法步骤S510、图6中的方法步骤S610至方法步骤S620、图7中的方法步骤S710至方法步骤S720、图8中的方法步骤S810、图9中的方法步骤S910至方法步骤S940。
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
此外,本公开的一个实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令被一个处理器或控制器执行,例如,被上述样本删除装置1500实施例中的一个处理器1520执行,可使得上述处理器1520执行上述实施例中的基于时间衰减的样本删除方法,例如,执行以上描述的图1中的方法步骤S110至方法步骤S140、图2中的方法步骤S210、图3中的方法步骤S310至方法步骤S320、图4中的方法步骤S410至方法步骤S420、图5中的方法步骤S510、图6中的方法步骤S610至方法步骤S620、图7中的方法步骤S710至方法步骤S720、图8中的方法步骤S810、图9中的方法步骤S910至方法步骤S940。
此外,本公开的一个实施例还提供了一种计算机程序产品,包括计算机程序或计算机指令,计算机程序或计算机指令存储在计算机可读存储介质中,计算机设备的处理器从计算机 可读存储介质读取计算机程序或计算机指令,处理器执行计算机程序或计算机指令,使得计算机设备执行如前面任意实施例所述的基于时间衰减的样本删除方法,例如,执行以上描述的图1中的方法步骤S110至方法步骤S140、图2中的方法步骤S210、图3中的方法步骤S310至方法步骤S320、图4中的方法步骤S410至方法步骤S420、图5中的方法步骤S510、图6中的方法步骤S610至方法步骤S620、图7中的方法步骤S710至方法步骤S720、图8中的方法步骤S810、图9中的方法步骤S910至方法步骤S940。
本公开实施例包括一种基于时间衰减的样本删除方法及其装置、存储介质,其中,所述基于时间衰减的样本删除方法包括:获取多个样本;将样本保存至存储空间,其中,存储空间对应有存储属性,存储属性随着存储空间保存样本的存储时间而改变,不同的存储属性对应有不同的预设删除容量,同一存储属性所对应的预设删除容量随着存储时间而衰减;计算归属于目标存储属性的存储空间中的各个样本的热度值;根据热度值和存储空间的当前预设删除容量,删除存储空间中的样本。根据本公开实施例提供的方案,根据样本的热度值以及存储空间的预设删除容量删除陈旧的样本,相较于前述常规做法仅仅根据样本存储时间的远近删除样本的技术方案,本申请的技术方案能够保留陈旧样本中有价值的数据,从而有效提高样本的质量。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
以上是对本公开的实施进行了具体说明,但本公开并不局限于上述实施方式,熟悉本领域的技术人员在不违背本公开本质的前提下还可作出种种的等同变形或替换,这些等同的变形或替换均包含在本公开权利要求所限定的范围内。

Claims (14)

  1. 一种基于时间衰减的样本删除方法,包括:
    获取多个样本;
    将所述样本保存至存储空间,其中,所述存储空间对应有存储属性,所述存储属性随着所述存储空间保存所述样本的存储时间而改变,不同的所述存储属性对应有不同的预设删除容量,同一所述存储属性所对应的所述预设删除容量随着所述存储时间而衰减;
    计算归属于目标存储属性的所述存储空间中的各个所述样本的热度值;
    根据所述热度值和所述存储空间的当前预设删除容量,删除所述存储空间中的所述样本。
  2. 根据权利要求1所述的方法,其中,所述存储空间的数量为多个,所述根据所述热度值和所述存储空间的当前预设删除容量,删除所述存储空间中的所述样本,包括:
    根据各个所述存储空间中各个所述样本的热度值,以及各个所述存储空间的当前预设删除容量,对应删除各个所述存储空间中的所述样本,使得所述样本的删除总量等于所述样本的获取量。
  3. 根据权利要求1所述的方法,其中,所述计算归属于目标存储属性的所述存储空间中的各个所述样本的热度值之后,所述方法还包括:
    根据所述热度值将所述存储空间划分为多个链表,所述链表对应有链表标识;
    根据所述链表标识和所述热度值将所述样本保存至对应的链表。
  4. 根据权利要求3所述的方法,其中,所述根据所述热度值和所述存储空间的当前预设删除容量,删除所述存储空间中的样本,包括:
    根据所述链表标识和所述预设删除容量确定第一热度阈值;
    删除所述热度值小于所述第一热度阈值的样本。
  5. 根据权利要求4所述的方法,其中,所述删除所述热度值小于所述第一热度阈值的样本之前,所述方法还包括:
    当检测到所述链表中的样本的热度值发生变化,根据变化后的热度值将热度值发生变化的样本调度至新的链表。
  6. 根据权利要求1所述的方法,其中,所述根据所述热度值和所述存储空间的当前预设删除容量,删除所述存储空间中的样本,包括:
    根据所述当前预设删除容量确定所述存储空间中的目标样本,其中,所述目标样本的热度值小于预设的第二热度阈值;
    删除所述目标样本。
  7. 根据权利要求1所述的方法,其中,所述多个样本具有不同的样本类型,不同的所述样本类型对应有不同的样本类型值,所述计算归属于目标存储属性的所述存储空间中的各个所述样本的热度值,包括:
    根据具有不同所述样本类型的样本得到样本比例值;
    根据当前样本的所述样本类型值、所述样本比例值、当前样本的访问次数和当前样本的历史热度值得到当前样本的热度值。
  8. 根据权利要求1所述的方法,其中,所述存储空间的数量为多个,所述存储空间具有存储序号,所述存储空间的所述存储序号随着启用所述存储空间的先后顺序而改变;不同 的所述存储属性对应有不同的最大储存空间数量;所述方法还包括:
    当具有相同的所述存储属性的所述存储空间的数量大于或等于对应的最大储存空间数量,按照所述存储序号的先后顺序变更对应的所述存储空间的所述存储属性。
  9. 根据权利要求1所述的方法,其中,所述存储属性包括第一存储属性、第二存储属性、第三存储属性和第四存储属性;所述第一存储属性用于表征所述存储空间处于采集样本阶段;所述第二存储属性用于表征所述存储空间处于存量样本阶段;所述第三存储属性用于表征所述存储空间处于衰减样本阶段;所述第四存储属性用于表征所述存储空间处于废弃样本阶段;
    所述将所述样本保存至存储空间之前,所述方法还包括:
    将存储设备划分成第一存储区域、第二存储区域、第三存储区域和第四存储区域,其中,所述第一存储区域,其中,所述第一存储区域与所述第一存储属性对应,所述第二存储区域与所述第二存储属性对应,所述第三存储区域与所述第三存储属性对应,所述第四存储区域与所述第四存储属性对应;
    所述将所述样本保存至存储空间,包括:
    将所述样本保存至所述第一存储区域中的存储空间;
    当所述第一存储区域中的存储空间的数量大于或等于所述第一存储区域的最大存储空间数量,把所述第一存储区域中保存样本的所述存储时间最长的存储空间更改为归属于所述第二存储区域。
  10. 根据权利要求9所述的方法,其中,所述把所述第一存储区域中保存样本的所述存储时间最长的存储空间更改为归属于所述第二存储区域之后,所述方法还包括:
    当所述第二存储区域中的存储空间的数量大于或等于所述第二存储区域的最大存储空间数量,把所述第二存储区域中保存样本的所述存储时间最长的存储空间更改为归属于所述第三存储区域。
  11. 根据权利要求10所述的方法,其中,所述把所述第二存储区域中保存样本的所述存储时间最长的存储空间更改为归属于所述第三存储区域之后,所述方法还包括:
    当所述第三存储区域中的存储空间的数量大于或等于所述第三存储区域的最大存储空间数量,把所述第三存储区域中保存样本的所述存储时间最长的存储空间更改为归属于所述第四存储区域。
  12. 根据权利要求9所述的方法,其中:
    当所述存储空间归属于所述第一存储区域,所述预设删除容量为0;
    或者,
    当所述存储空间归属于所述第二存储区域,所述预设删除容量为0;
    或者,
    当所述存储空间归属于所述第三存储区域,所述预设删除容量根据所述存储空间在所述第三存储区域的保存所述样本的当前存储时间而确定;
    或者,
    当所述存储空间的当前存储属性为第四存储属性,所述预设删除容量根据归属于所述第三存储区域的所述存储空间的数量而确定。
  13. 一种样本删除装置,包括:存储器、处理器及存储在存储器上并可在处理器上运行 的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至12中任意一项所述的基于时间衰减的样本删除方法。
  14. 一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行如权利要求1至12中任意一项所述的基于时间衰减的样本删除方法。
PCT/CN2023/076554 2022-02-18 2023-02-16 基于时间衰减的样本删除方法及其装置、存储介质 WO2023155849A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210153039.1A CN114510474B (zh) 2022-02-18 2022-02-18 基于时间衰减的样本删除方法及其装置、存储介质
CN202210153039.1 2022-02-18

Publications (1)

Publication Number Publication Date
WO2023155849A1 true WO2023155849A1 (zh) 2023-08-24

Family

ID=81552319

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/076554 WO2023155849A1 (zh) 2022-02-18 2023-02-16 基于时间衰减的样本删除方法及其装置、存储介质

Country Status (2)

Country Link
CN (1) CN114510474B (zh)
WO (1) WO2023155849A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114510474B (zh) * 2022-02-18 2024-06-18 中兴通讯股份有限公司 基于时间衰减的样本删除方法及其装置、存储介质
CN116129227B (zh) * 2023-04-12 2023-09-01 合肥的卢深视科技有限公司 模型训练方法、装置、电子设备及计算机可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120330954A1 (en) * 2011-06-27 2012-12-27 Swaminathan Sivasubramanian System And Method For Implementing A Scalable Data Storage Service
CN110334059A (zh) * 2018-02-11 2019-10-15 北京京东尚科信息技术有限公司 用于处理文件的方法和装置
WO2021008024A1 (zh) * 2019-07-12 2021-01-21 平安科技(深圳)有限公司 数据处理的方法、装置和服务器
CN113867645A (zh) * 2021-09-30 2021-12-31 苏州浪潮智能科技有限公司 数据迁移和数据读写方法、装置、计算机设备及存储介质
CN114510474A (zh) * 2022-02-18 2022-05-17 中兴通讯股份有限公司 基于时间衰减的样本删除方法及其装置、存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2890596C (en) * 2011-11-07 2017-01-03 Nexgen Storage, Inc. Primary data storage system with deduplication
CN105589926A (zh) * 2015-11-27 2016-05-18 深圳市美贝壳科技有限公司 一种移动终端实时清理缓存文件的方法
CN105573682B (zh) * 2016-02-25 2018-10-30 浪潮(北京)电子信息产业有限公司 一种san存储系统及其数据读写方法
US10921997B2 (en) * 2018-09-07 2021-02-16 Getac Technology Corporation Information capture device and control method thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120330954A1 (en) * 2011-06-27 2012-12-27 Swaminathan Sivasubramanian System And Method For Implementing A Scalable Data Storage Service
CN110334059A (zh) * 2018-02-11 2019-10-15 北京京东尚科信息技术有限公司 用于处理文件的方法和装置
WO2021008024A1 (zh) * 2019-07-12 2021-01-21 平安科技(深圳)有限公司 数据处理的方法、装置和服务器
CN113867645A (zh) * 2021-09-30 2021-12-31 苏州浪潮智能科技有限公司 数据迁移和数据读写方法、装置、计算机设备及存储介质
CN114510474A (zh) * 2022-02-18 2022-05-17 中兴通讯股份有限公司 基于时间衰减的样本删除方法及其装置、存储介质

Also Published As

Publication number Publication date
CN114510474A (zh) 2022-05-17
CN114510474B (zh) 2024-06-18

Similar Documents

Publication Publication Date Title
WO2023155849A1 (zh) 基于时间衰减的样本删除方法及其装置、存储介质
US10740308B2 (en) Key_Value data storage system
KR102307957B1 (ko) 다중-스트림 저장 장치를 위한 스트림 선택
KR102289332B1 (ko) 병합 트리 가비지 메트릭스
KR102290835B1 (ko) 유지관리 동작들을 위한 병합 트리 수정들
US11093502B2 (en) Table partitioning and storage in a database
US10353586B2 (en) Memory device and host device
CN110268394A (zh) Kvs树
US20120166400A1 (en) Techniques for processing operations on column partitions in a database
EP3252609A1 (en) Cache data determination method and device
CN110109886B (zh) 分布式文件系统的文件存储方法及分布式文件系统
CN112395212B (zh) 减少键值分离存储系统的垃圾回收和写放大的方法及系统
US10552460B2 (en) Sensor data management apparatus, sensor data management method, and computer program product
CN112734982A (zh) 无人车驾驶行为数据的存储方法和系统
CN114036410A (zh) 数据存储、查询方法、设备、系统、程序及介质
CN113778964B (zh) 用于储存多个暂存档案的记录装置及暂存档案的管理方法
CN115391355B (zh) 数据处理方法、装置、设备及存储介质
CN115576924A (zh) 一种数据迁移的方法
CN108153805A (zh) 一种高效清理Hbase时序数据的方法、系统
CN114780484A (zh) 一种用于对象存储的文件生命周期管理的实施方法
CN112181973A (zh) 一种时序数据的存储方法
CN115904263B (zh) 一种数据迁移方法、系统、设备及计算机可读存储介质
WO2021224960A1 (ja) 保存装置、保存方法、およびプログラム
CN117149074A (zh) 一种数据处理方法、装置、设备及可读存储介质
CN115904211A (zh) 一种存储系统、数据处理方法及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23755850

Country of ref document: EP

Kind code of ref document: A1