WO2022088374A1 - Data processing method and apparatus - Google Patents

Data processing method and apparatus Download PDF

Info

Publication number
WO2022088374A1
WO2022088374A1 PCT/CN2020/132911 CN2020132911W WO2022088374A1 WO 2022088374 A1 WO2022088374 A1 WO 2022088374A1 CN 2020132911 W CN2020132911 W CN 2020132911W WO 2022088374 A1 WO2022088374 A1 WO 2022088374A1
Authority
WO
WIPO (PCT)
Prior art keywords
dimensional
centroids
data blocks
block
data
Prior art date
Application number
PCT/CN2020/132911
Other languages
French (fr)
Chinese (zh)
Inventor
占志刚
程威
Original Assignee
北京泽石科技有限公司
泽石科技(武汉)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京泽石科技有限公司, 泽石科技(武汉)有限公司 filed Critical 北京泽石科技有限公司
Publication of WO2022088374A1 publication Critical patent/WO2022088374A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system

Definitions

  • the present disclosure relates to the field of data processing, and in particular, to a data processing method and apparatus.
  • NAND Flash block management includes the use of clean blocks and the recovery of dirty blocks, mainly in terms of wear leveling, including dynamic leveling (garbage collection, etc.) and static leveling. The implementation of the two may need to be adjusted according to requirements.
  • the embodiments of the present disclosure provide a data processing method and device, so as to at least solve the management method of data storage blocks in the related art, and there is a problem that the feature measurement of data blocks is limited, resulting in poor consistency and generality of data block processing. technical problem.
  • a data processing method including: establishing a multi-dimensional function model, wherein the multi-dimensional function model includes multi-dimensional indices, and the multi-dimensional indices respectively correspond to multiple conditions for filtering data blocks ; Determine the multi-dimensional space corresponding to the multi-dimensional function model, and a plurality of centroids of the multi-dimensional space; According to the plurality of centroids, cluster a plurality of data blocks by a clustering algorithm, and obtain a plurality of centroids corresponding to the plurality of centroids. multiple block clusters; select the block cluster corresponding to the centroid whose multi-dimensional index satisfies the multiple conditions as the target block cluster.
  • determining a multi-dimensional space corresponding to the multi-dimensional function model and multiple centroids of the multi-dimensional space includes: determining the number of centroids of the multiple centroids according to the number of multi-dimensional indices of the multi-dimensional function model the number of centroids, wherein the number of centroids is one more than the number of multi-dimensional indices; determine the coordinates of the number of centroids in the multi-dimensional space, wherein the centroids are the multiple data blocks in the multi-dimensional space The end point of the range in space, the centroid is on the coordinate axis of the multidimensional space.
  • clustering a plurality of data blocks through a clustering algorithm to obtain a plurality of block clusters corresponding to the plurality of centroids includes: determining the plurality of data blocks The coordinates of the blocks in the multi-dimensional space; weighting the coordinates of the multiple data blocks; according to the weighted coordinates of the multiple data blocks, calculate the distance between the multiple data blocks and the multiple particle points respectively Euclidean distance; using the size of the Euclidean distance as a clustering condition, cluster the multiple data blocks; obtain multiple block clusters corresponding to the multiple centroids.
  • selecting the block cluster corresponding to the centroid whose multidimensional index satisfies the multiple conditions as the target block cluster includes: determining the target multidimensional index satisfying the multiple conditions according to the multiple conditions; The multi-dimensional index is used to determine the target centroid whose coordinates correspond to the target multi-dimensional index; the block cluster corresponding to the target centroid is used as the target block cluster.
  • the method further includes: determining the actual centroid according to the coordinates of the data blocks in the target block cluster; The actual centroid performs the clustering operation of subsequent data blocks.
  • the condition is a filtering condition that meets the requirements of a data processing method; the data processing method includes at least one of the following: performing a write operation on the plurality of data blocks; Data blocks are reclaimed.
  • the plurality of filter conditions include at least one of the following: the number of valid pages of the data block is 0 or the number of valid pages is less than a preset number; the wear degree of the data block is less than the preset wear degree; the popularity of the data block is 0 or the heat is less than the preset heat.
  • a data processing apparatus comprising: a building module configured to build a multi-dimensional function model, wherein the multi-dimensional function model includes multi-dimensional indices, and the multi-dimensional indices correspond to corresponding data respectively multiple conditions for the block to be screened; a determination module, configured to determine a multi-dimensional space corresponding to the multi-dimensional function model, and multiple centroids of the multi-dimensional space; a clustering module, configured to determine the multiple centroids through clustering The algorithm clusters multiple data blocks to obtain multiple block clusters corresponding to the multiple centroids; the selection module is configured to select the block clusters corresponding to the centroids whose multi-dimensional indices satisfy the multiple conditions as the target block clusters.
  • a computer storage medium is also provided, where the computer storage medium includes a stored program, wherein when the program runs, a device where the computer storage medium is located is controlled to execute any one of the above The data processing method described in item.
  • a processor is further provided, and the processor is configured to run a program, wherein when the program runs, any one of the data processing methods described above is executed.
  • a multi-dimensional function model is established, wherein the multi-dimensional function model includes a multi-dimensional index, and the multi-dimensional index corresponds to a plurality of conditions for screening data blocks respectively; the multi-dimensional space corresponding to the multi-dimensional function model is determined, and the multi-dimensional According to the multiple centroids, cluster multiple data blocks through a clustering algorithm to obtain multiple block clusters corresponding to the multiple centroids; select the block cluster corresponding to the centroid whose multi-dimensional index satisfies multiple conditions as the target block cluster
  • FIG. 1 is a flowchart of a data processing method according to an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of clustering in a multi-dimensional space according to an embodiment of the present disclosure
  • FIG. 3 is a flowchart of a storage data block management method according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure.
  • a method embodiment of a data processing method is provided. It should be noted that the steps shown in the flowchart of the accompanying drawings may be executed in a computer system such as a set of computer-executable instructions, and, Although a logical order is shown in the flowcharts, in some cases steps shown or described may be performed in an order different from that herein.
  • FIG. 1 is a flowchart of a data processing method according to an embodiment of the present disclosure. As shown in FIG. 1 , the method includes the following steps:
  • Step S102 establishing a multi-dimensional function model, wherein the multi-dimensional function model includes a multi-dimensional index, and the multi-dimensional index corresponds to a plurality of conditions for screening data blocks;
  • Step S104 determining the multi-dimensional space corresponding to the multi-dimensional function model, and multiple centroids of the multi-dimensional space;
  • Step S106 cluster a plurality of data blocks by a clustering algorithm to obtain a plurality of block clusters corresponding to the plurality of centroids;
  • step S108 the block cluster corresponding to the centroid whose multi-dimensional index satisfies multiple conditions is selected as the target block cluster.
  • a multi-dimensional function model is established, wherein the multi-dimensional function model includes a multi-dimensional index, and the multi-dimensional index corresponds to a plurality of conditions for screening data blocks respectively; the multi-dimensional space corresponding to the multi-dimensional function model and the multi-dimensional space are determined.
  • multiple data blocks are clustered by a clustering algorithm, and multiple block clusters corresponding to multiple centroids are obtained; the block cluster corresponding to the centroid whose multi-dimensional index satisfies multiple conditions is selected as the target block cluster,
  • the data blocks can be quickly screened, and the purpose of quickly screening multiple data blocks from multiple dimensions is achieved. It achieves the technical effect of improving the measurement efficiency of data blocks, improving the consistency and versatility of data block processing, and then solving the management method of data storage blocks in related technologies. Deal with technical issues with poor consistency and generality.
  • Conditions or indicators of multiple dimensions when using, that is, multiple conditions for filtering data blocks For example, when writing a clean data block or reclaiming a dirty data block, three factors need to be considered, the number of valid pages of the data block, the wear degree of the data block, that is, the number of programming/erasing, The heat conversion value of the data block, that is, the difference between the initial programming time of the data block and the current time is M. The larger the difference, the smaller the heat. In order to avoid the locality principle, the data that has just been moved will be written again soon. , so the block with less heat is usually selected for garbage collection, and N is set to a fixed constant, and the heat value is equal to N-M.
  • a block (empty block) with a valid page of 0, a low degree of wear, and a heat of 0 is preferentially selected.
  • a block with fewer valid pages, less wear and less heat is selected.
  • the multi-dimensional space corresponding to the above multi-dimensional function can be a multi-dimensional space established by using the index of each dimension as a coordinate axis, and the eigenvalues of multiple data blocks on the multi-dimensional index can be used as the data block in the multi-dimensional space. Corresponding to the coordinate value of the coordinate axis, so that multiple data blocks are represented in the form of points in the multi-dimensional space.
  • the number of the above-mentioned centroids is one more than the dimension of the space.
  • x is equal to the maximum value, that is, the maximum value of the index corresponding to the x-axis in multiple data blocks; similarly, y is equal to the maximum value, which can be the maximum value of the index corresponding to the y-axis in multiple data blocks; z is equal to The maximum value can be the maximum value of the indices corresponding to the z-axis in multiple data blocks.
  • centroid coordinates of other multi-dimensional spaces are analogous, and the centroid can be the coordinate origin of the multi-dimensional space, and each coordinate axis corresponds to The point of the maximum value of the multiple data blocks, the centroid is on the coordinate axes of the multidimensional space.
  • the above-mentioned clustering algorithm may be a K-means clustering algorithm.
  • multiple data blocks can be clustered into data block clusters corresponding to multiple centroids, that is, a set of data blocks.
  • a plurality of data blocks are classified, and according to the multi-dimensional index satisfied by the centroid and whether multiple conditions for screening the data blocks are satisfied, the data block cluster corresponding to the centroid is determined as the target block cluster for screening.
  • the data blocks can be quickly screened, and the purpose of quickly screening multiple data blocks from multiple dimensions is achieved. It achieves the technical effect of improving the measurement efficiency of data blocks, improving the consistency and versatility of data block processing, and then solving the management method of data storage blocks in related technologies. Poor consistency and generality of block processing" technical issues.
  • determining the multi-dimensional space corresponding to the multi-dimensional function model and the multiple centroids of the multi-dimensional space includes: determining the number of the multiple centroids according to the number of the multi-dimensional indices of the multi-dimensional function model, wherein the number of the centroids is greater than the number of the centroids.
  • the number of multi-dimensional indices is one more; the coordinates of a number of centroids in the multi-dimensional space are determined, wherein the centroid is the range end point of the plurality of data blocks in the multi-dimensional space, and the above-mentioned centroids are on the coordinate axis of the multi-dimensional space.
  • the above range endpoints can be the maximum or minimum value of the coordinate range. It mainly depends on the situation of the coordinate system and the distribution of multiple data blocks in the coordinate system. In this embodiment, when writing a clean data block or recycling a dirty data block, the indicators of the data block are all non-negative values, so the above range endpoint may be the maximum value of the coordinate range of the coordinate axis.
  • clustering a plurality of data blocks by a clustering algorithm according to a plurality of centroids to obtain a plurality of block clusters corresponding to the plurality of centroids includes: determining the distribution of the plurality of data blocks in a multi-dimensional space. Coordinates; weight the coordinates of multiple data blocks; calculate the Euclidean distances between multiple data blocks and multiple particle points according to the weighted coordinates of multiple data blocks; use the size of the Euclidean distance as the clustering condition , clustering multiple data blocks; obtain multiple block clusters corresponding to multiple centroids.
  • the conventional K-means clustering algorithm usually selects the Euclidean distance from the target point to the centroid as a condition for measuring whether or not to cluster.
  • the coordinates of the weighted data block and the Euclidean distance between the centroid coordinates are used to determine the data.
  • a measure of whether the blocks can be clustered in the data block clusters for this centroid. For example, in the above three-dimensional space, after the coordinates (x, y, z) are weighted, the actual algorithm logical coordinates are (ax, by, cz). Among them, a, b, and c are the weights of the corresponding features, and a+b+c 1. By adjusting the feature weights, all data blocks can be reasonably distributed in the corresponding sets of 4 centroids.
  • selecting the block cluster corresponding to the centroid whose multidimensional index satisfies multiple conditions as the target block cluster includes: determining the target multidimensional index satisfying the multiple conditions according to the multiple conditions; The target centroid corresponding to the target multi-dimensional index; the block cluster corresponding to the target centroid is taken as the target block cluster.
  • a block with a valid page of 0, a low degree of wear, and a heat of 0, that is, an empty block is preferentially selected.
  • the data block cluster corresponding to the centroid (1) is the target data block cluster.
  • the method further includes: determining the actual centroid according to the coordinates of the data blocks in the target block cluster; and performing subsequent steps according to the actual centroid. Clustering operations on data blocks.
  • the actual centroid is recalculated according to the coordinates of the corresponding points of each data block in the target data block cluster, and the actual centroid of the target data block cluster can be calculated by taking the average value based on the coordinates of the data blocks in all the data block clusters.
  • the entire life cycle of the subsequent solid-state storage device takes the actual center as the starting point for data block management.
  • the multi-dimensional space corresponding to the multi-dimensional function model determined above, as well as multiple centroids in the multi-dimensional space can be the initial centroids, and the initial centroids are all on the coordinate axis of the multi-dimensional space, but the actual centroid may not on the coordinate axes of space.
  • centroid when filtering a data block next time, the centroid can be used as the origin of the multi-dimensional space coordinate system, and the next data block filtering can be performed. It should be noted that the above centroids and target block clusters can be iteratively updated with the screening of each data block to ensure the validity and accuracy of the next use.
  • the above-mentioned condition for selecting a multi-dimensional index is a screening condition that meets the requirements of a data processing method; the above-mentioned data processing method includes at least one of the following: performing a write operation on multiple data blocks; Perform a recycling operation.
  • the plurality of filter conditions include at least one of the following: the number of valid pages of the data block is 0 or the number of valid pages is less than a preset number; the wear degree of the data block is less than the preset wear degree; the popularity of the data block is 0 or the heat is less than the preset heat.
  • the purpose of this embodiment is to solve the problem that the existing general algorithms cannot take into account wear leveling and writing efficiency, and provide a general block management method that takes into account multiple target characteristics.
  • the purpose of this embodiment is to divide the samples into several categories (clusters) according to the similarity of comprehensive features, and select a category with better comprehensive features as the target during writing and garbage collection.
  • FIG. 3 is a flowchart of a method for managing storage data blocks according to an embodiment of the present disclosure.
  • x is the number of valid pages of the block
  • y is the wear degree of the block, including the number of programming and/or erasing times
  • z is the heat conversion value of the block
  • the difference between the initial programming time of the block and the current time is M, the difference value The larger the value, the smaller the heat.
  • All NAND Flash blocks in a solid-state storage device can be regarded as discrete points in a multi-dimensional space, such as a three-dimensional space, and these points have characteristic values such as x, y, and z.
  • the initial centroid is selected as the cluster center.
  • the conventional K-means clustering algorithm usually selects the Euclidean distance from the target point to the centroid as a condition for measuring whether or not to cluster, and the present disclosure uses the weighted Euclidean distance as the measuring condition.
  • the actual algorithm logical coordinates are (ax, by, cz).
  • K-means clustering is performed. After the K-means clustering is completed, 4 NAND Flash block clusters (sets) are obtained.
  • the initial centroid After the initial cluster is obtained, the initial centroid has no effect. At this time, the centroid is recalculated according to the coordinates of the elements in the cluster, and is calculated based on the average coordinate of all cluster elements. After the initial cluster and centroid are obtained, the entire life cycle of subsequent solid-state storage devices takes this as the starting point for block management.
  • the cluster is The set of optimal solutions obtained after synthesizing each feature. Selecting blocks in the cluster for writing and garbage collection can make the solid-state storage device achieve a relatively good wear leveling state, while taking into account the read and write performance.
  • block elements automatically join or leave the corresponding cluster, and the position of the centroid is adjusted in real time.
  • the clusters obtained based on the coordinates (2) (3) (4) and their subsequent centroids are a collection of blocks with poor eigenvalues, which are all poor solutions for the target operation of the storage device.
  • FIG. 2 is a schematic diagram of clustering in a multi-dimensional space according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 4, according to another aspect of the embodiment of the present disclosure, a data processing apparatus is further provided, including: a establishing module 42, a determining module 44. The clustering module 46 and the selection module 48 are described in detail below.
  • the establishment module 42 is configured to establish a multi-dimensional function model, wherein the multi-dimensional function model includes a multi-dimensional index, and the multi-dimensional index corresponds to a plurality of conditions for screening the data blocks respectively;
  • the determination module 44 is connected to the above-mentioned establishment module 42, and is set to determine the multi-dimensional function The multi-dimensional space corresponding to the model, and the multiple centroids of the multi-dimensional space;
  • the clustering module 46 connected with the above-mentioned determination module 44, is set to cluster a plurality of data blocks through a clustering algorithm according to the plurality of centroids, to obtain a plurality of data blocks.
  • the establishment module 42 is used to establish a multi-dimensional function model, wherein the multi-dimensional function model includes a multi-dimensional index, and the multi-dimensional index corresponds to a plurality of conditions for screening data blocks; the determination module 44 determines the multi-dimensional space corresponding to the multi-dimensional function model, and the multi-dimensional Multiple centroids of space; the clustering module 46 clusters multiple data blocks through a clustering algorithm according to the multiple centroids to obtain multiple block clusters corresponding to the multiple centroids; the selection module 48 selects a multi-dimensional index to meet multiple conditions
  • the block cluster corresponding to the centroid is the target block cluster, by establishing a multi-dimensional function model, selecting the mass points in the multi-dimensional function model for clustering, and classifying multiple data blocks, so as to quickly filter the data blocks, and achieve the goal of achieving a wide range of The purpose of quickly screening multiple data blocks by dimension, so as to achieve the technical effect of "improving the measurement efficiency of data blocks, and improving the consistency and versatility of data block processing", thereby solving the problem
  • a computer storage medium includes a stored program, wherein when the program is executed, a device where the computer storage medium is located is controlled to execute any one of the data processing methods described above.
  • a processor is also provided, and the processor is configured to run a program, wherein when the program runs, any one of the data processing methods described above is executed.
  • the disclosed technical content may be implemented in other manners.
  • the device embodiments described above are only illustrative, for example, the division of the units may be a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of units or modules, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present disclosure can be embodied in the form of software products in essence, or the part that contributes to the prior art, or all or part of the technical solutions, and the computer software product is stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data processing method and apparatus. The method comprises: establishing a multi-dimensional function model, wherein the multi-dimensional function model comprises multi-dimensional indexes, and the multi-dimensional indexes respectively correspond to a plurality of conditions for screening data blocks (S102); determining a multi-dimensional space corresponding to the multi-dimensional function model, and a plurality of centroids of the multi-dimensional space (S104); according to the plurality of centroids, clustering a plurality of data blocks by means of a clustering algorithm, so as to obtain a plurality of block clusters corresponding to the plurality of centroids (S106); and selecting a block cluster corresponding to a centroid, of which the multi-dimensional index meets a plurality of conditions, as a target block cluster (S108). By means of the method, the technical problem in the related art of poor consistency and universality of data block processing due to characteristic measurement of data blocks being limited in terms of management manner of data storage blocks is solved.

Description

数据处理方法及装置Data processing method and device
本公开以2020年10月30日递交的、申请号为202011193950.2且名称为“数据处理方法及装置”的中国专利文件为优先权文件,其全部内容通过引用结合在本公开中。The present disclosure takes the Chinese patent document with the application number of 202011193950.2 and the title of “Data Processing Method and Apparatus” filed on October 30, 2020 as a priority document, the entire contents of which are incorporated into the present disclosure by reference.
技术领域technical field
本公开涉及数据处理领域,具体而言,涉及一种数据处理方法及装置。The present disclosure relates to the field of data processing, and in particular, to a data processing method and apparatus.
背景技术Background technique
NAND Flash块管理包括干净块的使用和脏块的回收,主要提现在磨损均衡方面,包括动态均衡(垃圾回收等)和静态均衡,二者的实现可能需要根据需求做相应调整。NAND Flash block management includes the use of clean blocks and the recovery of dirty blocks, mainly in terms of wear leveling, including dynamic leveling (garbage collection, etc.) and static leveling. The implementation of the two may need to be adjusted according to requirements.
干净块的使用情形比较简单,只需考虑块的磨损度就可以。而脏块的垃圾回收,经典算法包括Greedy policy,Cost-benefit policy,Cost-Age-Times(CAT)policy等。这些算法衡量的块特征有限,且大都是靠经验指导,很难找到全局最优,而且可能并不具有一致性和通用性。The use of clean blocks is relatively simple, just consider the wear degree of the block. For garbage collection of dirty blocks, classic algorithms include Greedy policy, Cost-benefit policy, Cost-Age-Times (CAT) policy, etc. The block characteristics measured by these algorithms are limited, and most of them are guided by experience. It is difficult to find the global optimum, and may not be consistent and universal.
针对上述的问题,目前尚未提出有效的解决方案。For the above problems, no effective solution has been proposed yet.
发明内容SUMMARY OF THE INVENTION
本公开实施例提供了一种数据处理方法及装置,以至少解决相关技术中对数据存储块的管理方式,存在对数据块的特征衡量有限,导致数据块处理的一致性和通用性较差的技术问题。The embodiments of the present disclosure provide a data processing method and device, so as to at least solve the management method of data storage blocks in the related art, and there is a problem that the feature measurement of data blocks is limited, resulting in poor consistency and generality of data block processing. technical problem.
根据本公开实施例的一个方面,提供了一种数据处理方法,包括:建立多维函数模型,其中,所述多维函数模型包括多维指数,所述多维指数分别对应对数据块进行筛选的多个条件;确定所述多维函数模型对应的多维空间,以及所述多维空间的多个质心;根据所述多个质心,通过聚类算法对多个数据块进行聚类,得到与所述多个质心对应的多个块簇;选取多维指数满足所述多个条件的质心对应的块簇为目标块簇。According to an aspect of the embodiments of the present disclosure, a data processing method is provided, including: establishing a multi-dimensional function model, wherein the multi-dimensional function model includes multi-dimensional indices, and the multi-dimensional indices respectively correspond to multiple conditions for filtering data blocks ; Determine the multi-dimensional space corresponding to the multi-dimensional function model, and a plurality of centroids of the multi-dimensional space; According to the plurality of centroids, cluster a plurality of data blocks by a clustering algorithm, and obtain a plurality of centroids corresponding to the plurality of centroids. multiple block clusters; select the block cluster corresponding to the centroid whose multi-dimensional index satisfies the multiple conditions as the target block cluster.
在公开的一些实施例中,确定所述多维函数模型对应的多维空间,以及所述多维空间的多个质心,包括:根据所述多维函数模型的多维指数的数量,确定所述多个质心的数量,其中,所述质心的数量比所述多维指数的数量多1;确定所述数量的多个 质心在所述多维空间中的坐标,其中,所述质心为多个数据块在所述多维空间中的范围端点,所述质心在所述多维空间的坐标轴上。In some disclosed embodiments, determining a multi-dimensional space corresponding to the multi-dimensional function model and multiple centroids of the multi-dimensional space includes: determining the number of centroids of the multiple centroids according to the number of multi-dimensional indices of the multi-dimensional function model the number of centroids, wherein the number of centroids is one more than the number of multi-dimensional indices; determine the coordinates of the number of centroids in the multi-dimensional space, wherein the centroids are the multiple data blocks in the multi-dimensional space The end point of the range in space, the centroid is on the coordinate axis of the multidimensional space.
在公开的一些实施例中,根据所述多个质心,通过聚类算法对多个数据块进行聚类,得到与所述多个质心对应的多个块簇,包括:确定所述多个数据块在所述多维空间中的坐标;对所述多个数据块的坐标进行加权处理;根据所述多个数据块加权后的坐标,计算所述多个数据块分别与多个质点之间的欧氏距离;以所述欧氏距离的大小为聚类条件,对所述多个数据块进行聚类;得到与所述多个质心对应的多个块簇。In some disclosed embodiments, according to the plurality of centroids, clustering a plurality of data blocks through a clustering algorithm to obtain a plurality of block clusters corresponding to the plurality of centroids includes: determining the plurality of data blocks The coordinates of the blocks in the multi-dimensional space; weighting the coordinates of the multiple data blocks; according to the weighted coordinates of the multiple data blocks, calculate the distance between the multiple data blocks and the multiple particle points respectively Euclidean distance; using the size of the Euclidean distance as a clustering condition, cluster the multiple data blocks; obtain multiple block clusters corresponding to the multiple centroids.
在公开的一些实施例中,选取多维指数满足所述多个条件的质心对应的块簇为目标块簇,包括:根据多个条件确定满足所述多个条件的目标多维指数;根据所述目标多维指数,确定坐标与所述目标多维指数对应的目标质心;将所述目标质心对应的块簇作为所述目标块簇。In some disclosed embodiments, selecting the block cluster corresponding to the centroid whose multidimensional index satisfies the multiple conditions as the target block cluster includes: determining the target multidimensional index satisfying the multiple conditions according to the multiple conditions; The multi-dimensional index is used to determine the target centroid whose coordinates correspond to the target multi-dimensional index; the block cluster corresponding to the target centroid is used as the target block cluster.
在公开的一些实施例中,选取多维指数满足所述多个条件的质心对应的块簇为目标块簇之后,还包括:根据所述目标块簇中的数据块的坐标,确定实际质心;根据所述实际质心进行后续数据块的聚类操作。In some disclosed embodiments, after selecting the block cluster corresponding to the centroid whose multi-dimensional index satisfies the multiple conditions as the target block cluster, the method further includes: determining the actual centroid according to the coordinates of the data blocks in the target block cluster; The actual centroid performs the clustering operation of subsequent data blocks.
在公开的一些实施例中,所述条件为满足数据处理方式的要求的筛选条件;所述数据处理方式包括下列至少之一:对所述多个数据块进行写入操作;对所述多个数据块进行回收操作。In some disclosed embodiments, the condition is a filtering condition that meets the requirements of a data processing method; the data processing method includes at least one of the following: performing a write operation on the plurality of data blocks; Data blocks are reclaimed.
在公开的一些实施例中,多个筛选条件包括下列至少之一:数据块的有效页数量为0或者有效页数量小于预设数量;数据块的磨损度小于预设磨损度;数据块的热度为0或者热度小于预设热度。In some disclosed embodiments, the plurality of filter conditions include at least one of the following: the number of valid pages of the data block is 0 or the number of valid pages is less than a preset number; the wear degree of the data block is less than the preset wear degree; the popularity of the data block is 0 or the heat is less than the preset heat.
根据本公开实施例的另一方面,还提供了一种数据处理装置,包括:建立模块,设置为建立多维函数模型,其中,所述多维函数模型包括多维指数,所述多维指数分别对应对数据块进行筛选的多个条件;确定模块,设置为确定所述多维函数模型对应的多维空间,以及所述多维空间的多个质心;聚类模块,设置为根据所述多个质心,通过聚类算法对多个数据块进行聚类,得到与所述多个质心对应的多个块簇;选取模块,设置为选取多维指数满足所述多个条件的质心对应的块簇为目标块簇。According to another aspect of the embodiments of the present disclosure, there is also provided a data processing apparatus, comprising: a building module configured to build a multi-dimensional function model, wherein the multi-dimensional function model includes multi-dimensional indices, and the multi-dimensional indices correspond to corresponding data respectively multiple conditions for the block to be screened; a determination module, configured to determine a multi-dimensional space corresponding to the multi-dimensional function model, and multiple centroids of the multi-dimensional space; a clustering module, configured to determine the multiple centroids through clustering The algorithm clusters multiple data blocks to obtain multiple block clusters corresponding to the multiple centroids; the selection module is configured to select the block clusters corresponding to the centroids whose multi-dimensional indices satisfy the multiple conditions as the target block clusters.
根据本公开实施例的另一方面,还提供了一种计算机存储介质,所述计算机存储介质包括存储的程序,其中,在所述程序运行时控制所述计算机存储介质所在设备执行上述中任意一项所述的数据处理方法。According to another aspect of the embodiments of the present disclosure, a computer storage medium is also provided, where the computer storage medium includes a stored program, wherein when the program runs, a device where the computer storage medium is located is controlled to execute any one of the above The data processing method described in item.
根据本公开实施例的另一方面,还提供了一种处理器,所述处理器设置为运行程序,其中,所述程序运行时执行上述中任意一项所述的数据处理方法。According to another aspect of the embodiments of the present disclosure, a processor is further provided, and the processor is configured to run a program, wherein when the program runs, any one of the data processing methods described above is executed.
在本公开实施例中,采用建立多维函数模型,其中,多维函数模型包括多维指数,多维指数分别对应对数据块进行筛选的多个条件;确定多维函数模型对应的多维空间,以及多维空间的多个质心;根据多个质心,通过聚类算法对多个数据块进行聚类,得到与多个质心对应的多个块簇;选取多维指数满足多个条件的质心对应的块簇为目标块簇的方式,通过建立多维函数模型,在多维函数模型中选取质点进行聚类,将多个数据块进行分类,从而快速对数据块进行筛选,达到了从多维度对多个数据块进行快速筛选的目的,从而实现了提高对数据块的衡量效率,提高数据块处理的一致性和通用性的技术效果,进而解决了相关技术中对数据存储块的管理方式,存在对数据块的特征衡量有限,导致数据块处理的一致性和通用性较差的技术问题。In the embodiment of the present disclosure, a multi-dimensional function model is established, wherein the multi-dimensional function model includes a multi-dimensional index, and the multi-dimensional index corresponds to a plurality of conditions for screening data blocks respectively; the multi-dimensional space corresponding to the multi-dimensional function model is determined, and the multi-dimensional According to the multiple centroids, cluster multiple data blocks through a clustering algorithm to obtain multiple block clusters corresponding to the multiple centroids; select the block cluster corresponding to the centroid whose multi-dimensional index satisfies multiple conditions as the target block cluster By establishing a multi-dimensional function model, selecting mass points in the multi-dimensional function model for clustering, and classifying multiple data blocks, so as to quickly filter the data blocks, and achieve the rapid screening of multiple data blocks from multiple dimensions. Therefore, the technical effect of improving the measurement efficiency of data blocks and improving the consistency and versatility of data block processing is realized, and the management method of data storage blocks in related technologies is solved. Technical issues that lead to poor consistency and generality of data block processing.
附图说明Description of drawings
此处所说明的附图用来提供对本公开的进一步理解,构成本申请的一部分,本公开的示意性实施例及其说明用于解释本公开,并不构成对本公开的不当限定。在附图中:The accompanying drawings described herein are used to provide a further understanding of the present disclosure and constitute a part of the present application. The exemplary embodiments of the present disclosure and their descriptions are used to explain the present disclosure and do not constitute an improper limitation of the present disclosure. In the attached image:
图1是根据本公开实施例的一种数据处理方法的流程图;1 is a flowchart of a data processing method according to an embodiment of the present disclosure;
图2是根据本公开实施方式的在多维空间中进行聚类的示意图;2 is a schematic diagram of clustering in a multi-dimensional space according to an embodiment of the present disclosure;
图3是根据本公开实施方式的存储数据块管理方法的流程图;3 is a flowchart of a storage data block management method according to an embodiment of the present disclosure;
图4是根据本公开实施例的一种数据处理装置的示意图。FIG. 4 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本公开方案,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分的实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本公开保护的范围。In order to make those skilled in the art better understand the solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only Embodiments are part of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present disclosure.
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不是用于描述特定的顺序或先后次序。应该理解为这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second" and the like in the description and claims of the present disclosure and the above drawings are used to distinguish similar objects, rather than to describe a specific sequence or sequence. It should be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.
根据本公开实施例,提供了一种数据处理方法的方法实施例,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。According to an embodiment of the present disclosure, a method embodiment of a data processing method is provided. It should be noted that the steps shown in the flowchart of the accompanying drawings may be executed in a computer system such as a set of computer-executable instructions, and, Although a logical order is shown in the flowcharts, in some cases steps shown or described may be performed in an order different from that herein.
图1是根据本公开实施例的一种数据处理方法的流程图,如图1所示,该方法包括如下步骤:FIG. 1 is a flowchart of a data processing method according to an embodiment of the present disclosure. As shown in FIG. 1 , the method includes the following steps:
步骤S102,建立多维函数模型,其中,多维函数模型包括多维指数,多维指数分别对应对数据块进行筛选的多个条件;Step S102, establishing a multi-dimensional function model, wherein the multi-dimensional function model includes a multi-dimensional index, and the multi-dimensional index corresponds to a plurality of conditions for screening data blocks;
步骤S104,确定多维函数模型对应的多维空间,以及多维空间的多个质心;Step S104, determining the multi-dimensional space corresponding to the multi-dimensional function model, and multiple centroids of the multi-dimensional space;
步骤S106,根据多个质心,通过聚类算法对多个数据块进行聚类,得到与多个质心对应的多个块簇;Step S106, according to a plurality of centroids, cluster a plurality of data blocks by a clustering algorithm to obtain a plurality of block clusters corresponding to the plurality of centroids;
步骤S108,选取多维指数满足多个条件的质心对应的块簇为目标块簇。In step S108, the block cluster corresponding to the centroid whose multi-dimensional index satisfies multiple conditions is selected as the target block cluster.
通过上述步骤,采用建立多维函数模型,其中,多维函数模型包括多维指数,多维指数分别对应对数据块进行筛选的多个条件;确定多维函数模型对应的多维空间,以及多维空间的多个质心;根据多个质心,通过聚类算法对多个数据块进行聚类,得到与多个质心对应的多个块簇;选取多维指数满足多个条件的质心对应的块簇为目标块簇的方式,通过建立多维函数模型,在多维函数模型中选取质点进行聚类,将多个数据块进行分类,从而快速对数据块进行筛选,达到了从多维度对多个数据块进行快速筛选的目的,从而实现了提高对数据块的衡量效率,提高数据块处理的一致性和通用性的技术效果,进而解决了相关技术中对数据存储块的管理方式,存在对数据块的特征衡量有限,导致数据块处理的一致性和通用性较差的技术问题。Through the above steps, a multi-dimensional function model is established, wherein the multi-dimensional function model includes a multi-dimensional index, and the multi-dimensional index corresponds to a plurality of conditions for screening data blocks respectively; the multi-dimensional space corresponding to the multi-dimensional function model and the multi-dimensional space are determined. According to multiple centroids, multiple data blocks are clustered by a clustering algorithm, and multiple block clusters corresponding to multiple centroids are obtained; the block cluster corresponding to the centroid whose multi-dimensional index satisfies multiple conditions is selected as the target block cluster, By establishing a multi-dimensional function model, selecting particles in the multi-dimensional function model for clustering, and classifying multiple data blocks, the data blocks can be quickly screened, and the purpose of quickly screening multiple data blocks from multiple dimensions is achieved. It achieves the technical effect of improving the measurement efficiency of data blocks, improving the consistency and versatility of data block processing, and then solving the management method of data storage blocks in related technologies. Deal with technical issues with poor consistency and generality.
上述多维函数模型可以为d=f(x,y,z…),x,y,z…表示影响该多维函数值的多维指数,也即是多个自变量,上述多维指数可以为在数据块进行使用时的多个维度的条件或者指标,也即是对数据块进行筛选的多个条件。例如,在对干净的数据块进行写入,或者对脏数据块进行回收时,需要考虑三方面的因素,数据块的有效页数量,数据块的磨损度,也即是编程/擦除次数,数据块的热度转换值,即数据块初次编程的时间与当前时间的差值为M,差值越大,热度越小,为避免局部性原理导致刚被搬走的数据很快又被写入,所以通常选用热度较小的块做垃圾回收,设定N为某固定常量,热度值等于N-M。The above multi-dimensional function model can be d=f(x, y, z...), x, y, z... represents the multi-dimensional index that affects the multi-dimensional function value, that is, multiple independent variables, and the above multi-dimensional index can be in the data block. Conditions or indicators of multiple dimensions when using, that is, multiple conditions for filtering data blocks. For example, when writing a clean data block or reclaiming a dirty data block, three factors need to be considered, the number of valid pages of the data block, the wear degree of the data block, that is, the number of programming/erasing, The heat conversion value of the data block, that is, the difference between the initial programming time of the data block and the current time is M. The larger the difference, the smaller the heat. In order to avoid the locality principle, the data that has just been moved will be written again soon. , so the block with less heat is usually selected for garbage collection, and N is set to a fixed constant, and the heat value is equal to N-M.
在公开的一些实施例中,在上述对干净的数据块进行写入的过程中,优先选择有效页为0、磨损度较小、热度为0的块(空块)。在上述对脏数据块进行回收的过程中, 选择有效页较少,磨损度较小,热度较小的块。无论是上述写入还是回收过程,都存在对大量数据块进行筛选的过程。In some disclosed embodiments, in the above process of writing a clean data block, a block (empty block) with a valid page of 0, a low degree of wear, and a heat of 0 is preferentially selected. In the above process of reclaiming dirty data blocks, a block with fewer valid pages, less wear and less heat is selected. Whether it is the above-mentioned writing or recycling process, there is a process of filtering a large number of data blocks.
上述多维函数对应的多维空间,可以是将每个维度的指数作为一个坐标轴,建立的多维空间,多个数据块在多维指数上具有的特征值,都可以作为该数据块在该多维空间中对应坐标轴的坐标值,从而将多个数据块都在多维空间中以点的形式进行表示。The multi-dimensional space corresponding to the above multi-dimensional function can be a multi-dimensional space established by using the index of each dimension as a coordinate axis, and the eigenvalues of multiple data blocks on the multi-dimensional index can be used as the data block in the multi-dimensional space. Corresponding to the coordinate value of the coordinate axis, so that multiple data blocks are represented in the form of points in the multi-dimensional space.
在多维空间中,上述质心的数量比空间的维度多1,在多维空间为三维空间的情况下,上述质心的数量为4,分别为(1)(x=0,y=0,z=0);(2)(x=最大值,y=0,z=0);(3)(x=0,y=最大值,z=0);(4)(x=0,y=0,z=最大值)。上述x等于最大值也即是多个数据块中在x轴对应的指数的最大值;同理,y等于最大值,可以为多个数据块中在y轴对应的指数的最大值;z等于最大值,可以为多个数据块中在z轴对应的指数的最大值。In a multi-dimensional space, the number of the above-mentioned centroids is one more than the dimension of the space. In the case that the multi-dimensional space is a three-dimensional space, the above-mentioned number of centroids is 4, which are (1) (x=0, y=0, z=0 respectively). ); (2) (x=maximum, y=0, z=0); (3) (x=0, y=maximum, z=0); (4) (x=0, y=0, z = maximum value). The above x is equal to the maximum value, that is, the maximum value of the index corresponding to the x-axis in multiple data blocks; similarly, y is equal to the maximum value, which can be the maximum value of the index corresponding to the y-axis in multiple data blocks; z is equal to The maximum value can be the maximum value of the indices corresponding to the z-axis in multiple data blocks.
据此,可以推理,在四维空间中的五个质心分别为,(1)(w=0,x=0,y=0,z=0;(2)(w=最大值,x=0,y=0,z=0);(3)(w=0,x=最大值,y=0,z=0);(4)(w=0,x=0,y=最大值,z=0);(5)(w=0,x=0,y=0,z=最大值)。其他的多维空间的质心坐标以此类推,质心可以为多维空间的坐标原点,以及各个坐标轴对应的多个数据块的最大值的点,质心在多维空间的坐标轴上。Accordingly, it can be inferred that the five centroids in the four-dimensional space are, respectively, (1) (w=0, x=0, y=0, z=0; (2) (w=maximum value, x=0, y=0, z=0); (3) (w=0, x=maximum, y=0, z=0); (4) (w=0, x=0, y=maximum, z= 0); (5) (w=0, x=0, y=0, z=maximum value). The centroid coordinates of other multi-dimensional spaces are analogous, and the centroid can be the coordinate origin of the multi-dimensional space, and each coordinate axis corresponds to The point of the maximum value of the multiple data blocks, the centroid is on the coordinate axes of the multidimensional space.
上述聚类算法可以为K均值聚类算法,通过上述聚类算法,可以将多个数据块,聚类为与多个质心对应的数据块簇,也即是数据块的集合。从而将多个数据块进行分类,根据质心所满足的多维指数,是否满足对数据块进行筛选的多个条件,确定该质心对应的数据块簇为筛选的目标块簇。The above-mentioned clustering algorithm may be a K-means clustering algorithm. Through the above-mentioned clustering algorithm, multiple data blocks can be clustered into data block clusters corresponding to multiple centroids, that is, a set of data blocks. Thereby, a plurality of data blocks are classified, and according to the multi-dimensional index satisfied by the centroid and whether multiple conditions for screening the data blocks are satisfied, the data block cluster corresponding to the centroid is determined as the target block cluster for screening.
通过建立多维函数模型,在多维函数模型中选取质点进行聚类,将多个数据块进行分类,从而快速对数据块进行筛选,达到了从多维度对多个数据块进行快速筛选的目的,从而实现了提高对数据块的衡量效率,提高数据块处理的一致性和通用性的技术效果,进而解决了相关技术中对数据存储块的管理方式,存在“对数据块的特征衡量有限,导致数据块处理的一致性和通用性较差”的技术问题。By establishing a multi-dimensional function model, selecting particles in the multi-dimensional function model for clustering, and classifying multiple data blocks, the data blocks can be quickly screened, and the purpose of quickly screening multiple data blocks from multiple dimensions is achieved. It achieves the technical effect of improving the measurement efficiency of data blocks, improving the consistency and versatility of data block processing, and then solving the management method of data storage blocks in related technologies. Poor consistency and generality of block processing" technical issues.
在公开的一些实施例中,确定多维函数模型对应的多维空间,以及多维空间的多个质心,包括:根据多维函数模型的多维指数的数量,确定多个质心的数量,其中,质心的数量比多维指数的数量多1;确定数量的多个质心在多维空间中的坐标,其中,质心为多个数据块在多维空间中的范围端点,上述质心在多维空间的坐标轴上。In some disclosed embodiments, determining the multi-dimensional space corresponding to the multi-dimensional function model and the multiple centroids of the multi-dimensional space includes: determining the number of the multiple centroids according to the number of the multi-dimensional indices of the multi-dimensional function model, wherein the number of the centroids is greater than the number of the centroids. The number of multi-dimensional indices is one more; the coordinates of a number of centroids in the multi-dimensional space are determined, wherein the centroid is the range end point of the plurality of data blocks in the multi-dimensional space, and the above-mentioned centroids are on the coordinate axis of the multi-dimensional space.
在确定初始的质心时,不仅要确定质心的数量,还要根据多个数据块在多个坐标轴上的最大值,从而确定初始的质心坐标。上述范围端点可以为坐标范围最大值或最 小值。主要取决于坐标系的情况以及多个数据块在坐标系中的分布情况。在本实施例中,在对干净的数据块进行写入,或者对脏数据块进行回收时,数据块的指标均为非负值,因此上述范围端点可以为坐标轴的坐标范围的最大值。When determining the initial centroid, not only the number of centroids, but also the maximum value of multiple data blocks on multiple coordinate axes, so as to determine the initial centroid coordinates. The above range endpoints can be the maximum or minimum value of the coordinate range. It mainly depends on the situation of the coordinate system and the distribution of multiple data blocks in the coordinate system. In this embodiment, when writing a clean data block or recycling a dirty data block, the indicators of the data block are all non-negative values, so the above range endpoint may be the maximum value of the coordinate range of the coordinate axis.
在公开的一些实施例中,根据多个质心,通过聚类算法对多个数据块进行聚类,得到与多个质心对应的多个块簇,包括:确定多个数据块在多维空间中的坐标;对多个数据块的坐标进行加权处理;根据多个数据块加权后的坐标,计算多个数据块分别与多个质点之间的欧氏距离;以欧氏距离的大小为聚类条件,对多个数据块进行聚类;得到与多个质心对应的多个块簇。In some disclosed embodiments, clustering a plurality of data blocks by a clustering algorithm according to a plurality of centroids to obtain a plurality of block clusters corresponding to the plurality of centroids includes: determining the distribution of the plurality of data blocks in a multi-dimensional space. Coordinates; weight the coordinates of multiple data blocks; calculate the Euclidean distances between multiple data blocks and multiple particle points according to the weighted coordinates of multiple data blocks; use the size of the Euclidean distance as the clustering condition , clustering multiple data blocks; obtain multiple block clusters corresponding to multiple centroids.
常规K均值聚类算法通常选用目标点到质心的欧氏距离作为衡量聚类与否的条件,本实施例使用加权后的数据块的坐标,与质心坐标之间的欧氏距离作为确定该数据块是否可以聚类在该质心的数据块簇中的衡量条件。例如,上述三维空间中,坐标(x,y,z)经过加权运算后,实际算法逻辑坐标为(ax,by,cz)。其中,a、b、c为对应特征的权重,且a+b+c=1。通过调整特征权重,可以让所有数据块合理的分布于4个质心对应集合中。The conventional K-means clustering algorithm usually selects the Euclidean distance from the target point to the centroid as a condition for measuring whether or not to cluster. In this embodiment, the coordinates of the weighted data block and the Euclidean distance between the centroid coordinates are used to determine the data. A measure of whether the blocks can be clustered in the data block clusters for this centroid. For example, in the above three-dimensional space, after the coordinates (x, y, z) are weighted, the actual algorithm logical coordinates are (ax, by, cz). Among them, a, b, and c are the weights of the corresponding features, and a+b+c=1. By adjusting the feature weights, all data blocks can be reasonably distributed in the corresponding sets of 4 centroids.
在公开的一些实施例中,选取多维指数满足多个条件的质心对应的块簇为目标块簇,包括:根据多个条件确定满足多个条件的目标多维指数;根据目标多维指数,确定坐标与目标多维指数对应的目标质心;将目标质心对应的块簇作为目标块簇。In some disclosed embodiments, selecting the block cluster corresponding to the centroid whose multidimensional index satisfies multiple conditions as the target block cluster includes: determining the target multidimensional index satisfying the multiple conditions according to the multiple conditions; The target centroid corresponding to the target multi-dimensional index; the block cluster corresponding to the target centroid is taken as the target block cluster.
例如,在上述对干净的数据块进行写入的过程中,优先选择有效页为0、磨损度较小、热度为0的块,也即是空块。在上述对脏数据块进行回收的过程中,选择有效页较少,磨损度较小,热度较小的块。则在对干净的数据块进行写入的过程中或对脏数据块进行回收的过程中,选择质心(1)(x=0,y=0,z=0)为目标质心,该质心代表的块簇所含块比其它质心代表的块簇所含块更满足要求,因此,该质心(1)对应的数据块簇为目标数据块簇。For example, in the above process of writing a clean data block, a block with a valid page of 0, a low degree of wear, and a heat of 0, that is, an empty block, is preferentially selected. In the above process of reclaiming dirty data blocks, select blocks with fewer valid pages, less wear, and less heat. Then in the process of writing clean data blocks or recycling dirty data blocks, select centroid (1) (x=0, y=0, z=0) as the target centroid, which represents the The blocks contained in the block cluster are more satisfactory than the blocks contained in the block clusters represented by other centroids. Therefore, the data block cluster corresponding to the centroid (1) is the target data block cluster.
在公开的一些实施例中,选取多维指数满足多个条件的质心对应的块簇为目标块簇之后,还包括:根据目标块簇中的数据块的坐标,确定实际质心;根据实际质心进行后续数据块的聚类操作。In some disclosed embodiments, after selecting the block cluster corresponding to the centroid whose multi-dimensional index satisfies multiple conditions as the target block cluster, the method further includes: determining the actual centroid according to the coordinates of the data blocks in the target block cluster; and performing subsequent steps according to the actual centroid. Clustering operations on data blocks.
根据目标数据块簇中各个数据块对应的点的坐标重新计算实际质心,基于所有数据块簇中的数据块的坐标,求取平均值就可以计算得到该目标数据块簇的实际质心。得到目标数据块簇和实际质心后,后续固态存储设备的整个生命周期都以实际之心为数据块管理的起点。相比于实际质心,上述确定多维函数模型对应的多维空间,以及多维空间的多个质心,都可以为初始质心,初始质心都在多维空间的坐标轴上,但是 实际质心就不一定会在多维空间的坐标轴上。例如,下一次对数据块进行筛选时,可以将该质心作为多维空间坐标系的原点,进行下一次的数据块筛选。需要说明的是,上述质心和目标块簇可以随着每次数据块的筛选进行迭代更新,以保证下一次使用的有效性和准确性。The actual centroid is recalculated according to the coordinates of the corresponding points of each data block in the target data block cluster, and the actual centroid of the target data block cluster can be calculated by taking the average value based on the coordinates of the data blocks in all the data block clusters. After obtaining the target data block cluster and the actual center of mass, the entire life cycle of the subsequent solid-state storage device takes the actual center as the starting point for data block management. Compared with the actual centroid, the multi-dimensional space corresponding to the multi-dimensional function model determined above, as well as multiple centroids in the multi-dimensional space, can be the initial centroids, and the initial centroids are all on the coordinate axis of the multi-dimensional space, but the actual centroid may not on the coordinate axes of space. For example, when filtering a data block next time, the centroid can be used as the origin of the multi-dimensional space coordinate system, and the next data block filtering can be performed. It should be noted that the above centroids and target block clusters can be iteratively updated with the screening of each data block to ensure the validity and accuracy of the next use.
在公开的一些实施例中,上述选取多维指数的条件为满足数据处理方式的要求的筛选条件;上述数据处理方式包括下列至少之一:对多个数据块进行写入操作;对多个数据块进行回收操作。In some disclosed embodiments, the above-mentioned condition for selecting a multi-dimensional index is a screening condition that meets the requirements of a data processing method; the above-mentioned data processing method includes at least one of the following: performing a write operation on multiple data blocks; Perform a recycling operation.
在公开的一些实施例中,多个筛选条件包括下列至少之一:数据块的有效页数量为0或者有效页数量小于预设数量;数据块的磨损度小于预设磨损度;数据块的热度为0或者热度小于预设热度。In some disclosed embodiments, the plurality of filter conditions include at least one of the following: the number of valid pages of the data block is 0 or the number of valid pages is less than a preset number; the wear degree of the data block is less than the preset wear degree; the popularity of the data block is 0 or the heat is less than the preset heat.
需要说明的是,本申请实施例还提供了一种实施方式,下面对该实施方式进行详细说明。It should be noted that the embodiments of the present application further provide an implementation manner, which will be described in detail below.
在固态存储设备的正常使用过程中,以无监督学习的方式,选取若干质点为聚类中心,自动将所有的块聚合成若干类(簇)。这几个以质点为中心的NAND Flash块聚类,代表了固态存储设备的待使用块集合。During the normal use of the solid-state storage device, in an unsupervised learning manner, a number of particles are selected as cluster centers, and all blocks are automatically aggregated into several classes (clusters). These particle-centered NAND Flash block clusters represent the set of blocks to be used in solid-state storage devices.
本实施方式的目的是解决现有通行算法无法兼顾磨损均衡和写入效率等的问题,提供了一种通用的,兼顾多个目标特征的块管理方法。The purpose of this embodiment is to solve the problem that the existing general algorithms cannot take into account wear leveling and writing efficiency, and provide a general block management method that takes into account multiple target characteristics.
对于样本集D={x 1,x 2,...,x m},K均值算法就是针对聚类划分C={C 1,C 2,...,C k}最小化平方误差:
Figure PCTCN2020132911-appb-000001
For the sample set D = {x 1 , x 2 , ..., x m }, the K-means algorithm is to minimize the squared error for the clustering partition C = {C 1 , C 2 , ..., C k }:
Figure PCTCN2020132911-appb-000001
其中
Figure PCTCN2020132911-appb-000002
是簇C i的均值向量。从上述公式中可以看出,该公式刻画了簇内样本围绕簇均值向量的紧密程度,E值越小簇内样本的相似度越高。
in
Figure PCTCN2020132911-appb-000002
is the mean vector of clusters C i . It can be seen from the above formula that the formula depicts the closeness of the samples in the cluster around the cluster mean vector. The smaller the E value, the higher the similarity of the samples in the cluster.
本实施方式的目的就是按综合特征相似度将样本分成若干类(簇),写入和垃圾回收时选取综合特征较优的一类作为目标。The purpose of this embodiment is to divide the samples into several categories (clusters) according to the similarity of comprehensive features, and select a category with better comprehensive features as the target during writing and garbage collection.
图3是根据本公开实施方式的存储数据块管理方法的流程图,如图3所示,首先,建立多维函数模型d=f(x,y,z…),其中,x,y,z等为NAND Flash块的离散特征值。如x为块的有效页数量,y为块的磨损度,包括编程次数和/或擦除次数,z为块的热度转换值,块初次编程的时间与当前时间的差值为M,差值越大,热度越小,为避免局部性原理导致刚被搬走的数据很快又被写入,所以通常选用热度较小的块做垃圾回收,设定N为某固定常量,z=N-M。FIG. 3 is a flowchart of a method for managing storage data blocks according to an embodiment of the present disclosure. As shown in FIG. 3 , first, a multi-dimensional function model d=f(x, y, z...) is established, where x, y, z, etc. is the discrete eigenvalue of the NAND Flash block. For example, x is the number of valid pages of the block, y is the wear degree of the block, including the number of programming and/or erasing times, z is the heat conversion value of the block, and the difference between the initial programming time of the block and the current time is M, the difference value The larger the value, the smaller the heat. In order to avoid the locality principle causing the data that has just been moved to be written again quickly, the block with less heat is usually selected for garbage collection, and N is set to a fixed constant, z=N-M.
块的使用原则是:(1)写入时,选择有效页为0、磨损度较小、热度为0的块,也即是空块;(2)垃圾回收时,选择有效页较少,磨损度较小,热度较小的块。The principles of block usage are: (1) When writing, select a block with 0 valid pages, less wear, and 0 heat, that is, empty blocks; (2) When garbage collection, select fewer valid pages and wear less. Smaller, less hot blocks.
固态存储设备中所有的NAND Flash块,可以看作是离散在多维空间中的点,如三维空间,这些点都具备x、y、z等特征值。All NAND Flash blocks in a solid-state storage device can be regarded as discrete points in a multi-dimensional space, such as a three-dimensional space, and these points have characteristic values such as x, y, and z.
多维空间确定后,选取初始质心作为聚类中心。质心的数量等于模型维度加1,三维模型中,质心数量为K=4。4个初始质心的坐标为:(1)(x=0,y=0,z=0);(2)(x=最大值,y=0,z=0);(3)(x=0,y=最大值,z=0);(4)(x=0,y=0,z=最大值)。即三维坐标系坐标轴上的4个点,最大值指的是所有块中该特征的最大值。After the multi-dimensional space is determined, the initial centroid is selected as the cluster center. The number of centroids is equal to the model dimension plus 1. In a three-dimensional model, the number of centroids is K=4. The coordinates of the four initial centroids are: (1)(x=0, y=0, z=0); (2)(x =max, y=0, z=0); (3) (x=0, y=max, z=0); (4) (x=0, y=0, z=max). That is, 4 points on the coordinate axis of the three-dimensional coordinate system, and the maximum value refers to the maximum value of this feature in all blocks.
常规K均值聚类算法通常选用目标点到质心的欧氏距离作为衡量聚类与否的条件,本公开使用加权后的欧氏距离作为衡量条件。三维空间中,坐标(x,y,z)经过加权运算后,实际算法逻辑坐标为(ax,by,cz)。其中,a、b、c为对应特征的权重,且a+b+c=1。通过调整特征权重,可以让所有块合理的分布于4个集合中。The conventional K-means clustering algorithm usually selects the Euclidean distance from the target point to the centroid as a condition for measuring whether or not to cluster, and the present disclosure uses the weighted Euclidean distance as the measuring condition. In the three-dimensional space, after the coordinates (x, y, z) are weighted, the actual algorithm logical coordinates are (ax, by, cz). Among them, a, b, and c are the weights of the corresponding features, and a+b+c=1. By adjusting the feature weights, all blocks can be reasonably distributed in 4 sets.
初始质心确定后,就需要进行数据分配。按照上述方式,以加权后的逻辑坐标到初始质点的欧氏距离大小为聚类条件,进行K均值聚类,K均值聚类完成后,得到4个NAND Flash块簇(集合)。After the initial centroids are determined, data allocation is required. According to the above method, using the Euclidean distance from the weighted logical coordinates to the initial mass point as the clustering condition, K-means clustering is performed. After the K-means clustering is completed, 4 NAND Flash block clusters (sets) are obtained.
得到初始簇后,初始质心就失去作用,此时根据簇中元素坐标重新计算质心,基于所有簇元素的坐标平均值计算得到。得到初始簇和质心后,后续固态存储设备的整个生命周期都以此为块管理的起点。After the initial cluster is obtained, the initial centroid has no effect. At this time, the centroid is recalculated according to the coordinates of the elements in the cluster, and is calculated based on the average coordinate of all cluster elements. After the initial cluster and centroid are obtained, the entire life cycle of subsequent solid-state storage devices takes this as the starting point for block management.
其中,基于坐标(1)(x=0,y=0,z=0)得到的质心,以及由该质心聚类得到的簇,就是写入和垃圾回收时选择块的目标簇,该簇是综合各个特征后得到的优解集合。选择该簇中的块进行写入和垃圾回收,可以使固态存储设备达到比较良好的磨损均衡状态,同时可以兼顾读写性能。增减簇中的块元素时,依据无监督学习的方式,块元素自动加入或离开对应的簇,同时实时调整质心的位置。Among them, the centroid obtained based on the coordinates (1) (x=0, y=0, z=0) and the cluster obtained by the centroid clustering are the target clusters for selecting blocks during writing and garbage collection. The cluster is The set of optimal solutions obtained after synthesizing each feature. Selecting blocks in the cluster for writing and garbage collection can make the solid-state storage device achieve a relatively good wear leveling state, while taking into account the read and write performance. When adding or removing block elements in a cluster, according to the method of unsupervised learning, block elements automatically join or leave the corresponding cluster, and the position of the centroid is adjusted in real time.
而基于坐标(2)(3)(4)及其后续质心得到的簇,是某个特征值较差的块的集合,都是存储设备目标操作的劣解。The clusters obtained based on the coordinates (2) (3) (4) and their subsequent centroids are a collection of blocks with poor eigenvalues, which are all poor solutions for the target operation of the storage device.
图2是根据本公开实施方式的在多维空间中进行聚类的示意图,如图2所示,靠近质心(1)(x=0,y=0,z=0)的圆圈所在坐标代表的块集合,为存储设备写入和垃圾回收时的目标块集合。FIG. 2 is a schematic diagram of clustering in a multi-dimensional space according to an embodiment of the present disclosure. As shown in FIG. 2 , the block represented by the coordinates of the circle near the centroid (1) (x=0, y=0, z=0) is located Collection, which is the target block collection for storage device writing and garbage collection.
图4是根据本公开实施例的一种数据处理装置的示意图,如图4所示,根据本公开实施例的另一方面,还提供了一种数据处理装置,包括:建立模块42,确定模块44, 聚类模块46和选取模块48,下面对该装置进行详细说明。FIG. 4 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 4, according to another aspect of the embodiment of the present disclosure, a data processing apparatus is further provided, including: a establishing module 42, a determining module 44. The clustering module 46 and the selection module 48 are described in detail below.
建立模块42,设置为建立多维函数模型,其中,多维函数模型包括多维指数,多维指数分别对应对数据块进行筛选的多个条件;确定模块44,与上述建立模块42相连,设置为确定多维函数模型对应的多维空间,以及多维空间的多个质心;聚类模块46,与上述确定模块44相连,设置为根据多个质心,通过聚类算法对多个数据块进行聚类,得到与多个质心对应的多个块簇;选取模块48,与上述聚类模块46相连,设置为选取多维指数满足多个条件的质心对应的块簇为目标块簇。The establishment module 42 is configured to establish a multi-dimensional function model, wherein the multi-dimensional function model includes a multi-dimensional index, and the multi-dimensional index corresponds to a plurality of conditions for screening the data blocks respectively; the determination module 44 is connected to the above-mentioned establishment module 42, and is set to determine the multi-dimensional function The multi-dimensional space corresponding to the model, and the multiple centroids of the multi-dimensional space; the clustering module 46, connected with the above-mentioned determination module 44, is set to cluster a plurality of data blocks through a clustering algorithm according to the plurality of centroids, to obtain a plurality of data blocks. A plurality of block clusters corresponding to the centroids; the selection module 48, connected to the above-mentioned clustering module 46, is configured to select the block clusters corresponding to the centroids whose multidimensional indices satisfy multiple conditions as the target block clusters.
通过上述装置,采用建立模块42建立多维函数模型,其中,多维函数模型包括多维指数,多维指数分别对应对数据块进行筛选的多个条件;确定模块44确定多维函数模型对应的多维空间,以及多维空间的多个质心;聚类模块46根据多个质心,通过聚类算法对多个数据块进行聚类,得到与多个质心对应的多个块簇;选取模块48选取多维指数满足多个条件的质心对应的块簇为目标块簇的方式,通过建立多维函数模型,在多维函数模型中选取质点进行聚类,将多个数据块进行分类,从而快速对数据块进行筛选,达到了从多维度对多个数据块进行快速筛选的目的,从而实现了“提高对数据块的衡量效率,提高数据块处理的一致性和通用性”的技术效果,进而解决了相关技术中对数据存储块的管理方式,存在对数据块的特征衡量有限,导致数据块处理的一致性和通用性较差的技术问题。Through the above device, the establishment module 42 is used to establish a multi-dimensional function model, wherein the multi-dimensional function model includes a multi-dimensional index, and the multi-dimensional index corresponds to a plurality of conditions for screening data blocks; the determination module 44 determines the multi-dimensional space corresponding to the multi-dimensional function model, and the multi-dimensional Multiple centroids of space; the clustering module 46 clusters multiple data blocks through a clustering algorithm according to the multiple centroids to obtain multiple block clusters corresponding to the multiple centroids; the selection module 48 selects a multi-dimensional index to meet multiple conditions The block cluster corresponding to the centroid is the target block cluster, by establishing a multi-dimensional function model, selecting the mass points in the multi-dimensional function model for clustering, and classifying multiple data blocks, so as to quickly filter the data blocks, and achieve the goal of achieving a wide range of The purpose of quickly screening multiple data blocks by dimension, so as to achieve the technical effect of "improving the measurement efficiency of data blocks, and improving the consistency and versatility of data block processing", thereby solving the problem of data storage blocks in related technologies. In the management method, there is a technical problem that the characteristic measurement of the data block is limited, which leads to the poor consistency and generality of the data block processing.
根据本公开实施例的另一方面,还提供了一种计算机存储介质,计算机存储介质包括存储的程序,其中,在程序运行时控制计算机存储介质所在设备执行上述中任意一项的数据处理方法。According to another aspect of the embodiments of the present disclosure, a computer storage medium is also provided, and the computer storage medium includes a stored program, wherein when the program is executed, a device where the computer storage medium is located is controlled to execute any one of the data processing methods described above.
根据本公开实施例的另一方面,还提供了一种处理器,处理器设置为运行程序,其中,程序运行时执行上述中任意一项的数据处理方法。According to another aspect of the embodiments of the present disclosure, a processor is also provided, and the processor is configured to run a program, wherein when the program runs, any one of the data processing methods described above is executed.
上述本公开实施例序号仅仅为了描述,不代表实施例的优劣。The above-mentioned serial numbers of the embodiments of the present disclosure are only for description, and do not represent the advantages or disadvantages of the embodiments.
在本公开的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments of the present disclosure, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
在本公开所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,可以为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in the present disclosure, it should be understood that the disclosed technical content may be implemented in other manners. The device embodiments described above are only illustrative, for example, the division of the units may be a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of units or modules, and may be in electrical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present disclosure can be embodied in the form of software products in essence, or the part that contributes to the prior art, or all or part of the technical solutions, and the computer software product is stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present disclosure. The aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes .
以上所述仅是本公开的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本公开原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本公开的保护范围。The above are only the preferred embodiments of the present disclosure. It should be pointed out that for those skilled in the art, without departing from the principles of the present disclosure, several improvements and modifications can be made. It should be regarded as the protection scope of the present disclosure.

Claims (10)

  1. 一种数据处理方法,包括:A data processing method comprising:
    建立多维函数模型,其中,所述多维函数模型包括多维指数,所述多维指数分别对应对数据块进行筛选的多个条件;establishing a multi-dimensional function model, wherein the multi-dimensional function model includes a multi-dimensional index, and the multi-dimensional index corresponds to a plurality of conditions for screening data blocks;
    确定所述多维函数模型对应的多维空间,以及所述多维空间的多个质心;determining a multidimensional space corresponding to the multidimensional function model, and a plurality of centroids of the multidimensional space;
    根据所述多个质心,通过聚类算法对多个数据块进行聚类,得到与所述多个质心对应的多个块簇;According to the plurality of centroids, clustering a plurality of data blocks through a clustering algorithm to obtain a plurality of block clusters corresponding to the plurality of centroids;
    选取多维指数满足所述多个条件的质心对应的块簇为目标块簇。The block cluster corresponding to the centroid whose multi-dimensional index satisfies the multiple conditions is selected as the target block cluster.
  2. 根据权利要求1所述的方法,其中,确定所述多维函数模型对应的多维空间,以及所述多维空间的多个质心,包括:The method according to claim 1, wherein determining a multi-dimensional space corresponding to the multi-dimensional function model and a plurality of centroids of the multi-dimensional space, comprising:
    根据所述多维函数模型的多维指数的数量,确定所述多个质心的数量,其中,所述质心的数量比所述多维指数的数量多1;determining the number of the plurality of centroids according to the number of the multidimensional indices of the multidimensional function model, wherein the number of the centroids is one more than the number of the multidimensional indices;
    确定所述数量的多个质心在所述多维空间中的坐标,其中,所述质心为多个数据块在所述多维空间中的范围端点,所述质心在所述多维空间的坐标轴上。Coordinates of the number of centroids in the multi-dimensional space are determined, wherein the centroids are range endpoints of a plurality of data blocks in the multi-dimensional space, and the centroids are on the coordinate axis of the multi-dimensional space.
  3. 根据权利要求2所述的方法,其中,根据所述多个质心,通过聚类算法对多个数据块进行聚类,得到与所述多个质心对应的多个块簇,包括:The method according to claim 2, wherein, according to the plurality of centroids, clustering a plurality of data blocks by a clustering algorithm to obtain a plurality of block clusters corresponding to the plurality of centroids, comprising:
    确定所述多个数据块在所述多维空间中的坐标;determining the coordinates of the plurality of data blocks in the multidimensional space;
    对所述多个数据块的坐标进行加权处理;weighting the coordinates of the plurality of data blocks;
    根据所述多个数据块加权后的坐标,计算所述多个数据块相对于多个质点的欧氏距离;According to the weighted coordinates of the plurality of data blocks, calculate the Euclidean distances of the plurality of data blocks relative to the plurality of particle points;
    以所述欧氏距离的大小为聚类条件,对所述多个数据块进行聚类;得到与所述多个质心对应的多个块簇。Using the size of the Euclidean distance as a clustering condition, the multiple data blocks are clustered; multiple block clusters corresponding to the multiple centroids are obtained.
  4. 根据权利要求3所述的方法,其中,选取多维指数满足所述多个条件的质心对应的块簇为目标块簇,包括:The method according to claim 3, wherein selecting the block cluster corresponding to the centroid whose multidimensional index satisfies the multiple conditions as the target block cluster, comprising:
    根据多个条件确定满足所述多个条件的目标多维指数;determining a target multidimensional index satisfying the plurality of conditions according to a plurality of conditions;
    根据所述目标多维指数,确定坐标与所述目标多维指数对应的目标质心;According to the target multi-dimensional index, determine the target centroid whose coordinates correspond to the target multi-dimensional index;
    将所述目标质心对应的块簇作为所述目标块簇。The block cluster corresponding to the target centroid is taken as the target block cluster.
  5. 根据权利要求1所述的方法,其中,选取多维指数满足所述多个条件的质心对应的块簇为目标块簇之后,还包括:The method according to claim 1, wherein after selecting the block cluster corresponding to the centroid whose multidimensional index satisfies the multiple conditions as the target block cluster, the method further comprises:
    根据所述目标块簇中的数据块的坐标,确定实际质心;Determine the actual centroid according to the coordinates of the data blocks in the target block cluster;
    根据所述实际质心进行后续数据块的聚类操作。The clustering operation of subsequent data blocks is performed according to the actual centroid.
  6. 根据权利要求1所述的方法,其中,所述条件为满足数据处理方式的要求的筛选条件;所述数据处理方式包括下列至少之一:The method according to claim 1, wherein the condition is a screening condition that meets the requirements of a data processing method; the data processing method includes at least one of the following:
    对所述多个数据块进行写入操作;performing a write operation on the plurality of data blocks;
    对所述多个数据块进行回收操作。A reclamation operation is performed on the plurality of data blocks.
  7. 根据权利要求6所述的方法,其中,多个筛选条件包括下列至少之一:The method of claim 6, wherein the plurality of screening conditions include at least one of the following:
    数据块的有效页数量为0或者有效页数量小于预设数量;The number of valid pages of the data block is 0 or the number of valid pages is less than the preset number;
    数据块的磨损度小于预设磨损度;The wear degree of the data block is less than the preset wear degree;
    数据块的热度为0或者热度小于预设热度。The data block's hotness is 0 or the hotness is less than the preset hotness.
  8. 一种数据处理装置,包括:A data processing device, comprising:
    建立模块,设置为建立多维函数模型,其中,所述多维函数模型包括多维指数,所述多维指数分别对应对数据块进行筛选的多个条件;a building module, configured to build a multi-dimensional function model, wherein the multi-dimensional function model includes a multi-dimensional index, and the multi-dimensional index corresponds to a plurality of conditions for screening data blocks;
    确定模块,设置为确定所述多维函数模型对应的多维空间,以及所述多维空间的多个质心;a determination module, configured to determine a multi-dimensional space corresponding to the multi-dimensional function model, and a plurality of centroids of the multi-dimensional space;
    聚类模块,设置为根据所述多个质心,通过聚类算法对多个数据块进行聚类,得到与所述多个质心对应的多个块簇;A clustering module, configured to cluster a plurality of data blocks through a clustering algorithm according to the plurality of centroids to obtain a plurality of block clusters corresponding to the plurality of centroids;
    选取模块,设置为选取多维指数满足所述多个条件的质心对应的块簇为目标块簇。The selection module is configured to select the block cluster corresponding to the centroid whose multi-dimensional index satisfies the multiple conditions as the target block cluster.
  9. 一种计算机存储介质,所述计算机存储介质包括存储的程序,其中,在所述程序运行时控制所述计算机存储介质所在设备执行权利要求1至7中任意一项所述的数据处理方法。A computer storage medium, the computer storage medium comprising a stored program, wherein when the program is executed, a device where the computer storage medium is located is controlled to execute the data processing method according to any one of claims 1 to 7.
  10. 一种处理器,所述处理器设置为运行程序,其中,所述程序运行时执行权利要求1至7中任意一项所述的数据处理方法。A processor, wherein the processor is configured to run a program, wherein when the program runs, the data processing method according to any one of claims 1 to 7 is executed.
PCT/CN2020/132911 2020-10-30 2020-11-30 Data processing method and apparatus WO2022088374A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011193950.2 2020-10-30
CN202011193950.2A CN112306414A (en) 2020-10-30 2020-10-30 Data processing method and device

Publications (1)

Publication Number Publication Date
WO2022088374A1 true WO2022088374A1 (en) 2022-05-05

Family

ID=74333150

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/132911 WO2022088374A1 (en) 2020-10-30 2020-11-30 Data processing method and apparatus

Country Status (2)

Country Link
CN (1) CN112306414A (en)
WO (1) WO2022088374A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140181370A1 (en) * 2012-12-21 2014-06-26 Lsi Corporation Method to apply fine grain wear leveling and garbage collection
CN108646977A (en) * 2018-03-07 2018-10-12 深圳忆联信息系统有限公司 A kind of method and rubbish recovering method of the cold and hot data judgements of SSD
CN109783020A (en) * 2018-12-28 2019-05-21 西安交通大学 A kind of rubbish recovering method based on SSD-SMR mixing key assignments storage system
CN111026673A (en) * 2019-11-19 2020-04-17 中国航空工业集团公司西安航空计算技术研究所 NAND FLASH garbage recycling dynamic optimization method
CN111090595A (en) * 2019-11-19 2020-05-01 中国航空工业集团公司西安航空计算技术研究所 NAND FLASH garbage recovery balance optimization method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106127519A (en) * 2016-06-24 2016-11-16 武汉斗鱼网络科技有限公司 A kind of live platform user divided method based on K Means algorithm and system
CN106709662B (en) * 2016-12-30 2021-07-02 山东鲁能软件技术有限公司 Power equipment operation condition division method
CN109753986A (en) * 2017-11-08 2019-05-14 中移(苏州)软件技术有限公司 A kind of clustering method and device of the index information based on data block
CN111178380B (en) * 2019-11-15 2023-07-04 腾讯科技(深圳)有限公司 Data classification method and device and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140181370A1 (en) * 2012-12-21 2014-06-26 Lsi Corporation Method to apply fine grain wear leveling and garbage collection
CN108646977A (en) * 2018-03-07 2018-10-12 深圳忆联信息系统有限公司 A kind of method and rubbish recovering method of the cold and hot data judgements of SSD
CN109783020A (en) * 2018-12-28 2019-05-21 西安交通大学 A kind of rubbish recovering method based on SSD-SMR mixing key assignments storage system
CN111026673A (en) * 2019-11-19 2020-04-17 中国航空工业集团公司西安航空计算技术研究所 NAND FLASH garbage recycling dynamic optimization method
CN111090595A (en) * 2019-11-19 2020-05-01 中国航空工业集团公司西安航空计算技术研究所 NAND FLASH garbage recovery balance optimization method

Also Published As

Publication number Publication date
CN112306414A (en) 2021-02-02

Similar Documents

Publication Publication Date Title
US20070005556A1 (en) Probabilistic techniques for detecting duplicate tuples
CN107305637B (en) Data clustering method and device based on K-Means algorithm
CN106874213B (en) Solid state disk hot data identification method fusing multiple machine learning algorithms
WO2014101130A1 (en) Data processing method and device
Sun et al. Density peaks clustering based on k-nearest neighbors and self-recommendation
CN102236675A (en) Method for processing matched pairs of characteristic points of images, image retrieval method and image retrieval equipment
CN112328169B (en) Wear leveling method and device for solid state disk and computer readable storage medium
CN109934258A (en) The image search method of characteristic weighing and Regional Integration
Radenović et al. Multiple measurements and joint dimensionality reduction for large scale image search with short vectors
CN108171252A (en) A kind of balanced image clustering method based on hierarchical cluster
WO2018006631A1 (en) User level automatic segmentation method and system
WO2023005976A1 (en) Deep-learning-based identification method and apparatus for electric device
JP5518856B2 (en) Improved image recognition support device
CN107527058B (en) Image retrieval method based on weighted local feature aggregation descriptor
WO2022088374A1 (en) Data processing method and apparatus
CN115982132A (en) Construction system for export fan image data sample library
CN111738341A (en) Distributed large-scale face clustering method and device
Cheng et al. A local cores-based hierarchical clustering algorithm for data sets with complex structures
CN110909817A (en) Distributed clustering method and system, processor, electronic device and storage medium
Wang et al. A three-way adaptive density peak clustering (3W-ADPC) method
CN112463727A (en) File storage method and related equipment
Takaishi et al. Free-form feature classification for finite element meshing based on shape descriptors and machine learning
WO2022252316A1 (en) Method and apparatus for searching for optimal complete division index in metric space, and related component
EP3846037B1 (en) Storage device configured to support multi-streams and operation method thereof
CN108415958A (en) The weight processing method and processing device of index weight VLAD features

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20959533

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20959533

Country of ref document: EP

Kind code of ref document: A1