WO2016070529A1 - 一种实现重复数据删除的方法及装置 - Google Patents

一种实现重复数据删除的方法及装置 Download PDF

Info

Publication number
WO2016070529A1
WO2016070529A1 PCT/CN2015/073136 CN2015073136W WO2016070529A1 WO 2016070529 A1 WO2016070529 A1 WO 2016070529A1 CN 2015073136 W CN2015073136 W CN 2015073136W WO 2016070529 A1 WO2016070529 A1 WO 2016070529A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
deduplication
data deduplication
data block
deduplication table
Prior art date
Application number
PCT/CN2015/073136
Other languages
English (en)
French (fr)
Inventor
鲁飞
刘煌
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2016070529A1 publication Critical patent/WO2016070529A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present application relates to a data storage technology, and more particularly to a method and apparatus for implementing data deletion.
  • deduplication The method of processing for redundant data is called deduplication.
  • deduplication is divided into in-band and out-of-band.
  • in-band is a processing method embedded in the entire input/output (I/O) process, also known as real-time repetition.
  • Data deletion Real-time deduplication is to analyze whether data is duplicated when it is written to the storage medium, and directly delete the duplicated data to reduce the space occupation in the first time, but it is more resource-intensive and may affect the writing. performance.
  • Out-of-band is to write data normally, and then delete the data on the disk at a certain time, also known as post-processing deduplication.
  • Post-processing deduplication is a deduplication operation after data is written to disk. Its technical advantage is that it does not affect write performance, but requires sufficient disk space to store all data until the business is off-peak. Deduplication operation.
  • the amount of data is also very large, so the data deduplication table can not be put into the memory, the memory can only be used as a buffer for the data deduplication table on the disk, so when searching through the data fingerprint, it needs to be in the memory and The hybrid search in the disk, the retrieval of the data deduplication table becomes the main performance bottleneck of the deduplication system.
  • many optimization methods are proposed for the retrieval of data deduplication tables, such as hash tables, hierarchical indexes and mechanisms, but usually for one data storage.
  • the storage node, the data deduplication table is still part of the memory, and part of it is on the disk.
  • the real-time deduplication system retrieves the data deduplication table of the disk part, it will inevitably affect the I/O performance.
  • the embodiment of the invention provides a method and a device for implementing deduplication, which does not need to perform a complete data deduplication table search, reduces the time consumption for data deduplication, and reduces the impact on I/O performance.
  • an embodiment of the present invention provides a method for implementing data deduplication
  • the method before performing real-time data deduplication, the method further includes:
  • performing post-processing deduplication on the corresponding data block recorded in the temporary data deduplication table by using a preset policy includes:
  • the independent thread When the system is idle, the independent thread is enabled, and the corresponding data block recorded in the temporary data deduplication table is post-processed and deleted.
  • the method further includes: incorporating the temporary data deduplication table of the post-processing deduplication into the data deduplication table; specifically:
  • the information of the data block in the temporary data deduplication table is added to the data deduplication table; for the repeated data block, the After the information of the temporary data deduplication table is deleted, the data block modifies the reference number information of the repeated data blocks in the data deduplication table.
  • an embodiment of the present invention further provides an apparatus for implementing deduplication, including: a writing unit and a temporary data deduplication processing unit;
  • the writing unit is configured to not query the information of a data block in the data deduplication table in the memory during the execution of the real-time data de-duplication; or, in the data deduplication table of the disk within the preset duration When the information of the data block is not found, the data block is written to the disk;
  • a temporary data deduplication processing unit is configured to: establish a temporary data deduplication table according to the writing of the data block; perform post-processing deduplication on the corresponding data block recorded in the temporary data deduplication table by using a preset policy .
  • the device further includes an obtaining unit and a lookup processing unit; wherein
  • An obtaining unit configured to acquire a hash value fingerprint of the data block as a keyword KEY for deleting duplicate data before the writing unit performs real-time data de-duplication;
  • a search processing unit configured to determine, by the Bloom filter, whether the KEY is recorded in a data deduplication table, and when the KEY is not recorded in the data deduplication table, perform storage of the data block and KEY and The storage address is updated into the data deduplication table; when the KEY has been recorded in the data deduplication table, real-time data deduplication is performed.
  • the temporary data deduplication processing unit is configured to establish the temporary data deduplication table according to the writing of the data block;
  • the independent thread When the system is idle, the independent thread is enabled, and the corresponding data block recorded in the temporary data deduplication table is post-processed and deleted.
  • the temporary data deduplication processing unit is further configured to: after the completion of the post-processing deduplication, the temporary data deduplication table is incorporated into the data deduplication table, including:
  • the information of the data block in the temporary data deduplication table is added to the data deduplication table; for the repeated data block, the data block modifies the reference number information of the repeated data block in the data deduplication table, so as to implement the temporary data deduplication table that completes the post-processing deduplication into the data. Deduplicate the table.
  • the embodiment of the present invention further provides a computer readable storage medium, where the storage medium stores a computer program, where the computer program includes program instructions, when the program instruction is executed by the data deduplication device, enabling the device to perform the above implementation repetition.
  • the method of data deletion is a computer readable storage medium, where the storage medium stores a computer program, where the computer program includes program instructions, when the program instruction is executed by the data deduplication device, enabling the device to perform the above implementation repetition.
  • the technical solution provided by the embodiment of the present invention includes: when performing real-time data de-duplication, when a data block information is not found in a data deduplication table in a memory; or, in a preset Within the duration, when the information of the data block is not found in the data deduplication table of the memory and the information of the data block is not found in the data deduplication table of the disk, the data block is written to the disk, and Forming a temporary data deduplication table according to the writing of the data block; performing post-processing deduplication on the corresponding data block recorded in the temporary data deduplication table according to a preset policy.
  • the embodiment of the invention avoids the full disk search of the data deduplication table, reduces the real-time I/O delay, and improves the work efficiency of data deduplication.
  • the Blu-ray filter performs a fast judgment of the data deduplication table, which reduces the data block to be processed by real-time data deduplication.
  • FIG. 1 is a flowchart of a method for implementing deduplication according to an embodiment of the present invention
  • FIG. 2 is a structural block diagram of an apparatus for implementing deduplication according to an embodiment of the present invention
  • FIG. 3 is a flow chart of a method in accordance with another embodiment of the present invention.
  • Bloom filter In order to clearly state the contents of the embodiments of the present invention, a brief introduction to the Bloom filter is a spatially efficient random data structure, which is represented by a bit array very concisely. A collection and can determine whether an element belongs to this collection.
  • FIG. 1 is a flowchart of a method for implementing deduplication according to an embodiment of the present invention. As shown in FIG. 1, the method includes:
  • Step 100 In the process of performing real-time data de-duplication, if a data block information is not found in the data deduplication table in the memory; or, within the preset time period, the data is not found in the data deduplication table of the disk. When the information of the data block is described, the data block is written to the disk;
  • a related art method for deleting duplicate data which generally includes: calculating a hash value of the data block; and performing deduplication data in the data deduplication table according to the hash value.
  • the matching search is generally performed according to the data deduplication table in the memory first. When not found, the data is deleted in the data deduplication table in the disk; in the search process, if a duplicate data block is found, then the search is performed. The deduplication processing of the duplicate data block; otherwise, the data block is written to the disk, and the data deduplication table is updated.
  • the preset duration generally refers to data obtained by a person skilled in the art based on experience greater than the data in the completed memory.
  • the time required for the deduplication search of the stored data block of the deduplication table that is, the preset duration according to the process of data deduplication, must complete the deduplication search of the data deduplication table in the memory, and perform the data in the partial disk. Redefine the lookup of the table.
  • Step 101 Establish a temporary data deduplication table according to the writing of the data block, and perform post-processing deduplication on the data block recorded in the temporary data deduplication table by using a preset policy.
  • the temporary data deduplication table is a temporary established according to the format and content of the data deduplication table.
  • the entry is recorded to form an asynchronous deduplication queue, but is not updated to the data re-table in the related art.
  • performing post-processing deduplication on the data blocks recorded in the temporary data deduplication table according to the preset policy includes:
  • the data block When the information of the data block is not found in the data deduplication table in the memory, or within the preset time period, when the information of the data block is not found in the data deduplication table of the disk, the data block is directly Write to the disk, and establish a temporary data deduplication table for post-processing deduplication, avoiding the full disk search of the data deduplication table, reducing the real-time I/O delay, and improving the efficiency of data deduplication.
  • the method of the embodiment of the present invention further includes: acquiring a hash value fingerprint of the data block as a keyword (KEY) for deleting duplicate data;
  • the KEY is recorded in the data deduplication table by the Bloom filter, and when not recorded in the data deduplication table, the data block is stored and the KEY and the storage address are updated into the data deduplication table; otherwise, the execution is performed. Real-time data deduplication.
  • the Bloom filter can quickly determine the portion of the data deduplication table that is not recorded; for the unrecognizable part, the Bloom filter may determine whether an element belongs to a certain set or not, and may not belong to the set. The element is mistaken for the problem of this false positive. Therefore, Bloom filter is not suitable for those "zero error" applications. In applications that can tolerate low error rates, the Bloom filter exchanges significant savings in storage space with few errors. Real-time data deduplication needs to be performed for a part of the data block that cannot be determined whether it is recorded in the data deduplication table and judged to be recorded in the data deduplication table.
  • the data is deduplicated by the Bloom filter.
  • the query in the table can quickly determine the data block record that does not exist in the data deduplication table, and combines the Bloom filter to improve the efficiency of real-time data deduplication, and avoid real-time data deduplication to perform a full table lookup on the data deduplication table. The cost.
  • the partial data block in which the information is not recorded in the data deduplication table is processed by the real-time data deduplication process, and the number of the stored data blocks is greatly increased. Reduced, avoiding the impact on I/O performance.
  • the method of the embodiment of the present invention further includes: incorporating the temporary data deduplication table after the post-processing deduplication into the data deduplication table. Specifically include:
  • the non-duplicate data block written to the disk is added to the data deduplication table in the temporary data deduplication table; the duplicate data block is added to the information in the temporary data deduplication table.
  • the information such as the number of times of reference to the repeated data block in the data deduplication table is modified.
  • the embodiment of the invention performs fast judgment of the data deduplication table by the Bloom filter, and processes the data block that does not exist in the data deduplication table, so that the data block to be processed by real-time data deduplication is greatly reduced, and real-time is also avoided.
  • the impact of data deduplication on I/O performance improves the efficiency of data deduplication.
  • by performing a deduplication search on the data deduplication table in the memory, or performing a deduplication search based on the preset duration the data block in which the record information is not found is directly stored in the disk, and the temporary data deduplication table is established. Then, according to the preset policy, the data deduplication table is adjusted and updated, so that the efficiency of the data deduplication process is reduced, and the data deduplication efficiency is improved.
  • FIG. 2 is a structural block diagram of an apparatus for implementing deduplication according to an embodiment of the present invention. As shown in FIG. 2, the method includes: a writing unit and a temporary data deduplication processing unit;
  • the writing unit is adapted to: when the real-time data deduplication process is performed, when the data information of the data block is not found in the data deduplication table in the memory; or, in the data deduplication table of the disk within the preset time period When the record information of the data block is not found, the data block is written to the disk;
  • the temporary data deduplication processing unit is adapted to establish a temporary data deduplication table according to the writing of the data block, and perform post-processing deduplication on the data block recorded in the temporary data deduplication table by using a preset policy.
  • the temporary data deduplication processing unit is specifically adapted to: establish a temporary data deduplication table according to the writing of the data block;
  • the temporary data deduplication processing unit is further adapted to: after the completion of the post-processing deduplication, the temporary data deduplication table is incorporated into the data deduplication table; specifically:
  • the non-duplicate data block written to the disk is added to the data deduplication table by the information in the temporary data deduplication table; for the repeated data block, the temporary data is deleted in the temporary data table.
  • information such as the number of times of reference to the data block in the data deduplication table is modified, so that the temporary data deduplication table for post-processing deduplication is incorporated into the data deduplication table.
  • the device of the embodiment of the invention further includes an obtaining unit and a search processing unit; wherein
  • An obtaining unit configured to acquire a hash value fingerprint of the data block as a key (KEY) for deleting duplicate data before the writing unit performs real-time data de-duplication;
  • the search processing unit is adapted to determine whether the KEY is recorded in the data deduplication table by the Bloom filter. When not recorded in the data deduplication table, the data block is stored and the KEY and the storage address are updated to the data deduplication. In the table; otherwise, perform real-time data deduplication.
  • the search processing unit and the temporary data deduplication processing unit perform deletion or write processing according to the existing method after determining the processing of the stored data block, and in specific implementation, by giving a corresponding storage data block A notification (or instruction) that causes the stored data block to perform a corresponding write to disk or delete operation according to the notification.
  • the data when data is deduplicated, the data is first subjected to block processing to generate a data block, and data deduplication table processing is performed on the data block to implement data deduplication processing.
  • data deduplication table processing is performed on the data block to implement data deduplication processing.
  • finding also called retrieving
  • the data deduplication table takes a long time, especially if you cannot find the data in the data deduplication table cached in memory, but need to find the data deduplication table in the disk. Then the time consumed is very large, and the impact on I/O performance is very large.
  • FIG. 3 is a flowchart of a method according to another embodiment of the present invention. As shown in FIG. 3, the method includes:
  • Step 300 Acquire a hash value fingerprint of the data block as a keyword (KEY) for deleting duplicate data;
  • Step 301 Determine, by using a Bloom filter, whether the KEY is recorded in the data deduplication table. When the data deduplication table is not recorded, the data block is stored and the KEY and the storage address are updated into the data deduplication table; otherwise , perform real-time data deduplication.
  • Step 302 When performing real-time data de-duplication, if the record information of a certain data block is not found in the in-memory data deduplication table; or within the preset time period, the data block in the in-memory data deduplication table is performed. After deduplicating the search and the data deduplication table of the disk, the data is deleted to the disk; if the record information of the data block is not found, the data block is written to the disk;
  • Step 303 Create a temporary data deduplication table according to the writing of the data block; according to a preset policy, enable the independent thread to perform post-processing deduplication on the corresponding data block recorded in the temporary data deduplication table;
  • Step 304 After the completion of the process of deduplication, the temporary data deduplication table of the post-processing deduplication is completed and Enter the data deduplication table.
  • enabling independent threads mainly includes: setting a processing duration threshold of the temporary data deduplication table, and when the processing duration reaches the threshold, enabling independent threads, performing post-processing deduplication on the data blocks recorded in the temporary data deduplication table ;or,
  • the initial state of the process is the "waiting for external wakeup" state.
  • the wake-up signal to the independent thread is issued by the temporary data deduplication table.
  • the temporary data deduplication is expressed to the storage amount threshold
  • the temporary data is directly deleted and the deduplication operation is started; if the temporary data deduplication table does not reach the storage amount threshold, the temporary data is deleted according to the setting.
  • the processing time threshold of the table enters the timing waiting. When the timing arrives, it directly enters the post-processing deduplication and starts the deduplication operation; if the system is idle, it directly enters the post-processing deduplication and starts the deduplication operation.
  • the timer for setting the processing duration threshold is cleared, and the timer returns to the initial state after the deduplication process is completed.
  • all or part of the steps of the above embodiments may also be implemented by using an integrated circuit. These steps may be separately fabricated into individual integrated circuit modules, or multiple modules or steps may be fabricated into a single integrated circuit module. achieve. Thus, the invention is not limited to any specific combination of hardware and software.
  • the devices/function modules/functional units in the above embodiments may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices.
  • each device/function module/functional unit in the above embodiment When each device/function module/functional unit in the above embodiment is implemented in the form of a software function module and sold or used as a stand-alone product, it can be stored in a computer readable storage medium.
  • the above mentioned computer readable storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
  • the method and the device provided by the embodiment of the present invention do not find information of a data block in the data deduplication table of the memory in the process of performing real-time data de-duplication; or the data on the disk is heavy within a preset time period.
  • the data block is written to the disk, and a temporary data deduplication table is established, and then the corresponding data recorded in the temporary data is deleted by the preset policy.
  • the block performs post-processing deduplication, which can avoid the full disk search of the data deduplication table, reduce the real-time I/O delay, and improve the work efficiency of data deduplication.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种实现重复数据删除的方法及装置,包括:在执行实时数据重删过程中,当在对内存的数据重删表中未查找到一数据块的信息;或,在预设时长内,在磁盘的数据重删表中未查找到所述数据块的信息时,将所述存储数据块写入磁盘,并根据所述数据块的写入,建立一临时数据重删表,通过预设的策略对所述临时数据重删表中记录的相应数据块进行后处理重删。本发明实施例可以避免数据重删表的全盘查找,减少实时I/O时延,提高数据重删的工作效率。

Description

一种实现重复数据删除的方法及装置 技术领域
本申请涉及数据存储技术,尤指一种实现数据删除的方法及装置。
背景技术
随着计算机信息化程度的提高,人类已经进入了信息时代。计算机以及互联网已经深入各行各业,这也使得信息数据量呈几个级数增长。而在这海量的数据中,存在大量的冗余数据。为了保证数据的安全,需要不断对数据进行备份,而在备份过程中,大量的冗余数据会随之产生。
针对冗余数据进行处理的方法被称为重复数据删除。根据对数据的处理时机不同,重复数据删除分为带内和带外两种;其中,带内是一种嵌入在整个输入/输出(I/O)流程中的处理方式,也称为实时重复数据删除。实时重复数据删除是数据在写入存储介质时分析是否有重复,并对重复的数据直接进行删除处理,以实现在第一时间减少空间占用,不过其更耗资源,有可能会影响写入的性能。带外是先正常写入数据,之后再在某个时刻对磁盘上的数据进行重删,也称为后处理重复数据删除。后处理重复数据删除是在数据写入磁盘后再进行重复数据删除操作,其技术优势在于其不会影响写入性能,但要求有足够的磁盘空间来存储所有数据,直到业务非高峰时刻时进行的重复数据删除操作。
无论是带内方式还是带外方式的重复数据删除,首先都要找到重复数据,由于需要处理的数据量可能是非常大的,从大量的数据中找到相同内容的数据块是非常费时的。尽管现有技术中通过对数据块内容的数据指纹(哈希值),在被称为数据重删表的重删信息索引表中实现重复数据查找,但是,记录重复数据信息的数据重删表的数据量也非常的大,因此数据重删表不能被全部放入内存,内存只能作为磁盘上的数据重删表的缓冲(cache),所以在通过数据指纹进行查找时,需要在内存和磁盘中混合查找,对数据重删表的检索成为重复数据重删系统的主要性能瓶颈。目前,对数据重删表的检索也提出了许多优化方法,例如散列表、分级索引和机制等,但通常对于一个数据存 储节点,数据重删表仍然是一部分在内存中,一部分在磁盘中,实时重删系统在检索磁盘部分的数据重删表时,必然对I/O性能造成影响。
发明内容
本发明实施例提供一种实现重复数据删除的方法及装置,无需进行完整数据重删表的查找,减少进行数据重删的时间消耗,降低对I/O性能的影响。
为了解决上述技术问题,本发明实施例提供了一种实现数据重删的方法;包括:
在执行实时数据重删过程中,当在对内存的数据重删表中未查找到一数据块的信息;或,在预设时长内,在磁盘的数据重删表中未查找到所述数据块的信息时,将所述数据块写入磁盘;
根据所述数据块的写入,建立一临时数据重删表;
通过预设的策略对所述临时数据重删表中记录的相应数据块进行后处理重删。
可选地,在执行实时数据重删之前,该方法还包括:
获取所述数据块的哈希值指纹,作为删除重复数据的关键字KEY;
通过布鲁姆过滤器判断所述KEY是否记录在所述数据重删表中,当所述KEY未记录在所述数据重删表中时,进行所述数据块的存储并将所述KEY和存储地址更新到所述数据重删表中;当所述KEY已记录在所述数据重删表中时,执行实时数据重删。
可选地,通过预设的策略对所述临时数据重删表中记录的相应数据块进行后处理重删包括:
设置所述临时数据重删表的处理时长阈值,在处理时长到达阈值时,启用独立线程,对所述临时数据重删表中记录的相应数据块进行后处理重删;或者,
设置所述临时数据重删表的存储量阈值,当临时数据重删表达到存储量阈值时,启用独立线程,对所述临时数据重删表中记录的相应数据块进行后处理重删;或者,
查询到系统空闲时,启用独立线程,对所述临时数据重删表中记录的相应数据块进行后处理重删。
可选地,该方法还包括:将完成后处理重删的临时数据重删表并入到数据重删表中;具体包括:
在后处理重删过程中,对写入磁盘的非重复的数据块,将所述数据块在所述临时数据重删表的信息加入到数据重删表;对重复的数据块,将所述数据块在所述临时数据重删表的信息删除后,修改所述数据重删表中所述重复的数据块的引用次数信息。
另一方面,本发明实施例还提供一种实现重复数据删除的装置,包括:写入单元和临时数据重删处理单元;其中,
写入单元,设置为在执行实时数据重删过程中,当对内存中的数据重删表中未查询到一数据块的信息;或,在预设时长内,在磁盘的数据重删表中未查找到所述数据块的信息时,将所述数据块写入磁盘;
临时数据重删处理单元,设置为根据所述数据块的写入,建立一临时数据重删表;通过预设的策略对所述临时数据重删表中记录的相应数据块进行后处理重删。
可选地,该装置还包括获取单元和查找处理单元;其中,
获取单元,设置为在写入单元执行实时数据重删之前,获取所述数据块的哈希值指纹,作为删除重复数据的关键字KEY;
查找处理单元,设置为通过布鲁姆过滤器判断所述KEY是否记录在数据重删表中,当所述KEY未记录在数据重删表中时,进行所述数据块的存储并将KEY和存储地址更新到数据重删表中;当所述KEY已记录在数据重删表中时,执行实时数据重删。
可选地,所述临时数据重删处理单元是设置为根据所述数据块的写入,建立所述临时数据重删表;
设置所述临时数据重删表的处理时长阈值,在处理时长到达阈值时,启用独立线程,对临时数据重删表中记录的相应数据块进行后处理重删;或者,
设置临时数据重删表的存储量阈值,当临时数据重删表达到存储量阈值 时,启用独立线程,对所述临时数据重删表中记录的相应数据块进行后处理重删;或者,
查询到系统空闲时,启用独立线程,对所述临时数据重删表中记录的相应数据块进行后处理重删。
可选地,临时数据重删处理单元还设置为,在完成后处理重删后,将临时数据重删表并入到所述数据重删表中,包括:
在后处理重删过程中,对写入磁盘的非重复的数据块,将所述数据块在所述临时数据重删表的信息加入到数据重删表;对重复的数据块,将所述数据块在临时数据重删表的信息删除后,修改所述数据重删表中所述重复的数据块的引用次数信息,以实现将完成后处理重删的临时数据重删表并入到数据重删表中。
本发明实施例还提供一种计算机可读存储介质,所述存储介质存储有计算机程序,该计算机程序包括程序指令,当该程序指令被重复数据删除设备执行时,使得该设备可执行上述实现重复数据删除的方法。
与现有技术相比,本发明实施例提供的技术方案,包括:在执行实时数据重删过程中,当在内存的数据重删表中未查找到一数据块的信息;或,在预设时长内,在内存的数据重删表中未查找到所述数据块的信息和在磁盘的数据重删表中未查找到所述数据块的信息时,将所述数据块写入磁盘,并根据所述数据块的写入,建立一临时数据重删表;根据预设的策略对所述临时数据重删表中记录的相应数据块进行后处理重删。本发明实施例避免了数据重删表的全盘查找,减少了实时I/O时延,提高了数据重删的工作效率。可选地,通过布鲁姆过滤器进行数据重删表的快速判断,减少了实时数据重删所要处理的数据块。
附图概述
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。
图1为本发明实施例的实现重复数据删除的方法的流程图;
图2为本发明实施例的实现重复数据删除的装置的结构框图;
图3为本发明另一实施例的方法的流程图。
本发明的较佳实施方式
下文中将结合附图对本申请的实施例进行详细说明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。
为了清楚的陈述本发明实施例的内容,对布鲁姆过滤器(Bloom filter)做简要介绍,布鲁姆过滤器是一种空间效率很高的随机数据结构,它利用位数组很简洁地表示一个集合,并能判断一个元素是否属于这个集合。
图1为本发明实施例的实现重复数据删除的方法的流程图,如图1所示,包括:
步骤100、在执行实时数据重删过程中,如果在内存的数据重删表中未查找到一数据块的信息;或,在预设时长内,在磁盘的数据重删表中未查找到所述数据块的信息时,则将所述数据块写入磁盘;
需要说明的是,对数据块执行重删,可以参照相关技术的删除重复数据的方法,一般包括:计算该数据块的哈希值;根据哈希值在数据重删表中进行重删数据的匹配查找,一般按照先在内存中的数据重删表中进行查找,未找到时,再到磁盘中的数据重删表中进行查找;在查找过程中,如果查找到重复的数据块,则进行重复数据块的去重处理;否则,将该数据块写入磁盘中,并更新数据重删表。本步骤中,在对磁盘的数据重删表进行重删查找之前,已经对内存中的数据重删表进行了查找;预设时长一般指本领域技术人员根据经验获得的大于完成内存中的数据重删表的存储数据块的重删查找所需的时间,即预设时长按照数据重删的处理过程,必定完成了内存中的数据重删表的重删查找,进行了部分磁盘中的数据重删表的查找。
步骤101、根据所述数据块的写入,建立临时数据重删表,通过预设的策略对临时数据重删表中记录的数据块进行后处理重删。
其中,所述临时数据重删表是按照数据重删表的格式和内容建立的临时 记录表项,形成异步重删队列,但不更新到相关技术中的数据重新表。
本步骤中,根据预设的策略对临时数据重删表中记录的数据块进行后处理重删具体包括:
设置临时数据重删表的处理时长阈值,在处理时长到达阈值时,启用独立线程,对临时数据重删表中记录的数据块进行后处理重删;或者,
设置临时数据重删表的存储量阈值,当临时数据重删表达到存储量阈值时,启用独立线程,对临时数据重删表中记录的数据块进行后处理重删;或者,
查询到系统空闲时,启用独立线程,对临时数据重删表中记录的数据块进行后处理重删。
通过在内存的数据重删表中未查找到一数据块的信息,或在预设时长内,在磁盘的数据重删表中未查找到所述数据块的信息时、直接将所述数据块写入磁盘,并建立临时数据重删表,进行后处理重删,避免了数据重删表的全盘查找,减少了实时I/O时延,提高了数据重删的工作效率。
本发明实施例的方法之前还包括:获取所述数据块的哈希值指纹,作为删除重复数据的关键字(KEY);
通过布鲁姆过滤器判断KEY是否记录在数据重删表中,当未记录在数据重删表中时,进行数据块的存储并将KEY和存储地址更新到数据重删表中;否则,执行实时数据重删。
需要说明的是,获取数据块的哈希值指纹为现有方法,属于本领域技术人员的惯用技术手段。布鲁姆过滤器可以快速地判断未记录在数据重删表的部分;对无法确认的部分,布鲁姆过滤器在判断一个元素是否属于某个集合时,有可能会把不属于这个集合的元素误认为属于这个集合(false positive)的问题。因此,Bloom filter不适合那些“零错误”的应用场合。在能容忍低错误率的应用场合下,Bloom filter通过极少的错误换取了存储空间的极大节省。对于无法确定是否记录在数据重删表和判断出记录在数据重删表的部分数据块需要执行实时数据重删。
利用数据块的哈希值指纹,作为一个KEY,通过Bloom filter在数据重删 表中的查询,可以快速地确定在数据重删表中不存在的数据块记录,结合Bloom过滤器提升了实时数据重删的效率,避免了实时数据重删对数据重删表进行整表查找所带来的开销。
进一步地,对数据重删表中不存在记录信息的数据块进行存储处理后,对于在数据重删表中没有记录信息的部分数据块采用实时数据重删流程进行处理,存储数据块的数量大大降低,避免了对I/O性能的影响。
在完成后处理重删后,本发明实施例的方法还包括:将完成后处理重删的临时数据重删表并入到数据重删表中。具体包括:
在后处理重删过程中,对写入磁盘的非重复数据块,将其在临时数据重删表的信息加入到数据重删表;对重复数据块,将其在临时数据重删表的信息删除后,修改数据重删表中对所述重复数据块的引用次数等信息。
本发明实施例通过布鲁姆过滤器进行数据重删表的快速判断,对数据重删表中不存在的数据块进行处理,使实时数据重删所要处理的数据块大大减少,也避免了实时数据重删对I/O性能的影响,提高了数据重删的工作效率。进一步地,通过对内存中的数据重删表进行重删查找,或基于预设时长进行重删查找后,将未找到记录信息的数据块,直接存储到磁盘中,并建立临时数据重删表,然后按照预设策略对数据重删表进行调整和更新,使数据重删过程效率对系统影响降低,数据重删效率得到提高。
图2为本发明实施例的实现重复数据删除的装置的结构框图,如图2所示,包括:写入单元和临时数据重删处理单元;其中,
写入单元,适用于在执行实时数据重删过程中,当在内存的数据重删表中未查找到一数据块的记录信息;或,在预设时长内,在磁盘的数据重删表中未查找到所述数据块的记录信息时,将所述数据块写入磁盘;
临时数据重删处理单元,适用于根据所述数据块的写入,建立临时数据重删表,通过预设的策略对临时数据重删表中记录的数据块进行后处理重删。
临时数据重删处理单元具体适用于,根据所述数据块的写入,建立临时数据重删表;
设置临时数据重删表的处理时长阈值,在处理时长到达阈值时,启用独 立线程,对临时数据重删表中记录的数据块进行后处理重删;或者,
设置临时数据重删表的存储量阈值,当临时数据重删表达到存储量阈值时,启用独立线程,对临时数据重删表中记录的数据块进行后处理重删;或者,
查询到系统空闲时,启用独立线程,对临时数据重删表中记录的数据块进行后处理重删。
临时数据重删处理单元还适用于,在完成后处理重删后,将临时数据重删表并入到数据重删表中;具体包括:
在后处理重删过程中,对写入磁盘的非重复的数据块,将其在临时数据重删表的信息加入到数据重删表;对重复的数据块,将其在临时数据重删表的信息删除后,修改数据重删表中对所述数据块的引用次数等信息,以实现将后处理重删的临时数据重删表并入到数据重删表中。
本发明实施例的装置还包括获取单元和查找处理单元;其中,
获取单元,适用于在写入单元执行实时数据重删之前,获取所述数据块的哈希值指纹,作为删除重复数据的关键字(KEY);
查找处理单元,适用于通过布鲁姆过滤器判断KEY是否记录在数据重删表中,当未记录在数据重删表中时,进行数据块的存储并将KEY和存储地址更新到数据重删表中;否则,执行实时数据重删。
需要说明的是,查找处理单元和临时数据重删处理单元,在确定对存储数据块的处理之后,按照现有的方法进行删除或写入处理,具体实现时,通过给相应的存储数据块一个通知(或指令),使存储数据块根据通知执行相应的写入磁盘或删除的操作。
为清楚陈述本发明,以下通过具体实施例,对本发明进行详细说明,实施例只为清楚说明本发明,并不用于限制本发明的保护内容。
实施例1
在实际应用中,对数据进行重删处理时,首先会对数据进行分块处理,生成数据块,通过对数据块执行数据重删表的查找,实现数据的重删处理。假设直接采用实时数据的重删,由于数据重删表可能非常大,对存储数据块 进行数据重删表的查找(也可以说是检索)需要消耗较长的时间,特别是如果无法在内存中缓存的数据重删表中找到数据,而需要查找磁盘中的数据重删表时,那么消耗的时间就非常多,对I/O性能的影响非常大。
图3为本发明另一实施例的方法的流程图,如图3所示,包括:
步骤300、获取数据块的哈希值指纹,作为删除重复数据的关键字(KEY);
步骤301、通过布鲁姆过滤器判断KEY是否记录在数据重删表中,当未记录数据重删表中时,进行数据块的存储并将KEY和存储地址更新到数据重删表中;否则,执行实时数据重删。
通过实验测试,经本地实际测试,使用zfs作为本地文件系统,在zfs的pool中已经有少量数据(3.4G)的情况下(ddt数据重删表中存在少量记录)写入大量的存储数据块(新数据)(11G),对比不启用Bloom过滤器和启用Bloom过滤器的写入速度,发现启用Bloom过滤器的写入效率大概提高了14%左右。在这基础上对这些已经写入的数据再进行一次写入(重拷旧数据),发现启用了Bloom过滤器的写入速度提升了大概18%。
由于理论上ddt的数据重删表中的记录越多,那么查找ddt数据重删表就越耗时,那么启用Bloom过滤器后的效果会更加的明显。因此后续又进行了一次数据量较大的测试。在zfs的pool中已经有25G大小存储数据块的情况下,再往pool中写入45G左右大小的存储数据块,对比不开启Bloom过滤器和开启Bloom过滤器的情况,发现启用了Bloom过滤器情况下存储数据块的写入速度提升了大概110%,这个写入速度的提升就相当明显了。
步骤302、在进行实时数据重删时,如果对内存中数据重删表中没有查找到某个数据块的记录信息;或在预设时长内,对内存中的数据重删表进行数据块的重删查找后和磁盘的数据重删表的重删查找,没有查找到该数据块的记录信息时,将该数据块写入磁盘;
步骤303、根据所述数据块的写入,建立一临时数据重删表;根据预设的策略,启用独立线程对临时数据重删表中记录的相应数据块进行后处理重删;
步骤304、完成后处理重删时,将完成后处理重删的临时数据重删表并 入到数据重删表中。
按照预设的策略,启用独立线程主要包括:设置临时数据重删表的处理时长阈值,在处理时长到达阈值时,启用独立线程,对临时数据重删表中记录的数据块进行后处理重删;或者,
设置临时数据重删表的存储量阈值,当临时数据重删表达到存储量阈值时,启用独立线程,对临时数据重删表中记录的数据块进行后处理重删;或者,
查询到系统空闲时,启用独立线程,对临时数据重删表中记录的数据块进行后处理重删。
具体启用独立线程的工作过程如下:
首先,进程的初始状态为“等待外部唤醒”状态。
当在实时重删系统建立临时数据重删表时,通过临时数据重删表,会发出对独立线程的唤醒信号。
在唤醒状态在执行流程中,当临时数据重删表达到存储量阈值时,直接进入后处理重删,开始重删操作;如果临时数据重删表未达到存储量阈值,根据设置临时数据重删表的处理时长阈值进入计时等待,计时到达时,直接进入后处理重删,开始重删操作;如果查询到系统空闲,则直接进入后处理重删,开始重删操作。
进入重删处理时,将设置处理时长阈值的计时器清零,完成重删处理后计时器回到初始状态。
本领域普通技术人员可以理解上述实施例的全部或部分步骤可以使用计算机程序流程来实现,所述计算机程序可以存储于一计算机可读存储介质中,所述计算机程序在相应的硬件平台上(如系统、设备、装置、器件等)执行,在执行时,包括方法实施例的步骤之一或其组合。
可选地,上述实施例的全部或部分步骤也可以使用集成电路来实现,这些步骤可以被分别制作成一个个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。
上述实施例中的各装置/功能模块/功能单元可以采用通用的计算装置来实现,它们可以集中在单个的计算装置上,也可以分布在多个计算装置所组成的网络上。
上述实施例中的各装置/功能模块/功能单元以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。上述提到的计算机可读取存储介质可以是只读存储器,磁盘或光盘等。
任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求所述的保护范围为准。
工业实用性
本发明实施例提供的方法及装置,在执行实时数据重删过程中,当在对内存的数据重删表中未查找到一数据块的信息;或在预设时长内,在磁盘的数据重删表中未查找到所述数据块的信息时,将所述数据块写入磁盘,并建立临时数据重删表,然后通过预设的策略对所述临时数据重删表中记录的相应数据块进行后处理重删,可以避免数据重删表的全盘查找,减少实时I/O时延,提高数据重删的工作效率。

Claims (9)

  1. 一种实现重复数据删除的方法,包括:
    在执行实时数据重删过程中,当在内存的数据重删表中未查找到一数据块的信息;或,在预设时长内,在磁盘的数据重删表中未查找到所述数据块的信息时,将所述数据块写入磁盘;
    根据所述数据块的写入,建立一临时数据重删表;
    通过预设的策略对所述临时数据重删表中记录的相应数据块进行后处理重删。
  2. 根据权利要求1所述的方法,在执行实时数据重删之前,该方法还包括:
    获取所述数据块的哈希值指纹,作为删除重复数据的关键字KEY;
    通过布鲁姆过滤器判断所述KEY是否记录在所述数据重删表中,当所述KEY未记录在所述数据重删表中时,进行所述数据块的存储并将所述KEY和存储地址更新到所述数据重删表中;当所述KEY已记录在所述数据重删表中时,执行实时数据重删。
  3. 根据权利要求1或2所述的方法,其中,所述通过预设的策略对所述临时数据重删表中记录的相应数据块进行后处理重删包括:
    设置所述临时数据重删表的处理时长阈值,在处理时长到达阈值时,启用独立线程,对所述临时数据重删表中记录的相应数据块进行后处理重删;或者,
    设置所述临时数据重删表的存储量阈值,当临时数据重删表达到存储量阈值时,启用独立线程,对所述临时数据重删表中记录的相应数据块进行后处理重删;或者,
    查询到系统空闲时,启用独立线程,对所述临时数据重删表中记录的相应数据块进行后处理重删。
  4. 根据权利要求1或2所述的方法,还包括:将完成后处理重删的临时数据重删表并入到数据重删表中;具体包括:
    在后处理重删过程中,对写入磁盘的非重复的数据块,将所述数据块在所述临时数据重删表的信息加入到所述数据重删表;对重复的数据块,将所述数据块在所述临时数据重删表的信息删除后,修改所述数据重删表中所述重复的数据块的引用次数信息。
  5. 一种实现重复数据删除的装置,包括:写入单元和临时数据重删处理单元;其中,
    写入单元,设置为在执行实时数据重删过程中,当在内存中的数据重删表中未查找到一数据块的信息;或,在预设时长内,在磁盘的数据重删表中未查找到所述数据块的信息时,将所述数据块写入磁盘;
    临时数据重删处理单元,设置为根据所述数据块的写入,建立一临时数据重删表;通过预设的策略对所述临时数据重删表中记录的相应数据块进行后处理重删。
  6. 根据权利要求5所述的装置,还包括获取单元和查找处理单元;其中,
    获取单元,设置为在写入单元执行实时数据重删之前,获取所述数据块的哈希值指纹,作为删除重复数据的关键字KEY;
    查找处理单元,设置为通过布鲁姆过滤器判断所述KEY是否记录在数据重删表中,当所述KEY未记录在数据重删表中时,进行所述数据块的存储并将KEY和存储地址更新到数据重删表中;当所述KEY已记录在数据重删表中时,执行实时数据重删。
  7. 根据权利要求5或6所述的装置,其中,所述临时数据重删处理单元是设置为根据所述数据块的写入,建立所述临时数据重删表;
    设置所述临时数据重删表的处理时长阈值,在处理时长到达阈值时,启用独立线程,对所述临时数据重删表中记录的相应数据块进行后处理重删;或者,
    设置所述临时数据重删表的存储量阈值,当临时数据重删表达到存储量阈值时,启用独立线程,对所述临时数据重删表中记录的相应数据块进行后处理重删;或者,
    查询到系统空闲时,启用独立线程,对所述临时数据重删表中记录的相 应数据块进行后处理重删。
  8. 根据权利要求5或6所述的装置,其中,所述临时数据重删处理单元还设置为:在完成后处理重删后,将临时数据重删表并入到所述数据重删表中,包括:
    在后处理重删过程中,对写入磁盘的非重复的数据块,将所述数据块在所述临时数据重删表的信息加入到数据重删表;对重复的数据块,将所述数据块在所述临时数据重删表的信息删除后,修改所述数据重删表中所述重复的数据块的引用次数信息,以实现将完成后处理重删的临时数据重删表并入到数据重删表中。
  9. 一种计算机可读存储介质,所述存储介质存储有计算机程序,该计算机程序包括程序指令,当该程序指令被重复数据删除设备执行时,使得该设备可执行权利要求1-4任一项的方法。
PCT/CN2015/073136 2014-11-07 2015-02-15 一种实现重复数据删除的方法及装置 WO2016070529A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410623909.2A CN105630834B (zh) 2014-11-07 2014-11-07 一种实现重复数据删除的方法及装置
CN201410623909.2 2014-11-07

Publications (1)

Publication Number Publication Date
WO2016070529A1 true WO2016070529A1 (zh) 2016-05-12

Family

ID=55908460

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/073136 WO2016070529A1 (zh) 2014-11-07 2015-02-15 一种实现重复数据删除的方法及装置

Country Status (2)

Country Link
CN (1) CN105630834B (zh)
WO (1) WO2016070529A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301351A (zh) * 2017-06-22 2017-10-27 北京北信源软件股份有限公司 一种扫描与清除网络访问记录的方法与装置
CN114356212A (zh) * 2021-11-23 2022-04-15 阿里巴巴(中国)有限公司 数据处理方法、系统及计算机可读存储介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10235396B2 (en) * 2016-08-29 2019-03-19 International Business Machines Corporation Workload optimized data deduplication using ghost fingerprints
CN108572789B (zh) * 2017-03-13 2022-01-28 阿里巴巴集团控股有限公司 磁盘存储方法和装置、消息推送方法和装置及电子设备
CN108256003A (zh) * 2017-12-29 2018-07-06 天津南大通用数据技术股份有限公司 一种根据分析数据重复率提高union运算效率的方法
CN108762680A (zh) * 2018-05-30 2018-11-06 郑州云海信息技术有限公司 一种控制ddp模块开关的方法及其相关装置
CN113760187B (zh) * 2021-07-29 2023-08-18 苏州浪潮智能科技有限公司 重删io线程生成方法、系统、终端及存储介质
CN113961549A (zh) * 2021-09-22 2022-01-21 李凤杰 基于数据仓库的医疗数据整合方法及系统
WO2023070462A1 (zh) * 2021-10-28 2023-05-04 华为技术有限公司 一种文件去重方法、装置和设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130282672A1 (en) * 2012-04-18 2013-10-24 Hitachi Computer Peripherals Co., Ltd. Storage apparatus and storage control method
WO2014063062A1 (en) * 2012-10-18 2014-04-24 Netapp, Inc. Selective deduplication
US20140122818A1 (en) * 2012-10-31 2014-05-01 Hitachi Computer Peripherals Co., Ltd. Storage apparatus and method for controlling storage apparatus
CN103970744A (zh) * 2013-01-25 2014-08-06 华中科技大学 一种可扩展的重复数据检测方法
US20140325147A1 (en) * 2012-03-14 2014-10-30 Netapp, Inc. Deduplication of data blocks on storage devices

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7747584B1 (en) * 2006-08-22 2010-06-29 Netapp, Inc. System and method for enabling de-duplication in a storage system architecture
CN102222085B (zh) * 2011-05-17 2012-08-22 华中科技大学 一种基于相似性与局部性结合的重复数据删除方法
CN102810107B (zh) * 2011-06-01 2015-10-07 英业达股份有限公司 重复数据的处理方法
CN102833298A (zh) * 2011-06-17 2012-12-19 英业达集团(天津)电子技术有限公司 分布式的重复数据删除系统及其处理方法
CN102915278A (zh) * 2012-09-19 2013-02-06 浪潮(北京)电子信息产业有限公司 重复数据删除方法
CN104077380B (zh) * 2014-06-26 2017-07-18 深圳信息职业技术学院 一种重复数据删除方法、装置及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140325147A1 (en) * 2012-03-14 2014-10-30 Netapp, Inc. Deduplication of data blocks on storage devices
US20130282672A1 (en) * 2012-04-18 2013-10-24 Hitachi Computer Peripherals Co., Ltd. Storage apparatus and storage control method
WO2014063062A1 (en) * 2012-10-18 2014-04-24 Netapp, Inc. Selective deduplication
US20140122818A1 (en) * 2012-10-31 2014-05-01 Hitachi Computer Peripherals Co., Ltd. Storage apparatus and method for controlling storage apparatus
CN103970744A (zh) * 2013-01-25 2014-08-06 华中科技大学 一种可扩展的重复数据检测方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301351A (zh) * 2017-06-22 2017-10-27 北京北信源软件股份有限公司 一种扫描与清除网络访问记录的方法与装置
CN114356212A (zh) * 2021-11-23 2022-04-15 阿里巴巴(中国)有限公司 数据处理方法、系统及计算机可读存储介质

Also Published As

Publication number Publication date
CN105630834B (zh) 2021-07-20
CN105630834A (zh) 2016-06-01

Similar Documents

Publication Publication Date Title
WO2016070529A1 (zh) 一种实现重复数据删除的方法及装置
CN108319654B (zh) 计算系统、冷热数据分离方法及装置、计算机可读存储介质
US20200150890A1 (en) Data Deduplication Method and Apparatus
US10248676B2 (en) Efficient B-Tree data serialization
EP2863310B1 (en) Data processing method and apparatus, and shared storage device
KR102334735B1 (ko) 스토리지 장치 및 자율 공간 압축 방법
US8996499B2 (en) Using temporary performance objects for enhanced query performance
CN103595797B (zh) 一种分布式存储系统中的缓存方法
US8868576B1 (en) Storing files in a parallel computing system based on user-specified parser function
WO2012083754A1 (zh) 处理脏数据的方法及装置
US10169391B2 (en) Index management
CN105630810B (zh) 一种对于海量小文件在分布式存储系统中上载的方法
US10409692B1 (en) Garbage collection: timestamp entries and remove reference counts
WO2014058711A1 (en) Creation of inverted index system, and data processing method and apparatus
US20120084316A1 (en) Database-transparent near online archiving and retrieval of data
CN103744875B (zh) 基于文件系统的数据快速迁移方法及系统
CN102880671A (zh) 一种面向分布式文件系统的主动重复数据删除方法
CN110309233A (zh) 数据存储的方法、装置、服务器和存储介质
CN109598156A (zh) 一种写时重定向引擎快照流方法
US9336135B1 (en) Systems and methods for performing search and complex pattern matching in a solid state drive
CN107665219A (zh) 一种日志管理方法及装置
JP6245700B2 (ja) 計算機システム、データの検査方法及び計算機
CN104035822A (zh) 一种低开销的高效内存去冗余方法及系统
CN103377292B (zh) 数据库结果集缓存方法及设备
JP2019537097A5 (zh)

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15856588

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15856588

Country of ref document: EP

Kind code of ref document: A1