WO2023147861A1

WO2023147861A1 - Method of tracking sensitive data in a data storage system

Info

Publication number: WO2023147861A1
Application number: PCT/EP2022/052574
Authority: WO
Inventors: Amit Margalit; Shahar SALZMAN
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2022-02-03
Filing date: 2022-02-03
Publication date: 2023-08-10

Abstract

Provided a method of tracking sensitive data in a data storage system (102, 202). The method includes receiving an incoming file, dividing the incoming file into one or more data chunks, searching for each data chunk in a deduplication index associated with the data storage system, adding each data chunk that is not found in the deduplication index to the deduplication index, performing a scan for the presence of sensitive data, SD, on each data chunk that is not found in the deduplication index, and adding a result of the scan as a SD metadata record associated with each scanned data chunk in the deduplication index.

Description

METHOD OF TRACKING SENSITIVE DATA IN A DATA STORAGE SYSTEM

TECHNICAL FIELD

The disclosure relates to data storage systems, more particularly, the disclosure relates to a method of tracking sensitive data in a data storage system and a data processing module for the data storage system.

BACKGROUND

Data storage systems are widely used to store data in data centers that are a repository for persistently storing and managing the stored data including any of files, and bytes. The data storage system also stores information that must be protected against unauthorized accesses, i.e. sensitive data. One of the most intensive tasks in handling sensitive data is tracking where the data is located. Systems using scanning, detecting, and identifying the sensitive data may use enormous amounts of resources including a central processing unit, a memory, a network, an input/output unit, to perform the identification of the sensitive data.

Normally, a typical data center generates a huge amount of new data each day, as well as equally large amounts of backup images, and copies of existing files, and the like. The new data includes new files and modifications to existing files. Existing solutions enter the new files and the modified files into a queue for re-scanning, to the existing solutions, process the queue. As the data centers grow, handling more and more users and applications, the amount of data that needs to be rescanned increases. Further, the world progresses the data centers with more privacy regulation, the data processing entities include more types of sensitive data to scan, making the scan queue take longer and longer to process.

Another existing solution includes a catalog-based solution that registers all documents in a catalog, scanning for the sensitive data, and saving the results in the catalog. The catalogbased solution may provide fast results, but requires every new file to be fully scanned, and every change in the file may trigger a rescan, resulting in a longer time to process the scan queue. Another existing solution includes a general classification of documents that is processed when a request to find the sensitive data arrives, a system would know which documents need to be scanned. New documents and changed documents are only registered for later scanning. Further, both the catalog-based solution and the general classification of documents are resource-intensive when the data storage system includes similar files, copies of existing files, and modifications of the files. When there is a modification in the file, the file is re-scanned fully even with a small change. When there is a copy of existing files or files containing partial copies of information from other files, the data storage system takes that as a new file and initiates the scan.

Therefore, there arises a need to address the aforementioned technical drawbacks and problems in the data storage systems.

SUMMARY

It is an object of the disclosure to provide a method of tracking sensitive data in a data storage system, and a data processing module for the data storage system, while avoiding one or more disadvantages of prior art approaches.

This object is achieved by the features of the independent claims. Further implementations are apparent from the dependent claims, the description, and the figures.

The disclosure provides a method of tracking sensitive data in a data storage system, and a data processing module for the data storage system.

According to a first aspect, there is provided a method of tracking sensitive data in a data storage system. The method includes receiving an incoming file. The method includes dividing the incoming file into one or more data chunks. The method includes searching for each data chunk in a deduplication index associated with the data storage system. The method includes adding each data chunk that is not found in the deduplication index to the deduplication index. The method includes performing a scan for the presence of sensitive data, SD, on each data chunk that is not found in the deduplication index. The method includes adding a result of the scan as a SD metadata record associated with each scanned data chunk in the deduplication index. The method enables every file to be broken down into the one or more data chunks, that can be easily identified in order to avoid storing the files more than once. The method enables keeping results of sensitive information scan on each chunk, as well as on each file, that save efforts when any of a file’s content is changed, removed, or added. The method prioritizes and optimizes a scan queue when there is a change in each file. The method reduces a scanning effort and a catalog-assisted effort. The method handles sensitive data across chunk boundaries in the method that enables a scanning unit by alerting that scanning the entire chunk is not required. The method provides additional optimization using one or more similarity score techniques.

Optionally, the method includes performing a scan includes performing a further scan of one or more data chunks adjacent to each scanned data chunk.

Optionally, the method further includes if the incoming file is a new file, generating a SD file metadata record for the incoming file, based on the SD metadata record for each data chunks in the incoming file.

Optionally, the method further includes if the incoming file is a modified file, updating a SD file metadata record for the incoming file, based on the SD metadata record for the scanned chunks.

Optionally, the method further includes updating a SD file metadata record for one or more additional files that also include at least one of the scanned data chunks.

Optionally, the method further includes if a scanned data chunk is associated with an SD metadata record in the deduplication index which indicates that presence of SD, and a result of the scan for the data chunk indicates no SD, lowering an update priority to update the SD file metadata for additional files that also include the data chunk.

Optionally, the method includes performing a scan includes adding the data chunks to a scan queue.

Optionally, the method further includes optimizing a scan queue by searching the scan queue for additional data chunks of an additional file which also includes one or more data chunks of the incoming file added to the scan queue, and increasing aa queue priority of the additional data chunks in the scan queue. Optionally, the method further includes in response to a result of the scan for any of the data chunks of the incoming file indicating the presence of SD, increasing a queue priority of any data chunks of the incoming file remaining in the scan queue and increasing a queue priority of the additional data chunks in the scan queue.

Optionally, the method includes adding the result of the scan to the SD metadata record for a data chunk includes adding a boundary indication if the SD crosses a boundary of the data chunk. Optionally, the method includes performing a scan on a data chunk associated with a boundary indication may include performing the scan on only a portion of the data chunk.

Optionally, the method further includes in response to a result of the scan for any of the data chunks of the incoming file indicating the presence of SD, searching the deduplication index for one or more similar data chunks, based on a similarity score, and performing a scan for the presence of SD in each similar data chunk.

Optionally, the method further includes performing a scan for the presence of SD in each data chunk in a file containing one or more of the similar data chunks.

According to a second aspect, there is provided a data processing module for a data storage system. The data processing module includes an input unit, a deduplication unit, and a scanning unit. The input unit is configured to receive an incoming file. The deduplication unit is configured to divide the incoming file into one or more data chunks, search for each data chunk in a deduplication index associated with the data storage system, and add each data chunk that is not found in the deduplication index to the deduplication index. The scanning unit is configured to perform a scan for the presence of sensitive data, SD, on each data chunk that is not found in the deduplication index, and add a result of the scan as a SD metadata record associated with each scanned data chunk in the deduplication index.

The data processing module enables every file to be broken down into the one or more data chunks, that can be easily identified in order to avoid storing the files more than once. The data processing module enables keeping results of sensitive information scan on each chunk, as well as on each file, that save efforts when any of a file’s content is changed, removed, or added. The data processing module prioritizes and optimizes a scan queue when there is a change in each file. The data processing module reduces a scanning effort and a catalog- assisted effort. The data processing module handles sensitive data across chunk boundaries in the data storage system that enables the scanning unit by alerting that scanning the entire chunk is not required. The data processing module provides additional optimization using one or more similarity score techniques.

According to a third aspect, there is provided a data storage system including the data processing module.

According to a fourth aspect, there is provided a computer readable medium including instructions which, when executed by a processor, cause the processor to perform the method.

Therefore, in contradistinction to the prior art, according to the method, the data processing module, and the data storage system, reduce a scanning effort by prioritizing and optimizing when there is a file’s content is changed, removed, or added. The method, the data processing module, and the data storage system avoid storing the files more than once in the data storage system, thereby reducing space required in the data storage system.

These and other aspects of the disclosure will be apparent from the implementations described below.

BRIEF DESCRIPTION OF DRAWINGS

Implementations of the disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which: FIG. 1 is a block diagram of a data processing module for a data storage system in accordance with an implementation of the disclosure;

FIG. 2 illustrates a block diagram of a data storage system that includes a data processing module in accordance with an implementation of the disclosure;

FIG. 3 illustrates an exemplary diagram of processing of a new file in a data processing module in accordance with an implementation of the disclosure;

FIG. 4 is a flow diagrams that illustrates a method of tracking sensitive data in a data storage system in accordance with an implementation of the disclosure; and

FIG. 5 is an illustration of a computing arrangement (e.g. a data processing module) that is used in accordance with implementations of the disclosure.

DETAILED DESCRIPTION OF THE DRAWINGS

Implementations of the disclosure provide a method of tracking sensitive data in a data storage system that reduces a scanning effort by prioritizing and optimizing when there is a file’s content is changed, removed, or added. The disclosure also provides a data processing module for the data storage system, and the data storage system, that reduces a scanning effort by prioritizing and optimizing when there is a file’s content is changed, removed, or added.

To make solutions of the disclosure more comprehensible for a person skilled in the art, the following implementations of the disclosure are described with reference to the accompanying drawings.

Terms such as "a first", "a second", "a third", and "a fourth" (if any) in the summary, claims, and foregoing accompanying drawings of the disclosure are used to distinguish between similar objects and are not necessarily used to describe a specific sequence or order. It should be understood that the terms so used are interchangeable under appropriate circumstances, so that the implementations of the disclosure described herein are, for example, capable of being implemented in sequences other than the sequences illustrated or described herein.

Furthermore, the terms "include" and "have" and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, a method, a system, a product, or a device that includes a series of steps or units, is not necessarily limited to expressly listed steps or units but may include other steps or units that are not expressly listed or that are inherent to such process, method, product, or device.

FIG. 1 is a block diagram of a data processing module 104 for a data storage system 102 in accordance with an implementation of the disclosure. The data processing module 104 includes an input unit 106, a deduplication unit 108, and a scanning unit 110. The input unit 106 is configured to receive an incoming file. The deduplication unit 108 is configured to divide the incoming file into one or more data chunks. The deduplication unit 108 is configured to search for each data chunk in a deduplication index associated with the data storage system 102. The deduplication unit 108 is configured to add each data chunk that is not found in the deduplication index to the deduplication index. The scanning unit 110 is configured to perform a scan for the presence of sensitive data, SD, on each data chunk that is not found in the deduplication index. The scanning unit 110 is configured to add a result of the scan as a SD metadata record associated with each scanned data chunk in the deduplication index.

The data processing module 104 enables every file to be broken down into the one or more data chunks, that can be easily identified in order to avoid storing the files more than once. The data processing module 104 enables keeping results of sensitive information scan on each chunk, as well as on each file, that save efforts when any of a file’s content is changed, removed, or added. The data processing module 104 prioritizes and optimizes a scan queue when there is a change in each file. The data processing module 104 reduces a scanning effort and a catalog-assisted effort. The data processing module 104 handles sensitive data across chunk boundaries in the data storage system 102 that enables the scanning unit 110 by alerting that scanning the entire chunk is not required. The data processing module 104 provides additional optimization using one or more similarity score techniques.

When a file’s content is changed in the data storage system 102, new data resides in new chunks of the modified file, and the data processing module 104 is configured to add the new chunks to a scan queue. Optionally, adjacent chunks in the modified file are added to the rescan queue to identify cases where sensitive data crosses chunk boundaries. The scanning unit 110 is configured to perform the scan as per the scan queue. When a new file is added in the data storage system 102, the data processing module 104 scans known chunks. The new file may be a copy of another file in the data storage system 102 that the scanning unit 110 performed the scan. Optionally, the new file includes the known chunks if the new file is the copy of another file, or the new file is already scanned by the scanning unit 110. The known chunks may include sensitive information i.e. Sensitive Data, SD. The data processing module 104 is configured to mark the new file if the new file includes the known chunks. Optionally, the data processing module 104 holds the marked files from scanning in the scanning unit 110, as the marked files are already scanned. If there are no known chunks in the new file and the new file includes new chunks, the data processing module 104 adds the new file to the scan queue. Optionally, adjacent chunks are added to the scan queue. The scanning unit 110 is configured to perform the scan as per the scan queue. Optionally, the sensitive information can happen to cross a chunk boundary, and the data processing module 104 provides a special mark on the chunk indicating the chunk includes the SD that crosses the chunk boundary.

When a file is removed from the data storage system 102, chunks associated with the file may not disappear as the chunks may be used by other files. The other files may be old backup copies in the data storage system 102. Optionally, the data processing module 104 subtracts reference counts in the chunks, and the chunks are removed when the reference counts reach zero. The data storage system 102 may remove the file along with scan result information of the file. Optionally, the data storage system 102 is a deduplicated storage system. Optionally, if the file is removed, metadata is to be updated in the data storage system 102 if the file includes the SD.

The data processing module 104 is configured to prioritize the scan queue whenever the file’s content is changed, removed, or added in the data storage system 102 and optimize a rescanning process by prioritizing the chunks of the modified file, or the new file. Optionally, the data processing module 104 checks the scan queue, and if the scan queue includes other files that use some of the same chunks of the modified file, or the new file, the data processing module 104 is configured to prioritize the chunks in the scan queue. The scanning unit 110 is configured to scan the chunks based on the scan queue. Results of the scanning may be propagated to all files using the same chunks.

If the scanned chunk in the file includes new SD tags, the rest of the file and the other files using the same chunk may receive higher priority. The data processing module 104 is configured to prioritize the scan queue based on the scanned chunks. Optionally, the data processing module 104 prioritizes the scan queue when the chunk in the file is new. If the scanned chunk in the file includes removed SD tags compared to a state before the scan in the scanning unit 110, the priority may be lowered. Optionally, the data processing module 104 handles potential false-positive until the scan of the other files using the chunks is completed. If the scanned chunk does not include any change in SD tags, the scanning unit 110 completes the scan.

FIG. 2 illustrates a block diagram of a data storage system 202 that includes a data processing module 204 in accordance with an implementation of the disclosure. The data processing module 204 enables every file to be broken down into the one or more data chunks, that can be easily identified in order to avoid storing the files more than once. The data processing module 204 enables keeping results of sensitive information scan on each chunk, as well as on each file, that save efforts when any of a file’s content is changed, removed, or added. The data processing module 204 prioritizes and optimizes a scan queue when there is a change in each file. The data processing module 204 reduces a scanning effort and a catalog-assisted effort. The data processing module 204 handles sensitive data across chunk boundaries in the data storage system 202 that enables a scanning unit by alerting that scanning the entire chunk is not required. The data processing module 204 provides additional optimization using one or more similarity score techniques.

Optionally, the scanning in the data processing module 204 is optimized by leveraging deduplication information in its catalog. The catalog may include any of a chunk-list, or sensitive information scan status and results. Leveraging the deduplication information may propagate results to new files containing known chunks, without accessing a deduplication chunk database that is very large in size.

Optionally, the data storage system 202 employs algorithms to identify similarity between any of files or chunks, with less computer processing unit resources than a full identification in the data storage system 202. When scanning for a new type of sensitive information, the data processing module 204 may rescan the entire available data resources for instances of the new type of the sensitive information. Optionally, the data processing module 204 optimizes the search for existing chunks by quickly identifying a smaller subset of files to be scanned. Each chunk including the new type of the sensitive information may be analyzed and all chunks having same similarity markers may be prioritized higher for the scanning. FIG. 3 illustrates an exemplary diagram 300 of processing of a new file 302 in a data processing module in accordance with an implementation of the disclosure. A data storage system may hold tags or labels per chunk, and per file, that already processed one file and the exemplary diagram 300 illustrates receiving the new file 302. The exemplary diagram 300 includes the new file 302, and a chunk processing and identification 304. The new file 302 includes one or more chunks including a chunk 7, a chunk 1, and a chunk 4. The chunk 7 is a new chunk and needs to be scanned as the new chunk i.e. the chunk 7, includes Sensitive Data, SD. The data processing module is configured to prioritize the new chunk in a scan queue to perform a scan. The chunk processing and identification 304 is configured to process the chunks in the new file with a chunk list, and a file list.

FIG. 4 is a flow diagram that illustrates a method of tracking sensitive data in a data storage system in accordance with an implementation of the disclosure. At a step 402, an incoming file is received. At a step 404, the incoming file is divided into one or more data chunks. At a step 406, each data chunk is searched in a deduplication index associated with the data storage system. At a step 408, each data chunk that is not found in the deduplication index is added to the deduplication index. At a step 410, a scan is performed for the presence of sensitive data, SD, on each data chunk that is not found in the deduplication index. At a step 412, a result of the scan is added as a SD metadata record associated with each scanned data chunk in the deduplication.

The method enables every file to be broken down into the one or more data chunks, that can be easily identified in order to avoid storing the files more than once. The method enables keeping results of sensitive information scan on each chunk, as well as on each file, that save efforts when any of a file’s content is changed, removed, or added. The method prioritizes and optimizes a scan queue when there is a change in each file. The method reduces a scanning effort and a catalog-assisted effort. The method handles sensitive data across chunk boundaries in the method that enables a scanning unit by alerting that scanning the entire chunk is not required. The method provides additional optimization using one or more similarity score techniques.

Optionally, the method includes performing a scan includes performing a further scan of one or more data chunks adjacent to each scanned data chunk. Optionally, the method further includes if the incoming file is a new file, generating a SD file metadata record for the incoming file, based on the SD metadata record for each data chunks in the incoming file.

Optionally, the method further includes optimizing a scan queue by searching the scan queue for additional data chunks of an additional file which also includes one or more data chunks of the incoming file added to the scan queue, and increasing aa queue priority of the additional data chunks in the scan queue.

Optionally, the method further includes in response to a result of the scan for any of the data chunks of the incoming file indicating the presence of SD, increasing a queue priority of any data chunks of the incoming file remaining in the scan queue and increasing a queue priority of the additional data chunks in the scan queue.

In an implementation, there is provided a computer readable medium including instructions which, when executed by a processor, cause the processor to perform the method.

FIG. 5 is an illustration of an exemplary computing arrangement (e.g. a data processing module) 500 in which the various architectures and functionalities of the various previous implementations may be implemented. As shown, the computing arrangement 500 includes at least one processor 504 that is connected to a bus 502, wherein the computing arrangement 500 may be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI-Express, AGP (Accelerated Graphics Port), HyperTransport, or any other bus or point-to-point communication protocol (s). The computing arrangement 500 also includes a memory 506.

Control logic (software) and data are stored in the memory 506 which may take the form of random-access memory (RAM). In the present description, a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connectivity which simulate on-chip modules with increased connectivity which simulate on-chip operation, and make substantial improvements over utilizing a conventional central processing unit (CPU) and bus implementation. Of course, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user.

The computing arrangement 500 may also include a secondary storage 510. The secondary storage 510 includes, for example, a hard disk drive and a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (DVD) drive, recording device, universal serial bus (USB) flash memory. The removable storage drive at least one of reads from and writes to a removable storage unit in a well- known manner. Computer programs, or computer control logic algorithms, may be stored in at least one of the memory 506 and the secondary storage 510. Such computer programs, when executed, enable the computing arrangement 500 to perform various functions as described in the foregoing. The memory 506, the secondary storage 510, and any other storage are possible examples of computer-readable media.

In an implementation, the architectures and functionalities depicted in the various previous figures may be implemented in the context of the processor 504, a graphics processor coupled to a communication interface 512, an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the processor 504 and a graphics processor, a chipset (i.e., a group of integrated circuits designed to work and sold as a unit for performing related functions, etc.).

Furthermore, the architectures and functionalities depicted in the various previous figures may be implemented in the context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system. For example, the computing arrangement 500 may take the form of a desktop computer, a laptop computer, a server, a workstation, a game console, an embedded system.

Furthermore, the computing arrangement 500 may take the form of various other devices including, but not limited to a personal digital assistant (PDA) device, a mobile phone device, a smart phone, a television, etc. Additionally, although not shown, the computing arrangement 500 may be coupled to a network (e.g., a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like) for communication purposes through an I/O interface 508.

It should be understood that the arrangement of components illustrated in the figures described are exemplary and that other arrangement may be possible. It should also be understood that the various system components (and means) defined by the claims, described below, and illustrated in the various block diagrams represent components in some systems configured according to the subject matter disclosed herein. For example, one or more of these system components (and means) may be realized, in whole or in part, by at least some of the components illustrated in the arrangements illustrated in the described figures. In addition, while at least one of these components are implemented at least partially as an electronic hardware component, and therefore constitutes a machine, the other components may be implemented in software that when included in an execution environment constitutes a machine, hardware, or a combination of software and hardware. Although the disclosure and its advantages have been described in detail, it should be understood that various changes, substitutions, and alterations can be made herein without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims

1. A method of tracking sensitive data in a data storage system (102, 202), the method comprising: receiving an incoming file; dividing the incoming file into a plurality of data chunks; searching for each data chunk in a deduplication index associated with the data storage system (102, 202); adding each data chunk that is not found in the deduplication index to the deduplication index; performing a scan for the presence of sensitive data, SD, on each data chunk that is not found in the deduplication index; and adding a result of the scan as a SD metadata record associated with each scanned data chunk in the deduplication index.

2. The method of claim 1, wherein performing a scan includes performing a further scan of one or more data chunks adjacent to each scanned data chunk.

3. The method of claim 1 or claim 2, further comprising, if the incoming file is a new file, generating a SD file metadata record for the incoming file, based on the SD metadata record for each data chunks in the incoming file.

4. The method of claim 1 or claim 2, further comprising, if the incoming file is a modified file, updating a SD file metadata record for the incoming file, based on the SD metadata record for the scanned chunks.

5. The method of any preceding claim, further comprising updating a SD file metadata record for one or more additional files that also include at least one of the scanned data chunks.

6. The method of claim 5, further comprising, if a scanned data chunk is associated with an SD metadata record in the deduplication index which indicates that presence of SD, and a result of the scan for the data chunk indicates no SD, lowering an update priority to update the SD file metadata for additional files that also include the data chunk.

7. The method of any preceding claim, wherein performing a scan includes adding the data chunks to a scan queue.

8. The method of claim 7, further comprising optimising a scan queue by searching the scan queue for additional data chunks of an additional file which also includes one or more data chunks of the incoming file added to the scan queue, and increasing a queue priority of the additional data chunks in the scan queue.

9. The method of claim 8, further comprising, in response to a result of the scan for any of the data chunks of the incoming file indicating the presence of SD, increasing a queue priority of any data chunks of the incoming file remaining in the scan queue and increasing a queue priority of the additional data chunks in the scan queue.

10. The method of any preceding claim, wherein adding the result of the scan to the SD metadata record for a data chunk includes adding a boundary indication if the SD crosses a boundary of the data chunk, wherein performing a scan on a data chunk associated with a boundary indication may include performing the scan on only a portion of the data chunk.

11. The method of claim 8, further comprising, in response to a result of the scan for any of the data chunks of the incoming file indicating the presence of SD, searching the deduplication index for one or more similar data chunks, based on a similarity score, and performing a scan for the presence of SD in each similar data chunk.

12. The method of claim 11, further comprising performing a scan for the presence of SD in each data chunk in a file containing one or more of the similar data chunks.

13. A data processing module (104, 204) for a data storage system (102, 202), the data processing module (104, 204) comprising: an input unit (106) configured to receive an incoming file; a deduplication unit (108) configured to: divide the incoming file into a plurality of data chunks; search for each data chunk in a deduplication index associated with the data storage system (102, 202); and add each data chunk that is not found in the deduplication index to the deduplication index; and a scanning unit (110) configured to: perform a scan for the presence of sensitive data, SD, on each data chunk that is not found in the deduplication index; and add a result of the scan as a SD metadata record associated with each scanned data chunk in the deduplication index.

14. A data storage system (102, 202) comprising the data processing module (104, 204) of claim 13.

15. A computer readable medium comprising instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 12.