WO2023160776A2 - Controller and method for finding data to be protected in database - Google Patents

Controller and method for finding data to be protected in database Download PDF

Info

Publication number
WO2023160776A2
WO2023160776A2 PCT/EP2022/054513 EP2022054513W WO2023160776A2 WO 2023160776 A2 WO2023160776 A2 WO 2023160776A2 EP 2022054513 W EP2022054513 W EP 2022054513W WO 2023160776 A2 WO2023160776 A2 WO 2023160776A2
Authority
WO
WIPO (PCT)
Prior art keywords
data
controller
data segment
address
segment
Prior art date
Application number
PCT/EP2022/054513
Other languages
French (fr)
Other versions
WO2023160776A3 (en
Inventor
Avi Kessel
Shay Akirav
Ido SHLOMO
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2022/054513 priority Critical patent/WO2023160776A2/en
Publication of WO2023160776A2 publication Critical patent/WO2023160776A2/en
Publication of WO2023160776A3 publication Critical patent/WO2023160776A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Definitions

  • the present disclosure relates generally to the field of database security and compliance and, more specifically, to a controller for finding data to be protected in a database, and a method for finding data to be protected in the database.
  • a database is used by different organizations to maintain critical data (or assets), such as related to multiple subjects, customers and potential customers, and the like.
  • data privacy is one of the major issues due to data handling, data sharing, data protection.
  • various laws and regulations are used to provide the fundamental restrictions of collecting and processing personally identifiable information such as sensitive data.
  • the sensitive data corresponds to a piece of information that is protected against unwarranted disclosure, such as personal information, financial data, intellectual property information, and the like.
  • sensitive data discovery plays a major role in the implementation of data privacy as it helps to identify the sensitive data from a huge collection of data. In other words, extraction of information about specific sensitive data is complicated, time-consuming, and sometimes requires manual efforts.
  • data owners are not aware about type of the sensitive data maintained in the database.
  • the conventional approaches for scanning the database for sensitive data discovery include the implementation of various patterns based on various regular expressions to check the database with respect to object names and also to perform data sampling tests based on those regular expressions.
  • the existing approach also uses a checksum function for scanning the databases.
  • the sensitive data is identified and protected by applying the security policies but, the application of conventional approaches provides more false positives and more false negatives values, due to which the protection of sensitive data affects adversely.
  • the conventional approaches are not desirable to prevent the cases of false positives and false negatives values.
  • the present disclosure provides a controller for finding data to be protected in a database and a method for finding data to be protected in the database.
  • the present disclosure provides a solution to the existing problem of how to detect sensitive data and reduce the number of false detections.
  • An objective of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in the prior art and provides an improved controller for finding data to be protected in a database and an improved method for finding data to be protected in the database. For example, for sensitive data discovery and classification in the databases.
  • the present disclosure provides a controller for finding data to be protected in a database.
  • the controller is configured to receive data comprising one or more data segments and determine which data segments to protect by applying pattern classification, classify metadata and classify data.
  • the controller is configured to label the data segment to be protected as sensitive and protect the data segment by applying security policies.
  • the controller is configured to classify the data in a data segment based on at least a partial address, names of person, file type classification, face detection, and encrypted and/or random content.
  • the controller By virtue of finding the sensitive data and then the classification of data segments based on the scoring mechanism enables the controller to find the sensitive data and also to apply security policies on the sensitive data to protect the sensitive data. Further, the determination of which sensitive data is to be protected by applying pattern classifiers or advanced data classifiers, or metadata classifiers reduces the number of false positives detection and false negatives detection that further provides a more accurate classification of the sensitive data.
  • the controller further configured to classify the data in a data segment based on at least a partial address by utilizing an address parser to tokenize the data as address providing address features.
  • the controller is further configured to determine if the number of features is more than a configurable number and if so classify the data segment as comprising an address.
  • the classification of data based on the partial address enables the controller to find the address data segment in the database.
  • the controller is further configured to look up the at least partial address in an address in an address register to validate the address, and if found in the address register that at least partial address is categorized as an address and the data segment is labelled as holding an address entity.
  • the controller can reduce the number of false positives detections and false negatives detections through validation of the address.
  • the controller is further configured to classify the data based on names of persons by searching in a name database for a match for the data.
  • the name database comprises the most used names in a current region and/or language, and if a match is found, the data segment is labelled as holding a name entry.
  • the controller classifies the data based on the name of the person that enables the controller to find the sensitive data that includes the name of the person.
  • the controller is further configured to determine that a field in the data is of a length in a range (MIN to MAX) prior to searching the name database for a match for the field.
  • the range depends on the current region and/or language.
  • the length of the name in the name database enables the controller to reduce false positives detections and false negatives detections.
  • the controller is further configured to classify the data based on file type classification by determining that a file in the data is of a protected type and determining that the data segment holds a protected file entry.
  • the classification of the data based on the file type enables the controller to find the sensitive data that is included in the files.
  • the controller is further configured to classify the data based on face detection by determining that a file in the data segment is of an image type.
  • the controller is further configured for applying face detection utilizing an object detection model to the file and determining if a face is detected and if so determining that the data segment holds a detected face entry.
  • the classification of sensitive data based on the face detection enables the controller to find the sensitive data that includes the images of the person.
  • the controller is further configured to classify the data utilizing of the Shannon entropy.
  • the controller is further configured to determine if the data segment comprises a semi-structure, and if so then determine a name of the semi-structure.
  • the controller is further configured to compare the name to name classifiers, and if a match is found and then determining that the data segment holds an entry, and if no match is found.
  • the controller further determines if the content of the semi-structure is a final node, and if so, determining that the data segment holds an entry.
  • the determination of data segments that include the semi- structure enables the controller to determine if the data segment holds the sensitive data or not.
  • the controller is further configured to score the determination of the data to be protected by determining a score based on a weight for finding an entry and a fraction of the number of records where an entry was found out of the number of entries scanned.
  • the determination of the data to be protected by the score enables the controller to determine the number of entries scanned.
  • the controller is further configured to protect the data segment to be protected by applying security policies.
  • the controller is further configured to select the security policies to be applied for a data segment, based on the type of entries in the data segment.
  • the protection of data segments by applying the security policies enables the controller to implement security policies for the protection of sensitive data.
  • the controller is further configured to determine if a segment of the data holds a number of entries of a specific class exceeding a configurable number and if so, determine that the segment is to be protected.
  • the controller determines the number of entries of a specific class that further enables the controller to protect that data segment.
  • the present disclosure provides a method for finding data to be protected in a database, wherein the method comprises receiving data comprising one or more data segments.
  • the method further comprises determining which data segments to protect by applying pattern classification, classifying metadata, and classifying data.
  • the method further comprises labelling the data segment to be protected as sensitive and protecting the data segment by applying security policies.
  • the method further comprises classifying the data in a data segment based on at least a partial address, names of persons, file type classification, face detection, and encrypted and/or random content.
  • the method achieves all the advantages and technical effects of the controller of the present disclosure.
  • FIG. 1 is a block diagram that depicts a controller to be used for finding data in a database, in accordance with another embodiment of the present disclosure
  • FIG. 2 depicts a flowchart of finding data to be protected in a database, in accordance with an embodiment of the present disclosure
  • FIG. 3 depicts a flowchart of finding data that includes an address to be protected in a database, in accordance with an embodiment of the present disclosure
  • FIG. 4 depicts a graphical representation that illustrates the entropy of different types of data to further detect the sensitive data, in accordance with an embodiment of the present disclosure
  • FIG. 5 depicts a flowchart of finding a data segment that includes a semi-structure to be protected in a database, in accordance with another embodiment of the present disclosure
  • FIG. 6 depicts a flowchart of finding that data stored in a semi- structure is a sensitive data, in accordance with another embodiment of the present disclosure
  • FIG. 7 depicts a flowchart of applying security policies to protect data segments, in accordance with an embodiment of the present disclosure.
  • FIG. 8 depicts a flowchart of a method of finding data to be protected in a database, in accordance with an embodiment of the present disclosure.
  • an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent.
  • a non-underlined number relates to an item identified by a line linking the nonunderlined number to the item.
  • the non-underlined number is used to identify a general item at which the arrow is pointing.
  • FIG. l is a block diagram that depicts a controller to be used for finding data in a database, in accordance with another embodiment of the present disclosure. With reference to FIG.l, there is shown a block diagram 100 that includes a controller 102 and a database 104.
  • the controller 102 includes suitable logic, circuitry, interfaces, and/or code that is configured to find a sensitive data in the database 104.
  • Examples of implementation of the controller 102 may include but are not limited to a central data processing device, a microprocessor, a microcontroller, a complex instruction set computing (CISC) processor, an application-specific integrated circuit (ASIC) processor, a reduced instruction set (RISC) processor, a very long instruction word (VLIW) processor, a central processing unit (CPU), a state machine, a data processing unit, and other processors or circuitry.
  • CISC complex instruction set computing
  • ASIC application-specific integrated circuit
  • RISC reduced instruction set
  • VLIW very long instruction word
  • CPU central processing unit
  • state machine a data processing unit
  • data processing unit and other processors or circuitry.
  • the database 104 stores information corresponding to the plurality of data segments that can be further classified as the sensitive data.
  • a communication interface 106 may be used to communicate with the components like the database 104, the controller 102, and the memory 108.
  • the controller 102 is also in communication with a memory 108 that is configured to hold the data received by the controller 102.
  • Examples of implementation of the memory 108 may include, but are not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Dynamic Random-Access Memory (DRAM), Random Access Memory (RAM), Read-Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, a Secure Digital (SD) card, Solid-State Drive (SSD), and/or CPU cache memory.
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • DRAM Dynamic Random-Access Memory
  • RAM Random Access Memory
  • ROM Read-Only Memory
  • HDD Hard Disk Drive
  • Flash memory a Secure Digital (SD) card, Solid-State Drive (
  • the controller 102 that is configured to find the data to be protected in the database 104.
  • the controller 102 receives the data including one or more data segments, and further determines which data segment is to be protected in the database 104.
  • the controller 102 is further configured to apply various classifications, such as pattern classification, metadata classification, and data classification to determine, which data segments are to be protected.
  • the controller 102 further labels the data segment to be protected as sensitive and also labels the type of data segment (e.g., name address, credit card, and the like).
  • the controller 102 further protects the data segment by applying the security policies as further shown and described in detail in FIG. 2. In other words, the controller 102 provides improve sensitive data discovery and protection system by applying advanced techniques to identify the data as sensitive.
  • the sensitive data corresponds to information that is protected against unwarranted disclosure whose protection is required for legal or ethical reasons such as for issues pertaining to personal privacy, or for proprietary considerations.
  • the pattern classification, the metadata classification, and the data classifications are used by the controller 102 to detect more types of sensitive data, and also reduces the number of false detections, such as by using a scoring mechanism.
  • the data segments include a plurality of records
  • the controller 102 is further configured to classify the data in the data segment by classifying the data in each record based on the classification score.
  • the data segments correspond to a segment in a database container, such as a column in a table, a JavaScript Object Notation (JSON) field in a collection of documents, and the like.
  • the controller 102 is further configured to determine a classification score based on the number of records and if a match is found, then the controller 102 classifies the data segment based on the classification score. In other words, the controller 102 classifies the data segments that include the plurality of records and determines the sensitive data, which is protected by the implementation of security policies.
  • the controller 102 is configured to find the sensitive data and then classify the data based on the classification score with the reduced number of false detections, such as reduced number of false positives detection and reduced number of false negatives detection. Further, the controller 102 is configured to apply security policies on the sensitive data to protect the sensitive data.
  • FIG. 2 depicts a flowchart of finding data to be protected in a database, in accordance with an embodiment of the present disclosure.
  • FIG. 2 is described in conjunction with elements from FIG. 1.
  • a flowchart 200 that includes a series of operations from 202-to-216.
  • the controller 102 is configured to execute the flowchart 200.
  • the controller 102 to find the data to be protected in the database 104.
  • the controller 102 is configured to detect the sensitive data in the database 104.
  • the controller 102 starts the operation to find the sensitive data to be protected in the database 104.
  • the controller 102 connects to a data source and after that obtains a metadata.
  • the controller 102 is configured to receive the data that includes one or more data segments.
  • the controller 102 receives the data with one or more data segments, such as a person’s name, address, and the like through data sampling. The controller 102 receives the data through the database 104 and then finds the sensitive data from the data, and then protects the sensitive data.
  • the controller 102 is further configured to determine, which data segment is to be protected by applying a pattern classification, a metadata classification, and a data classification.
  • the data segment is protected by the pattern classification and through pattern classifiers.
  • the data segment is protected by the data classification and through data classifiers.
  • the data segment with the metadata is protected by the metadata classification and through metadata classifiers.
  • the controller 102 is configured to classify the data in the data segment based on a scoring mechanism such as through operation 214. The scoring mechanism enables the controller 102 to classify the data segments that include the sensitive data and also protects the data segments by the security policies.
  • the data is classified as the sensitive data based on at least a partial address, names of persons, file type classification, face detection, and encrypted (and/or) random content.
  • the controller 102 is configured to label the data segment (e.g., name, address, face detection, and the like) to be protected as sensitive such as through operation 216.
  • the controller 102 is further configured to implement the security policies for the sensitive data.
  • the controller 102 further detects the sensitive data, such as the address (e.g., city, state, zip code, and the like) of a person that is located in the database 104 using the address parser.
  • the address of the person helps the controller 102 to set up the security policies for sensitive data protection.
  • the controller 102 parses the data to find the sensitive data by using regular expressions or natural language processing without limiting the scope of the present disclosure. For example, the controller 102 develops the address parsers to automate the delivery to determine if a specific data segment of the database 104 holds the data segment of the addresses or not.
  • the controller 102 uses the address parser to tokenize the text as an address and provides the output of the address tokens such as street, city, state, country, house number zip code, and the like.
  • the controller 102 further finds the corresponding data segment (e.g., address column, document, and the like) that holds the address, and also classifies the data segment that holds the sensitive data (i.e., address). Finally, the controller 102 increments a counter of the number of addresses found.
  • the controller 102 is configured to classify the data based on the names of a person by searching in a name database for a match for the data. In other words, as the name of the person is classified as the sensitive data, then the controller 102 is configured to detect the sensitive data in the database 104, such as by searching the most used names.
  • the name database includes the most used names in a current region and/or language and if a match is found, then the data segment is labelled as holding a name entry.
  • the database 104 includes a large amount of data that includes a person’s name. Further, some of the records store the most common names such as common last, middle, and first names.
  • the dictionary used to look up the most common names are different for different regions, locations, cultures, and languages.
  • the controller 102 is further configured to determine that a field in the data is of a length in a range (i.e., MIN to MAX) prior to searching the name database for a match for the field. Moreover, the range depends on the current region and/or language. In other words, the controller 102 checks the length of the text for the common names and then parses till the end of the text to check if the text includes the first name, middle name, or the last name (e.g., features) based on the common length of the text. Finally, if the data segment includes the name, then the number of found names is incremented. Beneficially, the detection of the sensitive data as per the length and the feature of the text enables the controller 102 to find the sensitive data (i.e., name of a person) and further protect the sensitive data.
  • a range i.e., MIN to MAX
  • the controller 102 is further configured to classify the data based on file type classification by determining that a file in the data is of a protected type and also to determine that the data segment holds a protected file entry.
  • the controller 102 finds the sensitive data included in the sensitive files located in the database 104 by checking the file type.
  • a document stored in the database 104 includes the sensitive data, such as a portable document file (PDF) file that stores a trade contract or a computer aided design (or CAD) drawing file.
  • PDF portable document file
  • CAD computer aided design
  • the controller 102 firstly checks through a base, if the database 104 segment is in the binary form (e.g., base 64, Hexa, and the like) or not, and if the data segment is not in the binary form, then the controller 102 converts the data segment into binary form. After that, the controller 102 obtains the first 1024 bytes of the binary form (or binary content) of the data segment and checks a mime type of the file and if the file is compressed type, then extracts the first bytes. Finally, if a mime type of the file found is sensitive, then the number of files with sensitive data is incremented, otherwise not. Beneficially, the determination of data in the protected file enables the controller 102 to find the sensitive data included in the protected file.
  • the binary form e.g., base 64, Hexa, and the like
  • the controller 102 is further configured to classify the data based on face detection by determining that a file in the data segment is of an image type. In other words, as the image of a person's face is classified as the sensitive data, and the controller 102 is configured to detect the image of faces in the database 104 through various face recognition techniques, such as via object recognition techniques and mime-type recognition techniques, and the like. Further, the controller 102 is configured to classify the data based on face detection by applying face detection, by utilizing an object detection model in the file, and by determining if the face is detected or not. Moreover, if the face is detected, then the controller 102 is configured to determine that the data segment holds a detected face.
  • the controller 102 checks, if the database 104 segment is in a binary form (i.e., through the base of base 64, Hexa, and the like) or not. Moreover, if the data segment is not in the binary form, then the controller 102 converts the data segment to the binary form. After that, the controller 102 obtains the mime type to know the existence of images in the data segment. In addition, if an image is found in the data segment, then the controller 102 extracts the full content of the field and downloads the image. In an implementation, the controller 102 converts the image to the format known by the object detection model.
  • the controller 102 applies the face object detection model and if the person’s face is found, then the number of images that include the person’ s face is incremented. The detection of the data that includes the person’ s face enables the controller 102 to find the sensitive data.
  • the controller 102 is further configured to classify the data utilizing the Shannon entropy.
  • the controller 102 is configured to find the sensitive data that is included in encrypted (or hash) form in the database 104 through Shannon entropy.
  • the encrypted content includes at least passwords (e.g., user passwords), encryption keys, salt keys, access tokens, session identifiers (e.g., session IDs), payment card numbers, encrypted data, or any other form of random data.
  • the encrypted and random data includes high byte distribution that is measured via the Shannon Entropy.
  • the controller 102 is further configured to classify the data utilizing the Shannon entropy by determining an entropy for data in a record in the data segment and by determining if the record is of a length that exceeds a size threshold or not. Moreover, if the length does not exceed the size threshold, then the controller 102 accumulates the data from the other records together with the data of the record and determines the entropy on the accumulated data.
  • the probability of each byte in the data is calculated via below given formulae:
  • File Entropy ) , where the file entropy is between zero (0) and eight (8).
  • the file includes the highest bytes distribution and if the value of the file entropy is zero, then the file has the lowest byte distribution.
  • the high entropy values include compressed data and the encrypted data.
  • the controller 102 further checks the content type (or the data header) to reduce false positive by separating the compressed data from the encrypted data.
  • the controller 102 is also configured to detect the sensitive data with high entropy through different parameters, such as the list of field types, length of the field (e.g., minimum length or maximum length), the number of bytes to accumulate for entropy check, an entropy value, and content header (e.g., compressed or encrypted).
  • the minimum field size is 80 MB
  • the maximum field size is 160 MB
  • the number of bytes is 500 bytes
  • the entropy value is 7.2
  • the check for compressed is false
  • the data is encrypted data that includes the sensitive data, as further shown in FIG. 4.
  • the minimum field size is 2048 MB and the maximum field size is lOOMB
  • the number of bytes is 500 bytes
  • the entropy value is 7.2
  • check for compressed is true
  • the data includes compressed data that includes the sensitive data.
  • the controller 102 is configured to sample the data segment (e.g., document, table), and then check for the data encoded in the data segment.
  • the controller 102 transforms the data to the binary form. After that, the controller 102 identifies the data segment (e.g., column, JSON field, XML, and the like) that includes the encrypted data by checking the field type, text field, the length of data, and the data header (compressed or encrypted). In an example, if the data is compressed, then the controller 102 checks if the data segment (or compressed content) is protected with a password or not. Moreover, if the data segment is protected by the password, then the data is the sensitive data, and if the data segment is not protected by the password, then the data segment does not include the sensitive data.
  • the data segment e.g., column, JSON field, XML, and the like
  • the controller 102 concatenates the occurrences and gets the chunks in size greater than or equal to the threshold (or bytes num).
  • the controller 102 further calculates the entropy of the given bytes, and if the entropy is higher than the threshold value (or entropy value), then the controller 102 adds an occurrence of found incidents. Further, the controller 102 counts the number of occurrences and the number of data segments scanned and passes the number of occurrences to the score calculation to get the confidence level of the sensitive data.
  • the controller 102 is further configured to score the determination of the data to be protected by determining a score based on the weight for finding an entry and a fraction of the number of records where an entry is found out of the number of entries scanned. In other words, the controller 102 classifies the data segments through the scoring mechanism, which enables the controller 102 to differentiate the findings and potential false positives values.
  • the score can be calculated by the addition of w na me (name found) and Wcomem (content found ratio) .
  • the Wname corresponds to weight for finding the name and the value of Wname lies between “0” and “1”.
  • the value of the name found is “1” if the data segment names are used to find the item.
  • the w CO ntent corresponds to the weight for finding the name in the data segment, and the values for w CO ntent are between “0” and “1”.
  • the content ratio found corresponds to the number of records, where content found is divided by the number of records scanned. In an example, if the score is higher than “1”, then the score is set to “1”.
  • Each classifier has different weights without limiting the scope of the present disclosure. The scoring mechanism balances the alert on everything suspicious and avoids false positives with a score from 1 to 100. The results are ordered, so those with higher confidence are promoted.
  • the controller 102 is further configured to implement the security policies for the protection of the sensitive data such as through operation 218.
  • the controller 102 is further configured to protect the data segment to be protected by applying security policies.
  • the controller 102 is configured to select the security policies to be applied for the data segment based on the type of entries in the data segment.
  • the security policies include different policies, such as access control, auditing, dynamic and static data masking, and the like.
  • the controller 102 is configured to protect the sensitive data using the security policies based on the type of sensitive data found.
  • the finding of sensitive data by the controller 102 and then the classification of data segments based on the scoring mechanism enables the controller 102 to find the sensitive data and perform security policies on the sensitive data to protect the sensitive data.
  • the determination of which sensitive data is to be protected by applying pattern classifiers or advanced data classifiers, or metadata classifiers reduces the number of false positives and false negatives and provides a more accurate classification of sensitive data.
  • FIG. 3 depicts a flowchart of finding data that includes an address to be protected in a database, in accordance with an embodiment of the present disclosure.
  • FIG. 3 is described in conjunction with elements from FIGs. 1 and 2.
  • FIG. 3 there is shown a flowchart 300 that includes a series of operations from 302-to-314.
  • the controller 102 is configured to execute the flowchart 300.
  • the controller 102 starts traversing through the data segments that include the plurality of records.
  • the controller 102 is further configured to classify the data in the data segment based on at least a partial address by utilizing an address parser to tokenize the data as address providing address feature.
  • the controller 102 parses the addresses in the data segments. After that, the controller 102 determines if the number of features is more than a configurable number, and then classify the data segment as including the address.
  • the controller 102 is configured to check number of address features for every record and then mark the data segments with a rate mechanism.
  • the controller 102 is configured to determine if the data segment of the data holds the number of entries of a specific class exceeds the configurable number, then determine that the data segment is to be protected. For example, at operation 306, the controller 102 is configured to determine if the number of features is greater than the configurable number (or ‘c’), then the data segment is classified as an address that is to be protected through security policies, and the controller 102 moves at operation 308. However, if the number of features is less than the configurable number, then the data segment is not classified as an address and thus not protected, and the controller 102 moves at operation 312, where the data segment is classified as not an address.
  • the data segment is represented as “22 October 2018 Monday Evening 12345”, with three features, such as house number is 22, the road is October 2018 Monday evening and postcode is 12345. Moreover, if the configurable number is four, then the data, such as “22 October 2018 Monday Evening 12345”, is parsed by the controller 102, and the data is not classified as an address, because the number of features in the data is three.
  • the controller 102 is further configured to look-up the at least partial address in an address register to validate the address. Moreover, if at least the partial address in the address register is found in the address register that at least partial address is categorized as an address, then the data segment is labelled as holding an address entry.
  • the controller 102 also checks the features on the dictionary.
  • the dictionary corresponds to an address registry from which the controller 102 can check if a specific address or address part exists or not. For example, the address registry checks if a certain street exists in a city or the city exists in the country and the like. Moreover, based on the features on the dictionary, the controller 102 classifies the data segments as an address and move at operation 310.
  • the data segment is classified not as an address, such as through operation 314.
  • the data segment is represented as “22 October 2018 Monday Evening New York 12345”, with five features, such as house_number is 22, the road is October 2018 Monday, the city is evening, the state is New York, and postcode is 12345.
  • the features like the addresses, the city, and the road are not found in the dictionary, then the data “22 October 2018 Monday Evening New York 12345” is classified as not an address such as through operation 214.
  • the data segment is represented as “77 Nassau av. Plainview NY 1180377 Nassau av.
  • the controller 102 is parsed through the data and found five features with house number 77, road Nassau Avenue, city Plainview, state NY, and pin code 11803 that matches the features on the dictionary. Thus, the data segment is termed as a valid address. Finally, if the data segment is classified as an address, then at operation 310, the address counter that counts the number of addresses in the data segments is incremented. Beneficially, the data segment’s classification based on the number of features by scoring mechanism enables the controller 102 to reduce the number of false negatives and false positives.
  • the operation 308, and the operation 314 are optional, and are depending on the availability of the dictionary. Thus, if the dictionary is not available, then the controller 102 executes the operation 310 directly after the execution of operation 306.
  • FIG. 4 depicts a graphical representation that illustrates an entropy of different types of data to further detect the sensitive data, in accordance with an embodiment of the present disclosure.
  • FIG. 4 is described in conjunction with elements from FIGs. 1, 2, and 3.
  • a graphical representation 400 that includes an X-axis 402 that illustrates a number of bytes and a Y-axis 404 that illustrates value of entropy.
  • a first line 406, a second line 408, a third line 410, a fourth line 412, a fifth line 414, and a sixth line 416 collectively, illustrate the different types of data and their corresponding entropy.
  • the first line 406 illustrates the data stored in a portable document file (PDF)
  • the second line 408 illustrates the data stored in a text file (or txt)
  • the third line 410 illustrates the data stored in hashed form (or random passwords of eight characters hashed).
  • the fourth line 412 illustrates the data stored in the encrypted cc form (or advanced encrypted standards encrypted of 16 random digits, each producing 89 bytes, concatenated together).
  • the fifth line 414 illustrates the data stored in the encrypted (or AES Encrypted of large random string) form
  • the sixth line 416 illustrates the data stored in the random form.
  • the value of the Shannon entropy is up to a high value of 7.2 just with 500 bytes, such as shown through the third line 410 (i.e., hashed form), the fourth line 412 (i.e., encrypted), and the sixth line 418 (i.e., random form).
  • the data with high Shannon entropy of random, hash, and encrypted data enables the controller 102 to find the sensitive data that can be protected.
  • the controller 102 is further configured to classify the data utilizing the Shannon entropy.
  • the Shannon entropy enables the controller 102 to find the sensitive data in different data types.
  • FIG. 5 depicts a flowchart of finding a data segment that includes a semi-structure to be protected in a database, in accordance with an embodiment of the present disclosure.
  • FIG.5 is described in conjunction with elements from FIGs. 1, 2, 3, and 4.
  • a flowchart 500 that includes a series of operations from 502-to-522.
  • the controller 102 is configured to execute the flowchart 500.
  • the controller 102 starts the operation for finding the sensitive data to be protected in the database 104.
  • the controller 102 connects to the data source and after that obtains the metadata.
  • the controller 102 is configured to receive the data that includes one or more data segments. For example, at operation 506, the controller 102 receives the data with the one or more data segments (e.g., a person’s name, address, and the like), and the controller 102 performs the data sampling. After that, at operation 508, The controller 102 is further configured to check if the data found from the data sampling is a semistructure (or semi-structure data) or not.
  • the semi-structure data corresponds to a data structure, such as key value, JSON, XML, arrays, XML, JSON, and key value stores.
  • the controller 102 implements the operation 522 that includes semi-structure handling.
  • the controller 102 is further configured to determine, which data segment is to be protected by applying a pattern classification, a metadata classification, and a data classification. For example, at operation 510, the data segment is protected by the pattern classification through pattern classifiers, and at operation 512, the data segment is protected by the data classification through data classifiers.
  • the controller 102 is configured to classify the data in a data segment based on the scoring mechanism such as through operation 516.
  • the data is classified as the sensitive data based on at least a partial address, names of persons, file type classification, face detection, and encrypted and/or random content.
  • the controller 102 is configured to label the sensitive data segment (e.g., name, address, face detection, and the like) to be protected as sensitive such as through operation 518.
  • the controller 102 is further configured to implement the security policies for the protection of the sensitive data such as through operation 520.
  • FIG. 6 depicts a flowchart of finding that data stored in a semi- structure is a sensitive data, in accordance with another embodiment of the present disclosure.
  • FIG. 6 is described in conjunction with elements from FIGs. 1, 2, 3, 4, and 5.
  • FIG. 6 there is shown a flowchart 600 that includes a series of operations from 602-to-612.
  • the controller 102 is configured to execute the flowchart 600.
  • the controller 102 is further configured to determine if the data segment includes the semi-structure or not such as through operation 602. Moreover, if the controller 102 determines that the data segment includes the semi-structure, then the controller 102 further determines the name of the semi- structure such as through operation 604. After that, the controller 102 compares the name-to-name classifiers such as through operation 610 and if during the comparison a match is found, then the controller 102 determines if that data segment holds an entry or not. However, if the match is not found during the comparison, then the controller 102 further determines if the content of the semi- structure is a final node or not such as through operation 606.
  • the controller 102 determines that the data segment holds the entry, and the controller 102 further checks and counts the content with the content classifiers such as through operation 612. However, if the semi- structure is not the final node, then the controller 102 duplicates each node’s child (i.e., child node) such as through operation 608. After that, the controller 102 determines if the semistructure includes the name through the operation 604. Beneficially, the determination of the semi- structure enables the controller 102 to find the sensitive data in the data segment.
  • FIG. 7 depicts a flowchart of applying security policies to protect data segments, in accordance with an embodiment of the present disclosure.
  • FIG.7 is described in conjunction with elements from FIGs. 1, 2, 3, 4, 5, and 6. With reference to FIG. 7, there is shown a flowchart 700 that includes a series of operations from 702-to-714.
  • the controller 102 is configured to execute the flowchart 700.
  • the controller 102 is configured to protect the data segment to be protected by applying security policies.
  • the controller 102 is further configured to select the security policies to be applied for the data segment based on the type of data in the data segment (e.g., credit card, person’s name, password and the like).
  • the controller 102 is configured to protect the sensitive data using the security policies (e.g., access control, auditing, dynamic and static data masking, and the like) based on the type of sensitive data found.
  • the controller 102 starts the sensitive data discovery process to find the sensitive data. After the discovery of the sensitive data, the controller 102 provides sensitive data discovery alerts with different labels (e.g., names, addresses, images, and the like), such as through operation 704.
  • the sensitive data discovery alert corresponds to the labels with the type of the sensitive data
  • each sensitive data discovery alert on the data segment is labelled with a label (e.g., a credit card, address and the like).
  • the controller 102 implements a policy matcher to check the type of security policy to be applied to the corresponding data segment.
  • the payment policy domain is applied for payment card-related data segments.
  • a personal data policy is applied for personal data (e.g., name, address, and the like), and the like.
  • the security policies are arranged in different domains, like payment (e.g., PCI compliance), personal data protection, and the like to protect the sensitive data.
  • the policy matcher decides the application of policies through operation 708 by the discovery of alerts with labels and the policies through operation 710.
  • the policy matcher links the labels on the sensitive data alerts with the polices repositories and offers the policies that match the sensitive data found in the database 104.
  • the controller 102 not eventually finds the sensitive data in the database 104 but also offers the right protection that depends on the type of the sensitive data found.
  • the correct implementation of security policies on the sensitive data segments provides a comprehensive security solution using the pre-defined security policies.
  • FIG. 8 is a flowchart of a method of finding data to be protected in a database, in accordance with an embodiment of the present disclosure.
  • FIG. 8 is described in conjunction with elements from FIGs. 1 to 7.
  • FIG. 8 there is shown a flowchart of method 800 that includes steps 802 to 808.
  • the controller 102 of the FIG. 1 is configured to execute the method 800.
  • the method 800 of finding the data to be protected in the database 104 is configured to detect the sensitive data in the database 104.
  • the method 800 provides improve sensitive data discovery and protection system by applying advanced techniques to identify the data as sensitive.
  • the data segments include a plurality of records
  • the controller 102 is further configured to classify the data in the data segment by classifying the data in each data segment based on the classification score.
  • the data segments correspond to a segment in a database container, such as a column in a table, a JSON field in a collection of documents, and the like.
  • the method 800 includes, determining a classification score based on the number of records and if a match is found and then the controller 102 classifies the data segment based on the classification score.
  • the method 800 include, classifying the data segments that include the plurality of records and determines the sensitive data, which is protected by the implementation of security policies.
  • the sensitive data corresponds to information that is protected against unwarranted disclosure whose protection is required for legal or ethical reasons such as for issues pertaining to personal privacy, or for proprietary considerations.
  • the method 800 comprises receiving of the data that includes one or more data segments.
  • the method 800 includes receiving the data with one or more data segments such as a person’s name, address, and the like through data sampling.
  • the method 800 includes receiving the data through the database 104 and then finding the sensitive data from the data.
  • the method 800 further comprises determining which data segment is to be protected by applying a pattern classification, a metadata classification, and a data classification.
  • the data segment is protected by the pattern classification and through pattern classifiers.
  • the data segment is protected by the data classification and through data classifiers.
  • the data segment with the metadata is protected by the metadata classification and through metadata classifiers.
  • the method 800 includes classifying the data in the data segment based on a scoring mechanism. The scoring mechanism is used to classify the data segments that include the sensitive data and also protects the data segments by the security policies.
  • the data is classified as the sensitive data based on at least a partial address, names of persons, file type classification, face detection, and encrypted (and/or) random content.
  • the method 800 further comprises labelling the data segment (e.g., name, address, face detection, and the like) to be protected as sensitive.
  • the method 800 provides protecting the data segment by applying security policies.
  • the controller 102 detects the sensitive data such as the address (e.g., city, state, zip code, and the like) of a person that is located in the database 104 using the address parser. The address of the person is used to set up the security policies for sensitive data protection.
  • the controller 102 is used to parses the data to find the sensitive data by using regular expressions or natural language processing without limiting the scope of the present disclosure.
  • the method 800 further comprises classifying the data in the data segment based on at least a partial address, names of persons, file type classification, face detection, and encrypted and/or random content.
  • the method 800 further comprises classifying the data based on the names of a person by searching in a name database for a match for the data.
  • the method 800 includes detecting the sensitive data in the database 104 by searching the most used names.
  • the name database includes the most used names in a current region and/or language and if a match is found, then the data segment is labelled as holding a name entry.
  • the database 104 includes a large amount of data that includes a person’s name.
  • some of the records store the most common names such as common last, middle, and first names. Therefore, it is enough to look into a dictionary with the most common last, middle, and first names to detect that the data holds the name entry or not.
  • the dictionary used to look up the most common names are different for different regions, locations, cultures, and languages.
  • the method 800 further comprises determining that a field in the data is of a length in a range (i.e., MIN to MAX) prior to searching the name database for a match for the field. Moreover, the range depends on the current region and/or language.
  • the method 800 includes checking the length of the text for the common names and then parses till the end of the text to check if the text includes the first name, middle name, or the last name (e.g., features) based on the common length of the text. Finally, if the data segment includes the name, then the number of found names is incremented.
  • the detection of the sensitive data as per the length and the feature of the text is used to find the sensitive data (i.e., name of a person) and further protect the sensitive data.
  • the method 800 further comprises classifying the data based on file type classification by determine that a file in the data is of a protected type and also to determine that the data segment holds a protected file entry.
  • the method 800 includes finding the sensitive data included in the sensitive files located in the database 104 by checking the file type.
  • a document stored in the database 104 includes the sensitive data such as a portable document file (PDF) file that stores a trade contract or an Auto computer aided design (CAD) drawing file.
  • PDF portable document file
  • CAD Auto computer aided design
  • the method 800 includes, checking through a base, if the database 104 segment is in the binary form (e.g., base 64, Hexa, and the like) or not, and if the data segment is not in the binary form, then the method 800 includes, converting the data segment into binary form. After that, the method 800 includes obtaining the first 1024 bytes of the binary form (or binary content) of the data segment and checks a mime type of a file and if the file is compressed type, then extracts the first bytes. Finally, if the mime type of the file found is sensitive then the number of files with sensitive data is incremented, otherwise not. Beneficially, the determination of data in the protected file is used to find the sensitive data included in the protected file.
  • the binary form e.g., base 64, Hexa, and the like
  • the method 800 further comprises classifying the data based on face detection by determining that a file (or binary field) in the data segment is of an image type.
  • the method 800 includes detecting the image of faces in the database 104 through various face recognition techniques, such as via object recognition techniques and mime-type recognition techniques, and the like.
  • the method 800 further includes, classifying the data based on face detection by applying face detection, by utilizing an object detection model in the file, and by determining if a face is detected and if so then determining that the data segment holds a detected face.
  • the method 800 includes, checking if the database 104 segment is in a binary form (i.e., through the base of base 64, Hexa, and the like) or not. Moreover, if the data segment is not in the binary form, then the method 800 includes, converting the data segment to binary form. After that, the controller 102 obtains the mime type to know the existence of images in the data segment. In addition, if an image is found in the data segment, then the method 800 includes extracting the full content of the field and downloads the image. In an implementation, the controller 102 converts the image to the format known by the object detection model. Finally, the method 800 includes applying the face object detection model and if the person’s face is found, then the number of images that include the person’s face is incremented. The detection of the data that includes the person’s face is used to find the sensitive data.
  • a binary form i.e., through the base of base 64, Hexa, and the like
  • the method 800 includes, converting the data segment to binary form.
  • the method 800 further comprises classifying the data utilizing the Shannon entropy.
  • the method 800 includes finding the sensitive data that is included in encrypted (or hash) form in the database 104 through Shannon entropy.
  • the encrypted content includes at least passwords (e.g., user passwords), encryption keys, salt keys, access tokens, session identifiers (e.g., session IDs), payment card numbers, encrypted data or any other form of random data.
  • the encrypted and random data includes high byte distribution that is measured via the Shannon Entropy.
  • the method 800 further comprises classifying the data utilizing the Shannon entropy by determining an entropy for data in a record in the data segment and determine if the record is of a length exceeds a size threshold or not. Moreover, if the length does not exceed the size threshold, then the method 800 includes accumulating the data from the other records together with the data of the record and determines the entropy on the accumulated data.
  • the file includes the highest bytes distribution (i.e., the data is random data) and if the value of the file entropy is zero, then the file has the lowest byte distribution (i.e., the data is non-random data).
  • the high entropy values include compressed data and the encrypted data.
  • the method 800 includes checking the content type (or the data header) to reduce false positives values by separating the compressed data from the encrypted data.
  • the method 800 further includes detecting the sensitive data with high entropy through different parameters such as the list of field types (or field types), length of the field (e.g., minimum length or maximum length), the number of bytes to accumulate for entropy check, entropy value, and content header (e.g., compressed or any other form of data).
  • the method 800 further includes sampling the data segment (e.g., document, table), and then check for the data encoded in the data segment. Further, if the data encoded in the data segment includes base 64 (or Hexa or any other type) of an encoded form of data, then the method 800 includes transforming the data to the binary form.
  • the method 800 includes identifying the data segment (e.g., column, JSON field, XML, and the like) that includes the encrypted data, or the hashed data by checking the field type, text field, the length of data, and the data header (compressed or encrypted). In an example, if the field length is less than the threshold value (or bytes num), then the method 800 includes concatenating the occurrences and gets the chunks in size greater than or equal to the threshold (or bytes num). The method 800 further includes calculating the entropy of the given bytes, and if the entropy is higher than the threshold value (or entropy value), then adding an occurrence of found incidents. Further, the method 800 includes counting the number of occurrences and the number of data segments scanned and passes the number of occurrences to the score calculation to get the confidence level of the sensitive data.
  • the data segment e.g., column, JSON field, XML, and the like
  • the method 800 includes concatenating the occurrences and gets the
  • the method 800 further includes scoring the determination of the data to be protected by determining a score based on the weight for finding an entry and a fraction of the number of records where an entry is found out of the number of entries scanned.
  • the method 800 includes classifying the data segments through the scoring mechanism, which is used to differentiate and potentially findings false positive values.
  • the scoring mechanism balances the alert on everything suspicious and avoids false positives with a score from 1 to 100. The results are ordered, so those with higher confidence are promoted.
  • the method 800 further comprises protecting the data segment by applying security policies.
  • the method 800 further comprises protecting the data segment to be protected by applying security policies.
  • the method 800 includes selecting the security policies to be applied for the data segment based on the type of entries in the data segment.
  • the security policies include different policies, such as access control, auditing, dynamic and static data masking, and the like.
  • the method 800 includes protecting the sensitive data using the security policies (based on the type of sensitive data found.
  • the finding of sensitive data by the method 800 and then the classification of data segments based on the scoring mechanism is used to find the sensitive data and perform security policies on the sensitive data to protect the sensitive data.
  • the determination of which sensitive data is to be protected by applying pattern classifiers or advanced data classifiers, or metadata classifiers reduces the number of false positives and false negatives and provides a more accurate classification of sensitive data.
  • the method 800 further comprises classifying the data in the data segment based on the at least partial address by utilizing an address parser to tokenize the data as addressproviding address feature. For example, the method 800 provides the controller 102 that parses the addresses in the data segments. After that, the controller 102 determines if the number of features is more than a configurable number and if so then classify the data segment as including the address. In an implementation, the method 800 further includes determining if the data segment of the data holds the number of entries of a specific class exceeds the configurable number, then determining that the data segment is to be protected.
  • the method 800 includes determining if the number of features is greater than the configurable number (i.e., c), then the data segment is classified as an address that is to be protected through security policies. However, if the number of features is less than the configurable number, then the data segment is not classified as an address and thus not protected, and the data segment is classified as not an address.
  • the configurable number i.e., c
  • the method 800 further comprises performing a look-up the at least partial address in an address register to validate the address. Moreover, if at least the partial address in the address register is found in the address register that at least partial address is categorized as an address, then the data segment is labelled as holding an address entry.
  • the method 800 includes checking the features on the dictionary. Moreover, based on the features on the dictionary, and classifying the data segments as an address. However, if the features are not found in the dictionary is not found, then the data segment is classified not as an address. Finally, if the data segment is classified as an address, then, the address counter that counts the number of addresses in the data segments is incremented. Beneficially, the data segment’s classification based on the number of features by scoring mechanism reduces the number of false negatives and false positives.
  • the method 800 further comprises determining if the data segment includes the semi-structure or not. Moreover, if the data segment includes the semi- structure, then the method 800 includes determining the name of the semi-structure. After that, the controller 102 compares the name-to-name classifiers and if during the comparison a match is found, then the controller 102 determines if that data segment holds an entry or not. However, if the match is not found during the comparison, then the controller 102 further determines if the content of the semi- structure is a final node or not. Moreover, if the semi- structure is the final node, then determining that the data segment holds the entry, and the checking (and counting) the content with the content classifiers.
  • the semi-structure is not the final node, then duplicates each node’s child (i.e., child node). After that, determining if the semistructure includes the name. Beneficially, the determination of the semi-structure enables is used to find the sensitive data in the data segment.
  • the method 800 of the finding of sensitive data by and then the classification of data segments based on the scoring mechanism is used to find the sensitive data and perform security policies on the sensitive data to protect the sensitive data. Further, the determination of which sensitive data is to be protected by applying pattern classifiers or advanced data classifiers, or metadata classifiers reduces the number of false positives and false negatives and provides a more accurate classification of sensitive data.
  • steps 802 to 808 are only illustrative, and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claim herein.

Abstract

A controller for finding data to be protected in a database is configured to receive data includes one or more data segments and determines which data segments to protect by applying pattern classification, metadata classification, and data classification. Further, the controller labels the data segment to be protected as sensitive and protects the data segment by applying security policies. The controller is configured to classify the data in a data segment based on at least a partial address, names of persons, file type classification, face detection, and encrypted and/or random content. By virtue of finding sensitive data and then the classification of the sensitive data enables the controller to perform security policies on the sensitive data to protect the sensitive data. As a result, the controller is beneficial to reduce the number of false positives and false negatives and provides a more accurate classification of sensitive data.

Description

CONTROLLER AND METHOD FOR FINDING DATA TO BE PROTECTED IN DATABASE
TECHNICAL FIELD
The present disclosure relates generally to the field of database security and compliance and, more specifically, to a controller for finding data to be protected in a database, and a method for finding data to be protected in the database.
BACKGROUND
Generally, a database is used by different organizations to maintain critical data (or assets), such as related to multiple subjects, customers and potential customers, and the like. Moreover, data privacy is one of the major issues due to data handling, data sharing, data protection. Conventionally, various laws and regulations are used to provide the fundamental restrictions of collecting and processing personally identifiable information such as sensitive data. Generally, the sensitive data corresponds to a piece of information that is protected against unwarranted disclosure, such as personal information, financial data, intellectual property information, and the like. However, before the protection of the sensitive data, sensitive data discovery plays a major role in the implementation of data privacy as it helps to identify the sensitive data from a huge collection of data. In other words, extraction of information about specific sensitive data is complicated, time-consuming, and sometimes requires manual efforts. Moreover, sometimes data owners are not aware about type of the sensitive data maintained in the database.
Currently, the conventional approaches for scanning the database for sensitive data discovery include the implementation of various patterns based on various regular expressions to check the database with respect to object names and also to perform data sampling tests based on those regular expressions. The existing approach also uses a checksum function for scanning the databases. In addition, by the usage of the conventional approaches, the sensitive data is identified and protected by applying the security policies but, the application of conventional approaches provides more false positives and more false negatives values, due to which the protection of sensitive data affects adversely. Moreover, the conventional approaches are not desirable to prevent the cases of false positives and false negatives values. As a result, there exists a technical problem of how to detect sensitive data and reduce the number of false detections, such as false positives and false negatives.
Therefore, in light of the foregoing discussion, there exists a need to overcome the aforementioned drawbacks associated with the existing approaches.
SUMMARY
The present disclosure provides a controller for finding data to be protected in a database and a method for finding data to be protected in the database. The present disclosure provides a solution to the existing problem of how to detect sensitive data and reduce the number of false detections. An objective of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in the prior art and provides an improved controller for finding data to be protected in a database and an improved method for finding data to be protected in the database. For example, for sensitive data discovery and classification in the databases.
One or more objectives of the present disclosure are achieved by the solutions provided in the enclosed independent claims. Advantageous implementations of the present disclosure are further defined in the dependent claims.
In one aspect, the present disclosure provides a controller for finding data to be protected in a database. The controller is configured to receive data comprising one or more data segments and determine which data segments to protect by applying pattern classification, classify metadata and classify data. The controller is configured to label the data segment to be protected as sensitive and protect the data segment by applying security policies. The controller is configured to classify the data in a data segment based on at least a partial address, names of person, file type classification, face detection, and encrypted and/or random content.
By virtue of finding the sensitive data and then the classification of data segments based on the scoring mechanism enables the controller to find the sensitive data and also to apply security policies on the sensitive data to protect the sensitive data. Further, the determination of which sensitive data is to be protected by applying pattern classifiers or advanced data classifiers, or metadata classifiers reduces the number of false positives detection and false negatives detection that further provides a more accurate classification of the sensitive data.
In a further implementation form, the controller further configured to classify the data in a data segment based on at least a partial address by utilizing an address parser to tokenize the data as address providing address features. The controller is further configured to determine if the number of features is more than a configurable number and if so classify the data segment as comprising an address.
In this implementation, the classification of data based on the partial address enables the controller to find the address data segment in the database.
In a further implementation form, the controller is further configured to look up the at least partial address in an address in an address register to validate the address, and if found in the address register that at least partial address is categorized as an address and the data segment is labelled as holding an address entity.
In this implementation, the controller can reduce the number of false positives detections and false negatives detections through validation of the address.
In a further implementation form, the controller is further configured to classify the data based on names of persons by searching in a name database for a match for the data. The name database comprises the most used names in a current region and/or language, and if a match is found, the data segment is labelled as holding a name entry.
The controller classifies the data based on the name of the person that enables the controller to find the sensitive data that includes the name of the person.
In a further implementation, the controller is further configured to determine that a field in the data is of a length in a range (MIN to MAX) prior to searching the name database for a match for the field. The range depends on the current region and/or language.
In this implementation, the length of the name in the name database enables the controller to reduce false positives detections and false negatives detections. In a further implementation, the controller is further configured to classify the data based on file type classification by determining that a file in the data is of a protected type and determining that the data segment holds a protected file entry.
In this implementation, the classification of the data based on the file type enables the controller to find the sensitive data that is included in the files.
In a further implementation, the controller is further configured to classify the data based on face detection by determining that a file in the data segment is of an image type. The controller is further configured for applying face detection utilizing an object detection model to the file and determining if a face is detected and if so determining that the data segment holds a detected face entry.
Beneficially, the classification of sensitive data based on the face detection enables the controller to find the sensitive data that includes the images of the person.
In a further implementation, the controller is further configured to classify the data utilizing of the Shannon entropy.
The Shannon entropy enables the controller to find the sensitive data in different data types. In a further implementation, the controller is further configured to determine if the data segment comprises a semi-structure, and if so then determine a name of the semi-structure. The controller is further configured to compare the name to name classifiers, and if a match is found and then determining that the data segment holds an entry, and if no match is found. The controller further determines if the content of the semi-structure is a final node, and if so, determining that the data segment holds an entry.
In this implementation, the determination of data segments that include the semi- structure enables the controller to determine if the data segment holds the sensitive data or not.
In a further implementation, the controller is further configured to score the determination of the data to be protected by determining a score based on a weight for finding an entry and a fraction of the number of records where an entry was found out of the number of entries scanned.
In this implementation, the determination of the data to be protected by the score enables the controller to determine the number of entries scanned. In a further implementation, the controller is further configured to protect the data segment to be protected by applying security policies. The controller is further configured to select the security policies to be applied for a data segment, based on the type of entries in the data segment.
The protection of data segments by applying the security policies enables the controller to implement security policies for the protection of sensitive data.
In a further implementation, the controller is further configured to determine if a segment of the data holds a number of entries of a specific class exceeding a configurable number and if so, determine that the segment is to be protected.
Beneficially, the controller determines the number of entries of a specific class that further enables the controller to protect that data segment.
In another aspect, the present disclosure provides a method for finding data to be protected in a database, wherein the method comprises receiving data comprising one or more data segments. The method further comprises determining which data segments to protect by applying pattern classification, classifying metadata, and classifying data. The method further comprises labelling the data segment to be protected as sensitive and protecting the data segment by applying security policies. The method further comprises classifying the data in a data segment based on at least a partial address, names of persons, file type classification, face detection, and encrypted and/or random content.
The method achieves all the advantages and technical effects of the controller of the present disclosure.
It is to be appreciated that all the aforementioned implementation forms can be combined. It has to be noted that all devices, elements, circuitry, units, and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity that performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims.
Additional aspects, advantages, features, and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative implementations construed in conjunction with the appended claims that follow.
BRIEF DESCRIPTION OF THE DRAWINGS
The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.
Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:
FIG. 1 is a block diagram that depicts a controller to be used for finding data in a database, in accordance with another embodiment of the present disclosure;
FIG. 2 depicts a flowchart of finding data to be protected in a database, in accordance with an embodiment of the present disclosure;
FIG. 3 depicts a flowchart of finding data that includes an address to be protected in a database, in accordance with an embodiment of the present disclosure;
FIG. 4 depicts a graphical representation that illustrates the entropy of different types of data to further detect the sensitive data, in accordance with an embodiment of the present disclosure; FIG. 5 depicts a flowchart of finding a data segment that includes a semi-structure to be protected in a database, in accordance with another embodiment of the present disclosure;
FIG. 6 depicts a flowchart of finding that data stored in a semi- structure is a sensitive data, in accordance with another embodiment of the present disclosure;
FIG. 7 depicts a flowchart of applying security policies to protect data segments, in accordance with an embodiment of the present disclosure; and
FIG. 8 depicts a flowchart of a method of finding data to be protected in a database, in accordance with an embodiment of the present disclosure.
In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the nonunderlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.
DETAILED DESCRIPTION
The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practicing the present disclosure are also possible.
FIG. l is a block diagram that depicts a controller to be used for finding data in a database, in accordance with another embodiment of the present disclosure. With reference to FIG.l, there is shown a block diagram 100 that includes a controller 102 and a database 104.
The controller 102 includes suitable logic, circuitry, interfaces, and/or code that is configured to find a sensitive data in the database 104. Examples of implementation of the controller 102 may include but are not limited to a central data processing device, a microprocessor, a microcontroller, a complex instruction set computing (CISC) processor, an application-specific integrated circuit (ASIC) processor, a reduced instruction set (RISC) processor, a very long instruction word (VLIW) processor, a central processing unit (CPU), a state machine, a data processing unit, and other processors or circuitry.
The database 104 stores information corresponding to the plurality of data segments that can be further classified as the sensitive data. In an example, a communication interface 106 may be used to communicate with the components like the database 104, the controller 102, and the memory 108. In an example, the controller 102 is also in communication with a memory 108 that is configured to hold the data received by the controller 102. Examples of implementation of the memory 108 may include, but are not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Dynamic Random-Access Memory (DRAM), Random Access Memory (RAM), Read-Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, a Secure Digital (SD) card, Solid-State Drive (SSD), and/or CPU cache memory.
There is provided the controller 102 that is configured to find the data to be protected in the database 104. The controller 102 receives the data including one or more data segments, and further determines which data segment is to be protected in the database 104. The controller 102 is further configured to apply various classifications, such as pattern classification, metadata classification, and data classification to determine, which data segments are to be protected. The controller 102 further labels the data segment to be protected as sensitive and also labels the type of data segment (e.g., name address, credit card, and the like). The controller 102 further protects the data segment by applying the security policies as further shown and described in detail in FIG. 2. In other words, the controller 102 provides improve sensitive data discovery and protection system by applying advanced techniques to identify the data as sensitive. In an example, the sensitive data corresponds to information that is protected against unwarranted disclosure whose protection is required for legal or ethical reasons such as for issues pertaining to personal privacy, or for proprietary considerations. Beneficially as compared to the conventional approach, the pattern classification, the metadata classification, and the data classifications are used by the controller 102 to detect more types of sensitive data, and also reduces the number of false detections, such as by using a scoring mechanism.
In an implementation, the data segments include a plurality of records, and the controller 102 is further configured to classify the data in the data segment by classifying the data in each record based on the classification score. In an example, the data segments correspond to a segment in a database container, such as a column in a table, a JavaScript Object Notation (JSON) field in a collection of documents, and the like. Moreover, the controller 102 is further configured to determine a classification score based on the number of records and if a match is found, then the controller 102 classifies the data segment based on the classification score. In other words, the controller 102 classifies the data segments that include the plurality of records and determines the sensitive data, which is protected by the implementation of security policies.
Beneficially, the controller 102 is configured to find the sensitive data and then classify the data based on the classification score with the reduced number of false detections, such as reduced number of false positives detection and reduced number of false negatives detection. Further, the controller 102 is configured to apply security policies on the sensitive data to protect the sensitive data.
FIG. 2 depicts a flowchart of finding data to be protected in a database, in accordance with an embodiment of the present disclosure. FIG. 2 is described in conjunction with elements from FIG. 1. With reference to FIG. 2, there is shown a flowchart 200 that includes a series of operations from 202-to-216. The controller 102 is configured to execute the flowchart 200.
There is provided the controller 102 to find the data to be protected in the database 104. In other words, the controller 102 is configured to detect the sensitive data in the database 104. For example, at operation 202, the controller 102 starts the operation to find the sensitive data to be protected in the database 104. At operation 204, the controller 102 connects to a data source and after that obtains a metadata. Further, the controller 102 is configured to receive the data that includes one or more data segments. For example, at operation 206, the controller 102 receives the data with one or more data segments, such as a person’s name, address, and the like through data sampling. The controller 102 receives the data through the database 104 and then finds the sensitive data from the data, and then protects the sensitive data. The controller 102 is further configured to determine, which data segment is to be protected by applying a pattern classification, a metadata classification, and a data classification. In other words, at operation 208, the data segment is protected by the pattern classification and through pattern classifiers. Moreover, at operation 210, the data segment is protected by the data classification and through data classifiers. Further, at operation 212, the data segment with the metadata is protected by the metadata classification and through metadata classifiers. In an implementation, the controller 102 is configured to classify the data in the data segment based on a scoring mechanism such as through operation 214. The scoring mechanism enables the controller 102 to classify the data segments that include the sensitive data and also protects the data segments by the security policies. In an example, the data is classified as the sensitive data based on at least a partial address, names of persons, file type classification, face detection, and encrypted (and/or) random content. Moreover, the controller 102 is configured to label the data segment (e.g., name, address, face detection, and the like) to be protected as sensitive such as through operation 216. In addition, at operation 218, the controller 102 is further configured to implement the security policies for the sensitive data.
The controller 102 further detects the sensitive data, such as the address (e.g., city, state, zip code, and the like) of a person that is located in the database 104 using the address parser. The address of the person helps the controller 102 to set up the security policies for sensitive data protection. In an implementation, the controller 102 parses the data to find the sensitive data by using regular expressions or natural language processing without limiting the scope of the present disclosure. For example, the controller 102 develops the address parsers to automate the delivery to determine if a specific data segment of the database 104 holds the data segment of the addresses or not. Similarly, the controller 102 uses the address parser to tokenize the text as an address and provides the output of the address tokens such as street, city, state, country, house number zip code, and the like. The controller 102 further finds the corresponding data segment (e.g., address column, document, and the like) that holds the address, and also classifies the data segment that holds the sensitive data (i.e., address). Finally, the controller 102 increments a counter of the number of addresses found.
In an implementation, the controller 102 is configured to classify the data based on the names of a person by searching in a name database for a match for the data. In other words, as the name of the person is classified as the sensitive data, then the controller 102 is configured to detect the sensitive data in the database 104, such as by searching the most used names. In an implementation, the name database includes the most used names in a current region and/or language and if a match is found, then the data segment is labelled as holding a name entry. For example, the database 104 includes a large amount of data that includes a person’s name. Further, some of the records store the most common names such as common last, middle, and first names. Therefore, it is enough to look into a dictionary with the most common last, middle, and first names to detect that the data holds the name entry or not, as further shown and described in FIG. 3. In an implementation, the dictionary used to look up the most common names are different for different regions, locations, cultures, and languages.
In accordance with an embodiment, the controller 102 is further configured to determine that a field in the data is of a length in a range (i.e., MIN to MAX) prior to searching the name database for a match for the field. Moreover, the range depends on the current region and/or language. In other words, the controller 102 checks the length of the text for the common names and then parses till the end of the text to check if the text includes the first name, middle name, or the last name (e.g., features) based on the common length of the text. Finally, if the data segment includes the name, then the number of found names is incremented. Beneficially, the detection of the sensitive data as per the length and the feature of the text enables the controller 102 to find the sensitive data (i.e., name of a person) and further protect the sensitive data.
In an implementation, the controller 102 is further configured to classify the data based on file type classification by determining that a file in the data is of a protected type and also to determine that the data segment holds a protected file entry. In other words, the controller 102 finds the sensitive data included in the sensitive files located in the database 104 by checking the file type. For example, a document stored in the database 104 includes the sensitive data, such as a portable document file (PDF) file that stores a trade contract or a computer aided design (or CAD) drawing file. Then in such case the controller 102 firstly checks through a base, if the database 104 segment is in the binary form (e.g., base 64, Hexa, and the like) or not, and if the data segment is not in the binary form, then the controller 102 converts the data segment into binary form. After that, the controller 102 obtains the first 1024 bytes of the binary form (or binary content) of the data segment and checks a mime type of the file and if the file is compressed type, then extracts the first bytes. Finally, if a mime type of the file found is sensitive, then the number of files with sensitive data is incremented, otherwise not. Beneficially, the determination of data in the protected file enables the controller 102 to find the sensitive data included in the protected file.
In an implementation, the controller 102 is further configured to classify the data based on face detection by determining that a file in the data segment is of an image type. In other words, as the image of a person's face is classified as the sensitive data, and the controller 102 is configured to detect the image of faces in the database 104 through various face recognition techniques, such as via object recognition techniques and mime-type recognition techniques, and the like. Further, the controller 102 is configured to classify the data based on face detection by applying face detection, by utilizing an object detection model in the file, and by determining if the face is detected or not. Moreover, if the face is detected, then the controller 102 is configured to determine that the data segment holds a detected face. Further, in such a case, firstly, the controller 102 checks, if the database 104 segment is in a binary form (i.e., through the base of base 64, Hexa, and the like) or not. Moreover, if the data segment is not in the binary form, then the controller 102 converts the data segment to the binary form. After that, the controller 102 obtains the mime type to know the existence of images in the data segment. In addition, if an image is found in the data segment, then the controller 102 extracts the full content of the field and downloads the image. In an implementation, the controller 102 converts the image to the format known by the object detection model. Finally, the controller 102 applies the face object detection model and if the person’s face is found, then the number of images that include the person’ s face is incremented. The detection of the data that includes the person’ s face enables the controller 102 to find the sensitive data.
In an implementation, the controller 102 is further configured to classify the data utilizing the Shannon entropy. In other words, the controller 102 is configured to find the sensitive data that is included in encrypted (or hash) form in the database 104 through Shannon entropy. In an implementation, the encrypted content includes at least passwords (e.g., user passwords), encryption keys, salt keys, access tokens, session identifiers (e.g., session IDs), payment card numbers, encrypted data, or any other form of random data. The encrypted and random data includes high byte distribution that is measured via the Shannon Entropy. The controller 102 is further configured to classify the data utilizing the Shannon entropy by determining an entropy for data in a record in the data segment and by determining if the record is of a length that exceeds a size threshold or not. Moreover, if the length does not exceed the size threshold, then the controller 102 accumulates the data from the other records together with the data of the record and determines the entropy on the accumulated data. In an example, the probability of each byte in the data is calculated via below given formulae:
P(t) = byte occurrence where “z” = 0 to 255, byte occurrence is the number of times byte “i” appears in the data, and the data length is the data size is in bytes. Further, the file entropy is calculated by using the following formulae:
File Entropy
Figure imgf000013_0001
) , where the file entropy is between zero (0) and eight (8).
In an implementation, if the value of the file entropy is eight, then the file includes the highest bytes distribution and if the value of the file entropy is zero, then the file has the lowest byte distribution. In an example, the high entropy values include compressed data and the encrypted data. The controller 102 further checks the content type (or the data header) to reduce false positive by separating the compressed data from the encrypted data. The controller 102 is also configured to detect the sensitive data with high entropy through different parameters, such as the list of field types, length of the field (e.g., minimum length or maximum length), the number of bytes to accumulate for entropy check, an entropy value, and content header (e.g., compressed or encrypted). For example, the minimum field size is 80 MB, the maximum field size is 160 MB, the number of bytes is 500 bytes, the entropy value is 7.2, and the check for compressed is false, then, the data is encrypted data that includes the sensitive data, as further shown in FIG. 4. Similarly, in another example, if the minimum field size is 2048 MB and the maximum field size is lOOMB, the number of bytes is 500 bytes, the entropy value is 7.2, and check for compressed is true, then the data includes compressed data that includes the sensitive data. The controller 102 is configured to sample the data segment (e.g., document, table), and then check for the data encoded in the data segment. Further, if the data encoded in the data segment includes base 64 (or Hexa or any other type) of the encoded form of data, then the controller 102 transforms the data to the binary form. After that, the controller 102 identifies the data segment (e.g., column, JSON field, XML, and the like) that includes the encrypted data by checking the field type, text field, the length of data, and the data header (compressed or encrypted). In an example, if the data is compressed, then the controller 102 checks if the data segment (or compressed content) is protected with a password or not. Moreover, if the data segment is protected by the password, then the data is the sensitive data, and if the data segment is not protected by the password, then the data segment does not include the sensitive data. In another example, if the field length is less than the threshold value (or bytes num), then the controller 102 concatenates the occurrences and gets the chunks in size greater than or equal to the threshold (or bytes num). The controller 102 further calculates the entropy of the given bytes, and if the entropy is higher than the threshold value (or entropy value), then the controller 102 adds an occurrence of found incidents. Further, the controller 102 counts the number of occurrences and the number of data segments scanned and passes the number of occurrences to the score calculation to get the confidence level of the sensitive data.
In an implementation, the controller 102 is further configured to score the determination of the data to be protected by determining a score based on the weight for finding an entry and a fraction of the number of records where an entry is found out of the number of entries scanned. In other words, the controller 102 classifies the data segments through the scoring mechanism, which enables the controller 102 to differentiate the findings and potential false positives values. In an implementation, the score can be calculated by the addition of wname (name found) and Wcomem (content found ratio) . The Wname corresponds to weight for finding the name and the value of Wname lies between “0” and “1”. The value of the name found is “1” if the data segment names are used to find the item. Moreover, the wCOntent corresponds to the weight for finding the name in the data segment, and the values for wCOntent are between “0” and “1”. Further, the content ratio found corresponds to the number of records, where content found is divided by the number of records scanned. In an example, if the score is higher than “1”, then the score is set to “1”. Each classifier has different weights without limiting the scope of the present disclosure. The scoring mechanism balances the alert on everything suspicious and avoids false positives with a score from 1 to 100. The results are ordered, so those with higher confidence are promoted.
The controller 102 is further configured to implement the security policies for the protection of the sensitive data such as through operation 218. In an implementation, the controller 102 is further configured to protect the data segment to be protected by applying security policies. The controller 102 is configured to select the security policies to be applied for the data segment based on the type of entries in the data segment. In an example, the security policies include different policies, such as access control, auditing, dynamic and static data masking, and the like. Moreover, the controller 102 is configured to protect the sensitive data using the security policies based on the type of sensitive data found. Beneficially, the finding of sensitive data by the controller 102 and then the classification of data segments based on the scoring mechanism enables the controller 102 to find the sensitive data and perform security policies on the sensitive data to protect the sensitive data. Further, the determination of which sensitive data is to be protected by applying pattern classifiers or advanced data classifiers, or metadata classifiers reduces the number of false positives and false negatives and provides a more accurate classification of sensitive data.
FIG. 3 depicts a flowchart of finding data that includes an address to be protected in a database, in accordance with an embodiment of the present disclosure. FIG. 3 is described in conjunction with elements from FIGs. 1 and 2. With reference to FIG. 3, there is shown a flowchart 300 that includes a series of operations from 302-to-314. The controller 102 is configured to execute the flowchart 300.
At operation 302, the controller 102 starts traversing through the data segments that include the plurality of records. In an implementation, the controller 102 is further configured to classify the data in the data segment based on at least a partial address by utilizing an address parser to tokenize the data as address providing address feature. For example, at operation 304, the controller 102 parses the addresses in the data segments. After that, the controller 102 determines if the number of features is more than a configurable number, and then classify the data segment as including the address. In an example, the controller 102 is configured to check number of address features for every record and then mark the data segments with a rate mechanism. In an implementation, the controller 102 is configured to determine if the data segment of the data holds the number of entries of a specific class exceeds the configurable number, then determine that the data segment is to be protected. For example, at operation 306, the controller 102 is configured to determine if the number of features is greater than the configurable number (or ‘c’), then the data segment is classified as an address that is to be protected through security policies, and the controller 102 moves at operation 308. However, if the number of features is less than the configurable number, then the data segment is not classified as an address and thus not protected, and the controller 102 moves at operation 312, where the data segment is classified as not an address. In an example, the data segment is represented as “22 October 2018 Monday Evening 12345”, with three features, such as house number is 22, the road is October 2018 Monday evening and postcode is 12345. Moreover, if the configurable number is four, then the data, such as “22 October 2018 Monday Evening 12345”, is parsed by the controller 102, and the data is not classified as an address, because the number of features in the data is three.
In an implementation, the controller 102 is further configured to look-up the at least partial address in an address register to validate the address. Moreover, if at least the partial address in the address register is found in the address register that at least partial address is categorized as an address, then the data segment is labelled as holding an address entry. Optionally, at operation 308, the controller 102 also checks the features on the dictionary. In an implementation, the dictionary corresponds to an address registry from which the controller 102 can check if a specific address or address part exists or not. For example, the address registry checks if a certain street exists in a city or the city exists in the country and the like. Moreover, based on the features on the dictionary, the controller 102 classifies the data segments as an address and move at operation 310. However, if the features are not found in the dictionary is not found, then the data segment is classified not as an address, such as through operation 314. In an example, the data segment is represented as “22 October 2018 Monday Evening New York 12345”, with five features, such as house_number is 22, the road is October 2018 Monday, the city is evening, the state is New York, and postcode is 12345. However, if the features like the addresses, the city, and the road are not found in the dictionary, then the data “22 October 2018 Monday Evening New York 12345” is classified as not an address such as through operation 214. In another example, the data segment is represented as “77 Nassau av. Plainview NY 1180377 Nassau av. Plainview NY 11803”, and the controller 102 is parsed through the data and found five features with house number 77, road Nassau Avenue, city Plainview, state NY, and pin code 11803 that matches the features on the dictionary. Thus, the data segment is termed as a valid address. Finally, if the data segment is classified as an address, then at operation 310, the address counter that counts the number of addresses in the data segments is incremented. Beneficially, the data segment’s classification based on the number of features by scoring mechanism enables the controller 102 to reduce the number of false negatives and false positives. In an example, the operation 308, and the operation 314 (i.e., as represented via dotted line in FIG. 3) are optional, and are depending on the availability of the dictionary. Thus, if the dictionary is not available, then the controller 102 executes the operation 310 directly after the execution of operation 306.
FIG. 4 depicts a graphical representation that illustrates an entropy of different types of data to further detect the sensitive data, in accordance with an embodiment of the present disclosure. FIG. 4 is described in conjunction with elements from FIGs. 1, 2, and 3. With reference to FIG. 4, there is shown a graphical representation 400 that includes an X-axis 402 that illustrates a number of bytes and a Y-axis 404 that illustrates value of entropy.
With reference to the graphical representation 400, a first line 406, a second line 408, a third line 410, a fourth line 412, a fifth line 414, and a sixth line 416 collectively, illustrate the different types of data and their corresponding entropy. Such as the first line 406 illustrates the data stored in a portable document file (PDF), the second line 408 illustrates the data stored in a text file (or txt), the third line 410 illustrates the data stored in hashed form (or random passwords of eight characters hashed). Similarly, the fourth line 412 illustrates the data stored in the encrypted cc form (or advanced encrypted standards encrypted of 16 random digits, each producing 89 bytes, concatenated together). Moreover, the fifth line 414 illustrates the data stored in the encrypted (or AES Encrypted of large random string) form, and the sixth line 416 illustrates the data stored in the random form.
In an example, the value of the Shannon entropy is up to a high value of 7.2 just with 500 bytes, such as shown through the third line 410 (i.e., hashed form), the fourth line 412 (i.e., encrypted), and the sixth line 418 (i.e., random form). The data with high Shannon entropy of random, hash, and encrypted data enables the controller 102 to find the sensitive data that can be protected. In an implementation, the controller 102 is further configured to classify the data utilizing the Shannon entropy. Beneficially, the Shannon entropy enables the controller 102 to find the sensitive data in different data types.
FIG. 5 depicts a flowchart of finding a data segment that includes a semi-structure to be protected in a database, in accordance with an embodiment of the present disclosure. FIG.5 is described in conjunction with elements from FIGs. 1, 2, 3, and 4. With reference to FIG. 5, there is shown a flowchart 500 that includes a series of operations from 502-to-522. The controller 102 is configured to execute the flowchart 500.
At operation 502, the controller 102 starts the operation for finding the sensitive data to be protected in the database 104. At operation 504, the controller 102 connects to the data source and after that obtains the metadata. Further, the controller 102 is configured to receive the data that includes one or more data segments. For example, at operation 506, the controller 102 receives the data with the one or more data segments (e.g., a person’s name, address, and the like), and the controller 102 performs the data sampling. After that, at operation 508, The controller 102 is further configured to check if the data found from the data sampling is a semistructure (or semi-structure data) or not. In an example, the semi-structure data corresponds to a data structure, such as key value, JSON, XML, arrays, XML, JSON, and key value stores. Moreover, if the data is the semi- structure, then the controller 102 implements the operation 522 that includes semi-structure handling. However, if the data is not the semi-structure, then the controller 102 is further configured to determine, which data segment is to be protected by applying a pattern classification, a metadata classification, and a data classification. For example, at operation 510, the data segment is protected by the pattern classification through pattern classifiers, and at operation 512, the data segment is protected by the data classification through data classifiers. Similarly, at operation 514, the data segment with the metadata is protected by the metadata classification through metadata classifiers. In an implementation, the controller 102 is configured to classify the data in a data segment based on the scoring mechanism such as through operation 516. The data is classified as the sensitive data based on at least a partial address, names of persons, file type classification, face detection, and encrypted and/or random content. The controller 102 is configured to label the sensitive data segment (e.g., name, address, face detection, and the like) to be protected as sensitive such as through operation 518. The controller 102 is further configured to implement the security policies for the protection of the sensitive data such as through operation 520. FIG. 6 depicts a flowchart of finding that data stored in a semi- structure is a sensitive data, in accordance with another embodiment of the present disclosure. FIG. 6 is described in conjunction with elements from FIGs. 1, 2, 3, 4, and 5. With reference to FIG. 6, there is shown a flowchart 600 that includes a series of operations from 602-to-612. The controller 102 is configured to execute the flowchart 600.
In an implementation, the controller 102 is further configured to determine if the data segment includes the semi-structure or not such as through operation 602. Moreover, if the controller 102 determines that the data segment includes the semi-structure, then the controller 102 further determines the name of the semi- structure such as through operation 604. After that, the controller 102 compares the name-to-name classifiers such as through operation 610 and if during the comparison a match is found, then the controller 102 determines if that data segment holds an entry or not. However, if the match is not found during the comparison, then the controller 102 further determines if the content of the semi- structure is a final node or not such as through operation 606. Moreover, if the semi-structure is the final node, then the controller 102 determines that the data segment holds the entry, and the controller 102 further checks and counts the content with the content classifiers such as through operation 612. However, if the semi- structure is not the final node, then the controller 102 duplicates each node’s child (i.e., child node) such as through operation 608. After that, the controller 102 determines if the semistructure includes the name through the operation 604. Beneficially, the determination of the semi- structure enables the controller 102 to find the sensitive data in the data segment.
FIG. 7 depicts a flowchart of applying security policies to protect data segments, in accordance with an embodiment of the present disclosure. FIG.7 is described in conjunction with elements from FIGs. 1, 2, 3, 4, 5, and 6. With reference to FIG. 7, there is shown a flowchart 700 that includes a series of operations from 702-to-714. The controller 102 is configured to execute the flowchart 700.
In an implementation, the controller 102 is configured to protect the data segment to be protected by applying security policies. The controller 102 is further configured to select the security policies to be applied for the data segment based on the type of data in the data segment (e.g., credit card, person’s name, password and the like). In other words, the controller 102 is configured to protect the sensitive data using the security policies (e.g., access control, auditing, dynamic and static data masking, and the like) based on the type of sensitive data found. At operation 702, the controller 102 starts the sensitive data discovery process to find the sensitive data. After the discovery of the sensitive data, the controller 102 provides sensitive data discovery alerts with different labels (e.g., names, addresses, images, and the like), such as through operation 704. In an example, the sensitive data discovery alert corresponds to the labels with the type of the sensitive data, and each sensitive data discovery alert on the data segment is labelled with a label (e.g., a credit card, address and the like). Further at operation 706, the controller 102 implements a policy matcher to check the type of security policy to be applied to the corresponding data segment. For example, at operation 712, the payment policy domain is applied for payment card-related data segments. Similarly, at operation 714, a personal data policy is applied for personal data (e.g., name, address, and the like), and the like. In an example, the security policies are arranged in different domains, like payment (e.g., PCI compliance), personal data protection, and the like to protect the sensitive data. Further, the policy matcher decides the application of policies through operation 708 by the discovery of alerts with labels and the policies through operation 710. The policy matcher links the labels on the sensitive data alerts with the polices repositories and offers the policies that match the sensitive data found in the database 104. Thus, the controller 102 not eventually finds the sensitive data in the database 104 but also offers the right protection that depends on the type of the sensitive data found. Beneficially, the correct implementation of security policies on the sensitive data segments provides a comprehensive security solution using the pre-defined security policies.
FIG. 8 is a flowchart of a method of finding data to be protected in a database, in accordance with an embodiment of the present disclosure. FIG. 8 is described in conjunction with elements from FIGs. 1 to 7. With reference to FIG. 8, there is shown a flowchart of method 800 that includes steps 802 to 808. The controller 102 of the FIG. 1 is configured to execute the method 800.
There is provided the method 800 of finding the data to be protected in the database 104. In other words, the method 800 is configured to detect the sensitive data in the database 104. The method 800 provides improve sensitive data discovery and protection system by applying advanced techniques to identify the data as sensitive.
In an implementation, the data segments include a plurality of records, and the controller 102 is further configured to classify the data in the data segment by classifying the data in each data segment based on the classification score. In an example, the data segments correspond to a segment in a database container, such as a column in a table, a JSON field in a collection of documents, and the like. Moreover, the method 800 includes, determining a classification score based on the number of records and if a match is found and then the controller 102 classifies the data segment based on the classification score. In other words, the method 800 include, classifying the data segments that include the plurality of records and determines the sensitive data, which is protected by the implementation of security policies. The sensitive data corresponds to information that is protected against unwarranted disclosure whose protection is required for legal or ethical reasons such as for issues pertaining to personal privacy, or for proprietary considerations.
At step 802, the method 800 comprises receiving of the data that includes one or more data segments. For example, the method 800 includes receiving the data with one or more data segments such as a person’s name, address, and the like through data sampling. The method 800 includes receiving the data through the database 104 and then finding the sensitive data from the data.
At step 804, the method 800 further comprises determining which data segment is to be protected by applying a pattern classification, a metadata classification, and a data classification. In other words, at step 804A, the data segment is protected by the pattern classification and through pattern classifiers. Moreover, at step 804B, the data segment is protected by the data classification and through data classifiers. Further, at step 804C, the data segment with the metadata is protected by the metadata classification and through metadata classifiers. In an implementation, the method 800 includes classifying the data in the data segment based on a scoring mechanism. The scoring mechanism is used to classify the data segments that include the sensitive data and also protects the data segments by the security policies. In an example, the data is classified as the sensitive data based on at least a partial address, names of persons, file type classification, face detection, and encrypted (and/or) random content.
At step 806, the method 800 further comprises labelling the data segment (e.g., name, address, face detection, and the like) to be protected as sensitive. In addition, at step 808, the method 800 provides protecting the data segment by applying security policies. In an example, the controller 102 detects the sensitive data such as the address (e.g., city, state, zip code, and the like) of a person that is located in the database 104 using the address parser. The address of the person is used to set up the security policies for sensitive data protection. In an implementation, the controller 102 is used to parses the data to find the sensitive data by using regular expressions or natural language processing without limiting the scope of the present disclosure. The method 800 further comprises classifying the data in the data segment based on at least a partial address, names of persons, file type classification, face detection, and encrypted and/or random content.
In an implementation, the method 800 further comprises classifying the data based on the names of a person by searching in a name database for a match for the data. In other words, as the name of the person is classified as the sensitive data, then the method 800 includes detecting the sensitive data in the database 104 by searching the most used names. In an implementation, the name database includes the most used names in a current region and/or language and if a match is found, then the data segment is labelled as holding a name entry. For example, the database 104 includes a large amount of data that includes a person’s name. Thus, some of the records store the most common names such as common last, middle, and first names. Therefore, it is enough to look into a dictionary with the most common last, middle, and first names to detect that the data holds the name entry or not. In an implementation, the dictionary used to look up the most common names are different for different regions, locations, cultures, and languages.
In an implementation, the method 800 further comprises determining that a field in the data is of a length in a range (i.e., MIN to MAX) prior to searching the name database for a match for the field. Moreover, the range depends on the current region and/or language. In other words, the method 800 includes checking the length of the text for the common names and then parses till the end of the text to check if the text includes the first name, middle name, or the last name (e.g., features) based on the common length of the text. Finally, if the data segment includes the name, then the number of found names is incremented. Beneficially, the detection of the sensitive data as per the length and the feature of the text is used to find the sensitive data (i.e., name of a person) and further protect the sensitive data.
In an implementation, the method 800 further comprises classifying the data based on file type classification by determine that a file in the data is of a protected type and also to determine that the data segment holds a protected file entry. In other words, the method 800 includes finding the sensitive data included in the sensitive files located in the database 104 by checking the file type. For example, a document stored in the database 104 includes the sensitive data such as a portable document file (PDF) file that stores a trade contract or an Auto computer aided design (CAD) drawing file. Then in such case the method 800 includes, checking through a base, if the database 104 segment is in the binary form (e.g., base 64, Hexa, and the like) or not, and if the data segment is not in the binary form, then the method 800 includes, converting the data segment into binary form. After that, the method 800 includes obtaining the first 1024 bytes of the binary form (or binary content) of the data segment and checks a mime type of a file and if the file is compressed type, then extracts the first bytes. Finally, if the mime type of the file found is sensitive then the number of files with sensitive data is incremented, otherwise not. Beneficially, the determination of data in the protected file is used to find the sensitive data included in the protected file.
In an implementation, the method 800 further comprises classifying the data based on face detection by determining that a file (or binary field) in the data segment is of an image type. In other words, as the image of a person's face is classified as the sensitive data, the method 800 includes detecting the image of faces in the database 104 through various face recognition techniques, such as via object recognition techniques and mime-type recognition techniques, and the like. Further, the method 800 further includes, classifying the data based on face detection by applying face detection, by utilizing an object detection model in the file, and by determining if a face is detected and if so then determining that the data segment holds a detected face. Then in such a case, firstly, the method 800 includes, checking if the database 104 segment is in a binary form (i.e., through the base of base 64, Hexa, and the like) or not. Moreover, if the data segment is not in the binary form, then the method 800 includes, converting the data segment to binary form. After that, the controller 102 obtains the mime type to know the existence of images in the data segment. In addition, if an image is found in the data segment, then the method 800 includes extracting the full content of the field and downloads the image. In an implementation, the controller 102 converts the image to the format known by the object detection model. Finally, the method 800 includes applying the face object detection model and if the person’s face is found, then the number of images that include the person’s face is incremented. The detection of the data that includes the person’s face is used to find the sensitive data.
In an implementation, the method 800 further comprises classifying the data utilizing the Shannon entropy. In other words, the method 800 includes finding the sensitive data that is included in encrypted (or hash) form in the database 104 through Shannon entropy. In an implementation, the encrypted content includes at least passwords (e.g., user passwords), encryption keys, salt keys, access tokens, session identifiers (e.g., session IDs), payment card numbers, encrypted data or any other form of random data. The encrypted and random data includes high byte distribution that is measured via the Shannon Entropy. The method 800 further comprises classifying the data utilizing the Shannon entropy by determining an entropy for data in a record in the data segment and determine if the record is of a length exceeds a size threshold or not. Moreover, if the length does not exceed the size threshold, then the method 800 includes accumulating the data from the other records together with the data of the record and determines the entropy on the accumulated data.
In an implementation, if the value of the file entropy is eight, then the file includes the highest bytes distribution (i.e., the data is random data) and if the value of the file entropy is zero, then the file has the lowest byte distribution (i.e., the data is non-random data). In an example, the high entropy values include compressed data and the encrypted data. In an implementation, the method 800 includes checking the content type (or the data header) to reduce false positives values by separating the compressed data from the encrypted data. The method 800 further includes detecting the sensitive data with high entropy through different parameters such as the list of field types (or field types), length of the field (e.g., minimum length or maximum length), the number of bytes to accumulate for entropy check, entropy value, and content header (e.g., compressed or any other form of data). The method 800 further includes sampling the data segment (e.g., document, table), and then check for the data encoded in the data segment. Further, if the data encoded in the data segment includes base 64 (or Hexa or any other type) of an encoded form of data, then the method 800 includes transforming the data to the binary form. After that, the method 800 includes identifying the data segment (e.g., column, JSON field, XML, and the like) that includes the encrypted data, or the hashed data by checking the field type, text field, the length of data, and the data header (compressed or encrypted). In an example, if the field length is less than the threshold value (or bytes num), then the method 800 includes concatenating the occurrences and gets the chunks in size greater than or equal to the threshold (or bytes num). The method 800 further includes calculating the entropy of the given bytes, and if the entropy is higher than the threshold value (or entropy value), then adding an occurrence of found incidents. Further, the method 800 includes counting the number of occurrences and the number of data segments scanned and passes the number of occurrences to the score calculation to get the confidence level of the sensitive data.
In an implementation, the method 800 further includes scoring the determination of the data to be protected by determining a score based on the weight for finding an entry and a fraction of the number of records where an entry is found out of the number of entries scanned. In other words, the method 800 includes classifying the data segments through the scoring mechanism, which is used to differentiate and potentially findings false positive values. The scoring mechanism balances the alert on everything suspicious and avoids false positives with a score from 1 to 100. The results are ordered, so those with higher confidence are promoted.
At step 808, the method 800 further comprises protecting the data segment by applying security policies. In an implementation, the method 800 further comprises protecting the data segment to be protected by applying security policies. The method 800 includes selecting the security policies to be applied for the data segment based on the type of entries in the data segment. In an example, the security policies include different policies, such as access control, auditing, dynamic and static data masking, and the like. Moreover, the method 800 includes protecting the sensitive data using the security policies (based on the type of sensitive data found. Beneficially, the finding of sensitive data by the method 800 and then the classification of data segments based on the scoring mechanism is used to find the sensitive data and perform security policies on the sensitive data to protect the sensitive data. Further, the determination of which sensitive data is to be protected by applying pattern classifiers or advanced data classifiers, or metadata classifiers reduces the number of false positives and false negatives and provides a more accurate classification of sensitive data.
In an implementation, the method 800 further comprises classifying the data in the data segment based on the at least partial address by utilizing an address parser to tokenize the data as addressproviding address feature. For example, the method 800 provides the controller 102 that parses the addresses in the data segments. After that, the controller 102 determines if the number of features is more than a configurable number and if so then classify the data segment as including the address. In an implementation, the method 800 further includes determining if the data segment of the data holds the number of entries of a specific class exceeds the configurable number, then determining that the data segment is to be protected. For example, the method 800 includes determining if the number of features is greater than the configurable number (i.e., c), then the data segment is classified as an address that is to be protected through security policies. However, if the number of features is less than the configurable number, then the data segment is not classified as an address and thus not protected, and the data segment is classified as not an address.
In an implementation, the method 800 further comprises performing a look-up the at least partial address in an address register to validate the address. Moreover, if at least the partial address in the address register is found in the address register that at least partial address is categorized as an address, then the data segment is labelled as holding an address entry. Optionally, the method 800 includes checking the features on the dictionary. Moreover, based on the features on the dictionary, and classifying the data segments as an address. However, if the features are not found in the dictionary is not found, then the data segment is classified not as an address. Finally, if the data segment is classified as an address, then, the address counter that counts the number of addresses in the data segments is incremented. Beneficially, the data segment’s classification based on the number of features by scoring mechanism reduces the number of false negatives and false positives.
In an implementation, the method 800 further comprises determining if the data segment includes the semi-structure or not. Moreover, if the data segment includes the semi- structure, then the method 800 includes determining the name of the semi-structure. After that, the controller 102 compares the name-to-name classifiers and if during the comparison a match is found, then the controller 102 determines if that data segment holds an entry or not. However, if the match is not found during the comparison, then the controller 102 further determines if the content of the semi- structure is a final node or not. Moreover, if the semi- structure is the final node, then determining that the data segment holds the entry, and the checking (and counting) the content with the content classifiers. However, if the semi-structure is not the final node, then duplicates each node’s child (i.e., child node). After that, determining if the semistructure includes the name. Beneficially, the determination of the semi-structure enables is used to find the sensitive data in the data segment.
Beneficially, the method 800 of the finding of sensitive data by and then the classification of data segments based on the scoring mechanism is used to find the sensitive data and perform security policies on the sensitive data to protect the sensitive data. Further, the determination of which sensitive data is to be protected by applying pattern classifiers or advanced data classifiers, or metadata classifiers reduces the number of false positives and false negatives and provides a more accurate classification of sensitive data.
The steps 802 to 808 are only illustrative, and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claim herein.
Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as "including", "comprising", "incorporating", "have", "is" used to describe and claim the present disclosure are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural. The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments or to exclude the incorporation of features from other embodiments. The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". It is appreciated that certain features of the present disclosure, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the present disclosure, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable combination or as suitable in any other described embodiment of the disclosure.

Claims

1. A controller (102) for finding data to be protected in a database (104), wherein the controller (102) is configured to: receive data comprising one or more data segments; determine which data segments to protect by applying pattern classification, classify metadata, and classify data, label the data segment to be protected as sensitive; and protect the data segment by applying security policies, wherein the controller (102) is configured to classify the data in a data segment based on at least a partial address, names of persons, file type classification, face detection, and encrypted and/or random content.
2. The controller (102) according to claim 1, wherein a data segment comprises a plurality of records, and wherein the controller (102) is further configured to classify the data in a data segment by classifying the data in each record and determine a classification score based on the number of records where a match is found and classify the data segment based on the classification score.
3. The controller (102) according to claim 1 or 2, wherein the controller (102) is further configured to classify the data in a data segment based on the at least partial address by utilizing an address parser to tokenize the data as address providing address features, determine if the number of features is more than a configurable number and if so classify the data segment as comprising an address
4. The controller (102) according to any preceding claim, wherein the controller (102) is further configured to look-up the at least partial address in an address register to validate the address, and if found in the address register that at least partial address is categorized as an address and the data segment is labelled as holding an address entry.
5. The controller (102) according to any preceding claim, wherein the controller (102) is further configured to classify the data based on names of persons by: searching in a name database for a match for the data, wherein the name database comprises the most used names in a current region and/or language, and if a match is found, the data segment is labelled as holding a name entry.
6. The controller (102) according to claim 5, wherein the number of the most used names in the name database (104) depends on the current regions and/or languages.
7. The controller according to claim 5 or 6, wherein the controller (102) is further configured to determine that a field in the data is of a length in a range (MIN to MAX) prior to searching the name data base for a match for the field, wherein the range depends on the current region and/or language.
8. The controller (102) according to any preceding claim, wherein the controller (102) is further configured to classify the data based on file type classification by determining that a file in the data is of a protected type and determining that the data segment holds a protected file entry.
9. The controller (102) according to any preceding claim, wherein the controller (102) is further configured to classify the data based on face detection by: determining that a file in the data segment is of an image type; applying face detection utilizing an object detection model to the file; and determining if a face is detected and if so determining that the data segment holds a detected face entry.
10. The controller (102) according to any preceding claim, wherein the encrypted and/or random content comprises at least one of the following: passwords, encryption keys, salt keys, access tokens, session identifiers, payment card numbers, encrypted data or any other form of random data
11. The controller (102) according to any preceding claim, wherein the controller (102) is further configured to classify the data utilizing of the Shannon entropy.
12. The controller (102) according to preceding claim 11, wherein the controller (102) is further configured to classify the data utilizing of the Shannon entropy by determining an entropy for data in a record in the data segment, determining if the record is of a length exceeding a size threshold, and if not accumulate data from another record together with the data of the record and determine the entropy on the accumulated data.
13. The controller (102) according to any preceding claim, wherein the controller (102) is further configured to determine if the data segment comprises a semi-structure, and if so determine a name of the semi-structure, compare the name to name classifiers, and if a match is found, determining that the data segment holds an entry, and if no match is found. determine if the content of the semi- structure is a final node, and if so, determining that the data segment holds an entry
14. The controller (102) according to any preceding claim, wherein the controller (102) is further configured to score the determination of the data to be protected by determining a score based on a weight for finding an entry and a fraction of the number of records where an entry was found out of the number of entries scanned.
15. The controller (102) according to any preceding claim, wherein the controller (102) is further configured to protect the data segment to be protected by applying security policies, wherein the controller (102) is further configured to select the security policies to be applied for a data segment, based on the type of entries in the data segment.
16. The controller (102) according to any preceding claim, wherein the controller (102) is further configured to determine if a segment of the data holds a number of entries of a specific class exceeding a configurable number and if so, determine that the segment is to be protected.
17. A method (800) for finding data to be protected in a database (104), wherein the method (800) comprises: receiving data comprising one or more data segments; determining which data segments to protect by applying pattern classification, classify metadata, and classify data, labelling the data segment to be protected as sensitive; and protecting the data segment by applying security policies, wherein the method (800) further comprises classifying the data in a data segment based on at least a partial address, names of persons, file type classification, face detection, and encrypted and/or random content.
18. The method (800) according to claim 17, wherein a data segment comprises a plurality of records, and wherein the method (800) further comprises classifying the data in a data segment by classifying the data in each record and determine a classification score based on the number of records where a match is found and classify the data segment based on the classification score.
19. The method (800) according to claim 17 or 18, wherein the method (800) further comprises classifying the data in a data segment based on the at least partial address by utilizing an address parser to tokenize the data as address providing address features, determining if the number of features is more than a configurable number and if so classifying the data segment as comprising an address.
20. The method (800) according to any of claims 17 to 19, wherein the method (800) further comprises performing a look-up of the at least partial address in an address register to validate the address, and if found in the address register that at least partial address is categorized as an address and the data segment is labelled as holding an address entry.
21. The method (800) according to any of claims 17 to 20, wherein the method (800) further comprises classifying the data based on names of persons by: searching in a name database for a match for the data, wherein the name database comprises the most used names in a current region and/or language, and if a match is found, the data segment is labelled as holding a name entry.
22. The method (800) according to claim 21, wherein the number of the most used names in the name database depends on the current regions and/or languages.
23. The method (800) according to claim 21 or 22, wherein the method (800) further comprises determining that a field in the data is of a length in a range (MIN to MAX) prior to searching the name data base for a match for the field, wherein the range depends on the current region and/or language.
24. The method (800) according to any of claims 17 to 23, wherein the method (800) further comprises classifying the data based on file type classification by determining that a file in the data is of a protected type and determining that the data segment holds a protected file entry.
25. The method (800) according to any of claims 17 to 24, wherein the method (800) further comprises classifying the data based on face detection by: determining that a file in the data segment is of an image type; applying face detection utilizing an object detection model to the file; and determining if a face is detected and if so determining that the data segment holds a detected face entry.
26. The method (800) according to any of claims 17 to 25, wherein the encrypted and/or random content comprises at least one of the following: passwords, encryption keys, salt keys, access tokens, session identifiers, payment card numbers, encrypted data or any other form of random data
27. The method (800) according to any of claims 17 to 26, wherein the (800) method further comprises classifying the data utilizing of the Shannon entropy.
28. The method (800) according to preceding claim 27, wherein the method (800) further comprises classifying the data utilizing of the Shannon entropy by determining an entropy for data in a record in the data segment, determining if the record is of a length exceeding a size threshold, and if not accumulate data from another record together with the data of the record and determine the entropy on the accumulated data.
29. The method (800) according to any of claims 17 to 28, wherein the method (800) further comprises determining if the data segment comprises a semi-structure, and if so determine a name of the semi-structure, compare the name to name classifiers, and if a match is found, determining that the data segment holds an entry, and if no match is found determining if the content of the semi- structure is a final node, and if so, determining that the data segment holds an entry.
30. The method (800) according to any of claims 17 to 29, wherein the method (800) further comprises scoring the determination of the data to be protected by determining a score based on a weight for finding an entry and a fraction of the number of records where an entry was found out of the number of entries scanned.
31. The method (800) according to any of claims 17 to 30, wherein the method (800) further comprises protecting the data segment to be protected by applying security policies, wherein the method (800) further comprises select the security policies to be applied for a data segment, based on the type of entries in the data segment.
32. The method (800) according to any of claims 17 to 31, wherein the method (800) further comprises determining if a segment of the data holds a number of entries of a specific class exceeding a configurable number and if so, determining that the segment is to be protected.
PCT/EP2022/054513 2022-02-23 2022-02-23 Controller and method for finding data to be protected in database WO2023160776A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/054513 WO2023160776A2 (en) 2022-02-23 2022-02-23 Controller and method for finding data to be protected in database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/054513 WO2023160776A2 (en) 2022-02-23 2022-02-23 Controller and method for finding data to be protected in database

Publications (2)

Publication Number Publication Date
WO2023160776A2 true WO2023160776A2 (en) 2023-08-31
WO2023160776A3 WO2023160776A3 (en) 2023-11-16

Family

ID=80933470

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/054513 WO2023160776A2 (en) 2022-02-23 2022-02-23 Controller and method for finding data to be protected in database

Country Status (1)

Country Link
WO (1) WO2023160776A2 (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10810317B2 (en) * 2017-02-13 2020-10-20 Protegrity Corporation Sensitive data classification
US20200334381A1 (en) * 2019-04-16 2020-10-22 3M Innovative Properties Company Systems and methods for natural pseudonymization of text
US20200394327A1 (en) * 2019-06-13 2020-12-17 International Business Machines Corporation Data security compliance for mobile device applications
AU2021201071B2 (en) * 2020-02-19 2022-04-28 Harrison-Ai Pty Ltd Method and system for automated text anonymisation

Also Published As

Publication number Publication date
WO2023160776A3 (en) 2023-11-16

Similar Documents

Publication Publication Date Title
US11126720B2 (en) System and method for automated machine-learning, zero-day malware detection
CN109359439B (en) software detection method, device, equipment and storage medium
US9292688B2 (en) System and method for automated machine-learning, zero-day malware detection
CN115380281A (en) Generating rules for data processing values for data fields based on semantic tags of the data fields
US20170351913A1 (en) Document Field Detection And Parsing
US20130074198A1 (en) Methods and systems to fingerprint textual information using word runs
RU2708356C1 (en) System and method for two-stage classification of files
CN110928931B (en) Sensitive data processing method and device, electronic equipment and storage medium
Nagano et al. Static analysis with paragraph vector for malware detection
CN113076748B (en) Bullet screen sensitive word processing method, device, equipment and storage medium
Vanamala et al. Topic modeling and classification of Common Vulnerabilities And Exposures database
US11068595B1 (en) Generation of file digests for cybersecurity applications
US8365247B1 (en) Identifying whether electronic data under test includes particular information from a database
CN114550193A (en) Document integrity detection method and system and electronic equipment
CN116701641B (en) Hierarchical classification method and device for unstructured data
CN113746952A (en) DGA domain name detection method, device, electronic equipment and computer storage medium
Eskenazi et al. When document security brings new challenges to document analysis
US9013732B1 (en) Using font information installed in an operating system to intercept text being printed
Rowe Identifying forensically uninteresting files using a large corpus
WO2023160776A2 (en) Controller and method for finding data to be protected in database
Ali et al. Carving of the OOXML document from volatile memory using unsupervised learning techniques
CN113472686B (en) Information identification method, device, equipment and storage medium
EP3929787A1 (en) Detecting sensitive data records using a data format analysis
CA3144052A1 (en) Method and apparatus for recognizing new sql statements in database audit systems
Ting et al. Faster classification using compression analytics

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22712298

Country of ref document: EP

Kind code of ref document: A2