CN111241133A - Sensitive data identification method, device and equipment and computer storage medium - Google Patents

Sensitive data identification method, device and equipment and computer storage medium Download PDF

Info

Publication number
CN111241133A
CN111241133A CN201811445535.4A CN201811445535A CN111241133A CN 111241133 A CN111241133 A CN 111241133A CN 201811445535 A CN201811445535 A CN 201811445535A CN 111241133 A CN111241133 A CN 111241133A
Authority
CN
China
Prior art keywords
data
target
sensitive data
database
sensitive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811445535.4A
Other languages
Chinese (zh)
Inventor
陆艳军
杨翔
赵立农
廖天宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Chongqing Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Chongqing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Chongqing Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201811445535.4A priority Critical patent/CN111241133A/en
Publication of CN111241133A publication Critical patent/CN111241133A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a sensitive data identification method, a sensitive data identification device, sensitive data identification equipment and a computer storage medium. The sensitive data identification method comprises the following steps: acquiring characteristic parameters for positioning target data to be identified and a regular expression for identifying sensitive data in the target data; acquiring a target object containing target data according to the characteristic parameters; and identifying target data in the target object line by line according to the regular expression to determine whether the target object contains sensitive data. According to the embodiment of the invention, sensitive data in a large amount of data can be quickly and accurately identified.

Description

Sensitive data identification method, device and equipment and computer storage medium
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a sensitive data identification method, a sensitive data identification device, sensitive data identification equipment and a computer storage medium.
Background
The existing sensitive data identification method is mainly based on the matching method of a keyword library and combines a manual identification method to identify sensitive data.
The matching method of the keyword library has the principle that the matching mode of sensitive data is defined manually, the data is matched one by one, and when the data meets the mode matching, the data is defined as the sensitive data. The principle of the manual identification method is that an evaluator empirically defines a plurality of data in a predefined data model, such as a database design model, a file system organization structure, and the like, as sensitive information, and then identifies the sensitive data in the sensitive information in a data sampling manner.
Therefore, the existing method of combining the matching method of the keyword library and the manual identification method to identify the sensitive data mainly comprises the following steps: the method comprises the steps that firstly, an evaluator defines a matching mode of sensitive data, then, a matching direction of a keyword library is determined according to a predefined model, finally, matching scanning is conducted on a target by using the matching mode of the sensitive data, and after scanning is completed, the evaluator filters a matching result so as to optimize the matching result.
Although the current sensitive data identification method can identify the sensitive data to a certain extent, the following disadvantages still exist:
the degree of automation is insufficient: the identification of sensitive data requires manual execution of matching result filtering, resulting in low efficiency;
and (3) identification accuracy is low: the matching method of the keyword library adopts a mode of patterned matching, so that the accuracy of identifying sensitive data is determined by establishing the keyword library, and the problem of low accuracy can occur when the keyword library is incomplete or is established wrongly;
the recognition speed is slow: due to the adoption of the manual processing mode, the problem of long recognition speed period can occur when a large amount of data is faced, and the requirement on evaluators is high by adopting the manual processing mode.
Disclosure of Invention
The embodiment of the invention provides a sensitive data identification method, a sensitive data identification device, sensitive data identification equipment and a computer storage medium, which can quickly and accurately identify sensitive data in a large amount of data.
In one aspect, an embodiment of the present invention provides a sensitive data identification method, including:
acquiring characteristic parameters for positioning target data to be identified and a regular expression for identifying sensitive data in the target data;
acquiring a target object containing the target data according to the characteristic parameters;
and identifying the target data in the target object line by line according to the regular expression so as to determine whether the target object contains the sensitive data.
Further, the characteristic parameters comprise a storage position parameter of the target object and a sampling range parameter of the target data in the target object.
Further, the storage location parameter of the target object at least comprises a database type for storing the target object, wherein the database type is a Hive database, an Hbase database, a Linux database, a Windows database, an ORACLE database, a MySQL database, or a db2 database.
Further, according to the characteristic parameters, acquiring the target object containing the target data includes:
acquiring a target file in the storage position based on the storage position corresponding to the storage position parameter;
and acquiring target data in the sampling range in the target file according to the sampling range corresponding to the sampling range parameter, and forming the target object containing the target data.
Further, acquiring the target file in the storage location based on the storage location corresponding to the storage location parameter includes:
acquiring a data file with operation authority in the storage position according to the storage position;
and removing the temporary file in the data file and obtaining the target file.
Further, the regular expression comprises a sensitive information parameter for identifying the sensitive data and an identification rule generated according to the sensitive information parameter.
Further, after determining that the target object contains the sensitive data, the method further includes:
acquiring a data position parameter of the sensitive data in the target data and a field parameter of the sensitive data in the data position;
and generating prompt information about the sensitive data according to the data position parameter and the field parameter.
In another aspect, an embodiment of the present invention provides a sensitive data identification apparatus, where the apparatus includes:
the information acquisition unit is configured to acquire characteristic parameters for positioning target data to be identified and regular expressions for identifying sensitive data in the target data;
an object determination unit configured to acquire a target object containing the target data according to the characteristic parameter;
the data identification unit is configured to identify the target data in the target object line by line according to the regular expression so as to determine whether the sensitive data is contained in the target object.
In another aspect, an embodiment of the present invention provides a sensitive data identification device, where the device includes: a processor and a memory storing computer program instructions;
the processor, when executing the computer program instructions, implements the sensitive data identification method as described above.
In still another aspect, an embodiment of the present invention provides a computer storage medium, where computer program instructions are stored, and when executed by a processor, the computer program instructions implement the sensitive data identification method described above.
According to the sensitive data identification method, the sensitive data identification device, the sensitive data identification equipment and the computer storage medium, the target object containing the target data can be found according to the acquired characteristic parameters of the target data to be identified, the target object is scanned according to the acquired regular expression for identifying the sensitive data, and whether the target object contains the sensitive data or not is determined.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart illustrating a sensitive data identification method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a specific method of step S120 in FIG. 1;
FIG. 3 is a flowchart illustrating a specific method of step S121 in FIG. 2;
FIG. 4 is a flowchart illustrating a specific method of step S130 in FIG. 1;
FIG. 5 is a schematic flow chart diagram of a sensitive data identification method according to another embodiment of the present invention;
FIG. 6 is a schematic flow chart diagram of one example of a sensitive data identification method of an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a sensitive data identification device according to an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a sensitive data identification device according to another embodiment of the present invention;
fig. 9 is a schematic hardware structure diagram of a sensitive data identification device according to an embodiment of the present invention.
Detailed Description
Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
In order to solve the problem of the prior art, embodiments of the present invention provide a method, an apparatus, a device, and a computer storage medium for identifying sensitive data. The sensitive data identification method provided by the embodiment of the invention is described below firstly.
Fig. 1 is a flowchart illustrating a sensitive data identification method according to an embodiment of the present invention. As shown in fig. 1, the sensitive data identification method according to the embodiment of the present invention includes:
s110, acquiring characteristic parameters for positioning target data to be identified and a regular expression for identifying sensitive data in the target data;
s120, acquiring a target object containing target data according to the characteristic parameters;
s130, identifying the target data in the target object line by line according to the regular expression to determine whether the target object contains sensitive data.
In the embodiment of the invention, the target object containing the target data can be searched according to the acquired characteristic parameters of the target data to be identified, the target object is scanned according to the acquired regular expression for identifying the sensitive data, and whether the target object contains the sensitive data or not is determined.
In step S110, the characteristic parameters are some key information that is set in advance and related to the target data to be recognized in order to enable the sensitive data recognition task to be performed automatically, and are used for locating the target data to be recognized. In particular, the characteristic parameters may include at least a storage location parameter of the target object and a sampling range parameter of the target data in the target object.
In order to enable the computer to recognize the feature parameters, the feature parameters may be set as corresponding codes, and the storage position parameters and the sampling range parameters will be described separately as examples.
The storage location parameter is used for a carrier of the target data, specifically, the storage location parameter of the target object at least includes a database type for storing the target object, wherein the database is a carrier of the target data, and the database type may be one of a Hive database, an Hbase database, a Linux database, a Windows database, an ORACLE database, a MySQL database, or a db2 database.
Taking the storage position parameters corresponding to the Linux database, the Hive database and the Hbase database as an example, the storage position parameter corresponding to the Linux database may be set to be 1, the storage position parameter corresponding to the Hive database may be set to be 2, and the storage position parameter corresponding to the Hbase database may be set to be 3.
It should be noted that, in other embodiments, the storage location parameter of the target object may further include a specific storage location of the target object in the database, for example, a specific list, directory, or other location, so as to provide location index information of the target data through the storage location parameter of the target object.
The sample range parameter is used to specify the range of target data to be collected, such as the full data of the table, the first 50 rows of data, for identifying sensitive data. Taking sampling range parameters corresponding to the Linux database, the Hive database and the Hbase database as an example, the sampling range parameter corresponding to the full-document data of the Linux database can be set to be 1-1, the sampling range parameter corresponding to the first 50 rows of data of the Linux database can be set to be 1-2, the sampling range parameter corresponding to the full-table data of the Hive database can be set to be 2-1, and the sampling range parameter corresponding to the first 50 rows of data of the Hive database can be set to be 2-2. It should be noted that, because the Hbase database has the characteristics that the sampling range is different from other databases, and the sampling is performed through the first letter of a line, the middle character of a line, or the custom character string, when the custom character string is a character string from the beginning to the end of the first letter of a line, the sampling range parameters corresponding to the Hbase database can be set as follows: the sampling range parameter corresponding to the initial letter is 3-1, the sampling range parameter corresponding to the middle character of the line is 3-2, and the sampling range parameter corresponding to the character string from the beginning to the end of the initial letter of the line is 3-3.
In step S110, the regular expression includes a sensitive information parameter for identifying the sensitive data and an identification rule generated from the sensitive information parameter. The sensitive information parameters can be characters and/or numbers, the characters and/or numbers can form regular character strings according to identification rules, the regular character strings are regular expressions, and the regular character strings can be used for filtering common character strings in target data so as to identify the sensitive data according to filtering results.
In the embodiment of the present invention, the sensitive data may be data related to the privacy of the user, such as latitude and longitude, name, bank account number, identification number, phone number (including mobile phone number and fixed phone number), unit name, address, gender, certificate type, and the like. Taking the sensitive data as the mobile phone number as an example, the regular expression can be set according to the meaning of each segment of the mobile phone number, and in general, the meaning of each segment of the mobile phone number is as follows: the first three digits represent the operator, the middle four digits represent the area number, and the last four digits represent the sequence number. Therefore, the regular expression set according to the meaning of each segment of the mobile phone number may be: ^ (13[0-9]) | (14[5|7]) | (15([0-3] | [5-9])) | (18[0,5-9]) \ \ d {8 }.
Fig. 2 shows a flowchart of a specific method of step S120 in fig. 1. As shown in fig. 2, the specific method for acquiring the target object including the target data according to the characteristic parameter in step S120 may include:
s121, acquiring a target file in the storage position based on the storage position corresponding to the storage position parameter;
and S122, acquiring target data in the sampling range in the target file according to the sampling range corresponding to the sampling range parameter, and forming a target object containing the target data.
In step S121, according to the storage location corresponding to the storage location parameter, a database type where the target data is located is determined, and a database driver corresponding to the database is started to access the database, so as to enter the database to obtain the target file.
The database driver is an interface which is uniformly established by a database manufacturer and is convenient for accessing the database, and when different types of databases are accessed, specific connection and data access to the database can be realized according to the database driver type and the interface definition provided by the database manufacturer. Specifically, when accessing a database, the database driver type and interface provided by the database manufacturer need to be added to the program to be used, and the corresponding implementation logic is written according to the required mode and format, so as to implement the driving of the database and access the database.
After the target file is determined, a target object containing a plurality of line data (target data) to be identified may be obtained according to the sampling range parameters in step 122. The target object may be a real or virtual new file or table composed of a plurality of lines of data (target data) to be identified.
Fig. 3 shows a flowchart of a specific method of step S121 in fig. 2. As shown in fig. 3, the step S121, acquiring the target file in the storage location based on the storage location corresponding to the storage location parameter includes:
s210, acquiring a data file with an operation authority in the storage position according to the storage position;
and S220, removing the temporary files in the data files and obtaining the target files.
In step S210, all data files having operation authority in the storage location can be queried according to the storage location, but all the obtained data files are not target files, for example, a temporary file in the storage location does not need to be identified by sensitive data due to its temporary characteristic. Therefore, in step S220, the data file may be filtered by proposing a temporary file in the data file, thereby obtaining a target file in the data file. Therefore, the data size to be identified can be reduced, the identification efficiency of the sensitive data is improved, and the workload of sensitive data identification is reduced.
Fig. 4 shows a flowchart of a specific method of step S130 in fig. 1. As shown in fig. 4, step S130 may specifically be:
s131, determining whether data in any data position in the target data is matched with the regular expression; executing step S132 if any field in the data is matched with the regular expression, and executing step S133 if all fields in the data are not matched with the regular expression;
s132, determining that the data contain sensitive data, storing the data, and then executing a step S134;
s133, determining that the data does not contain sensitive data and does not store the data, and then executing a step S134;
s134, determining whether all the data in the target data are completely matched, if so, finishing the identification of the sensitive data, and if not, executing the step S131.
Fig. 5 is a flowchart illustrating a sensitive data identification method according to another embodiment of the present invention. As shown in fig. 5, after determining that the target object contains sensitive data in step S130, the method further includes:
s140, acquiring a data position parameter of the sensitive data in the target data and a field parameter of the sensitive data in the data position;
and S150, generating prompt information about the sensitive data according to the data position parameter and the field parameter.
The sensitive data identification method according to the embodiment of the present invention will be described in detail with reference to fig. 6.
Fig. 6 is a flowchart illustrating an example of a sensitive data identification method according to an embodiment of the present invention. As shown in fig. 6, the whole process includes the following steps:
s301, setting a sensitive data identification task, specifically setting information such as characteristic parameters of target data to be identified, task start time and task end time, and regular expressions corresponding to sensitive data, wherein the characteristic parameters include: target equipment to be identified, a storage position of a target file in the target equipment, and a sampling range in the target file.
And S302, starting a sensitive data identification task.
And S303, loading the characteristic parameters of the target data.
And S304, loading a regular expression corresponding to the sensitive data.
S305, the sensitive data identification program accesses the database corresponding to the storage position by calling an interface of the database.
S306, establishing database connection, and acquiring the list, specifically, the list can be presented through the list name.
S307, performing the table-by-table processing in a loop, and determining whether the processing is completed on all the tables, if the processing is completed, performing S308, and if the processing is not completed, performing S309.
And S308, ending the task.
S309, acquiring the table name, judging whether the table name contains temp, zzz and other information, if so, marking the table as a temporary table, returning to the step S307, and if not, continuing to execute the step S310.
S310, column names of the table are obtained, data (target data) in corresponding rows of each column are obtained according to the sampling range, and a target object containing the target data is formed.
And S311, circularly performing regular expression matching column by column, determining whether all columns are completely matched, if all columns are completely matched, executing the step S316, and if not, executing the step S312.
S312, matching according to the regular expression, determining whether the regular expression is matched with the regular expression, if not, executing the step S313, and if so, executing the step S314.
S313, judging whether the current data is the last row data of the current column, if so, executing the step S311, otherwise, taking the next column data and executing the step S312.
And S314, generating prompt information of the sensitive data.
S315, the column information is stored in a warehouse, and the process returns to the step S311 to check the next column.
And S316, confirming whether the table is a sensitive table or not, and determining the table to be a sensitive table as long as one column in the table has sensitive data.
S317, the table information is put in storage, and the step S307 is skipped to identify the next table.
In the embodiment of the present invention, the database type may be one of a Hive database, an Hbase database, a Linux database, a Windows database, an ORACLE database, a MySQL database, or a db2 database. In the following, the database types are divided into four cases, and the sensitive data identification process is described separately.
First case
The first case is described for the Hive database, ORACLE database, MySQL database, or db2 database.
When the type of the database is the database, the configuration of the database driver can be realized by configuring default.
For example, the configuration information may be spliced into the following form to call a client Application Programming Interface (API) for linking:
url ═ jdbc: hive2:// { host node ip }: host node port }/{ database name } ";
name="hdfs";
password=""。
the drive manager then links the database via the configuration information described above.
Then, all table names are obtained through database scanning, specifically, all table names can be obtained through the link established in the previous step by calling an interface method.
And finally, identifying the sensitive data of the target data in a specific table, and acquiring prompt information of the sensitive data. Specifically, the definition of the identification line number (identification of data from m to n lines) may be performed based on a keyword in the query sentence, thereby determining the target data. And secondly, acquiring column names and corresponding data, and storing the column names and the corresponding data into a linked list hash type container A in sequence to ensure the sequence type. And matching the data with the regular expressions in sequence, packaging the successfully matched data into a hash container, and storing the hash container into an ordered container B. Then, the obtained ordered container B is filtered to determine whether it is sensitive data that is negligible (for example, if any data is matched by using a regular expression as a mobile phone number, but the mobile phone number is in a white list, it is considered as sensitive data that is negligible). Finally, if any data is determined to be sensitive data, prompting information, such as 'sensitive data exists in a field of a certain row and a certain column', calling an interface API (application program interface) and returning the sensitive data to the server.
Second case
The second case is explained with respect to Hbase data.
When the type of the database is the database, the configuration of the database driver can be realized by configuring the host node ip and the host node port, and for Hbase data, a server deploying the probe needs to configure corresponding node information for hosts files.
For example, add C: \ Windows \ System32\ drivers \ etc \ hosts file under Windows operating System
192.168.186.150 big01
192.168.186.151 big02
192.168.186.159 big03
192.168.186.160 big04。
Then, a sampling range parameter of Hbase data was set. In the above description, it is briefly stated that the sample range parameter of the Hbase data is different from other databases. Next, a specific setting thereof will be explained.
Before the target data is sampled from the Hbase data, an association table needs to be created. The setting of the sampling range parameter of the Hbase data is divided into the following three cases:
a) top of line letter filtering (input box: string, comma separated, up to three allowed);
b) middle string (input box: string, comma separated, up to three allowed);
c) string of first letters beginning to end (both beginning and end allow only one string to be entered, and both must be entered at the same time, none of them is allowed to be empty).
After the sampling range parameter is set, other processing of the background can be continued.
And then, establishing connection through an API (application programming interface) of an Hbase database, scanning the database to obtain all table names, judging whether a certain table C exists, and if so, determining that the certain table C is the target file.
And finally, establishing a scanner for extracting a target data rule according to the value range parameter in the target data area, then acquiring a target object containing the target data, and executing the same sensitive data identification processing and sensitive data prompt information acquisition processing as in the first case.
Third case
The third case is a description made with respect to the Linux database.
Firstly, judging whether the identification task exists in a plan termination queue before connecting with a Linux database, if so, canceling the identification task, and if not, executing the identification task.
And when the identification task is determined to be executed, connecting the Linux database, and acquiring a file list to be scanned. Specifically, it may be determined whether a scanning path and a recursion depth have been set, and if the scanning path and the recursion depth have been set, a LINUX command for finding the target file is spliced according to the scanning path, the recursion depth, and the type of the scanning file, where the LINUX command includes: and finally, obtaining the file names of all the file paths and all the target files in the file paths meeting the conditions. And if not, taking all files meeting the scanning type under the default path as target files to acquire the file names of all the target files.
Then, the contents of all the object files are stored in files/probetemplates, and the same identification processing of the sensitive data and the acquisition processing of the prompt information of the sensitive data as in the first case are executed for the files/probetemplates. If the sensitive data is identified, the sensitive data and the corresponding prompt information are stored in SENSEDATA as a scan success result, and finally all scan results (including a scan success result and a scan failure result) are packaged into a scanResult. Wherein the scan failure result comprises a scan failure reason.
It should be noted that, before saving as a scanning result, it is determined whether the identification task exists in the scheduled termination queue, if so, the saving is cancelled, otherwise, the interface service api is called, and the result is saved.
Fourth case
The fourth case is a description made with respect to WINDOWS databases.
Firstly, before connecting with WINDOWS database, judging whether the identification task exists in the plan termination queue, if so, canceling the identification task, otherwise, executing the identification task.
When the identification task is determined to be executed, the WINDOWS database is connected, and a file list to be scanned is obtained. Specifically, it may be determined whether a scanning path and a recursion depth have been set, and if the scanning path and the recursion depth have been set, a command for finding a target file is spliced according to the scanning path, the recursion depth, and the type of the scanned file, so as to obtain all file paths that satisfy the conditions and the file names of the target files in the file paths. And if not, taking all files meeting the scanning type under the default path as target files to acquire the file names of all the target files.
And then, storing the contents of all target files in the shared disk paths of PROBE and windows, wherein the file contents in the shared disk of PROBE need to be read in a targeted manner and stored in files/probeTemplate/.
Then, the same identification processing of sensitive data and the acquisition processing of the prompt information of sensitive data as in the first case are performed for files/probeTemplate. If the sensitive data is identified, the sensitive data and the corresponding prompt information are stored in SENSEDATA as a scan success result, and finally all scan results (including a scan success result and a scan failure result) are packaged into a scanResult. Wherein the scan failure result comprises a scan failure reason.
It should be noted that, before saving as a scanning result, it is determined whether the identification task exists in the scheduled termination queue, if so, the saving is cancelled, otherwise, the interface service api is called, and the result is saved.
In summary, the sensitive data identification method of the embodiment of the present invention can make up for a plurality of defects of the existing identification method, and specifically includes: the embodiment of the invention sets the characteristic parameters, can automatically extract the target data and improve the efficiency of extracting the target data; sensitive data can be matched more accurately by setting a regular expression; moreover, the content in the databases of different types can be automatically identified, and the efficiency of identifying the sensitive data is greatly improved.
Fig. 7 is a schematic structural diagram of a sensitive data identification device according to an embodiment of the present invention. As shown in fig. 7, an embodiment of the present invention provides a sensitive data identification apparatus, including:
the information acquisition unit 410 is configured to acquire characteristic parameters for positioning target data to be identified and regular expressions for identifying sensitive data in the target data;
an object determination unit 420 configured to acquire a target object containing target data according to the characteristic parameters;
and the data identification unit 430 is configured to identify the target data in the target object line by line according to the regular expression so as to determine whether the target object contains sensitive data.
In an embodiment of the present invention, the object determining unit 420 is further configured to: acquiring a target file in the storage position based on the storage position corresponding to the storage position parameter; and acquiring target data in the sampling range in the target file according to the sampling range corresponding to the sampling range parameter, and forming a target object containing the target data.
The specific method for acquiring the target file in the storage location by the object determination unit 420 based on the storage location corresponding to the storage location parameter may include: acquiring a data file with operation authority in the storage position according to the storage position; and removing the temporary file in the data file and obtaining the target file.
Fig. 8 is a schematic structural diagram of a sensitive data identification device according to another embodiment of the present invention. As shown in fig. 8, the sensitive data identification apparatus according to the embodiment of the present invention further includes:
an information processing unit 440 configured to obtain a data location parameter of the sensitive data in the target data and a field parameter of the sensitive data in the data location;
and an information generating unit 450 configured to generate prompt information about the sensitive data according to the data location parameter and the field parameter.
Fig. 9 is a schematic diagram illustrating a hardware structure of a sensitive data identification device according to an embodiment of the present invention.
The sensitive data identification device may comprise a processor 501 and a memory 502 in which computer program instructions are stored.
Specifically, the processor 501 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured as one or more Integrated circuits implementing embodiments of the present invention.
Memory 502 may include mass storage for data or instructions. By way of example, and not limitation, memory 502 may include a Hard Disk Drive (HDD), a floppy Disk Drive, flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 502 may include removable or non-removable (or fixed) media, where appropriate. The memory 502 may be internal or external to the integrated gateway disaster recovery device, where appropriate. In a particular embodiment, the memory 502 is non-volatile solid-state memory. In a particular embodiment, the memory 502 includes Read Only Memory (ROM). Where appropriate, the ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or flash memory or a combination of two or more of these.
The processor 501 reads and executes the computer program instructions stored in the memory 502 to implement any one of the sensitive data identification methods in the above embodiments.
In one example, the sensitive data identification device may also include a communication interface 503 and a bus 510. As shown in fig. 9, the processor 501, the memory 502, and the communication interface 503 are connected via a bus 510 to complete communication therebetween.
The communication interface 503 is mainly used for implementing communication between modules, apparatuses, units and/or devices in the embodiments of the present invention.
Bus 510 comprises hardware, software, or both to couple the components of the online data traffic billing device to each other. By way of example, and not limitation, a bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a Hypertransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus or a combination of two or more of these. Bus 510 may include one or more buses, where appropriate. Although specific buses have been described and shown in the embodiments of the invention, any suitable buses or interconnects are contemplated by the invention.
The sensitive data identification device can execute the sensitive data identification method in the embodiment of the invention, so as to realize the sensitive data identification method and the sensitive data identification device described in conjunction with fig. 1-8.
In addition, in combination with the sensitive data identification method in the foregoing embodiment, the embodiment of the present invention may be implemented by providing a computer storage medium. The computer storage medium having computer program instructions stored thereon; the computer program instructions, when executed by a processor, implement any one of the sensitive data identification methods of the above embodiments.
It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.
The functional blocks shown in the above-described structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.
As described above, only the specific embodiments of the present invention are provided, and it can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present invention, and these modifications or substitutions should be covered within the scope of the present invention.

Claims (10)

1. A sensitive data identification method, comprising:
acquiring characteristic parameters for positioning target data to be identified and a regular expression for identifying sensitive data in the target data;
acquiring a target object containing the target data according to the characteristic parameters;
and identifying the target data in the target object line by line according to the regular expression so as to determine whether the target object contains the sensitive data.
2. The sensitive data identification method according to claim 1, wherein the characteristic parameters include a storage location parameter of the target object and a sampling range parameter of the target data in the target object.
3. The sensitive data recognition method of claim 2, wherein the storage location parameter of the target object at least comprises a database type for storing the target object, wherein the database type is Hive database, Hbase database, Linux database, Windows database, ORACLE database, MySQL database, or db2 database.
4. The sensitive data identification method according to claim 2, wherein obtaining the target object containing the target data according to the characteristic parameters comprises:
acquiring a target file in the storage position based on the storage position corresponding to the storage position parameter;
and acquiring target data in the sampling range in the target file according to the sampling range corresponding to the sampling range parameter, and forming the target object containing the target data.
5. The sensitive data identification method according to claim 4, wherein acquiring the target file in the storage location based on the storage location corresponding to the storage location parameter comprises:
acquiring a data file with operation authority in the storage position according to the storage position;
and removing the temporary file in the data file and obtaining the target file.
6. The sensitive data identification method according to claim 1, wherein the regular expression comprises a sensitive information parameter for identifying the sensitive data and an identification rule generated according to the sensitive information parameter.
7. The method for identifying sensitive data according to claim 1, wherein after determining that the sensitive data is contained in the target object, the method further comprises:
acquiring a data position parameter of the sensitive data in the target data and a field parameter of the sensitive data in the data position;
and generating prompt information about the sensitive data according to the data position parameter and the field parameter.
8. An apparatus for identifying sensitive data, the apparatus comprising:
the information acquisition unit is configured to acquire characteristic parameters for positioning target data to be identified and regular expressions for identifying sensitive data in the target data;
an object determination unit configured to acquire a target object containing the target data according to the characteristic parameter;
the data identification unit is configured to identify the target data in the target object line by line according to the regular expression so as to determine whether the sensitive data is contained in the target object.
9. A sensitive data identification device, the device comprising: a processor and a memory storing computer program instructions;
the processor, when executing the computer program instructions, implements the sensitive data identification method of any of claims 1-7.
10. A computer storage medium having stored thereon computer program instructions which, when executed by a processor, implement the sensitive data identification method of any one of claims 1 to 7.
CN201811445535.4A 2018-11-29 2018-11-29 Sensitive data identification method, device and equipment and computer storage medium Pending CN111241133A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811445535.4A CN111241133A (en) 2018-11-29 2018-11-29 Sensitive data identification method, device and equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811445535.4A CN111241133A (en) 2018-11-29 2018-11-29 Sensitive data identification method, device and equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN111241133A true CN111241133A (en) 2020-06-05

Family

ID=70875755

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811445535.4A Pending CN111241133A (en) 2018-11-29 2018-11-29 Sensitive data identification method, device and equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN111241133A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580092A (en) * 2020-12-07 2021-03-30 北京明朝万达科技股份有限公司 Sensitive file identification method and device
CN112839077A (en) * 2020-12-29 2021-05-25 北京安华金和科技有限公司 Sensitive data determination method and device
CN113434740A (en) * 2021-06-22 2021-09-24 中国平安人寿保险股份有限公司 Sensitive information monitoring method and device, terminal equipment and storage medium
CN113489704A (en) * 2021-06-29 2021-10-08 平安信托有限责任公司 Sensitive data identification method and device based on flow, electronic equipment and medium
CN113704573A (en) * 2021-08-26 2021-11-26 北京中安星云软件技术有限公司 Database sensitive data scanning method and device
WO2023125336A1 (en) * 2021-12-30 2023-07-06 Huawei Technologies Co., Ltd. Methods and devices for generating sensitive text detectors

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070083514A1 (en) * 2005-10-07 2007-04-12 International Business Machines Corporation System and method for protecting sensitive data
CN104794204A (en) * 2015-04-23 2015-07-22 上海新炬网络信息技术有限公司 Database sensitive data automatically-recognizing method
CN105471823A (en) * 2014-09-03 2016-04-06 阿里巴巴集团控股有限公司 Sensitive information processing method, device, server and security determination system
CN106599713A (en) * 2016-11-11 2017-04-26 中国电子科技网络信息安全有限公司 Database masking system and method based on big data
US20170139674A1 (en) * 2015-11-18 2017-05-18 American Express Travel Related Services Company, Inc. Systems and methods for tracking sensitive data in a big data environment
CN108171069A (en) * 2018-01-03 2018-06-15 平安科技(深圳)有限公司 Desensitization method, application server and computer readable storage medium
CN108536739A (en) * 2018-03-07 2018-09-14 中国平安人寿保险股份有限公司 The recognition methods of metadata sensitive information field, device, equipment and storage medium
CN108563961A (en) * 2018-04-13 2018-09-21 中国民航信息网络股份有限公司 The recognition methods of data desensitization platform sensitive data, device, equipment and medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070083514A1 (en) * 2005-10-07 2007-04-12 International Business Machines Corporation System and method for protecting sensitive data
CN105471823A (en) * 2014-09-03 2016-04-06 阿里巴巴集团控股有限公司 Sensitive information processing method, device, server and security determination system
CN104794204A (en) * 2015-04-23 2015-07-22 上海新炬网络信息技术有限公司 Database sensitive data automatically-recognizing method
US20170139674A1 (en) * 2015-11-18 2017-05-18 American Express Travel Related Services Company, Inc. Systems and methods for tracking sensitive data in a big data environment
CN106599713A (en) * 2016-11-11 2017-04-26 中国电子科技网络信息安全有限公司 Database masking system and method based on big data
CN108171069A (en) * 2018-01-03 2018-06-15 平安科技(深圳)有限公司 Desensitization method, application server and computer readable storage medium
CN108536739A (en) * 2018-03-07 2018-09-14 中国平安人寿保险股份有限公司 The recognition methods of metadata sensitive information field, device, equipment and storage medium
CN108563961A (en) * 2018-04-13 2018-09-21 中国民航信息网络股份有限公司 The recognition methods of data desensitization platform sensitive data, device, equipment and medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580092A (en) * 2020-12-07 2021-03-30 北京明朝万达科技股份有限公司 Sensitive file identification method and device
CN112839077A (en) * 2020-12-29 2021-05-25 北京安华金和科技有限公司 Sensitive data determination method and device
CN113434740A (en) * 2021-06-22 2021-09-24 中国平安人寿保险股份有限公司 Sensitive information monitoring method and device, terminal equipment and storage medium
CN113489704A (en) * 2021-06-29 2021-10-08 平安信托有限责任公司 Sensitive data identification method and device based on flow, electronic equipment and medium
CN113704573A (en) * 2021-08-26 2021-11-26 北京中安星云软件技术有限公司 Database sensitive data scanning method and device
WO2023125336A1 (en) * 2021-12-30 2023-07-06 Huawei Technologies Co., Ltd. Methods and devices for generating sensitive text detectors

Similar Documents

Publication Publication Date Title
CN111241133A (en) Sensitive data identification method, device and equipment and computer storage medium
US8781172B2 (en) Methods and systems for enhancing the performance of automated license plate recognition applications utilizing multiple results
EP3438851A1 (en) Vehicle model identification device, vehicle model identification system, and vehicle model identification method
US11651014B2 (en) Source code retrieval
CN110705235B (en) Information input method and device for business handling, storage medium and electronic equipment
CN113486350B (en) Method, device, equipment and storage medium for identifying malicious software
CN109933645B (en) Information query method, device, computer equipment and storage medium
WO2018121266A1 (en) Method and device for obtaining application and terminal device
JP2014115857A (en) Business form definition data creation system and business form definition data creation method
CN113127125B (en) Page automatic adaptation method, device, equipment and storage medium
CN111061733B (en) Data processing method, device, electronic equipment and computer readable storage medium
CN110706035B (en) Updating effect evaluation method and device, storage medium and electronic equipment
CN110263060B (en) ERP electronic accessory management method and computer equipment
CN110941744A (en) Data list adder and adding method
CN115935344A (en) Abnormal equipment identification method and device and electronic equipment
CN113297488A (en) Data processing method and system based on big data and artificial intelligence
US20160196331A1 (en) Reconstitution order of entity evaluations
US20190019054A1 (en) Contact Information Identification System
CN110874305A (en) User operation recording method and device and server
US20190138632A1 (en) Automated database updating and curation
CN112434273B (en) Database management method and device based on user verification
CN114996364B (en) Classification and classification method and device for audit logs of PaaS cloud database and storage medium
JPWO2017135445A1 (en) Load man-hour estimation device, load man-hour estimation method, and program
JP2002132795A (en) Method and device for storing information, method and device for calling information
CN107220255B (en) Address information processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination