CN115310514A - Method and device for identifying target type data in mass data - Google Patents

Method and device for identifying target type data in mass data Download PDF

Info

Publication number
CN115310514A
CN115310514A CN202210790536.2A CN202210790536A CN115310514A CN 115310514 A CN115310514 A CN 115310514A CN 202210790536 A CN202210790536 A CN 202210790536A CN 115310514 A CN115310514 A CN 115310514A
Authority
CN
China
Prior art keywords
field
type
data
sample data
preset field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210790536.2A
Other languages
Chinese (zh)
Inventor
付彪
宋荣鑫
黄建庭
黄龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Qiyu Information Technology Co ltd
Original Assignee
Shanghai Qiyu Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Qiyu Information Technology Co ltd filed Critical Shanghai Qiyu Information Technology Co ltd
Priority to CN202210790536.2A priority Critical patent/CN115310514A/en
Priority to PCT/CN2022/124515 priority patent/WO2024007466A1/en
Publication of CN115310514A publication Critical patent/CN115310514A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a method and a device for identifying target type data in mass data and electronic equipment, wherein the method comprises the following steps: respectively extracting n sample data from n fields of a data table according to a sampling rule corresponding to the type of the data table in the data warehouse; inputting the ith sample data to M interfaces of the field identification model to obtain identification results of the ith sample data on M preset field types; determining the probability that the ith field is judged as M preset field types according to the recognition result of the ith sample data on the M preset field types; and identifying a target type field according to the probability. The invention identifies the target type field integrally according to the probability of judging M preset field types according to the ith field, so that the identification result can reflect the integral probability of the field sampling value rather than the field sampling value, the identification result of the target type field is more accurate, the safety protection of sensitive data is enhanced, and the leakage of the sensitive data is effectively avoided.

Description

Method and device for identifying target type data in mass data
Technical Field
The invention relates to the technical field of data security processing, in particular to a method and a device for identifying target type data in mass data, electronic equipment and a computer readable medium.
Background
With the rapid development of the internet, the data value and accessibility of a massive database like an enterprise data warehouse are greatly improved, and meanwhile, great challenges are brought to data security. For this reason, it becomes especially important to accurately identify valuable object type data in the data warehouse. In the data warehouse, data is mainly stored in a data table in units of fields, and therefore, the target type field needs to be accurately identified in the data warehouse.
Currently, for the identification of a target type field, an application system needs to acquire a data table through a jdbc database connection pool, determine whether a sample value of a single field of the data table matches with a predetermined target type field, and identify the field as the target type field if the sample value matches with the predetermined target type field. In this way, only whether the sampled value of a single field matches with a predetermined target type field is considered, and the ratio of the matched field value to the total number of samples is ignored, so that the accuracy of the identification result is affected, and further, the safety hazard of sensitive data leakage exists.
Disclosure of Invention
In view of the foregoing, the present invention is directed to a method, an apparatus, an electronic device and a computer-readable medium for identifying target type data in mass data, so as to at least partially solve at least one of the above technical problems.
In order to solve the above technical problem, a first aspect of the present invention provides a method for identifying target type data in mass data, where the method includes:
respectively extracting n sample data from n fields of the data table according to a sampling rule corresponding to the type of the data table in the data warehouse;
inputting the ith sample data to M interfaces of the field identification model to obtain identification results of the ith sample data on M preset field types; wherein: correspondingly identifying a preset field type by each interface of the field identification model;
determining the probability that the ith field is judged as M preset field types according to the recognition result of the ith sample data on the M preset field types;
identifying a target type field according to the probability;
wherein: m, n and i are all natural numbers larger than zero, and i is smaller than or equal to n.
According to a preferred embodiment of the present invention, the sample data is in json format; the input of the ith sample data to the M interfaces of the field recognition model to obtain recognition results of the ith sample data on M preset field types comprises the following steps:
disassembling ith sample data according to a json structure to obtain a plurality of primary key values of the ith sample data;
inputting each primary key value into M interfaces of a field recognition model to obtain recognition results of each primary key value on M preset field types;
and determining the identification results of the ith sample data on the M preset field types according to the identification results of each primary key value on the M preset field types.
According to a preferred embodiment of the invention, the probability q that the ith field is determined as the jth predetermined field type ij Obtained by the following formula:
Figure BDA0003730003770000021
wherein: n is a radical of hydrogen 1ij And determining the identification result of the ith sample data on the jth preset field type as the number of samples of the first identification result, wherein N is the total number of the ith sample data, and j is a natural number which is greater than zero and less than or equal to M.
According to a preferred embodiment of the present invention, the method further comprises:
configuring a probability threshold value of each preset field type;
judging whether the probability that the target type field is judged to be the jth preset field type is larger than the probability threshold value of the jth preset field type or not;
if yes, marking the type of the target type field according to the jth preset field type;
and carrying out desensitization treatment on the target type field according to the type.
According to a preferred embodiment of the present invention, the data table includes: a partition table and a non-partition table; respectively extracting n sample data from n fields of the data table by adopting a first sampling rule for the non-partition table; and sequentially extracting n sample data from the n fields of each partition according to the partition sequence for the partition table.
According to a preferred embodiment of the invention, the method further comprises:
extracting a target type data table according to the target type field, and configuring a multi-level approval mechanism for the target type data table;
and/or:
desensitizing the target type field.
According to a preferred embodiment of the present invention, the preset field type includes: at least one of a name type, an identity information type, a contact information type, and an account information type.
In order to solve the above technical problem, a second aspect of the present invention provides an apparatus for identifying target type data in mass data, the apparatus comprising:
the sampling module is used for respectively extracting n sample data from n fields of the data table according to the sampling rule corresponding to the type of the data table in the data warehouse;
the first identification module is used for inputting the ith sample data to M interfaces of the field identification model to obtain identification results of the ith sample data on M preset field types; wherein: correspondingly identifying a preset field type by each interface of the field identification model;
the determining module is used for determining the probability that the ith field is determined as the M preset field types according to the recognition result of the ith sample data on the M preset field types;
the second identification module is used for identifying the target type field by the probability;
wherein: m and n are natural numbers, and i is less than or equal to n.
According to a preferred embodiment of the present invention, the sample data is in json format; the first identification module comprises:
the disassembling module is used for disassembling the ith sample data according to the json structure to obtain a plurality of primary key values of the ith sample data;
the input module is used for inputting each primary key value into M interfaces of the field recognition model to obtain recognition results of each primary key value on M preset field types;
and the sub-determination module is used for determining the recognition results of the ith sample data on the M preset field types according to the recognition results of each primary key value on the M preset field types.
According to a preferred embodiment of the present invention, the determining module obtains the probability q that the ith field is determined as the jth preset field type according to the following formula ij
Figure BDA0003730003770000041
Wherein: n is a radical of 1ij The identification result of the ith sample data on the jth preset field type is the number of samples of the first identification result, N is the total number of the ith sample data, and j is a natural number which is greater than zero and less than or equal to M.
According to a preferred embodiment of the invention, the device further comprises:
the configuration module is used for configuring the probability threshold value of each preset field type;
the judging module is used for judging whether the probability that the target type field is judged to be the jth preset field type is larger than the probability threshold value of the jth preset field type;
the marking module is used for marking the type of the target type field according to the jth preset field type if the number of the target type fields is larger than the jth preset field type;
and the desensitization module is used for performing desensitization treatment on the target type field according to the type.
According to a preferred embodiment of the present invention, the data table comprises: a partition table and a non-partition table; the sampling module is used for respectively extracting n sample data from n fields of the data table by adopting a first sampling rule for the non-partition table; and sequentially extracting n sample data from n fields of each partition according to the partition sequence by the partition table.
According to a preferred embodiment of the present invention, the apparatus further comprises:
the extraction module is used for extracting a target type data table according to the target type field and configuring a multi-level examination and approval mechanism for the target type data table;
and/or:
and the processing module is used for carrying out desensitization processing on the target type field.
According to a preferred embodiment of the present invention, the preset field type includes: at least one of a name type, an identity information type, a contact information type, and an account information type.
To solve the above technical problem, a third aspect of the present invention provides an electronic device, comprising:
a processor; and
a memory storing computer executable instructions that, when executed, cause the processor to perform the method described above.
To solve the above technical problems, a fourth aspect of the present invention provides a computer-readable storage medium, wherein the computer-readable storage medium stores one or more programs which, when executed by a processor, implement the above method.
Respectively extracting n sample data from n fields of a data table according to a sampling rule corresponding to the type of the data table in a data warehouse; inputting ith sample data to M interfaces of the field recognition model to obtain recognition results of the ith sample data on M preset field types; further determining the probability that the ith field is determined as M preset field types according to the recognition result of the ith sample data on the M preset field types; therefore, the target type field is integrally identified according to the probability of judging M preset field types according to the ith field, so that the identification result can better reflect the integral probability of the field sample value, the identification result of the target type field is more accurate than the field sample value, and the condition that sensitive data is leaked due to the fact that the determined target type field is not accurate enough in the related technology is effectively avoided.
Drawings
In order to make the technical problems solved by the present invention, the technical means adopted and the technical effects obtained more clear, the following will describe in detail the embodiments of the present invention with reference to the accompanying drawings. It should be noted, however, that the drawings described below are only illustrations of exemplary embodiments of the invention, from which other embodiments can be derived by those skilled in the art without inventive step.
FIG. 1 is a flowchart illustrating a method for identifying target type data in mass data according to an embodiment of the present invention;
FIG. 2 is a schematic structural framework diagram of an apparatus for identifying object type data in mass data according to an embodiment of the present invention;
FIG. 3 is a block diagram of an exemplary embodiment of an electronic device in accordance with the present invention;
FIG. 4 is a schematic diagram of one embodiment of a computer-readable medium of the present invention.
Detailed Description
Exemplary embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the invention may be embodied in many specific forms, and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art.
The structures, properties, effects or other characteristics described in a certain embodiment may be combined in any suitable manner in one or more other embodiments, while still complying with the technical idea of the invention.
In describing particular embodiments, specific details of structures, properties, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by one skilled in the art. However, it is not excluded that a person skilled in the art may implement the invention in a specific case without the above-described structures, performances, effects or other features.
The flow chart in the drawings is only an exemplary flow demonstration, and does not represent that all the contents, operations and steps in the flow chart are necessarily included in the scheme of the invention, nor does it represent that the execution is necessarily performed in the order shown in the drawings. For example, some operations/steps in the flowcharts may be divided, some operations/steps may be combined or partially combined, and the like, and the execution order shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.
The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The same reference numerals denote the same or similar elements, components, or portions throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.
Referring to fig. 1, fig. 1 is a method for identifying target type data in mass data according to the present invention, as shown in fig. 1, the method includes:
s1, respectively extracting n sample data from n fields of a data table according to a sampling rule corresponding to the type of the data table in a data warehouse;
in this embodiment, the target type field identification is mainly performed on mass data in a data warehouse and a large database. Taking the data warehouse hive as an example, the data table in hive may include: a partition table and a non-partition table; wherein: the partition table divides an original large table into data directories of different levels for storage, and each data directory corresponds to one partition. Such as: the first level of data directory further partitions the table into a front region, a middle region and a back region, and the second level of data directory further partitions the front region, the middle region and the back region. In the step, judging whether the data table is a partitioned data table or a non-partitioned data table, and extracting n sample data from n fields of the data table by adopting a first sampling rule for the non-partitioned data table; the first sampling rule may be a preset sampling rule such as: limit sampling rules. For the partition table, due to the continuity of the service and the safety planning of the service, the initial data of the data table may be plaintext, the subsequently increased data becomes ciphertext, but the previous table data is not changed, so that the data of the data table at different periods needs to be sampled, and the partition sequence can just reflect the front-back sequence of the table data, so that n sample data can be respectively extracted from n fields of each partition according to the partition sequence. The partition order may be determined according to the level of the data object, such as: firstly, a plurality of partitions are respectively extracted from a front region, a middle region and a rear region according to a data directory of a first level, and then n sample data are respectively extracted from n fields according to a data directory of a second level from the plurality of extracted partitions. Wherein: n is a natural number greater than zero, i =1, 2 \8230andn.
In the specific extraction process, one sample data is extracted from each field correspondingly, and then the sample data extracted from the ith field is marked as: the ith sample data. Wherein: the sample data may be a different field value in the fields.
In addition, in the prior art, the application system acquires the data table through the jdbc database connection pool, and the requirement for identifying the target type field of the mass data cannot be met because the application system has limited data reading through the jdbc. Before the step, all data tables of the hive source database can be obtained through a calculation engine suitable for a large data set, so that the calculation of mass data is met. Further, the big data engine may determine whether each data table has identified the hypersensitivity data, and perform steps S1 to S4 of the present invention on the data table that has not identified the hypersensitivity data. For convenience of determination, the identified data table may be stored in the identified table, and it may be determined whether the current data table identifies the allergy data by querying whether the identified table has the current data table. Wherein: the computing engine may employ a Spark engine, a flink engine, or the like.
S2, inputting the ith sample data to M interfaces of the field recognition model to obtain recognition results of the ith sample data on M preset field types;
in this embodiment, the field identification model may include a plurality of interfaces for inputting data, and each interface correspondingly identifies a preset field type, that is, each interface may identify whether a field is a preset field type corresponding to an interface according to input sample data, and if so, outputs "1", and if not, outputs "0". For example, the field recognition model may be composed of a plurality of machine models, each of which may be regarded as an interface, and a preset field type is correspondingly recognized. The field identification model can also be composed of a plurality of identification rules, each identification rule can be regarded as an interface, and a preset field type is correspondingly identified.
The preset field type is a preset type of a target type field to be identified, M is a natural number greater than zero, and the specific M can be determined according to the number of the target type field to be identified. For example, the preset field type may include: at least one of a name type, an identity information type, a contact information type, and an account information type. In one example, the field identification model includes four interfaces corresponding to four target type fields, namely, a type of identification name (such as a person name, a company name, and the like), a type of identity information (such as an identification card, a corporate taxpayer identification number, and the like), a type of contact information (such as a mobile phone number, a mailbox number, and the like), and a type of account information (such as a bank card account, a member account, and the like).
In one example, the ith field is a name field in the data table, and the extracted ith sample is: zhang three, li four, wang five, information network and technology, the field identification model includes a name judges the interface, is used for discerning the field of the name type; and a mobile phone number judging interface for identifying the field of the contact information type. Then the third, fourth, fifth, information network and technology are input into the name judgment interface of the field identification model to obtain the name type identification result of the ith sample: 1. 1, 0; inputting Zhang III, li IV, wang V, information network and technology into a mobile phone number judgment interface of a field recognition model to obtain a contact information type recognition result of the ith sample: 0. 0, 0 and 0.
In practice, the transmission format of the sample data may be various, such as: XML format, json format, etc., and different transmission formats are processed in different manners in this step. Therefore, before this step, the transmission format of the sample data needs to be determined, and corresponding processing is performed according to the transmission format of the sample data.
The json format is a relatively common data transmission format and is widely used in intermediate data processing. In the existing identification process of target type fields, a json structure is required to be predefined aiming at sample data in a json format, and the requirement for processing data in any json format cannot be met. In this embodiment, if the sample data is in a json format, the sending the ith sample data to a plurality of interfaces of the field identification model, and obtaining the identification results of the ith sample data on M preset field types includes:
s21, disassembling the ith sample data according to a json structure to obtain a plurality of primary key values of the ith sample data;
wherein: the json structure is bracketed by braces "{ }", which are formed by 0 or more "key: value" pairs (key: value) separated by english commas. In the disassembling process, a primary key value corresponding to the key in each brace is obtained, and the column is as follows: two primary key values value1, value2 in json position a.b.c, a.b.d are extracted hierarchically. Such as: a field custum exists in table _ info of the data table, and the extracted 3 pieces of corresponding sample data are as follows:
{ "a": { "b": { "c": zhang III "," d ": information and technology" } and
{ "a": { "b": { "c": lie four "," d ": zhao six" } f
{ "a": { "b": { "c": computer network "," d ": network and media" } and
then, after disassembling according to the josn structure, two primary key values of a.b.c and a.b.d in json positions of 3 pieces of sample data are obtained as follows:
a.b.c three a.b.d information and technology
a, b, c, four, a, b, d, zhao six
a.b.c computer network a.b.d network and media
S22, inputting each primary key value into M interfaces of the field recognition model to obtain recognition results of each primary key value on M preset field types;
exemplarily, the 6 primary key values of the 3 pieces of sample data are sequentially input into the name type determination interface of the field identification model, and the obtained identification result of the two primary key values of the 3 pieces of sample data in the name type is as follows:
a.b.c 1 a.b.d 0
a.b.c 1 a.b.d 1
a.b.c 0 a.b.d 0
s23, determining the recognition results of the ith sample data on the M preset field types according to the recognition results of each primary key value on the M preset field types.
For example, it may be determined whether a sum of the recognition results of all primary key values in each sample data on the preset field type is greater than a threshold (for example, 1), and if so, it is determined that the recognition result of the corresponding sample data on the preset field type is 1, otherwise, it is 0. Accordingly, in the recognition result of the name type of the two primary key values of the 3 pieces of sample data in step S22, the recognition result of the name type of the 3 pieces of sample data can be obtained as follows: 1. 1 and 0.
Through the above steps S21 to S23, processing of json data of an arbitrary format is realized.
S3, determining the probability that the ith field is judged as M preset field types according to the recognition result of the ith sample data on the M preset field types;
in one example, the probability of each field in each preset field type is determined, and then the ith field is determined as the probability q of the jth preset field type ij Obtained by the following formula:
Figure BDA0003730003770000101
wherein: n is a radical of 1ij Identifying the identification result of the ith sample data on the jth preset field type as the number of samples of the first identification result, wherein N is the total number of the ith sample data, and j is a natural number which is greater than zero and less than or equal to M; such as: the first identification result refers to that the j interface output result is 1.
For example, in one example, the name field exists in table _ info of the data table, and 3 field values of the name field of the table _ info table are extracted: zhang III, li IV, information network and technology are respectively input into a name type judging interface, a mobile phone number type judging interface, an identity number judging interface and a bank card number judging interface of a field recognition model to obtain the following recognition results:
name type recognition result: 1. 1, 0
The mobile phone number type identification result is as follows: 0. 0, 0
Identification result of the identification number: 0. 0, 0
And (3) identifying the bank card number: 0. 0, 0
According to the formula
Figure BDA0003730003770000102
The name field is available: the probability of being identified as a name type is: 2/3; the probability of identifying the type of the mobile phone number is as follows: 0/3; the probability of identifying as an identification number is: 0/3; the probability of identifying as a bank card number is: 0/3. For convenience of application, the result may be stored in a form of table 1, where table is a table name, type is a preset field type, field is a table field to be identified, and score is a probability that the field to be identified is determined as corresponding to the preset field type.
Figure BDA0003730003770000103
Figure BDA0003730003770000111
TABLE 1 probability of field name in four Preset field types
In another example, in order to integrate the recognition results of the fields in each preset field type and improve the accuracy of the target type field recognition, the method comprises the steps of
Figure BDA0003730003770000112
Obtaining the probability q that the ith field is judged as the jth preset field type ij Then, the step may further perform a summary process on the probabilities that the ith field is determined as each preset field type, so as to obtain the probabilities that the ith field is determined as M preset field types. Such as: the probability Pi of the ith field being determined as M preset field types is:
Figure BDA0003730003770000113
according to the formula
Figure BDA0003730003770000114
Obtaining the probability that the name field is judged as the four preset field types as follows: 15 percent.
S4, identifying a target type field according to the probability;
the identification method of this step corresponds to the method of determining the probability that the ith field is determined to be M preset field types in step S3.
Such as: in one example, step S3 is performed by
Figure BDA0003730003770000115
Determining the probability that the ith field is judged as the jth preset field typeq ij . This step can configure a probability threshold Q for each preset field type j (ii) a Judging the probability q of the ith field as the jth preset field type ij Whether the probability is larger than the probability threshold Q of the jth preset field type j (ii) a If yes, marking the ith field as a target type field.
Such as: in another example, step S3 is performed by
Figure BDA0003730003770000116
Determining the probability q that the ith field is judged as the jth preset field type ij Then according to
Figure BDA0003730003770000117
And obtaining the probability Pi of judging the ith field as M preset field types. A probability threshold may be configured in this step, and when the probability that the ith field is determined as M preset field types is greater than the probability threshold, the ith field is identified as the target type field.
Further, after the target type field is identified, the target type field can be pushed to an application system, and the application system can process the target type field as required. For example, pushing to the query system, the query system may perform desensitization processing on the target type field before providing the query function. Exemplary desensitization treatments may employ: encryption algorithms (e.g., asymmetric encryption algorithm and Hash algorithm) encrypt and/or mask data.
Optionally, the present invention may further identify the type of the target type field by presetting a probability threshold of the field type, and perform corresponding desensitization processing on the target type field according to the type of the target type field. The invention may further comprise:
s51, configuring a probability threshold value of each preset field type;
such as: the probability threshold for the name type is 60% and for the phone number type is 70%.
S52, judging whether the probability that the target type field is judged to be the jth preset field type is larger than the probability threshold value of the jth preset field type or not;
s53, if the number is larger than the preset number, marking the type of the target type field according to the jth preset field type;
such as: and if the probability that the target type field is judged as the name type is 0 and the probability that the mobile phone number type is 80 percent, marking the type of the target type field as the mobile phone number.
And S54, carrying out desensitization treatment on the target type field according to the type.
For example, a desensitization processing mode corresponding to the type of each target type field may be preconfigured, where the desensitization processing mode may include: encryption algorithms, and/or masking schemes. Taking a masking manner as an example, the number of bits of the type mask of each target type field may be configured, such as: name type mask: opening; phone type mask: 1521111 ·; identity card type mask: 3507 × 19100101 × 4 × 3; bank card type mask: 622****12345123456.
In an actual query system, hive data is queried through large data engines such as spark, hive, presto and the like, and is returned to the query system to display the data, and the query system regularly refreshes the recognition result set of the previous target type field. If the user queries the target type field of the corresponding table, desensitizing the data according to a desensitizing treatment mode corresponding to the type of the target type field, and returning the treated result to the user for viewing.
After identifying the target type field, the target type field may be pushed to, for example, a service approval system, and the method may further include:
s501, extracting a target type data table according to the target type field;
illustratively, the business approval system determines whether the corresponding data table is the target type data table according to the target type field in the same data table. In this embodiment, the higher the probability that the target type field in the data table is determined to be M preset field types, the higher the probability that the data table is the target type data table is, the threshold sensitivity probability may be determined, and whether the probability that all the target type fields in the data table are determined to be M preset field types is greater than the threshold sensitivity probability is determined, if yes, the data table is the target type data table, otherwise, the data table is not the target type data table.
S502, configuring a multi-level approval mechanism for the target type data table;
through the hierarchical approval mechanism, the target type data table is prevented from being seen by unnecessary users, and the data safety is protected.
It should be noted that: for the name category, since there are some country names and general proper nouns in the data table, and so on, these nouns have high repetition degree and are easily confused as names. In order to prevent determining the proper nouns such as country, place name, etc. as names, the step S53 marks the field as name type, and may obtain the sample data of the field from the cached sample data, and remove the sample data of the field first, and then divide by the total amount of all sample data in the data table to obtain the first probability; and (4) weighted average of the probability of judging the field as the name type in the step (S3) and the first probability to obtain the final probability of judging the field as the name type. Thereby reducing the interference caused by the special name when identifying the name field.
Fig. 2 is a device for identifying object type data in mass data according to the present invention, as shown in fig. 2, the device includes:
the sampling module 21 is configured to extract n sample data from n fields of the data table according to a sampling rule corresponding to the type of the data table in the data warehouse;
the first identification module 22 is configured to input the ith sample data to M interfaces of the field identification model, so as to obtain identification results of the ith sample data on M preset field types; wherein: correspondingly identifying a preset field type by each interface of the field identification model;
the determining module 23 is configured to determine, according to the recognition result of the ith sample data on the M preset field types, the probability that the ith field is determined as the M preset field types;
the second identification module 24 is configured to identify a target type field according to a probability that the ith field is determined as M preset field types;
wherein: m, N and i are all natural numbers larger than zero, and i is smaller than or equal to N.
In one embodiment, the sample data is in json format; the first identification module 22 includes:
the disassembling module is used for disassembling the ith sample data according to the json structure to obtain a plurality of primary key values of the ith sample data;
the input module is used for inputting each primary key value into M interfaces of the field recognition model to obtain recognition results of each primary key value on M preset field types;
and the sub-determination module is used for determining the recognition results of the ith sample data on the M preset field types according to the recognition results of each primary key value on the M preset field types.
The determining module 23 obtains the probability q that the ith field is determined as the jth preset field type through the following formula ij
Figure BDA0003730003770000141
Wherein: n is a radical of 1ij Identifying the identification result of the ith sample data on the jth preset field type as the number of samples of the first identification result, wherein N is the total number of the ith sample data, and j is a natural number which is greater than zero and less than or equal to M;
the determining module 23 determines that the probability Pi that the ith field is determined as M preset field types is:
Figure BDA0003730003770000142
further, the apparatus further comprises:
the configuration module is used for configuring the probability threshold value of each preset field type;
the judging module is used for judging whether the probability that the target type field is judged to be the jth preset field type is larger than the probability threshold value of the jth preset field type;
the marking module is used for marking the type of the target type field according to the jth preset field type if the number of the target type fields is larger than the jth preset field type;
and the desensitization module is used for performing desensitization treatment on the target type field according to the type.
In one example, the data table includes: a partition table and a non-partition table; the sampling module 21 is used for respectively extracting n sample data from n fields of the data table by adopting a first sampling rule for the non-partition table; and sequentially extracting n sample data from n fields of each partition according to the partition sequence by the partition table.
Further, the apparatus further comprises:
the extraction module is used for extracting a target type data table according to the target type field and configuring a multi-level examination and approval mechanism for the target type data table;
and/or:
and the processing module is used for carrying out desensitization processing on the target type field.
In this embodiment, the preset field types include: at least one of a name type, an identity information type, a contact information type, and an account information type.
Those skilled in the art will appreciate that the modules in the above-described embodiments of the apparatus may be distributed as described in the apparatus, and that corresponding variations may be made in one or more apparatus other than the above-described embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
In the following, embodiments of the electronic device of the present invention are described, which may be regarded as an implementation in physical form for the above-described embodiments of the method and apparatus of the present invention. The details described in the embodiments of the electronic device of the invention are to be regarded as supplementary for the embodiments of the method or the apparatus described above; for details which are not disclosed in embodiments of the electronic device of the invention, reference may be made to the above-described embodiments of the method or the apparatus.
Fig. 3 is a block diagram of an exemplary embodiment of an electronic device according to the present invention. The electronic device shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 3, the electronic apparatus 300 of the exemplary embodiment is represented in the form of a general-purpose data processing apparatus. The components of electronic device 300 may include, but are not limited to: at least one processing unit 310, at least one memory unit 320, a bus 330 connecting different electronic device components (including the memory unit 320 and the processing unit 310), a display unit 340, and the like.
The storage unit 320 stores a computer readable program, which may be a code of a source program or a read-only program. The program may be executed by the processing unit 310 such that the processing unit 310 performs the steps of various embodiments of the present invention. For example, the processing unit 310 may perform the steps shown in fig. 1.
The storage unit 320 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM) 3201 and/or a cache storage unit 3202, and may further include a read only memory unit (ROM) 3203. The storage unit 320 may also include a program/utility 3204 having a set (at least one) of program modules 3205, such program modules 3205 including, but not limited to: operating the electronic device, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 330 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 300 may also communicate with one or more external devices 100 (e.g., a keyboard, a display, a network device, a bluetooth device, etc.), enable a user to interact with the electronic device 300 via the external devices 100, and/or enable the electronic device 300 to communicate with one or more other data processing devices (e.g., a router, a modem, etc.). Such communication may occur via input/output (I/O) interfaces 350, and may also occur via a network adapter 360 to one or more networks, such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet. Network adapter 360 may communicate with other modules of electronic device 300 via bus 330. It should be understood that although not shown in FIG. 3, other hardware and/or software modules may be used in electronic device 300, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID electronics, tape drives, and data backup storage electronics, among others.
FIG. 4 is a schematic diagram of one computer-readable medium embodiment of the present invention. As shown in fig. 4, the computer program may be stored on one or more computer readable media. The computer readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electronic device, apparatus, or device that is electronic, magnetic, optical, electromagnetic, infrared, or semiconductor, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. The computer program, when executed by one or more data processing devices, enables the computer-readable medium to implement the above-described method of the invention, namely: respectively extracting n sample data from n fields of a data table according to a sampling rule corresponding to the type of the data table in the data warehouse; inputting the ith sample data to M interfaces of the field identification model to obtain identification results of the ith sample data on M preset field types; wherein: correspondingly identifying a preset field type by each interface of the field identification model; determining the probability that the ith field is judged as M preset field types according to the recognition results of the ith sample data on the M preset field types; identifying a target type field according to the probability; wherein: m, n and i are natural numbers, and i is less than or equal to n.
Through the description of the above embodiments, those skilled in the art will readily understand that the exemplary embodiments described in the present invention may be implemented by software, and may also be implemented by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a computer-readable storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a data processing device (which can be a personal computer, a server, or a network device, etc.) execute the above-mentioned method according to the present invention.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution electronic device, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including object oriented programming languages such as Java, C + +, or the like, as well as conventional procedural programming languages, such as "C" language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In situations involving remote computing devices, the remote computing devices may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to external computing devices (e.g., through the internet using an internet service provider).
In summary, the present invention can be implemented as a method, an apparatus, an electronic device, or a computer-readable medium executing a computer program. Some or all of the functions of the present invention may be implemented in practice using a general-purpose data processing device such as a microprocessor or a Digital Signal Processor (DSP).
While the foregoing detailed description has described in detail certain embodiments of the invention with reference to certain specific aspects, embodiments and advantages thereof, it should be understood that the invention is not limited to any particular computer, virtual machine, or electronic device, as various general purpose machines may implement the invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims (14)

1. A method for identifying target type data in mass data is characterized by comprising the following steps:
respectively extracting n sample data from n fields of a data table according to a sampling rule corresponding to the type of the data table in the data warehouse;
inputting ith sample data to M interfaces of the field recognition model to obtain recognition results of the ith sample data on M preset field types; wherein: correspondingly identifying a preset field type by each interface of the field identification model;
determining the probability that the ith field is judged as M preset field types according to the recognition results of the ith sample data on the M preset field types;
identifying a target type field according to the probability;
wherein: m, n and i are natural numbers larger than zero, and i is smaller than or equal to n.
2. The method of claim 1, wherein the sample data is in json format; the input of the ith sample data to the M interfaces of the field recognition model to obtain recognition results of the ith sample data on M preset field types comprises the following steps:
disassembling ith sample data according to a json structure to obtain a plurality of primary key values of the ith sample data;
inputting each primary key value into M interfaces of a field recognition model to obtain recognition results of each primary key value on M preset field types;
and determining the recognition results of the ith sample data on the M preset field types according to the recognition results of each primary key value on the M preset field types.
3. Method according to claim 1 or 2, characterized in that the probability q that the ith field is determined as the jth preset field type ij Obtained by the following formula:
Figure FDA0003730003760000011
wherein: n is a radical of hydrogen 1ij And determining the identification result of the ith sample data on the jth preset field type as the number of samples of the first identification result, wherein N is the total number of the ith sample data, and j is a natural number which is greater than zero and less than or equal to M.
4. The method of claim 3, further comprising:
configuring a probability threshold value of each preset field type;
judging whether the probability that the target type field is judged to be the jth preset field type is larger than the probability threshold value of the jth preset field type or not;
if yes, marking the type of the target type field according to the jth preset field type;
and carrying out desensitization processing on the target type field according to the type.
5. The method of claim 1, wherein the data table comprises: a partition table and a non-partition table; respectively extracting n sample data from n fields of the data table by adopting a first sampling rule for the non-partition table; and sequentially extracting n sample data from n fields of each partition according to the partition sequence by the partition table.
6. The method of claim 1, further comprising:
extracting a target type data table according to the target type field, and configuring a multi-level approval mechanism for the target type data table;
and/or:
desensitizing the target type field.
7. An apparatus for identifying object type data in mass data, the apparatus comprising:
the sampling module is used for respectively extracting n sample data from n fields of the data table according to a sampling rule corresponding to the type of the data table in the data warehouse;
the first identification module is used for inputting the ith sample data to M interfaces of the field identification model to obtain identification results of the ith sample data on M preset field types; wherein: correspondingly identifying a preset field type by each interface of the field identification model;
the determining module is used for determining the probability that the ith field is determined as the M preset field types according to the recognition result of the ith sample data on the M preset field types;
the second identification module is used for identifying the target type field according to the probability;
wherein: m, n and i are all natural numbers larger than zero, and i is smaller than or equal to n.
8. The apparatus of claim 7, wherein the sample data is in json format; the first identification module comprises:
the disassembling module is used for disassembling the ith sample data according to the json structure to obtain a plurality of primary key values of the ith sample data;
the input module is used for inputting each primary key value into M interfaces of the field recognition model to obtain recognition results of each primary key value on M preset field types;
and the sub-determination module is used for determining the identification results of the ith sample data on the M preset field types according to the identification results of each primary key value on the M preset field types.
9. The apparatus according to claim 7 or 8, wherein the determining module obtains the probability q that the ith field is determined as the jth preset field type according to the following formula ij
Figure FDA0003730003760000031
Wherein: n is a radical of hydrogen 1ij The identification result of the ith sample data on the jth preset field type is the number of samples of the first identification result, N is the total number of the ith sample data, and j is a natural number which is greater than zero and less than or equal to M.
10. The apparatus of claim 9, further comprising:
the configuration module is used for configuring the probability threshold value of each preset field type;
the judging module is used for judging whether the probability that the target type field is judged to be the jth preset field type is larger than the probability threshold value of the jth preset field type;
the marking module is used for marking the type of the target type field according to the jth preset field type if the number of the target type fields is larger than the jth preset field type;
and the desensitization module is used for performing desensitization treatment on the target type field according to the type.
11. The apparatus of claim 7, wherein the data table comprises: a partition table and a non-partition table; the sampling module is used for respectively extracting n sample data from n fields of the data table by adopting a first sampling rule for the non-partition table; and sequentially extracting n sample data from n fields of each partition according to the partition sequence by the partition table.
12. The apparatus of claim 7, further comprising:
the extraction module is used for extracting a target type data table according to the target type field and configuring a multi-level approval mechanism for the target type data table;
and/or:
and the processing module is used for carrying out desensitization processing on the target type field.
13. An electronic device, comprising:
a processor; and
a memory storing computer-executable instructions that, when executed, cause the processor to perform the method of any of claims 1-6.
14. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-6.
CN202210790536.2A 2022-07-05 2022-07-05 Method and device for identifying target type data in mass data Pending CN115310514A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210790536.2A CN115310514A (en) 2022-07-05 2022-07-05 Method and device for identifying target type data in mass data
PCT/CN2022/124515 WO2024007466A1 (en) 2022-07-05 2022-10-11 Method and apparatus for identifying target type data in mass data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210790536.2A CN115310514A (en) 2022-07-05 2022-07-05 Method and device for identifying target type data in mass data

Publications (1)

Publication Number Publication Date
CN115310514A true CN115310514A (en) 2022-11-08

Family

ID=83855831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210790536.2A Pending CN115310514A (en) 2022-07-05 2022-07-05 Method and device for identifying target type data in mass data

Country Status (2)

Country Link
CN (1) CN115310514A (en)
WO (1) WO2024007466A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874307A (en) * 2024-03-12 2024-04-12 北京全路通信信号研究设计院集团有限公司 Engineering data field identification method and device, electronic equipment and storage medium
CN117874307B (en) * 2024-03-12 2024-06-04 北京全路通信信号研究设计院集团有限公司 Engineering data field identification method and device, electronic equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11941135B2 (en) * 2019-08-23 2024-03-26 International Business Machines Corporation Automated sensitive data classification in computerized databases
CN111738358B (en) * 2020-07-24 2020-12-08 支付宝(杭州)信息技术有限公司 Data identification method, device, equipment and readable medium
CN113642030B (en) * 2021-10-14 2022-02-15 广东鸿数科技有限公司 Sensitive data multi-layer identification method
CN114547675A (en) * 2022-01-28 2022-05-27 新华三大数据技术有限公司 Data identification method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874307A (en) * 2024-03-12 2024-04-12 北京全路通信信号研究设计院集团有限公司 Engineering data field identification method and device, electronic equipment and storage medium
CN117874307B (en) * 2024-03-12 2024-06-04 北京全路通信信号研究设计院集团有限公司 Engineering data field identification method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2024007466A1 (en) 2024-01-11

Similar Documents

Publication Publication Date Title
US10719567B2 (en) Database query processing on encrypted data
CN110532797A (en) The desensitization method and system of big data
US11250137B2 (en) Vulnerability assessment based on machine inference
US9875370B2 (en) Database server and client for query processing on encrypted data
US10430610B2 (en) Adaptive data obfuscation
CN107480549A (en) A kind of shared sensitive information desensitization method of data-oriented and system
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
WO2023065632A1 (en) Data desensitization method, data desensitization apparatus, device, and storage medium
WO2021189951A1 (en) Text search method and apparatus, and computer device and storage medium
US20100131569A1 (en) Method & apparatus for identifying a secondary concept in a collection of documents
Sun et al. VDSimilar: Vulnerability detection based on code similarity of vulnerabilities and patches
CN111291070A (en) Abnormal SQL detection method, equipment and medium
US20240028650A1 (en) Method, apparatus, and computer-readable medium for determining a data domain associated with data
US8676791B2 (en) Apparatus and methods for providing assistance in detecting mistranslation
CN110618999A (en) Data query method and device, computer storage medium and electronic equipment
CN111797217B (en) Information query method based on FAQ matching model and related equipment thereof
CN114722137A (en) Security policy configuration method and device based on sensitive data identification and electronic equipment
CN115203719A (en) Method, device and equipment for desensitizing SQL (structured query language) statement and computer-readable storage medium
JP2019020794A (en) Document management device, document management system, and program
US10565391B2 (en) Expression evaluation of database statements for restricted data
CN111581344A (en) Interface information auditing method and device, computer equipment and storage medium
CN115310514A (en) Method and device for identifying target type data in mass data
CN115906817A (en) Keyword matching method and device for cross-language environment and electronic equipment
US20220229998A1 (en) Lookup source framework for a natural language understanding (nlu) framework
CN114925757A (en) Multi-source threat intelligence fusion method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: Room 1118, No.4, Lane 800, Tongpu Road, Putuo District, Shanghai 200062

Applicant after: SHANGHAI QIYU INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 201500 room a1-5962, 58 Fumin Branch Road, Hengsha Township, Chongming District, Shanghai (Shanghai Hengtai Economic Development Zone)

Applicant before: SHANGHAI QIYU INFORMATION TECHNOLOGY Co.,Ltd.

Country or region before: China