CN108776762B - Data desensitization processing method and device - Google Patents

Data desensitization processing method and device Download PDF

Info

Publication number
CN108776762B
CN108776762B CN201810586230.9A CN201810586230A CN108776762B CN 108776762 B CN108776762 B CN 108776762B CN 201810586230 A CN201810586230 A CN 201810586230A CN 108776762 B CN108776762 B CN 108776762B
Authority
CN
China
Prior art keywords
data
target data
word
desensitization
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810586230.9A
Other languages
Chinese (zh)
Other versions
CN108776762A (en
Inventor
林鸿
欧阳红
袁葆
江再玉
赵加奎
熊根鑫
王宇坤
于喻
宋振世
王奕
郑倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
Beijing China Power Information Technology Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
Beijing China Power Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Information and Telecommunication Co Ltd, Beijing China Power Information Technology Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201810586230.9A priority Critical patent/CN108776762B/en
Publication of CN108776762A publication Critical patent/CN108776762A/en
Application granted granted Critical
Publication of CN108776762B publication Critical patent/CN108776762B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a data desensitization processing method and device, which are used for determining the type of target data; calling a corresponding sub-word library in a word segmentation reference word library according to the type of the target data, and performing word segmentation by adopting a word segmentation method corresponding to the type of the target data; and determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization treatment on sensitive data obtained after word segmentation of the target data by adopting the desensitization method of the target data. The target data is segmented to obtain data with a certain structure, desensitization treatment is carried out on the part with main sensitive information, and mask treatment is carried out on all or most of the sensitive information, so that the effectiveness of data desensitization is improved, the safety of data assets is guaranteed, the safety of client information is protected to the maximum extent, and client information leakage caused by abnormal inquiry, export and other modes is avoided.

Description

Data desensitization processing method and device
Technical Field
The invention relates to the technical field of data processing, in particular to a data desensitization processing method and device.
Background
In order to meet the working requirements of the national network security law on protection of client sensitive information, guarantee the data asset security of power marketing clients, guarantee the legal rights and interests of the power marketing clients and perform data desensitization on the power marketing client sensitive information, the aim is to protect the security of the power client information to the maximum extent while meeting the normal business requirements and avoid the leakage of the power client information caused by abnormal inquiry, derivation and other modes.
At present, a mask desensitization method is mainly adopted by a main power marketing data desensitization rule, partial information is reserved, the length of the information is ensured to be unchanged, and the main rule is as follows:
(1) contact address
The format is as follows: the format is not fixed and is a character string with an indefinite length.
Desensitization rules: the method comprises the steps of reserving words with lengths of 5 words or less, reserving the 1 st word and the last 2 words according to the lengths in a stepped manner; the length of the word is 6-9, and the last 5 words are reserved; length is 10 words and above, and the 4 words before the last 5 words are hidden; the hidden word is replaced with a.
(2) Enterprise family name
The format is as follows: the enterprise family name is consistent with the business license, is a company name and consists of a plurality of Chinese characters.
Desensitization rules: step-by-step reservation according to length: the length of the character is 4 or less, and the head and the tail of the character are respectively reserved with 1 character; the length of the Chinese characters is 5-6, and the head and the tail of the Chinese characters are respectively reserved with 2 words; odd numbers with the length of 7 words or more, and 3 words in the middle are hidden; the length of the character is 8 or more even numbers, and the middle 4 characters are hidden; the hidden word is replaced with a.
The main disadvantages of the existing power marketing data desensitization rules are:
after data desensitization is carried out on the electricity utilization address and the enterprise family electric power marketing data according to the current data desensitization rule, the data are not subjected to keyword mask codes, and keywords are reserved. For example, according to the desensitization rule of the enterprise-type username, sensitive information may still exist in the desensitized username address, part of keywords are reserved, and the desensitization effect is not obvious. As follows: qingdao Huifeng Motor manufacturing company- > Qingdao Huifeng x company Limited; islands two zero two business services, Inc. - > Islands two X A Business, Inc.
Similar problems exist with the desensitization rule for contact addresses, as follows: 2-1-101 of the north Jue of the Shanchuan avenue in the middle of Jinan City in Shandong province, and 1-101 of the north Jue of the Shanchuan ave in the middle of Jinan City in Shandong province.
Disclosure of Invention
In view of the above, the invention discloses a processing method and a device for data desensitization, which are used for performing word segmentation on target data by calling a word segmentation reference word bank before data desensitization so as to realize more effective data desensitization.
In order to achieve the above purpose, the invention provides the following specific technical scheme:
a method of data desensitization processing, comprising:
determining the type of the target data;
calling a corresponding sub-word library in a word segmentation reference word library according to the type of the target data, and performing word segmentation by adopting a word segmentation method corresponding to the type of the target data;
and determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization treatment on sensitive data obtained after word segmentation of the target data by adopting the desensitization method of the target data.
Optionally, the method further includes:
and constructing a participle reference word bank, wherein the participle reference word bank comprises a plurality of sub-word banks, and each sub-word bank comprises a type of sensitive word.
Optionally, when the type of the target data is a power utilization address, the calling a corresponding sub-lexicon in a word segmentation reference lexicon according to the type of the target data, and performing word segmentation by using a word segmentation method corresponding to the type of the target data includes:
and calling a general address sub-word library, a place name sub-word library, a cell name sub-word library and a administrative district division set sub-word library, and performing word segmentation on the target data by adopting maximum forward matching Chinese word segmentation.
Optionally, when the type of the target data is an enterprise-class username, the invoking a corresponding sub-lexicon in a participle reference lexicon according to the type of the target data, and performing participle by using a participle method corresponding to the type of the target data, includes:
and calling the regional set sub-word library, the industry set sub-word library and the company organization set sub-word library, and performing word segmentation by adopting a bidirectional maximum matching Chinese word segmentation method.
Optionally, before the determining a desensitization method of the target data according to the type of the target data and the length of the target data, the method further includes:
calculating the accuracy of the word segmentation result of the target data;
judging whether the correctness of the word segmentation result of the target data is greater than a first preset value or not;
if yes, executing the desensitization method for the target data according to the type of the target data and the length of the target data;
and if not, performing word segmentation on the target data based on a hidden Markov model, and executing the desensitization method for the target data according to the type of the target data and the length of the target data.
Optionally, when the type of the target data is an electricity utilization address, determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization processing on sensitive data obtained after word segmentation of the target data by using the desensitization method of the target data, includes:
judging whether the length of the target data is larger than a second preset value or not;
when the length of the target data is larger than the second preset value, determining that the desensitization method of the target data is a first electrical address data desensitization method;
extracting the last 5 bits of data of the house number data and the data of provincial, city, district and county of the house number data from the word segmentation result of the target data by adopting the first user address data desensitization method to obtain the remaining part data;
reserving the last 5 bits of data of the house number data and the province, city, district and county data, and masking the rest data of the target data to obtain data after the target data is desensitized;
when the length of the target data is not greater than the second preset value, determining that the desensitization method of the target data is a second electrical address data desensitization method;
and extracting the reserved part of the target data according to the length of the target data and a first stepped reservation rule by adopting the second user address data desensitization method, and masking the rest part of the target data to obtain the data after the target data desensitization.
Optionally, when the type of the target data is an enterprise-type username, determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization processing on sensitive data obtained after word segmentation of the target data by using the desensitization method of the target data, including:
judging whether the length of the target data is larger than a third preset value or not;
when the length of the target data is larger than the third preset value, determining that the desensitization method of the target data is a first enterprise username data desensitization method;
extracting the first character of the character size data and the last character of the industry data from the word segmentation result of the target data by adopting the first enterprise username data desensitization method to obtain the residual data of the character size data and the residual data of the industry data;
masking the residual data of the word size data and the residual data of the industry data, and reserving other data of the target data to obtain desensitized data of the target data;
when the length of the target data is not larger than the third preset value, determining that the desensitization method of the target data is a second enterprise username data desensitization method;
and extracting the reserved part of the target data according to the length of the target data and a second hierarchical reservation rule by adopting the second enterprise username data desensitization method, and masking the rest part of the target data to obtain the data after the target data desensitization.
A processing apparatus for data desensitization, comprising:
a type determination unit for determining a type of the target data;
the first word segmentation processing unit is used for calling a corresponding sub-word library in a word segmentation reference word library according to the type of the target data and performing word segmentation by adopting a word segmentation method corresponding to the type of the target data;
and the desensitization processing unit is used for determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization processing on sensitive data obtained after word segmentation of the target data by adopting the desensitization method of the target data.
Optionally, the apparatus further comprises:
the word bank building unit is used for building a participle reference word bank, the participle reference word bank comprises a plurality of sub-word banks, and each sub-word bank comprises one type of sensitive word.
Optionally, when the type of the target data is an electricity address, the first word segmentation processing unit is specifically configured to:
and calling a general address sub-word library, a place name sub-word library, a cell name sub-word library and a administrative district division set sub-word library, and performing word segmentation on the target data by adopting maximum forward matching Chinese word segmentation.
Optionally, when the type of the target data is an enterprise-type username, the first word segmentation processing unit is specifically configured to:
and calling the regional set sub-word library, the industry set sub-word library and the company organization set sub-word library, and performing word segmentation by adopting a bidirectional maximum matching Chinese word segmentation method.
Optionally, the apparatus further comprises:
the calculation unit is used for calculating the accuracy of the word segmentation result of the target data;
the judging end member is used for judging whether the accuracy of the word segmentation result of the target data is greater than a first preset value or not;
if yes, triggering the desensitization processing unit;
and if not, triggering a second word segmentation processing unit, wherein the second word segmentation processing unit is used for segmenting the target data based on the hidden Markov model and triggering the desensitization processing unit.
Optionally, when the type of the target data is an electrical address, the desensitization processing unit includes:
the first judgment subunit is used for judging whether the length of the target data is greater than a second preset value or not;
a first determining subunit, configured to determine that the desensitization method of the target data is a first electrical address data desensitization method when the length of the target data is greater than the second preset value;
the first extraction subunit is used for extracting the last 5-bit data of the house number data and the provincial, municipal and county data from the word segmentation result of the target data by adopting the first user address data desensitization method to obtain the remaining part data;
the first desensitization processing subunit is used for reserving the last 5 bits of data of the house number data and the data of province, city, district and county, and masking the remaining data of the target data to obtain the data after the target data is desensitized;
a second determining subunit, configured to determine that the desensitization method of the target data is a second electrical address data desensitization method when the length of the target data is not greater than the second preset value;
and the second desensitization processing subunit is used for extracting the reserved part of the target data according to the length of the target data and a first stepped reservation rule by adopting the second user address data desensitization method, and masking the remaining part of the target data to obtain data after the target data is desensitized.
Optionally, when the type of the target data is an enterprise-type username, the desensitization processing unit includes:
the second judgment subunit is used for judging whether the length of the target data is greater than a third preset value or not;
the third determining subunit is configured to determine that the desensitization method of the target data is a first enterprise-class username data desensitization method when the length of the target data is greater than the third preset value;
a second extraction subunit, configured to extract, by using the first enterprise-class username data desensitization method, a first word of the word size data and a last word of the industry data from the word segmentation result of the target data, and obtain remaining data of the word size data and remaining data of the industry data;
the third desensitization processing subunit is used for masking the residual data of the word size data and the residual data of the industry data, reserving other data of the target data, and obtaining data after the target data is desensitized;
a fourth determining subunit, configured to determine that the desensitization method of the target data is a second enterprise-class username data desensitization method when the length of the target data is not greater than the third preset value;
and the fourth desensitization processing subunit is configured to extract, by using the second enterprise-class username data desensitization method, the reserved portion of the target data according to the length of the target data and according to a second hierarchical reservation rule, and mask the remaining portion of the target data to obtain data after the target data is desensitized.
Compared with the prior art, the invention has the following beneficial effects:
according to the data desensitization processing method and device, before data desensitization, word segmentation is carried out on target data by calling the word segmentation reference word bank to obtain data with a certain structure, desensitization processing is carried out on parts with main sensitive information, all or most of the sensitive information is masked, and effectiveness of data desensitization is improved. The corresponding sub-word banks in the word segmentation reference word bank are called according to the type of the target data, word segmentation is carried out by adopting a word segmentation method corresponding to the type of the target data, the word segmentation accuracy is improved, a desensitization method of the target data is determined according to the type and the length of the target data, differential desensitization of different types of data with different lengths is realized, and the effectiveness of data desensitization is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a data desensitization processing method according to an embodiment of the present invention;
FIG. 2 is a diagram of a universal address sub-vocabulary according to an embodiment of the present invention;
FIG. 3 is a diagram of a local name thesaurus sub-thesaurus disclosed in an embodiment of the present invention;
FIG. 4 is a diagram of a cell name sub-lexicon as disclosed in an embodiment of the present invention;
FIG. 5 is a diagram of a sub-thesaurus of administrative region division sets disclosed in an embodiment of the present invention;
FIG. 6 is a diagram of a regional collection subword library disclosed in an embodiment of the present invention;
FIG. 7 is a schematic diagram of an industry aggregate sub-word library disclosed in an embodiment of the present invention;
FIG. 8 is a diagram of a corporate organization collection subword library disclosed in an embodiment of the present invention;
FIG. 9 is a diagram illustrating a maximum forward matching Chinese word segmentation method according to an embodiment of the present invention;
FIG. 10 is a flowchart of a method for desensitizing processing of electrical address data according to an embodiment of the present invention;
FIG. 11 is a flowchart of a method for desensitizing enterprise-type username data according to an embodiment of the present invention;
FIG. 12 is a flow chart of another method of data desensitization processing according to the disclosed embodiments;
fig. 13 is a schematic structural diagram of a processing apparatus for data desensitization according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present embodiment discloses a data desensitization processing method, which specifically includes the following steps:
s101: determining the type of the target data;
the target data is data which needs desensitization processing, and the types of the target data can include telephone data, address data, user name data, bank account data and the like.
S102: calling a corresponding sub-word library in a word segmentation reference word library according to the type of the target data, and performing word segmentation by adopting a word segmentation method corresponding to the type of the target data;
word segmentation is the segmentation of a sequence of Chinese characters into a single word. Word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification.
In order to more accurately perform word segmentation on the target data, the corresponding sub-word libraries in the word segmentation reference word library are called according to the type of the target data to perform word segmentation on the target data.
It should be noted that the processing method for data desensitization further includes:
and constructing a word segmentation reference word bank.
The word segmentation reference word bank comprises a plurality of sub-word banks, and each sub-word bank comprises a type of sensitive word.
Referring to fig. 2 to 8, the general address sub-word library, the local name sub-word library, the cell name sub-word library, the administrative district division set sub-word library, the area set sub-word library, the industry set sub-word library and the company organization set sub-word library in the participle reference word library are respectively shown.
In order to more accurately perform word segmentation on target data, calling a corresponding sub-word library in a word segmentation reference word library according to the type of the target data, and performing word segmentation by adopting a word segmentation method corresponding to the type of the target data. For example, when the type of the target data is a power utilization address, a general address sub-word library, a place name sub-word library, a cell name sub-word library and a administrative division set sub-word library are called, and the target data is segmented by adopting maximum forward matching Chinese segmentation words. And when the type of the target data is an enterprise family name, calling the regional set sub-word library, the industry set sub-word library and the company organization set sub-word library, and performing word segmentation by adopting a bidirectional maximum matching Chinese word segmentation method.
As shown in fig. 9, the maximum forward matching chinese word segmentation algorithm is adopted when segmenting the electric address data, and the specific algorithm is as follows:
several consecutive characters in the target data are matched with the word list from left to right, and if the matching is positive, a word is cut out. But there is a problem here: to achieve maximum matching, it is not possible to split the first match. If the text to be participled is:
content [ ] { "flood", "mountain", "street", "double", "river", "society", "district", … … }
A word list: dit [ ] { "Changsha city", "Kaifu district", "Hongshan street", … … }
(1) Starting from content [1], when content [2] is scanned, it is found that "Hongshan" is already in the vocabulary dit [ ]. But not yet separable because we do not know that the following words cannot constitute longer words (maximum match);
(2) continuing to scan content [3], it was found that "Hongshan street" was not the word in dit [ ]. But we cannot yet determine if the previously found "Hongshan" is already the largest word, because "Hongshan street" is the prefix of dit [2 ];
(3) content [4] was scanned and "Hongshan street" was found to be a word in dit [ ]. Continuing to scan;
(4) when content [5] is scanned, it is found that "Hongshan street double" is not a word in the vocabulary, nor is it a prefix of a word. Therefore, the word "Hongshan street" with the largest front can be cut.
It follows that the maximum matched word must ensure that the next scan does not end with a word or prefix of a word in the vocabulary. And (5) continuing to circulate by utilizing a maximum forward matching algorithm to finish the residual word segmentation. For example, the final word segmentation result of the address of "the current generation ten-thousand-country city three-period 10-two-unit 1706 of the great mountain street, the double river community, the Fuyuan West road 199 and the like in the Kangfu area of Changsha city" is as follows:
"Changshan | Kaifu | Hongshan street | twin river community | Fu Yuan West |199| number | contemporary world city third phase |10| multi | two | cell | 1706".
A bidirectional maximum matching Chinese word segmentation method is adopted when the enterprise family name data is segmented. The bidirectional maximum matching Chinese word segmentation method comprises the steps of firstly respectively carrying out maximum forward matching and maximum reverse matching Chinese word segmentation, comparing word segmentation results on the basis, adopting different word segmentation strategies according to different results, and selecting one word segmentation result to output according to the principle that the more words with large granularity are better, the less words with non-dictionary words and single words are better, for example.
The maximum forward matching chinese word segmentation algorithm has been described in detail. The maximum reverse matching Chinese word segmentation algorithm is similar to the maximum forward matching algorithm, and the difference is the scanning direction, namely, the substrings are taken from the right to the left for matching. The algorithm flow can be described as:
(1) inputting a preprocessed sentence content to be participled, and initializing an index as content.
(2) Obtaining the length of each sub-dictionary in the dictionary database;
(3) obtaining the length of word segmentation words, comparing the length with the longest sub-dictionary in the dictionary database, if the maximum length of the sub-dictionary is larger than the length of the word to be segmented, taking the character string remaining in the word to be segmented as the maximum length, otherwise, segmenting the word by the maximum length;
(4) searching the sub dictionary with the same maximum matching length as the current dictionary by using a dichotomy, if the sub dictionary is found, turning (5), and if the sub dictionary is not found, subtracting one from the maximum length (4);
(5) obtaining a character string SubStr to be participled, finding the character string in a dictionary, if the character string is found, adding the character string into the List, if the character string is not found, judging whether the SubStr is larger than 1, if the character string is larger than 1, deleting the last character of the SubStr (5), otherwise, setting a segmentation mark, and turning (6);
(6) and judging whether the Index is larger than 1, if so, turning to (3), otherwise, saving the List, and exiting.
The bidirectional maximum matching algorithm combines the forward matching algorithm and the reverse matching algorithm, for character strings to be segmented, firstly, the maximum forward matching algorithm and the maximum reverse matching algorithm are respectively used for segmenting words, word segmentation results are compared, the forward maximum matching and the reverse maximum matching are compared, and word segmentation results are returned; when the word segmentation results in the two directions are consistent, the returned character strings are not consistent, and the returned length is small; when the lengths are consistent, the reverse direction is returned. The bidirectional maximum matching Chinese word segmentation algorithm comprises the following steps:
(1) inputting a sentence content to be participled;
(2) preprocessing the content, then performing word segmentation by using a maximum forward matching algorithm and a maximum reverse matching algorithm respectively, comparing word segmentation results, turning to (3) if the word segmentation results are completely the same, and turning to (4) if the word segmentation results are different;
(3) randomly selecting one word segmentation result, and finishing the output algorithm of the word segmentation result;
(4) comparing whether the word segmentation numbers are the same or not, if so, selecting a reverse word segmentation result, outputting the word segmentation result, and ending the algorithm; otherwise, selecting the word segmentation result with smaller word segmentation number for output, and ending the algorithm.
S103: and determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization treatment on sensitive data obtained after word segmentation of the target data by adopting the desensitization method of the target data.
Referring to fig. 10, when the type of the target data is a power consumption address, the execution process of S103 is as follows:
s201: judging whether the length of the target data is larger than a second preset value or not; if yes, go to S202, if no, go to S203:
s202: determining that the desensitization method of the target data is a first electrical address data desensitization method;
s204: extracting the last 5 bits of data of the house number data and the data of provincial, city, district and county of the house number data from the word segmentation result of the target data by adopting the first user address data desensitization method to obtain the remaining part data;
s205: reserving the last 5 bits of data of the house number data and the province, city, district and county data, and masking the rest data of the target data to obtain data after the target data is desensitized;
s203: determining that the desensitization method of the target data is a second electrical address data desensitization method;
s206: and extracting the reserved part of the target data according to the length of the target data and a first stepped reservation rule by adopting the second user address data desensitization method, and masking the rest part of the target data to obtain the data after the target data desensitization.
For example, for the electricity address data with the length of 10 words and less, data desensitization is carried out according to a second user address data desensitization method, stepped reservation is carried out according to the length, the length of 5 words and less, the 1 st word and the last 2 words are reserved; 6-9 words in length, the last 5 words being reserved.
Data desensitization is performed by a first user address data desensitization method for power consumption address data having a length of 10 words or more. The electricity utilization address generally comprises province, city, county, street/village committee/village, road, district and house number. The house number part is reserved with the last 5 digits, province, city and county are reserved, and other parts are all replaced by X. As follows:
the central region of Shandong province, Jinan City, Shanchuan avenue, North Jujue, three-way Qilu Ankang district 2-1-101, the central region of Shandong province, Jinan City, Shandong province, 1-101.
Referring to fig. 11, when the type of the target data is a power consumption address, the execution process of S103 is as follows:
s301: judging whether the length of the target data is larger than a third preset value or not; if yes, executing S302, otherwise executing S303;
s302: determining that the desensitization method of the target data is a first enterprise username data desensitization method;
s304: extracting the first character of the character size data and the last character of the industry data from the word segmentation result of the target data by adopting the first enterprise username data desensitization method to obtain the residual data of the character size data and the residual data of the industry data;
s305: masking the residual data of the word size data and the residual data of the industry data, and reserving other data of the target data to obtain desensitized data of the target data;
s303: determining that the desensitization method of the target data is a second enterprise username data desensitization method;
s306: and extracting the reserved part of the target data according to the length of the target data and a second hierarchical reservation rule by adopting the second enterprise username data desensitization method, and masking the rest part of the target data to obtain the data after the target data desensitization.
For example, enterprise-class account name data with the length of less than 6 words is subjected to data desensitization according to a second electric address data desensitization method, the enterprise-class account name data are reserved according to the length in a stepped mode, the enterprise-class account name data with the length of 4 words or less are reserved for 1 word from head to tail; the length of the Chinese character is 5-6 words, and the head and the tail of the Chinese character are respectively reserved with 2 words.
And carrying out data desensitization on enterprise username data with the length of 6 words or more according to a first electric address data desensitization method. The enterprise family name generally comprises four parts of area, word size, industry and company organization. And reserving the areas and the organization parts before and after the operation, and performing mask operation on the word size and the industry. The first word is reserved in the word size part, and other parts are completely replaced by words; the industry part reserves the last word, and the other parts are all replaced by words. As follows:
qingdao Huifeng Motor manufacturing company Limited- > Qingdao Hui xi company Limited;
islands two zero two business services, Inc. - > Islands two X A Business, Inc.
According to the data desensitization processing method disclosed by the embodiment, before data desensitization, word segmentation is performed on target data by calling a word segmentation reference word bank to obtain data with a certain structure, desensitization processing is performed on parts with main sensitive information, all or most of the sensitive information is masked, and the effectiveness of data desensitization is improved. The corresponding sub-word banks in the word segmentation reference word bank are called according to the type of the target data, word segmentation is carried out by adopting a word segmentation method corresponding to the type of the target data, the word segmentation accuracy is improved, a desensitization method of the target data is determined according to the type and the length of the target data, differential desensitization of different types of data with different lengths is realized, and the effectiveness of data desensitization is improved.
Referring to fig. 12, the present embodiment discloses another data desensitization processing method, which specifically includes the following steps:
s401: determining the type of the target data;
s402: calling a corresponding sub-word library in a word segmentation reference word library according to the type of the target data, and performing word segmentation by adopting a word segmentation method corresponding to the type of the target data;
s403: calculating the accuracy of the word segmentation result of the target data;
s404: judging whether the correctness of the word segmentation result of the target data is greater than a first preset value or not; if yes, executing S405, otherwise, executing S406;
s405: determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization treatment on sensitive data obtained after word segmentation of the target data by adopting the desensitization method of the target data;
s406: and segmenting the target data based on a hidden Markov model, and executing S405.
And carrying out Chinese word segmentation on the two types of data of enterprise class house names and power utilization addresses by adopting a Hidden Markov Model (HMM Hidden Markov Model). Under the condition that the training corpus is large enough in scale and the coverage field is enough, the HMM algorithm can obtain higher segmentation accuracy. The word segmentation algorithm models Chinese based on the part of speech and statistical characteristics of manual labeling, namely, model parameters are estimated and trained according to observed data (labeled corpora). And in the word segmentation stage, the probability of the occurrence of various word segmentations is calculated through a model, and the word segmentation result with the maximum probability is taken as a final result. The common sequence labeling model has an HMM algorithm, the algorithm can well process the problems of ambiguity and unknown words, and the effect is better than that of matching based on character strings.
The hidden markov model is a double stochastic process, we do not know specific state sequences, and only know the probability of state transition, i.e. the state transition process of the model is not observable (hidden), while the stochastic process of observable events is a stochastic function of the hidden state transition process.
The composition of the HMM includes:
the number of states in the model is N;
a different number of symbols M that may be output from each state;
state transition probability matrix a ═ aijWherein a isijIs in a state SiTransition to State SjThe probability of (d);
from state CjObserving a specific symbol OkAm (a)The rate distribution matrix is: b ═ Bj(k) The probability of observing a symbol is also called symbol emission probability;
the probability distribution of the initial state is: pi ═ pii}。
In general, an HMM is denoted as a five-tuple μ ═ C, K, a, B, and pi, where C is the set of states, O is the set of output symbols, and pi, a and B are the probability distribution of initial states, the probability of state transitions, and the probability of symbol emissions, respectively.
Chinese segmentation uses corpora to train HMMs. Using the classical character notation model, the set of four classes of labels C is C ═ B, E, M, S, and has the following meaning:
b: beginning of a word
E: ending of a word
M: middle of a word
S: single character and word
After the four types of labels are marked, an HMM model can be established by a statistical method, and the label classification of each character is only influenced by the classification of the previous character. And obtaining a state transition matrix A and a symbol emission probability B of the HMM. Wherein:
Figure BDA0001689566350000131
Figure BDA0001689566350000132
in the formula, C ═ { B, E, M, S }, O ═ character set }, and Count represents frequency. In the calculation of BijMeanwhile, due to the sparsity of data, many characters do not appear in the training set, which results in the result with probability 0 appearing in B, and in order to fix this problem, a data smoothing technique of adding 1 is adopted, that is:
Figure BDA0001689566350000133
we set the initial vector pi to {0.5, 0.0, 0.0, 0.5}, M and E cannot occur at the top of a sentence. So far, the HMM model is completely built. Based on this HMM model, for an observation sequence, a hidden sequence { B, E, M, S } is obtained using the Viterbi algorithm.
The Viterbi search algorithm is:
1. initialization: delta1(i)=πibi(O1),1≤i≤N,
Path variable with the highest probability:
Figure BDA0001689566350000134
2. and (3) recursive calculation:
Figure BDA0001689566350000135
3. memorizing a rollback path:
Figure BDA0001689566350000136
4. and (3) finalization:
Figure BDA0001689566350000141
Figure BDA0001689566350000142
get path (state sequence) by backtracking:
Figure BDA0001689566350000143
the time complexity of the Viterbi algorithm is O (N)2T). For example, the output state sequence of the address of the current generation ten-thousand city three-phase 10-span two-unit 1706 of the Hongshan street, two rivers and community Fuyuan Weilu 199 in the open areas of Changsha city is as follows:
“BMEBMEBMMEBMMEBMMEBMMEBMMMMMEBMEBMEBMME”
according to the state sequence, Chinese word segmentation can be carried out as follows:
“BME|BME|BMME|BMME|BMME|BMME|BMMMMME|BME|BME|BM ME”
the final Chinese word segmentation results are as follows:
"Changsha | Kaifu | Hongshan street | twin river community | Fu Yuan West |199 number | contemporary world city third phase |10 pieces | two units | 1706".
The processing method for data desensitization disclosed in this embodiment first performs word segmentation processing on target word segmentation by using a maximum forward matching method or a bidirectional maximum matching chinese word segmentation method with a small algorithm complexity, so that the processing speed of word segmentation processing is ensured. And when the word segmentation result accuracy is lower than a threshold value, a hidden Markov model with higher algorithm complexity and higher word segmentation accuracy is adopted to segment the target data, so that the accuracy of the word segmentation result is ensured.
Referring to fig. 13, the present embodiment correspondingly discloses a processing apparatus for data desensitization, which includes:
a type determining unit 501 for determining the type of the target data;
a first word segmentation processing unit 502, configured to call a corresponding sub-word bank in a word segmentation reference word bank according to the type of the target data, and perform word segmentation by using a word segmentation method corresponding to the type of the target data;
a desensitization processing unit 503, configured to determine a desensitization method for the target data according to the type of the target data and the length of the target data, and perform desensitization processing on sensitive data obtained after word segmentation of the target data by using the desensitization method for the target data.
Optionally, the apparatus further comprises:
the word bank building unit is used for building a participle reference word bank, the participle reference word bank comprises a plurality of sub-word banks, and each sub-word bank comprises one type of sensitive word.
Optionally, when the type of the target data is an electricity address, the first word segmentation processing unit 502 is specifically configured to:
and calling a general address sub-word library, a place name sub-word library, a cell name sub-word library and a administrative district division set sub-word library, and performing word segmentation on the target data by adopting maximum forward matching Chinese word segmentation.
Optionally, when the type of the target data is an enterprise-type username, the first word segmentation processing unit 502 is specifically configured to:
and calling the regional set sub-word library, the industry set sub-word library and the company organization set sub-word library, and performing word segmentation by adopting a bidirectional maximum matching Chinese word segmentation method.
Optionally, the apparatus further comprises:
the calculation unit is used for calculating the accuracy of the word segmentation result of the target data;
the judging end member is used for judging whether the accuracy of the word segmentation result of the target data is greater than a first preset value or not;
if yes, triggering the desensitization processing unit;
and if not, triggering a second word segmentation processing unit, wherein the second word segmentation processing unit is used for segmenting the target data based on the hidden Markov model and triggering the desensitization processing unit.
Optionally, when the type of the target data is an electrical address, the desensitization processing unit 503 includes:
the first judgment subunit is used for judging whether the length of the target data is greater than a second preset value or not;
a first determining subunit, configured to determine that the desensitization method of the target data is a first electrical address data desensitization method when the length of the target data is greater than the second preset value;
the first extraction subunit is used for extracting the last 5-bit data of the house number data and the provincial, municipal and county data from the word segmentation result of the target data by adopting the first user address data desensitization method to obtain the remaining part data;
the first desensitization processing subunit is used for reserving the last 5 bits of data of the house number data and the data of province, city, district and county, and masking the remaining data of the target data to obtain the data after the target data is desensitized;
a second determining subunit, configured to determine that the desensitization method of the target data is a second electrical address data desensitization method when the length of the target data is not greater than the second preset value;
and the second desensitization processing subunit is used for extracting the reserved part of the target data according to the length of the target data and a first stepped reservation rule by adopting the second user address data desensitization method, and masking the remaining part of the target data to obtain data after the target data is desensitized.
Optionally, when the type of the target data is an enterprise-type username, the desensitization processing unit 503 includes:
the second judgment subunit is used for judging whether the length of the target data is greater than a third preset value or not;
the third determining subunit is configured to determine that the desensitization method of the target data is a first enterprise-class username data desensitization method when the length of the target data is greater than the third preset value;
a second extraction subunit, configured to extract, by using the first enterprise-class username data desensitization method, a first word of the word size data and a last word of the industry data from the word segmentation result of the target data, and obtain remaining data of the word size data and remaining data of the industry data;
the third desensitization processing subunit is used for masking the residual data of the word size data and the residual data of the industry data, reserving other data of the target data, and obtaining data after the target data is desensitized;
a fourth determining subunit, configured to determine that the desensitization method of the target data is a second enterprise-class username data desensitization method when the length of the target data is not greater than the third preset value;
and the fourth desensitization processing subunit is configured to extract, by using the second enterprise-class username data desensitization method, the reserved portion of the target data according to the length of the target data and according to a second hierarchical reservation rule, and mask the remaining portion of the target data to obtain data after the target data is desensitized.
According to the data desensitization processing device disclosed by the embodiment, before data desensitization, word segmentation is performed on target data by calling the word segmentation reference word bank to obtain data with a certain structure, desensitization processing is performed on parts with main sensitive information, all or most of the sensitive information is masked, and the effectiveness of data desensitization is improved. The corresponding sub-word banks in the word segmentation reference word bank are called according to the type of the target data, word segmentation is carried out by adopting a word segmentation method corresponding to the type of the target data, the word segmentation accuracy is improved, a desensitization method of the target data is determined according to the type and the length of the target data, differential desensitization of different types of data with different lengths is realized, and the effectiveness of data desensitization is improved.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A method of data desensitization processing, comprising:
determining the type of the target data;
calling a corresponding sub-word library in a word segmentation reference word library according to the type of the target data, and performing word segmentation by adopting a word segmentation method corresponding to the type of the target data;
determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization treatment on sensitive data obtained after word segmentation of the target data by adopting the desensitization method of the target data;
when the type of the target data is the electricity utilization address, determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization processing on sensitive data obtained after word segmentation of the target data by adopting the desensitization method of the target data, wherein the desensitization processing comprises the following steps: judging whether the length of the target data is larger than a second preset value or not; when the length of the target data is larger than the second preset value, determining that the desensitization method of the target data is a first electrical address data desensitization method; extracting the last 5-bit data of the house number data and the provincial, municipal and county data from the word segmentation result of the target data by adopting the first power utilization address data desensitization method to obtain the remaining part data; reserving the last 5 bits of data of the house number data and the province, city, district and county data, and masking the rest data of the target data to obtain data after the target data is desensitized; when the length of the target data is not greater than the second preset value, determining that the desensitization method of the target data is a second electrical address data desensitization method; and extracting a reserved part of the target data according to the length of the target data and a first stepped reservation rule by adopting the second electrical address data desensitization method, and masking the rest part of the target data to obtain the data after the target data desensitization.
2. The method of claim 1, further comprising:
and constructing a participle reference word bank, wherein the participle reference word bank comprises a plurality of sub-word banks, and each sub-word bank comprises a type of sensitive word.
3. The method according to claim 1, wherein when the type of the target data is a power address, the calling a corresponding sub-lexicon in a participle reference lexicon according to the type of the target data, and performing participle by using a participle method corresponding to the type of the target data comprises:
and calling a general address sub-word library, a place name sub-word library, a cell name sub-word library and a administrative district division set sub-word library, and performing word segmentation on the target data by adopting maximum forward matching Chinese word segmentation.
4. The method according to claim 1, wherein when the type of the target data is an enterprise-class username, the calling a corresponding sub-thesaurus in a participle reference thesaurus according to the type of the target data, and performing participle by using a participle method corresponding to the type of the target data comprises:
and calling the regional set sub-word library, the industry set sub-word library and the company organization set sub-word library, and performing word segmentation by adopting a bidirectional maximum matching Chinese word segmentation method.
5. The method of claim 1, wherein prior to the determining a desensitization method of the target data based on the type of the target data and the length of the target data, the method further comprises:
calculating the accuracy of the word segmentation result of the target data;
judging whether the correctness of the word segmentation result of the target data is greater than a first preset value or not;
if yes, executing the desensitization method for the target data according to the type of the target data and the length of the target data;
and if not, performing word segmentation on the target data based on a hidden Markov model, and executing the desensitization method for the target data according to the type of the target data and the length of the target data.
6. The method according to claim 1, wherein when the type of the target data is an enterprise-class username, determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization processing on sensitive data obtained after word segmentation of the target data by using the desensitization method of the target data comprises:
judging whether the length of the target data is larger than a third preset value or not;
when the length of the target data is larger than the third preset value, determining that the desensitization method of the target data is a first enterprise username data desensitization method;
extracting the first character of the character size data and the last character of the industry data from the word segmentation result of the target data by adopting the first enterprise username data desensitization method to obtain the residual data of the character size data and the residual data of the industry data;
masking the residual data of the word size data and the residual data of the industry data, and reserving other data of the target data to obtain desensitized data of the target data;
when the length of the target data is not larger than the third preset value, determining that the desensitization method of the target data is a second enterprise username data desensitization method;
and extracting the reserved part of the target data according to the length of the target data and a second hierarchical reservation rule by adopting the second enterprise username data desensitization method, and masking the rest part of the target data to obtain the data after the target data desensitization.
7. A processing apparatus for desensitizing data, comprising:
a type determination unit for determining a type of the target data;
the first word segmentation processing unit is used for calling a corresponding sub-word library in a word segmentation reference word library according to the type of the target data and performing word segmentation by adopting a word segmentation method corresponding to the type of the target data;
the desensitization processing unit is used for determining a desensitization method of the target data according to the type of the target data and the length of the target data, and performing desensitization processing on sensitive data obtained after word segmentation of the target data by adopting the desensitization method of the target data;
wherein, when the type of the target data is a power-on address, the desensitization processing unit includes: the first judgment subunit is used for judging whether the length of the target data is greater than a second preset value or not; a first determining subunit, configured to determine that the desensitization method of the target data is a first electrical address data desensitization method when the length of the target data is greater than the second preset value; the first extraction subunit is used for extracting the last 5-bit data of the house number data and the provincial, municipal and county data from the word segmentation result of the target data by adopting the first electric address data desensitization method to obtain the remaining part data; the first desensitization processing subunit is used for reserving the last 5 bits of data of the house number data and the data of province, city, district and county, and masking the remaining data of the target data to obtain the data after the target data is desensitized; a second determining subunit, configured to determine that the desensitization method of the target data is a second electrical address data desensitization method when the length of the target data is not greater than the second preset value; and the second desensitization processing subunit is used for extracting the reserved part of the target data according to the length of the target data and a first stepped reservation rule by adopting the second electrical address data desensitization method, and masking the remaining part of the target data to obtain data after the target data is desensitized.
8. The apparatus of claim 7, further comprising:
the word bank building unit is used for building a participle reference word bank, the participle reference word bank comprises a plurality of sub-word banks, and each sub-word bank comprises one type of sensitive word.
9. The apparatus according to claim 7, wherein when the type of the target data is a power utilization address, the first word segmentation processing unit is specifically configured to:
and calling a general address sub-word library, a place name sub-word library, a cell name sub-word library and a administrative district division set sub-word library, and performing word segmentation on the target data by adopting maximum forward matching Chinese word segmentation.
10. The apparatus according to claim 7, wherein when the type of the target data is an enterprise-class username, the first participle processing unit is specifically configured to:
and calling the regional set sub-word library, the industry set sub-word library and the company organization set sub-word library, and performing word segmentation by adopting a bidirectional maximum matching Chinese word segmentation method.
11. The apparatus of claim 7, further comprising:
the calculation unit is used for calculating the accuracy of the word segmentation result of the target data;
the judging end member is used for judging whether the accuracy of the word segmentation result of the target data is greater than a first preset value or not;
if yes, triggering the desensitization processing unit;
and if not, triggering a second word segmentation processing unit, wherein the second word segmentation processing unit is used for segmenting the target data based on the hidden Markov model and triggering the desensitization processing unit.
12. The apparatus according to claim 7, wherein when the type of the target data is an enterprise-class username, the desensitization processing unit comprises:
the second judgment subunit is used for judging whether the length of the target data is greater than a third preset value or not;
the third determining subunit is configured to determine that the desensitization method of the target data is a first enterprise-class username data desensitization method when the length of the target data is greater than the third preset value;
a second extraction subunit, configured to extract, by using the first enterprise-class username data desensitization method, a first word of the word size data and a last word of the industry data from the word segmentation result of the target data, and obtain remaining data of the word size data and remaining data of the industry data;
the third desensitization processing subunit is used for masking the residual data of the word size data and the residual data of the industry data, reserving other data of the target data, and obtaining data after the target data is desensitized;
a fourth determining subunit, configured to determine that the desensitization method of the target data is a second enterprise-class username data desensitization method when the length of the target data is not greater than the third preset value;
and the fourth desensitization processing subunit is configured to extract, by using the second enterprise-class username data desensitization method, the reserved portion of the target data according to the length of the target data and according to a second hierarchical reservation rule, and mask the remaining portion of the target data to obtain data after the target data is desensitized.
CN201810586230.9A 2018-06-08 2018-06-08 Data desensitization processing method and device Active CN108776762B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810586230.9A CN108776762B (en) 2018-06-08 2018-06-08 Data desensitization processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810586230.9A CN108776762B (en) 2018-06-08 2018-06-08 Data desensitization processing method and device

Publications (2)

Publication Number Publication Date
CN108776762A CN108776762A (en) 2018-11-09
CN108776762B true CN108776762B (en) 2022-01-28

Family

ID=64025970

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810586230.9A Active CN108776762B (en) 2018-06-08 2018-06-08 Data desensitization processing method and device

Country Status (1)

Country Link
CN (1) CN108776762B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382457B (en) * 2018-12-28 2023-08-18 神州数码医疗科技股份有限公司 Data risk assessment method and device
CN111767565B (en) * 2019-03-15 2024-04-12 北京京东尚科信息技术有限公司 Data desensitization processing method, processing device and storage medium
CN110610196B (en) * 2019-08-14 2023-04-28 平安科技(深圳)有限公司 Desensitization method, system, computer device and computer readable storage medium
CN110532805B (en) * 2019-09-05 2023-01-24 国网山西省电力公司阳泉供电公司 Data desensitization method and device
CN110750984B (en) * 2019-10-24 2023-11-21 深圳前海微众银行股份有限公司 Command line character string processing method, terminal, device and readable storage medium
CN110851864A (en) * 2019-11-08 2020-02-28 国网浙江省电力有限公司信息通信分公司 Sensitive data automatic identification and processing method and system
CN111143884B (en) * 2019-12-31 2022-07-12 北京懿医云科技有限公司 Data desensitization method and device, electronic equipment and storage medium
CN110928931B (en) * 2020-02-17 2020-06-30 深圳市琦迹技术服务有限公司 Sensitive data processing method and device, electronic equipment and storage medium
CN112132238A (en) * 2020-11-23 2020-12-25 支付宝(杭州)信息技术有限公司 Method, device, equipment and readable medium for identifying private data
CN116719907B (en) * 2023-06-26 2024-06-11 阿波罗智联(北京)科技有限公司 Data processing method, device, equipment and storage medium
CN117272996B (en) * 2023-11-23 2024-02-27 山东网安安全技术有限公司 Data desensitization system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2998903A1 (en) * 2014-09-18 2016-03-23 Kaspersky Lab, ZAO System and method for robust full-drive encryption
CN106909630A (en) * 2017-01-26 2017-06-30 武汉奇米网络科技有限公司 Filtering sensitive words method and system based on dynamic dictionary
CN107145799A (en) * 2017-05-04 2017-09-08 山东浪潮云服务信息科技有限公司 A kind of data desensitization method and device
CN107480549A (en) * 2017-06-28 2017-12-15 银江股份有限公司 A kind of shared sensitive information desensitization method of data-oriented and system
CN107609418A (en) * 2017-08-31 2018-01-19 深圳市牛鼎丰科技有限公司 Desensitization method, device, storage device and the computer equipment of text data
CN107885876A (en) * 2017-11-29 2018-04-06 北京安华金和科技有限公司 A kind of dynamic desensitization method rewritten based on SQL statement
CN107992771A (en) * 2017-12-20 2018-05-04 北京明朝万达科技股份有限公司 A kind of data desensitization method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750852B (en) * 2015-04-14 2018-03-09 海量云图(北京)数据技术有限公司 The discovery of Chinese address data and sorting technique
CN104731976B (en) * 2015-04-14 2018-03-30 海量云图(北京)数据技术有限公司 The discovery of private data and sorting technique in tables of data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2998903A1 (en) * 2014-09-18 2016-03-23 Kaspersky Lab, ZAO System and method for robust full-drive encryption
CN106909630A (en) * 2017-01-26 2017-06-30 武汉奇米网络科技有限公司 Filtering sensitive words method and system based on dynamic dictionary
CN107145799A (en) * 2017-05-04 2017-09-08 山东浪潮云服务信息科技有限公司 A kind of data desensitization method and device
CN107480549A (en) * 2017-06-28 2017-12-15 银江股份有限公司 A kind of shared sensitive information desensitization method of data-oriented and system
CN107609418A (en) * 2017-08-31 2018-01-19 深圳市牛鼎丰科技有限公司 Desensitization method, device, storage device and the computer equipment of text data
CN107885876A (en) * 2017-11-29 2018-04-06 北京安华金和科技有限公司 A kind of dynamic desensitization method rewritten based on SQL statement
CN107992771A (en) * 2017-12-20 2018-05-04 北京明朝万达科技股份有限公司 A kind of data desensitization method and device

Also Published As

Publication number Publication date
CN108776762A (en) 2018-11-09

Similar Documents

Publication Publication Date Title
CN108776762B (en) Data desensitization processing method and device
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN106909611B (en) Hotel automatic matching method based on text information extraction
JP2008243227A (en) Method and apparatus for generating template used in handwritten character recognition
CN109344263A (en) A kind of address matching method
CN109284358B (en) Chinese address noun hierarchical method and device
CN108268440A (en) A kind of unknown word identification method
Li et al. Boundary detection with BERT for span-level emotion cause analysis
Gilbert et al. A probabilistic context-free grammar for melodic reduction
Tsai et al. Mencius: A Chinese named entity recognizer using the maximum entropy-based hybrid model
CN116414824A (en) Administrative division information identification and standardization processing method, device and storage medium
CN106610937A (en) Information theory-based Chinese automatic word segmentation method
CN114091454A (en) Method for extracting place name information and positioning space in internet text
CN113033204A (en) Information entity extraction method and device, electronic equipment and storage medium
Skylaki et al. Named entity recognition in the legal domain using a pointer generator network
Sarikaya et al. Shrinkage based features for slot tagging with conditional random fields.
CN109871536B (en) Place name recognition method and device
Kumar Saha et al. Named entity recognition in Hindi using maximum entropy and transliteration
CN115146635B (en) Address segmentation method based on domain knowledge enhancement
Wu et al. One improved model of named entity recognition by combining BERT and BiLSTM-CNN for domain of Chinese railway construction
CN116821326A (en) Text abstract generation method and device based on self-attention and relative position coding
CN112632526B (en) User password modeling and strength evaluation method based on comprehensive segmentation
Whitelaw et al. Named entity recognition using a character-based probabilistic approach
Lu et al. Learning Chinese word embeddings by discovering inherent semantic relevance in sub-characters
Wang et al. Accurate Braille-Chinese translation towards efficient Chinese input method for blind people

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant