CN116484420A

CN116484420A - Text desensitization processing method and device

Info

Publication number: CN116484420A
Application number: CN202310423810.7A
Authority: CN
Inventors: 许江峰; 丰瑾
Original assignee: Postal Savings Bank of China Ltd
Current assignee: Postal Savings Bank of China Ltd
Priority date: 2023-04-19
Filing date: 2023-04-19
Publication date: 2023-07-25

Abstract

The invention discloses a text desensitization processing method and device. Wherein the method comprises the following steps: word segmentation processing is carried out on a target text to obtain a plurality of words, wherein the target text is a text which needs desensitization processing; matching the plurality of words with a sensitive information word stock to obtain at least one sensitive word in the plurality of words; determining an identification code of at least one user in the target text, wherein the identification code is used for uniquely identifying the user; determining a hash value of an identification code of a user, and taking the hash value as a key of at least one sensitive word; determining a desensitization word of the at least one sensitive word based on the key of the at least one sensitive word and the length of the at least one sensitive word; at least one sensitive word is replaced by a desensitization word to desensitize the target text. The invention solves the technical problems that the data management method in the related technology has lower reliability and is easy to cause data leakage.

Description

Text desensitization processing method and device

Technical Field

The invention relates to the technical field of data security protection, in particular to a text desensitization processing method and device.

Background

Currently, each row accumulates a large number of documents containing bank card numbers, customer names, cell phone numbers, address sensitive information. In addition, data security requirements are continually increasing. However, there is a safety risk in managing and storing data, for example, some sensitive data is at risk of leakage, and no safe and reliable measures are provided in the related art.

In view of the problem that the data leakage is easily caused by the low reliability of the data management method in the related art, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides a text desensitization processing method and device, which at least solve the technical problems that the data management mode in the related technology is low in reliability and easy to cause data leakage.

According to an aspect of an embodiment of the present invention, there is provided a text desensitization processing method, including: word segmentation processing is carried out on a target text to obtain a plurality of words, wherein the target text is a text needing desensitization processing; matching the plurality of words with a sensitive information word stock to obtain at least one sensitive word in the plurality of words, wherein the sensitive information word stock is constructed in advance based on a plurality of sample texts and comprises a word stock of a plurality of reference sensitive words, and the fields corresponding to the plurality of sample texts are the same as the fields corresponding to the target text; determining an identification code of at least one user in the target text, wherein the identification code is used for uniquely identifying the user; determining a hash value of the identification code of the user, and taking the hash value as a key of the at least one sensitive word; determining a desensitization word of the at least one sensitive word based on a key of the at least one sensitive word and a length of the at least one sensitive word; and replacing the at least one sensitive word with the desensitization word to desensitize the target text.

Optionally, before the word segmentation processing is performed on the target text, the method further includes: performing a cleaning process on the target text, wherein the cleaning process comprises at least one of the following operations: splicing error correction processing, character case conversion processing, punctuation mark conversion processing and special symbol identification processing.

Optionally, word segmentation processing is performed on the target text, including one of the following: word segmentation processing is carried out on the target text by using a bidirectional maximum matching method; word segmentation processing is carried out on the target text by using a hidden Markov algorithm; and performing word segmentation processing on the target text by using a conditional random field mode.

Optionally, matching the plurality of words with a sensitive information word stock to obtain at least one sensitive word in the plurality of words, including: identifying each of the plurality of words by a preset mode to perform part-of-speech tagging, and obtaining tagged words, wherein the preset mode comprises one of the following steps: a hidden Markov algorithm and a conditional random field mode; determining the part of speech of the sensitive word in the target text; determining partial words with the same part of speech in the marked multiple words; and matching the partial words with the sensitive information word stock to obtain the at least one sensitive word in the plurality of words.

Optionally, determining the desensitization word of the at least one sensitive word based on the key of the at least one sensitive word and the length of the at least one sensitive word includes: selecting at least one character based on a length of each of the at least one sensitive word; symmetrically encrypting the at least one sensitive word according to the secret key and the at least one character to obtain the at least one symmetrically encrypted sensitive word; determining zero width characters of the hash value; and adding the zero width character to the at least one sensitive word after symmetrical encryption to obtain a desensitization word of the at least one sensitive word.

Optionally, adding the zero-width character to the key after symmetric encryption to obtain a desensitization word of the at least one sensitized word, including: and adding the zero width character to the key after symmetric encryption to obtain an initial desensitization word of the at least one sensitive word, and adding a watermark to the initial desensitization word to obtain the desensitization word of the at least one sensitive word.

Optionally, the text desensitization processing method further includes: extracting watermark information in the target text after desensitization processing during data tracing to obtain the target text with the watermark removed; and decoding the zero width character from the target text from which the watermark is removed, and recovering to obtain the identification code.

Optionally, the text desensitization processing method further includes: when the data is traced, the key is utilized to analyze the at least one desensitized sensitive word; comparing the desensitized at least one desensitized word with the at least one desensitized word before desensitization, and restoring to obtain the identification code.

Optionally, replacing the at least one sensitive word with the desensitized word includes: determining the position of the at least one sensitive word in the target text; locating the at least one sensitive word in the target text according to the location; and replacing the at least one sensitive word obtained by positioning by using the desensitization word on the basis of the original text format of the target text.

According to another aspect of the embodiment of the present invention, there is also provided a text desensitization processing apparatus, including: the word segmentation unit is used for carrying out word segmentation processing on a target text to obtain a plurality of words, wherein the target text is a text which needs desensitization processing; the matching unit is used for matching the plurality of words with a sensitive information word stock to obtain at least one sensitive word in the plurality of words, wherein the sensitive information word stock is constructed in advance based on a plurality of sample texts and comprises a word stock of a plurality of reference sensitive words, and the fields corresponding to the plurality of sample texts are the same as the fields corresponding to the target text; a first determining unit, configured to determine an identification code of at least one user in the target text, where the identification code is used to uniquely identify the user; a second determining unit, configured to determine a hash value of the identification code of the user, and use the hash value as a key of the at least one sensitive word; a third determining unit configured to determine a desensitization word of the at least one sensitive word based on a key of the at least one sensitive word and a length of the at least one sensitive word; and the desensitization unit is used for replacing the at least one sensitive word with the desensitization word so as to desensitize the target text.

Optionally, before word segmentation processing is performed on the target text, the text desensitization processing apparatus further includes: a cleaning unit, configured to perform a cleaning process on the target text, where the cleaning process includes at least one of the following operations: splicing error correction processing, character case conversion processing, punctuation mark conversion processing and special symbol identification processing.

Optionally, the word segmentation unit includes one of the following: the first word segmentation module is used for carrying out word segmentation processing on the target text by utilizing a bidirectional maximum matching method; the second word segmentation module is used for carrying out word segmentation processing on the target text by utilizing a hidden Markov algorithm; and the third word segmentation module is used for carrying out word segmentation processing on the target text by utilizing a conditional random field mode.

Optionally, the matching unit includes: the labeling module is used for identifying each word in the plurality of words through a preset mode to label the parts of speech, and obtaining the labeled plurality of words, wherein the preset mode comprises one of the following steps: a hidden Markov algorithm and a conditional random field mode; the first determining module is used for determining the part of speech of the sensitive word in the target text; the second determining module is used for determining partial words with the same part of speech in the marked multiple words; and the matching module is used for matching the partial words with the sensitive information word stock to obtain the at least one sensitive word in the plurality of words.

Optionally, the third determining unit includes: a selection module for selecting at least one character based on a length of each of the at least one sensitive word; the encryption module is used for symmetrically encrypting the at least one sensitive word according to the secret key and the at least one character to obtain the at least one symmetrically encrypted sensitive word; a third determining module, configured to determine a zero-width character of the hash value; and the acquisition module is used for adding the zero-width character to the at least one sensitive word after symmetrical encryption to obtain a desensitization word of the at least one sensitive word.

Optionally, the acquiring module includes: and the adding sub-module is used for adding the watermark to the initial desensitization word after the zero width character is added to the key after the symmetric encryption to obtain the initial desensitization word of the at least one sensitive word, so as to obtain the desensitization word of the at least one sensitive word.

Optionally, the text desensitization processing apparatus further includes: the extraction module is used for extracting watermark information in the target text after the desensitization processing during data tracing to obtain the target text from which the watermark is removed; and the decoding module is used for decoding the zero width character from the target text from which the watermark is removed, and recovering the zero width character to obtain the identification code.

Optionally, the text desensitization processing apparatus further includes: the analysis module is used for analyzing the at least one desensitized sensitive word by utilizing the key during data tracing; and the comparison module is used for comparing the desensitized at least one desensitized word with the desensitized at least one word before desensitization, and recovering to obtain the identification code.

Optionally, the desensitizing unit comprises: a fourth determining module, configured to determine a location of the at least one sensitive word in the target text; a positioning module for positioning the at least one sensitive word in the target text according to the position; and the replacing module is used for replacing and positioning the at least one obtained sensitive word by using the desensitization word on the basis of the original text format of the target text.

According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium including a stored program, wherein the program performs any one of the above text desensitization processing methods.

According to another aspect of the embodiment of the present invention, there is provided a processor, where the processor is configured to execute a program, and when the program is executed, perform any one of the text desensitization processing methods described above.

In the embodiment of the invention, word segmentation is carried out on a target text to obtain a plurality of words, wherein the target text is a text which needs desensitization; matching the plurality of words with a sensitive information word stock to obtain at least one sensitive word in the plurality of words, wherein the sensitive information word stock is constructed in advance based on a plurality of sample texts and comprises a word stock of a plurality of reference sensitive words, and the fields corresponding to the plurality of sample texts are the same as the fields corresponding to the target texts; determining an identification code of at least one user in the target text, wherein the identification code is used for uniquely identifying the user; determining a hash value of an identification code of a user, and taking the hash value as a key of at least one sensitive word; determining a desensitization word of the at least one sensitive word based on the key of the at least one sensitive word and the length of the at least one sensitive word; at least one sensitive word is replaced by a desensitization word to desensitize the target text. According to the technical scheme provided by the embodiment of the invention, the aim of determining the desensitization word according to the hash value of the identification code of the user extracted from the target text and the length of the sensitive word in the target text so as to desensitize the sensitive word in the target text by using the desensitization word is fulfilled, the data security is improved, and the technical problem that the data leakage is easy to cause due to lower reliability of a mode for carrying out data management in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a flow chart of a text desensitization processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of an alternative text desensitization processing method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a text desensitizing processing apparatus according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, some of the terms or terminology present in the embodiments of the invention are explained below:

natural language processing (Natural Language Processing, NLP for short) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Through natural language processing, the text file can be segmented and words of specified types can be extracted.

Data watermarking: is a technique for embedding a specific digital signal into a digital product to protect the copyright, integrity, copy protection or tracking of the digital product.

In order to further improve the safety and usability of data and reduce the risk of sensitive data leakage, the text file is subjected to desensitization processing by the method and the device, so that the risk of sensitive data leakage is reduced. In addition, the invention can trace back the file through the special watermark information formed by desensitization, can effectively trace back the source of data leakage, improve the standardization and compliance level of data use management, and prevent potential safety hazards such as sensitive data leakage. The text desensitization processing method is described below with reference to specific examples.

In accordance with an embodiment of the present invention, there is provided a method embodiment of a text desensitization processing method, it being noted that the steps shown in the flowchart of the figures may be performed in a computer system such as a set of computer executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order other than that shown or described herein.

Fig. 1 is a flowchart of a text desensitization processing method according to an embodiment of the present invention, as shown in fig. 1, including the steps of:

step S102, word segmentation processing is carried out on a target text to obtain a plurality of words, wherein the target text is a text which needs desensitization processing.

Alternatively, the target text is text that requires desensitization of sensitive words.

Step S104, matching the plurality of words with a sensitive information word stock to obtain at least one sensitive word in the plurality of words, wherein the sensitive information word stock is constructed in advance based on a plurality of sample texts and comprises a word stock of a plurality of reference sensitive words, and the fields corresponding to the plurality of sample texts are the same as the fields corresponding to the target texts.

Alternatively, the sensitive information word stock may be pre-constructed from a plurality of sample texts. A plurality of reference sensitive words are included in the sensitive information word stock. Wherein the plurality of samples herein may be text having the same field as the target text field. By searching in the sensitive words in the same field, the accuracy and efficiency of the matching of the sensitive words can be improved.

For example, a dictionary (i.e., a library of sensitive information words) that discovers sensitive words may be constructed by regular expressions or artificial annotations, etc., for identifying sensitive words in a document, such as bank card numbers, customer names, cell phone numbers, addresses, etc.

Step S106, determining an identification code of at least one user in the target text, wherein the identification code is used for uniquely identifying the user.

Optionally, the identification code is information extracted from the target text and used for uniquely identifying the user, for example, information such as an identification card number of the user, a code of the user, and the like.

Step S108, determining a hash value of the identification code of the user, and taking the hash value as a key of at least one sensitive word.

In this embodiment, a hash value may be generated according to the identification code of the user, and the generated hash value is used as the key of at least one sensitive word.

Step S110, determining a desensitization word of the at least one sensitive word based on the key of the at least one sensitive word and the length of the at least one sensitive word.

In this embodiment, the desensitization word of the at least one sensitive word may be determined according to the key of the at least one sensitive word and the length of the at least one sensitive word to security of the desensitization of the sensitive word.

And step S112, replacing at least one sensitive word with the desensitization word to desensitize the target text.

From the above, in the embodiment of the present invention, the target text may be subjected to word segmentation processing to obtain a plurality of words, where the target text is a text that needs to be subjected to desensitization processing; matching the plurality of words with a sensitive information word stock to obtain at least one sensitive word in the plurality of words, wherein the sensitive information word stock is constructed in advance based on a plurality of sample texts and comprises a word stock of a plurality of reference sensitive words, and the fields corresponding to the plurality of sample texts are the same as the fields corresponding to the target texts; determining an identification code of at least one user in the target text, wherein the identification code is used for uniquely identifying the user; determining a hash value of an identification code of a user, and taking the hash value as a key of at least one sensitive word; determining a desensitization word of the at least one sensitive word based on the key of the at least one sensitive word and the length of the at least one sensitive word; the desensitization words are used for replacing at least one sensitive word so as to carry out desensitization processing on the target text, the aim of determining the desensitization words according to the hash value of the identification code of the user extracted from the target text and the length of the sensitive words in the target text is achieved, the aim of carrying out desensitization processing on the sensitive words in the target text by using the desensitization words is achieved, and data safety is improved.

Therefore, the text desensitization processing method provided by the embodiment of the invention solves the technical problems that the data management mode in the related technology is low in reliability and easy to cause data leakage.

According to the above embodiment of the present invention, before performing word segmentation processing on the target text, the text desensitization processing method further includes: performing a cleaning process on the target text, wherein the cleaning process comprises at least one of the following operations: splicing error correction processing, character case conversion processing, punctuation mark conversion processing and special symbol identification processing.

In this embodiment, the target text, i.e., the text to be desensitized, may be imported into a computer for document cleaning, mainly to handle special symbols, spelling correction, case-to-case conversion, punctuation conversion, etc., to provide accurate text for text desensitization.

According to the embodiment of the invention, word segmentation processing is performed on the target text, which comprises one of the following steps: word segmentation processing is carried out on the target text by utilizing a bidirectional maximum matching method; word segmentation processing is carried out on the target text by using a hidden Markov algorithm; and performing word segmentation processing on the target text by using a conditional random field mode.

In this embodiment, the cleaned file may be segmented, for example, the text may be segmented using a chinese segmentation tool such as jieba, using a rule-based segmentation method such as a bi-directional maximum matching method or a machine learning-based HMM, CRF, or the like.

According to the embodiment of the invention, matching a plurality of words with a sensitive information word stock to obtain at least one sensitive word in the plurality of words comprises: identifying each of the plurality of words by a predetermined mode to perform part-of-speech tagging, and obtaining tagged plurality of words, wherein the predetermined mode comprises one of the following steps: a hidden Markov algorithm and a conditional random field mode; determining the part of speech of sensitive words in the target text; determining partial words with the same part of speech in the marked multiple words; and matching part of the words with the sensitive information word stock to obtain at least one sensitive word in the plurality of words.

In this embodiment, the plurality of words obtained by segmentation may be part-of-speech tagged, for example, the words may be part-of-speech tagged using a rule-based or statistical-based approach, such as HMM, CRF, LSTM +crf. Because the fields related to the personal sensitive information are nouns, only nouns need to be marked when parts of speech are marked. The text desensitization efficiency is improved by only labeling the nouns, namely only positioning the nouns when the sensitive words are desensitized later.

According to the above embodiment of the present invention, determining the desensitization word of the at least one sensitive word based on the key of the at least one sensitive word and the length of the at least one sensitive word may include: selecting at least one character based on a length of each of the at least one sensitive word; symmetrically encrypting at least one sensitive word according to the secret key and the at least one character to obtain at least one symmetrically encrypted sensitive word; determining zero width characters of the hash value; and adding the zero width character to at least one sensitive word after symmetrical encryption to obtain a desensitization word of the at least one sensitive word.

In this embodiment, at least one character may be selected according to a length of each of the at least one sensitive word, so that the at least one sensitive word may be symmetrically encrypted using the key and the at least one character to obtain the symmetrically encrypted at least one sensitive word; after the zero width character of the hash value is determined, adding the zero width character to the symmetrically encrypted at least one sensitive word to obtain a desensitization word of the at least one sensitive word.

In an alternative embodiment, adding zero width characters to a symmetrically encrypted key results in a desensitization word of at least one sensitized word, comprising: after adding the zero width character to the key after symmetric encryption to obtain an initial desensitization word of at least one sensitive word, adding a watermark to the initial desensitization word to obtain the desensitization word of the at least one sensitive word.

In this embodiment, after adding the zero width character to the symmetrically encrypted key to obtain the initial desensitization word of the at least one desensitization word, in order to further improve data security, a watermark may be added to the initial desensitization word to obtain the desensitization word of the at least one sensitive word.

As an alternative embodiment, the text desensitizing method may further include: extracting watermark information in the target text after desensitization processing during data tracing to obtain the target text with the watermark removed; and decoding the zero width character from the target text from which the watermark is removed, and recovering to obtain the identification code.

In this embodiment, the watermark is obtained by extracting the watermark of the target text after the desensitization processing during the data trace-back; therefore, the zero width character can be decoded from the target text from which the watermark is removed, and the identification code is restored and obtained.

As another alternative embodiment, the text desensitizing method further includes: when the data is traced, the key is utilized to analyze at least one desensitized sensitive word; and comparing the desensitized at least one desensitized word with the desensitized at least one word before desensitization, and recovering to obtain the identification code.

In this embodiment, the key may be used to parse the desensitized at least one sensitive word during data tracing, so as to compare the desensitized at least one sensitive word with the original desensitized word, and obtain the identification code.

According to the above embodiment of the present invention, replacing at least one sensitive word with a desensitizing word includes: determining the position of at least one sensitive word in the target text; locating at least one sensitive word in the target text according to the location; and replacing the at least one obtained sensitive word by using the desensitization word on the basis of the original text format of the target text.

In this embodiment, the location of the at least one sensitive word in the target text may be determined first, and then the at least one sensitive word may be located in the target text according to the determined location, and the at least one sensitive word obtained by the locating may be replaced by a desensitized word based on the original text format of the target text.

FIG. 2 is a flow chart of an alternative text desensitization processing method according to an embodiment of the invention, as shown in FIG. 2, firstly, text preprocessing can be performed, specifically, a file to be desensitized can be imported into a computer; cleaning files, mainly processing, special symbol, spelling error correction, case-to-case conversion and punctuation mark conversion; the cleaned file is subjected to word segmentation, a Chinese word segmentation tool such as jieba is used, and a rule-based word segmentation method such as a bidirectional maximum matching method or a machine learning-based HMM, CRF and the like is used for word segmentation of the text; the words are tagged with parts of speech, and rules-based or statistical-based methods, such as HMM, CRF, LSTM +crf, may be used to tag the words with parts of speech. Because the fields related to the personal sensitive information are nouns, only nouns need to be marked when parts of speech are marked.

Then, text desensitization is carried out; specifically, a dictionary for finding sensitive words is constructed by regular expressions or manual labels and the like, and is used for identifying the sensitive words in a document, such as bank card numbers, customer names, mobile phone numbers, addresses and the like. Labeling the position of the sensitive word in the document; different users generate different identification codes and perform persistence storage; the identification code uniquely marks that the user carries out desensitization on the marked sensitive words by adopting data replacement, when the data is replaced, hash values of identification codes of different users are used as secret keys, one or more characters are selected for symmetric encryption according to different lengths of the sensitive words, in addition, zero width characters of the hash values of the identification codes of the different users are added, the original format of a document is maintained, and the desensitized words containing data watermarks are replaced with the sensitive words. Finally, data tracing can be performed, specifically, if data leakage occurs, watermark information in a document can be extracted, and zero-width character characters can be decoded; or, comparing the sensitive words after traversing, analyzing and desensitizing by using the keys of all users with the original sensitive words, and restoring the identification codes of the users.

According to the text desensitization processing method provided by the embodiment of the invention, the natural language desensitized text file is used, and the identification code of the user is used as the tracing code to realize data tracing, so that the whole flow is smooth, the management is more standard, and the safety of the data is greatly improved.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method described in the embodiments of the present application.

According to an embodiment of the present invention, there is also provided a text desensitizing apparatus for implementing the above text desensitizing method, and fig. 3 is a schematic diagram of the text desensitizing apparatus according to an embodiment of the present invention, as shown in fig. 3, where the text desensitizing apparatus includes: a word segmentation unit 301, a matching unit 303, a first determination unit 305, a second determination unit 307, a third determination unit 309, and a desensitization unit 311. The text desensitizing apparatus is described below.

The word segmentation unit 301 is configured to perform word segmentation processing on a target text to obtain a plurality of words, where the target text is a text that needs to be subjected to desensitization processing.

And a matching unit 303, configured to match the plurality of words with a sensitive information word stock to obtain at least one sensitive word in the plurality of words, where the sensitive information word stock is pre-constructed based on a plurality of sample texts and includes a word stock of a plurality of reference sensitive words, and a field corresponding to the plurality of sample texts is the same as a field corresponding to the target text.

The first determining unit 305 is configured to determine an identification code of at least one user in the target text, where the identification code is used to uniquely identify the user.

A second determining unit 307, configured to determine a hash value of the identification code of the user, and use the hash value as a key of at least one sensitive word.

A third determining unit 309 is configured to determine a desensitization word of the at least one sensitive word based on the key of the at least one sensitive word and the length of the at least one sensitive word.

The desensitization unit 311 is configured to replace at least one sensitive word with a desensitization word to desensitize the target text.

Here, the word segmentation unit 301, the matching unit 303, the first determining unit 305, the second determining unit 307, the third determining unit 309, and the desensitizing unit 311 correspond to steps S102 to S112 in the above embodiment, and six units are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in the above embodiment.

As can be seen from the above, in the solution described in the above embodiment of the present invention, the word segmentation unit may be used to perform word segmentation on the target text to obtain a plurality of words, where the target text is a text that needs to be subjected to desensitization; then matching the plurality of words with a sensitive information word stock by using a matching unit to obtain at least one sensitive word in the plurality of words, wherein the sensitive information word stock is constructed in advance based on a plurality of sample texts and comprises a word stock of a plurality of reference sensitive words, and the fields corresponding to the plurality of sample texts are the same as the fields corresponding to the target texts; then, determining an identification code of at least one user in the target text by using a first determining unit, wherein the identification code is used for uniquely identifying the user; determining a hash value of the identification code of the user by using a second determining unit, and taking the hash value as a key of at least one sensitive word; and determining, with the third determining unit, a desensitization word of the at least one sensitive word based on the key of the at least one sensitive word and the length of the at least one sensitive word; and replacing at least one sensitive word with the desensitization word by using the desensitization unit to desensitize the target text, so that the aim of determining the desensitization word according to the hash value of the identification code of the user extracted from the target text and the length of the sensitive word in the target text to desensitize the sensitive word in the target text by using the desensitization word is fulfilled, and the data security is improved. The text desensitization processing device provided by the embodiment of the invention solves the technical problems that the data management mode in the related technology is low in reliability and easy to cause data leakage.

In an alternative embodiment, before the word segmentation processing is performed on the target text, the text desensitization processing apparatus further includes: a cleaning unit for performing a cleaning process on the target text, wherein the cleaning process includes at least one of the following operations: splicing error correction processing, character case conversion processing, punctuation mark conversion processing and special symbol identification processing.

In an alternative embodiment, the word segmentation unit comprises one of the following: the first word segmentation module is used for carrying out word segmentation processing on the target text by utilizing a bidirectional maximum matching method; the second word segmentation module is used for carrying out word segmentation processing on the target text by utilizing a hidden Markov algorithm; and the third word segmentation module is used for carrying out word segmentation processing on the target text by utilizing a conditional random field mode.

In an alternative embodiment, the matching unit comprises: the labeling module is used for identifying each of the plurality of words to label the part of speech through a preset mode to obtain a plurality of labeled words, wherein the preset mode comprises one of the following steps: a hidden Markov algorithm and a conditional random field mode; the first determining module is used for determining the part of speech of the sensitive word in the target text; the second determining module is used for determining partial words with the same part of speech in the marked multiple words; and the matching module is used for matching part of words with the sensitive information word stock to obtain at least one sensitive word in the plurality of words.

In an alternative embodiment, the third determining unit comprises: a selection module for selecting at least one character based on a length of each of the at least one sensitive word; the encryption module is used for symmetrically encrypting the at least one sensitive word according to the secret key and the at least one character to obtain at least one symmetrically encrypted sensitive word; a third determining module, configured to determine a zero-width character of the hash value; and the acquisition module is used for adding the zero-width character to the at least one sensitive word after symmetrical encryption to obtain a desensitization word of the at least one sensitive word.

In an alternative embodiment, the acquisition module includes: and the adding submodule is used for adding the watermark to the initial desensitization word after adding the zero-width character to the symmetrically encrypted secret key to obtain the initial desensitization word of at least one sensitive word, so as to obtain the desensitization word of the at least one sensitive word.

In an alternative embodiment, the text desensitizing apparatus further comprises: the extraction module is used for extracting watermark information in the target text after the desensitization processing during data tracing to obtain the target text with the watermark removed; and the decoding module is used for decoding the zero width character from the target text from which the watermark is removed, and recovering to obtain the identification code.

In an alternative embodiment, the text desensitizing apparatus further comprises: the analysis module is used for analyzing at least one desensitized sensitive word by using the key during data tracing; and the comparison module is used for comparing at least one desensitized word after desensitization with at least one desensitized word before desensitization, and recovering to obtain the identification code.

In an alternative embodiment, the desensitizing unit comprises: a fourth determining module, configured to determine a location of at least one sensitive word in the target text; the positioning module is used for positioning at least one sensitive word in the target text according to the position; and the replacing module is used for replacing at least one obtained sensitive word by the desensitization word on the basis of the original text format of the target text.

According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium including a stored program, wherein the program performs the text desensitization processing method of any one of the above.

Alternatively, in the present embodiment, the above-described computer-readable storage medium may be located in any one of a group of computer terminals in a computer network, or in any one of a group of communication devices.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: word segmentation processing is carried out on a target text to obtain a plurality of words, wherein the target text is a text which needs desensitization processing; matching the plurality of words with a sensitive information word stock to obtain at least one sensitive word in the plurality of words, wherein the sensitive information word stock is constructed in advance based on a plurality of sample texts and comprises a word stock of a plurality of reference sensitive words, and the fields corresponding to the plurality of sample texts are the same as the fields corresponding to the target texts; determining an identification code of at least one user in the target text, wherein the identification code is used for uniquely identifying the user; determining a hash value of an identification code of a user, and taking the hash value as a key of at least one sensitive word; determining a desensitization word of the at least one sensitive word based on the key of the at least one sensitive word and the length of the at least one sensitive word; at least one sensitive word is replaced by a desensitization word to desensitize the target text.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: performing a cleaning process on the target text, wherein the cleaning process comprises at least one of the following operations: splicing error correction processing, character case conversion processing, punctuation mark conversion processing and special symbol identification processing.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: word segmentation processing is carried out on the target text by utilizing a bidirectional maximum matching method; word segmentation processing is carried out on the target text by using a hidden Markov algorithm; and performing word segmentation processing on the target text by using a conditional random field mode.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: identifying each of the plurality of words by a predetermined mode to perform part-of-speech tagging, and obtaining tagged plurality of words, wherein the predetermined mode comprises one of the following steps: a hidden Markov algorithm and a conditional random field mode; determining the part of speech of sensitive words in the target text; determining partial words with the same part of speech in the marked multiple words; and matching part of the words with the sensitive information word stock to obtain at least one sensitive word in the plurality of words.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: selecting at least one character based on a length of each of the at least one sensitive word; symmetrically encrypting at least one sensitive word according to the secret key and the at least one character to obtain at least one symmetrically encrypted sensitive word; determining zero width characters of the hash value; and adding the zero width character to at least one sensitive word after symmetrical encryption to obtain a desensitization word of the at least one sensitive word.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: after adding the zero width character to the key after symmetric encryption to obtain an initial desensitization word of at least one sensitive word, adding a watermark to the initial desensitization word to obtain the desensitization word of the at least one sensitive word.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: extracting watermark information in the target text after desensitization processing during data tracing to obtain the target text with the watermark removed; and decoding the zero width character from the target text from which the watermark is removed, and recovering to obtain the identification code.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: when the data is traced, the key is utilized to analyze at least one desensitized sensitive word; and comparing the desensitized at least one desensitized word with the desensitized at least one word before desensitization, and recovering to obtain the identification code.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: determining the position of at least one sensitive word in the target text; locating at least one sensitive word in the target text according to the location; and replacing the at least one obtained sensitive word by using the desensitization word on the basis of the original text format of the target text.

According to another aspect of the embodiment of the present invention, there is also provided a processor, configured to execute a program, where the program executes any one of the text desensitization processing methods described above.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A text desensitization processing method, characterized by comprising:

word segmentation processing is carried out on a target text to obtain a plurality of words, wherein the target text is a text needing desensitization processing;

matching the plurality of words with a sensitive information word stock to obtain at least one sensitive word in the plurality of words, wherein the sensitive information word stock is constructed in advance based on a plurality of sample texts and comprises a word stock of a plurality of reference sensitive words, and the fields corresponding to the plurality of sample texts are the same as the fields corresponding to the target text;

determining an identification code of at least one user in the target text, wherein the identification code is used for uniquely identifying the user;

determining a hash value of the identification code of the user, and taking the hash value as a key of the at least one sensitive word;

determining a desensitization word of the at least one sensitive word based on a key of the at least one sensitive word and a length of the at least one sensitive word;

And replacing the at least one sensitive word with the desensitization word to desensitize the target text.

2. The text desensitization processing method according to claim 1, further comprising, before performing word segmentation processing on the target text: performing a cleaning process on the target text, wherein the cleaning process comprises at least one of the following operations: splicing error correction processing, character case conversion processing, punctuation mark conversion processing and special symbol identification processing.

3. The text desensitization processing method according to claim 1, wherein the word segmentation processing is performed on the target text, comprising one of:

word segmentation processing is carried out on the target text by using a bidirectional maximum matching method;

word segmentation processing is carried out on the target text by using a hidden Markov algorithm;

and performing word segmentation processing on the target text by using a conditional random field mode.

4. The text desensitization processing method according to claim 1, wherein matching the plurality of words with a sensitive information word stock to obtain at least one sensitive word of the plurality of words comprises:

identifying each of the plurality of words by a preset mode to perform part-of-speech tagging, and obtaining tagged words, wherein the preset mode comprises one of the following steps: a hidden Markov algorithm and a conditional random field mode;

Determining the part of speech of the sensitive word in the target text;

determining partial words with the same part of speech in the marked multiple words;

and matching the partial words with the sensitive information word stock to obtain the at least one sensitive word in the plurality of words.

5. The text desensitization processing method according to claim 1, wherein determining the desensitization word of the at least one sensitive word based on the key of the at least one sensitive word and the length of the at least one sensitive word comprises:

selecting at least one character based on a length of each of the at least one sensitive word;

symmetrically encrypting the at least one sensitive word according to the secret key and the at least one character to obtain the at least one symmetrically encrypted sensitive word;

determining zero width characters of the hash value;

and adding the zero width character to the at least one sensitive word after symmetrical encryption to obtain a desensitization word of the at least one sensitive word.

6. The text desensitization processing method according to claim 5, wherein adding said zero-width character to said key after symmetric encryption to obtain a desensitization word of said at least one sensitive word, comprising:

And adding the zero width character to the key after symmetric encryption to obtain an initial desensitization word of the at least one sensitive word, and adding a watermark to the initial desensitization word to obtain the desensitization word of the at least one sensitive word.

7. The text desensitization processing method according to claim 6, further comprising:

extracting watermark information in the target text after desensitization processing during data tracing to obtain the target text with the watermark removed;

and decoding the zero width character from the target text from which the watermark is removed, and recovering to obtain the identification code.

8. The text desensitization processing method according to claim 6, further comprising:

when the data is traced, the key is utilized to analyze the at least one desensitized sensitive word;

comparing the desensitized at least one desensitized word with the at least one desensitized word before desensitization, and restoring to obtain the identification code.

9. The text desensitization processing method according to any one of claims 1 to 8, wherein replacing the at least one sensitive word with the desensitization word comprises:

determining the position of the at least one sensitive word in the target text;

Locating the at least one sensitive word in the target text according to the location;

and replacing the at least one sensitive word obtained by positioning by using the desensitization word on the basis of the original text format of the target text.

10. A text desensitization processing apparatus, comprising:

the word segmentation unit is used for carrying out word segmentation processing on a target text to obtain a plurality of words, wherein the target text is a text which needs desensitization processing;

the matching unit is used for matching the plurality of words with a sensitive information word stock to obtain at least one sensitive word in the plurality of words, wherein the sensitive information word stock is constructed in advance based on a plurality of sample texts and comprises a word stock of a plurality of reference sensitive words, and the fields corresponding to the plurality of sample texts are the same as the fields corresponding to the target text;

a first determining unit, configured to determine an identification code of at least one user in the target text, where the identification code is used to uniquely identify the user;

a second determining unit, configured to determine a hash value of the identification code of the user, and use the hash value as a key of the at least one sensitive word;

A third determining unit configured to determine a desensitization word of the at least one sensitive word based on a key of the at least one sensitive word and a length of the at least one sensitive word;

and the desensitization unit is used for replacing the at least one sensitive word with the desensitization word so as to desensitize the target text.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium includes a stored program, wherein the program performs the text desensitization processing method according to any one of claims 1 to 9.

12. A processor for executing a program, wherein the program is operative to perform the text desensitization processing method according to any one of claims 1-9.