CN112001174A - Text desensitization method, apparatus, electronic device and computer-readable storage medium - Google Patents

Text desensitization method, apparatus, electronic device and computer-readable storage medium Download PDF

Info

Publication number
CN112001174A
CN112001174A CN202010795184.0A CN202010795184A CN112001174A CN 112001174 A CN112001174 A CN 112001174A CN 202010795184 A CN202010795184 A CN 202010795184A CN 112001174 A CN112001174 A CN 112001174A
Authority
CN
China
Prior art keywords
text
vocabulary
information
desensitization
context
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010795184.0A
Other languages
Chinese (zh)
Inventor
代庆国
罗英群
吕令广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE ICT Technologies Co Ltd
Original Assignee
ZTE ICT Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE ICT Technologies Co Ltd filed Critical ZTE ICT Technologies Co Ltd
Priority to CN202010795184.0A priority Critical patent/CN112001174A/en
Publication of CN112001174A publication Critical patent/CN112001174A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a text desensitization method, a text desensitization device, electronic equipment and a computer-readable storage medium. Wherein the text desensitization method comprises: acquiring a text to be processed and a hidden Markov model; performing word segmentation processing on the text to be processed according to the word segmentation library to obtain word information; determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model; and desensitizing the vocabulary information based on the context information meeting the preset context information. According to the method, the context of the unstructured text is identified through the hidden Markov model, the private words can be further screened, the identification precision of the private words is improved, the desensitization requirements of different users are met, the identification processing efficiency of the private words is effectively improved, the private data is prevented from being searched in a regular mode, the user does not need to be forced to edit any data rule, the workload of the user is reduced, and meanwhile, the artificial errors of manual labeling are prevented.

Description

Text desensitization method, apparatus, electronic device and computer-readable storage medium
Technical Field
The invention relates to the technical field of electronic equipment, in particular to a text desensitization method, a text desensitization device, electronic equipment and a computer readable storage medium.
Background
In the prior art, in order to ensure the safety of data use, a desensitization method is generally used for replacing private data, and most of the existing desensitization methods are directed to structured data, such as a database. Identification is performed using a regular pattern, such as specifying field names for database tables, etc. for desensitization.
Increasingly important in the protection of industrial data privacy, the desensitization mode used by industrial users has the following problems: most of the existing data processing methods mainly aim at processing structured data, and most of semi-structured data are processed by adopting a regular expression pattern matching method to find out key data for desensitization. Most of the existing sensitive data identification methods are based on rule discovery and manual definition, and the rule discovery-based method can effectively identify the sensitive data which accords with the rule definition, but can miss a large amount of irregular sensitive data, so that the accuracy rate of sensitive data identification is reduced; on the other hand, in the case of a large amount of data, the method based on manual definition increases the burden on the user, and reduces the usability and usability of the system.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art or the related art.
To this end, a first aspect of the invention is directed to a method of text desensitization.
A second aspect of the invention is to propose a text desensitizing device.
A third aspect of the invention is directed to an electronic device.
A fourth aspect of the invention is directed to a computer-readable storage medium.
In view of this, according to a first aspect of the present invention, there is provided a text desensitization method, comprising: acquiring a text to be processed and a hidden Markov model; performing word segmentation processing on the text to be processed according to the word segmentation library to obtain word information; determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model; and desensitizing the vocabulary information based on the context information meeting the preset context information.
The text desensitization method provided by the invention combines the text to be processed with the word segmentation library to perform word segmentation to obtain word information, wherein the word information comprises a plurality of word texts and corresponding word positions and semantics. The vocabulary information is input into a preset Hidden Markov Model (HMM) for calculation, the context information which is closest to the vocabulary information is determined, and the context information is compared with the preset context information related to desensitization. If the context information meets the preset context information, which indicates that characters meeting the desensitization context rule exist in the text to be processed, finding out a privacy data structure in the key context, and performing desensitization processing on the privacy data so as to perform data deformation on some sensitive data through the desensitization rule. Therefore, the context of the unstructured text is identified through the hidden Markov model, the privacy words can be further screened, the identification precision of the privacy words is improved, desensitization requirements of different users are met, the identification processing efficiency of the privacy words is effectively improved, the privacy data are prevented from being searched in a regular mode, the users do not need to be forced to edit any data rule, the workload of the users is reduced, and meanwhile, artificial errors of manual labeling are prevented.
In addition, the text desensitization method in the above technical scheme provided by the invention can also have the following additional technical features:
in the above technical solution, further, the step of performing desensitization processing on the text vocabulary information specifically includes: comparing vocabulary text and private vocabulary in the vocabulary information; based on the fact that the vocabulary text conforms to the private vocabulary, the vocabulary text is marked as sensitive data; and carrying out desensitization processing on the sensitive data according to desensitization rules.
In the technical scheme, under the condition that the context information of the vocabulary information meets the preset context information, the vocabulary text and the privacy vocabulary in the vocabulary information are compared, namely the privacy data structure in the key context is searched. And if the vocabulary text conforms to the private vocabulary, the vocabulary is indicated as sensitive data, and desensitization processing is carried out on the sensitive data according to desensitization rules. Therefore, sensitive data related to data security is shielded, and the security, integrity and usability of the sensitive data in the text are effectively guaranteed.
Specifically, the privacy vocabulary may be fields preset by the user according to needs and experience, such as: name, address, date directly associated with the individual, telephone number, email address, account number, transaction amount, IP address, license plate number, and the like.
In any of the above technical solutions, further, before the step of obtaining the text to be processed, the method further includes: acquiring a target text; performing word segmentation processing on the target text according to the word segmentation library to obtain a first target word and a word position and a semantic meaning corresponding to the first target word; determining a context identifier of a target text according to a first target vocabulary, a vocabulary position, a semantic and context pattern library; and constructing a hidden Markov model according to the context identifier.
According to the technical scheme, before desensitization processing is carried out on a text, a target text of a specified industry is obtained, word segmentation is carried out on the target text, and a first target word and a word position corresponding to the first target word are obtained. And analyzing the context of the first target vocabulary with different vocabulary positions and different semantics according to the context pattern library of the specified industry, and marking the context identification of the target text according to the context to establish the relevance of the context and the vocabulary. And calculating an initial probability distribution matrix, a state transition probability distribution matrix, an observation probability distribution matrix and the like of the hidden Markov model according to the context identifier, and converging the matrixes by using an unsupervised machine learning method so as to construct the hidden Markov model in the specified industry. The method and the device have the advantages that the context identification is conveniently carried out on the text to be processed according to the hidden Markov model, the dynamic desensitization of sensitive information is realized according to the actual context, the identification processing efficiency of privacy words is effectively improved while the security of sensitive data in the text is guaranteed, and the desensitization requirements of different users are met.
In any of the above technical solutions, further, the step of determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden markov model specifically includes: matching the vocabulary information with the hidden Markov model to obtain the matching degree of the vocabulary information and the context identifier; and based on the matching degree being greater than or equal to the threshold value of the matching degree, using the context identifier corresponding to the matching degree greater than or equal to the threshold value of the matching degree as the context information.
According to the technical scheme, after vocabulary information is input into a hidden Markov model, the matching degree of the vocabulary information and a plurality of context identifications is calculated and counted through the hidden Markov model. If the matching degree is greater than or equal to the matching degree threshold, the context identifier of the matching degree is the most probable context of the vocabulary information, and the context identifier corresponding to the matching degree greater than or equal to the matching degree threshold is used as the context information to match the preset context information, so that the sensitive vocabulary is recognized by using the context, the recognition precision of the privacy words is improved, and the effect of protecting the data privacy is achieved. The threshold of the matching degree may be set to a maximum value of the obtained multiple matching degrees, or may be set according to an actual requirement of the user.
Specifically, the hidden markov model is a dual stochastic process, and has a hidden markov chain with a certain number of states and a display stochastic function set, which can be represented by five elements, including 2 state sets and 3 probability matrices, which are respectively a hidden state, an observable state, an initial probability distribution matrix, a state transition probability distribution matrix, and an observation probability distribution matrix. The hidden state represents a context identifier, such as title/xx and identification number/xx, the observable state represents a word segmentation result of a text, the initial probability distribution matrix is used for representing the probability that the state of the first word of a sentence is a privacy state, the state transition probability distribution matrix is used for representing the probability of transition from one hidden state to another hidden state, and the observation probability distribution matrix is used for representing the probability of occurrence of a keyword in the hidden state. And then inputting the vocabulary information after the word segmentation of the text to be processed into a hidden Markov model, obtaining the matching degree (probability) of the vocabulary information and each context identifier through an initial probability distribution matrix, a state transition probability distribution matrix and an observation probability distribution matrix algorithm of the hidden Markov model, and determining a hidden state sequence with the maximum probability, namely the context information which is most close to the vocabulary information.
In any of the above technical solutions, further, after the step of obtaining the target text, the method further includes: performing word segmentation processing on the target text by adopting a maximum matching algorithm to obtain a second target word; counting the occurrence frequency of a second target vocabulary in the target text; and updating the word segmentation library according to the second target vocabulary with the occurrence frequency greater than or equal to the preset frequency.
In the technical scheme, the word segmentation library can adopt a special word library already existing in the industry, but the updating iteration of words is considered, the maximum matching algorithm is used for carrying out word segmentation on a target text in the specified industry, then the occurrence frequency of a second target word in the target text is counted, and the second target word with the occurrence frequency greater than or equal to the preset frequency is added into the word segmentation library, so that the word segmentation library is gradually perfected, the accurate word segmentation is conveniently carried out on the text, the comprehensive data support is improved for constructing the hidden Markov model, the recognition precision of the context of the text to be processed and the sensitive words is ensured, and the omission in desensitization processing is prevented. Therefore, dynamic desensitization of sensitive information is realized, and the safety, integrity and usability of sensitive data in the text are effectively guaranteed.
In any of the above technical solutions, further, the step of obtaining the hidden markov model specifically includes: determining the text type of the text to be processed according to the text to be processed; matching the text type with a preset text type; and calling the hidden Markov model according to the preset text type based on the matching of the text type and the preset text type.
In the technical scheme, the text type of the text to be processed is determined, and the text type is matched with the preset text type, so that the hidden Markov model is selected according to the preset text type with high matching degree, the texts with different types can adopt the specific hidden Markov model to perform context recognition, desensitization requirements of different industries are met, and the application range of products is expanded.
In any of the above embodiments, further, the desensitization rule includes at least one of: a masking algorithm, a morphing algorithm, a replacement algorithm, a format preserving encryption algorithm, and a data encryption algorithm.
In the technical scheme, the plain text can be processed in a data hiding, rounding, shifting and mapping mode or by algorithms such as Hash and the like. For a text with a format, random data meeting the format is generated according to the data format and is replaced by the random data, or the prefix/suffix is kept unchanged, the latter half part is filled by scrambling data, or the data is encrypted by a format-preserving encryption method. Therefore, a targeted and differentiated strategy is formulated according to the requirements of the user on the sensitive data, and the availability of the sensitive data is ensured.
According to a second aspect of the present invention, there is provided a text desensitization apparatus, comprising a memory, a processor, the memory storing a computer program, the processor implementing the text desensitization method of any one of the above when executing the computer program. The text desensitization device thus has all the benefits of the text desensitization method of any of the above.
According to a third aspect of the invention, there is provided an electronic device comprising: the text desensitization apparatus of any of the above, the text desensitization apparatus being capable of performing the following steps when executing the computer program: acquiring a text to be processed and a hidden Markov model; performing word segmentation processing on the text to be processed according to the word segmentation library to obtain word information; determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model; and desensitizing the vocabulary information based on the context information meeting the preset context information.
The electronic equipment provided by the invention obtains the vocabulary information by combining the text to be processed with the vocabulary base to perform word segmentation, wherein the vocabulary information comprises a plurality of vocabulary texts and corresponding vocabulary positions and semantics. And inputting the vocabulary information into a preset hidden Markov model for calculation, determining the context information closest to the vocabulary information, and comparing the context information with preset context information related to desensitization. If the context information meets the preset context information, which indicates that characters meeting the desensitization context rule exist in the text to be processed, finding out a privacy data structure in the key context, and performing desensitization processing on the privacy data so as to perform data deformation on some sensitive data through the desensitization rule. Therefore, the context of the unstructured text is identified through the hidden Markov model, so that the privacy words are further screened, the identification precision of the privacy words is improved, desensitization requirements of different users are met, the identification processing efficiency of the privacy words is effectively improved, the privacy data are prevented from being searched in a regular mode, the users do not need to be forced to edit any data rule, the workload of the users is reduced, and meanwhile, artificial errors of manual labeling are prevented.
According to a fourth aspect of the invention, a computer-readable storage medium is proposed, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text desensitization method according to any of the preceding claims. The computer readable storage medium thus has all the benefits of any of the text desensitization methods described above.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 shows a text desensitization method flow diagram of one embodiment of the present invention;
FIG. 2 shows a text desensitization method flow diagram of yet another embodiment of the present invention;
FIG. 3 shows a text desensitization method flow diagram of yet another embodiment of the present invention;
FIG. 4 shows a text desensitization method flow diagram of yet another embodiment of the present invention;
FIG. 5 is a flow diagram illustrating a text desensitization method according to an embodiment of the present invention;
FIG. 6 shows a schematic block diagram of a text desensitization apparatus of an embodiment of the present invention;
FIG. 7 shows a schematic block diagram of an electronic device in accordance with a specific embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.
Text desensitization methods, text desensitization apparatuses, electronic devices, and computer-readable storage media according to some embodiments of the present invention are described below with reference to fig. 1-7.
Example one
As shown in fig. 1, according to an embodiment of the first aspect of the present invention, there is provided a text desensitization method, the method comprising:
102, acquiring a text to be processed and a hidden Markov model;
104, performing word segmentation processing on the text to be processed according to the word segmentation library to obtain word information;
106, determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model;
step 108, judging whether the context information meets preset context information, if so, entering step 110, and if not, entering step 112;
110, desensitizing the vocabulary information;
in step 112, no desensitization is performed.
In the embodiment, the text to be processed is segmented by combining with the segmentation library to obtain vocabulary information, wherein the vocabulary information comprises a plurality of vocabulary texts and corresponding vocabulary positions. And inputting the vocabulary information into a preset hidden Markov model for calculation, determining the context information closest to the vocabulary information, and comparing the context information with preset context information related to desensitization. If the context information meets the preset context information, which indicates that characters meeting the desensitization context rule exist in the text to be processed, finding out a privacy data structure in the key context, and performing desensitization processing on the privacy data so as to perform data deformation on some sensitive data through the desensitization rule. Therefore, the context of the unstructured text is identified through the hidden Markov model, so that the privacy words are further screened, the identification precision of the privacy words is improved, desensitization requirements of different users are met, the identification processing efficiency of the privacy words is effectively improved, the privacy data are prevented from being searched in a regular mode, the users do not need to be forced to edit any data rule, the workload of the users is reduced, and meanwhile, artificial errors of manual labeling are prevented.
Example two
As shown in fig. 2, according to one embodiment of the present invention, a text desensitization method is proposed, the method comprising:
step 202, acquiring a text to be processed and a hidden Markov model;
step 204, performing word segmentation processing on the text to be processed according to the word segmentation library to obtain word information;
step 206, determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model;
step 208, judging whether the context information meets preset context information, if so, entering step 210, and if not, entering step 212;
step 210, judging whether the vocabulary text in the vocabulary information accords with the private vocabulary, if so, entering step 214, and if not, entering step 212;
step 212, no desensitization treatment is performed;
step 214, marking the vocabulary text as sensitive data;
and step 216, desensitizing the sensitive data according to the desensitizing rule.
In this embodiment, in the case that the context information of the vocabulary information satisfies the preset context information, the vocabulary text in the vocabulary information is compared with the privacy vocabulary, that is, the privacy data structure in the key context is searched. And if the vocabulary text conforms to the private vocabulary, the vocabulary is indicated as sensitive data, and desensitization processing is carried out on the sensitive data according to desensitization rules. Therefore, sensitive data related to data security is shielded, and the security, integrity and usability of the sensitive data in the text are effectively guaranteed.
Specifically, the privacy vocabulary may be fields preset by the user according to needs and experience, such as: name, address, date directly associated with the individual, certificate number, telephone number, email address, account number, transaction amount, IP address, license plate number, and the like.
Further, the desensitization rule includes at least one of: a masking algorithm, a morphing algorithm, a replacement algorithm, a format preserving encryption algorithm, and a data encryption algorithm. The plain text can be processed by data hiding, rounding, offsetting and mapping, or by algorithms such as Hash. For a text with a format, random data meeting the format is generated according to the data format and is replaced by the random data, or the prefix/suffix is kept unchanged, the latter half part is filled by scrambling data, or the data is encrypted by a format-preserving encryption method. Therefore, a targeted and differentiated strategy is formulated according to the requirements of the user on the sensitive data, and the availability of the sensitive data is ensured.
EXAMPLE III
As shown in fig. 3, according to one embodiment of the present invention, a text desensitization method is proposed, the method comprising:
step 302, acquiring a target text;
step 304, performing word segmentation processing on the target text by adopting a maximum matching algorithm to obtain a second target word;
step 306, counting the occurrence frequency of a second target vocabulary in the target text;
step 308, updating the word segmentation library according to a second target word with the occurrence frequency greater than or equal to the preset frequency;
step 310, performing word segmentation processing on the target text according to the word segmentation library to obtain a first target word and a word position and a semantic meaning corresponding to the first target word;
step 312, determining a context identifier of the target text according to the first target vocabulary, the vocabulary position, the semantic meaning and the context pattern library;
step 314, constructing a hidden Markov model according to the context identifier;
step 316, acquiring a text to be processed;
318, performing word segmentation processing on the text to be processed according to the word segmentation library to obtain word information;
step 320, matching the vocabulary information with the hidden Markov model to obtain the matching degree of the vocabulary information and the context identifier;
step 322, determining whether the matching degree is greater than or equal to the threshold value of the matching degree, if so, entering step 324, otherwise, entering step 320;
step 324, regarding the context identifier corresponding to the matching degree greater than or equal to the threshold value of the matching degree as context information;
step 326, judging whether the context information meets the preset context information, if so, entering step 328, and if not, entering step 330;
step 328, desensitizing the vocabulary information;
in step 330, no desensitization is performed.
In the embodiment, before desensitization processing is performed on the text, a target text of a specified industry is obtained, and word segmentation is performed on the target text to obtain a first target word and a word position and a semantic meaning corresponding to the first target word. And analyzing the context of the first target vocabulary with different vocabulary positions and different semantics according to the context pattern library of the specified industry, and marking the context identification of the target text according to the context to establish the relevance of the context and the vocabulary. And calculating an initial probability distribution matrix, a state transition probability distribution matrix, an observation probability distribution matrix and the like of the hidden Markov model according to the context identifier, and converging the matrixes by using an unsupervised machine learning method so as to construct the hidden Markov model in the specified industry. And then, inputting the vocabulary information after the word segmentation of the text to be processed into a hidden Markov model, and calculating the matching degree of the vocabulary information and the context identifier through the hidden Markov model. If the matching degree is greater than or equal to the matching degree threshold, the context identifier of the matching degree is the most probable context of the vocabulary information, and the context identifier corresponding to the matching degree greater than or equal to the matching degree threshold is used as the context information to match the preset context information, so that the sensitive vocabulary is recognized by using the context, the recognition precision of the privacy words is improved, and the effect of protecting the data privacy is achieved.
Specifically, the hidden markov model is a dual stochastic process, and has a hidden markov chain with a certain number of states and a display stochastic function set, which can be represented by five elements, including 2 state sets and 3 probability matrices, which are respectively a hidden state, an observable state, an initial probability distribution matrix, a state transition probability distribution matrix, and an observation probability distribution matrix. The hidden state represents a context identifier, the observable state represents a word segmentation result of a text, the initial probability distribution matrix is used for representing the probability that the state of a sentence initial word is a privacy state, the state transition probability distribution matrix is used for representing the probability of transition from one hidden state to another hidden state, and the observation probability distribution matrix is used for representing the probability of occurrence of keywords in the hidden state. And then inputting the vocabulary information after the word segmentation of the text to be processed into a hidden Markov model, obtaining the matching degree (probability) of the vocabulary information and each context identifier through an initial probability distribution matrix, a state transition probability distribution matrix and an observation probability distribution matrix algorithm of the hidden Markov model, and determining a hidden state sequence with the maximum probability, namely the context information which is most close to the vocabulary information.
The word segmentation library can adopt a special word library already existing in the industry, but the updating iteration of words is considered, the maximum matching algorithm is used for carrying out word segmentation on a target text in the appointed industry, then the occurrence frequency of a second target word in the target text is counted, and the second target word with the occurrence frequency larger than or equal to the preset frequency is added into the word segmentation library, so that the word segmentation library is gradually perfected, the accurate word segmentation is conveniently carried out on the text, the comprehensive data support is improved for constructing a hidden Markov model, the recognition precision of the context of the text to be processed and the sensitive words is ensured, and omission in desensitization processing is prevented.
For example, "congratulatory" may represent a congratulation in the context of a celebration, but may represent a name in the context of a resume, and may need to be masked.
Example four
As shown in fig. 4, according to one embodiment of the present invention, a text desensitization method is proposed, the method comprising:
step 402, acquiring a text to be processed;
step 404, determining a text type of the text to be processed according to the text to be processed;
step 406, judging whether the text type is matched with a preset text type, if so, entering step 408, and if not, entering step 404;
step 408, calling a hidden Markov model according to a preset text type;
step 410, performing word segmentation processing on the text to be processed according to the word segmentation library to obtain word information;
step 412, determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model;
step 414, judging whether the context information meets the preset context information, if so, entering step 416, and if not, entering step 418;
step 416, desensitizing the vocabulary information;
in step 418, no desensitization is performed.
In the embodiment, the text type of the text to be processed is determined, and the text type and the preset text type are matched, so that the hidden Markov model is selected according to the preset text type with high matching degree, the texts with different types can adopt the specific hidden Markov model to perform context recognition, desensitization requirements of different industries are met, and the application range of the product is expanded.
EXAMPLE five
As shown in fig. 5, according to an embodiment of the present invention, a text desensitization method is proposed, including:
502, acquiring commonly used industrial text data, performing maximum matching algorithm word segmentation, and taking out words with higher frequency as words for an industrial language base;
step 504, marking the context state of the word according to the occurrence position and the semantics of the vocabulary;
step 506, constructing an initial probability distribution matrix, a state transition probability distribution matrix and an observation probability distribution matrix according to the context state conversion relation corresponding to each word;
step 508, training the hidden Markov model by using machine learning until the initial probability distribution matrix, the state transition probability distribution matrix and the observation probability distribution matrix are all converged;
and 510, analyzing text data to be desensitized, performing word segmentation by combining an industry word stock, finding a context with the maximum probability by combining a hidden Markov model, identifying words corresponding to a hidden state sequence, and performing desensitization conversion.
In this embodiment, a library of terms for a particular industry is collected, which can be used directly if the industry already has a specialized thesaurus. If not, the article texts in the industry are collected, word segmentation is carried out by using a maximum matching algorithm, and then words with high frequency of occurrence are screened according to a statistical method and added into a word bank. And combining the common text information and word stock in the industry, and designing a group of states representing different contexts according to the appearance positions and the semantemes of the words. And calculating an initial probability distribution matrix, a state transition probability distribution matrix and an observation probability distribution matrix of the hidden Markov model according to the marked context state, and converging the matrixes by using an unsupervised machine learning method to obtain the trained hidden Markov model. And then, combining the text to be processed with the industry special word bank to perform word segmentation, taking a hidden state sequence with the highest probability according to word segmentation content and a trained model, performing mode recognition by using the sequence, finding out a key privacy data structure in semantics, and desensitizing privacy data. Therefore, for unstructured texts in a specific industry, private data is searched without using a regular mode, and instead, a word segmentation library and a context pattern library aiming at the industry are used for recognizing the texts in the industry and finding the private data. Therefore, artificial errors caused by manual labeling are avoided, and meanwhile, the recognition processing efficiency of the private words is improved.
EXAMPLE six
As shown in fig. 6, according to the embodiment of the second aspect of the present invention, a text desensitization apparatus 600 is proposed, which comprises a memory 602 and a processor 604, the memory 602 stores a computer program, and the processor 604 implements the text desensitization method of the embodiment of the first aspect when executing the computer program. The text desensitization apparatus thus has all the benefits of the text desensitization method of the embodiments of the first aspect.
EXAMPLE seven
According to an embodiment of the third aspect of the present invention, there is provided an electronic apparatus including: the text desensitization apparatus of an embodiment of the second aspect, when executing the computer program, is capable of performing the steps of: acquiring a text to be processed and a hidden Markov model; performing word segmentation processing on the text to be processed according to the word segmentation library to obtain word information; determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model; and desensitizing the vocabulary information based on the context information meeting the preset context information.
The electronic equipment provided by the invention obtains the vocabulary information by combining the text to be processed with the vocabulary base to perform word segmentation, wherein the vocabulary information comprises a plurality of vocabulary texts and corresponding vocabulary positions and semantics. And inputting the vocabulary information into a preset hidden Markov model for calculation, determining the context information closest to the vocabulary information, and comparing the context information with preset context information related to desensitization. If the context information meets the preset context information, which indicates that characters meeting the desensitization context rule exist in the text to be processed, finding out a privacy data structure in the key context, and performing desensitization processing on the privacy data so as to perform data deformation on some sensitive data through the desensitization rule. Therefore, the context of the unstructured text is identified through the hidden Markov model, so that the privacy words are further screened, the identification precision of the privacy words is improved, desensitization requirements of different users are met, the identification processing efficiency of the privacy words is effectively improved, the privacy data are prevented from being searched in a regular mode, the users do not need to be forced to edit any data rule, the workload of the users is reduced, and meanwhile, artificial errors of manual labeling are prevented.
Example eight
As shown in fig. 7, an industry data collection server 702, an industry thesaurus analysis server 704, a model training server 706, a desensitization data analysis server 708, and a desensitization processing server 710 are deployed in an electronic device 700 according to an embodiment of the present invention.
Specifically, the industry data acquisition server 702 is started, industry word texts are acquired, a maximum matching method is found by using word segmentation, the occurrence frequency of each word is statistically analyzed, and words with higher frequency are stored. The industry thesaurus analysis server 704 is then started to label the states of the contexts. The model training server 706 is started and the hidden markov model is trained using unsupervised machine learning algorithms until the model data converges. And starting a desensitization data analysis server 708 and a desensitization processing server 710, reading in the text to be processed, using word segmentation, finding the context with the maximum probability by combining model data matching, and proposing hidden data for desensitization processing.
In the embodiment, a hidden Markov model is established by using a language library for a specific industry and a machine learning method, the maximum possible semantic meaning of the pre-desensitized unstructured text data is found, pattern recognition is carried out, and the recognized private information is replaced. The problem that the traditional structural data desensitization method is ineffective for the more semi-structural methods at present is solved, the technical requirements and the workload of operators are effectively reduced, and the manual business trip probability is reduced.
Example nine
According to an embodiment of the fourth aspect of the present invention, a computer-readable storage medium is proposed, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text desensitization method according to an embodiment of the first aspect. The computer readable storage medium thus has all the benefits of the text desensitization method of embodiments of the first aspect.
In the description herein, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance unless explicitly stated or limited otherwise; the terms "connected," "mounted," "secured," and the like are to be construed broadly and include, for example, fixed connections, removable connections, or integral connections; may be directly connected or indirectly connected through an intermediate. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the description herein, the description of the terms "one embodiment," "some embodiments," "specific embodiments," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of text desensitization, comprising:
acquiring a text to be processed and a hidden Markov model;
performing word segmentation processing on the text to be processed according to a word segmentation library to obtain word information;
determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model;
and desensitizing the vocabulary information based on the context information meeting preset context information.
2. The text desensitization method according to claim 1, wherein said step of desensitizing the textual vocabulary information comprises:
comparing vocabulary text and private vocabulary in the vocabulary information;
based on the vocabulary text conforming to the private vocabulary, marking the vocabulary text as sensitive data;
and desensitizing the sensitive data according to a desensitizing rule.
3. The text desensitization method according to claim 1, wherein said step of obtaining hidden markov models comprises:
acquiring a target text;
performing word segmentation processing on the target text according to the word segmentation library to obtain a first target word and a word position and a semantic meaning corresponding to the first target word;
determining a context identifier of the target text according to the first target vocabulary, the vocabulary position, the semantic and context pattern library;
and constructing the hidden Markov model according to the context identifier.
4. The text desensitization method according to claim 3, wherein the step of determining context information corresponding to the lexical information based on the lexical information and the hidden Markov models comprises:
matching the vocabulary information with the hidden Markov model to obtain the matching degree of the vocabulary information and the context identifier;
based on the matching degree being greater than or equal to a threshold matching degree, taking the context identifier corresponding to the matching degree greater than or equal to the threshold matching degree as the context information.
5. The text desensitization method according to claim 3, wherein said step of obtaining target text is followed by further comprising:
performing word segmentation processing on the target text by adopting a maximum matching algorithm to obtain a second target word;
counting the occurrence frequency of the second target vocabulary in the target text;
and updating the word segmentation library according to the second target vocabulary with the occurrence frequency larger than or equal to the preset frequency.
6. The text desensitization method according to any of claims 1 to 5, wherein said step of obtaining hidden Markov models comprises in particular:
determining the text type of the text to be processed according to the text to be processed;
matching the text type with a preset text type;
and calling the hidden Markov model according to the preset text type based on the matching of the text type and the preset text type.
7. Text desensitization method according to any of claims 2 to 5,
the desensitization rule includes at least one of: a masking algorithm, a morphing algorithm, a replacement algorithm, a format preserving encryption algorithm, and a data encryption algorithm.
8. A text desensitizing apparatus, comprising: a memory storing a computer program and a processor executing the computer program to perform a text desensitization method according to any of claims 1 to 7.
9. An electronic device, comprising:
the text desensitization device of claim 8, the text desensitization device when executing a computer program operable to perform the steps of:
acquiring a text to be processed and a hidden Markov model;
performing word segmentation processing on the text to be processed according to a word segmentation library to obtain word information;
determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model;
and desensitizing the vocabulary information based on the context information meeting preset context information.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text desensitization method according to any of claims 1 to 7.
CN202010795184.0A 2020-08-10 2020-08-10 Text desensitization method, apparatus, electronic device and computer-readable storage medium Pending CN112001174A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010795184.0A CN112001174A (en) 2020-08-10 2020-08-10 Text desensitization method, apparatus, electronic device and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010795184.0A CN112001174A (en) 2020-08-10 2020-08-10 Text desensitization method, apparatus, electronic device and computer-readable storage medium

Publications (1)

Publication Number Publication Date
CN112001174A true CN112001174A (en) 2020-11-27

Family

ID=73462902

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010795184.0A Pending CN112001174A (en) 2020-08-10 2020-08-10 Text desensitization method, apparatus, electronic device and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN112001174A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800465A (en) * 2021-02-09 2021-05-14 第四范式(北京)技术有限公司 Method and device for processing text data to be labeled, electronic equipment and medium
CN113033217A (en) * 2021-04-19 2021-06-25 广州欢网科技有限责任公司 Method and device for automatically shielding and translating sensitive subtitle information
CN113407989A (en) * 2021-05-26 2021-09-17 天九共享网络科技集团有限公司 Data desensitization method and device, electronic equipment and storage medium
CN113627535A (en) * 2021-08-12 2021-11-09 福建中信网安信息科技有限公司 Data grading classification system and method based on data security and privacy protection
CN115050390A (en) * 2022-08-12 2022-09-13 杭州海康威视数字技术股份有限公司 Voice privacy protection method and device, electronic equipment and storage medium
CN115470509A (en) * 2022-11-14 2022-12-13 优铸科技(北京)有限公司 Display method, device and medium for workshop billboard
CN116522403A (en) * 2023-07-04 2023-08-01 大白熊大数据科技(常熟)有限公司 Interactive information desensitization method and server for focusing big data privacy security
CN117272996A (en) * 2023-11-23 2023-12-22 山东网安安全技术有限公司 Data desensitization system
CN117951747A (en) * 2024-03-26 2024-04-30 成都飞机工业(集团)有限责任公司 Self-adaptive desensitization method, system, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170680A (en) * 2017-12-29 2018-06-15 厦门市美亚柏科信息股份有限公司 Keyword recognition method, terminal device and storage medium based on Hidden Markov Model
CN109033150A (en) * 2018-06-12 2018-12-18 平安科技(深圳)有限公司 Sensitive word verification method, device, computer equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170680A (en) * 2017-12-29 2018-06-15 厦门市美亚柏科信息股份有限公司 Keyword recognition method, terminal device and storage medium based on Hidden Markov Model
CN109033150A (en) * 2018-06-12 2018-12-18 平安科技(深圳)有限公司 Sensitive word verification method, device, computer equipment and storage medium

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800465A (en) * 2021-02-09 2021-05-14 第四范式(北京)技术有限公司 Method and device for processing text data to be labeled, electronic equipment and medium
CN113033217B (en) * 2021-04-19 2023-09-15 广州欢网科技有限责任公司 Automatic shielding translation method and device for subtitle sensitive information
CN113033217A (en) * 2021-04-19 2021-06-25 广州欢网科技有限责任公司 Method and device for automatically shielding and translating sensitive subtitle information
CN113407989A (en) * 2021-05-26 2021-09-17 天九共享网络科技集团有限公司 Data desensitization method and device, electronic equipment and storage medium
CN113627535A (en) * 2021-08-12 2021-11-09 福建中信网安信息科技有限公司 Data grading classification system and method based on data security and privacy protection
CN115050390B (en) * 2022-08-12 2022-12-06 杭州海康威视数字技术股份有限公司 Voice privacy protection method and device, electronic equipment and storage medium
CN115050390A (en) * 2022-08-12 2022-09-13 杭州海康威视数字技术股份有限公司 Voice privacy protection method and device, electronic equipment and storage medium
CN115470509A (en) * 2022-11-14 2022-12-13 优铸科技(北京)有限公司 Display method, device and medium for workshop billboard
CN116522403A (en) * 2023-07-04 2023-08-01 大白熊大数据科技(常熟)有限公司 Interactive information desensitization method and server for focusing big data privacy security
CN116522403B (en) * 2023-07-04 2023-08-29 大白熊大数据科技(常熟)有限公司 Interactive information desensitization method and server for focusing big data privacy security
CN117272996A (en) * 2023-11-23 2023-12-22 山东网安安全技术有限公司 Data desensitization system
CN117272996B (en) * 2023-11-23 2024-02-27 山东网安安全技术有限公司 Data desensitization system
CN117951747A (en) * 2024-03-26 2024-04-30 成都飞机工业(集团)有限责任公司 Self-adaptive desensitization method, system, equipment and medium

Similar Documents

Publication Publication Date Title
CN112001174A (en) Text desensitization method, apparatus, electronic device and computer-readable storage medium
CN109635273B (en) Text keyword extraction method, device, equipment and storage medium
CN109446513B (en) Extraction method of events in text based on natural language understanding
US7945525B2 (en) Methods for obtaining improved text similarity measures which replace similar characters with a string pattern representation by using a semantic data tree
CN112926327B (en) Entity identification method, device, equipment and storage medium
US20070050356A1 (en) Query construction for semantic topic indexes derived by non-negative matrix factorization
CN110457405B (en) Database auditing method based on blood relationship
US10860565B2 (en) Database update and analytics system
CN104471568A (en) Learning-based processing of natural language questions
CN112651236B (en) Method and device for extracting text information, computer equipment and storage medium
CN116401464B (en) Professional user portrait construction method, device, equipment and storage medium
CN113990352A (en) User emotion recognition and prediction method, device, equipment and storage medium
CN112036185A (en) Method and device for constructing named entity recognition model based on industrial enterprise
CN112395881B (en) Material label construction method and device, readable storage medium and electronic equipment
CN112380848B (en) Text generation method, device, equipment and storage medium
CN110941713B (en) Self-optimizing financial information block classification method based on topic model
Kapugama et al. Enhancing Wikipedia search results using text mining
Majumder et al. Event extraction from biomedical text using crf and genetic algorithm
CN114666078B (en) Method and system for detecting SQL injection attack, electronic equipment and storage medium
CN113515587A (en) Object information extraction method and device, computer equipment and storage medium
CN114547321A (en) Knowledge graph-based answer generation method and device and electronic equipment
CN112329478A (en) Method, device and equipment for constructing causal relationship determination model
CN114091456B (en) Intelligent positioning method and system for quotation contents
JP2014112306A (en) Demand sentence extract device, demand content identification model learning device, method and program
CN113094469B (en) Text data analysis method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201127

WD01 Invention patent application deemed withdrawn after publication