CN112001174A

CN112001174A - Text desensitization method, apparatus, electronic device and computer-readable storage medium

Info

Publication number: CN112001174A
Application number: CN202010795184.0A
Authority: CN
Inventors: 代庆国; 罗英群; 吕令广
Original assignee: ZTE ICT Technologies Co Ltd
Current assignee: ZTE ICT Technologies Co Ltd
Priority date: 2020-08-10
Filing date: 2020-08-10
Publication date: 2020-11-27

Abstract

The invention provides a text desensitization method, a text desensitization device, electronic equipment and a computer-readable storage medium. Wherein the text desensitization method comprises: acquiring a text to be processed and a hidden Markov model; performing word segmentation processing on the text to be processed according to the word segmentation library to obtain word information; determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model; and desensitizing the vocabulary information based on the context information meeting the preset context information. According to the method, the context of the unstructured text is identified through the hidden Markov model, the private words can be further screened, the identification precision of the private words is improved, the desensitization requirements of different users are met, the identification processing efficiency of the private words is effectively improved, the private data is prevented from being searched in a regular mode, the user does not need to be forced to edit any data rule, the workload of the user is reduced, and meanwhile, the artificial errors of manual labeling are prevented.

Description

Text desensitization method, apparatus, electronic device and computer-readable storage medium

Technical Field

The invention relates to the technical field of electronic equipment, in particular to a text desensitization method, a text desensitization device, electronic equipment and a computer readable storage medium.

Background

In the prior art, in order to ensure the safety of data use, a desensitization method is generally used for replacing private data, and most of the existing desensitization methods are directed to structured data, such as a database. Identification is performed using a regular pattern, such as specifying field names for database tables, etc. for desensitization.

Increasingly important in the protection of industrial data privacy, the desensitization mode used by industrial users has the following problems: most of the existing data processing methods mainly aim at processing structured data, and most of semi-structured data are processed by adopting a regular expression pattern matching method to find out key data for desensitization. Most of the existing sensitive data identification methods are based on rule discovery and manual definition, and the rule discovery-based method can effectively identify the sensitive data which accords with the rule definition, but can miss a large amount of irregular sensitive data, so that the accuracy rate of sensitive data identification is reduced; on the other hand, in the case of a large amount of data, the method based on manual definition increases the burden on the user, and reduces the usability and usability of the system.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art or the related art.

To this end, a first aspect of the invention is directed to a method of text desensitization.

A second aspect of the invention is to propose a text desensitizing device.

A third aspect of the invention is directed to an electronic device.

A fourth aspect of the invention is directed to a computer-readable storage medium.

In view of this, according to a first aspect of the present invention, there is provided a text desensitization method, comprising: acquiring a text to be processed and a hidden Markov model; performing word segmentation processing on the text to be processed according to the word segmentation library to obtain word information; determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model; and desensitizing the vocabulary information based on the context information meeting the preset context information.

The text desensitization method provided by the invention combines the text to be processed with the word segmentation library to perform word segmentation to obtain word information, wherein the word information comprises a plurality of word texts and corresponding word positions and semantics. The vocabulary information is input into a preset Hidden Markov Model (HMM) for calculation, the context information which is closest to the vocabulary information is determined, and the context information is compared with the preset context information related to desensitization. If the context information meets the preset context information, which indicates that characters meeting the desensitization context rule exist in the text to be processed, finding out a privacy data structure in the key context, and performing desensitization processing on the privacy data so as to perform data deformation on some sensitive data through the desensitization rule. Therefore, the context of the unstructured text is identified through the hidden Markov model, the privacy words can be further screened, the identification precision of the privacy words is improved, desensitization requirements of different users are met, the identification processing efficiency of the privacy words is effectively improved, the privacy data are prevented from being searched in a regular mode, the users do not need to be forced to edit any data rule, the workload of the users is reduced, and meanwhile, artificial errors of manual labeling are prevented.

In addition, the text desensitization method in the above technical scheme provided by the invention can also have the following additional technical features:

in the above technical solution, further, the step of performing desensitization processing on the text vocabulary information specifically includes: comparing vocabulary text and private vocabulary in the vocabulary information; based on the fact that the vocabulary text conforms to the private vocabulary, the vocabulary text is marked as sensitive data; and carrying out desensitization processing on the sensitive data according to desensitization rules.

In the technical scheme, under the condition that the context information of the vocabulary information meets the preset context information, the vocabulary text and the privacy vocabulary in the vocabulary information are compared, namely the privacy data structure in the key context is searched. And if the vocabulary text conforms to the private vocabulary, the vocabulary is indicated as sensitive data, and desensitization processing is carried out on the sensitive data according to desensitization rules. Therefore, sensitive data related to data security is shielded, and the security, integrity and usability of the sensitive data in the text are effectively guaranteed.

Specifically, the privacy vocabulary may be fields preset by the user according to needs and experience, such as: name, address, date directly associated with the individual, telephone number, email address, account number, transaction amount, IP address, license plate number, and the like.

In any of the above technical solutions, further, before the step of obtaining the text to be processed, the method further includes: acquiring a target text; performing word segmentation processing on the target text according to the word segmentation library to obtain a first target word and a word position and a semantic meaning corresponding to the first target word; determining a context identifier of a target text according to a first target vocabulary, a vocabulary position, a semantic and context pattern library; and constructing a hidden Markov model according to the context identifier.

According to the technical scheme, before desensitization processing is carried out on a text, a target text of a specified industry is obtained, word segmentation is carried out on the target text, and a first target word and a word position corresponding to the first target word are obtained. And analyzing the context of the first target vocabulary with different vocabulary positions and different semantics according to the context pattern library of the specified industry, and marking the context identification of the target text according to the context to establish the relevance of the context and the vocabulary. And calculating an initial probability distribution matrix, a state transition probability distribution matrix, an observation probability distribution matrix and the like of the hidden Markov model according to the context identifier, and converging the matrixes by using an unsupervised machine learning method so as to construct the hidden Markov model in the specified industry. The method and the device have the advantages that the context identification is conveniently carried out on the text to be processed according to the hidden Markov model, the dynamic desensitization of sensitive information is realized according to the actual context, the identification processing efficiency of privacy words is effectively improved while the security of sensitive data in the text is guaranteed, and the desensitization requirements of different users are met.

In any of the above technical solutions, further, the step of determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden markov model specifically includes: matching the vocabulary information with the hidden Markov model to obtain the matching degree of the vocabulary information and the context identifier; and based on the matching degree being greater than or equal to the threshold value of the matching degree, using the context identifier corresponding to the matching degree greater than or equal to the threshold value of the matching degree as the context information.

According to the technical scheme, after vocabulary information is input into a hidden Markov model, the matching degree of the vocabulary information and a plurality of context identifications is calculated and counted through the hidden Markov model. If the matching degree is greater than or equal to the matching degree threshold, the context identifier of the matching degree is the most probable context of the vocabulary information, and the context identifier corresponding to the matching degree greater than or equal to the matching degree threshold is used as the context information to match the preset context information, so that the sensitive vocabulary is recognized by using the context, the recognition precision of the privacy words is improved, and the effect of protecting the data privacy is achieved. The threshold of the matching degree may be set to a maximum value of the obtained multiple matching degrees, or may be set according to an actual requirement of the user.

Specifically, the hidden markov model is a dual stochastic process, and has a hidden markov chain with a certain number of states and a display stochastic function set, which can be represented by five elements, including 2 state sets and 3 probability matrices, which are respectively a hidden state, an observable state, an initial probability distribution matrix, a state transition probability distribution matrix, and an observation probability distribution matrix. The hidden state represents a context identifier, such as title/xx and identification number/xx, the observable state represents a word segmentation result of a text, the initial probability distribution matrix is used for representing the probability that the state of the first word of a sentence is a privacy state, the state transition probability distribution matrix is used for representing the probability of transition from one hidden state to another hidden state, and the observation probability distribution matrix is used for representing the probability of occurrence of a keyword in the hidden state. And then inputting the vocabulary information after the word segmentation of the text to be processed into a hidden Markov model, obtaining the matching degree (probability) of the vocabulary information and each context identifier through an initial probability distribution matrix, a state transition probability distribution matrix and an observation probability distribution matrix algorithm of the hidden Markov model, and determining a hidden state sequence with the maximum probability, namely the context information which is most close to the vocabulary information.

In any of the above technical solutions, further, after the step of obtaining the target text, the method further includes: performing word segmentation processing on the target text by adopting a maximum matching algorithm to obtain a second target word; counting the occurrence frequency of a second target vocabulary in the target text; and updating the word segmentation library according to the second target vocabulary with the occurrence frequency greater than or equal to the preset frequency.

In the technical scheme, the word segmentation library can adopt a special word library already existing in the industry, but the updating iteration of words is considered, the maximum matching algorithm is used for carrying out word segmentation on a target text in the specified industry, then the occurrence frequency of a second target word in the target text is counted, and the second target word with the occurrence frequency greater than or equal to the preset frequency is added into the word segmentation library, so that the word segmentation library is gradually perfected, the accurate word segmentation is conveniently carried out on the text, the comprehensive data support is improved for constructing the hidden Markov model, the recognition precision of the context of the text to be processed and the sensitive words is ensured, and the omission in desensitization processing is prevented. Therefore, dynamic desensitization of sensitive information is realized, and the safety, integrity and usability of sensitive data in the text are effectively guaranteed.

In any of the above technical solutions, further, the step of obtaining the hidden markov model specifically includes: determining the text type of the text to be processed according to the text to be processed; matching the text type with a preset text type; and calling the hidden Markov model according to the preset text type based on the matching of the text type and the preset text type.

In the technical scheme, the text type of the text to be processed is determined, and the text type is matched with the preset text type, so that the hidden Markov model is selected according to the preset text type with high matching degree, the texts with different types can adopt the specific hidden Markov model to perform context recognition, desensitization requirements of different industries are met, and the application range of products is expanded.

In any of the above embodiments, further, the desensitization rule includes at least one of: a masking algorithm, a morphing algorithm, a replacement algorithm, a format preserving encryption algorithm, and a data encryption algorithm.

In the technical scheme, the plain text can be processed in a data hiding, rounding, shifting and mapping mode or by algorithms such as Hash and the like. For a text with a format, random data meeting the format is generated according to the data format and is replaced by the random data, or the prefix/suffix is kept unchanged, the latter half part is filled by scrambling data, or the data is encrypted by a format-preserving encryption method. Therefore, a targeted and differentiated strategy is formulated according to the requirements of the user on the sensitive data, and the availability of the sensitive data is ensured.

According to a second aspect of the present invention, there is provided a text desensitization apparatus, comprising a memory, a processor, the memory storing a computer program, the processor implementing the text desensitization method of any one of the above when executing the computer program. The text desensitization device thus has all the benefits of the text desensitization method of any of the above.

According to a third aspect of the invention, there is provided an electronic device comprising: the text desensitization apparatus of any of the above, the text desensitization apparatus being capable of performing the following steps when executing the computer program: acquiring a text to be processed and a hidden Markov model; performing word segmentation processing on the text to be processed according to the word segmentation library to obtain word information; determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model; and desensitizing the vocabulary information based on the context information meeting the preset context information.

The electronic equipment provided by the invention obtains the vocabulary information by combining the text to be processed with the vocabulary base to perform word segmentation, wherein the vocabulary information comprises a plurality of vocabulary texts and corresponding vocabulary positions and semantics. And inputting the vocabulary information into a preset hidden Markov model for calculation, determining the context information closest to the vocabulary information, and comparing the context information with preset context information related to desensitization. If the context information meets the preset context information, which indicates that characters meeting the desensitization context rule exist in the text to be processed, finding out a privacy data structure in the key context, and performing desensitization processing on the privacy data so as to perform data deformation on some sensitive data through the desensitization rule. Therefore, the context of the unstructured text is identified through the hidden Markov model, so that the privacy words are further screened, the identification precision of the privacy words is improved, desensitization requirements of different users are met, the identification processing efficiency of the privacy words is effectively improved, the privacy data are prevented from being searched in a regular mode, the users do not need to be forced to edit any data rule, the workload of the users is reduced, and meanwhile, artificial errors of manual labeling are prevented.

According to a fourth aspect of the invention, a computer-readable storage medium is proposed, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text desensitization method according to any of the preceding claims. The computer readable storage medium thus has all the benefits of any of the text desensitization methods described above.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 shows a text desensitization method flow diagram of one embodiment of the present invention;

FIG. 2 shows a text desensitization method flow diagram of yet another embodiment of the present invention;

FIG. 3 shows a text desensitization method flow diagram of yet another embodiment of the present invention;

FIG. 4 shows a text desensitization method flow diagram of yet another embodiment of the present invention;

FIG. 5 is a flow diagram illustrating a text desensitization method according to an embodiment of the present invention;

FIG. 6 shows a schematic block diagram of a text desensitization apparatus of an embodiment of the present invention;

FIG. 7 shows a schematic block diagram of an electronic device in accordance with a specific embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

Text desensitization methods, text desensitization apparatuses, electronic devices, and computer-readable storage media according to some embodiments of the present invention are described below with reference to fig. 1-7.

Example one

As shown in fig. 1, according to an embodiment of the first aspect of the present invention, there is provided a text desensitization method, the method comprising:

102, acquiring a text to be processed and a hidden Markov model;

104, performing word segmentation processing on the text to be processed according to the word segmentation library to obtain word information;

106, determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model;

step 108, judging whether the context information meets preset context information, if so, entering step 110, and if not, entering step 112;

110, desensitizing the vocabulary information;

in step 112, no desensitization is performed.

In the embodiment, the text to be processed is segmented by combining with the segmentation library to obtain vocabulary information, wherein the vocabulary information comprises a plurality of vocabulary texts and corresponding vocabulary positions. And inputting the vocabulary information into a preset hidden Markov model for calculation, determining the context information closest to the vocabulary information, and comparing the context information with preset context information related to desensitization. If the context information meets the preset context information, which indicates that characters meeting the desensitization context rule exist in the text to be processed, finding out a privacy data structure in the key context, and performing desensitization processing on the privacy data so as to perform data deformation on some sensitive data through the desensitization rule. Therefore, the context of the unstructured text is identified through the hidden Markov model, so that the privacy words are further screened, the identification precision of the privacy words is improved, desensitization requirements of different users are met, the identification processing efficiency of the privacy words is effectively improved, the privacy data are prevented from being searched in a regular mode, the users do not need to be forced to edit any data rule, the workload of the users is reduced, and meanwhile, artificial errors of manual labeling are prevented.

Example two

As shown in fig. 2, according to one embodiment of the present invention, a text desensitization method is proposed, the method comprising:

step 202, acquiring a text to be processed and a hidden Markov model;

step 204, performing word segmentation processing on the text to be processed according to the word segmentation library to obtain word information;

step 206, determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model;

step 208, judging whether the context information meets preset context information, if so, entering step 210, and if not, entering step 212;

step 210, judging whether the vocabulary text in the vocabulary information accords with the private vocabulary, if so, entering step 214, and if not, entering step 212;

step 212, no desensitization treatment is performed;

step 214, marking the vocabulary text as sensitive data;

and step 216, desensitizing the sensitive data according to the desensitizing rule.

In this embodiment, in the case that the context information of the vocabulary information satisfies the preset context information, the vocabulary text in the vocabulary information is compared with the privacy vocabulary, that is, the privacy data structure in the key context is searched. And if the vocabulary text conforms to the private vocabulary, the vocabulary is indicated as sensitive data, and desensitization processing is carried out on the sensitive data according to desensitization rules. Therefore, sensitive data related to data security is shielded, and the security, integrity and usability of the sensitive data in the text are effectively guaranteed.

Specifically, the privacy vocabulary may be fields preset by the user according to needs and experience, such as: name, address, date directly associated with the individual, certificate number, telephone number, email address, account number, transaction amount, IP address, license plate number, and the like.

Further, the desensitization rule includes at least one of: a masking algorithm, a morphing algorithm, a replacement algorithm, a format preserving encryption algorithm, and a data encryption algorithm. The plain text can be processed by data hiding, rounding, offsetting and mapping, or by algorithms such as Hash. For a text with a format, random data meeting the format is generated according to the data format and is replaced by the random data, or the prefix/suffix is kept unchanged, the latter half part is filled by scrambling data, or the data is encrypted by a format-preserving encryption method. Therefore, a targeted and differentiated strategy is formulated according to the requirements of the user on the sensitive data, and the availability of the sensitive data is ensured.

EXAMPLE III

As shown in fig. 3, according to one embodiment of the present invention, a text desensitization method is proposed, the method comprising:

step 302, acquiring a target text;

step 304, performing word segmentation processing on the target text by adopting a maximum matching algorithm to obtain a second target word;

step 306, counting the occurrence frequency of a second target vocabulary in the target text;

step 308, updating the word segmentation library according to a second target word with the occurrence frequency greater than or equal to the preset frequency;

step 310, performing word segmentation processing on the target text according to the word segmentation library to obtain a first target word and a word position and a semantic meaning corresponding to the first target word;

step 312, determining a context identifier of the target text according to the first target vocabulary, the vocabulary position, the semantic meaning and the context pattern library;

step 314, constructing a hidden Markov model according to the context identifier;

step 316, acquiring a text to be processed;

318, performing word segmentation processing on the text to be processed according to the word segmentation library to obtain word information;

step 320, matching the vocabulary information with the hidden Markov model to obtain the matching degree of the vocabulary information and the context identifier;

step 322, determining whether the matching degree is greater than or equal to the threshold value of the matching degree, if so, entering step 324, otherwise, entering step 320;

step 324, regarding the context identifier corresponding to the matching degree greater than or equal to the threshold value of the matching degree as context information;

step 326, judging whether the context information meets the preset context information, if so, entering step 328, and if not, entering step 330;

step 328, desensitizing the vocabulary information;

in step 330, no desensitization is performed.

In the embodiment, before desensitization processing is performed on the text, a target text of a specified industry is obtained, and word segmentation is performed on the target text to obtain a first target word and a word position and a semantic meaning corresponding to the first target word. And analyzing the context of the first target vocabulary with different vocabulary positions and different semantics according to the context pattern library of the specified industry, and marking the context identification of the target text according to the context to establish the relevance of the context and the vocabulary. And calculating an initial probability distribution matrix, a state transition probability distribution matrix, an observation probability distribution matrix and the like of the hidden Markov model according to the context identifier, and converging the matrixes by using an unsupervised machine learning method so as to construct the hidden Markov model in the specified industry. And then, inputting the vocabulary information after the word segmentation of the text to be processed into a hidden Markov model, and calculating the matching degree of the vocabulary information and the context identifier through the hidden Markov model. If the matching degree is greater than or equal to the matching degree threshold, the context identifier of the matching degree is the most probable context of the vocabulary information, and the context identifier corresponding to the matching degree greater than or equal to the matching degree threshold is used as the context information to match the preset context information, so that the sensitive vocabulary is recognized by using the context, the recognition precision of the privacy words is improved, and the effect of protecting the data privacy is achieved.

Specifically, the hidden markov model is a dual stochastic process, and has a hidden markov chain with a certain number of states and a display stochastic function set, which can be represented by five elements, including 2 state sets and 3 probability matrices, which are respectively a hidden state, an observable state, an initial probability distribution matrix, a state transition probability distribution matrix, and an observation probability distribution matrix. The hidden state represents a context identifier, the observable state represents a word segmentation result of a text, the initial probability distribution matrix is used for representing the probability that the state of a sentence initial word is a privacy state, the state transition probability distribution matrix is used for representing the probability of transition from one hidden state to another hidden state, and the observation probability distribution matrix is used for representing the probability of occurrence of keywords in the hidden state. And then inputting the vocabulary information after the word segmentation of the text to be processed into a hidden Markov model, obtaining the matching degree (probability) of the vocabulary information and each context identifier through an initial probability distribution matrix, a state transition probability distribution matrix and an observation probability distribution matrix algorithm of the hidden Markov model, and determining a hidden state sequence with the maximum probability, namely the context information which is most close to the vocabulary information.

The word segmentation library can adopt a special word library already existing in the industry, but the updating iteration of words is considered, the maximum matching algorithm is used for carrying out word segmentation on a target text in the appointed industry, then the occurrence frequency of a second target word in the target text is counted, and the second target word with the occurrence frequency larger than or equal to the preset frequency is added into the word segmentation library, so that the word segmentation library is gradually perfected, the accurate word segmentation is conveniently carried out on the text, the comprehensive data support is improved for constructing a hidden Markov model, the recognition precision of the context of the text to be processed and the sensitive words is ensured, and omission in desensitization processing is prevented.

For example, "congratulatory" may represent a congratulation in the context of a celebration, but may represent a name in the context of a resume, and may need to be masked.

Example four

As shown in fig. 4, according to one embodiment of the present invention, a text desensitization method is proposed, the method comprising:

step 402, acquiring a text to be processed;

step 404, determining a text type of the text to be processed according to the text to be processed;

step 406, judging whether the text type is matched with a preset text type, if so, entering step 408, and if not, entering step 404;

step 408, calling a hidden Markov model according to a preset text type;

step 410, performing word segmentation processing on the text to be processed according to the word segmentation library to obtain word information;

step 412, determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model;

step 414, judging whether the context information meets the preset context information, if so, entering step 416, and if not, entering step 418;

step 416, desensitizing the vocabulary information;

in step 418, no desensitization is performed.

In the embodiment, the text type of the text to be processed is determined, and the text type and the preset text type are matched, so that the hidden Markov model is selected according to the preset text type with high matching degree, the texts with different types can adopt the specific hidden Markov model to perform context recognition, desensitization requirements of different industries are met, and the application range of the product is expanded.

EXAMPLE five

As shown in fig. 5, according to an embodiment of the present invention, a text desensitization method is proposed, including:

502, acquiring commonly used industrial text data, performing maximum matching algorithm word segmentation, and taking out words with higher frequency as words for an industrial language base;

step 504, marking the context state of the word according to the occurrence position and the semantics of the vocabulary;

step 506, constructing an initial probability distribution matrix, a state transition probability distribution matrix and an observation probability distribution matrix according to the context state conversion relation corresponding to each word;

step 508, training the hidden Markov model by using machine learning until the initial probability distribution matrix, the state transition probability distribution matrix and the observation probability distribution matrix are all converged;

and 510, analyzing text data to be desensitized, performing word segmentation by combining an industry word stock, finding a context with the maximum probability by combining a hidden Markov model, identifying words corresponding to a hidden state sequence, and performing desensitization conversion.

In this embodiment, a library of terms for a particular industry is collected, which can be used directly if the industry already has a specialized thesaurus. If not, the article texts in the industry are collected, word segmentation is carried out by using a maximum matching algorithm, and then words with high frequency of occurrence are screened according to a statistical method and added into a word bank. And combining the common text information and word stock in the industry, and designing a group of states representing different contexts according to the appearance positions and the semantemes of the words. And calculating an initial probability distribution matrix, a state transition probability distribution matrix and an observation probability distribution matrix of the hidden Markov model according to the marked context state, and converging the matrixes by using an unsupervised machine learning method to obtain the trained hidden Markov model. And then, combining the text to be processed with the industry special word bank to perform word segmentation, taking a hidden state sequence with the highest probability according to word segmentation content and a trained model, performing mode recognition by using the sequence, finding out a key privacy data structure in semantics, and desensitizing privacy data. Therefore, for unstructured texts in a specific industry, private data is searched without using a regular mode, and instead, a word segmentation library and a context pattern library aiming at the industry are used for recognizing the texts in the industry and finding the private data. Therefore, artificial errors caused by manual labeling are avoided, and meanwhile, the recognition processing efficiency of the private words is improved.

EXAMPLE six

As shown in fig. 6, according to the embodiment of the second aspect of the present invention, a text desensitization apparatus 600 is proposed, which comprises a memory 602 and a processor 604, the memory 602 stores a computer program, and the processor 604 implements the text desensitization method of the embodiment of the first aspect when executing the computer program. The text desensitization apparatus thus has all the benefits of the text desensitization method of the embodiments of the first aspect.

EXAMPLE seven

According to an embodiment of the third aspect of the present invention, there is provided an electronic apparatus including: the text desensitization apparatus of an embodiment of the second aspect, when executing the computer program, is capable of performing the steps of: acquiring a text to be processed and a hidden Markov model; performing word segmentation processing on the text to be processed according to the word segmentation library to obtain word information; determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model; and desensitizing the vocabulary information based on the context information meeting the preset context information.

Example eight

As shown in fig. 7, an industry data collection server 702, an industry thesaurus analysis server 704, a model training server 706, a desensitization data analysis server 708, and a desensitization processing server 710 are deployed in an electronic device 700 according to an embodiment of the present invention.

Specifically, the industry data acquisition server 702 is started, industry word texts are acquired, a maximum matching method is found by using word segmentation, the occurrence frequency of each word is statistically analyzed, and words with higher frequency are stored. The industry thesaurus analysis server 704 is then started to label the states of the contexts. The model training server 706 is started and the hidden markov model is trained using unsupervised machine learning algorithms until the model data converges. And starting a desensitization data analysis server 708 and a desensitization processing server 710, reading in the text to be processed, using word segmentation, finding the context with the maximum probability by combining model data matching, and proposing hidden data for desensitization processing.

In the embodiment, a hidden Markov model is established by using a language library for a specific industry and a machine learning method, the maximum possible semantic meaning of the pre-desensitized unstructured text data is found, pattern recognition is carried out, and the recognized private information is replaced. The problem that the traditional structural data desensitization method is ineffective for the more semi-structural methods at present is solved, the technical requirements and the workload of operators are effectively reduced, and the manual business trip probability is reduced.

Example nine

According to an embodiment of the fourth aspect of the present invention, a computer-readable storage medium is proposed, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text desensitization method according to an embodiment of the first aspect. The computer readable storage medium thus has all the benefits of the text desensitization method of embodiments of the first aspect.

In the description herein, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance unless explicitly stated or limited otherwise; the terms "connected," "mounted," "secured," and the like are to be construed broadly and include, for example, fixed connections, removable connections, or integral connections; may be directly connected or indirectly connected through an intermediate. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the description herein, the description of the terms "one embodiment," "some embodiments," "specific embodiments," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of text desensitization, comprising:

acquiring a text to be processed and a hidden Markov model;

performing word segmentation processing on the text to be processed according to a word segmentation library to obtain word information;

determining context information corresponding to the vocabulary information according to the vocabulary information and the hidden Markov model;

and desensitizing the vocabulary information based on the context information meeting preset context information.

2. The text desensitization method according to claim 1, wherein said step of desensitizing the textual vocabulary information comprises:

comparing vocabulary text and private vocabulary in the vocabulary information;

based on the vocabulary text conforming to the private vocabulary, marking the vocabulary text as sensitive data;

and desensitizing the sensitive data according to a desensitizing rule.

3. The text desensitization method according to claim 1, wherein said step of obtaining hidden markov models comprises:

acquiring a target text;

performing word segmentation processing on the target text according to the word segmentation library to obtain a first target word and a word position and a semantic meaning corresponding to the first target word;

determining a context identifier of the target text according to the first target vocabulary, the vocabulary position, the semantic and context pattern library;

and constructing the hidden Markov model according to the context identifier.

4. The text desensitization method according to claim 3, wherein the step of determining context information corresponding to the lexical information based on the lexical information and the hidden Markov models comprises:

matching the vocabulary information with the hidden Markov model to obtain the matching degree of the vocabulary information and the context identifier;

based on the matching degree being greater than or equal to a threshold matching degree, taking the context identifier corresponding to the matching degree greater than or equal to the threshold matching degree as the context information.

5. The text desensitization method according to claim 3, wherein said step of obtaining target text is followed by further comprising:

performing word segmentation processing on the target text by adopting a maximum matching algorithm to obtain a second target word;

counting the occurrence frequency of the second target vocabulary in the target text;

and updating the word segmentation library according to the second target vocabulary with the occurrence frequency larger than or equal to the preset frequency.

6. The text desensitization method according to any of claims 1 to 5, wherein said step of obtaining hidden Markov models comprises in particular:

determining the text type of the text to be processed according to the text to be processed;

matching the text type with a preset text type;

and calling the hidden Markov model according to the preset text type based on the matching of the text type and the preset text type.

7. Text desensitization method according to any of claims 2 to 5,

the desensitization rule includes at least one of: a masking algorithm, a morphing algorithm, a replacement algorithm, a format preserving encryption algorithm, and a data encryption algorithm.

8. A text desensitizing apparatus, comprising: a memory storing a computer program and a processor executing the computer program to perform a text desensitization method according to any of claims 1 to 7.

9. An electronic device, comprising:

the text desensitization device of claim 8, the text desensitization device when executing a computer program operable to perform the steps of:

acquiring a text to be processed and a hidden Markov model;

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text desensitization method according to any of claims 1 to 7.