CN117272996A - Data desensitization system - Google Patents

Data desensitization system Download PDF

Info

Publication number
CN117272996A
CN117272996A CN202311569787.9A CN202311569787A CN117272996A CN 117272996 A CN117272996 A CN 117272996A CN 202311569787 A CN202311569787 A CN 202311569787A CN 117272996 A CN117272996 A CN 117272996A
Authority
CN
China
Prior art keywords
word segmentation
text
words
dictionary
desensitization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311569787.9A
Other languages
Chinese (zh)
Other versions
CN117272996B (en
Inventor
卢国栋
李静
宋丙华
罗倩倩
王峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Wangan Security Technology Co ltd
Original Assignee
Shandong Wangan Security Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Wangan Security Technology Co ltd filed Critical Shandong Wangan Security Technology Co ltd
Priority to CN202311569787.9A priority Critical patent/CN117272996B/en
Publication of CN117272996A publication Critical patent/CN117272996A/en
Application granted granted Critical
Publication of CN117272996B publication Critical patent/CN117272996B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the field of data processing, and discloses a data desensitization system which comprises an acquisition module and a word segmentation module; the acquisition module is used for acquiring a text to be subjected to desensitization, wherein the text to be subjected to desensitization comprises a numerical text and a non-numerical text; the word segmentation module comprises a dictionary word segmentation unit and a statistical word segmentation unit; the dictionary word segmentation unit is used for carrying out word segmentation processing on the non-numerical text by adopting an improved dictionary-based word segmentation algorithm, storing the obtained words into a first word segmentation set, and obtaining a text without word segmentation based on the first word segmentation set; the statistical word segmentation unit is used for carrying out word segmentation processing on the text of the incomplete word segmentation by adopting a word segmentation algorithm based on statistics, and storing the obtained words into a second word segmentation set. The invention can ensure the word segmentation efficiency, and for words which do not exist in the dictionary, the word segmentation processing is carried out by the method with higher time complexity but without word segmentation algorithm through the dictionary, thereby ensuring the success rate of word segmentation.

Description

Data desensitization system
Technical Field
The invention relates to the field of data processing, in particular to a data desensitizing system.
Background
The data desensitization refers to the deformation of sensitive information through a desensitization rule so as to realize the reliable protection of sensitive and private data. This allows for the secure use of desensitized real data sets in development, testing, and other non-production and outsourcing environments. In the case of customer security data or certain business-sensitive data, the real data should be converted and provided for test use without violating system rules.
Data desensitization generally includes two steps, sensitive data identification and desensitizing the identified sensitive data. In the process of identifying sensitive data in a text to be identified, the prior art generally adopts a keyword identification and regular expression identification mode to identify the sensitive data. In the prior art, a single word segmentation algorithm is generally adopted to segment a text to be identified. If the dictionary-based word segmentation algorithm is directly adopted, the words which are not set in the dictionary are likely to be included in the text to be subjected to sensitive data recognition, so that more text which is not subjected to word segmentation still exists after the word segmentation, namely the success rate of word segmentation is lower. The success rate here is calculated from the ratio between the number of words that have been segmented and the total number of words in the entire text to be subjected to sensitive data recognition. If the word segmentation algorithm based on statistics is directly adopted, the word segmentation time is excessively long, because the time complexity of the word segmentation algorithm based on statistics is far higher than that of the word segmentation algorithm based on dictionary.
Disclosure of Invention
The invention aims to disclose a data desensitization system, which solves the problems of considering the word segmentation efficiency and the word segmentation success rate when the text is segmented in the data desensitization process, and reducing the time required by word segmentation on the premise of ensuring the success rate.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a data desensitization system comprises an acquisition module and a word segmentation module;
the acquisition module is used for acquiring a text to be subjected to desensitization, wherein the text to be subjected to desensitization comprises a numerical text and a non-numerical text;
the word segmentation module comprises a dictionary word segmentation unit and a statistical word segmentation unit;
the dictionary word segmentation unit is used for carrying out word segmentation processing on the non-numerical text by adopting an improved dictionary-based word segmentation algorithm, storing the obtained words into a first word segmentation set, and obtaining a text without word segmentation based on the first word segmentation set;
the statistical word segmentation unit is used for carrying out word segmentation processing on the text of the incomplete word segmentation by adopting a word segmentation algorithm based on statistics, and storing the obtained words into a second word segmentation set.
Preferably, the method further comprises a desensitization rule preservation module, wherein the desensitization rule preservation module is used for preserving preset desensitization processing rules of words of various types.
Preferably, the types of words include address class words, account class words, contact class words, and name class words.
Preferably, the desensitization processing rule of the words of the address class or the words of the name class is as follows:
replacing the words of the address class or the words of the name class by adopting randomly generated Chinese characters;
the desensitization processing rule of the account number type words or the contact way type words is as follows:
and replacing the number strings corresponding to the account number type words or the contact way type words by adopting random numbers.
Preferably, the number strings are obtained in the following manner:
and taking numerical text positioned right above, right below, left side and right side of the account word as a numerical string.
Preferably, the method further comprises a desensitization module, wherein the desensitization module is used for carrying out desensitization on the text to be subjected to desensitization based on the desensitization rule, the first word segmentation set and the second word segmentation set, so as to obtain the text after the desensitization.
Preferably, word segmentation processing is performed on the non-numerical text by adopting an improved dictionary-based word segmentation algorithm, the obtained words are saved to a first word segmentation set, and text with incomplete word segmentation is obtained based on the first word segmentation set, and the method comprises the following steps:
s1, acquiring a dictionary for word matching;
s2, calculating the self-adaptive sentence length of the non-numerical text;
s3, word segmentation processing is carried out on the non-numerical text based on the self-adaptive sentence length, the obtained words are stored in a first word segmentation set, and text with incomplete word segmentation is obtained based on the first word segmentation set.
Preferably, calculating the adaptive sentence length for the non-numeric text includes:
s21, obtaining the line number of the non-numerical text
S22, calculating the maximum value of the random number
Representing the number of random numbers that need to be generated;
s23, generatingThe value range is->Is to save the obtained random number to the set
S24, based onIs associated with (a)Number of machines selected from non-numeric text +.>Line text saving to collection
S25, word segmentation algorithm pair using hidden Markov modelWord segmentation processing is carried out on each line of texts in the set, and the obtained words are saved into a set +.>
S26, obtainingThe number of words of various lengths in the sentence, the length with the largest number is taken as the adaptive sentence length of the non-numerical text +.>
Preferably, word segmentation processing is performed on the non-numeric text based on the adaptive sentence length, the obtained words are saved to a first word segmentation set, and text with incomplete word segmentation is obtained based on the first word segmentation set, including:
s31, acquiring the front part in the non-numerical textIndividual characters form sentences to be segmented;
representing a maximum length of words in the dictionary; />The method comprises the steps of carrying out a first treatment on the surface of the mod represents a remainder operation;
s32, performing word segmentation on sentences to be segmented by using a forward maximum matching algorithm or a reverse maximum matching algorithm, and storing words which exist in the sentences to be segmented and belong to a dictionary into a first word segmentation set;
s33, representing the words which do not belong to the dictionary in the sentences to be segmented as
S34, willBefore->Personal text->ThereafterIndividual words are words in the text of the incomplete word segmentation,/->Representation->The number of words contained in the document;
s35, deleting words in the first word segmentation set and characters in the text of the unfinished word segmentation from the non-numerical text;
s36, judging whether characters still exist in the non-numerical text, if so, entering S31, and if not, outputting a first word segmentation set.
Preferably, the device further comprises a database module for storing texts to be subjected to desensitization processing.
Compared with the prior art that a single word segmentation algorithm is adopted, the method comprehensively adopts two different word segmentation algorithms to segment the text to be subjected to desensitization, firstly adopts an improved dictionary-based word segmentation algorithm to segment the text to be subjected to desensitization, and obtains a first word segmentation set and an unfinished word segmentation text, and then uses a statistical-based word segmentation algorithm to segment the unfinished word text, so as to obtain a second word segmentation set. Because the time complexity of the word segmentation algorithm based on the dictionary is smaller, the word segmentation algorithm based on the dictionary can obtain most word segmentation results first, and can ensure the word segmentation efficiency, while for words which do not exist in the dictionary, the word segmentation algorithm based on the dictionary has higher time complexity but does not need to carry out word segmentation processing through the word segmentation algorithm of the dictionary, so that the success rate of word segmentation is ensured.
Drawings
The present disclosure will become more fully understood from the detailed description given herein below and the accompanying drawings, which are given by way of illustration only, and thus are not limiting of the present disclosure, and wherein:
FIG. 1 is a schematic diagram of a data desensitization system of the present invention;
FIG. 2 is a schematic diagram of a non-numeric text word segmentation process using the improved dictionary-based word segmentation algorithm of the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.
In one embodiment as shown in fig. 1, the invention provides a data desensitizing system, which comprises an acquisition module and a word segmentation module;
the acquisition module is used for acquiring a text to be subjected to desensitization, wherein the text to be subjected to desensitization comprises a numerical text and a non-numerical text;
the word segmentation module comprises a dictionary word segmentation unit and a statistical word segmentation unit;
the dictionary word segmentation unit is used for carrying out word segmentation processing on the non-numerical text by adopting an improved dictionary-based word segmentation algorithm, storing the obtained words into a first word segmentation set, and obtaining a text without word segmentation based on the first word segmentation set;
the statistical word segmentation unit is used for carrying out word segmentation processing on the text of the incomplete word segmentation by adopting a word segmentation algorithm based on statistics, and storing the obtained words into a second word segmentation set.
Compared with the prior art that a single word segmentation algorithm is adopted, the method comprehensively adopts two different word segmentation algorithms to segment the text to be subjected to desensitization, firstly adopts an improved dictionary-based word segmentation algorithm to segment the text to be subjected to desensitization, and obtains a first word segmentation set and an unfinished word segmentation text, and then uses a statistical-based word segmentation algorithm to segment the unfinished word text, so as to obtain a second word segmentation set. Because the time complexity of the word segmentation algorithm based on the dictionary is smaller, the word segmentation algorithm based on the dictionary can obtain most word segmentation results first, and can ensure the word segmentation efficiency, while for words which do not exist in the dictionary, the word segmentation algorithm based on the dictionary has higher time complexity but does not need to carry out word segmentation processing through the word segmentation algorithm of the dictionary, so that the success rate of word segmentation is ensured.
Preferably, the method further comprises a desensitization rule preservation module, wherein the desensitization rule preservation module is used for preserving preset desensitization processing rules of words of various types.
Preferably, the desensitization rule holding module includes a setting unit and a holding unit;
the setting unit is used for setting desensitization processing rules of each type of words by staff for face recognition;
the storing unit stores the desensitization processing rule set by the staff person who passes the face recognition in the setting unit in english.
Preferably, the setting unit performs the segmentation processing on the face image of the worker by using the following function in the face recognition process:
acquiring human faceSet of edge pixels in an image
From the slaveIs selected randomly by a pixel point->
Acquisition toIs central, is->Is a set of pixels in a window +.>
Acquisition ofChinese meets the elliptical skin color model and is +.>The set of adjacent pixels of the pixels which do not conform to the elliptical skin color model +.>
JudgingWhether or not the pixels in (a) are all +.>If not, will be +.>In not belonging to the->Pixel point of (2) is added to +.>In (a) and (b);
from the following componentsThe pixel points in the array form a plurality of connected domains;
and taking the connected domain with the largest area as an image of the face area.
In the prior art, when a face image is segmented by adopting a mode based on edge recognition, the obtained connected domain is generally and directly recognized after edge pixel points are obtained, but when the edge recognition is performed, the gray value distribution is complex, for example, the gray value variance is large, and obviously, more accurate image edges can not be obtained due to the fact that the distribution is complex, so that the pixel points of the areas are further segmented by an elliptical skin color model, and the accuracy of the obtained image edges can be improved. Thereby improving the accuracy of the subsequently acquired image features. The success rate of face recognition is improved.
Preferably, the setting unit is further configured to acquire image features of an image of the face area, and perform face recognition on the worker based on the acquired image features.
Specifically, the face recognition is performed on the staff based on the obtained image features, including:
and matching the image characteristics of the image of the face area with the facial characteristics of the pre-stored personnel capable of setting the desensitization processing rule, and if the matching is successful, indicating that the personnel face recognition.
Preferably, the types of words include address class words, account class words, contact class words, and name class words.
Specifically, the types of the words can also include words of the certificate class, such as keywords of certificates of driving license, property license and the like.
Preferably, the desensitization processing rule of the words of the address class or the words of the name class is as follows:
replacing the words of the address class or the words of the name class by adopting randomly generated Chinese characters;
the desensitization processing rule of the account number type words or the contact way type words is as follows:
and replacing the number strings corresponding to the account number type words or the contact way type words by adopting random numbers.
In particular, it is also possible to useThe words of the address class or the words of the name class are replaced. For example, when the word of the address class is the d region of the a-province b-city, the desensitized data is +.>Province->City->A zone. For another example, when the term of the name class is b city gas limited, then the desensitized data is +.>Commercial gas limited. But for industry information of the company, the industry information can be removed as required.
Preferably, the number strings are obtained in the following manner:
and taking numerical text positioned right above, right below, left side and right side of the account word as a numerical string.
Words of the account number class, such as an identification card number, are likely to appear behind, and possibly in front of, the words of the account number class. In addition, if the text to be desensitized is a form-like text, the digit string may also appear directly under or over the account-like word.
And for words of contact ways such as mailbox, telephone and the like, the processing rule is the same as the words of account number.
Preferably, the method further comprises a desensitization module, wherein the desensitization module is used for carrying out desensitization on the text to be subjected to desensitization based on the desensitization rule, the first word segmentation set and the second word segmentation set, so as to obtain the text after the desensitization.
Specifically, the desensitization process is to randomly select a word from the first word segmentation set or the second word segmentation set, obtain a corresponding desensitization rule according to the type of the word, and perform the desensitization process on the word or the number string corresponding to the word based on the desensitization rule.
Preferably, as shown in fig. 2, the word segmentation processing is performed on the non-numerical text by adopting a modified word segmentation algorithm based on a dictionary, the obtained words are saved to a first word segmentation set, and the text with unfinished word segmentation is obtained based on the first word segmentation set, including:
s1, acquiring a dictionary for word matching;
s2, calculating the self-adaptive sentence length of the non-numerical text;
s3, word segmentation processing is carried out on the non-numerical text based on the self-adaptive sentence length, the obtained words are stored in a first word segmentation set, and text with incomplete word segmentation is obtained based on the first word segmentation set.
Specifically, words for word segmentation are stored in the dictionary, and word segmentation processing of the non-numeric text can be realized by matching words of various lengths in the non-numeric text with words in the dictionary.
Preferably, calculating the adaptive sentence length for the non-numeric text includes:
s21, obtaining the line number of the non-numerical text
S22, calculating the maximum value of the random number
Representation ofThe number of random numbers that need to be generated;
s23, generatingThe value range is->Is to save the obtained random number to the set
S24, based onRandom number in (a) selected from non-numeric text +.>Line text saving to collection
S25, word segmentation algorithm pair using hidden Markov modelWord segmentation processing is carried out on each line of texts in the set, and the obtained words are saved into a set +.>
S26, obtainingThe number of words of various lengths in the sentence, the length with the largest number is taken as the adaptive sentence length of the non-numerical text +.>
In the existing word segmentation algorithm based on the dictionary, the maximum value of the lengths of words in the dictionary is generally taken as the length of a sentence to be segmented. However, since the lengths of the words with high occurrence frequency are not the same in different texts, the maximum of the lengths of the words in the dictionary is usedIf the lengths of most words in the text are far from the maximum value, the word segmentation process can easily lead to dividing a complete word into two sentences in the process of acquiring sentences, so that the word segmentation process cannot be performed correctly on the word, the word segmentation process needs to be performed in the subsequent word segmentation process based on the statistical word segmentation algorithm, and obviously, the word number of the text to be processed in the statistical word segmentation algorithm is excessive, and the word segmentation process efficiency is affected. The present invention thus solves this problem by calculating an adaptive sentence length. The adaptive sentence length reflects as a sampleDistribution of word lengths in line text. The hidden Markov model is not affected by the dictionary during word segmentation and can therefore be used to model +.>And carrying out word segmentation on the line text. By solving the self-adaptive sentence length, the calculation result of the word segmentation algorithm based on statistics can influence the calculation process of the word segmentation algorithm based on the dictionary, and the probability of dividing a complete word into two sentences in the process of acquiring sentences when word segmentation processing is carried out by using the word segmentation algorithm based on the dictionary is reduced.
In addition, since the text to be desensitized is numerous in types, all the words which can appear are obviously difficult to record in the dictionary, the length of the word with the largest occurrence number can be known through sampling analysis from the non-numerical text, so that the length of the obtained sentence to be segmented can be correctly adjusted in the subsequent process, and the segmentation efficiency is improved.
By adopting a mode of generating random numbers, the influence of periodically repeated non-numerical text on the calculation of the self-adaptive sentence length can be avoided. Thereby improving the effectiveness of the adaptive sentence length. Because of the periodically repeated non-numeric text, if a fixed sampling interval is used to obtain the text as a sample, if the interval of the acquisition is exactly the same as the repetition period of the non-numeric text, the obtained sample actually contains only one type of line, which obviously results in a situation that the adaptive sentence length cannot be prepared to represent the length of the words in the non-numeric text.
In the present invention,much smaller than the actual number of lines of non-numeric text.
In particular, the method comprises the steps of,random number in (a) selected from non-numeric text +.>Line text is saved to the Collection->Comprising:
from the slaveA random number s is obtained;
calculate the need to save to the collectionNumber of rows in (a):
and->Representing the number of lines in the non-numeric text of the kth line and the kth-1 line text selected from the non-numeric text, respectively;
will not be the first in numerical textLine text is saved to the Collection->
Preferably, word segmentation processing is performed on the non-numeric text based on the adaptive sentence length, the obtained words are saved to a first word segmentation set, and text with incomplete word segmentation is obtained based on the first word segmentation set, including:
s31, acquiring the front part in the non-numerical textIndividual characters form sentences to be segmented;
representing a maximum length of words in the dictionary; />The method comprises the steps of carrying out a first treatment on the surface of the mod represents a remainder operation;
s32, performing word segmentation on sentences to be segmented by using a forward maximum matching algorithm or a reverse maximum matching algorithm, and storing words which exist in the sentences to be segmented and belong to a dictionary into a first word segmentation set;
s33, representing the words which do not belong to the dictionary in the sentences to be segmented as
S34, willBefore->Personal text->ThereafterIndividual words are words in the text of the incomplete word segmentation,/->Representation->The number of words contained in the document;
s35, deleting words in the first word segmentation set and characters in the text of the unfinished word segmentation from the non-numerical text;
s36, judging whether characters still exist in the non-numerical text, if so, entering S31, and if not, outputting a first word segmentation set.
In the process of calculating the length of sentences to be segmented, the invention uses the method of the inventionPerforming remainder calculation if ∈>Just +.>If an integer multiple of one complete word is split into two sentences, the probability is significantly lower than +.>Not->Integer multiples of the probability at which the probability is to be determined. Because of->Just +.>When the number of the whole number times of the word is multiplied, the probability that each word can be correctly segmented by a dictionary-based word segmentation algorithm in sentences to be segmented is obviously improved.
Preferably, the device further comprises a database module for storing texts to be subjected to desensitization processing.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the embodiments described above, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims (10)

1. The data desensitization system is characterized by comprising an acquisition module and a word segmentation module;
the acquisition module is used for acquiring a text to be subjected to desensitization, wherein the text to be subjected to desensitization comprises a numerical text and a non-numerical text;
the word segmentation module comprises a dictionary word segmentation unit and a statistical word segmentation unit;
the dictionary word segmentation unit is used for carrying out word segmentation processing on the non-numerical text by adopting an improved dictionary-based word segmentation algorithm, storing the obtained words into a first word segmentation set, and obtaining a text without word segmentation based on the first word segmentation set;
the statistical word segmentation unit is used for carrying out word segmentation processing on the text of the incomplete word segmentation by adopting a word segmentation algorithm based on statistics, and storing the obtained words into a second word segmentation set.
2. The data desensitization system according to claim 1, further comprising a desensitization rule preservation module for preserving preset desensitization processing rules for words of multiple types.
3. A data desensitization system according to claim 2, wherein the types of words include address-like words, account-like words, contact-like words, and name-like words.
4. A data desensitisation system according to claim 3, wherein the rules for desensitising the words of the address class or the words of the name class are:
replacing the words of the address class or the words of the name class by adopting randomly generated Chinese characters;
the desensitization processing rule of the account number type words or the contact way type words is as follows:
and replacing the number strings corresponding to the account number type words or the contact way type words by adopting random numbers.
5. The data desensitization system according to claim 4, wherein the number strings are obtained by:
and taking numerical text positioned right above, right below, left side and right side of the account word as a numerical string.
6. The data desensitization system according to claim 5, further comprising a desensitization module for desensitizing text to be desensitized based on the rule of desensitization, the first set of words, the second set of words, and obtaining desensitized text.
7. A data desensitization system according to claim 1, wherein the non-numeric text is segmented using a modified dictionary-based segmentation algorithm, the obtained words are saved to a first set of segments, and text that has not been segmented is obtained based on the first set of segments, comprising:
s1, acquiring a dictionary for word matching;
s2, calculating the self-adaptive sentence length of the non-numerical text;
s3, word segmentation processing is carried out on the non-numerical text based on the self-adaptive sentence length, the obtained words are stored in a first word segmentation set, and text with incomplete word segmentation is obtained based on the first word segmentation set.
8. A data desensitization system according to claim 7, wherein calculating adaptive sentence lengths for non-numeric text comprises:
s21, obtaining the line number of the non-numerical text
S22, calculating the maximum value of the random number
Representing the number of random numbers that need to be generated;
s23, generatingThe value range is->The obtained random number is saved to the set +.>
S24, based onRandom number in (a) selected from non-numeric text +.>Line text is saved to the Collection->
S25, makingWord segmentation algorithm pair using hidden Markov modelWord segmentation processing is carried out on each line of texts in the set, and the obtained words are saved into a set +.>
S26, obtainingThe number of words of various lengths in the sentence, the length with the largest number is taken as the adaptive sentence length of the non-numerical text +.>
9. The data desensitization system according to claim 8, wherein word segmentation processing is performed on the non-numeric text based on the adaptive sentence length, the obtained words are saved to a first word segmentation set, and text with incomplete word segmentation is obtained based on the first word segmentation set, comprising:
s31, acquiring the front part in the non-numerical textIndividual characters form sentences to be segmented;
representing a maximum length of words in the dictionary; />The method comprises the steps of carrying out a first treatment on the surface of the mod represents a remainder operation;
s32, performing word segmentation on sentences to be segmented by using a forward maximum matching algorithm or a reverse maximum matching algorithm, and storing words which exist in the sentences to be segmented and belong to a dictionary into a first word segmentation set;
s33, representing the words which do not belong to the dictionary in the sentences to be segmented as
S34, willBefore->Personal text->Thereafter->Individual words are words in the text of the incomplete word segmentation,/->Representation->The number of words contained in the document;
s35, deleting words in the first word segmentation set and characters in the text of the unfinished word segmentation from the non-numerical text;
s36, judging whether characters still exist in the non-numerical text, if so, entering S31, and if not, outputting a first word segmentation set.
10. A data desensitizing system according to claim 1, further comprising a database module for storing text to be desensitized.
CN202311569787.9A 2023-11-23 2023-11-23 Data desensitization system Active CN117272996B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311569787.9A CN117272996B (en) 2023-11-23 2023-11-23 Data desensitization system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311569787.9A CN117272996B (en) 2023-11-23 2023-11-23 Data desensitization system

Publications (2)

Publication Number Publication Date
CN117272996A true CN117272996A (en) 2023-12-22
CN117272996B CN117272996B (en) 2024-02-27

Family

ID=89220047

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311569787.9A Active CN117272996B (en) 2023-11-23 2023-11-23 Data desensitization system

Country Status (1)

Country Link
CN (1) CN117272996B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108776762A (en) * 2018-06-08 2018-11-09 北京中电普华信息技术有限公司 A kind of processing method and processing device of data desensitization
CN110210242A (en) * 2019-04-25 2019-09-06 深圳壹账通智能科技有限公司 A kind of method, apparatus, storage medium and the computer equipment of data desensitization
CN110532805A (en) * 2019-09-05 2019-12-03 国网山西省电力公司阳泉供电公司 Data desensitization method and device
CN112001174A (en) * 2020-08-10 2020-11-27 深圳中兴网信科技有限公司 Text desensitization method, apparatus, electronic device and computer-readable storage medium
CN115062338A (en) * 2019-12-31 2022-09-16 北京懿医云科技有限公司 Data desensitization method and device, electronic equipment and storage medium
CN115544560A (en) * 2022-09-22 2022-12-30 中国平安财产保险股份有限公司 Desensitization method and device for sensitive information, computer equipment and storage medium
CN115618371A (en) * 2022-07-11 2023-01-17 上海期货信息技术有限公司 Desensitization method and device for non-text data and storage medium
CN115659977A (en) * 2022-10-28 2023-01-31 青岛高重信息科技有限公司 Entity identification method for desensitization Chinese text
CN116049884A (en) * 2023-01-17 2023-05-02 三江学院 Data desensitization method, system and medium based on role access control
CN116484420A (en) * 2023-04-19 2023-07-25 中国邮政储蓄银行股份有限公司 Text desensitization processing method and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108776762A (en) * 2018-06-08 2018-11-09 北京中电普华信息技术有限公司 A kind of processing method and processing device of data desensitization
CN110210242A (en) * 2019-04-25 2019-09-06 深圳壹账通智能科技有限公司 A kind of method, apparatus, storage medium and the computer equipment of data desensitization
CN110532805A (en) * 2019-09-05 2019-12-03 国网山西省电力公司阳泉供电公司 Data desensitization method and device
CN115062338A (en) * 2019-12-31 2022-09-16 北京懿医云科技有限公司 Data desensitization method and device, electronic equipment and storage medium
CN112001174A (en) * 2020-08-10 2020-11-27 深圳中兴网信科技有限公司 Text desensitization method, apparatus, electronic device and computer-readable storage medium
CN115618371A (en) * 2022-07-11 2023-01-17 上海期货信息技术有限公司 Desensitization method and device for non-text data and storage medium
CN115544560A (en) * 2022-09-22 2022-12-30 中国平安财产保险股份有限公司 Desensitization method and device for sensitive information, computer equipment and storage medium
CN115659977A (en) * 2022-10-28 2023-01-31 青岛高重信息科技有限公司 Entity identification method for desensitization Chinese text
CN116049884A (en) * 2023-01-17 2023-05-02 三江学院 Data desensitization method, system and medium based on role access control
CN116484420A (en) * 2023-04-19 2023-07-25 中国邮政储蓄银行股份有限公司 Text desensitization processing method and device

Also Published As

Publication number Publication date
CN117272996B (en) 2024-02-27

Similar Documents

Publication Publication Date Title
CN109829155B (en) Keyword determination method, automatic scoring method, device, equipment and medium
CN107122479B (en) User password guessing system based on deep learning
WO2022041815A1 (en) Weak password detection method and device based on deep learning, and electronic device
EP2860658A1 (en) Classifying malware by order of network behavior artifacts
CN109472207B (en) Emotion recognition method, device, equipment and storage medium
CN108376220A (en) A kind of malice sample program sorting technique and system based on deep learning
CN110378228A (en) Video data handling procedure, device, computer equipment and storage medium are examined in face
CN112131383A (en) Specific target emotion polarity classification method
CN110149266A (en) Spam filtering method and device
CN111260568B (en) Peak binarization background noise removing method based on multi-discriminator countermeasure network
CN109522740B (en) Health data privacy removal processing method and system
CN114912456B (en) Medical entity relationship identification method and device and storage medium
CN113268768B (en) Desensitization method, device, equipment and medium for sensitive data
CN111783818B (en) Xgboost and DBSCAN-based accurate marketing method
CN113435196B (en) Intention recognition method, device, equipment and storage medium
CN113990352B (en) User emotion recognition and prediction method, device, equipment and storage medium
CN111767565A (en) Data desensitization processing method, processing device and storage medium
CN113032528A (en) Case analysis method, case analysis device, case analysis equipment and storage medium
CN109344747B (en) Tamper graph identification method, storage medium and server
CN117272996B (en) Data desensitization system
CN109101956B (en) Method and apparatus for processing image
KR102309829B1 (en) Apparatus and method for analyzing call emotions
CN111130794A (en) Identity verification method based on iris and private key certificate chain connection storage structure
CN113593698B (en) Traditional Chinese medicine syndrome type identification method based on graph attention network
CN109408789B (en) Handwriting template, generation method thereof and handwriting template selection system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant