CN114626097A - Desensitization method, desensitization device, electronic apparatus, and storage medium - Google Patents

Desensitization method, desensitization device, electronic apparatus, and storage medium Download PDF

Info

Publication number
CN114626097A
CN114626097A CN202210283373.9A CN202210283373A CN114626097A CN 114626097 A CN114626097 A CN 114626097A CN 202210283373 A CN202210283373 A CN 202210283373A CN 114626097 A CN114626097 A CN 114626097A
Authority
CN
China
Prior art keywords
desensitization
data
sensitive
word segment
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210283373.9A
Other languages
Chinese (zh)
Inventor
肖田雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202210283373.9A priority Critical patent/CN114626097A/en
Publication of CN114626097A publication Critical patent/CN114626097A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioethics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application provides a desensitization method, a desensitization device, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring original data to be desensitized; the method comprises the steps that original data are segmented according to preset desensitization category labels to obtain word segments to be desensitized corresponding to the desensitization category labels, the word segments to be desensitized comprise a first word segment, a second word segment and a third word segment, and first desensitization detection and first desensitization processing are carried out on the first word segment through a sensitive information detection model to obtain first desensitization data; performing second sensitive detection and second desensitization processing on the second word segment through the regular expression to obtain second desensitization data; performing third sensitive detection and third desensitization processing on the third word segment through the sensitive word dictionary to obtain third desensitization data; and combining the first desensitization data, the second desensitization data and the third desensitization data to obtain target data. The embodiment of the application can improve the accuracy of desensitization.

Description

Desensitization method, desensitization device, electronic device, and storage medium
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a desensitization method, a desensitization device, electronic equipment and a storage medium.
Background
Sensitive information such as user privacy and business secrets can be involved in some data, or sensitive information which does not meet the requirements of the industry can be involved, so that desensitization processing is required to the relevant sensitive information. At present, a common desensitization method mainly performs part-of-speech tagging on text information and then performs desensitization processing on sensitive entity words obtained by tagging, and the method usually needs a large amount of tagged data, so that the cost for obtaining a large amount of high-quality tagged data is high, and the desensitization accuracy is not high, so that how to improve the desensitization accuracy becomes a technical problem to be solved urgently.
Disclosure of Invention
The main purpose of the embodiments of the present application is to provide a desensitization method, a desensitization apparatus, an electronic device, and a storage medium, which are intended to improve desensitization accuracy.
To achieve the above object, a first aspect of embodiments of the present application proposes a desensitization method, including:
acquiring original data to be desensitized;
segmenting the original data according to preset desensitization category labels to obtain word segments to be desensitized corresponding to each desensitization category label, wherein the word segments to be desensitized comprise a first word segment, a second word segment and a third word segment, and the sensitivity categories of the first word segment, the second word segment and the third word segment are different;
performing first sensitive detection on the first word segment through a pre-trained sensitive information detection model to obtain first sensitive data, and performing first desensitization processing on the first sensitive data to obtain first desensitization data;
performing second sensitive detection on the second word segment through a preset regular expression to obtain second sensitive data, and performing second desensitization processing on the second sensitive data to obtain second desensitization data;
performing third sensitive detection on the third word segment through a preset sensitive word dictionary to obtain third sensitive data, and performing third desensitization processing on the third sensitive data to obtain third desensitization data;
and performing combined processing on the first desensitization data, the second desensitization data and the third desensitization data to obtain target data.
In some embodiments, the step of performing segmentation processing on the original data according to preset desensitization category labels to obtain to-be-desensitized word segments corresponding to each desensitization category label includes:
performing classification probability calculation on the original data through a preset function and the desensitization class labels to obtain a classification probability value of each desensitization class label;
and carrying out segmentation processing on the original data according to the classification probability value to obtain the word segment to be desensitized.
In some embodiments, the performing, by a pre-trained sensitive information detection model, a first sensitive detection on the first word segment to obtain first sensitive data, and performing a first desensitization process on the first sensitive data to obtain first desensitization data includes:
inputting the first word segment into the sensitive information detection model, wherein the sensitive information detection model comprises a convolutional layer, a full-link layer and a decoding layer;
extracting entity characteristics of the first word segment through the convolutional layer to obtain candidate word segment characteristics;
screening the candidate word segment characteristics through a part-of-speech category label preset in the full connection layer to obtain target word segment characteristics to be desensitized;
decoding the target word segment characteristics through the decoding layer to obtain the first sensitive data;
and carrying out first desensitization processing on the first sensitive data to obtain first desensitization data.
In some embodiments, the performing, by using a preset regular expression, a second sensitive detection on the second word segment to obtain second sensitive data, and performing a second desensitization process on the second sensitive data to obtain second desensitization data includes:
coding the second word segment through a preset coder to obtain a target character string to be desensitized;
performing sensitive detection on the target character string through the regular expression to obtain second sensitive data;
and carrying out second desensitization treatment on the second sensitive data to obtain second desensitization data.
In some embodiments, the performing, by a preset sensitive word dictionary, third sensitive detection on the third word segment to obtain third sensitive data, and performing third desensitization processing on the third sensitive data to obtain third desensitization data includes:
traversing the sensitive word dictionary, and performing similarity calculation on the third word segment and each reference word segment in the sensitive word dictionary to obtain word segment similarity;
screening the reference word segments according to the word segment similarity to obtain third sensitive data;
and carrying out third desensitization treatment on the third sensitive data to obtain third desensitization data.
In some embodiments, the step of performing a filtering process on the reference word segment according to the word segment similarity to obtain the third sensitive data includes:
comparing the word segment similarity with a preset similarity threshold;
and selecting the reference word segment with the word segment similarity larger than or equal to the similarity threshold value as the third sensitive data.
In some embodiments, the first desensitization data, the second desensitization data, and the third desensitization data are subjected to stitching according to a preset stitching order to obtain initial data;
and filtering the initial data to obtain the target data.
To achieve the above object, a second aspect of the embodiments of the present application proposes a desensitizing apparatus, including:
the original data acquisition module is used for acquiring original data to be desensitized;
the segmentation module is used for segmenting the original data according to preset desensitization category labels to obtain word segments to be desensitized corresponding to each desensitization category label, wherein the word segments to be desensitized comprise a first word segment, a second word segment and a third word segment, and the sensitivity categories of the first word segment, the second word segment and the third word segment are different;
the first sensitive detection module is used for carrying out first sensitive detection on the first word segment through a pre-trained sensitive information detection model to obtain first sensitive data and carrying out first desensitization processing on the first sensitive data to obtain first desensitization data;
the second sensitive detection module is used for performing second sensitive detection on the second word segment through a preset regular expression to obtain second sensitive data, and performing second desensitization processing on the second sensitive data to obtain second desensitization data;
the third sensitive detection module is used for carrying out third sensitive detection on the third word segment through a preset sensitive word dictionary to obtain third sensitive data and carrying out third desensitization processing on the third sensitive data to obtain third desensitization data;
and the combination module is used for carrying out combination processing on the first desensitization data, the second desensitization data and the third desensitization data to obtain target data.
In order to achieve the above object, a third aspect of the embodiments of the present application provides an electronic device, which includes a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for implementing connection communication between the processor and the memory, wherein the program, when executed by the processor, implements the method of the first aspect.
To achieve the above object, a fourth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium for computer-readable storage, and the storage medium stores one or more programs, which are executable by one or more processors to implement the method of the first aspect.
According to the desensitization method, the desensitization device, the electronic equipment and the storage medium, original data to be desensitized are obtained, and the original data are segmented according to preset desensitization category labels to obtain word segments to be desensitized corresponding to each desensitization category label, wherein the word segments to be desensitized comprise first word segments, second word segments and third word segments, the sensitivity categories of the first word segments, the second word segments and the third word segments are different, and the method can classify and segment sensitive information in the original data according to the preset desensitization category labels, so that the sensitive information of different types can be detected and desensitized in various ways, and the desensitization process has pertinence. Further, performing first sensitive detection on the first word segment through a pre-trained sensitive information detection model to obtain first sensitive data, and performing first desensitization processing on the first sensitive data to obtain first desensitization data; performing second sensitive detection on the second word segment through a preset regular expression to obtain second sensitive data, and performing second desensitization processing on the second sensitive data to obtain second desensitization data; third sensitive detection is carried out on the third word segment through a preset sensitive word dictionary to obtain third sensitive data, third desensitization processing is carried out on the third sensitive data to obtain third desensitization data, and a sensitive information detection model, a regular expression and the sensitive word dictionary are adopted to detect different types of sensitive information, so that the detection comprehensiveness is effectively improved; meanwhile, the sensitive information detection model, the regular expression and the sensitive word dictionary can be preset according to actual conditions, can be directly used in different desensitization scenes, can be continuously expanded and perfected according to different desensitization requirements, can reduce the use cost, and can meet personalized requirements. And finally, combining the first desensitization data, the second desensitization data and the third desensitization data to obtain target data.
Drawings
Fig. 1 is a flowchart of a desensitization method provided in an embodiment of the present application;
FIG. 2 is a flowchart of step S102 in FIG. 1;
FIG. 3 is a flowchart of step S103 in FIG. 1;
FIG. 4 is a flowchart of step S104 in FIG. 1;
fig. 5 is a flowchart of step S105 in fig. 1;
fig. 6 is a flowchart of step S502 in fig. 5;
FIG. 7 is a flowchart of step S106 in FIG. 1;
fig. 8 is a schematic structural diagram of a desensitizing apparatus provided by an embodiment of the present application;
fig. 9 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
It should be noted that although functional blocks are illustrated as being partitioned in a schematic diagram of an apparatus and logical order is illustrated in a flowchart, in some cases, the steps illustrated or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
First, several terms referred to in the present application are resolved:
artificial Intelligence (AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produces a new intelligent machine that can react in a manner similar to human intelligence, and research in this area includes robotics, language recognition, image recognition, natural language processing, and expert systems. The artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.
Natural Language Processing (NLP): NLP uses computer to process, understand and use human languages (such as chinese, english, etc.), and belongs to a branch of artificial intelligence, which is a cross discipline between computer science and linguistics, also commonly called computational linguistics. Natural language processing includes syntactic analysis, semantic analysis, discourse understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, speech recognition and text-to-speech conversion, information intention recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like related to language processing.
Information Extraction (NER): and extracting entity, relation, event and other factual information of specified types from the natural language text, and forming a text processing technology for outputting structured data. Information extraction is a technique for extracting specific information from text data. The text data is composed of specific units, such as sentences, paragraphs, chapters, and the text information is composed of small specific units, such as words, phrases, sentences, paragraphs, or combinations thereof. The extraction of noun phrases, names of people, names of places, etc. in the text data is text information extraction, and of course, the information extracted by the text information extraction technology may be various types of information.
Deep Learning (DL): is a new research direction in the field of Machine Learning (ML). The concept of deep learning is derived from the research of artificial neural networks, and a multi-layer perceptron comprising a plurality of hidden layers is a deep learning structure. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data. The motivation for studying deep learning is to build neural networks that simulate the human brain for analytical learning, which imitates the mechanism of the human brain to interpret data, such as images, sounds, and text. Deep learning is a general term for a class of pattern analysis methods, and mainly relates to three classes of methods for specific research contents:
(1) a neural network system based on convolution operations, i.e. a Convolutional Neural Network (CNN).
(2) self-Coding neural networks based on multi-layer neurons include both self-Coding (Auto encoder) and Sparse Coding (Sparse Coding) which has received much attention in recent years.
(3) And pre-training in a multi-layer self-coding neural network mode, and further optimizing a Deep Belief Network (DBN) of the neural network weight by combining the identification information.
BERT (bidirectional Encoder retrieval from transformations): is a language representation model (language representation model). BERT adopts a transform Encoder block for connection, and is a typical bidirectional coding model.
Hidden Markov Model (Maximum entry Markov Model, MEMM): the method is used for calculating the conditional probability distribution of each hidden state sequence Y for a given observation sequence X, namely establishing joint probability for transition probability and expression probability, and counting the conditional probability rather than the co-occurrence probability. Since MEMM is only locally normalized, MEMM tends to fall into local optima.
Conditional random field algorithm (CRF): is a mathematical algorithm; the method combines the characteristics of a maximum entropy model and a hidden Markov model, is an undirected graph model, and has good effect in sequence tagging tasks such as word segmentation, part of speech tagging, named entity recognition and the like in recent years. The conditional random field is a typical discriminant model, and the joint probability thereof can be written in the form of multiplication of several potential functions, wherein the most common is the linear chain element random field. If x ═ represents the observed input data sequence, (x1, x2, … xn), and y ═ represents a state sequence, (y1, y2, … yn), given an input sequence, the CRF model for the linear chain defines the joint conditional probability of the state sequence as p (y | x) ═ exp { } (2-14); z (x) { } (2-15); wherein Z is a probability normalization factor conditioned on the observation sequence x; fj (yi-1, yi, x, i) is an arbitrary characteristic function.
Decoding (decoder): converting the fixed vector generated before into an output sequence; wherein, the input sequence can be characters, voice, images and videos; the output sequence may be text, images.
Regular Expression (Regular Expression, RE): regular expressions are also known as regular expressions. Regular expressions are typically used to retrieve, replace, text that conforms to a certain pattern (rule). The regular expression is a logical formula for operating on character strings (including common characters (e.g., letters between a and z) and special characters (called meta characters)), i.e., a "regular character string" is formed by a plurality of specific characters and combinations of the specific characters, which are defined in advance, and the "regular character string" is used for expressing a filtering logic for the character strings. A regular expression is a text pattern that describes one or more strings of characters to be matched when searching for text.
Sensitive information such as user privacy and business secrets can be involved in some data, or sensitive information which does not meet the requirements of the industry can be involved, so that desensitization processing needs to be carried out on related sensitive information. At present, a common desensitization method mainly performs part-of-speech tagging on text information and then performs desensitization processing on sensitive entity words obtained by tagging, and the method usually needs a large amount of tagged data, so that the cost for obtaining a large amount of high-quality tagged data is high, and the desensitization accuracy is not high, so that how to improve the desensitization accuracy becomes a technical problem to be solved urgently.
Based on this, embodiments of the present application provide a desensitization method, a desensitization apparatus, an electronic device, and a storage medium, aiming to improve desensitization accuracy.
The desensitization method, the desensitization device, the electronic device, and the storage medium provided in the embodiments of the present application are specifically described in the following embodiments, and the desensitization method in the embodiments of the present application is first described.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operating/interactive systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The desensitization method provided by the embodiment of the application relates to the technical field of artificial intelligence. The desensitization method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smartphone, tablet, notebook, desktop computer, or the like; the server side can be configured into an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and cloud servers for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content delivery network) and big data and artificial intelligence platforms; the software may be an application that implements the desensitization method, etc., but is not limited to the above form.
The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Fig. 1 is an alternative flowchart of a desensitization method provided in an embodiment of the present application, and the method in fig. 1 may include, but is not limited to, steps S101 to S106.
Step S101, obtaining original data to be desensitized;
step S102, according to preset desensitization category labels, carrying out segmentation processing on original data to obtain word segments to be desensitized corresponding to each desensitization category label, wherein the word segments to be desensitized comprise a first word segment, a second word segment and a third word segment, and the sensitivity categories of the first word segment, the second word segment and the third word segment are different;
step S103, performing first sensitivity detection on the first word segment through a pre-trained sensitive information detection model to obtain first sensitive data, and performing first desensitization processing on the first sensitive data to obtain first desensitization data;
step S104, performing second sensitive detection on the second word segment through a preset regular expression to obtain second sensitive data, and performing second desensitization processing on the second sensitive data to obtain second desensitization data;
step S105, performing third sensitive detection on a third word segment through a preset sensitive word dictionary to obtain third sensitive data, and performing third desensitization processing on the third sensitive data to obtain third desensitization data;
and S106, performing combined processing on the first desensitization data, the second desensitization data and the third desensitization data to obtain target data.
In steps S101 to S106 illustrated in the embodiment of the present application, the original data is segmented by the preset desensitization category labels to obtain to-be-desensitized word segments corresponding to each desensitization category label, where the to-be-desensitized word segments include a first word segment, a second word segment, and a third word segment, and the sensitivity categories of the first word segment, the second word segment, and the third word segment are different from each other, and the sensitive information in the original data can be classified and segmented according to the preset desensitization category labels, so that the sensitive information of different types can be detected and desensitized in multiple ways, and the desensitization process is more targeted. Performing first sensitive detection on the first word segment through a sensitive information detection model trained in advance to obtain first sensitive data, and performing first desensitization processing on the first sensitive data to obtain first desensitization data; performing second sensitive detection on the second word segment through a preset regular expression to obtain second sensitive data, and performing second desensitization processing on the second sensitive data to obtain second desensitization data; third sensitive detection is carried out on a third word segment through a preset sensitive word dictionary to obtain third sensitive data, third desensitization processing is carried out on the third sensitive data to obtain third desensitization data, and a sensitive information detection model, a regular expression and the sensitive word dictionary are adopted to detect different types of sensitive information, so that the detection comprehensiveness is effectively improved; meanwhile, the sensitive information detection model, the regular expression and the sensitive dictionary can be preset according to actual conditions, can be directly used in different desensitization scenes, can be continuously expanded and perfected according to different desensitization requirements, can reduce the use cost, and can meet personalized requirements. The method realizes accurate identification of sensitive information, and improves desensitization accuracy and desensitization efficiency.
In step S101 of some embodiments, the web crawler may be written, and the data source is set and then the targeted crawling data is performed, so as to obtain the original data to be desensitized. The raw data may also be obtained in other ways, not limited to this. It should be noted that the original data is mainly text data, and the text data contains the name, location, identification number, and other sensitive information related to the privacy of the user.
Referring to fig. 2, in some embodiments, step S102 may include, but is not limited to, step S201 to step S202:
step S201, performing classification probability calculation on original data through a preset function and desensitization class labels to obtain a classification probability value of each desensitization class label;
and S202, segmenting the original data according to the classification probability values to obtain word segments to be desensitized.
In step S201 of some embodiments, the preset desensitization category labels include parts-of-speech explicit classes (names of people, places, time, and the like), rule explicit classes (identification numbers, mobile phone numbers, policy numbers, and the like), special case classes (forbidden words, special names of people, special names of places, and the like); the preset function may be a softmax function or the like, for example, a probability distribution is created on different desensitization category labels for the data to be desensitized by the softmax function, so as to obtain classification probability values of the text segments on the data to be desensitized belonging to different desensitization categories.
In step S202 of some embodiments, according to the classification probability values, different text segments are divided into desensitization category labels with the maximum classification probability values, so as to obtain a to-be-desensitized segment corresponding to each desensitization category label.
Before step S103 of some embodiments, the method further includes training a sensitive information detection model in advance, where the sensitive information detection model is constructed mainly by using a statistical-based method (e.g., hidden markov model, conditional random field algorithm, etc.) or a deep learning-based method (e.g., lattic LSTM, LR-CNN, etc.), and entities having a specific meaning, such as a person name, a place name, an organization name, a date, a proper noun, etc., in the target text can be identified by using a named entity extraction algorithm (NER algorithm) of the above sensitive information detection model. In the desensitization task of the embodiment of the application, the desensitization treatment can be performed on the vocabulary recognized by the named entity according to the requirement of the desensitization task by the sensitive information detection model.
Referring to fig. 3, in some embodiments, step S103 may include, but is not limited to, step S301 to step S305:
step S301, inputting the first word segment into a sensitive information detection model, wherein the sensitive information detection model comprises a convolution layer, a full connection layer and a decoding layer;
step S302, extracting entity characteristics of the first word segment through the convolutional layer to obtain candidate word segment characteristics;
step S303, screening the candidate word segment characteristics through the part-of-speech category labels preset in the full connection layer to obtain target word segment characteristics to be desensitized;
step S304, decoding the characteristics of the target word segment through a decoding layer to obtain first sensitive data;
step S305, performing first desensitization processing on the first sensitive data to obtain first desensitization data.
In step S301 of some embodiments, the first word segment is input into a sensitive information detection model, where the sensitive information model may be constructed based on a hidden markov model, a conditional random field algorithm, and the like, and the sensitive information detection model includes a convolutional layer, a fully-connected layer, and a decoding layer.
In step S302 of some embodiments, entity feature extraction is performed on the first word segment through the convolutional layer, and the entity features in the first word segment are captured to obtain candidate word segment features.
In step S303 in some embodiments, a probability distribution is created for the candidate word segment features through a part-of-speech category tag preset in the fully-connected layer, a part-of-speech category to which each candidate word segment feature belongs is obtained through a probability distribution condition, and candidate word segment features included in the part-of-speech category related to the sensitive information are extracted to obtain target word segment features to be desensitized.
In step S304 of some embodiments, the decoding layer performs decoding processing on the target word segment features, so as to implement mapping processing of the target word segment features from a vector space to a semantic space, and obtain target word segment features (i.e., first sensitive data) in a text form.
In step S305 of some embodiments, a preset mask symbol is obtained, where the mask symbol may be a character null, a preset certain pixel value, a punctuation mark, and the like. The first desensitization processing on the first sensitive data through the mask symbol mainly comprises the steps of performing mask processing on the first sensitive data, and replacing all or part of actual content of the first sensitive data with the mask symbol, so that the field position of the first sensitive data is empty, or is a punctuation mark, or is coded and covered by a certain pixel value, and the first desensitization data is obtained.
For example, the first phrase is "Lihua is prepared to visit a customer from 2 months and 12 days to the first financial center"; then, the sensitive information detection model can detect and extract the entity characteristics such as name, time, location and the like, and the characteristics of the target word segment obtained are as follows: li wa (person), 12 days 2 months (time), the first finance center (location), and therefore, mask desensitization processing was performed on this field using mask symbols, resulting in the first desensitization data "li x is ready to visit customers" in.
In the steps S301 to S304, the sensitive vocabulary with definite part of speech (for example, name, location, time, etc.) can be conveniently detected and desensitized through the pre-constructed sensitive information detection model, so that desensitization accuracy and desensitization efficiency are improved.
Referring to fig. 4, in some embodiments, step S104 may include, but is not limited to, step S401 to step S403:
step S401, a preset encoder is used for encoding the second word segment to obtain a target character string to be desensitized;
step S402, carrying out sensitivity detection on the target character string through a regular expression to obtain second sensitive data;
and S403, performing second desensitization processing on the second sensitive data to obtain second desensitization data.
In step S401 in some embodiments, the preset encoder may be a BERT encoder, or may be other encoders, which is not limited. Taking a BERT encoder as an example, the BERT encoder is used for encoding the second word segment, and the second word segment is converted into a vector form from a text form, so as to obtain a target character string to be desensitized.
In step S402 of some embodiments, the preset regular expressions are regular expressions of various types, and since the regular expressions are often used for matching character strings, different types of regular expressions can be written according to actual requirements to match different types of target character strings, for example, mailboxes, mobile phone numbers, identity card numbers, policy holders, and the like. And then, carrying out sensitive detection on the target character string through the regular expression to obtain second sensitive data.
For example, the regular expression for the mailbox may be
(\w)((\w)*)([-+.]\w+)*(@\w+([-.]\w+)*\.\w+([-.]\w+)*);
The regular expression of the mobile phone number can be
(13[0-9]|14[5|7]|15[0-9]|18[0-9])\d{8};
The regular expression of the identification number may be
(^\d{15}$)|(^\d{18}$)|(^\d{17}(\d|X|x)$)。
In step S403 in some embodiments, a preset mask symbol is obtained, where the mask symbol may be a character null, a preset certain pixel value, a punctuation mark, and the like. The second desensitization processing on the second sensitive data through the mask symbol mainly comprises the steps of performing mask processing on the second sensitive data, and replacing all or part of actual content of the second sensitive data with the mask symbol, so that the field position of the second sensitive data is empty, or is a punctuation mark, or is coded and covered by a certain pixel value, and the second desensitization data is obtained.
For example, the mailbox with the second segment as the customer change is 123456789@ qq.com, the mobile phone number is 13012342224, the identification number is 400102198001011230, and after the above steps S401 to S403, the second desensitization data is the mailbox with the customer change is 1 × @ qq.com, the mobile phone number is 1 × 24, and the identification number is 4001 × 1230.
In the steps S401 to S403, by constructing various types of regular expressions in advance, regular and definite sensitive information (e.g., an identity number, a mobile phone number, a policy number, and the like) such as a character string can be detected and desensitized more conveniently, so that desensitization accuracy and desensitization efficiency are improved.
Referring to fig. 5, in some embodiments, step S105 may further include, but is not limited to, step S501 to step S503:
step S501, traversing the sensitive word dictionary, and performing similarity calculation on the third word segment and each reference word segment in the sensitive word dictionary to obtain the similarity of the word segments;
step S502, screening the reference word segments according to the similarity of the word segments to obtain third sensitive data;
step S503, performing a third desensitization process on the third sensitive data to obtain third desensitization data.
In step S501 of some embodiments, the sensitive word dictionary is a set of a series of reference word segments, and the reference word segments are separated by using a separator, "so as to perform similarity calculation between each reference word segment and a third word segment, specifically, the third word segment and the reference word segment may be converted into a vector form, and a cosine similarity calculation method is used to perform similarity calculation between the vector form of the three word segments and the reference word segments, so as to obtain word similarity.
In step S502 of some embodiments, when the reference word segments are screened according to the word segment similarity, the word segment similarity is compared with a preset similarity threshold, and the reference word segments meeting the requirement are selected as the third sensitive data according to the magnitude relationship between the word segment similarity and the similarity threshold.
In step S503 of some embodiments, a preset mask symbol is obtained, where the mask symbol may be a character null, a preset certain pixel value, a punctuation mark, and the like. The second desensitization processing on the third sensitive data through the mask symbol mainly comprises the steps of performing mask processing on the third sensitive data, and replacing all or part of actual content of the third sensitive data with the mask symbol, so that the field position of the third sensitive data is empty, or is a punctuation mark, or is coded and covered by a certain pixel value, and the third desensitization data is obtained.
For example, a sensitive dictionary includes reference word segments that are "pirated"; the third word segment is 'a batch of pirated books are bought recently, the quality is poor'; similarity calculation is carried out according to phrases and characters in the reference word segment and the third word segment, wherein the similarity between the "pirate" of the reference word segment and the "pirated book" in the third word segment is higher, therefore, mask desensitization processing is carried out on the field by adopting mask symbols, and the third desensitization data is obtained and is that "a batch of books are bought recently, and the quality is poor".
In the steps S501 to S503, by constructing the sensitive word dictionary in advance, the specific sensitive words (for example, forbidden words, special names, special place names, etc.) can be detected and desensitized more conveniently, so that desensitization accuracy and desensitization efficiency are improved.
Referring to fig. 6, in some embodiments, step S502 further includes, but is not limited to, step S601 to step S602:
step S601, comparing the similarity of the word segments with a preset similarity threshold;
step S602, selecting a reference word segment with a word segment similarity greater than or equal to the similarity threshold as third sensitive data.
In step S601 and step S602 of some embodiments, the preset similarity threshold may be set according to actual requirements, for example, the similarity threshold may be 0.8. Then, comparing the similarity of the word segments with the similarity threshold, wherein the similarity of the word segments is greater than or equal to 0.8, which indicates that the current third word segment is similar to the reference word segment and has a high possibility of belonging to the sensitive word segment, and therefore, the reference word segment with the similarity of the word segments greater than or equal to the similarity threshold is selected as the third sensitive data. By the method, the vocabulary with high possibility belonging to the sensitive word segment can be desensitized, the vocabulary with low possibility belonging to the sensitive word segment can be detected, whether the vocabularies need desensitization treatment or not can be determined by a secondary screening method, and desensitization accuracy and desensitization efficiency can be improved.
Referring to fig. 7, in some embodiments, step S106 may further include, but is not limited to, step S701 to step S702:
step S701, splicing the first desensitization data, the second desensitization data and the third desensitization data according to a preset splicing sequence to obtain initial data;
step S702, filtering the initial data to obtain target data.
In step S701 of some embodiments, the preset splicing order may be a time sequence of acquiring the first desensitization data, the second desensitization data, and the third desensitization data, and the like. For example, on a database platform, according to the time sequence of acquiring the first desensitization data, the second desensitization data and the third desensitization data, labeling processing is performed on a plurality of first desensitization data, second desensitization data and third desensitization data, so that each of the first desensitization data, the second desensitization data and the third desensitization data is provided with a sequence tag, wherein the sequence tag may be an arabic sequence (1, 2, 3, …) or an english alphabet sequence (A, B, C, …); and further, splicing the first desensitization data, the second desensitization data and the third desensitization data with the sequence labels according to a sequence label sequence by using a CONCAT () function to obtain initial data, wherein the initial data is the data subjected to detection and desensitization processing by the sensitive information detection model, the regular expression and the sensitive word dictionary.
In step S702 of some embodiments, in order to further improve desensitization accuracy, after the initial data is obtained, operations such as data cleaning and manual review may be performed on the initial data according to actual requirements, so as to filter or correct abnormal data or irregular data in the initial data, thereby obtaining target data, which is final desensitization data.
It should be noted that, in order to further improve the desensitization effect, the sensitive information detection model, the regular expression and the sensitive word dictionary may be preset according to actual conditions, and may be directly used in different desensitization scenarios, and the sensitive information detection model, the regular expression and the sensitive word dictionary may be continuously expanded and perfected according to different desensitization requirements, so that the use cost may be reduced, and personalized requirements may also be satisfied.
According to the desensitization method, original data to be desensitized are obtained, segmentation processing is carried out on the original data according to preset desensitization category labels, and word segments to be desensitized corresponding to all the desensitization category labels are obtained, wherein the word segments to be desensitized comprise first word segments, second word segments and third word segments, the sensitivity categories of the first word segments, the second word segments and the third word segments are different, and the method can classify and segment sensitive information in the original data according to the preset desensitization category labels, so that detection and desensitization can be carried out on the sensitive information of different types in multiple modes, and the desensitization process has pertinence. Further, performing first sensitive detection on the first word segment through a sensitive information detection model trained in advance to obtain first sensitive data, and performing first desensitization processing on the first sensitive data to obtain first desensitization data; performing second sensitive detection on the second word segment through a preset regular expression to obtain second sensitive data, and performing second desensitization processing on the second sensitive data to obtain second desensitization data; third sensitive detection is carried out on a third word segment through a preset sensitive word dictionary to obtain third sensitive data, third desensitization processing is carried out on the third sensitive data to obtain third desensitization data, and a sensitive information detection model, a regular expression and the sensitive word dictionary are adopted to detect different types of sensitive information, so that the detection comprehensiveness is effectively improved; meanwhile, the sensitive information detection model, the regular expression and the sensitive word dictionary can be preset according to actual conditions, can be directly used in different desensitization scenes, can be continuously expanded and perfected according to different desensitization requirements, can reduce the use cost, and can meet personalized requirements. And finally, combining the first desensitization data, the second desensitization data and the third desensitization data to obtain target data.
Referring to fig. 8, an embodiment of the present application further provides a desensitization apparatus, which can implement the desensitization method described above, and the apparatus includes:
an original data acquisition module 801, configured to acquire original data to be desensitized;
a segmentation module 802, configured to perform segmentation processing on original data according to preset desensitization category labels to obtain to-be-desensitized word segments corresponding to each desensitization category label, where the to-be-desensitized word segments include a first word segment, a second word segment, and a third word segment, and the sensitivity categories of the first word segment, the second word segment, and the third word segment are different;
the first sensitive detection module 803 is configured to perform first sensitive detection on the first word segment through a sensitive information detection model trained in advance to obtain first sensitive data, and perform first desensitization processing on the first sensitive data to obtain first desensitization data;
the second sensitive detection module 804 is configured to perform second sensitive detection on the second word segment through a preset regular expression to obtain second sensitive data, and perform second desensitization processing on the second sensitive data to obtain second desensitization data;
a third sensitive detection module 805, configured to perform third sensitive detection on a third word segment through a preset sensitive word dictionary to obtain third sensitive data, and perform third desensitization processing on the third sensitive data to obtain third desensitization data;
and the combining module 806 is configured to perform combining processing on the first desensitization data, the second desensitization data, and the third desensitization data to obtain target data.
In some embodiments, the segmentation module 802 includes:
the classification probability calculation unit is used for performing classification probability calculation on the original data through a preset function and desensitization class labels to obtain a classification probability value of each desensitization class label;
and the segmentation processing unit is used for carrying out segmentation processing on the original data according to the classification probability value to obtain a word segment to be desensitized.
In some embodiments, the first sensitive detection module 803 includes:
the input unit is used for inputting the first word segment into a sensitive information detection model, wherein the sensitive information detection model comprises a convolution layer, a full-connection layer and a decoding layer;
the entity characteristic extraction unit is used for extracting entity characteristics of the first word segment through the convolutional layer to obtain candidate word segment characteristics;
the characteristic screening unit is used for screening the candidate word segment characteristics through the part of speech category labels preset in the full connection layer to obtain target word segment characteristics to be desensitized;
the decoding unit is used for decoding the target word segment characteristics through a decoding layer to obtain first sensitive data;
and the first desensitization unit is used for performing first desensitization processing on the first sensitive data to obtain first desensitization data.
In some embodiments, the second sensitivity detection module 804 includes:
the coding unit is used for coding the second word segment through a preset coder to obtain a target character string to be desensitized;
the sensitive detection unit is used for carrying out sensitive detection on the target character string through the regular expression to obtain second sensitive data;
and the second desensitization unit is used for performing second desensitization treatment on the second sensitive data to obtain second desensitization data.
In some embodiments, the third sensitivity detection module 805 comprises:
the similarity calculation unit is used for traversing the sensitive word dictionary and calculating the similarity of the third word segment and each reference word segment in the sensitive word dictionary to obtain the similarity of the word segments;
the word segment screening unit is used for comparing the word segment similarity with a preset similarity threshold value and selecting a reference word segment with the word segment similarity larger than or equal to the similarity threshold value as third sensitive data;
and the third desensitization unit is used for performing third desensitization processing on the third sensitive data to obtain third desensitization data.
In some embodiments, the combining module 806 includes:
the splicing unit is used for splicing the first desensitization data, the second desensitization data and the third desensitization data according to a preset splicing sequence to obtain initial data;
and the filtering unit is used for filtering the initial data to obtain target data.
The specific implementation of the desensitization device is basically the same as the specific implementation of the desensitization method, and the detailed description is omitted here.
An embodiment of the present application further provides an electronic device, where the electronic device includes: a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling communication of connections between the processor and the memory, the program when executed by the processor implementing the desensitization method described above. The electronic equipment can be any intelligent terminal including a tablet computer, a vehicle-mounted computer and the like.
Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, where the electronic device includes:
the processor 901 may be implemented by a general-purpose CPU (central processing unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute a relevant program to implement the technical solution provided in the embodiment of the present application;
the memory 902 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a Random Access Memory (RAM). The memory 902 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present disclosure is implemented by software or firmware, the relevant program codes are stored in the memory 902 and called by the processor 901 to execute the desensitization method of the embodiments of the present disclosure;
an input/output interface 903 for implementing information input and output;
a communication interface 904, configured to implement communication interaction between the device and another device, where communication may be implemented in a wired manner (e.g., USB, network cable, etc.), or in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);
a bus 905 that transfers information between various components of the device (e.g., the processor 901, memory 902, input/output interface 903, and communication interface 904);
wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 enable communication connections within the device with each other via a bus 905.
Embodiments of the present application further provide a storage medium, which is a computer-readable storage medium for a computer-readable storage, and the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the desensitization method described above.
The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory may optionally include memory located remotely from the processor, which may be connected to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
According to the desensitization method, the desensitization device, the electronic equipment and the storage medium provided by the embodiment of the application, original data to be desensitized are obtained, and the original data are segmented according to preset desensitization category labels to obtain word segments to be desensitized corresponding to each desensitization category label, wherein the word segments to be desensitized comprise first word segments, second word segments and third word segments, the sensitivity categories of the first word segments, the second word segments and the third word segments are different, and the method can classify and segment sensitive information in the original data according to the preset desensitization category labels, so that the sensitive information of different types can be detected and desensitized in various ways, and the desensitization process has pertinence. Further, performing first sensitive detection on the first word segment through a pre-trained sensitive information detection model to obtain first sensitive data, and performing first desensitization processing on the first sensitive data to obtain first desensitization data; performing second sensitivity detection on the second word segment through a preset regular expression to obtain second sensitive data, and performing second desensitization processing on the second sensitive data to obtain second desensitization data; third sensitive detection is carried out on a third word segment through a preset sensitive word dictionary to obtain third sensitive data, third desensitization processing is carried out on the third sensitive data to obtain third desensitization data, and a sensitive information detection model, a regular expression and the sensitive word dictionary are adopted to detect different types of sensitive information, so that the detection comprehensiveness is effectively improved; meanwhile, the sensitive information detection model, the regular expression and the sensitive dictionary can be preset according to actual conditions, can be directly used in different desensitization scenes, can be continuously expanded and perfected according to different desensitization requirements, can reduce the use cost, and can meet personalized requirements. And finally, combining the first desensitization data, the second desensitization data and the third desensitization data to obtain target data.
The embodiments described in the embodiments of the present application are for more clearly illustrating the technical solutions in the embodiments of the present application, and do not constitute a limitation to the technical solutions provided in the embodiments of the present application, and it is obvious to those skilled in the art that the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems with the evolution of technology and the emergence of new application scenarios.
It will be understood by those skilled in the art that the embodiments shown in fig. 1-7 are not limiting of the embodiments of the present application and may include more or fewer steps than those shown, or some of the steps may be combined, or different steps may be included.
The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.
The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicates that there may be three relationships, for example, "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the contextual objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the above-described units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes multiple instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and the scope of the claims of the embodiments of the present application is not limited thereto. Any modifications, equivalents and improvements that may occur to those skilled in the art without departing from the scope and spirit of the embodiments of the present application are intended to be within the scope of the claims of the embodiments of the present application.

Claims (10)

1. A method of desensitizing, said method comprising:
acquiring original data to be desensitized;
segmenting the original data according to preset desensitization category labels to obtain word segments to be desensitized corresponding to each desensitization category label, wherein the word segments to be desensitized comprise a first word segment, a second word segment and a third word segment, and the sensitivity categories of the first word segment, the second word segment and the third word segment are different;
performing first sensitive detection on the first word segment through a sensitive information detection model trained in advance to obtain first sensitive data, and performing first desensitization processing on the first sensitive data to obtain first desensitization data;
performing second sensitive detection on the second word segment through a preset regular expression to obtain second sensitive data, and performing second desensitization processing on the second sensitive data to obtain second desensitization data;
performing third sensitive detection on the third word segment through a preset sensitive word dictionary to obtain third sensitive data, and performing third desensitization processing on the third sensitive data to obtain third desensitization data;
and performing combined processing on the first desensitization data, the second desensitization data and the third desensitization data to obtain target data.
2. The desensitization method according to claim 1, wherein the step of segmenting the original data according to preset desensitization category labels to obtain to-be-desensitized word segments corresponding to each desensitization category label comprises:
performing classification probability calculation on the original data through a preset function and the desensitization class labels to obtain the classification probability value of each desensitization class label;
and carrying out segmentation processing on the original data according to the classification probability value to obtain the word segment to be desensitized.
3. The desensitization method according to claim 1, wherein said step of performing a first sensitivity detection on said first word segment by a pre-trained sensitive information detection model to obtain first sensitive data, and performing a first desensitization process on said first sensitive data to obtain first desensitized data comprises:
inputting the first word segment into the sensitive information detection model, wherein the sensitive information detection model comprises a convolution layer, a full-connection layer and a decoding layer;
extracting entity characteristics of the first word segment through the convolutional layer to obtain candidate word segment characteristics;
screening the candidate word segment characteristics through a part-of-speech category label preset in the full-connection layer to obtain target word segment characteristics to be desensitized;
decoding the target word segment characteristics through the decoding layer to obtain the first sensitive data;
and carrying out first desensitization processing on the first sensitive data to obtain first desensitization data.
4. The desensitization method according to claim 1, wherein the step of performing a second sensitivity detection on the second word segment by using a preset regular expression to obtain second sensitive data, and performing a second desensitization process on the second sensitive data to obtain second desensitization data comprises:
coding the second word segment through a preset coder to obtain a target character string to be desensitized;
performing sensitivity detection on the target character string through the regular expression to obtain second sensitive data;
and carrying out second desensitization treatment on the second sensitive data to obtain second desensitization data.
5. The desensitization method according to claim 1, wherein said step of performing a third desensitization process on said third sensitive data by performing a third sensitivity detection on said third field using a preset sensitive word dictionary to obtain third sensitive data, and obtaining third desensitization data by performing a third desensitization process on said third sensitive data, comprises:
traversing the sensitive word dictionary, and performing similarity calculation on the third word segment and each reference word segment in the sensitive word dictionary to obtain word segment similarity;
screening the reference word segments according to the word segment similarity to obtain third sensitive data;
and carrying out third desensitization treatment on the third sensitive data to obtain third desensitization data.
6. The desensitization method according to claim 5, wherein said step of screening said reference word segments according to said word segment similarities to obtain said third sensitive data comprises:
comparing the word segment similarity with a preset similarity threshold;
and selecting the reference word segment with the word segment similarity larger than or equal to the similarity threshold value as the third sensitive data.
7. A desensitization method according to any of claims 1 to 6, wherein said step of performing a combined processing of said first desensitization data, said second desensitization data and said third desensitization data to obtain target data comprises:
splicing the first desensitization data, the second desensitization data and the third desensitization data according to a preset splicing sequence to obtain initial data;
and filtering the initial data to obtain the target data.
8. A desensitizing apparatus, said apparatus comprising:
the original data acquisition module is used for acquiring original data to be desensitized;
the segmentation module is used for segmenting the original data according to preset desensitization category labels to obtain word segments to be desensitized corresponding to each desensitization category label, wherein the word segments to be desensitized comprise a first word segment, a second word segment and a third word segment, and the sensitivity categories of the first word segment, the second word segment and the third word segment are different;
the first sensitive detection module is used for carrying out first sensitive detection on the first word segment through a pre-trained sensitive information detection model to obtain first sensitive data and carrying out first desensitization processing on the first sensitive data to obtain first desensitization data;
the second sensitive detection module is used for carrying out second sensitive detection on the second word segment through a preset regular expression to obtain second sensitive data and carrying out second desensitization processing on the second sensitive data to obtain second desensitization data;
the third sensitive detection module is used for carrying out third sensitive detection on the third word segment through a preset sensitive word dictionary to obtain third sensitive data and carrying out third desensitization processing on the third sensitive data to obtain third desensitization data;
and the combination module is used for carrying out combination processing on the first desensitization data, the second desensitization data and the third desensitization data to obtain target data.
9. An electronic device comprising a memory, a processor, a program stored on the memory and executable on the processor, the program when executed by the processor implementing the steps of the desensitization method of any of claims 1 to 7, and a data bus for implementing the connection communication between the processor and the memory.
10. A storage medium, which is a computer-readable storage medium, for computer-readable storage, characterized in that the storage medium stores one or more programs, which are executable by one or more processors, to implement the steps of the desensitization method of any one of claims 1 to 7.
CN202210283373.9A 2022-03-22 2022-03-22 Desensitization method, desensitization device, electronic apparatus, and storage medium Pending CN114626097A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210283373.9A CN114626097A (en) 2022-03-22 2022-03-22 Desensitization method, desensitization device, electronic apparatus, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210283373.9A CN114626097A (en) 2022-03-22 2022-03-22 Desensitization method, desensitization device, electronic apparatus, and storage medium

Publications (1)

Publication Number Publication Date
CN114626097A true CN114626097A (en) 2022-06-14

Family

ID=81903387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210283373.9A Pending CN114626097A (en) 2022-03-22 2022-03-22 Desensitization method, desensitization device, electronic apparatus, and storage medium

Country Status (1)

Country Link
CN (1) CN114626097A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048682A (en) * 2022-08-15 2022-09-13 河北省农林科学院农业信息与经济研究所 Safe storage method of land circulation information
CN115221884A (en) * 2022-09-15 2022-10-21 北京铀媒科技有限公司 Specific person detection method, system, storage medium and terminal
CN115618371A (en) * 2022-07-11 2023-01-17 上海期货信息技术有限公司 Desensitization method and device for non-text data and storage medium
CN115688184A (en) * 2022-12-26 2023-02-03 平安银行股份有限公司 Log desensitization method and device, electronic equipment and storage medium
CN116719907A (en) * 2023-06-26 2023-09-08 阿波罗智联(北京)科技有限公司 Data processing method, device, equipment and storage medium
CN117010019A (en) * 2023-08-04 2023-11-07 北京泰策科技有限公司 Data desensitization method and system based on NLP language model

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115618371A (en) * 2022-07-11 2023-01-17 上海期货信息技术有限公司 Desensitization method and device for non-text data and storage medium
CN115618371B (en) * 2022-07-11 2023-08-04 上海期货信息技术有限公司 Non-text data desensitization method, device and storage medium
CN115048682A (en) * 2022-08-15 2022-09-13 河北省农林科学院农业信息与经济研究所 Safe storage method of land circulation information
CN115048682B (en) * 2022-08-15 2022-11-01 河北省农林科学院农业信息与经济研究所 Safe storage method for land circulation information
CN115221884A (en) * 2022-09-15 2022-10-21 北京铀媒科技有限公司 Specific person detection method, system, storage medium and terminal
CN115688184A (en) * 2022-12-26 2023-02-03 平安银行股份有限公司 Log desensitization method and device, electronic equipment and storage medium
CN115688184B (en) * 2022-12-26 2023-03-31 平安银行股份有限公司 Log desensitization method and device, electronic equipment and storage medium
CN116719907A (en) * 2023-06-26 2023-09-08 阿波罗智联(北京)科技有限公司 Data processing method, device, equipment and storage medium
CN117010019A (en) * 2023-08-04 2023-11-07 北京泰策科技有限公司 Data desensitization method and system based on NLP language model
CN117010019B (en) * 2023-08-04 2024-04-16 北京泰策科技有限公司 Data desensitization method and system based on NLP language model

Similar Documents

Publication Publication Date Title
CN114626097A (en) Desensitization method, desensitization device, electronic apparatus, and storage medium
CN113792818A (en) Intention classification method and device, electronic equipment and computer readable storage medium
CN113887215A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN114358007A (en) Multi-label identification method and device, electronic equipment and storage medium
CN114722069A (en) Language conversion method and device, electronic equipment and storage medium
CN114359810A (en) Video abstract generation method and device, electronic equipment and storage medium
CN114240552A (en) Product recommendation method, device, equipment and medium based on deep clustering algorithm
CN114926039A (en) Risk assessment method, risk assessment device, electronic device, and storage medium
CN114519356A (en) Target word detection method and device, electronic equipment and storage medium
CN114897060A (en) Training method and device of sample classification model, and sample classification method and device
CN114841146A (en) Text abstract generation method and device, electronic equipment and storage medium
CN114637847A (en) Model training method, text classification method and device, equipment and medium
CN114416995A (en) Information recommendation method, device and equipment
CN114613462A (en) Medical data processing method and device, electronic equipment and storage medium
CN114358020A (en) Disease part identification method and device, electronic device and storage medium
CN114064894A (en) Text processing method and device, electronic equipment and storage medium
CN114492661A (en) Text data classification method and device, computer equipment and storage medium
CN114595357A (en) Video searching method and device, electronic equipment and storage medium
CN114492437B (en) Keyword recognition method and device, electronic equipment and storage medium
CN114090778A (en) Retrieval method and device based on knowledge anchor point, electronic equipment and storage medium
CN115795007A (en) Intelligent question-answering method, intelligent question-answering device, electronic equipment and storage medium
CN114998041A (en) Method and device for training claim settlement prediction model, electronic equipment and storage medium
CN115796141A (en) Text data enhancement method and device, electronic equipment and storage medium
CN115204300A (en) Data processing method, device and storage medium for text and table semantic interaction
CN115270746A (en) Question sample generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination