CN114036940A - Sensitive data identification method and device, electronic equipment and storage medium - Google Patents

Sensitive data identification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114036940A
CN114036940A CN202111312004.XA CN202111312004A CN114036940A CN 114036940 A CN114036940 A CN 114036940A CN 202111312004 A CN202111312004 A CN 202111312004A CN 114036940 A CN114036940 A CN 114036940A
Authority
CN
China
Prior art keywords
text
processed
unit
feature extraction
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111312004.XA
Other languages
Chinese (zh)
Inventor
张黎
石桂红
余海波
陈广辉
刘维炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Flash It Co ltd
Original Assignee
Flash It Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Flash It Co ltd filed Critical Flash It Co ltd
Priority to CN202111312004.XA priority Critical patent/CN114036940A/en
Publication of CN114036940A publication Critical patent/CN114036940A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a device for identifying sensitive data, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a text to be processed; inputting the text to be processed into a feature extraction network to obtain the spatial features of each unit in the text to be processed output by the feature extraction network; inputting the spatial characteristics of each unit in the text to be processed into a label prediction model to obtain label information of each unit output by the label prediction model; and determining the sensitive vocabulary of the text to be processed according to the label information of each unit in the text to be processed. According to the scheme, the label information of the text is obtained through the feature extraction network and the label prediction model, and sensitive words are effectively recognized.

Description

Sensitive data identification method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for identifying sensitive data, an electronic device, and a computer-readable storage medium.
Background
With the rapid development of computer technology in the twenty-first century and the advent of the big data era, the explosion of information volume has brought about a number of inevitable problems, such as the presence of many illegal words, including abusive or political sentences, in text data. How to identify these abusive words or political-related statements is a currently pending problem.
The traditional big data security mainly relies on rules, related algorithms, keywords and the like for identifying sensitive data in a text, for example, according to the technical security requirement of national shared data, the sensitive data in the text comprises an IP address, an MAC address, an IPv6 address, a mobile phone number, a bank card, an address, a name and the like, regular data such as the IP address is detected by using a regular expression, and related sensitive data such as the bank card number or an identity card can be detected by using the algorithm.
However, words and sentences with ambiguity such as addresses and names are not detected well by using traditional algorithms, for example, name detection usually involves writing all surnames in a Json file, matching the first word or the first two words contained in the surname file, and if included, indicating that the word is a name. Such detection results are inaccurate and mainly include the following two disadvantages: 1. most people in China, the family names are widely distributed, and particularly, the family names in minority are more variable, so that all the family names cannot be contained in the Json file; 2. the name thus detected is not disambiguated.
Disclosure of Invention
The embodiment of the application provides a sensitive data identification method, which is used for solving the problem that the traditional algorithm cannot accurately identify sensitive data.
The embodiment of the application provides a method for identifying sensitive data, which comprises the following steps:
acquiring a text to be processed;
inputting the processed text into a feature extraction network to obtain the spatial feature of each unit in the text to be processed output by the feature extraction network;
inputting the spatial characteristics of each unit in the text to be processed into a label prediction model to obtain label information of each unit output by the label prediction model;
and determining the sensitive vocabulary of the text to be processed according to the label information of each unit in the text to be processed.
In an embodiment, after the determining the sensitive vocabulary of the text to be processed according to the tag information of each unit in the text to be processed, the method further includes:
and replacing the sensitive vocabulary in the text to be processed by using the specified characters to obtain desensitization data.
In an embodiment, the inputting a text to be processed into a feature extraction network to obtain a spatial feature of each unit in the text to be processed output by the feature extraction network includes:
performing word segmentation operation on the text to be processed to obtain a plurality of units;
and inputting each unit of the text to be processed into a feature extraction network, and obtaining the spatial feature corresponding to each unit output by the feature extraction network.
In one embodiment, the feature extraction network is obtained by modifying the inclusion-v 4 network, removing the softmax layer of the inclusion-v 4 network, and adding a full convolutional layer.
In an embodiment, the inputting the spatial feature of each unit in the text to be processed into a tag prediction model to obtain tag information of each unit output by the tag prediction model includes:
inputting the spatial characteristics of each unit into the trained Bi-GRU model to obtain a prediction label of each unit output by the Bi-GRU model;
and taking the prediction label of each unit output by the Bi-GRU model as the input of the CRF model which is trained, and obtaining the label information of each unit output by the CRF model.
In an embodiment, before the obtaining the text to be processed, the method further includes:
acquiring a training text set;
performing word segmentation processing on each training text in the training text set by adopting a word segmentation tool;
acquiring labeling information of sensitive words and phrases in each training text and labeling information of other words and phrases;
and training to obtain the feature extraction network and the label prediction model according to the labeling information of the sensitive words and the labeling information of other words in each training text.
In an embodiment, the training to obtain the feature extraction network and the label prediction model according to the labeling information of the sensitive words and the labeling information of other words in each training text includes:
taking each training text as the input of the improved inclusion-v 4 network, taking the output of the improved inclusion-v 4 network as the input of the Bi-GRU model, taking the output of the Bi-GRU model as the input of the CRF model, adjusting the parameters of the inclusion-v 4 network, the Bi-GRU model and the CRF model, enabling the error between the output of the CRF model and the labeling information of each vocabulary in the training text to be smaller than a threshold value, obtaining a feature extraction network trained by the improved inclusion-v 4 network, and obtaining a label prediction model trained by the Bi-GRU model and the CRF model.
The embodiment of the present application further provides a device for identifying sensitive data, including:
the text acquisition module is used for acquiring a text to be processed;
the feature extraction module is used for inputting the text to be processed into a feature extraction network to obtain the spatial features of each unit in the text to be processed output by the feature extraction network;
the label information module is used for inputting the spatial characteristics of each unit in the text to be processed into a label prediction model and obtaining the label information of each unit output by the label prediction model;
and the sensitive vocabulary module is used for determining the sensitive vocabulary of the text to be processed according to the label information of each unit in the text to be processed.
An embodiment of the present application further provides an electronic device, where the electronic device includes:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform any one of the above-described methods of sensitive data identification.
The embodiment of the application also provides a computer readable storage medium, wherein the storage medium stores a computer program, and the computer program can be executed by a processor to execute any one of the above methods for sensitive data identification.
According to the technical scheme provided by the embodiment of the application, the spatial characteristics of each unit of the text to be processed are extracted through the characteristic extraction network, then the spatial characteristics are input into the label prediction model to obtain the label information of each unit of the text to be processed, and the sensitive words of the processed text are determined according to the label information, so that the accurate identification of the sensitive data is realized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a method for identifying sensitive data according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an improved inclusion-v 4 network according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of an improved Bi-GRU model provided by an embodiment of the present application;
fig. 5 is a schematic structural diagram of a GRU unit according to an embodiment of the present application;
FIG. 6 is a schematic flow chart illustrating a training feature extraction network and a label prediction model according to an embodiment of the present application;
FIG. 7 is a block diagram illustrating a sensitive data identification according to an embodiment of the present application;
fig. 8 is a block diagram of an apparatus for sensitive data identification according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
Like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
The method for identifying the sensitive data provided by the embodiment of the application can be applied to the following scenes: acquiring a text to be processed by a web crawler technology; after the text to be processed is segmented by a Jieba tool, inputting the segmented text into a trained improved Incepration-v 4 network to obtain the spatial characteristics of each unit in the text to be processed; and inputting the spatial characteristics of each unit into the trained Bi-GRU model and CRF model to obtain the label information of each unit, determining the sensitive vocabulary of the text to be processed according to the label information, and replacing the sensitive vocabulary with specific characters, thereby accurately identifying the sensitive data and realizing the desensitization of the data.
Fig. 1 shows an electronic device 1 according to an embodiment of the present application, where the electronic device 1 includes: at least one processor 11 and a memory 12, one processor being exemplified in fig. 1. The processor 11 and the memory 12 are connected by a bus 10, and the memory 12 stores instructions executable by the processor 11 and the instructions are executed by the processor 11. The processor 11 is configured to execute the method for identifying sensitive data provided in the embodiment of the present application.
The processor 11 may be a device comprising a Central Processing Unit (CPU), a Graphics Processing Unit (GPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, may process data for other components in the electronic device 1, and may also control other components in the electronic device 1 to perform desired functions.
Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer readable storage medium and executed by processor 11 to implement the sensitive data identification method described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.
The components and structures of the electronic device 1 shown in fig. 1 are exemplary only, and not limiting, and the electronic device 1 may have other components and structures according to needs.
In an embodiment, the example electronic device 1 for implementing the method for sensitive data identification of the embodiment of the present application may be implemented as a smart device such as a smart phone, a tablet computer, a desktop computer, a notebook computer, and a vehicle-mounted terminal.
Fig. 2 is a schematic flowchart of a method for identifying sensitive data according to an embodiment of the present application. As shown in fig. 2, the method may be performed by the electronic device 1 shown in fig. 1 to realize the sensitive data identification, and the method includes the following steps S210-S240.
Step S210: and acquiring a text to be processed.
In this step, a text to be processed is obtained, where the text to be processed may be a data text crawled from a website, and the website includes Baidu, Mei Tuo, Taobao, Xinlang, and so on.
Step S220: and inputting the text to be processed into a feature extraction network, and obtaining the spatial feature of each unit in the text to be processed output by the feature extraction network.
In the step, firstly, word segmentation is carried out on the text to be processed to obtain a plurality of units of the processed text, then the plurality of units of the processed text are input into a feature extraction network to obtain the spatial features corresponding to each unit output by the feature extraction network, wherein the plurality of units are characters or words obtained by word segmentation of the text to be processed.
In one embodiment, the feature extraction network is a modified inclusion-v 4 network, and as shown in fig. 3, the modified inclusion-v 4 network is obtained by removing the last softmax layer of the inclusion-v 4 network and adding a full convolutional layer to the inclusion-v 4 network on the basis of the original inclusion-v 4 network. The original inclusion-v 4 network is mainly used for extracting features of images and classifying the images, and the improved inclusion-v 4 network is used for extracting spatial features of texts in one embodiment.
Step S230: and inputting the spatial characteristics of each unit in the text to be processed into a label prediction model to obtain the label information of each unit output by the label prediction model.
In this step, the spatial features corresponding to each unit acquired by the feature extraction network are input into a label prediction model, and the label prediction model outputs label information of each unit, wherein the label information includes label information of sensitive words and label information of other words, and the label prediction model includes a trained Bi-GRU model and a CRF model.
The specific process of acquiring the label information of each unit is as follows: inputting the spatial characteristics of each unit into the trained Bi-GRU model to obtain the prediction label of each unit output by the Bi-GRU model, wherein the output prediction labels can be O, B-Abuse, I-Abuse and the like; and taking the prediction label of each unit output by the Bi-GRU model as the input of the CRF model after training, and obtaining the label information of each unit output by the CRF model.
In one embodiment, the Bi-GRU model has a structure as shown in fig. 4, and includes a plurality of GRU units stacked side by side and above each other, a full-link layer connected to the bottoms of the GRU units, and a Softmax layer connected to the bottom of the full-link layer. In an embodiment, the full link layer and the Softmax layer of the Bi-GRU model are used to obtain a prediction tag of each unit, that is, to classify each unit in the text to be processed. The output of the GRU units superposed up and down is connected with the same unit, the GRU units are connected with the same input, the GRU units on the upper layer connected side by side are sequentially connected from left to right, and the GRU units on the lower layer connected side by side are sequentially connected from right to left. The connection of the GRU units in the Bi-GRU model realizes the context connection of the text to be processed, namely the context connection of the text to be processed from front to back and from back to front.
As shown in FIG. 5, the GRU unit is controlled by several gate control units, each of which is an update gate r for updating informationtAnd a reset gate z for resetting informationtWherein the module for completing the memory function is mainly the reset gate ztThe information to be forgotten is reset along with the reset gate ztThe change of the parameter value changes, and the two gates are controlled by a Sigmoid function, and the main calculation formula is as follows:
zt=σ(Wz[ht-1,xt])
rt=σ(Wr[ht-1,xt])
Figure BDA0003341973680000081
Figure BDA0003341973680000082
in the formula, zt,rt,
Figure BDA0003341973680000083
htRespectively, an update gate, a reset gate, information to be forgotten, and output information at time t, Wz,WrAnd W is a weight matrix respectively, and the GRU unit can realize the context information from the front to the back direction according to the formula.
In one embodiment, the CRF model is connected after the Bi-GRU model and used to add some constraints to the tags predicted by the Bi-GRU model to ensure that the predicted tags are legal. For example, when the tag information of the first sensitive word in the tags predicted by the Bi-GRU model does not match the set tag, the CRF model may convert the tag information of the first sensitive word into the set tag.
Step S240: and determining the sensitive vocabulary of the text to be processed according to the label information of each unit in the text to be processed.
In this step, whether the tag information of each unit in the text to be processed is the labeling information of the sensitive vocabulary is judged, if yes, the text to be processed contains the sensitive vocabulary, and the sensitive vocabulary is the unit containing the labeling information of the sensitive vocabulary. Wherein, the labeling information of the sensitive vocabulary can be 'B-Abuse' and 'I-Abuse'.
When a unit in the processed text is a sensitive word, the unit is replaced by the designated character, so that data desensitization is realized. Wherein the designated character may be "".
In an embodiment, as shown in fig. 6, the step 610 specifically includes steps S610 to S640.
Step S610: and acquiring a training text set.
In this step, a training text set is obtained for training to obtain a feature extraction network and a label prediction model. Wherein the set of training texts can be crawled from a website.
Step S620: and performing word segmentation processing on each training text in the training text set by adopting a word segmentation tool.
In this step, the obtained training text set is processed, and a word segmentation tool is used to segment each training text in the training text set, wherein the word segmentation tool may be a Jieba tool. For example, the training text "i like grass houses on that mountain" is "i/like/that/mountain/grass houses" after being participled by the Jieba tool.
Step S630: and acquiring the labeling information of the sensitive words and the labeling information of other words in each training text.
In this step, the labeling information of the sensitive vocabulary in each training text and the labeling information of other vocabularies are obtained, wherein the labeling information of each training text vocabulary can be obtained by employing BIO coding labeling. When employing BIO coding, each training text corresponds to an entity, and the first sensitive word of each entity may be labeled "B-Abuse", the other sensitive words may be labeled "I-Abuse", and the other irrelevant words may be labeled "O". If the training text "he is a mixed egg", the text is marked as "O, O, O, O, B-Abuse, I-Abuse" after being coded and labeled by BIO.
Step S640: and training to obtain the feature extraction network and the label prediction model according to the labeling information of the sensitive words and the labeling information of other words in each training text.
In this step, a feature extraction network and a label prediction model are constructed, each training text is set as an input parameter x, and a result output by the feature extraction network and the label prediction model of each training text is set as an output parameter f (x). And putting each training text into the feature extraction network and the label prediction model for continuous iterative training, and adjusting parameters until the error between the output parameter f (x) and the labeling information of each vocabulary in the training text x is less than a threshold value, so that the feature extraction network and the label prediction model which are reasonably fitted are obtained, wherein the threshold value can be set according to specific conditions.
In one embodiment. In the training process of the feature extraction network and the label prediction model, each training text is used as the input of an improved inclusion-v 4 network, the output of an improved inclusion-v 4 network is used as the input of a Bi-GRU model, the output of the Bi-GRU model is used as the input of a CRF model, the parameters of the inclusion-v 4 network, the Bi-GRU model and the CRF model are adjusted, the error between the output of the CRF model and the labeling information of each vocabulary in the training text is smaller than a threshold value, the feature extraction network trained by the improved inclusion-v 4 network is obtained, and the label prediction model trained by the Bi-GRU model and the CRF model is obtained.
Fig. 7 is a schematic structural diagram of sensitive data identification according to an embodiment of the present application. As shown in fig. 7, the text to be processed is "he is a mixed egg", and after word segmentation processing by the Jieba tool, the text to be processed is changed into "he/is/one/mixed egg"; inputting the segmented text into an improved inclusion-v 4 network to obtain the spatial characteristics of each unit of the text; inputting the spatial characteristics of each unit into a Bi-GRU model of a label prediction model, acquiring the prediction label of each word in the ' he is a mixed egg ', and inputting the prediction label of each word into a CRF model, thereby acquiring the label information of each unit, namely ' he: 0/is: 0/one: 0/mixed egg: B-Abuse I-Abuse ". From the label information, the "mixed egg" is a sensitive word, and can be replaced by ". about.", and desensitization data "he is one about.".
According to the identification method of the sensitive data, the spatial characteristics of each unit of the text to be processed are extracted through the characteristic extraction network, then the spatial characteristics are input into the label prediction model to obtain the label information of each unit of the text to be processed, and the sensitive words of the processed text are determined according to the label information, so that the accurate identification of the sensitive data is realized.
The following are embodiments of the apparatus of the present application that may be used to perform the above-described embodiments of the method for sensitive data identification of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to embodiments of the method for sensitive data identification of the present application.
Fig. 8 is a block diagram illustrating an apparatus for sensitive data identification according to an embodiment of the present application. As shown in fig. 8, the apparatus includes: the system comprises a text acquisition module 810, a feature extraction module 820, a tag information module 830 and a sensitive vocabulary module 840.
A text obtaining module 810, configured to obtain a text to be processed;
a feature extraction module 820, configured to input the to-be-processed text into a feature extraction network, and obtain a spatial feature of each unit in the to-be-processed text output by the feature extraction network;
a tag information module 830, configured to input the spatial feature of each unit in the text to be processed into a tag prediction model, and obtain tag information of each unit output by the tag prediction model;
and the sensitive vocabulary module 840 is used for determining the sensitive vocabulary of the text to be processed according to the tag information of each unit in the text to be processed.
The implementation process of the functions and actions of each module in the device is specifically described in the implementation process of the corresponding step in the sensitive data identification method, and is not described herein again.
In the embodiments provided in the present application, the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims (10)

1. A method of sensitive data identification, comprising:
acquiring a text to be processed;
inputting the text to be processed into a feature extraction network to obtain the spatial features of each unit in the text to be processed output by the feature extraction network;
inputting the spatial characteristics of each unit in the text to be processed into a label prediction model to obtain label information of each unit output by the label prediction model;
and determining the sensitive vocabulary of the text to be processed according to the label information of each unit in the text to be processed.
2. The method according to claim 1, wherein after determining the sensitive vocabulary of the text to be processed according to the tag information of each unit in the text to be processed, the method further comprises:
and replacing the sensitive vocabulary in the text to be processed by using the specified characters to obtain desensitization data.
3. The method according to claim 1, wherein the inputting the text to be processed into a feature extraction network, and obtaining the spatial feature of each unit in the text to be processed output by the feature extraction network, comprises:
performing word segmentation operation on the text to be processed to obtain a plurality of units;
and inputting each unit of the text to be processed into a feature extraction network, and obtaining the spatial feature corresponding to each unit output by the feature extraction network.
4. The method of claim 3, wherein the feature extraction network is obtained by modifying an inclusion-v 4 network to remove the softmax layer of the inclusion-v 4 network and add a full convolutional layer.
5. The method according to claim 1, wherein the inputting the spatial feature of each unit in the text to be processed into a tag prediction model, and obtaining the tag information of each unit output by the tag prediction model, comprises:
inputting the spatial characteristics of each unit into the trained Bi-GRU model to obtain a prediction label of each unit output by the Bi-GRU model;
and taking the prediction label of each unit output by the Bi-GRU model as the input of the CRF model which is trained, and obtaining the label information of each unit output by the CRF model.
6. The method of claim 1, wherein prior to said obtaining text to be processed, the method further comprises:
acquiring a training text set;
performing word segmentation processing on each training text in the training text set by adopting a word segmentation tool;
acquiring labeling information of sensitive words and phrases in each training text and labeling information of other words and phrases;
and training to obtain the feature extraction network and the label prediction model according to the labeling information of the sensitive words and the labeling information of other words in each training text.
7. The method of claim 6, wherein the training of the feature extraction network and the label prediction model according to the labeled information of the sensitive words and the labeled information of other words in each training text comprises:
taking each training text as the input of the improved inclusion-v 4 network, taking the output of the improved inclusion-v 4 network as the input of the Bi-GRU model, taking the output of the Bi-GRU model as the input of the CRF model, adjusting the parameters of the inclusion-v 4 network, the Bi-GRU model and the CRF model, enabling the error between the output of the CRF model and the labeling information of each vocabulary in the training text to be smaller than a threshold value, obtaining a feature extraction network trained by the improved inclusion-v 4 network, and obtaining a label prediction model trained by the Bi-GRU model and the CRF model.
8. An apparatus for sensitive data recognition, comprising:
the text acquisition module is used for acquiring a text to be processed;
the feature extraction module is used for inputting the text to be processed into a feature extraction network to obtain the spatial features of each unit in the text to be processed output by the feature extraction network;
the label information module is used for inputting the spatial characteristics of each unit in the text to be processed into a label prediction model and obtaining the label information of each unit output by the label prediction model;
and the sensitive vocabulary module is used for determining the sensitive vocabulary of the text to be processed according to the label information of each unit in the text to be processed.
9. An electronic device, characterized in that the electronic device comprises:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the method of sensitive data identification of any of claims 1-7.
10. A computer-readable storage medium, characterized in that the storage medium stores a computer program executable by a processor to perform the method of sensitive data identification of any of claims 1-7.
CN202111312004.XA 2021-11-08 2021-11-08 Sensitive data identification method and device, electronic equipment and storage medium Pending CN114036940A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111312004.XA CN114036940A (en) 2021-11-08 2021-11-08 Sensitive data identification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111312004.XA CN114036940A (en) 2021-11-08 2021-11-08 Sensitive data identification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114036940A true CN114036940A (en) 2022-02-11

Family

ID=80143151

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111312004.XA Pending CN114036940A (en) 2021-11-08 2021-11-08 Sensitive data identification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114036940A (en)

Similar Documents

Publication Publication Date Title
CN109522557B (en) Training method and device of text relation extraction model and readable storage medium
CN105354307B (en) Image content identification method and device
CN109471944B (en) Training method and device of text classification model and readable storage medium
WO2019218514A1 (en) Method for extracting webpage target information, device, and storage medium
EP2812883B1 (en) System and method for semantically annotating images
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN106030568B (en) Natural language processing system, natural language processing method and natural language processing program
CN113177412A (en) Named entity identification method and system based on bert, electronic equipment and storage medium
CN110968725B (en) Image content description information generation method, electronic device and storage medium
CN112861842A (en) Case text recognition method based on OCR and electronic equipment
CN113051356A (en) Open relationship extraction method and device, electronic equipment and storage medium
CN112818200A (en) Data crawling and event analyzing method and system based on static website
US11966455B2 (en) Text partitioning method, text classifying method, apparatus, device and storage medium
CN114385812A (en) Relation extraction method and system for text
CN111444906B (en) Image recognition method and related device based on artificial intelligence
CN110929647B (en) Text detection method, device, equipment and storage medium
US20230109073A1 (en) Extraction of genealogy data from obituaries
CN115984886A (en) Table information extraction method, device, equipment and storage medium
CN115952800A (en) Named entity recognition method and device, computer equipment and readable storage medium
US20230177251A1 (en) Method, device, and system for analyzing unstructured document
CN116151258A (en) Text disambiguation method, electronic device and storage medium
CN114036940A (en) Sensitive data identification method and device, electronic equipment and storage medium
CN114090781A (en) Text data-based repulsion event detection method and device
CN114139658A (en) Method for training classification model and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination