CN114036940A

CN114036940A - Sensitive data identification method and device, electronic equipment and storage medium

Info

Publication number: CN114036940A
Application number: CN202111312004.XA
Authority: CN
Inventors: 张黎; 石桂红; 余海波; 陈广辉; 刘维炜
Original assignee: Flash It Co ltd
Current assignee: Flash It Co ltd
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-02-11

Abstract

The application provides a method and a device for identifying sensitive data, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a text to be processed; inputting the text to be processed into a feature extraction network to obtain the spatial features of each unit in the text to be processed output by the feature extraction network; inputting the spatial characteristics of each unit in the text to be processed into a label prediction model to obtain label information of each unit output by the label prediction model; and determining the sensitive vocabulary of the text to be processed according to the label information of each unit in the text to be processed. According to the scheme, the label information of the text is obtained through the feature extraction network and the label prediction model, and sensitive words are effectively recognized.

Description

Sensitive data identification method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for identifying sensitive data, an electronic device, and a computer-readable storage medium.

Background

With the rapid development of computer technology in the twenty-first century and the advent of the big data era, the explosion of information volume has brought about a number of inevitable problems, such as the presence of many illegal words, including abusive or political sentences, in text data. How to identify these abusive words or political-related statements is a currently pending problem.

The traditional big data security mainly relies on rules, related algorithms, keywords and the like for identifying sensitive data in a text, for example, according to the technical security requirement of national shared data, the sensitive data in the text comprises an IP address, an MAC address, an IPv6 address, a mobile phone number, a bank card, an address, a name and the like, regular data such as the IP address is detected by using a regular expression, and related sensitive data such as the bank card number or an identity card can be detected by using the algorithm.

However, words and sentences with ambiguity such as addresses and names are not detected well by using traditional algorithms, for example, name detection usually involves writing all surnames in a Json file, matching the first word or the first two words contained in the surname file, and if included, indicating that the word is a name. Such detection results are inaccurate and mainly include the following two disadvantages: 1. most people in China, the family names are widely distributed, and particularly, the family names in minority are more variable, so that all the family names cannot be contained in the Json file; 2. the name thus detected is not disambiguated.

Disclosure of Invention

The embodiment of the application provides a sensitive data identification method, which is used for solving the problem that the traditional algorithm cannot accurately identify sensitive data.

The embodiment of the application provides a method for identifying sensitive data, which comprises the following steps:

acquiring a text to be processed;

inputting the processed text into a feature extraction network to obtain the spatial feature of each unit in the text to be processed output by the feature extraction network;

inputting the spatial characteristics of each unit in the text to be processed into a label prediction model to obtain label information of each unit output by the label prediction model;

and determining the sensitive vocabulary of the text to be processed according to the label information of each unit in the text to be processed.

In an embodiment, after the determining the sensitive vocabulary of the text to be processed according to the tag information of each unit in the text to be processed, the method further includes:

and replacing the sensitive vocabulary in the text to be processed by using the specified characters to obtain desensitization data.

In an embodiment, the inputting a text to be processed into a feature extraction network to obtain a spatial feature of each unit in the text to be processed output by the feature extraction network includes:

performing word segmentation operation on the text to be processed to obtain a plurality of units;

and inputting each unit of the text to be processed into a feature extraction network, and obtaining the spatial feature corresponding to each unit output by the feature extraction network.

In one embodiment, the feature extraction network is obtained by modifying the inclusion-v 4 network, removing the softmax layer of the inclusion-v 4 network, and adding a full convolutional layer.

In an embodiment, the inputting the spatial feature of each unit in the text to be processed into a tag prediction model to obtain tag information of each unit output by the tag prediction model includes:

inputting the spatial characteristics of each unit into the trained Bi-GRU model to obtain a prediction label of each unit output by the Bi-GRU model;

and taking the prediction label of each unit output by the Bi-GRU model as the input of the CRF model which is trained, and obtaining the label information of each unit output by the CRF model.

In an embodiment, before the obtaining the text to be processed, the method further includes:

acquiring a training text set;

performing word segmentation processing on each training text in the training text set by adopting a word segmentation tool;

acquiring labeling information of sensitive words and phrases in each training text and labeling information of other words and phrases;

and training to obtain the feature extraction network and the label prediction model according to the labeling information of the sensitive words and the labeling information of other words in each training text.

In an embodiment, the training to obtain the feature extraction network and the label prediction model according to the labeling information of the sensitive words and the labeling information of other words in each training text includes:

taking each training text as the input of the improved inclusion-v 4 network, taking the output of the improved inclusion-v 4 network as the input of the Bi-GRU model, taking the output of the Bi-GRU model as the input of the CRF model, adjusting the parameters of the inclusion-v 4 network, the Bi-GRU model and the CRF model, enabling the error between the output of the CRF model and the labeling information of each vocabulary in the training text to be smaller than a threshold value, obtaining a feature extraction network trained by the improved inclusion-v 4 network, and obtaining a label prediction model trained by the Bi-GRU model and the CRF model.

The embodiment of the present application further provides a device for identifying sensitive data, including:

the text acquisition module is used for acquiring a text to be processed;

the feature extraction module is used for inputting the text to be processed into a feature extraction network to obtain the spatial features of each unit in the text to be processed output by the feature extraction network;

the label information module is used for inputting the spatial characteristics of each unit in the text to be processed into a label prediction model and obtaining the label information of each unit output by the label prediction model;

and the sensitive vocabulary module is used for determining the sensitive vocabulary of the text to be processed according to the label information of each unit in the text to be processed.

An embodiment of the present application further provides an electronic device, where the electronic device includes:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform any one of the above-described methods of sensitive data identification.

The embodiment of the application also provides a computer readable storage medium, wherein the storage medium stores a computer program, and the computer program can be executed by a processor to execute any one of the above methods for sensitive data identification.

According to the technical scheme provided by the embodiment of the application, the spatial characteristics of each unit of the text to be processed are extracted through the characteristic extraction network, then the spatial characteristics are input into the label prediction model to obtain the label information of each unit of the text to be processed, and the sensitive words of the processed text are determined according to the label information, so that the accurate identification of the sensitive data is realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a method for identifying sensitive data according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an improved inclusion-v 4 network according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an improved Bi-GRU model provided by an embodiment of the present application;

fig. 5 is a schematic structural diagram of a GRU unit according to an embodiment of the present application;

FIG. 6 is a schematic flow chart illustrating a training feature extraction network and a label prediction model according to an embodiment of the present application;

FIG. 7 is a block diagram illustrating a sensitive data identification according to an embodiment of the present application;

fig. 8 is a block diagram of an apparatus for sensitive data identification according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

The method for identifying the sensitive data provided by the embodiment of the application can be applied to the following scenes: acquiring a text to be processed by a web crawler technology; after the text to be processed is segmented by a Jieba tool, inputting the segmented text into a trained improved Incepration-v 4 network to obtain the spatial characteristics of each unit in the text to be processed; and inputting the spatial characteristics of each unit into the trained Bi-GRU model and CRF model to obtain the label information of each unit, determining the sensitive vocabulary of the text to be processed according to the label information, and replacing the sensitive vocabulary with specific characters, thereby accurately identifying the sensitive data and realizing the desensitization of the data.

Fig. 1 shows an electronic device 1 according to an embodiment of the present application, where the electronic device 1 includes: at least one processor 11 and a memory 12, one processor being exemplified in fig. 1. The processor 11 and the memory 12 are connected by a bus 10, and the memory 12 stores instructions executable by the processor 11 and the instructions are executed by the processor 11. The processor 11 is configured to execute the method for identifying sensitive data provided in the embodiment of the present application.

The processor 11 may be a device comprising a Central Processing Unit (CPU), a Graphics Processing Unit (GPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, may process data for other components in the electronic device 1, and may also control other components in the electronic device 1 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer readable storage medium and executed by processor 11 to implement the sensitive data identification method described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The components and structures of the electronic device 1 shown in fig. 1 are exemplary only, and not limiting, and the electronic device 1 may have other components and structures according to needs.

In an embodiment, the example electronic device 1 for implementing the method for sensitive data identification of the embodiment of the present application may be implemented as a smart device such as a smart phone, a tablet computer, a desktop computer, a notebook computer, and a vehicle-mounted terminal.

Fig. 2 is a schematic flowchart of a method for identifying sensitive data according to an embodiment of the present application. As shown in fig. 2, the method may be performed by the electronic device 1 shown in fig. 1 to realize the sensitive data identification, and the method includes the following steps S210-S240.

Step S210: and acquiring a text to be processed.

In this step, a text to be processed is obtained, where the text to be processed may be a data text crawled from a website, and the website includes Baidu, Mei Tuo, Taobao, Xinlang, and so on.

Step S220: and inputting the text to be processed into a feature extraction network, and obtaining the spatial feature of each unit in the text to be processed output by the feature extraction network.

In the step, firstly, word segmentation is carried out on the text to be processed to obtain a plurality of units of the processed text, then the plurality of units of the processed text are input into a feature extraction network to obtain the spatial features corresponding to each unit output by the feature extraction network, wherein the plurality of units are characters or words obtained by word segmentation of the text to be processed.

In one embodiment, the feature extraction network is a modified inclusion-v 4 network, and as shown in fig. 3, the modified inclusion-v 4 network is obtained by removing the last softmax layer of the inclusion-v 4 network and adding a full convolutional layer to the inclusion-v 4 network on the basis of the original inclusion-v 4 network. The original inclusion-v 4 network is mainly used for extracting features of images and classifying the images, and the improved inclusion-v 4 network is used for extracting spatial features of texts in one embodiment.

Step S230: and inputting the spatial characteristics of each unit in the text to be processed into a label prediction model to obtain the label information of each unit output by the label prediction model.

In this step, the spatial features corresponding to each unit acquired by the feature extraction network are input into a label prediction model, and the label prediction model outputs label information of each unit, wherein the label information includes label information of sensitive words and label information of other words, and the label prediction model includes a trained Bi-GRU model and a CRF model.

The specific process of acquiring the label information of each unit is as follows: inputting the spatial characteristics of each unit into the trained Bi-GRU model to obtain the prediction label of each unit output by the Bi-GRU model, wherein the output prediction labels can be O, B-Abuse, I-Abuse and the like; and taking the prediction label of each unit output by the Bi-GRU model as the input of the CRF model after training, and obtaining the label information of each unit output by the CRF model.

In one embodiment, the Bi-GRU model has a structure as shown in fig. 4, and includes a plurality of GRU units stacked side by side and above each other, a full-link layer connected to the bottoms of the GRU units, and a Softmax layer connected to the bottom of the full-link layer. In an embodiment, the full link layer and the Softmax layer of the Bi-GRU model are used to obtain a prediction tag of each unit, that is, to classify each unit in the text to be processed. The output of the GRU units superposed up and down is connected with the same unit, the GRU units are connected with the same input, the GRU units on the upper layer connected side by side are sequentially connected from left to right, and the GRU units on the lower layer connected side by side are sequentially connected from right to left. The connection of the GRU units in the Bi-GRU model realizes the context connection of the text to be processed, namely the context connection of the text to be processed from front to back and from back to front.

As shown in FIG. 5, the GRU unit is controlled by several gate control units, each of which is an update gate r for updating information_tAnd a reset gate z for resetting information_tWherein the module for completing the memory function is mainly the reset gate z_tThe information to be forgotten is reset along with the reset gate z_tThe change of the parameter value changes, and the two gates are controlled by a Sigmoid function, and the main calculation formula is as follows:

z_t＝σ(W_z[h_t-1,x_t])

r_t＝σ(W_r[h_t-1,x_t])

in the formula, z_t,r_t,

h_tRespectively, an update gate, a reset gate, information to be forgotten, and output information at time t, W_z,W_rAnd W is a weight matrix respectively, and the GRU unit can realize the context information from the front to the back direction according to the formula.

In one embodiment, the CRF model is connected after the Bi-GRU model and used to add some constraints to the tags predicted by the Bi-GRU model to ensure that the predicted tags are legal. For example, when the tag information of the first sensitive word in the tags predicted by the Bi-GRU model does not match the set tag, the CRF model may convert the tag information of the first sensitive word into the set tag.

Step S240: and determining the sensitive vocabulary of the text to be processed according to the label information of each unit in the text to be processed.

In this step, whether the tag information of each unit in the text to be processed is the labeling information of the sensitive vocabulary is judged, if yes, the text to be processed contains the sensitive vocabulary, and the sensitive vocabulary is the unit containing the labeling information of the sensitive vocabulary. Wherein, the labeling information of the sensitive vocabulary can be 'B-Abuse' and 'I-Abuse'.

When a unit in the processed text is a sensitive word, the unit is replaced by the designated character, so that data desensitization is realized. Wherein the designated character may be "".

In an embodiment, as shown in fig. 6, the step 610 specifically includes steps S610 to S640.

Step S610: and acquiring a training text set.

In this step, a training text set is obtained for training to obtain a feature extraction network and a label prediction model. Wherein the set of training texts can be crawled from a website.

Step S620: and performing word segmentation processing on each training text in the training text set by adopting a word segmentation tool.

In this step, the obtained training text set is processed, and a word segmentation tool is used to segment each training text in the training text set, wherein the word segmentation tool may be a Jieba tool. For example, the training text "i like grass houses on that mountain" is "i/like/that/mountain/grass houses" after being participled by the Jieba tool.

Step S630: and acquiring the labeling information of the sensitive words and the labeling information of other words in each training text.

In this step, the labeling information of the sensitive vocabulary in each training text and the labeling information of other vocabularies are obtained, wherein the labeling information of each training text vocabulary can be obtained by employing BIO coding labeling. When employing BIO coding, each training text corresponds to an entity, and the first sensitive word of each entity may be labeled "B-Abuse", the other sensitive words may be labeled "I-Abuse", and the other irrelevant words may be labeled "O". If the training text "he is a mixed egg", the text is marked as "O, O, O, O, B-Abuse, I-Abuse" after being coded and labeled by BIO.

Step S640: and training to obtain the feature extraction network and the label prediction model according to the labeling information of the sensitive words and the labeling information of other words in each training text.

In this step, a feature extraction network and a label prediction model are constructed, each training text is set as an input parameter x, and a result output by the feature extraction network and the label prediction model of each training text is set as an output parameter f (x). And putting each training text into the feature extraction network and the label prediction model for continuous iterative training, and adjusting parameters until the error between the output parameter f (x) and the labeling information of each vocabulary in the training text x is less than a threshold value, so that the feature extraction network and the label prediction model which are reasonably fitted are obtained, wherein the threshold value can be set according to specific conditions.

In one embodiment. In the training process of the feature extraction network and the label prediction model, each training text is used as the input of an improved inclusion-v 4 network, the output of an improved inclusion-v 4 network is used as the input of a Bi-GRU model, the output of the Bi-GRU model is used as the input of a CRF model, the parameters of the inclusion-v 4 network, the Bi-GRU model and the CRF model are adjusted, the error between the output of the CRF model and the labeling information of each vocabulary in the training text is smaller than a threshold value, the feature extraction network trained by the improved inclusion-v 4 network is obtained, and the label prediction model trained by the Bi-GRU model and the CRF model is obtained.

Fig. 7 is a schematic structural diagram of sensitive data identification according to an embodiment of the present application. As shown in fig. 7, the text to be processed is "he is a mixed egg", and after word segmentation processing by the Jieba tool, the text to be processed is changed into "he/is/one/mixed egg"; inputting the segmented text into an improved inclusion-v 4 network to obtain the spatial characteristics of each unit of the text; inputting the spatial characteristics of each unit into a Bi-GRU model of a label prediction model, acquiring the prediction label of each word in the ' he is a mixed egg ', and inputting the prediction label of each word into a CRF model, thereby acquiring the label information of each unit, namely ' he: 0/is: 0/one: 0/mixed egg: B-Abuse I-Abuse ". From the label information, the "mixed egg" is a sensitive word, and can be replaced by ". about.", and desensitization data "he is one about.".

According to the identification method of the sensitive data, the spatial characteristics of each unit of the text to be processed are extracted through the characteristic extraction network, then the spatial characteristics are input into the label prediction model to obtain the label information of each unit of the text to be processed, and the sensitive words of the processed text are determined according to the label information, so that the accurate identification of the sensitive data is realized.

The following are embodiments of the apparatus of the present application that may be used to perform the above-described embodiments of the method for sensitive data identification of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to embodiments of the method for sensitive data identification of the present application.

Fig. 8 is a block diagram illustrating an apparatus for sensitive data identification according to an embodiment of the present application. As shown in fig. 8, the apparatus includes: the system comprises a text acquisition module 810, a feature extraction module 820, a tag information module 830 and a sensitive vocabulary module 840.

A text obtaining module 810, configured to obtain a text to be processed;

a feature extraction module 820, configured to input the to-be-processed text into a feature extraction network, and obtain a spatial feature of each unit in the to-be-processed text output by the feature extraction network;

a tag information module 830, configured to input the spatial feature of each unit in the text to be processed into a tag prediction model, and obtain tag information of each unit output by the tag prediction model;

and the sensitive vocabulary module 840 is used for determining the sensitive vocabulary of the text to be processed according to the tag information of each unit in the text to be processed.

The implementation process of the functions and actions of each module in the device is specifically described in the implementation process of the corresponding step in the sensitive data identification method, and is not described herein again.

In the embodiments provided in the present application, the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A method of sensitive data identification, comprising:

acquiring a text to be processed;

inputting the text to be processed into a feature extraction network to obtain the spatial features of each unit in the text to be processed output by the feature extraction network;

2. The method according to claim 1, wherein after determining the sensitive vocabulary of the text to be processed according to the tag information of each unit in the text to be processed, the method further comprises:

3. The method according to claim 1, wherein the inputting the text to be processed into a feature extraction network, and obtaining the spatial feature of each unit in the text to be processed output by the feature extraction network, comprises:

4. The method of claim 3, wherein the feature extraction network is obtained by modifying an inclusion-v 4 network to remove the softmax layer of the inclusion-v 4 network and add a full convolutional layer.

5. The method according to claim 1, wherein the inputting the spatial feature of each unit in the text to be processed into a tag prediction model, and obtaining the tag information of each unit output by the tag prediction model, comprises:

6. The method of claim 1, wherein prior to said obtaining text to be processed, the method further comprises:

acquiring a training text set;

7. The method of claim 6, wherein the training of the feature extraction network and the label prediction model according to the labeled information of the sensitive words and the labeled information of other words in each training text comprises:

8. An apparatus for sensitive data recognition, comprising:

the text acquisition module is used for acquiring a text to be processed;

9. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method of sensitive data identification of any of claims 1-7.

10. A computer-readable storage medium, characterized in that the storage medium stores a computer program executable by a processor to perform the method of sensitive data identification of any of claims 1-7.