CN117391076B - Acquisition method and device of identification model of sensitive data, electronic equipment and medium - Google Patents
Acquisition method and device of identification model of sensitive data, electronic equipment and medium Download PDFInfo
- Publication number
- CN117391076B CN117391076B CN202311685176.0A CN202311685176A CN117391076B CN 117391076 B CN117391076 B CN 117391076B CN 202311685176 A CN202311685176 A CN 202311685176A CN 117391076 B CN117391076 B CN 117391076B
- Authority
- CN
- China
- Prior art keywords
- sample
- data
- sensitive
- different
- word segmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000012549 training Methods 0.000 claims abstract description 87
- 239000013598 vector Substances 0.000 claims abstract description 84
- 230000011218 segmentation Effects 0.000 claims abstract description 75
- 238000002372 labelling Methods 0.000 claims abstract description 39
- 238000012545 processing Methods 0.000 claims abstract description 34
- 238000013136 deep learning model Methods 0.000 claims abstract description 20
- 238000004891 communication Methods 0.000 claims description 21
- 238000004422 calculation algorithm Methods 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 13
- 238000013501 data transformation Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 8
- 238000003860 storage Methods 0.000 claims description 8
- 230000009466 transformation Effects 0.000 claims description 6
- 230000005540 biological transmission Effects 0.000 claims description 4
- 238000000586 desensitisation Methods 0.000 abstract description 8
- 238000010586 diagram Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 8
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000004220 aggregation Methods 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
The application relates to the technical field of data processing, and provides a method, a device, electronic equipment and a medium for acquiring an identification model of sensitive data. After a training sample and a sample data set constructed by corresponding sample labeling information are obtained, the training sample comprises data fields of non-sensitive type and different sensitive types, word segmentation processing is carried out on the data fields of any type, and text vectors of different sample segmentation of corresponding types are obtained; and then, based on text vectors of different sample word segmentation under different types and corresponding sample labeling information, performing iterative training on the deep learning model to be trained to obtain a sensitive information recognition model. The method can identify the sensitive information field in real time and effectively realize the dynamic desensitization of sensitive data.
Description
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and apparatus for acquiring an identification model of sensitive data, an electronic device, and a medium.
Background
The use of personally sensitive data by businesses is becoming increasingly conservative. At present, if personal sensitive information is required to be used for data analysis mining work, such as blacklist scanning model optimization, construction of client information matching models of different professional companies in a group and the like, the personal sensitive information can only be exported to the local by operators of a data center to a production environment after approval according to times and needs, and is used by a manual encryption transmitting user. After the user finishes using, the user needs to destroy the data in time and provide destroying evidence, and the company layer needs to regularly review the use of the sensitive data to ensure that the data is not revealed.
At present, data desensitization is carried out by spending a large amount of manpower and material resources to comb and distinguish hundreds of thousands of data sheets of hundreds of application systems in a whole row, determining whether each sheet relates to sensitive information or not, and carrying out desensitization treatment according to a combing result so as to meet the security management regulations of the sensitive information, wherein the related workload is large. And once the data table is changed, SIC of the application system needs to be updated synchronously, and the timely, accurate and non-missing SIC cannot be ensured.
Disclosure of Invention
An object of the embodiments of the present application is to provide a method, an apparatus, an electronic device, and a medium for acquiring an identification model of sensitive data, so as to identify a sensitive information field in real time, and efficiently implement dynamic desensitization of the sensitive data.
In a first aspect, a method for acquiring an identification model of sensitive data is provided, where the method may include:
acquiring a training sample and a sample data set constructed by corresponding sample labeling information; the training sample comprises data fields of the non-sensitive type and data fields of different sensitive types;
performing word segmentation processing on any type of data field to obtain text vectors of different sample word segmentation of corresponding types;
and carrying out iterative training on the deep learning model to be trained based on the text vectors of different sample segmentation words under different types and corresponding sample labeling information to obtain a sensitive information recognition model.
In one possible implementation, the sensitive information identification model includes an input layer, a first hidden layer, a second hidden layer, and a softmax layer;
the training process of the deep learning model to be trained comprises the following steps:
the input layer inputs the text vector of any sample word segmentation under any type and transmits the text vector to the first hidden layer;
the first hiding layer carries out superposition average on the text vectors of the different sample word segmentation to obtain average vectors corresponding to the corresponding types, and the average vectors are transmitted to the second hiding layer;
the second hiding layer carries out linear transformation on the received average vectors corresponding to various types based on a configured linear processing algorithm, and outputs a data transformation result to the softmax layer; different types of the configured linear processing algorithm are endowed with different weight parameters;
the softmax layer classifies received data transformation results;
if the sample labeling information corresponding to the classification result and the corresponding text vector does not meet the preset loss condition, adjusting each parameter in the deep learning model to be trained, and returning to input the text vector of other sample segmentation under any type to the first hidden layer until the sample labeling information corresponding to the classification result and the corresponding text vector meets the preset loss condition.
In one possible implementation, the data fields of the non-or different type of sensitivity include an english name field, a chinese name field, and a data content field.
In one possible implementation, word segmentation processing is performed on any type of data field to obtain text vectors of different sample word segmentation of corresponding types, including:
according to the character sequence of the fields, segmenting the data fields of any type to obtain different sample segmentation words of the corresponding data fields, wherein the different sample segmentation words comprise sample segmentation words of at least two character combinations in the data fields;
using word2vec, the different sample tokens are converted to corresponding text vectors.
In one possible implementation, after obtaining the sensitive information identification model, the method further includes:
constructing a virtual desktop, and simultaneously, sending a data request to a configured distributed sensitive database by a user of the PC section through a browser or a client of the virtual desktop;
performing field identification on request data corresponding to the data request, and determining current data fields of different types;
performing word segmentation processing on the current data field of any type to obtain current text vectors of different sample word segmentation of corresponding types;
inputting the current text vector into the sensitive information recognition model to obtain a recognition result output by the sensitive information recognition model;
and if the identification result is that the current text vector is sensitive data, encrypting the current text vector and then realizing the mutual transmission between the data and the PC end through the virtual desktop.
In one possible implementation, obtaining a sample data set constructed by training samples and corresponding sample labeling information, constructing the sample data set, includes:
acquiring initial training samples of each minority class in an initial training set; the initial training set comprises a minority initial training sample, a majority initial training sample and corresponding sample labeling information;
interpolation is carried out on each minority initial training sample by adopting a preset interpolation algorithm, and interpolation training samples corresponding to each minority initial training sample are obtained; the sample labeling information of the interpolation training samples is the same as the sample labeling information of the corresponding interpolated minority initial training samples;
and constructing a sample data set based on the interpolation training samples, the majority initial training samples and the corresponding sample labeling information.
In a second aspect, an apparatus for acquiring an identification model of sensitive data is provided, where the apparatus may include:
the acquisition unit is used for acquiring a training sample and a sample data set constructed by corresponding sample marking information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types;
the word segmentation unit is used for carrying out word segmentation processing on any type of data field to obtain text vectors of different sample word segmentation of corresponding types;
and the training unit is used for carrying out iterative training on the deep learning model to be trained based on the text vectors of different sample word segmentation under different types and the corresponding sample labeling information to obtain a sensitive information recognition model.
In one possible implementation, the sensitive information identification model includes an input layer, a first hidden layer, a second hidden layer, and a softmax layer;
the training process of the deep learning model to be trained comprises the following steps:
the input layer inputs the text vector of any sample word segmentation under any type and transmits the text vector to the first hidden layer;
the first hiding layer carries out superposition average on the text vectors of the different sample word segmentation to obtain average vectors corresponding to the corresponding types, and the average vectors are transmitted to the second hiding layer;
the second hiding layer carries out linear transformation on the received average vectors corresponding to various types based on a configured linear processing algorithm, and outputs a data transformation result to the softmax layer; different types of the configured linear processing algorithm are endowed with different weight parameters;
the softmax layer classifies received data transformation results;
if the sample labeling information corresponding to the classification result and the corresponding text vector does not meet the preset loss condition, adjusting each parameter in the deep learning model to be trained, and returning to input the text vector of other sample segmentation under any type to the first hidden layer until the sample labeling information corresponding to the classification result and the corresponding text vector meets the preset loss condition.
In a third aspect, an electronic device is provided, the electronic device comprising a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory are in communication with each other via the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any one of the above first aspects when executing a program stored on a memory.
In a fourth aspect, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the method steps of any of the first aspects.
The method for acquiring the identification model of the sensitive data comprises the steps of acquiring a sample data set constructed by training samples and corresponding sample marking information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types, and word segmentation processing is carried out on the data fields of any type to obtain text vectors of different sample word segmentation of corresponding type; and then, based on text vectors of different sample word segmentation under different types and corresponding sample labeling information, performing iterative training on the deep learning model to be trained to obtain a sensitive information recognition model. The method can identify the sensitive information field in real time and effectively realize the dynamic desensitization of sensitive data.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a method for obtaining an identification model of sensitive data according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a sensitive information identification model according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an acquisition device of an identification model of sensitive data according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.
The method for acquiring the identification model of the sensitive data can be applied to a server or a terminal. The server may be a physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligent platforms. The Terminal may be a Mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a Personal Digital Assistant (PDA), a tablet personal computer (PAD), or other User Equipment (UE), a handheld device, a car-mounted device, a wearable device, a computing device, or other processing device connected to a wireless modem, a Mobile Station (MS), a Mobile Terminal (Mobile Terminal), or the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.
The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and are not intended to limit the present application, and embodiments and features of embodiments of the present application may be combined with each other without conflict.
Fig. 1 is a flow chart of a method for processing service data according to an embodiment of the present application. As shown in fig. 1, the method may include:
step S110, a training sample and a sample data set constructed by corresponding sample marking information are obtained.
Prior to performing this step, a distributed sensitive database of the data laboratory is built, and data sheets of customers, transactions, behaviors, products, etc. are synchronized to the data laboratory desensitization environment in an evening batch format every day.
In specific implementation, a sensitive training sample and a sensitive data set constructed by corresponding sample marking information are obtained. Wherein the sensitive training samples may include data fields of a non-sensitive type and data fields of a different sensitive type. The different types of data fields may include english name fields, chinese name fields, and data content fields of the sensitive data.
Step S120, word segmentation processing is carried out on any type of data field, and text vectors of different sample word segmentation of corresponding types are obtained.
According to the character sequence of the fields, word segmentation is carried out on any type of data field to obtain different sample word segmentation of the corresponding data field, wherein the different sample word segmentation can comprise sample word segmentation of at least two character combinations in the data field; the different sample tokens are then converted to corresponding text vectors using word2 vec.
(1) Field english name segmentation: according to the line specification, a plurality of abbreviations and "_s" are connected, so that the field English name only needs to be divided according to "_s". The result of the word "card_num_id_num" is "< card|num|id|num >".
(2) Field chinese name segmentation: the segmentation was performed according to python-jieba. If "product number to which the card belongs", the result after the word segmentation is "< product number to which the card belongs >".
(3) Field description word segmentation: the segmentation was performed according to python-jieba. Such as "bank internal customer number", the result after word segmentation is "< bank internal customer number >".
Furthermore, for example, the "product number to which the card belongs" may further combine word segmentation results sequentially to obtain the "card belongs", "product to which the card belongs", "product number", and the like.
And step 130, performing iterative training on the deep learning model to be trained based on the text vectors of different sample word segmentation under different types and corresponding sample labeling information to obtain a sensitive information recognition model.
In order to adapt to the prediction of the scene, the accuracy of the model is improved, the model adopts a neural network framework, distinguishing weights are carried out on three input text features of 'field English name, field Chinese name and field description', and the accuracy of the model is ensured by combining with the technology of ebadd and N-gram;
the specific design is as follows: a field English name text, a field Chinese name text and a field description information text, generating word vectors and 2-gram vectors after word segmentation, and respectively forming three groups of vectors as input; performing superposition averaging on the three sets of vectors as a first hidden layer (hidden layer 2) of the model; the generated three text vectors are subjected to linear transformation, different weights are given, and the generated three text vectors serve as a second hidden layer (hidden layer 3) of the model; the hierarchical softmax is used for multi-classification, so that text classification prediction is realized.
Wherein, as shown in fig. 2, the sensitive information identification model includes an input layer, a first hidden layer, a second hidden layer, and a softmax layer.
Specifically, the training process of the deep learning model to be trained includes:
the input layer inputs a text vector (X 11 ,X 12 ,X 13 ,…,X 1M The method comprises the steps of carrying out a first treatment on the surface of the Alternatively, X 21 ,X 22 ,X 23 ,…,X 2N The method comprises the steps of carrying out a first treatment on the surface of the Alternatively, X 31 ,X 32 ,X 33 ,…,X 3K ) And transmitting it to the first hidden layer; m, N, K are all positive integers other than zero.
The first hiding layer carries out superposition average on text vectors of different sample word segmentation to obtain average vectors (X 1 ,X 2 ,X 3 ) And transmitting it to the second hidden layer;
the second hiding layer carries out linear transformation on the received average vectors corresponding to various types based on a configured linear processing algorithm, and outputs a data transformation result (X) to the softmax layer; different types of the configured linear processing algorithm are endowed with different weight parameters;
the softmax layer classifies received data transformation results;
if the sample labeling information corresponding to the classification result (Y) and the corresponding text vector does not meet the preset loss condition, adjusting each parameter in the deep learning model to be trained, and returning to input the text vector of other sample segmentation under any type to the first hidden layer until the sample labeling information corresponding to the classification result and the corresponding text vector meets the preset loss condition.
In some embodiments, for any iteration of the model, specifically: initializing model parameters, calculating classification results and sample labeling information output by a model, and calculating the gradient of the current iteration for a preset loss function; and dynamically adjusting and iterating model parameters by adopting a preset learning rate and a current gradient to obtain new model parameters, and performing the next iteration until a preset iteration termination condition is reached, wherein the finally obtained model parameters are used as the model parameters of the multi-layer neural network classifier after training is completed.
In some embodiments, the objective function of the softmax layer is optimized using a huffman tree.
In some embodiments, after the sensitive information identification model is obtained, a virtual desktop can be built, and meanwhile, a user of the PC section sends a data request to the configured distributed sensitive database through a browser or a client of the virtual desktop;
carrying out field identification on request data corresponding to the data request, and determining current data fields of different types;
performing word segmentation processing on any type of current data field to obtain current text vectors of different sample word segmentation of corresponding types;
inputting the current text vector into a sensitive information recognition model to obtain a recognition result output by the sensitive information recognition model;
if the identification result is sensitive data, the current request result data is automatically desensitized according to a preset desensitization method, and then displayed for a client to browse. And if the user needs to download the data, the mutual transmission of the virtual desktop data and the PC end is realized through a file ferrying mode.
And if the identification result is non-sensitive data, directly displaying the request result data to a client for browsing. And if the user needs to download the data, the mutual transmission of the virtual desktop data and the PC end is realized through a file ferrying mode.
In some embodiments, because the obtained training samples may include samples of non-sensitive type and samples of different sensitive types, if the number of samples of a certain type is small, the accuracy of the trained model is not high, a preset interpolation algorithm may be adopted to interpolate each minority of initial training samples, so as to obtain corresponding interpolated training samples; the sample labeling information of the interpolation training samples is the same as the sample labeling information of the corresponding interpolated minority initial training samples; thus, a sample data set is constructed based on the interpolation training samples, the majority initial training samples and corresponding sample labeling information.
Based on the above examples, in other embodiments, the training samples may be obtained by:
acquiring Euclidean distances between each minority class initial training sample and each majority class initial training sample in an initial training set;
based on a preset random forest classifier, classifying the initial training set to obtain the classification accuracy of each initial training sample and the weight factors of a few initial training samples, determining a sample adjustment coefficient, wherein the weight factors are determined based on the number of Euclidean distances;
generating a minority class new training sample corresponding to the minority class initial training sample based on the minority class initial training sample, the second number of Euclidean distances and the sample adjustment coefficient;
clustering a minority initial training sample, a minority new training sample and a majority training sample by using a fuzzy clustering algorithm, and determining clustering centers and clustering radii of different aggregation classes;
based on the clustering centers and the clustering radiuses of different aggregation classes, training samples are obtained, and accordingly a sample data set is constructed through the training samples and corresponding sample labeling information.
The method for acquiring the identification model of the sensitive data comprises the steps of acquiring a sample data set constructed by training samples and corresponding sample marking information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types, and word segmentation processing is carried out on the data fields of any type to obtain text vectors of different sample word segmentation of corresponding type; and then, based on text vectors of different sample word segmentation under different types and corresponding sample labeling information, performing iterative training on the deep learning model to be trained to obtain a sensitive information recognition model. The method can identify the sensitive information field in real time and effectively realize the dynamic desensitization of sensitive data.
Corresponding to the method, the embodiment of the application further provides an apparatus for acquiring the identification model of the sensitive data, as shown in fig. 3, where the apparatus for acquiring the identification model of the sensitive data includes:
an obtaining unit 310, configured to obtain a training sample and a sample dataset constructed by corresponding sample labeling information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types;
the word segmentation unit 320 is configured to perform word segmentation on any type of data field to obtain text vectors of different sample word segmentation of corresponding types;
the training unit 330 is configured to iteratively train the deep learning model to be trained based on the text vectors of the different sample word segmentation and the corresponding sample labeling information under the different types, so as to obtain a sensitive information recognition model.
The functions of each functional unit of the device for acquiring the identification model of the sensitive data provided in the foregoing embodiments of the present application may be implemented by the foregoing method steps, so specific working processes and beneficial effects of each unit in the device for acquiring the identification model of the sensitive data provided in the embodiments of the present application are not repeated herein.
The embodiment of the present application further provides an electronic device, as shown in fig. 4, including a processor 410, a communication interface 420, a memory 430, and a communication bus 440, where the processor 410, the communication interface 420, and the memory 430 complete communication with each other through the communication bus 440.
A memory 430 for storing a computer program;
the processor 410 is configured to execute the program stored in the memory 430, and implement the following steps:
acquiring a training sample and a sample data set constructed by corresponding sample labeling information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types;
performing word segmentation processing on any type of data field to obtain text vectors of different sample word segmentation of corresponding types;
and carrying out iterative training on the deep learning model to be trained based on the text vectors of different sample segmentation words under different types and corresponding sample labeling information to obtain a sensitive information recognition model.
The communication bus mentioned above may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the electronic device and other devices.
The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
Since the implementation manner and the beneficial effects of the solution to the problem of each device of the electronic apparatus in the foregoing embodiment may be implemented by referring to each step in the embodiment shown in fig. 1, the specific working process and the beneficial effects of the electronic apparatus provided in the embodiment of the present application are not repeated herein.
In yet another embodiment provided herein, a computer readable storage medium is provided, where instructions are stored, which when executed on a computer, cause the computer to perform the method for obtaining the identification model of sensitive data according to any of the above embodiments.
In a further embodiment provided herein, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform the method of acquiring an identification model of sensitive data as described in any of the above embodiments.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted to embrace the preferred embodiments and all such variations and modifications as fall within the scope of the embodiments herein.
It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments in the present application fall within the scope of the claims and the equivalents thereof in the embodiments of the present application, such modifications and variations are also intended to be included in the embodiments of the present application.
Claims (8)
1. A method for obtaining an identification model of sensitive data, the method comprising:
acquiring a training sample and a sample data set constructed by corresponding sample labeling information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types;
performing word segmentation processing on any type of data field to obtain text vectors of different sample word segmentation of corresponding types;
based on text vectors of different sample word segmentation under different types and corresponding sample labeling information, performing iterative training on a deep learning model to be trained to obtain a sensitive information recognition model;
the sensitive information identification model comprises an input layer, a first hiding layer, a second hiding layer and a softmax layer;
the training process of the deep learning model to be trained comprises the following steps:
the input layer inputs the text vector of any sample word segmentation under any type and transmits the text vector to the first hidden layer;
the first hiding layer carries out superposition average on the text vectors of the different sample word segmentation to obtain average vectors corresponding to the corresponding types, and the average vectors are transmitted to the second hiding layer;
the second hiding layer carries out linear transformation on the received average vectors corresponding to various types based on a configured linear processing algorithm, and outputs a data transformation result to the softmax layer; different types of the configured linear processing algorithm are endowed with different weight parameters;
the softmax layer classifies received data transformation results;
if the sample labeling information corresponding to the classification result and the corresponding text vector does not meet the preset loss condition, adjusting each parameter in the deep learning model to be trained, and returning to input the text vector of other sample segmentation under any type to the first hidden layer until the sample labeling information corresponding to the classification result and the corresponding text vector meets the preset loss condition.
2. The method of claim 1, wherein the data fields of different sensitive types include an english name field, a chinese name field, and a data content field.
3. The method of claim 1, wherein word segmentation is performed on any type of data field to obtain text vectors of different sample word segments of corresponding types, comprising:
according to the character sequence of the fields, segmenting the data fields of any type to obtain different sample segmentation words of the corresponding data fields, wherein the different sample segmentation words comprise sample segmentation words of at least two character combinations in the data fields;
using word2vec, the different sample tokens are converted to corresponding text vectors.
4. The method of claim 1, wherein after obtaining the sensitive information identification model, the method further comprises:
constructing a virtual desktop, and simultaneously, sending a data request to a configured distributed sensitive database by a user of the PC section through a browser or a client of the virtual desktop;
performing field identification on request data corresponding to the data request, and determining current data fields of different types;
performing word segmentation processing on the current data field of any type to obtain current text vectors of different sample word segmentation of corresponding types;
inputting the current text vector into the sensitive information recognition model to obtain a recognition result output by the sensitive information recognition model;
and if the identification result is that the current text vector is sensitive data, encrypting the current text vector and then realizing the mutual transmission between the data and the PC end through the virtual desktop.
5. The method of claim 1, wherein obtaining a sample dataset constructed of training samples and corresponding sample annotation information comprises:
acquiring initial training samples of each minority class in an initial training set; the initial training set comprises a minority initial training sample, a majority initial training sample and corresponding sample labeling information;
interpolation is carried out on each minority initial training sample by adopting a preset interpolation algorithm, and interpolation training samples corresponding to each minority initial training sample are obtained; the sample labeling information of the interpolation training samples is the same as the sample labeling information of the corresponding interpolated minority initial training samples;
and constructing a sample data set based on the interpolation training samples, the majority initial training samples and the corresponding sample labeling information.
6. An apparatus for acquiring an identification model of sensitive data, the apparatus comprising:
the acquisition unit is used for acquiring a training sample and a sample data set constructed by corresponding sample marking information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types;
the word segmentation unit is used for carrying out word segmentation processing on any type of data field to obtain text vectors of different sample word segmentation of corresponding types;
the training unit is used for carrying out iterative training on the deep learning model to be trained based on the text vectors of different sample word segmentation under different types and corresponding sample labeling information to obtain a sensitive information recognition model;
the sensitive information identification model comprises an input layer, a first hiding layer, a second hiding layer and a softmax layer;
the training process of the deep learning model to be trained comprises the following steps:
the input layer inputs the text vector of any sample word segmentation under any type and transmits the text vector to the first hidden layer;
the first hiding layer carries out superposition average on the text vectors of the different sample word segmentation to obtain average vectors corresponding to the corresponding types, and the average vectors are transmitted to the second hiding layer;
the second hiding layer carries out linear transformation on the received average vectors corresponding to various types based on a configured linear processing algorithm, and outputs a data transformation result to the softmax layer; different types of the configured linear processing algorithm are endowed with different weight parameters;
the softmax layer classifies received data transformation results;
if the sample labeling information corresponding to the classification result and the corresponding text vector does not meet the preset loss condition, adjusting each parameter in the deep learning model to be trained, and returning to input the text vector of other sample segmentation under any type to the first hidden layer until the sample labeling information corresponding to the classification result and the corresponding text vector meets the preset loss condition.
7. An electronic device, characterized in that the electronic device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are in communication with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method of any of claims 1-5 when executing a program stored on a memory.
8. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311685176.0A CN117391076B (en) | 2023-12-11 | 2023-12-11 | Acquisition method and device of identification model of sensitive data, electronic equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311685176.0A CN117391076B (en) | 2023-12-11 | 2023-12-11 | Acquisition method and device of identification model of sensitive data, electronic equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117391076A CN117391076A (en) | 2024-01-12 |
CN117391076B true CN117391076B (en) | 2024-02-27 |
Family
ID=89439555
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311685176.0A Active CN117391076B (en) | 2023-12-11 | 2023-12-11 | Acquisition method and device of identification model of sensitive data, electronic equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117391076B (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110175235A (en) * | 2019-04-23 | 2019-08-27 | 苏宁易购集团股份有限公司 | Intelligence commodity tax sorting code number method and system neural network based |
CN111428273A (en) * | 2020-04-23 | 2020-07-17 | 北京中安星云软件技术有限公司 | Dynamic desensitization method and device based on machine learning |
WO2020215571A1 (en) * | 2019-04-25 | 2020-10-29 | 平安科技(深圳)有限公司 | Sensitive data identification method and device, storage medium, and computer apparatus |
WO2021135446A1 (en) * | 2020-06-19 | 2021-07-08 | 平安科技(深圳)有限公司 | Text classification method and apparatus, computer device and storage medium |
CN114282258A (en) * | 2021-10-28 | 2022-04-05 | 平安银行股份有限公司 | Screen capture data desensitization method and device, computer equipment and storage medium |
CN114491018A (en) * | 2021-12-23 | 2022-05-13 | 天翼云科技有限公司 | Construction method of sensitive information detection model, and sensitive information detection method and device |
CN114595689A (en) * | 2022-02-28 | 2022-06-07 | 深圳依时货拉拉科技有限公司 | Data processing method, data processing device, storage medium and computer equipment |
CN115687980A (en) * | 2022-11-11 | 2023-02-03 | 中国农业银行股份有限公司 | Desensitization classification method of data table, and classification model training method and device |
CN115828901A (en) * | 2022-12-26 | 2023-03-21 | 中国农业银行股份有限公司 | Sensitive information identification method and device, electronic equipment and storage medium |
CN116305257A (en) * | 2023-02-15 | 2023-06-23 | 杭州北山数字科技有限公司 | Privacy information monitoring device and privacy information monitoring method |
CN116975102A (en) * | 2022-04-19 | 2023-10-31 | 中国移动通信集团广东有限公司 | Sensitive data monitoring method, system, electronic equipment and storage medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107220300B (en) * | 2017-05-05 | 2018-07-20 | 平安科技(深圳)有限公司 | Information mining method, electronic device and readable storage medium storing program for executing |
CN114564971B (en) * | 2022-02-28 | 2023-05-12 | 北京百度网讯科技有限公司 | Training method of deep learning model, text data processing method and device |
-
2023
- 2023-12-11 CN CN202311685176.0A patent/CN117391076B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110175235A (en) * | 2019-04-23 | 2019-08-27 | 苏宁易购集团股份有限公司 | Intelligence commodity tax sorting code number method and system neural network based |
WO2020215571A1 (en) * | 2019-04-25 | 2020-10-29 | 平安科技(深圳)有限公司 | Sensitive data identification method and device, storage medium, and computer apparatus |
CN111428273A (en) * | 2020-04-23 | 2020-07-17 | 北京中安星云软件技术有限公司 | Dynamic desensitization method and device based on machine learning |
WO2021135446A1 (en) * | 2020-06-19 | 2021-07-08 | 平安科技(深圳)有限公司 | Text classification method and apparatus, computer device and storage medium |
CN114282258A (en) * | 2021-10-28 | 2022-04-05 | 平安银行股份有限公司 | Screen capture data desensitization method and device, computer equipment and storage medium |
CN114491018A (en) * | 2021-12-23 | 2022-05-13 | 天翼云科技有限公司 | Construction method of sensitive information detection model, and sensitive information detection method and device |
CN114595689A (en) * | 2022-02-28 | 2022-06-07 | 深圳依时货拉拉科技有限公司 | Data processing method, data processing device, storage medium and computer equipment |
CN116975102A (en) * | 2022-04-19 | 2023-10-31 | 中国移动通信集团广东有限公司 | Sensitive data monitoring method, system, electronic equipment and storage medium |
CN115687980A (en) * | 2022-11-11 | 2023-02-03 | 中国农业银行股份有限公司 | Desensitization classification method of data table, and classification model training method and device |
CN115828901A (en) * | 2022-12-26 | 2023-03-21 | 中国农业银行股份有限公司 | Sensitive information identification method and device, electronic equipment and storage medium |
CN116305257A (en) * | 2023-02-15 | 2023-06-23 | 杭州北山数字科技有限公司 | Privacy information monitoring device and privacy information monitoring method |
Non-Patent Citations (5)
Title |
---|
Automated detection of unstructured context-dependent sensitive information using deep learning;Hadeer Ahmed, et al.;《Automated detection of unstructured context-dependent sensitive information using deep learning》;20210814;1-11 * |
一种改进的分类算法在不良信息过滤中的应用;刘志刚;杜娟;衣治安;;微计算机应用;20110215(02);9-14 * |
基于fasttext模型的中文专利快速分类;陈子豪;谢从华;时敏;唐晓娜;;常熟理工学院学报;20200917(05);47-50 * |
基于多神经网络混合的短文本分类模型;侯雪亮;李新;陈远平;;计算机系统应用;20201013(10);9-19 * |
基于数据特征的敏感数据识别方法;刘金;;信息通信;20160215(02);240-241 * |
Also Published As
Publication number | Publication date |
---|---|
CN117391076A (en) | 2024-01-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11062089B2 (en) | Method and apparatus for generating information | |
CN106651057B (en) | Mobile terminal user age prediction method based on installation package sequence list | |
CN110069709B (en) | Intention recognition method, device, computer readable medium and electronic equipment | |
CN112613917A (en) | Information pushing method, device and equipment based on user portrait and storage medium | |
CN111563163A (en) | Text classification model generation method and device and data standardization method and device | |
CN106778851A (en) | Social networks forecasting system and its method based on Mobile Phone Forensics data | |
WO2023213157A1 (en) | Data processing method and apparatus, program product, computer device and medium | |
CN112766284A (en) | Image recognition method and device, storage medium and electronic equipment | |
CN112418320A (en) | Enterprise association relation identification method and device and storage medium | |
CN109982272B (en) | Fraud short message identification method and device | |
CN111881943A (en) | Method, device, equipment and computer readable medium for image classification | |
CN112995414A (en) | Behavior quality inspection method, device, equipment and storage medium based on voice call | |
CN117391076B (en) | Acquisition method and device of identification model of sensitive data, electronic equipment and medium | |
CN111476595A (en) | Product pushing method and device, computer equipment and storage medium | |
CN111597336A (en) | Processing method and device of training text, electronic equipment and readable storage medium | |
CN115546554A (en) | Sensitive image identification method, device, equipment and computer readable storage medium | |
CN113609018A (en) | Test method, training method, device, apparatus, medium, and program product | |
CN109885647B (en) | User history verification method, device, electronic equipment and storage medium | |
CN113780239A (en) | Iris recognition method, iris recognition device, electronic equipment and computer readable medium | |
CN111274383B (en) | Object classifying method and device applied to quotation | |
CN113688232A (en) | Method and device for classifying bidding texts, storage medium and terminal | |
CN107368597B (en) | Information output method and device | |
CN112417260A (en) | Localized recommendation method and device and storage medium | |
CN115563289B (en) | Industry classification label generation method and device, electronic equipment and readable medium | |
CN114706927B (en) | Data batch labeling method based on artificial intelligence and related equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40098515 Country of ref document: HK |