CN117391076B - Acquisition method and device of identification model of sensitive data, electronic equipment and medium - Google Patents

Acquisition method and device of identification model of sensitive data, electronic equipment and medium Download PDF

Info

Publication number
CN117391076B
CN117391076B CN202311685176.0A CN202311685176A CN117391076B CN 117391076 B CN117391076 B CN 117391076B CN 202311685176 A CN202311685176 A CN 202311685176A CN 117391076 B CN117391076 B CN 117391076B
Authority
CN
China
Prior art keywords
sample
data
sensitive
different
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311685176.0A
Other languages
Chinese (zh)
Other versions
CN117391076A (en
Inventor
翁志鹏
洪建帮
陈春旺
伍思文
罗卓尔
裴雷
陈志�
金鑫
代军堂
丁有韬
王悦
丁征涛
李系能
张方昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bank Of East Asia China Co ltd
Original Assignee
Bank Of East Asia China Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bank Of East Asia China Co ltd filed Critical Bank Of East Asia China Co ltd
Priority to CN202311685176.0A priority Critical patent/CN117391076B/en
Publication of CN117391076A publication Critical patent/CN117391076A/en
Application granted granted Critical
Publication of CN117391076B publication Critical patent/CN117391076B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to the technical field of data processing, and provides a method, a device, electronic equipment and a medium for acquiring an identification model of sensitive data. After a training sample and a sample data set constructed by corresponding sample labeling information are obtained, the training sample comprises data fields of non-sensitive type and different sensitive types, word segmentation processing is carried out on the data fields of any type, and text vectors of different sample segmentation of corresponding types are obtained; and then, based on text vectors of different sample word segmentation under different types and corresponding sample labeling information, performing iterative training on the deep learning model to be trained to obtain a sensitive information recognition model. The method can identify the sensitive information field in real time and effectively realize the dynamic desensitization of sensitive data.

Description

Acquisition method and device of identification model of sensitive data, electronic equipment and medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and apparatus for acquiring an identification model of sensitive data, an electronic device, and a medium.
Background
The use of personally sensitive data by businesses is becoming increasingly conservative. At present, if personal sensitive information is required to be used for data analysis mining work, such as blacklist scanning model optimization, construction of client information matching models of different professional companies in a group and the like, the personal sensitive information can only be exported to the local by operators of a data center to a production environment after approval according to times and needs, and is used by a manual encryption transmitting user. After the user finishes using, the user needs to destroy the data in time and provide destroying evidence, and the company layer needs to regularly review the use of the sensitive data to ensure that the data is not revealed.
At present, data desensitization is carried out by spending a large amount of manpower and material resources to comb and distinguish hundreds of thousands of data sheets of hundreds of application systems in a whole row, determining whether each sheet relates to sensitive information or not, and carrying out desensitization treatment according to a combing result so as to meet the security management regulations of the sensitive information, wherein the related workload is large. And once the data table is changed, SIC of the application system needs to be updated synchronously, and the timely, accurate and non-missing SIC cannot be ensured.
Disclosure of Invention
An object of the embodiments of the present application is to provide a method, an apparatus, an electronic device, and a medium for acquiring an identification model of sensitive data, so as to identify a sensitive information field in real time, and efficiently implement dynamic desensitization of the sensitive data.
In a first aspect, a method for acquiring an identification model of sensitive data is provided, where the method may include:
acquiring a training sample and a sample data set constructed by corresponding sample labeling information; the training sample comprises data fields of the non-sensitive type and data fields of different sensitive types;
performing word segmentation processing on any type of data field to obtain text vectors of different sample word segmentation of corresponding types;
and carrying out iterative training on the deep learning model to be trained based on the text vectors of different sample segmentation words under different types and corresponding sample labeling information to obtain a sensitive information recognition model.
In one possible implementation, the sensitive information identification model includes an input layer, a first hidden layer, a second hidden layer, and a softmax layer;
the training process of the deep learning model to be trained comprises the following steps:
the input layer inputs the text vector of any sample word segmentation under any type and transmits the text vector to the first hidden layer;
the first hiding layer carries out superposition average on the text vectors of the different sample word segmentation to obtain average vectors corresponding to the corresponding types, and the average vectors are transmitted to the second hiding layer;
the second hiding layer carries out linear transformation on the received average vectors corresponding to various types based on a configured linear processing algorithm, and outputs a data transformation result to the softmax layer; different types of the configured linear processing algorithm are endowed with different weight parameters;
the softmax layer classifies received data transformation results;
if the sample labeling information corresponding to the classification result and the corresponding text vector does not meet the preset loss condition, adjusting each parameter in the deep learning model to be trained, and returning to input the text vector of other sample segmentation under any type to the first hidden layer until the sample labeling information corresponding to the classification result and the corresponding text vector meets the preset loss condition.
In one possible implementation, the data fields of the non-or different type of sensitivity include an english name field, a chinese name field, and a data content field.
In one possible implementation, word segmentation processing is performed on any type of data field to obtain text vectors of different sample word segmentation of corresponding types, including:
according to the character sequence of the fields, segmenting the data fields of any type to obtain different sample segmentation words of the corresponding data fields, wherein the different sample segmentation words comprise sample segmentation words of at least two character combinations in the data fields;
using word2vec, the different sample tokens are converted to corresponding text vectors.
In one possible implementation, after obtaining the sensitive information identification model, the method further includes:
constructing a virtual desktop, and simultaneously, sending a data request to a configured distributed sensitive database by a user of the PC section through a browser or a client of the virtual desktop;
performing field identification on request data corresponding to the data request, and determining current data fields of different types;
performing word segmentation processing on the current data field of any type to obtain current text vectors of different sample word segmentation of corresponding types;
inputting the current text vector into the sensitive information recognition model to obtain a recognition result output by the sensitive information recognition model;
and if the identification result is that the current text vector is sensitive data, encrypting the current text vector and then realizing the mutual transmission between the data and the PC end through the virtual desktop.
In one possible implementation, obtaining a sample data set constructed by training samples and corresponding sample labeling information, constructing the sample data set, includes:
acquiring initial training samples of each minority class in an initial training set; the initial training set comprises a minority initial training sample, a majority initial training sample and corresponding sample labeling information;
interpolation is carried out on each minority initial training sample by adopting a preset interpolation algorithm, and interpolation training samples corresponding to each minority initial training sample are obtained; the sample labeling information of the interpolation training samples is the same as the sample labeling information of the corresponding interpolated minority initial training samples;
and constructing a sample data set based on the interpolation training samples, the majority initial training samples and the corresponding sample labeling information.
In a second aspect, an apparatus for acquiring an identification model of sensitive data is provided, where the apparatus may include:
the acquisition unit is used for acquiring a training sample and a sample data set constructed by corresponding sample marking information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types;
the word segmentation unit is used for carrying out word segmentation processing on any type of data field to obtain text vectors of different sample word segmentation of corresponding types;
and the training unit is used for carrying out iterative training on the deep learning model to be trained based on the text vectors of different sample word segmentation under different types and the corresponding sample labeling information to obtain a sensitive information recognition model.
In one possible implementation, the sensitive information identification model includes an input layer, a first hidden layer, a second hidden layer, and a softmax layer;
the training process of the deep learning model to be trained comprises the following steps:
the input layer inputs the text vector of any sample word segmentation under any type and transmits the text vector to the first hidden layer;
the first hiding layer carries out superposition average on the text vectors of the different sample word segmentation to obtain average vectors corresponding to the corresponding types, and the average vectors are transmitted to the second hiding layer;
the second hiding layer carries out linear transformation on the received average vectors corresponding to various types based on a configured linear processing algorithm, and outputs a data transformation result to the softmax layer; different types of the configured linear processing algorithm are endowed with different weight parameters;
the softmax layer classifies received data transformation results;
if the sample labeling information corresponding to the classification result and the corresponding text vector does not meet the preset loss condition, adjusting each parameter in the deep learning model to be trained, and returning to input the text vector of other sample segmentation under any type to the first hidden layer until the sample labeling information corresponding to the classification result and the corresponding text vector meets the preset loss condition.
In a third aspect, an electronic device is provided, the electronic device comprising a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory are in communication with each other via the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any one of the above first aspects when executing a program stored on a memory.
In a fourth aspect, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the method steps of any of the first aspects.
The method for acquiring the identification model of the sensitive data comprises the steps of acquiring a sample data set constructed by training samples and corresponding sample marking information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types, and word segmentation processing is carried out on the data fields of any type to obtain text vectors of different sample word segmentation of corresponding type; and then, based on text vectors of different sample word segmentation under different types and corresponding sample labeling information, performing iterative training on the deep learning model to be trained to obtain a sensitive information recognition model. The method can identify the sensitive information field in real time and effectively realize the dynamic desensitization of sensitive data.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a method for obtaining an identification model of sensitive data according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a sensitive information identification model according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an acquisition device of an identification model of sensitive data according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.
The method for acquiring the identification model of the sensitive data can be applied to a server or a terminal. The server may be a physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligent platforms. The Terminal may be a Mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a Personal Digital Assistant (PDA), a tablet personal computer (PAD), or other User Equipment (UE), a handheld device, a car-mounted device, a wearable device, a computing device, or other processing device connected to a wireless modem, a Mobile Station (MS), a Mobile Terminal (Mobile Terminal), or the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.
The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and are not intended to limit the present application, and embodiments and features of embodiments of the present application may be combined with each other without conflict.
Fig. 1 is a flow chart of a method for processing service data according to an embodiment of the present application. As shown in fig. 1, the method may include:
step S110, a training sample and a sample data set constructed by corresponding sample marking information are obtained.
Prior to performing this step, a distributed sensitive database of the data laboratory is built, and data sheets of customers, transactions, behaviors, products, etc. are synchronized to the data laboratory desensitization environment in an evening batch format every day.
In specific implementation, a sensitive training sample and a sensitive data set constructed by corresponding sample marking information are obtained. Wherein the sensitive training samples may include data fields of a non-sensitive type and data fields of a different sensitive type. The different types of data fields may include english name fields, chinese name fields, and data content fields of the sensitive data.
Step S120, word segmentation processing is carried out on any type of data field, and text vectors of different sample word segmentation of corresponding types are obtained.
According to the character sequence of the fields, word segmentation is carried out on any type of data field to obtain different sample word segmentation of the corresponding data field, wherein the different sample word segmentation can comprise sample word segmentation of at least two character combinations in the data field; the different sample tokens are then converted to corresponding text vectors using word2 vec.
(1) Field english name segmentation: according to the line specification, a plurality of abbreviations and "_s" are connected, so that the field English name only needs to be divided according to "_s". The result of the word "card_num_id_num" is "< card|num|id|num >".
(2) Field chinese name segmentation: the segmentation was performed according to python-jieba. If "product number to which the card belongs", the result after the word segmentation is "< product number to which the card belongs >".
(3) Field description word segmentation: the segmentation was performed according to python-jieba. Such as "bank internal customer number", the result after word segmentation is "< bank internal customer number >".
Furthermore, for example, the "product number to which the card belongs" may further combine word segmentation results sequentially to obtain the "card belongs", "product to which the card belongs", "product number", and the like.
And step 130, performing iterative training on the deep learning model to be trained based on the text vectors of different sample word segmentation under different types and corresponding sample labeling information to obtain a sensitive information recognition model.
In order to adapt to the prediction of the scene, the accuracy of the model is improved, the model adopts a neural network framework, distinguishing weights are carried out on three input text features of 'field English name, field Chinese name and field description', and the accuracy of the model is ensured by combining with the technology of ebadd and N-gram;
the specific design is as follows: a field English name text, a field Chinese name text and a field description information text, generating word vectors and 2-gram vectors after word segmentation, and respectively forming three groups of vectors as input; performing superposition averaging on the three sets of vectors as a first hidden layer (hidden layer 2) of the model; the generated three text vectors are subjected to linear transformation, different weights are given, and the generated three text vectors serve as a second hidden layer (hidden layer 3) of the model; the hierarchical softmax is used for multi-classification, so that text classification prediction is realized.
Wherein, as shown in fig. 2, the sensitive information identification model includes an input layer, a first hidden layer, a second hidden layer, and a softmax layer.
Specifically, the training process of the deep learning model to be trained includes:
the input layer inputs a text vector (X 11 ,X 12 ,X 13 ,…,X 1M The method comprises the steps of carrying out a first treatment on the surface of the Alternatively, X 21 ,X 22 ,X 23 ,…,X 2N The method comprises the steps of carrying out a first treatment on the surface of the Alternatively, X 31 ,X 32 ,X 33 ,…,X 3K ) And transmitting it to the first hidden layer; m, N, K are all positive integers other than zero.
The first hiding layer carries out superposition average on text vectors of different sample word segmentation to obtain average vectors (X 1 ,X 2 ,X 3 ) And transmitting it to the second hidden layer;
the second hiding layer carries out linear transformation on the received average vectors corresponding to various types based on a configured linear processing algorithm, and outputs a data transformation result (X) to the softmax layer; different types of the configured linear processing algorithm are endowed with different weight parameters;
the softmax layer classifies received data transformation results;
if the sample labeling information corresponding to the classification result (Y) and the corresponding text vector does not meet the preset loss condition, adjusting each parameter in the deep learning model to be trained, and returning to input the text vector of other sample segmentation under any type to the first hidden layer until the sample labeling information corresponding to the classification result and the corresponding text vector meets the preset loss condition.
In some embodiments, for any iteration of the model, specifically: initializing model parameters, calculating classification results and sample labeling information output by a model, and calculating the gradient of the current iteration for a preset loss function; and dynamically adjusting and iterating model parameters by adopting a preset learning rate and a current gradient to obtain new model parameters, and performing the next iteration until a preset iteration termination condition is reached, wherein the finally obtained model parameters are used as the model parameters of the multi-layer neural network classifier after training is completed.
In some embodiments, the objective function of the softmax layer is optimized using a huffman tree.
In some embodiments, after the sensitive information identification model is obtained, a virtual desktop can be built, and meanwhile, a user of the PC section sends a data request to the configured distributed sensitive database through a browser or a client of the virtual desktop;
carrying out field identification on request data corresponding to the data request, and determining current data fields of different types;
performing word segmentation processing on any type of current data field to obtain current text vectors of different sample word segmentation of corresponding types;
inputting the current text vector into a sensitive information recognition model to obtain a recognition result output by the sensitive information recognition model;
if the identification result is sensitive data, the current request result data is automatically desensitized according to a preset desensitization method, and then displayed for a client to browse. And if the user needs to download the data, the mutual transmission of the virtual desktop data and the PC end is realized through a file ferrying mode.
And if the identification result is non-sensitive data, directly displaying the request result data to a client for browsing. And if the user needs to download the data, the mutual transmission of the virtual desktop data and the PC end is realized through a file ferrying mode.
In some embodiments, because the obtained training samples may include samples of non-sensitive type and samples of different sensitive types, if the number of samples of a certain type is small, the accuracy of the trained model is not high, a preset interpolation algorithm may be adopted to interpolate each minority of initial training samples, so as to obtain corresponding interpolated training samples; the sample labeling information of the interpolation training samples is the same as the sample labeling information of the corresponding interpolated minority initial training samples; thus, a sample data set is constructed based on the interpolation training samples, the majority initial training samples and corresponding sample labeling information.
Based on the above examples, in other embodiments, the training samples may be obtained by:
acquiring Euclidean distances between each minority class initial training sample and each majority class initial training sample in an initial training set;
based on a preset random forest classifier, classifying the initial training set to obtain the classification accuracy of each initial training sample and the weight factors of a few initial training samples, determining a sample adjustment coefficient, wherein the weight factors are determined based on the number of Euclidean distances;
generating a minority class new training sample corresponding to the minority class initial training sample based on the minority class initial training sample, the second number of Euclidean distances and the sample adjustment coefficient;
clustering a minority initial training sample, a minority new training sample and a majority training sample by using a fuzzy clustering algorithm, and determining clustering centers and clustering radii of different aggregation classes;
based on the clustering centers and the clustering radiuses of different aggregation classes, training samples are obtained, and accordingly a sample data set is constructed through the training samples and corresponding sample labeling information.
The method for acquiring the identification model of the sensitive data comprises the steps of acquiring a sample data set constructed by training samples and corresponding sample marking information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types, and word segmentation processing is carried out on the data fields of any type to obtain text vectors of different sample word segmentation of corresponding type; and then, based on text vectors of different sample word segmentation under different types and corresponding sample labeling information, performing iterative training on the deep learning model to be trained to obtain a sensitive information recognition model. The method can identify the sensitive information field in real time and effectively realize the dynamic desensitization of sensitive data.
Corresponding to the method, the embodiment of the application further provides an apparatus for acquiring the identification model of the sensitive data, as shown in fig. 3, where the apparatus for acquiring the identification model of the sensitive data includes:
an obtaining unit 310, configured to obtain a training sample and a sample dataset constructed by corresponding sample labeling information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types;
the word segmentation unit 320 is configured to perform word segmentation on any type of data field to obtain text vectors of different sample word segmentation of corresponding types;
the training unit 330 is configured to iteratively train the deep learning model to be trained based on the text vectors of the different sample word segmentation and the corresponding sample labeling information under the different types, so as to obtain a sensitive information recognition model.
The functions of each functional unit of the device for acquiring the identification model of the sensitive data provided in the foregoing embodiments of the present application may be implemented by the foregoing method steps, so specific working processes and beneficial effects of each unit in the device for acquiring the identification model of the sensitive data provided in the embodiments of the present application are not repeated herein.
The embodiment of the present application further provides an electronic device, as shown in fig. 4, including a processor 410, a communication interface 420, a memory 430, and a communication bus 440, where the processor 410, the communication interface 420, and the memory 430 complete communication with each other through the communication bus 440.
A memory 430 for storing a computer program;
the processor 410 is configured to execute the program stored in the memory 430, and implement the following steps:
acquiring a training sample and a sample data set constructed by corresponding sample labeling information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types;
performing word segmentation processing on any type of data field to obtain text vectors of different sample word segmentation of corresponding types;
and carrying out iterative training on the deep learning model to be trained based on the text vectors of different sample segmentation words under different types and corresponding sample labeling information to obtain a sensitive information recognition model.
The communication bus mentioned above may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the electronic device and other devices.
The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
Since the implementation manner and the beneficial effects of the solution to the problem of each device of the electronic apparatus in the foregoing embodiment may be implemented by referring to each step in the embodiment shown in fig. 1, the specific working process and the beneficial effects of the electronic apparatus provided in the embodiment of the present application are not repeated herein.
In yet another embodiment provided herein, a computer readable storage medium is provided, where instructions are stored, which when executed on a computer, cause the computer to perform the method for obtaining the identification model of sensitive data according to any of the above embodiments.
In a further embodiment provided herein, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform the method of acquiring an identification model of sensitive data as described in any of the above embodiments.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted to embrace the preferred embodiments and all such variations and modifications as fall within the scope of the embodiments herein.
It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments in the present application fall within the scope of the claims and the equivalents thereof in the embodiments of the present application, such modifications and variations are also intended to be included in the embodiments of the present application.

Claims (8)

1. A method for obtaining an identification model of sensitive data, the method comprising:
acquiring a training sample and a sample data set constructed by corresponding sample labeling information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types;
performing word segmentation processing on any type of data field to obtain text vectors of different sample word segmentation of corresponding types;
based on text vectors of different sample word segmentation under different types and corresponding sample labeling information, performing iterative training on a deep learning model to be trained to obtain a sensitive information recognition model;
the sensitive information identification model comprises an input layer, a first hiding layer, a second hiding layer and a softmax layer;
the training process of the deep learning model to be trained comprises the following steps:
the input layer inputs the text vector of any sample word segmentation under any type and transmits the text vector to the first hidden layer;
the first hiding layer carries out superposition average on the text vectors of the different sample word segmentation to obtain average vectors corresponding to the corresponding types, and the average vectors are transmitted to the second hiding layer;
the second hiding layer carries out linear transformation on the received average vectors corresponding to various types based on a configured linear processing algorithm, and outputs a data transformation result to the softmax layer; different types of the configured linear processing algorithm are endowed with different weight parameters;
the softmax layer classifies received data transformation results;
if the sample labeling information corresponding to the classification result and the corresponding text vector does not meet the preset loss condition, adjusting each parameter in the deep learning model to be trained, and returning to input the text vector of other sample segmentation under any type to the first hidden layer until the sample labeling information corresponding to the classification result and the corresponding text vector meets the preset loss condition.
2. The method of claim 1, wherein the data fields of different sensitive types include an english name field, a chinese name field, and a data content field.
3. The method of claim 1, wherein word segmentation is performed on any type of data field to obtain text vectors of different sample word segments of corresponding types, comprising:
according to the character sequence of the fields, segmenting the data fields of any type to obtain different sample segmentation words of the corresponding data fields, wherein the different sample segmentation words comprise sample segmentation words of at least two character combinations in the data fields;
using word2vec, the different sample tokens are converted to corresponding text vectors.
4. The method of claim 1, wherein after obtaining the sensitive information identification model, the method further comprises:
constructing a virtual desktop, and simultaneously, sending a data request to a configured distributed sensitive database by a user of the PC section through a browser or a client of the virtual desktop;
performing field identification on request data corresponding to the data request, and determining current data fields of different types;
performing word segmentation processing on the current data field of any type to obtain current text vectors of different sample word segmentation of corresponding types;
inputting the current text vector into the sensitive information recognition model to obtain a recognition result output by the sensitive information recognition model;
and if the identification result is that the current text vector is sensitive data, encrypting the current text vector and then realizing the mutual transmission between the data and the PC end through the virtual desktop.
5. The method of claim 1, wherein obtaining a sample dataset constructed of training samples and corresponding sample annotation information comprises:
acquiring initial training samples of each minority class in an initial training set; the initial training set comprises a minority initial training sample, a majority initial training sample and corresponding sample labeling information;
interpolation is carried out on each minority initial training sample by adopting a preset interpolation algorithm, and interpolation training samples corresponding to each minority initial training sample are obtained; the sample labeling information of the interpolation training samples is the same as the sample labeling information of the corresponding interpolated minority initial training samples;
and constructing a sample data set based on the interpolation training samples, the majority initial training samples and the corresponding sample labeling information.
6. An apparatus for acquiring an identification model of sensitive data, the apparatus comprising:
the acquisition unit is used for acquiring a training sample and a sample data set constructed by corresponding sample marking information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types;
the word segmentation unit is used for carrying out word segmentation processing on any type of data field to obtain text vectors of different sample word segmentation of corresponding types;
the training unit is used for carrying out iterative training on the deep learning model to be trained based on the text vectors of different sample word segmentation under different types and corresponding sample labeling information to obtain a sensitive information recognition model;
the sensitive information identification model comprises an input layer, a first hiding layer, a second hiding layer and a softmax layer;
the training process of the deep learning model to be trained comprises the following steps:
the input layer inputs the text vector of any sample word segmentation under any type and transmits the text vector to the first hidden layer;
the first hiding layer carries out superposition average on the text vectors of the different sample word segmentation to obtain average vectors corresponding to the corresponding types, and the average vectors are transmitted to the second hiding layer;
the second hiding layer carries out linear transformation on the received average vectors corresponding to various types based on a configured linear processing algorithm, and outputs a data transformation result to the softmax layer; different types of the configured linear processing algorithm are endowed with different weight parameters;
the softmax layer classifies received data transformation results;
if the sample labeling information corresponding to the classification result and the corresponding text vector does not meet the preset loss condition, adjusting each parameter in the deep learning model to be trained, and returning to input the text vector of other sample segmentation under any type to the first hidden layer until the sample labeling information corresponding to the classification result and the corresponding text vector meets the preset loss condition.
7. An electronic device, characterized in that the electronic device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are in communication with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method of any of claims 1-5 when executing a program stored on a memory.
8. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-5.
CN202311685176.0A 2023-12-11 2023-12-11 Acquisition method and device of identification model of sensitive data, electronic equipment and medium Active CN117391076B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311685176.0A CN117391076B (en) 2023-12-11 2023-12-11 Acquisition method and device of identification model of sensitive data, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311685176.0A CN117391076B (en) 2023-12-11 2023-12-11 Acquisition method and device of identification model of sensitive data, electronic equipment and medium

Publications (2)

Publication Number Publication Date
CN117391076A CN117391076A (en) 2024-01-12
CN117391076B true CN117391076B (en) 2024-02-27

Family

ID=89439555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311685176.0A Active CN117391076B (en) 2023-12-11 2023-12-11 Acquisition method and device of identification model of sensitive data, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN117391076B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175235A (en) * 2019-04-23 2019-08-27 苏宁易购集团股份有限公司 Intelligence commodity tax sorting code number method and system neural network based
CN111428273A (en) * 2020-04-23 2020-07-17 北京中安星云软件技术有限公司 Dynamic desensitization method and device based on machine learning
WO2020215571A1 (en) * 2019-04-25 2020-10-29 平安科技(深圳)有限公司 Sensitive data identification method and device, storage medium, and computer apparatus
WO2021135446A1 (en) * 2020-06-19 2021-07-08 平安科技(深圳)有限公司 Text classification method and apparatus, computer device and storage medium
CN114282258A (en) * 2021-10-28 2022-04-05 平安银行股份有限公司 Screen capture data desensitization method and device, computer equipment and storage medium
CN114491018A (en) * 2021-12-23 2022-05-13 天翼云科技有限公司 Construction method of sensitive information detection model, and sensitive information detection method and device
CN114595689A (en) * 2022-02-28 2022-06-07 深圳依时货拉拉科技有限公司 Data processing method, data processing device, storage medium and computer equipment
CN115687980A (en) * 2022-11-11 2023-02-03 中国农业银行股份有限公司 Desensitization classification method of data table, and classification model training method and device
CN115828901A (en) * 2022-12-26 2023-03-21 中国农业银行股份有限公司 Sensitive information identification method and device, electronic equipment and storage medium
CN116305257A (en) * 2023-02-15 2023-06-23 杭州北山数字科技有限公司 Privacy information monitoring device and privacy information monitoring method
CN116975102A (en) * 2022-04-19 2023-10-31 中国移动通信集团广东有限公司 Sensitive data monitoring method, system, electronic equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220300B (en) * 2017-05-05 2018-07-20 平安科技(深圳)有限公司 Information mining method, electronic device and readable storage medium storing program for executing
CN114564971B (en) * 2022-02-28 2023-05-12 北京百度网讯科技有限公司 Training method of deep learning model, text data processing method and device

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175235A (en) * 2019-04-23 2019-08-27 苏宁易购集团股份有限公司 Intelligence commodity tax sorting code number method and system neural network based
WO2020215571A1 (en) * 2019-04-25 2020-10-29 平安科技(深圳)有限公司 Sensitive data identification method and device, storage medium, and computer apparatus
CN111428273A (en) * 2020-04-23 2020-07-17 北京中安星云软件技术有限公司 Dynamic desensitization method and device based on machine learning
WO2021135446A1 (en) * 2020-06-19 2021-07-08 平安科技(深圳)有限公司 Text classification method and apparatus, computer device and storage medium
CN114282258A (en) * 2021-10-28 2022-04-05 平安银行股份有限公司 Screen capture data desensitization method and device, computer equipment and storage medium
CN114491018A (en) * 2021-12-23 2022-05-13 天翼云科技有限公司 Construction method of sensitive information detection model, and sensitive information detection method and device
CN114595689A (en) * 2022-02-28 2022-06-07 深圳依时货拉拉科技有限公司 Data processing method, data processing device, storage medium and computer equipment
CN116975102A (en) * 2022-04-19 2023-10-31 中国移动通信集团广东有限公司 Sensitive data monitoring method, system, electronic equipment and storage medium
CN115687980A (en) * 2022-11-11 2023-02-03 中国农业银行股份有限公司 Desensitization classification method of data table, and classification model training method and device
CN115828901A (en) * 2022-12-26 2023-03-21 中国农业银行股份有限公司 Sensitive information identification method and device, electronic equipment and storage medium
CN116305257A (en) * 2023-02-15 2023-06-23 杭州北山数字科技有限公司 Privacy information monitoring device and privacy information monitoring method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Automated detection of unstructured context-dependent sensitive information using deep learning;Hadeer Ahmed, et al.;《Automated detection of unstructured context-dependent sensitive information using deep learning》;20210814;1-11 *
一种改进的分类算法在不良信息过滤中的应用;刘志刚;杜娟;衣治安;;微计算机应用;20110215(02);9-14 *
基于fasttext模型的中文专利快速分类;陈子豪;谢从华;时敏;唐晓娜;;常熟理工学院学报;20200917(05);47-50 *
基于多神经网络混合的短文本分类模型;侯雪亮;李新;陈远平;;计算机系统应用;20201013(10);9-19 *
基于数据特征的敏感数据识别方法;刘金;;信息通信;20160215(02);240-241 *

Also Published As

Publication number Publication date
CN117391076A (en) 2024-01-12

Similar Documents

Publication Publication Date Title
US11062089B2 (en) Method and apparatus for generating information
CN106651057B (en) Mobile terminal user age prediction method based on installation package sequence list
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
CN112613917A (en) Information pushing method, device and equipment based on user portrait and storage medium
CN111563163A (en) Text classification model generation method and device and data standardization method and device
CN106778851A (en) Social networks forecasting system and its method based on Mobile Phone Forensics data
WO2023213157A1 (en) Data processing method and apparatus, program product, computer device and medium
CN112766284A (en) Image recognition method and device, storage medium and electronic equipment
CN112418320A (en) Enterprise association relation identification method and device and storage medium
CN109982272B (en) Fraud short message identification method and device
CN111881943A (en) Method, device, equipment and computer readable medium for image classification
CN112995414A (en) Behavior quality inspection method, device, equipment and storage medium based on voice call
CN117391076B (en) Acquisition method and device of identification model of sensitive data, electronic equipment and medium
CN111476595A (en) Product pushing method and device, computer equipment and storage medium
CN111597336A (en) Processing method and device of training text, electronic equipment and readable storage medium
CN115546554A (en) Sensitive image identification method, device, equipment and computer readable storage medium
CN113609018A (en) Test method, training method, device, apparatus, medium, and program product
CN109885647B (en) User history verification method, device, electronic equipment and storage medium
CN113780239A (en) Iris recognition method, iris recognition device, electronic equipment and computer readable medium
CN111274383B (en) Object classifying method and device applied to quotation
CN113688232A (en) Method and device for classifying bidding texts, storage medium and terminal
CN107368597B (en) Information output method and device
CN112417260A (en) Localized recommendation method and device and storage medium
CN115563289B (en) Industry classification label generation method and device, electronic equipment and readable medium
CN114706927B (en) Data batch labeling method based on artificial intelligence and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40098515

Country of ref document: HK