CN117391076B

CN117391076B - Acquisition method and device of identification model of sensitive data, electronic equipment and medium

Info

Publication number: CN117391076B
Application number: CN202311685176.0A
Authority: CN
Inventors: 翁志鹏; 洪建帮; 陈春旺; 伍思文; 罗卓尔; 裴雷; 陈志�; 金鑫; 代军堂; 丁有韬; 王悦; 丁征涛; 李系能; 张方昌
Original assignee: Bank Of East Asia China Co ltd
Current assignee: Bank Of East Asia China Co ltd
Priority date: 2023-12-11
Filing date: 2023-12-11
Publication date: 2024-02-27
Anticipated expiration: 2043-12-11
Also published as: CN117391076A

Abstract

The application relates to the technical field of data processing, and provides a method, a device, electronic equipment and a medium for acquiring an identification model of sensitive data. After a training sample and a sample data set constructed by corresponding sample labeling information are obtained, the training sample comprises data fields of non-sensitive type and different sensitive types, word segmentation processing is carried out on the data fields of any type, and text vectors of different sample segmentation of corresponding types are obtained; and then, based on text vectors of different sample word segmentation under different types and corresponding sample labeling information, performing iterative training on the deep learning model to be trained to obtain a sensitive information recognition model. The method can identify the sensitive information field in real time and effectively realize the dynamic desensitization of sensitive data.

Description

Acquisition method and device of identification model of sensitive data, electronic equipment and medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and apparatus for acquiring an identification model of sensitive data, an electronic device, and a medium.

Background

The use of personally sensitive data by businesses is becoming increasingly conservative. At present, if personal sensitive information is required to be used for data analysis mining work, such as blacklist scanning model optimization, construction of client information matching models of different professional companies in a group and the like, the personal sensitive information can only be exported to the local by operators of a data center to a production environment after approval according to times and needs, and is used by a manual encryption transmitting user. After the user finishes using, the user needs to destroy the data in time and provide destroying evidence, and the company layer needs to regularly review the use of the sensitive data to ensure that the data is not revealed.

At present, data desensitization is carried out by spending a large amount of manpower and material resources to comb and distinguish hundreds of thousands of data sheets of hundreds of application systems in a whole row, determining whether each sheet relates to sensitive information or not, and carrying out desensitization treatment according to a combing result so as to meet the security management regulations of the sensitive information, wherein the related workload is large. And once the data table is changed, SIC of the application system needs to be updated synchronously, and the timely, accurate and non-missing SIC cannot be ensured.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method, an apparatus, an electronic device, and a medium for acquiring an identification model of sensitive data, so as to identify a sensitive information field in real time, and efficiently implement dynamic desensitization of the sensitive data.

In a first aspect, a method for acquiring an identification model of sensitive data is provided, where the method may include:

acquiring a training sample and a sample data set constructed by corresponding sample labeling information; the training sample comprises data fields of the non-sensitive type and data fields of different sensitive types;

performing word segmentation processing on any type of data field to obtain text vectors of different sample word segmentation of corresponding types;

and carrying out iterative training on the deep learning model to be trained based on the text vectors of different sample segmentation words under different types and corresponding sample labeling information to obtain a sensitive information recognition model.

In one possible implementation, the sensitive information identification model includes an input layer, a first hidden layer, a second hidden layer, and a softmax layer;

the training process of the deep learning model to be trained comprises the following steps:

the input layer inputs the text vector of any sample word segmentation under any type and transmits the text vector to the first hidden layer;

the first hiding layer carries out superposition average on the text vectors of the different sample word segmentation to obtain average vectors corresponding to the corresponding types, and the average vectors are transmitted to the second hiding layer;

the second hiding layer carries out linear transformation on the received average vectors corresponding to various types based on a configured linear processing algorithm, and outputs a data transformation result to the softmax layer; different types of the configured linear processing algorithm are endowed with different weight parameters;

the softmax layer classifies received data transformation results;

if the sample labeling information corresponding to the classification result and the corresponding text vector does not meet the preset loss condition, adjusting each parameter in the deep learning model to be trained, and returning to input the text vector of other sample segmentation under any type to the first hidden layer until the sample labeling information corresponding to the classification result and the corresponding text vector meets the preset loss condition.

In one possible implementation, the data fields of the non-or different type of sensitivity include an english name field, a chinese name field, and a data content field.

In one possible implementation, word segmentation processing is performed on any type of data field to obtain text vectors of different sample word segmentation of corresponding types, including:

according to the character sequence of the fields, segmenting the data fields of any type to obtain different sample segmentation words of the corresponding data fields, wherein the different sample segmentation words comprise sample segmentation words of at least two character combinations in the data fields;

using word2vec, the different sample tokens are converted to corresponding text vectors.

In one possible implementation, after obtaining the sensitive information identification model, the method further includes:

constructing a virtual desktop, and simultaneously, sending a data request to a configured distributed sensitive database by a user of the PC section through a browser or a client of the virtual desktop;

performing field identification on request data corresponding to the data request, and determining current data fields of different types;

performing word segmentation processing on the current data field of any type to obtain current text vectors of different sample word segmentation of corresponding types;

inputting the current text vector into the sensitive information recognition model to obtain a recognition result output by the sensitive information recognition model;

and if the identification result is that the current text vector is sensitive data, encrypting the current text vector and then realizing the mutual transmission between the data and the PC end through the virtual desktop.

In one possible implementation, obtaining a sample data set constructed by training samples and corresponding sample labeling information, constructing the sample data set, includes:

acquiring initial training samples of each minority class in an initial training set; the initial training set comprises a minority initial training sample, a majority initial training sample and corresponding sample labeling information;

interpolation is carried out on each minority initial training sample by adopting a preset interpolation algorithm, and interpolation training samples corresponding to each minority initial training sample are obtained; the sample labeling information of the interpolation training samples is the same as the sample labeling information of the corresponding interpolated minority initial training samples;

and constructing a sample data set based on the interpolation training samples, the majority initial training samples and the corresponding sample labeling information.

In a second aspect, an apparatus for acquiring an identification model of sensitive data is provided, where the apparatus may include:

the acquisition unit is used for acquiring a training sample and a sample data set constructed by corresponding sample marking information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types;

the word segmentation unit is used for carrying out word segmentation processing on any type of data field to obtain text vectors of different sample word segmentation of corresponding types;

and the training unit is used for carrying out iterative training on the deep learning model to be trained based on the text vectors of different sample word segmentation under different types and the corresponding sample labeling information to obtain a sensitive information recognition model.

the softmax layer classifies received data transformation results;

In a third aspect, an electronic device is provided, the electronic device comprising a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory are in communication with each other via the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any one of the above first aspects when executing a program stored on a memory.

In a fourth aspect, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the method steps of any of the first aspects.

The method for acquiring the identification model of the sensitive data comprises the steps of acquiring a sample data set constructed by training samples and corresponding sample marking information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types, and word segmentation processing is carried out on the data fields of any type to obtain text vectors of different sample word segmentation of corresponding type; and then, based on text vectors of different sample word segmentation under different types and corresponding sample labeling information, performing iterative training on the deep learning model to be trained to obtain a sensitive information recognition model. The method can identify the sensitive information field in real time and effectively realize the dynamic desensitization of sensitive data.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a method for obtaining an identification model of sensitive data according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a sensitive information identification model according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an acquisition device of an identification model of sensitive data according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.

The method for acquiring the identification model of the sensitive data can be applied to a server or a terminal. The server may be a physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligent platforms. The Terminal may be a Mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a Personal Digital Assistant (PDA), a tablet personal computer (PAD), or other User Equipment (UE), a handheld device, a car-mounted device, a wearable device, a computing device, or other processing device connected to a wireless modem, a Mobile Station (MS), a Mobile Terminal (Mobile Terminal), or the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and are not intended to limit the present application, and embodiments and features of embodiments of the present application may be combined with each other without conflict.

Fig. 1 is a flow chart of a method for processing service data according to an embodiment of the present application. As shown in fig. 1, the method may include:

step S110, a training sample and a sample data set constructed by corresponding sample marking information are obtained.

Prior to performing this step, a distributed sensitive database of the data laboratory is built, and data sheets of customers, transactions, behaviors, products, etc. are synchronized to the data laboratory desensitization environment in an evening batch format every day.

In specific implementation, a sensitive training sample and a sensitive data set constructed by corresponding sample marking information are obtained. Wherein the sensitive training samples may include data fields of a non-sensitive type and data fields of a different sensitive type. The different types of data fields may include english name fields, chinese name fields, and data content fields of the sensitive data.

Step S120, word segmentation processing is carried out on any type of data field, and text vectors of different sample word segmentation of corresponding types are obtained.

According to the character sequence of the fields, word segmentation is carried out on any type of data field to obtain different sample word segmentation of the corresponding data field, wherein the different sample word segmentation can comprise sample word segmentation of at least two character combinations in the data field; the different sample tokens are then converted to corresponding text vectors using word2 vec.

(1) Field english name segmentation: according to the line specification, a plurality of abbreviations and "_s" are connected, so that the field English name only needs to be divided according to "_s". The result of the word "card_num_id_num" is "< card|num|id|num >".

(2) Field chinese name segmentation: the segmentation was performed according to python-jieba. If "product number to which the card belongs", the result after the word segmentation is "< product number to which the card belongs >".

(3) Field description word segmentation: the segmentation was performed according to python-jieba. Such as "bank internal customer number", the result after word segmentation is "< bank internal customer number >".

Furthermore, for example, the "product number to which the card belongs" may further combine word segmentation results sequentially to obtain the "card belongs", "product to which the card belongs", "product number", and the like.

And step 130, performing iterative training on the deep learning model to be trained based on the text vectors of different sample word segmentation under different types and corresponding sample labeling information to obtain a sensitive information recognition model.

In order to adapt to the prediction of the scene, the accuracy of the model is improved, the model adopts a neural network framework, distinguishing weights are carried out on three input text features of 'field English name, field Chinese name and field description', and the accuracy of the model is ensured by combining with the technology of ebadd and N-gram;

the specific design is as follows: a field English name text, a field Chinese name text and a field description information text, generating word vectors and 2-gram vectors after word segmentation, and respectively forming three groups of vectors as input; performing superposition averaging on the three sets of vectors as a first hidden layer (hidden layer 2) of the model; the generated three text vectors are subjected to linear transformation, different weights are given, and the generated three text vectors serve as a second hidden layer (hidden layer 3) of the model; the hierarchical softmax is used for multi-classification, so that text classification prediction is realized.

Wherein, as shown in fig. 2, the sensitive information identification model includes an input layer, a first hidden layer, a second hidden layer, and a softmax layer.

Specifically, the training process of the deep learning model to be trained includes:

the input layer inputs a text vector (X ₁₁ ,X ₁₂ ,X ₁₃ ,…,X _1M The method comprises the steps of carrying out a first treatment on the surface of the Alternatively, X ₂₁ ,X ₂₂ ,X ₂₃ ,…,X _2N The method comprises the steps of carrying out a first treatment on the surface of the Alternatively, X ₃₁ ,X ₃₂ ,X ₃₃ ,…,X _3K ) And transmitting it to the first hidden layer; m, N, K are all positive integers other than zero.

The first hiding layer carries out superposition average on text vectors of different sample word segmentation to obtain average vectors (X ₁ ,X ₂ ,X ₃ ) And transmitting it to the second hidden layer;

the second hiding layer carries out linear transformation on the received average vectors corresponding to various types based on a configured linear processing algorithm, and outputs a data transformation result (X) to the softmax layer; different types of the configured linear processing algorithm are endowed with different weight parameters;

the softmax layer classifies received data transformation results;

if the sample labeling information corresponding to the classification result (Y) and the corresponding text vector does not meet the preset loss condition, adjusting each parameter in the deep learning model to be trained, and returning to input the text vector of other sample segmentation under any type to the first hidden layer until the sample labeling information corresponding to the classification result and the corresponding text vector meets the preset loss condition.

In some embodiments, for any iteration of the model, specifically: initializing model parameters, calculating classification results and sample labeling information output by a model, and calculating the gradient of the current iteration for a preset loss function; and dynamically adjusting and iterating model parameters by adopting a preset learning rate and a current gradient to obtain new model parameters, and performing the next iteration until a preset iteration termination condition is reached, wherein the finally obtained model parameters are used as the model parameters of the multi-layer neural network classifier after training is completed.

In some embodiments, the objective function of the softmax layer is optimized using a huffman tree.

In some embodiments, after the sensitive information identification model is obtained, a virtual desktop can be built, and meanwhile, a user of the PC section sends a data request to the configured distributed sensitive database through a browser or a client of the virtual desktop;

carrying out field identification on request data corresponding to the data request, and determining current data fields of different types;

performing word segmentation processing on any type of current data field to obtain current text vectors of different sample word segmentation of corresponding types;

inputting the current text vector into a sensitive information recognition model to obtain a recognition result output by the sensitive information recognition model;

if the identification result is sensitive data, the current request result data is automatically desensitized according to a preset desensitization method, and then displayed for a client to browse. And if the user needs to download the data, the mutual transmission of the virtual desktop data and the PC end is realized through a file ferrying mode.

And if the identification result is non-sensitive data, directly displaying the request result data to a client for browsing. And if the user needs to download the data, the mutual transmission of the virtual desktop data and the PC end is realized through a file ferrying mode.

In some embodiments, because the obtained training samples may include samples of non-sensitive type and samples of different sensitive types, if the number of samples of a certain type is small, the accuracy of the trained model is not high, a preset interpolation algorithm may be adopted to interpolate each minority of initial training samples, so as to obtain corresponding interpolated training samples; the sample labeling information of the interpolation training samples is the same as the sample labeling information of the corresponding interpolated minority initial training samples; thus, a sample data set is constructed based on the interpolation training samples, the majority initial training samples and corresponding sample labeling information.

Based on the above examples, in other embodiments, the training samples may be obtained by:

acquiring Euclidean distances between each minority class initial training sample and each majority class initial training sample in an initial training set;

based on a preset random forest classifier, classifying the initial training set to obtain the classification accuracy of each initial training sample and the weight factors of a few initial training samples, determining a sample adjustment coefficient, wherein the weight factors are determined based on the number of Euclidean distances;

generating a minority class new training sample corresponding to the minority class initial training sample based on the minority class initial training sample, the second number of Euclidean distances and the sample adjustment coefficient;

clustering a minority initial training sample, a minority new training sample and a majority training sample by using a fuzzy clustering algorithm, and determining clustering centers and clustering radii of different aggregation classes;

based on the clustering centers and the clustering radiuses of different aggregation classes, training samples are obtained, and accordingly a sample data set is constructed through the training samples and corresponding sample labeling information.

Corresponding to the method, the embodiment of the application further provides an apparatus for acquiring the identification model of the sensitive data, as shown in fig. 3, where the apparatus for acquiring the identification model of the sensitive data includes:

an obtaining unit 310, configured to obtain a training sample and a sample dataset constructed by corresponding sample labeling information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types;

the word segmentation unit 320 is configured to perform word segmentation on any type of data field to obtain text vectors of different sample word segmentation of corresponding types;

the training unit 330 is configured to iteratively train the deep learning model to be trained based on the text vectors of the different sample word segmentation and the corresponding sample labeling information under the different types, so as to obtain a sensitive information recognition model.

The functions of each functional unit of the device for acquiring the identification model of the sensitive data provided in the foregoing embodiments of the present application may be implemented by the foregoing method steps, so specific working processes and beneficial effects of each unit in the device for acquiring the identification model of the sensitive data provided in the embodiments of the present application are not repeated herein.

The embodiment of the present application further provides an electronic device, as shown in fig. 4, including a processor 410, a communication interface 420, a memory 430, and a communication bus 440, where the processor 410, the communication interface 420, and the memory 430 complete communication with each other through the communication bus 440.

A memory 430 for storing a computer program;

the processor 410 is configured to execute the program stored in the memory 430, and implement the following steps:

acquiring a training sample and a sample data set constructed by corresponding sample labeling information; the training sample comprises data fields of non-sensitive type and data fields of different sensitive types;

The communication bus mentioned above may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

Since the implementation manner and the beneficial effects of the solution to the problem of each device of the electronic apparatus in the foregoing embodiment may be implemented by referring to each step in the embodiment shown in fig. 1, the specific working process and the beneficial effects of the electronic apparatus provided in the embodiment of the present application are not repeated herein.

In yet another embodiment provided herein, a computer readable storage medium is provided, where instructions are stored, which when executed on a computer, cause the computer to perform the method for obtaining the identification model of sensitive data according to any of the above embodiments.

In a further embodiment provided herein, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform the method of acquiring an identification model of sensitive data as described in any of the above embodiments.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted to embrace the preferred embodiments and all such variations and modifications as fall within the scope of the embodiments herein.

It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments in the present application fall within the scope of the claims and the equivalents thereof in the embodiments of the present application, such modifications and variations are also intended to be included in the embodiments of the present application.

Claims

1. A method for obtaining an identification model of sensitive data, the method comprising:

based on text vectors of different sample word segmentation under different types and corresponding sample labeling information, performing iterative training on a deep learning model to be trained to obtain a sensitive information recognition model;

the sensitive information identification model comprises an input layer, a first hiding layer, a second hiding layer and a softmax layer;

the softmax layer classifies received data transformation results;

2. The method of claim 1, wherein the data fields of different sensitive types include an english name field, a chinese name field, and a data content field.

3. The method of claim 1, wherein word segmentation is performed on any type of data field to obtain text vectors of different sample word segments of corresponding types, comprising:

4. The method of claim 1, wherein after obtaining the sensitive information identification model, the method further comprises:

5. The method of claim 1, wherein obtaining a sample dataset constructed of training samples and corresponding sample annotation information comprises:

6. An apparatus for acquiring an identification model of sensitive data, the apparatus comprising:

the training unit is used for carrying out iterative training on the deep learning model to be trained based on the text vectors of different sample word segmentation under different types and corresponding sample labeling information to obtain a sensitive information recognition model;

the softmax layer classifies received data transformation results;

7. An electronic device, characterized in that the electronic device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are in communication with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method of any of claims 1-5 when executing a program stored on a memory.

8. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-5.