CN115687980A

CN115687980A - Desensitization classification method of data table, and classification model training method and device

Info

Publication number: CN115687980A
Application number: CN202211412359.0A
Authority: CN
Inventors: 王刚; 张效铭
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2023-02-03

Abstract

The application provides a desensitization classification method of a data table, a classification model training method and a device, and relates to the technical field of data processing. The desensitization classification method of the data table comprises the following steps: acquiring table structure information of a data table; acquiring field information from the table structure information; and performing sensitive type identification on the field information through a data classification model to obtain a prediction sensitive type corresponding to the field information, wherein the data classification model is a neural network model obtained through training. Therefore, the neural network model is introduced, and the accuracy and the mobility of the sensitive type identification of the structured data are improved.

Description

Desensitization classification method of data table, and classification model training method and device

Technical Field

The application relates to the technical field of data processing, in particular to a desensitization classification method of a data table, a classification model training method and a device.

Background

With the progress of the digital age, explosive growth of data has catalyzed the explosive development of the digital industry. Sensitive information of various forms of data is effectively controlled, loss of data utility is reduced to the greatest extent, circulation of the data can be accelerated, and consumption and application of digital products are promoted.

Data desensitization is a data security technology that protects sensitive data and can preserve original data information to the greatest extent. Identification of sensitive data is required in data desensitization. In the related technology, a regular matching mode according to rules can be used, and a rule base and a pattern string written based on expert experience are utilized to perform regular matching on data to be processed, so that sensitive information in the data can be identified.

However, the above approach has insufficient flexibility, poor migration between different data, and even requires a lot of modifications to the rule base in different desensitization systems or different scenarios of data.

Disclosure of Invention

The application provides a desensitization classification method and a classification model training method and device for a data table, which are used for solving the problems of insufficient flexibility of a sensitive information identification mode and poor mobility between different data.

In a first aspect, the present application provides a method for desensitization classification of a data table, comprising: acquiring table structure information of a data table; acquiring field information from the table structure information; and performing sensitivity type identification on the field information through a data classification model to obtain a prediction sensitivity type corresponding to the field information, wherein the data classification model is a neural network model obtained through training.

In one possible implementation, the obtaining table structure information of the data table includes: traversing the database through database management statements to obtain table building statements of the data table; and analyzing the table building statement by using the regular expression to obtain table structure information.

In one possible implementation, the data classification model includes a first filter, a second filter, a first BiGRU model, a second BiGRU model, and a radial basis function RBF neural network. Through a data classification model, sensitive type recognition is carried out on the field information to obtain a prediction sensitive type corresponding to the field information, and the method comprises the following steps: coding the field information through a coding model to obtain a word vector of the field information; respectively inputting the word vectors into a first filter and a second filter, performing feature extraction on the word vectors through a plurality of convolution kernels in the first filter to obtain first filtering features, and performing feature extraction on the word vectors through a plurality of convolution kernels in the second filter to obtain second filtering features; inputting the first filtering characteristic into a first BiGRU model, inputting the second filtering characteristic into a second BiGRU model, extracting the context characteristic of the first filtering characteristic in the first BiGRU model through an attention mechanism to obtain a first context characteristic, and extracting the context characteristic of the second filtering characteristic in the second BiGRU model through the attention mechanism to obtain a second context characteristic; and inputting the first context characteristic and the second context characteristic into the RBF neural network, and performing sensitive type identification on the field information in the RBF neural network to obtain a prediction sensitive type.

In one possible implementation, encoding field information to obtain a word vector of the field information includes: determining an initial vector of field information; inputting the initial vector into a continuous bag-of-words model, and predicting the category probability corresponding to the field information based on the initial vector in the continuous bag-of-words model; and determining a word vector according to the category probability.

In a second aspect, the present application provides a classification model training method, including: acquiring a training data set, wherein the training data set comprises field information in table structure information of a data table and type labels corresponding to the field information, and the type labels represent sensitive types to which the field information actually belongs; training a data classification model based on a training data set, wherein the data classification model is used for performing sensitive type recognition on the data table in the desensitization classification method for the data table provided according to the first aspect described above.

In one possible implementation, the data classification model includes a first filter, a second filter, a first BiGRU model, a second BiGRU model, and a RBF neural network, and the data classification model is trained multiple times. A training process of the data classification model, comprising: coding the field information through a coding model to obtain a word vector of the field information; respectively inputting the word vectors into a first filter and a second filter, performing feature extraction on the word vectors through a plurality of convolution kernels in the first filter to obtain first filtering features, and performing feature extraction on the word vectors through a plurality of convolution kernels in the second filter to obtain second filtering features; inputting the first filtering characteristic into a first BiGRU model, inputting the second filtering characteristic into a second BiGRU model, extracting the context characteristic of the first filtering characteristic in the first BiGRU model through an attention mechanism to obtain a first context characteristic, and extracting the context characteristic of the second filtering characteristic in the second BiGRU model through the attention mechanism to obtain a second context characteristic; inputting the first context characteristic and the second context characteristic into an RBF neural network, and carrying out sensitive type identification on field information in the RBF neural network to obtain a prediction sensitive type; and adjusting the model parameters of the coding model and the data classification model according to the difference between the type label and the prediction sensitive type.

In a third aspect, the present application provides a desensitization sorting apparatus for a data sheet, comprising: a structure information acquisition unit for acquiring table structure information of the data table; a field acquisition unit, configured to acquire field information from the table structure information; and the desensitization classification unit is used for performing sensitivity type identification on the field information through a data classification model to obtain a prediction sensitivity type corresponding to the field information, wherein the data classification model is a neural network model obtained through training.

In a fourth aspect, the present application provides a classification model training apparatus, including: the training data acquisition unit is used for acquiring a training data set, the training data set comprises field information in table structure information of a data table and type labels corresponding to the field information, and the type labels represent sensitive types to which the field information actually belongs; a model training unit, configured to train a data classification model according to a training data set, where the data classification model is used to perform sensitivity type identification on the data table in the desensitization classification method for data tables provided according to the first aspect.

In a fifth aspect, the present application provides an electronic device, comprising: at least one processor and memory; the memory stores computer-executable instructions; the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method of desensitizing classification of a data table as provided in the first aspect above or to perform the method of classification model training as provided in the second aspect above.

In a sixth aspect, the present application provides a computer-readable storage medium having stored therein computer-executable instructions for implementing a method of desensitising classification of a data table as provided in the first aspect above or for implementing a method of training a classification model as provided in the second aspect above when executed by a processor.

In a seventh aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements a method of desensitising classification of a data sheet as described in the first aspect above or implements a method of training a classification model as described in the second aspect above.

According to the desensitization classification method, the desensitization classification device, the desensitization classification equipment and the desensitization classification medium for the data table, field information is obtained from table structure information of the data table aiming at the data table which is one of structural data, and sensitive type identification is carried out on the field information through a data classification model, so that a predicted sensitive type corresponding to the field information is obtained. The data classification model is a neural network model obtained through training. Therefore, the neural network model is utilized to realize the sensitive type identification of the data table which is one of the structured data and the automatic management of the database table structure, and the method does not need to rely on a large amount of expert knowledge, and has high flexibility, high mobility and low labor cost.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic diagram illustrating an application scenario in which embodiments of the present disclosure are applicable;

FIG. 2 is a first flowchart illustrating a desensitization classification method of a data table according to an embodiment of the present disclosure;

FIG. 3 is a second flowchart illustrating a desensitization classification method of a data table according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a classification model training method according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a desensitization sorting apparatus of a data table according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a classification model training method provided in the embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. The drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the concepts of the application by those skilled in the art with reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.

First, in order to facilitate understanding of the present solution, some terms of the present application are explained:

data desensitization: sensitive data in the database is masked or hidden.

Structuring data: refers to data represented by a highly organized and uniform structure. Common structured data is data in the form of a database or a table.

Unstructured data: including data in the form of text, images, voice, etc.

In the related art, the sensitive data identification can be performed in two ways:

in the first mode, the data is identified and classified by using regular matching according to rules.

Specifically, the rule-based technology mainly uses a dictionary, rule judgment and a pattern string of a regular expression, does not need training of early-stage data, and has a good pattern effect under a formulated pattern. However, the first method has the following disadvantages: 1. the fuzzy data is high in false recognition rate, manual intervention is needed for modification, and labor cost is increased; 2. when the desensitization rules and the mode strings are compiled, expert knowledge background is needed, namely a large amount of expert experience is needed, and the labor cost is high; 3. desensitization rules and pattern strings are less associated with data, matching modes are rigid, errors are prone to occur under the condition that contents are fuzzy, accuracy and reliability of sensitive data identification are low, even a rule base needs to be modified greatly under the condition of different desensitization systems or data of different scenes, and mobility is poor.

And secondly, performing sensitive data identification by using deep learning. With the development of deep learning, the capability of a neural network for extracting features of unstructured data such as texts and images is increasingly improved, and some algorithms based on the neural network have higher recognition capability, such as image recognition and natural language processing by adopting the neural network. However, the adoption of the deep learning method for sensitive data identification of the structured data consumes a great deal of computing resources and time cost.

In order to solve the defects, the application provides a desensitization classification method of a data table, a classification model training method and a device. In the method for desensitizing classification of the data table, sensitive type prediction is carried out on the field information in the table structure information of the data table through the data classification model, and a predicted sensitive type corresponding to the field information is obtained. In the related service needing data desensitization, the field information of the corresponding prediction sensitivity type can be desensitized according to the definition of sensitive data.

Therefore, the method introduces a deep learning method to realize the sensitive type identification of the data sheet of one of the structured data, has obvious advantages in the aspects of data sheet structure management, data type classification, desensitization classification and the like, improves the management of enterprises and teams on the structured data, and enhances the timeliness and agility of data processing; a rule base and a mode string do not need to be compiled based on expert experience, so that the labor cost is reduced, the flexibility and the mobility are high, and the method is suitable for different desensitization systems and different data scenes; the method avoids the redundancy of the traditional regular matching and the pattern string, improves the accuracy of desensitization data classification, ensures the desensitization quality of data, and provides guarantee for the safety of subsequent data processing and processing.

The implementation principle and technical effect of the device can refer to the content of the method.

Fig. 1 shows a schematic diagram of an application scenario to which an embodiment of the present disclosure is applicable. In the application scenario, the related devices include a model training device for training a data classification model and a desensitization classification device for identifying sensitive types of field information in a data table, where the model training device and the desensitization classification device may be servers or terminals, and fig. 1 takes as an example that the model training device is a first server 101 and the desensitization classification device is a second server 102.

On the first server 101, training of the data classification model is performed. The trained data classification model is deployed to the second server 102, and on the second server 102, sensitive type recognition is performed on the field information in the data table through the data classification model, so that the prediction sensitive type of the field information is obtained.

As shown in fig. 1, the application scenario relates that the device may further include a database 103, and the first server 101 may obtain training data from the database 103, and/or the second server 102 may obtain a data table to be subjected to desensitization classification from the database 103.

The server may be a centralized server, a distributed server, or a cloud server. The terminal may be a Personal Digital Assistant (PDA) device, a handheld device (e.g., a smart phone or a tablet computer) with a wireless communication function, a computing device (e.g., a Personal Computer (PC)), an in-vehicle device, a wearable device (e.g., a smart watch or a smart band), a smart home device (e.g., a smart speaker, a smart display device), a smart robot, or the like.

The following describes the technical solution of the present application and how to solve the above technical problems in detail by specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a first flowchart illustrating a desensitization classification method of a data table according to an embodiment of the present application. As shown in fig. 2, the desensitization classification method of the data table provided in this embodiment includes:

s201, obtaining the table structure information of the data table.

The table structure information of the data table is used for reflecting the table structure of the data table, and the table structure information of the data table may include field information, table name and other information related to the structure of the data table.

In this embodiment, the table structure information of one or more data tables in the database may be obtained in the database. The table structure information of the data table reflects the table structure of the data table, and the table structure information of the data table may include field information in the data table. Or, the data table input by the user can be received, and the table structure information of the data table is obtained from the data table. Under the condition of acquiring the table structure information of the data table from the database, the embodiment can realize the sensitive type identification of the field information of the batch data table in the database, and improve the sensitive type identification efficiency.

S202, field information is obtained from the table structure information.

The field information may include field names, and a data table may include a plurality of field names.

In this embodiment, after the table structure information is obtained, data analysis and query are performed in the table structure information to obtain field information of the data table.

Optionally, the field information may further include a field type and/or a field length. Therefore, the sensitive type prediction of the field name can be realized under the assistance of the field type and/or the field length, and the accuracy of identifying the sensitive type of the field information in the data table is improved.

And S203, carrying out sensitive type identification on the field information through a data classification model to obtain a prediction sensitive type corresponding to the field information.

The data classification model is a neural network model obtained through training. The training process of the data classification model may refer to the following embodiments, which are not described in this embodiment.

The prediction sensitivity type corresponding to the field information is used for representing the data type of the field information predicted by the data classification model in the data desensitization field. For example, the prediction sensitivity type corresponding to the field information may be any one of the following types: name class, code class, mark class, money amount class and certificate class. Desensitization processing may need to be performed on field information of different sensitive types in different desensitization services, for example, desensitization processing needs to be performed on field information corresponding to a name class and a certificate class in the desensitization service a, and desensitization processing needs to be performed on field information corresponding to a certificate class and a money class in the desensitization service B.

In this embodiment, the field information or the preliminarily processed field information may be input into the data classification model, and the feature extraction and type prediction may be performed on the field information in the data classification model based on the extracted feature, so as to obtain a prediction sensitive type corresponding to the field information.

In the embodiment of the application, based on the neural network model, the sensitive type identification of the field information in the database and the data table of one of the structured data is realized, the accuracy of the sensitive type identification is improved, and the accuracy of desensitization processing of desensitization services based on the sensitive type of the field information is further improved. The method does not need desensitization rules and mode strings, reduces labor cost, improves efficiency, has higher flexibility and mobility, and can be suitable for different desensitization systems and different data tables.

In some embodiments, obtaining table structure information for the data table includes: traversing the database through database management statements to obtain table building statements of the data table; and analyzing the table building statement of the data table by using the regular expression to obtain the table structure information of the data table.

The database management statement can be a show statement, and the statement can be used for acquiring a table building statement of a data table from a database and outputting and displaying the table building statement.

In this embodiment, the table name of the data table in the database may be traversed by the show statement to obtain the table building statement of the data table, and the table building statement may be stored in the table building statement log. And then, analyzing the table building sentences in the table building sentence log by using the regular expression to obtain data related to the structure of the data table, such as table names, field types, field lengths and the like in the table building sentences, so as to obtain the table structure information of the data table. The regular expression is used for analyzing the table building sentences in the table building sentence log, and the analysis can be realized through a Python script, wherein the Python is a computer programming language, and the regular expression can be written in the Python script and a program for analyzing the table building sentences based on the regular expression. Therefore, by using the database management statements and the regular expressions, the efficiency and the accuracy of obtaining the table structure information of the data table in the database are improved, and the efficiency and the accuracy of desensitizing classification of the data table are further improved.

The table structure information of the data table can be presented in a table form, so that the table structure information of the data table is more regular, and the data can be conveniently sorted and analyzed.

In some embodiments, the data classification model may include a filter (filter), a context feature extraction network, and a classification network, and the obtaining of the prediction sensitivity type corresponding to the field information by performing sensitivity type identification on the field information through the data classification model includes: determining a word vector of field information; inputting the word vector corresponding to the field information into a filter, and performing feature extraction on the word vector in the filter through a plurality of convolution cores to obtain filtering features, namely output data of the filter; inputting the filtering characteristics into a context characteristic extraction network, and carrying out context characteristic extraction on the filtering characteristics in the context characteristic extraction network to obtain context characteristics, namely output data of the context characteristic network; and inputting the context characteristics into a classification network, classifying the field information in the classification network based on the context characteristics, namely predicting the sensitive type to which the field information belongs in a plurality of sensitive types, and finally obtaining the predicted sensitive type corresponding to the field information.

For example, the plurality of sensitive types includes a name class, a code class, a token class, an amount class, a certificate class, and the like.

Further, in the data classification model, the filter includes a first filter and a second filter, and a size of a convolution kernel in the first filter is different from a size of a convolution kernel in the second filter, or a number of convolution kernels of the first filter is different from a number of convolution kernels in the second filter. Thus, the diversity of the filter characteristics is improved by the different first and second filters.

Further, in the data classification model, the context feature extraction network includes a first Bidirectional Gated recycling Unit (BiGRU) model and a second BiGRU model. And the first BiGRU model and the second BiGRU model have the same structure. The first BiGRU model is used for extracting the context characteristic of the first filtering characteristic, and the second BiGRU model is used for extracting the context of the second filtering characteristic, so that the diversity of the context characteristic is improved.

Further, in the data classification model, the classification Network may employ a Radial Basis Function (RBF) Neural Network. The RBF neural network has the advantages of simple structure, high learning speed, high convergence speed and the like, can improve the training speed of the data classification model, and can also improve the model performance of the data classification model.

Referring to fig. 3, fig. 3 is a schematic flowchart of a desensitization classification method of a data table according to an embodiment of the present application. As shown in fig. 3, the desensitization classification method of the data table provided in this embodiment includes:

s301, obtaining the table structure information of the data table.

S302, field information is obtained from the table structure information.

The implementation principle and the technical effect of S301 to S302 can refer to the foregoing embodiments, and are not described again.

And S303, coding the field information through the coding model to obtain a word vector of the field information.

In this embodiment, the field information may be input into the coding model, and the field information is coded in the coding model to obtain a word vector of the field information. In addition to encoding the field information in the encoding model, an encoding network layer may be provided in the data classification model, and the field information may be encoded by the encoding network layer. Compared with the method, the field information is coded by adopting a single coding model, so that the coding effect of coding the field information can be improved to a certain extent.

In one possible implementation, as shown in fig. 3, S303 includes S3031 to S3033:

s3031, determining an initial vector of the field information.

In this implementation, the field information may be primarily encoded in an encoding manner to obtain an initial vector of the field information, for example, the field information may be primarily encoded in a single hot encoding manner, and for example, the field information may be primarily encoded in a tag encoding manner.

S3032, the initial vector is input to a Continuous Bag of Words model (CBOW), and the category probability corresponding to the field information is predicted based on the initial vector in the Continuous Bag of Words model.

In the training process of the data classification model, not only are model parameters of the data classification model adjusted, but also model parameters of the CBOW model are adjusted, so that the CBOW model can predict the class probability corresponding to the field information to a certain extent. The category probability corresponding to the field information may include probabilities that the field information belongs to a plurality of sensitive types, for example, probabilities that the field information belongs to a name class, a code class, a mark class, an amount class, and a certificate class.

In the implementation mode, because the initial vector of the field information may have the defects of being too sparse and irrelevant to the sensitive type of the field information, the initial vector can be input into the CBOW model, the initial vector is subjected to feature processing in the CBOW model, and the category probability corresponding to the field information is predicted. Therefore, by adopting a word vector method of word2vec, the field information is converted into a word vector which can be identified by a machine, so that the converted word vector is also related to the sensitive type of the field information.

During the encoding process, the CBOW model may be based on the class probability corresponding to the context prediction field information given a window size of c. Specifically, the structure of the CBOW model includes an input layer, a projection layer, and an output layer, where: the initial vector input to the input layer may be represented as

All initial word vectors may be averaged in the projection layer, and the formula may be expressed as:

in the output layer, a type probability corresponding to the field information can be calculated by adopting a Software function:

wherein, ω is _i Which represents the information of the ith field,

is shown asi initial vectors corresponding to the field information, V represents a set of initial vectors corresponding to the field information, and p represents a type probability.

And S3033, determining the word vector of the field information according to the class probability corresponding to the field information.

In this implementation manner, after the category probabilities corresponding to the field information are obtained, all the category probabilities corresponding to the field information may be combined for each field information to obtain the word vector of the field information.

Thus, the bag-of-words model is utilized in this implementation to improve the encoding effect on the field information.

S304, respectively inputting the word vectors into a first filter and a second filter, performing feature extraction on the word vectors through a plurality of convolution kernels in the first filter to obtain first filtering features, and performing feature extraction on the word vectors through a plurality of convolution kernels in the second filter to obtain second filtering features.

In this embodiment, the word vectors are input into the first filter and the second filter, respectively, and the depth of the filters depends on the number of convolution kernels in the filters, in other words, the filters include a plurality of filters. Performing convolution operation on the word vectors in the first filter through a plurality of convolution kernels to extract local features in the word vectors to obtain output data of the first filter, namely first filtering features; in the second filter, the word vector is subjected to convolution operation through a plurality of convolution kernels to extract local features in the word vector, and output data of the second filter, namely second filtering features, are obtained.

Wherein the size of the convolution kernel in the first filter is different from the size of the convolution kernel in the second filter; and/or the number of convolution kernels in the first filter is different from the number of convolution kernels in the second filter.

The first filter may be a plurality of filters, and the second filter may also be a plurality of filters. Under the condition that a plurality of first filters are arranged, sequentially performing feature extraction on the word vectors through each filter to obtain a plurality of first filtering features; and under the condition that the number of the second filters is multiple, sequentially performing feature extraction on the word vectors through each filter to obtain multiple second filtering features.

For example, in the data classification model, 128 first filters and 64 second filters are included, the convolution kernel size in the first filters is 3, 4, 5, and the convolution kernel size in the second filters is 4, 5, 6.

S305, inputting the first filtering characteristic into a first BiGRU model, inputting the second filtering characteristic into a second BiGRU model, extracting the context characteristic of the first filtering characteristic in the first BiGRU model through an attention mechanism to obtain a first context characteristic, and extracting the context characteristic of the second filtering characteristic in the second BiGRU model through the attention mechanism to obtain a second context characteristic.

In this embodiment, the first filtering feature output by the first filter is input to the first BiGRU network, and the Attention (Attention) mechanism is used to perform context feature extraction on the first filtering feature in the first BiGRU model, specifically, the Attention mechanism is used to improve the weight calculation of the feature extraction, highlight the key information in the first filtering feature, and finally obtain the output data of the first BiGRU model, that is, the first context feature. And inputting the second filtering characteristic output by the second filter into a second BiGRU network, and performing context characteristic extraction on the second filtering characteristic in the second BiGRU model by using an attention mechanism, specifically, improving weight calculation of the characteristic extraction by using the attention mechanism, highlighting key information in the second filtering characteristic, and finally obtaining output data of the second BiGRU model, namely second context characteristics.

S306, inputting the first context characteristic and the second context characteristic into the RBF neural network, and carrying out sensitive type identification on the field information in the RBF neural network to obtain a prediction sensitive type.

In this embodiment, the first context feature and the second context feature are input to the RBF neural network together, and in the RBF neural network, based on the first context feature and the second context feature, the probability that the field information belongs to each sensitive type is predicted, that is, the probability distribution of the field information is calculated, so as to obtain the prediction probability that the field information belongs to each sensitive type. And determining the prediction sensitive type corresponding to the field information as the sensitive type with the highest prediction probability in the prediction probabilities of the field information belonging to the sensitive types.

In the embodiment of the application, the field information in the data table is subjected to sensitive type identification by using the data classification model comprising the first filter, the second filter, the first BiGRU model, the second BiGRU model and the RBF neural network, so that the accuracy and the reliability of the sensitive type identification are improved. The mode does not need desensitization rules and mode strings, reduces labor cost, improves efficiency, has higher flexibility and mobility, and can be suitable for different desensitization systems and different data tables.

Referring to fig. 4, fig. 4 is a schematic flowchart of a classification model training method provided in the embodiment of the present application.

As shown in fig. 4, the method for training a classification model provided in this embodiment includes:

s401, acquiring a training data set.

The training data set comprises field information in table structure information of the data table and type labels corresponding to the field information, and the type labels represent sensitive types to which the field information actually belongs.

In this embodiment, the training data set may be obtained from a database. In the training dataset, the field information in the table structure information of the data table may be obtained from the database first, and then the field information is obtained from the table structure information, which may refer to the description of the foregoing embodiment and is not described again.

In one possible implementation, the process of generating the training data set may include: traversing the database through database management statements to obtain table building statements of the data table; analyzing the table building statement by using a regular expression to obtain table structure information; acquiring field information from the table structure information; acquiring a type label corresponding to the field information; and obtaining a training data set according to the field information and the type label corresponding to the field information. The specific processes of database traversal and table building statement analysis may refer to the foregoing embodiments, and are not described in detail. The type label corresponding to the field information can be obtained through manual input.

S402, training the data classification model according to the training data set.

The data classification model can be used for performing sensitivity type identification on the data table in the desensitization classification method provided in any embodiment.

In this embodiment, the supervised training may be performed on the data classification model based on the field information in the training data set and the type label corresponding to the field information. In the process of supervised training, determining a model loss value according to the difference between a prediction sensitive type corresponding to field information output by a data classification model and a type label corresponding to the field information; and carrying out parameter adjustment on the data classification model according to the model loss value. The training times of the data classification model are one or more times, and under the condition that the training times of the data classification model are multiple times, the data classification model can be subjected to multiple parameter adjustment according to the process.

In the embodiment of the application, a data classification model is obtained through training based on field information in table structure information of a data table and type labels corresponding to the field information, and through the data classification model, sensitive type identification of the field information in a database and the data table serving as structural data can be realized, and sensitive types of the field information are provided for various desensitization services. The data classification model can be suitable for different databases, different data tables and different desensitization systems, and has strong flexibility and mobility.

In some embodiments, the data classification model may include a filter, a context feature extraction network, and a classification network, and the one-time training process of the data classification model may include: determining a word vector of field information; inputting the word vector corresponding to the field information into a filter, and performing feature extraction on the word vector in the filter through a plurality of convolution cores to obtain filtering features, namely output data of the filter; inputting the filtering characteristics into a context characteristic extraction network, and carrying out context characteristic extraction on the filtering characteristics in the context characteristic extraction network to obtain context characteristics, namely output data of the context characteristic network; inputting the context characteristics into a classification network, classifying the field information in the classification network based on the context characteristics, namely predicting the sensitive type to which the field information belongs in a plurality of sensitive types to obtain a predicted sensitive type corresponding to the field information; and adjusting the model parameters of the data classification model according to the difference between the prediction sensitive type corresponding to the field information and the type label corresponding to the field information.

Further, in the data classification model, the context feature extraction network includes a first BiGRU model and a second BiGRU model. And the first BiGRU model and the second BiGRU model have the same structure. The first BiGRU model is used for extracting context characteristics of the first filtering characteristics, and the second BiGRU model is used for extracting context of the second filtering characteristics, so that the diversity of the context characteristics is improved.

Further, in the data classification model, the classification network may employ an RBF neural network. The RBF neural network has the advantages of simple structure, high learning speed, high convergence speed and the like, can improve the training speed of the data classification model, and can also improve the model performance of the data classification model.

In some embodiments, a training process of the data classification model may include: coding the field information through a coding model to obtain a word vector of the field information; respectively inputting the word vectors into a first filter and a second filter, performing feature extraction on the word vectors through a plurality of convolution kernels in the first filter to obtain first filtering features, and performing feature extraction on the word vectors through a plurality of convolution kernels in the second filter to obtain second filtering features; inputting the first filtering characteristic into a first BiGRU model, inputting the second filtering characteristic into a second BiGRU model, performing context characteristic extraction on the first filtering characteristic in the first BiGRU model through an attention mechanism to obtain a first context characteristic, and performing context characteristic extraction on the second filtering characteristic in the second BiGRU model through the attention mechanism to obtain a second context characteristic; inputting the first context characteristic and the second context characteristic into the RBF neural network, and carrying out sensitive type identification on field information in the RBF neural network to obtain a prediction sensitive type; and adjusting the model parameters of the coding model and the data classification model according to the difference between the type label and the prediction sensitive type.

The execution process of the above steps can refer to the foregoing embodiments and is not described again.

In some embodiments, encoding the field information by an encoding model to obtain a word vector of the field information includes: determining an initial vector of field information; inputting the initial vector into a continuous bag-of-words model, and predicting the category probability corresponding to the field information based on the initial vector in the continuous bag-of-words model; and determining a word vector according to the class probability. Thus, the continuous bag-of-words model is utilized to improve the coding effect.

Further, the objective function of the continuous bag-of-words model can be expressed as:

L＝∑ _i∈V logp(ω _i |ω _i-c ，...，ω _i-1 ，ω _i+1 ，...，ω _i+c )

and in the training process, the continuous bag-of-words model is adjusted, so that the objective function is optimized towards the maximized direction. The meanings of the variables in the above formula can be referred to the previous embodiment and are not described in detail.

The following are embodiments of the apparatus of the present application that may be used to perform corresponding method embodiments of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method corresponding to the present application.

Fig. 5 is a schematic structural diagram of a desensitization sorting apparatus of a data table provided in an embodiment of the present application. As shown in fig. 5, the desensitization sorting apparatus 500 of the data table provided in this embodiment includes:

a structure information acquiring unit 501 for acquiring table structure information of a data table;

a field obtaining unit 502, configured to obtain field information from the table structure information;

and a desensitization classification unit 503, configured to perform sensitivity type identification on the field information through a data classification model, to obtain a predicted sensitivity type corresponding to the field information, where the data classification model is a trained neural network model.

In a possible implementation manner, the structure information obtaining unit 501 is specifically configured to: traversing the database through database management statements to obtain table building statements of the data table; and analyzing the table building statement by using the regular expression to obtain table structure information.

In one possible implementation, the data classification model may include a filter, a context feature extraction network, and a classification network, and the desensitization classification unit 503 includes: determining a word vector of field information; inputting the word vector corresponding to the field information into a filter, and performing feature extraction on the word vector in the filter through a plurality of convolution cores to obtain filtering features, namely output data of the filter; inputting the filtering characteristics into a context characteristic extraction network, and carrying out context characteristic extraction on the filtering characteristics in the context characteristic extraction network to obtain context characteristics, namely output data of the context characteristic network; and inputting the context characteristics into a classification network, classifying the field information in the classification network based on the context characteristics, namely predicting the sensitive type to which the field information belongs in a plurality of sensitive types, and finally obtaining the predicted sensitive type corresponding to the field information.

In one possible implementation, the filter includes a first filter and a second filter, and the size of the convolution kernel in the first filter is different from the size of the convolution kernel in the second filter, or the number of convolution kernels in the first filter is different from the number of convolution kernels in the second filter. Thus, the diversity of the filter characteristics is improved by the different first and second filters.

In one possible implementation, the context feature extraction network includes a first BiGRU model and a second BiGRU model.

In one possible implementation, the classification network may employ an RBF neural network.

In one possible implementation, the data classification model includes a first filter, a second filter, a first BiGRU model, a second BiGRU model, and a radial basis function RBF neural network. The desensitization classification unit 503 is specifically used for: coding the field information through a coding model to obtain a word vector of the field information; respectively inputting the word vectors into a first filter and a second filter, performing feature extraction on the word vectors through a plurality of convolution kernels in the first filter to obtain first filtering features, and performing feature extraction on the word vectors through a plurality of convolution kernels in the second filter to obtain second filtering features; inputting the first filtering characteristic into a first BiGRU model, inputting the second filtering characteristic into a second BiGRU model, extracting the context characteristic of the first filtering characteristic in the first BiGRU model through an attention mechanism to obtain a first context characteristic, and extracting the context characteristic of the second filtering characteristic in the second BiGRU model through the attention mechanism to obtain a second context characteristic; and inputting the first context characteristic and the second context characteristic into the RBF neural network, and performing sensitive type identification on the field information in the RBF neural network to obtain a prediction sensitive type.

In a possible implementation manner, in the process of encoding the field information to obtain a word vector of the field information, the desensitization classification unit 503 is specifically configured to: determining an initial vector of field information; inputting the initial vector into a continuous bag-of-words model, and predicting the category probability corresponding to the field information based on the initial vector in the continuous bag-of-words model; and determining a word vector according to the category probability.

It is worth to be noted that the desensitization classification apparatus for a data table provided in the foregoing embodiments may be used to execute each step in the desensitization classification method for a data table provided in any of the foregoing embodiments, and specific implementation manners and technical effects are similar, and are not described herein again.

Fig. 6 is a schematic structural diagram of a classification model training apparatus according to an embodiment of the present application. As shown in fig. 6, the classification model training apparatus 600 provided in this embodiment includes:

a training data obtaining unit 601, configured to obtain a training data set, where the training data set includes field information in table structure information of a data table and a type tag corresponding to the field information, and the type tag indicates a sensitive type to which the field information actually belongs;

the model training unit 602 is configured to train a data classification model according to a training data set, where the data classification model is used for performing sensitive type identification on the data table in the desensitization classification method for the data table provided in the foregoing embodiment.

In a possible implementation manner, the training data obtaining unit 601 is specifically configured to: traversing the database through database management statements to obtain table building statements of the data table; analyzing the table building statement by using a regular expression to obtain table structure information; acquiring field information from the table structure information; acquiring a type label corresponding to the field information; and obtaining a training data set according to the field information and the type label corresponding to the field information.

In a possible implementation manner, the data classification model may include a filter, a context feature extraction network, and a classification network, and in a training process of the data classification model, the model training unit 602 is specifically configured to: determining a word vector of field information; inputting the word vector corresponding to the field information into a filter, and performing feature extraction on the word vector in the filter through a plurality of convolution cores to obtain filtering features, namely output data of the filter; inputting the filtering characteristics to a context characteristic extraction network, and performing context characteristic extraction on the filtering characteristics in the context characteristic extraction network to obtain context characteristics, namely output data of the context characteristic network; inputting the context characteristics into a classification network, classifying the field information in the classification network based on the context characteristics, namely predicting the sensitive type to which the field information belongs in a plurality of sensitive types to obtain a predicted sensitive type corresponding to the field information; and adjusting the model parameters of the data classification model according to the difference between the prediction sensitive type corresponding to the field information and the type label corresponding to the field information.

In one possible implementation, the filter includes a first filter and a second filter.

In one possible implementation, the data classification model includes a first filter, a second filter, a first BiGRU model, a second BiGRU model, and a RBF neural network, and the data classification model is trained multiple times. In a training process of the data classification model, the model training unit 602 is specifically configured to: coding the field information through a coding model to obtain a word vector of the field information; respectively inputting the word vectors into a first filter and a second filter, performing feature extraction on the word vectors through a plurality of convolution kernels in the first filter to obtain first filtering features, and performing feature extraction on the word vectors through a plurality of convolution kernels in the second filter to obtain second filtering features; inputting the first filtering characteristic into a first BiGRU model, inputting the second filtering characteristic into a second BiGRU model, extracting the context characteristic of the first filtering characteristic in the first BiGRU model through an attention mechanism to obtain a first context characteristic, and extracting the context characteristic of the second filtering characteristic in the second BiGRU model through the attention mechanism to obtain a second context characteristic; inputting the first context characteristic and the second context characteristic into an RBF neural network, and performing sensitive type identification on field information in the RBF neural network to obtain a prediction sensitive type; and adjusting the model parameters of the coding model and the data classification model according to the difference between the type label and the prediction sensitive type.

In some embodiments, in the process of encoding the field information through the encoding model to obtain the word vector of the field information, the model training unit 602 is specifically configured to: determining an initial vector of field information; inputting the initial vector into a continuous bag-of-words model, and predicting the category probability corresponding to the field information based on the initial vector in the continuous bag-of-words model; and determining a word vector according to the class probability.

It should be noted that the classification model training apparatus provided in each of the above embodiments may be used to perform each step in the classification model training method provided in any of the above embodiments, and the specific implementation manner and the technical effect are similar and will not be described herein again.

The foregoing embodiments of the apparatus provided in this application are merely exemplary, and the module division is only one logic function division, and there may be another division manner in actual implementation. For example, multiple modules may be combined or may be integrated into another system. The coupling of the various modules to each other may be through interfaces that are typically electrical communication interfaces, but mechanical or other forms of interfaces are not excluded. Accordingly, modules illustrated as separate components may or may not be physically separate, may be located in one place, or may be distributed in different locations on the same or different devices.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 7, the electronic device 700 may include: at least one processor 701 and a memory 702. Fig. 7 shows an electronic device as an example of a processor.

The memory 702 stores programs of the processor 701. In particular, the program may include program code including computer operating instructions.

The memory 702 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor 701 is configured to execute the computer program stored in the memory 702 to implement the steps in the desensitization classification method of the data table in the above embodiments of the method.

The processor 701 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application.

Alternatively, the memory 702 may be separate or integrated with the processor 701. When the memory 702 is a device separate from the processor 701, the electronic device 700 may further include: the bus 703 is used to connect the processor 701 and the memory 702. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. Buses may be divided into address buses, data buses, control buses, etc., but do not represent only one bus or type of bus.

Alternatively, in a specific implementation, if the memory 702 and the processor 701 are implemented by being integrated on a chip, the memory 702 and the processor 701 may complete communication through an internal interface.

The present application also provides a computer-readable storage medium, which may include: a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like, wherein the computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are executed by at least one processor of the electronic device, the electronic device executes the steps of the desensitization classification method or the classification model training method for the data table provided in the foregoing embodiments.

Embodiments of the present application further provide a computer program product, which includes a computer program, where the computer program is stored in a readable storage medium. The computer program can be read from a readable storage medium by at least one processor of an electronic device, and execution of the computer program by the at least one processor causes the electronic device to implement the steps of the method for desensitizing classification of a data table or the method for training a classification model provided in the various embodiments described above.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of desensitizing a data sheet, comprising:

acquiring table structure information of a data table;

acquiring field information from the table structure information;

and performing sensitivity type identification on the field information through a data classification model to obtain a prediction sensitivity type corresponding to the field information, wherein the data classification model is a neural network model obtained through training.

2. The method for desensitizing sorting of data sheets according to claim 1, wherein said obtaining sheet structure information for data sheets comprises:

traversing the database through database management statements to obtain table building statements of the data table;

and analyzing the table building statement by using a regular expression to obtain the table structure information.

3. A method of desensitization classification of data sheets according to claim 1, wherein said data classification models include a first filter, a second filter, a first bi-directional gated cyclic unit BiGRU model, a second BiGRU model, and a radial basis function RBF neural network;

the identifying the sensitive type of the field information through the data classification model to obtain the prediction sensitive type corresponding to the field information comprises the following steps:

coding the field information through a coding model to obtain a word vector of the field information;

inputting the word vectors into the first filter and the second filter respectively, performing feature extraction on the word vectors in the first filter through a plurality of convolution cores to obtain first filtering features, and performing feature extraction on the word vectors in the second filter through a plurality of convolution cores to obtain second filtering features;

inputting the first filtering feature into a first BiGRU model, inputting the second filtering feature into a second BiGRU model, performing context feature extraction on the first filtering feature through an attention mechanism in the first BiGRU model to obtain a first context feature, and performing context feature extraction on the second filtering feature through the attention mechanism in the second BiGRU model to obtain a second context feature;

and inputting the first context characteristic and the second context characteristic into the RBF neural network, and performing sensitive type identification on the field information in the RBF neural network to obtain the prediction sensitive type.

4. The method of desensitizing classification of data sheets of claim 3, wherein said encoding said field information to obtain a word vector of said field information comprises:

determining an initial vector of the field information;

inputting the initial vector into a continuous bag-of-words model, and predicting category probability corresponding to the field information based on the initial vector in the continuous bag-of-words model;

and determining the word vector according to the category probability.

5. A classification model training method is characterized by comprising the following steps:

acquiring a training data set, wherein the training data set comprises field information in table structure information of a data table and a type label corresponding to the field information, and the type label represents a sensitive type to which the field information actually belongs;

training a data classification model based on the training data set, wherein the data classification model is used for performing sensitive type recognition on the data table in a desensitization classification method of the data table according to any one of claims 1-4.

6. The method of classification model training of claim 5, wherein the data classification model comprises a first filter, a second filter, a first BiGRU model, a second BiGRU model, and an RBF neural network, and the training of the data classification model is performed a plurality of times;

the one-time training process of the data classification model comprises the following steps:

inputting the first filtering feature into the first BiGRU model, inputting the second filtering feature into the second BiGRU model, performing context feature extraction on the first filtering feature through an attention mechanism in the first BiGRU model to obtain a first context feature, and performing context feature extraction on the second filtering feature through the attention mechanism in the second BiGRU model to obtain a second context feature;

inputting the first context feature and the second context feature into the RBF neural network, and performing sensitive type identification on the field information in the RBF neural network to obtain the prediction sensitive type;

adjusting model parameters of the coding model and the data classification model according to a difference between the type label and the prediction sensitivity type.

7. A data table desensitization sorting apparatus comprising:

a structure information acquisition unit for acquiring table structure information of the data table;

a field obtaining unit, configured to obtain field information from the table structure information;

and the desensitization classification unit is used for performing sensitivity type identification on the field information through a data classification model to obtain a prediction sensitivity type corresponding to the field information, wherein the data classification model is a neural network model obtained through training.

8. A classification model training apparatus, comprising:

a training data acquisition unit, configured to acquire a training data set, where the training data set includes field information in table structure information of a data table and a type tag corresponding to the field information, and the type tag indicates a sensitive type to which the field information actually belongs;

a model training unit for training a data classification model based on the training data set, wherein the data classification model is used for performing sensitive type identification on the data table in a desensitization classification method of the data table according to any one of claims 1-4.

9. An electronic device, comprising: at least one processor and a memory;

the memory stores computer execution instructions;

the at least one processor executing the memory-stored computer-executable instructions cause the at least one processor to perform a method of desensitizing classification of a data table according to any of claims 1-4 or to perform a method of classification model training according to any of claims 5-6.

10. A computer readable storage medium having stored thereon computer executable instructions for implementing a method of desensitising classification of a data table as claimed in any one of claims 1 to 4 or a method of training a classification model as claimed in any one of claims 5 to 6 when executed by a processor.