CN112925914A

CN112925914A - Data security classification method, system, device and storage medium

Info

Publication number: CN112925914A
Application number: CN202110352515.8A
Authority: CN
Inventors: 范遥新; 吴优; 骆垚; 申思
Original assignee: Ctrip Travel Network Technology Shanghai Co Ltd
Current assignee: Ctrip Travel Network Technology Shanghai Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-06-08
Anticipated expiration: 2041-03-31
Also published as: CN112925914B

Abstract

The invention provides a data security classification method, a system, equipment and a storage medium, wherein the method comprises the following steps: acquiring a blood relationship network of data to be classified, wherein the blood relationship network represents inheritance relationships among different data; judging whether the previous-stage data of the data to be classified exists or not according to the blood relationship network; if so, determining the security level of the data to be classified according to the security level of the previous-level data; if not, sequentially training a preset classification model based on the first training set and the second training set respectively to obtain a target classification model; taking the data to be classified as the input of the target classification model to obtain the security level corresponding to the data to be classified; according to the method and the device, the calculation amount and the calculation time for carrying out safety classification on the large relational database are reduced, and the coverage rate and the accuracy rate of the safety classification are improved.

Description

Data security classification method, system, device and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data security classification method, system, device, and storage medium.

Background

The data security classification is of great importance in the data security management process, the data classification is the visual display of the data importance, and is the basis for organizing the writing of an internal management system, the ground implementation of a technical support system and the basis for reasonably distributing energy and strength in the operation and maintenance process. And different safety protection can be realized according to different data levels in the subsequent use process.

The traditional data security classification method is mainly based on a keyword dictionary or a regular expression and other methods to perform data security classification, for example, a bank card, a certificate number and other specific rules can be established, and a corresponding regular expression can be constructed for matching; the second method is that there is no definite rule for the name, address, micro signal, etc. but there is a certain data rule, for example, the data field contains information such as city/street name, etc., it can recognize and classify through regular rule and naming standard, at the same time, it can also recognize some sensitive data based on keyword dictionary, these methods have problems of low automation degree, coverage rate and accuracy. For a large relational database, as many as billions of data fields are classified by using the method in the prior art, the calculation amount is large, and a large amount of time is consumed.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a data security classification method, a system, equipment and a storage medium, so that the calculation amount and the calculation time for performing security classification on a large relational database are reduced, and the coverage rate and the accuracy rate of the security classification are improved.

In order to achieve the above object, the present invention provides a data security classification method, comprising the steps of:

s110, acquiring a blood relationship network of data to be classified, wherein the blood relationship network represents inheritance relationships among different data;

s120, judging whether the previous-level data of the data to be classified exists or not according to the blood relationship network; if yes, go to step S130; if not, executing step S140;

s130, determining the security level of the data to be classified according to the security level of the previous-level data;

s140, training a preset classification model in sequence based on the first training set and the second training set respectively to obtain a target classification model;

s150, the data to be classified is used as the input of the target classification model, and the security level corresponding to the data to be classified is obtained.

Optionally, the method further comprises the step of:

and S160, setting different encryption algorithms for corresponding data according to the characteristics of different security levels, and realizing hierarchical encryption.

Optionally, before step S110, the method further includes the step of:

s100, extracting field information of data to be classified, and determining whether a security level matched with the field information exists or not according to the field information and a preset database; the preset database stores security levels corresponding to different field information;

if so, acquiring the security level;

if not, go to step S110.

Optionally, the step S120 includes:

if the data to be classified does not exist in the previous stage of the data to be classified, randomly extracting the data to be classified to form sample data;

constructing a regular expression;

performing regular matching on the sample data based on the regular expression, and judging whether the matching is successful;

if not, executing step S140;

and if so, obtaining a sample category obtained by matching, obtaining the security level of the sample category, and determining the security level of the sample category as the security level of the data to be classified.

Optionally, the method further comprises the step of:

acquiring a modification behavior log of the security level of the obtained data to be classified by service personnel;

acquiring a modification field and a modification track corresponding to a modification behavior according to the modification behavior log;

and adjusting the security level of the associated field of the modified field according to the modified field and the modified track.

Optionally, the step S140 includes:

training a preset classification model based on a first training set to obtain an initial classification model; the first training set is a preset corpus;

constructing a second training set comprising at least three data sources; the first data source is manually marked to obtain the security level in the data source; the second data source is the security level in the data source obtained based on step S100; the second data source is the security level in the data source obtained based on steps S110 to S130;

and training the initial classification model based on the second training set to obtain a target classification model.

Optionally, the step S130 includes:

if only one upper-level data exists, determining the security level of the upper-level data as the security level of the data to be classified;

and if at least two pieces of upper-level data exist and the security levels of all the upper-level data are different, taking the mode of the security levels of all the upper-level data as the security level of the data to be classified.

Optionally, the initial classification model is a text classification model pre-trained based on BERT.

The invention also provides a data security classification system, which is used for realizing the data security classification method and comprises the following steps:

the system comprises a blood relationship network acquisition module, a classification module and a classification module, wherein the blood relationship network acquisition module is used for acquiring a blood relationship network of data to be classified, and the blood relationship network represents inheritance relationships among different data;

the upper-level data judging module is used for judging whether the data to be classified exists in the upper-level data or not according to the blood relationship network; if yes, executing a first judging module; if not, executing the model fractional training module;

the first judgment module is used for determining the security level of the data to be classified according to the security level of the previous-level data;

the model grading training module is used for sequentially training a preset classification model based on the first training set and the second training set respectively to obtain a target classification model;

and the second judgment module is used for taking the data to be classified as the input of the target classification model to obtain the security level corresponding to the data to be classified.

The invention also provides a data security classification device, comprising:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of any of the above data security classification methods via execution of the executable instructions.

The present invention also provides a computer-readable storage medium storing a program which, when executed by a processor, performs the steps of any one of the above-described data security classification methods.

Compared with the prior art, the invention has the following advantages and prominent effects:

the data security grading method, the system, the equipment and the storage medium provided by the invention aim at the security grading of the large relational database, and reduce the calculation amount and the calculation time for grading the data of the large relational database and improve the coverage rate and the accuracy rate of the security grading by combining the blood relationship grading method and the neural network model classification method.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings.

FIG. 1 is a diagram of a data security classification method according to an embodiment of the present invention;

FIG. 2 is a diagram of a data security classification method according to another embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a data security classification system according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a data security classification device according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar structures, and thus their repetitive description will be omitted.

As shown in fig. 1, an embodiment of the present invention discloses a data security classification method, which includes the following steps:

s110, obtaining a blood relationship network of the data to be classified, wherein the blood relationship network represents inheritance relationships among different data. Specifically, the database comprises at least one data table, and the data table comprises at least one data field. In this embodiment, all data tables in the database are obtained first, and then all data fields in each data table are obtained, and the hierarchical object is also the data field. Therefore, the data to be classified is also referred to as a data field. Then, the security level of the data table can be obtained according to the security level of the data field, and then the security level of the database can be obtained according to the security level of the data table.

Wherein the security level of the data table may be the highest level of security levels of all data fields it contains. The security level of the database may be the highest level of security of all the data tables it contains.

In this step, the blood relationship network of the data field may be obtained by obtaining the blood relationship network of the data table to which the data field belongs. The blood relationship network of each data table can be calculated through the data field, namely the flow generated by the data to be classified. The blood relationship network is characterized by inheritance relationships among different data tables.

S120, judging whether the blood relationship network has the previous-level data of the data to be classified according to the blood relationship network; if yes, go to step S130; if not, go to step S140. Specifically, for example, if the data table a is generated by the flow of the data table B and the data table C through an ETL (Extract-Transform-Load), the data tables upstream of the data table a, i.e., the previous data table, are the data table B and the data table C. Conversely, the data table a is a data table downstream of the data tables B and C, i.e. a next-level data table.

In this step, it is determined whether a data table to which a data field corresponding to the data to be classified belongs has a previous-level data table, i.e., an upstream data table. And when the data table of the upper level exists, judging whether the data field corresponding to the data to be classified is from a certain data field of the upper level in the data table of the upper level. The source location tracing of the data field can be realized by the prior art, and is not described in detail in the application.

S130, determining the security level of the data to be classified according to the security level of the previous-level data. Specifically, step S130 includes:

and if only one upper-level data exists, determining the security level of the upper-level data as the security level of the data to be classified. And

Specifically, for example, for a data field a, if it is located that the data field a is only derived from the data field B of the upstream data table a, that is, there is only one upper-level data field, the data field a inherits the security level of the data field B, that is, the security level of the data field a is equal to that of the data field B.

For a data field a, if it is determined that the data field a is derived from a plurality of data fields of the upstream data table a, or is derived from a plurality of data fields belonging to a plurality of upstream data tables, that is, there are a plurality of previous-level data fields, at this time, if the security levels of all the previous-level data fields are different, the mode of the security levels of all the previous-level data fields is taken as the security level of the data to be classified. And if the security levels of all the upper-level data fields are the same, taking the security level of the upper-level data field as the security level of the data to be classified.

S140, training a preset classification model in sequence based on the first training set and the second training set respectively to obtain a target classification model. Specifically, the step S140 includes:

s141, training a preset classification model based on the first training set to obtain an initial classification model. The first training set is a corpus preset. In this embodiment, the initial classification model is a text classification model that implements pretraining based on BERT (Bidirectional Encoder characterization based on transform model).

S142, constructing a second training set containing at least three data sources. The first data source is a data source which is manually marked to obtain the security level in the data source. The second data source is the security level in the data source obtained based on step S100. The second data source is the security level in the data source obtained from step S110 to step S130.

The step S100 is: and extracting field information of the data to be classified, and determining whether a security level matched with the field information exists or not according to the field information and a preset database. The preset database stores security levels corresponding to different field information. Namely, the security level of the data field is obtained by matching based on the preset database. The preset database may be set according to historical experience of those skilled in the art, and the present application is not limited thereto.

And S143, training the initial classification model based on the second training set to obtain a target classification model.

In this embodiment, the preset classification model, that is, the text classification model, may be a Deep Neural Networks (DNN) model, which is not limited in this application. Before the preset classification model is trained for the second time, a large amount of corpora such as Baidu encyclopedia and the like are adopted for training, so that the situation that a new model is trained from beginning to end can be avoided, rich semantic knowledge can be learned from a large-scale corpus, data is classified safely in a targeted mode subsequently, and the classification accuracy of the model is improved.

And then, during the second training, a second training set consisting of three data sources is adopted, so that the data diversity of the second training set can be enriched, and the improvement of the classification accuracy of the trained target classification model is facilitated.

And S150, taking the data to be classified as the input of the target classification model to obtain the security level corresponding to the data to be classified. That is, the target classification model takes the data field to be classified as an input, and the corresponding output is the security level of the data field.

Another embodiment of the present application discloses another data security classification method. On the basis of the above embodiment, the data security classification method further includes the steps of:

and S160, setting different encryption algorithms for corresponding data according to the characteristics of different security levels, and realizing hierarchical encryption. Specifically, a first database with a security level mapped with an encryption algorithm may be constructed, after determining that corresponding security levels are completed for all data fields, encryption algorithms corresponding to the security levels are obtained by matching according to the security levels and the first database, and corresponding data fields are encrypted according to the encryption algorithms. Wherein, the higher the security level of the data field, the higher the encryption level of the adopted encryption algorithm. The lower the security level of the data field, the lower the encryption level of the encryption algorithm employed. Due to the fact that the encryption complexity of different encryption algorithms is different, the technical scheme can balance encryption efficiency and data security, a proper encryption algorithm is provided according to the requirements of the security of different data fields, the encryption cost is reduced on the premise that the data security is guaranteed, and the encryption efficiency is improved.

Another embodiment of the present application discloses another data security classification method, as shown in fig. 2. On the basis of the above embodiment, before step S110, the data security classification method further includes the steps of:

and S100, extracting field information of the data to be classified, and judging whether a security level matched with the field information exists or not according to the field information and a preset database. The preset database stores security levels corresponding to different field information. If yes, go to step S170: and acquiring the security level corresponding to the field information. If not, go to step S110. The preset database can be preset and generated according to historical experience of technicians in the field, the technical scheme is favorable for improving the coverage rate and accuracy of the security classification, and the problem that the coverage rate and accuracy of a security classification method for data fields in the prior art are low for a large relational database can be solved.

The above step 120 includes:

if the data of the previous level of the data to be classified does not exist in the blood relationship network, step S180 is executed: randomly extracting data to be classified to form sample data; constructing a regular expression; and

s190: and performing regular matching on the sample data based on the regular expression, and judging whether the matching is successful.

If not, executing step S140;

if yes, go to step S200: and obtaining the sample category obtained by matching, obtaining the security level of the sample category, and determining the security level of the sample category as the security level of the data to be classified.

For example, if the sample type obtained by matching is a mobile phone number, the security level of the mobile phone number is used as the security level of the corresponding data field.

The security classification of the data fields is realized by utilizing the regular matching, and the coverage rate and the accuracy rate of the data security classification are also improved.

acquiring a modification field and a modification track corresponding to the modification behavior according to the modification behavior log;

That is, when the system finds out from the modification behavior log that the business person has modified the security level of a certain data field, such as the field of the identity card, it indicates that the business person considers that the security level given by the system is not appropriate, the system automatically modifies the security levels of other fields upstream and downstream related to the identity card to the new security level which is the same as the security level of the identity card field. Therefore, on the premise of ensuring the data security, the manual workload is reduced, and the timeliness of the data security level setting is improved.

It should be noted that all the above embodiments disclosed in the present application can be combined arbitrarily, and the combined technical solutions are also within the scope of the present application.

As shown in fig. 3, an embodiment of the present invention further discloses a data security classification system 3, which includes:

the blood relationship network obtaining module 31 is configured to obtain a blood relationship network of data to be classified, where the blood relationship network represents an inheritance relationship between different data.

A previous-level data judgment module 32, configured to judge whether the data to be classified has previous-level data according to the blood relationship network; if yes, executing a first judging module; and if not, executing the model fractional training module.

The first determining module 33 is configured to determine the security level of the data to be classified according to the security level of the previous-level data.

And the model fractional training module 34 is configured to train a preset classification model in sequence based on the first training set and the second training set, respectively, to obtain a target classification model.

And a second determining module 35, configured to use the data to be classified as an input of the target classification model, and obtain a security level corresponding to the data to be classified.

It is understood that the data security hierarchy of the present invention also includes other existing functional modules that support the operation of the data security hierarchy. The data security hierarchy shown in fig. 3 is only an example and should not impose any limitations on the functionality or scope of use of embodiments of the present invention.

The data security classification system in this embodiment is used to implement the above data security classification method, so for the specific implementation steps of the data security classification system, reference may be made to the description of the data security classification method, and details are not described here again.

The embodiment of the invention also discloses data security classification equipment, which comprises a processor and a memory, wherein the memory stores executable instructions of the processor; the processor is configured to perform the steps of the above-described data security classification method via execution of executable instructions. Fig. 4 is a schematic structural diagram of the data security classification device disclosed by the invention. An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 4. The electronic device 600 shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 4, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting the different platform components (including the memory unit 620 and the processing unit 610), a display unit 640, etc.

Wherein the storage unit stores program code that can be executed by the processing unit 610 such that the processing unit 610 performs the steps according to various exemplary embodiments of the present invention as described in the above-mentioned data security classification method section of the present specification. For example, processing unit 610 may perform the steps as shown in fig. 1.

The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.

The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, to name a few.

The invention also discloses a computer readable storage medium for storing a program, which when executed implements the steps of the above data security classification method. In some possible embodiments, the various aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above-mentioned data security classification method of the present description, when the program product is run on the terminal device.

As described above, the computer-readable storage medium of this embodiment, when executed, reduces the amount and time of computation for classifying the data of the large relational database by combining the blood-related classification method and the neural network model classification method with respect to the security classification of the large relational database, and improves the coverage and accuracy of the security classification.

Fig. 5 is a schematic structural diagram of a computer-readable storage medium of the present invention. Referring to fig. 5, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The data security grading method, the system, the equipment and the storage medium provided by the embodiment of the invention aim at the security grading of the large relational database, and reduce the calculated amount and the calculated time for grading the data of the large relational database by combining the blood relationship grading method and the neural network model classification method, thereby improving the coverage rate and the accuracy rate of the security grading.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A data security classification method is characterized by comprising the following steps:

2. The data security classification method of claim 1, characterized in that the method further comprises the steps of:

3. The data security classification method of claim 1, characterized in that before step S110, it further comprises the steps of:

if so, acquiring the security level;

if not, go to step S110.

4. The data security classification method of claim 1, wherein the step S120 comprises:

constructing a regular expression;

if not, executing step S140;

5. The data security classification method of claim 1, characterized in that the method further comprises the steps of:

6. The data security classification method of claim 3, wherein the step S140 comprises:

7. The data security classification method of claim 1, wherein the step S130 comprises:

8. The method of data security classification of claim 6, characterized in that the initial classification model is a text classification model pre-trained based on BERT.

9. A data security classification system for implementing the data security classification method of claim 1, the system comprising:

10. A data security classification device, comprising:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the data security rating method of any of claims 1 to 8 via execution of the executable instructions.

11. A computer-readable storage medium storing a program which, when executed by a processor, performs the steps of the data security classification method of any one of claims 1 to 8.