CN112925914B

CN112925914B - Data security grading method, system, equipment and storage medium

Info

Publication number: CN112925914B
Application number: CN202110352515.8A
Authority: CN
Inventors: 范遥新; 吴优; 骆垚; 申思
Original assignee: Ctrip Travel Network Technology Shanghai Co Ltd
Current assignee: Ctrip Travel Network Technology Shanghai Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2024-03-15
Anticipated expiration: 2041-03-31
Also published as: CN112925914A

Abstract

The invention provides a data security grading method, a system, equipment and a storage medium, wherein the method comprises the following steps: acquiring a blood relationship network of data to be classified, wherein the blood relationship network represents inheritance relationships among different data; judging whether the previous level data of the data to be classified exists or not according to the blood relationship network; if yes, determining the security level of the data to be classified according to the security level of the previous level data; if not, training a preset classification model in sequence based on the first training set and the second training set respectively to obtain a target classification model; taking the data to be classified as the input of the target classification model to obtain the security level corresponding to the data to be classified; according to the method and the device, the calculated amount and the calculated time for carrying out security classification on the large-scale relational database are reduced, and the coverage rate and the accuracy of the security classification are improved.

Description

Data security grading method, system, equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data security classification method, system, device, and storage medium.

Background

The data security classification is crucial in the data security treatment process, the data classification is visual display of data importance, is a foundation for organizing internal management system writing, is a foundation for implementing a technical support system in a landing manner, and is a foundation for reasonably distributing energy and strength in the operation and maintenance process. And in the subsequent use process, different safety protection is realized according to different data levels.

The traditional data security grading method is mainly used for carrying out data security grading based on keyword dictionaries or regular expressions and other methods, for example, the method has clear rules on bank cards, certificate numbers and the like, and can construct corresponding regular expressions for matching; the second method is that there is no clear rule but some data rule for name, address, micro signal, etc. for example, the data field contains information of city/street name, etc. classification can be identified by regular rule and naming standard, and some sensitive data can be identified based on keyword dictionary, and these methods have low automation degree, coverage rate and accuracy. For large relational databases, up to hundreds of millions of data fields are ranked using prior art methods, which are computationally expensive and require significant time.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a data security classification method, a system, equipment and a storage medium, which reduce the calculation amount and calculation time for security classification of a large-scale relational database and improve the coverage rate and accuracy rate of security classification.

To achieve the above object, the present invention provides a data security classification method, the method comprising the steps of:

s110, acquiring a blood-edge relation network of data to be classified, wherein the blood-edge relation network represents inheritance relations among different data;

s120, judging whether the data of the upper level of the data to be classified exists or not according to the blood relationship network; if yes, go to step S130; if not, executing step S140;

s130, determining the security level of the data to be classified according to the security level of the previous level data;

s140, training a preset classification model in sequence based on the first training set and the second training set respectively to obtain a target classification model;

and S150, taking the data to be classified as the input of the target classification model, and obtaining the security level corresponding to the data to be classified.

Optionally, the method further comprises the steps of:

s160, setting different encryption algorithms for corresponding data according to the characteristics of different security levels, and realizing hierarchical encryption.

Optionally, before step S110, the method further includes the steps of:

s100, extracting field information of data to be classified, and determining whether a security level matched with the field information exists or not according to the field information and a preset database; the preset database stores security levels corresponding to different field information;

if yes, the security level is obtained;

if not, step S110 is performed.

Optionally, the step S120 includes:

if the previous level data of the data to be classified does not exist, randomly extracting the data to be classified to form sample data;

constructing a regular expression;

performing regular matching on the sample data based on the regular expression, and judging whether the matching is successful or not;

if not, executing step S140;

if yes, obtaining a sample class obtained by matching, obtaining the security level of the sample class, and determining the security level of the sample class as the security level of the data to be classified.

Optionally, the method further comprises the steps of:

acquiring a modification behavior log of a business person on the security level of the obtained data to be classified;

acquiring a modification field and a modification track corresponding to the modification behavior according to the modification behavior log;

and adjusting the security level of the associated field of the modification field according to the modification field and the modification track.

Optionally, the step S140 includes:

training a preset classification model based on the first training set to obtain an initial classification model; the first training set is a preset corpus;

constructing a second training set comprising at least three data sources; the first data source is a security level in the data source obtained by manual marking; the second data source is to obtain the security level in the data source based on the step S100; the second data source is to obtain the security level in the data source based on the steps S110 to S130;

training the initial classification model based on the second training set to obtain a target classification model.

Optionally, the step S130 includes:

if only one previous level data exists, determining the security level of the previous level data as the security level of the data to be classified;

and if at least two pieces of previous-level data exist and the security levels of all the previous-level data are different, taking the mode of the security levels of all the previous-level data as the security level of the data to be classified.

Optionally, the initial classification model is a text classification model pre-trained based on BERT.

The invention also provides a data security grading system for realizing the data security grading method, which comprises the following steps:

the blood relationship network acquisition module is used for acquiring a blood relationship network of data to be classified, wherein the blood relationship network represents inheritance relationships among different data;

the upper level data judging module is used for judging whether the data to be classified have upper level data or not according to the blood relationship network; if yes, executing a first judging module; if not, executing a model fractional training module;

the first judging module is used for determining the security level of the data to be classified according to the security level of the previous level data;

the model fractional training module is used for training a preset classification model in sequence based on the first training set and the second training set respectively to obtain a target classification model;

and the second judging module is used for taking the data to be classified as the input of the target classification model to obtain the security level corresponding to the data to be classified.

The invention also provides a data security grading device, comprising:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of any of the data security ranking methods described above via execution of the executable instructions.

The present invention also provides a computer-readable storage medium storing a program which, when executed by a processor, implements the steps of any one of the data security classification methods described above.

Compared with the prior art, the invention has the following advantages and outstanding effects:

according to the data security classification method, system, equipment and storage medium, the calculation amount and calculation time for classifying the data of the large-scale relational database are reduced by combining the blood relationship classification method and the neural network model classification method aiming at the security classification of the large-scale relational database, and the coverage rate and accuracy of the security classification are improved.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings.

FIG. 1 is a schematic diagram of a data security classification method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a data security classification method according to another embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a data security classification system according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a data security classification apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer readable storage medium according to an embodiment of the present invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the example embodiments may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar structures, and thus a repetitive description thereof will be omitted.

As shown in fig. 1, an embodiment of the present invention discloses a data security classification method, which includes the following steps:

s110, acquiring a blood-edge relation network of data to be classified, wherein the blood-edge relation network represents inheritance relations among different data. Specifically, the database includes at least one data table, and the data table has at least one data field. In this embodiment, all the data tables in the database are acquired first, and then all the data fields in each data table are acquired, and the hierarchical object is also the data field. Therefore, the data to be classified is the data field. Then, the security level of the data table can be obtained according to the security level of the data field, and the security level of the database can be obtained according to the security level of the data table.

Wherein the security level of the data table may be the highest of the security levels of all the data fields it contains. The security level of a database may be the highest of the security levels of all the data tables it contains.

In this step, the obtaining of the blood-edge relation network of the data field can be implemented by obtaining the blood-edge relation network of the data table to which the data field belongs. The blood relationship network of each data table can be calculated through the data field, namely the flow of data generation to be classified. The blood relationship network characterizes inheritance relationships among different data tables.

S120, judging whether the data of the upper level of the data to be classified exists in the blood relationship network according to the blood relationship network; if yes, go to step S130; if not, go to step S140. Specifically, if the data table a is generated by the flow of the data table B and the data table C through one ETL (Extract-Transform-Load), the upstream data table of the data table a, that is, the upper level data table, is the data table B and the data table C. Conversely, the data table a is a downstream data table of the data table B and the data table C, i.e., a next-level data table.

In the step, whether a data table to which a data field corresponding to the data to be classified belongs has a higher-level data table, namely an upstream data table is judged. When the data table of the upper level exists, whether the data field corresponding to the data to be classified is derived from a certain data field of the upper level in the data table of the upper level is judged. The source positioning tracing of the data field can be realized through the prior art, and the description is omitted herein.

S130, determining the security level of the data to be classified according to the security level of the previous level data. Specifically, step S130 includes:

if only one previous level data exists, the security level of the previous level data is determined as the security level of the data to be classified. And

If at least two previous-level data exist and the security levels of all the previous-level data are different, taking the mode of the security levels of all the previous-level data as the security level of the data to be classified.

Specifically, for example, for one data field a, if it is located that the data field a is derived from only the data field B of the upstream data table a, that is, there is only one previous level data field, the data field a inherits the security level of the data field B, that is, the security level of the data field a is equal to the security level of the data field B.

For one data field a, if the data field a is located from multiple data fields of the upstream data table a or from multiple data fields of multiple upstream data tables, that is, multiple upper-level data fields exist, if the security levels of all upper-level data fields are different, the mode of the security level of all upper-level data fields is taken as the security level of the data to be classified. If the security levels of all the previous-level data fields are the same, the security level of the previous-level data field is taken as the security level of the data to be classified.

And S140, training a preset classification model in sequence based on the first training set and the second training set respectively to obtain a target classification model. Specifically, the step S140 includes:

s141, training a preset classification model based on the first training set to obtain an initial classification model. The first training set is a preset corpus. In this embodiment, the initial classification model is a text classification model implementing a bi-directional encoder characterization based on the BERT (Bidirectional Encoder Representations from Transformers), transducer model based pre-training.

S142, constructing a second training set comprising at least three data sources. The first data source is a security level in the data source obtained by manual annotation. The second data source is to obtain the security level in the data source based on step S100. The second data source is to obtain the security level in the data source based on steps S110 to S130.

The step S100 is as follows: extracting field information of data to be classified, and determining whether a security level matched with the field information exists or not according to the field information and a preset database. The preset database stores security levels corresponding to different field information. And matching based on a preset database to obtain the security level of the data field. The above-mentioned preset database may be set according to the history experience of those skilled in the art, which is not limited in this application.

And S143, training the initial classification model based on the second training set to obtain a target classification model.

In this embodiment, the preset classification model, that is, the text classification model, may be a DNN (Deep Neural Networks, deep neural network) model, which is not limited in this application. Before the second training, the preset classification model is trained by adopting a large amount of corpus such as hundred degrees encyclopedia, so that training of a new model from scratch can be avoided, rich semantic knowledge can be learned from a large-scale corpus, data safety is classified in a targeted manner, and classification accuracy of the model is improved.

And then, when the second training is carried out, a second training set formed by three data sources is adopted, so that the data diversity of the second training set can be enriched, and the classification accuracy of the target classification model obtained by training is improved.

And S150, taking the data to be classified as the input of the target classification model, and obtaining the security level corresponding to the data to be classified. That is, the target classification model takes the data field to be classified as input, and the corresponding output is the security level of the data field.

Another embodiment of the present application discloses another data security classification method. The data security grading method further comprises the steps of:

s160, setting different encryption algorithms for corresponding data according to the characteristics of different security levels, and realizing hierarchical encryption. Specifically, a first database with a security level mapped with the encryption algorithm may be constructed, after determining that the corresponding security level is completed for all the data fields, the corresponding data fields are encrypted according to the encryption algorithm by matching the security level with the first database to obtain the encryption algorithm corresponding to each security level. Wherein, the higher the security level of the data field, the higher the encryption level of the adopted encryption algorithm. The lower the security level of the data field, the lower the encryption level of the encryption algorithm employed. Due to the fact that encryption complexity of different encryption algorithms is different, encryption efficiency and data safety can be balanced, a proper encryption algorithm is provided according to the safety requirements of different data fields, encryption cost is reduced on the premise that data safety is guaranteed, and encryption efficiency is improved.

As shown in fig. 2, another embodiment of the present application discloses another data security ranking method. On the basis of the above embodiment, the data security grading method further includes, before step S110, the steps of:

s100, extracting field information of data to be classified, and judging whether a security level matched with the field information exists or not according to the field information and a preset database. The preset database stores security levels corresponding to different field information. If yes, step S170 is executed: and acquiring the security level corresponding to the field information. If not, step S110 is performed. The technical scheme is favorable for improving the coverage rate and the accuracy of security classification, and can avoid the problem that the coverage rate and the accuracy of a large-scale relational database are lower in some data field security classification methods in the prior art.

The step 120 includes:

if there is no previous level data of the data to be classified in the blood relationship network, step S180 is executed: randomly extracting data to be classified to form sample data; constructing a regular expression; and

s190: and carrying out regular matching on the sample data based on the regular expression, and judging whether the matching is successful.

If not, executing step S140;

if yes, step S200 is executed: obtaining a sample class obtained by matching, obtaining the security level of the sample class, and determining the security level of the sample class as the security level of the data to be classified.

For example, if the sample class obtained by matching is a mobile phone number, the security level of the mobile phone number is used as the security level of the corresponding data field.

The regular matching is utilized to realize the security classification of the data fields, and the coverage rate and the accuracy rate of the security classification of the data are also improved.

That is, when the system finds that the service personnel modifies the security level of a certain data field, such as a field of the identity card, from the modification behavior log, which indicates that the security level given by the system is not suitable, the system automatically modifies the security levels of other fields upstream and downstream related to the identity card to be the same new security level as the field of the identity card. Therefore, on the premise of ensuring the data security, the manual workload is reduced, and the timeliness of the data security level setting is improved.

It should be noted that all the embodiments disclosed in the present application may be combined arbitrarily, and the combined technical solution is also within the protection scope of the present application.

As shown in fig. 3, an embodiment of the present invention further discloses a data security grading system 3, which includes:

the blood relationship network obtaining module 31 is configured to obtain a blood relationship network of data to be classified, where the blood relationship network characterizes inheritance relationships among different data.

The previous level data judging module 32 is configured to judge whether the to-be-classified data has previous level data according to the blood relationship network; if yes, executing a first judging module; and if not, executing the model fractional training module.

The first determining module 33 is configured to determine a security level of the data to be classified according to the security level of the previous level data.

The model classification training module 34 is configured to sequentially train a preset classification model based on the first training set and the second training set, respectively, to obtain a target classification model.

And a second determining module 35, configured to take the data to be classified as input of the target classification model, and obtain a security level corresponding to the data to be classified.

It will be appreciated that the data security rating system of the present invention also includes other existing functional modules that support the operation of the data security rating system. The data security ranking system shown in fig. 3 is only an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

The data security classification system in this embodiment is used to implement the above-mentioned method for classifying data security, so for the specific implementation steps of the data security classification system, reference may be made to the above description of the method for classifying data security, which is not repeated here.

The embodiment of the invention also discloses a data security grading device, which comprises a processor and a memory, wherein the memory stores executable instructions of the processor; the processor is configured to perform the steps in the data security ranking method described above via execution of the executable instructions. Fig. 4 is a schematic structural diagram of a data security grading device disclosed in the present invention. An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 4. The electronic device 600 shown in fig. 4 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 4, the electronic device 600 is embodied in the form of a general purpose computing device. Components of electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting the different platform components (including memory unit 620 and processing unit 610), a display unit 640, etc.

Wherein the storage unit stores program code that is executable by the processing unit 610 such that the processing unit 610 performs the steps according to various exemplary embodiments of the present invention described in the above data security ranking method section of the present specification. For example, the processing unit 610 may perform the steps as shown in fig. 1.

The storage unit 620 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 6201 and/or cache memory unit 6202, and may further include Read Only Memory (ROM) 6203.

The storage unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 630 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.

The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 600, and/or any device (e.g., router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 650. Also, electronic device 600 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 over the bus 630. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 600, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage platforms, and the like.

The invention also discloses a computer readable storage medium for storing a program which when executed implements the steps in the data security classification method described above. In some possible embodiments, the aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the invention as described in the above description of the data security ranking method, when the program product is run on the terminal device.

As described above, the program of the computer-readable storage medium of this embodiment, when executed, reduces the amount of computation and the computation time for classifying data of a large relational database by combining the blood relationship classification method and the neural network model classification method with respect to the security classification of the large relational database, and improves the coverage rate and accuracy of the security classification.

Fig. 5 is a schematic structural view of a computer-readable storage medium of the present invention. Referring to fig. 5, a program product 800 for implementing the above-described method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable storage medium may also be any readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

According to the data security classification method, system, equipment and storage medium provided by the embodiment of the invention, the calculation amount and calculation time for classifying the data of the large-scale relational database are reduced by combining the blood relationship classification method and the neural network model classification method aiming at the security classification of the large-scale relational database, and the coverage rate and accuracy rate of the security classification are improved.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. A method for data security classification, comprising the steps of:

if yes, the security level is obtained;

if not, executing step S110;

s120, judging whether the data of the upper level of the data to be classified exists or not according to the blood relationship network; if yes, go to step S130; if not, randomly extracting the data to be classified to form sample data; constructing a regular expression; performing regular matching on the sample data based on the regular expression, and judging whether the matching is successful or not; if not, executing step S140; if yes, obtaining a sample class obtained by matching, obtaining the security level of the sample class, and determining the security level of the sample class as the security level of the data to be classified;

s140, training a preset classification model in sequence based on the first training set and the second training set respectively to obtain a target classification model, wherein the first training set is used for training the preset classification model to obtain an initial classification model; the first training set is a preset corpus; constructing a second training set comprising at least three data sources; the first data source is a security level in the data source obtained by manual marking; the second data source is to obtain the security level in the data source based on the step S100; the second data source is to obtain the security level in the data source based on the steps S110 to S130; training the initial classification model based on the second training set to obtain a target classification model;

2. The data security ranking method of claim 1, wherein the method further comprises the steps of:

3. The data security ranking method of claim 1, wherein the method further comprises the steps of:

4. The data security grading method according to claim 1, wherein the step S130 comprises:

5. The data security ranking method of claim 1, wherein the initial classification model is a BERT pre-trained based text classification model.

6. A data security ranking system for implementing the data security ranking method of claim 1, the system comprising:

7. A data security ranking apparatus, comprising:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the data security ranking method of any one of claims 1 to 5 via execution of the executable instructions.

8. A computer-readable storage medium storing a program, wherein the program when executed by a processor implements the steps of the data security ranking method of any one of claims 1 to 5.