CN114840869A - Data sensitivity identification method and device based on sensitivity identification model - Google Patents

Data sensitivity identification method and device based on sensitivity identification model Download PDF

Info

Publication number
CN114840869A
CN114840869A CN202110139667.XA CN202110139667A CN114840869A CN 114840869 A CN114840869 A CN 114840869A CN 202110139667 A CN202110139667 A CN 202110139667A CN 114840869 A CN114840869 A CN 114840869A
Authority
CN
China
Prior art keywords
data
sensitivity
identified
metadata
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110139667.XA
Other languages
Chinese (zh)
Inventor
赵文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110139667.XA priority Critical patent/CN114840869A/en
Publication of CN114840869A publication Critical patent/CN114840869A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data sensitivity identification method based on a sensitivity identification model, a training method, a device, equipment and a computer readable storage medium of the sensitivity identification model; the sensitivity identification model comprises a feature extraction layer and a sensitivity identification layer, and the data sensitivity identification method comprises the following steps: acquiring metadata of data to be identified, wherein the metadata is used for describing the data to be identified; performing feature extraction on the metadata of the data to be identified through the feature extraction layer to obtain data features of the metadata; through the sensitivity identification layer, based on the data characteristics of the metadata, sensitivity identification is carried out on the data to be identified to obtain a sensitivity identification result; and the sensitivity identification result is used for indicating the data sensitivity corresponding to the data to be identified. Through the method and the device, the identification efficiency of the data sensitivity can be improved.

Description

Data sensitivity identification method and device based on sensitivity identification model
Technical Field
The present application relates to artificial intelligence and internet technologies, and in particular, to a data sensitivity recognition method based on a sensitivity recognition model, a method, an apparatus, a device, and a computer-readable storage medium for training the sensitivity recognition model.
Background
In the data asset management of internet enterprises, with the development of business and the increase of user activity, a large amount of valuable data can be deposited in a database table or text. Data sensitivity is used as a part of metadata, data is classified from leakage risks, and the data is convenient for developers to use and keep secret. However, if some valuable data lacks specific data sensitivity or risk level and is not managed and maintained by the developer, the data may be leaked out when in use, which may have great influence on the business.
In the related art, the data sensitivity is identified in a manual mode, namely, a database administrator identifies and determines the data sensitivity of data to be identified according to personal experience, but the mode is time-consuming and labor-consuming, and the probability of missing the sensitive data is high.
Disclosure of Invention
The embodiment of the application provides a data sensitivity identification method based on a sensitivity identification model, a training method, a device, equipment and a computer readable storage medium of the sensitivity identification model, which can improve the identification efficiency of the data sensitivity and reduce the probability of overlooking sensitive data.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a data sensitivity identification method based on a sensitivity identification model, wherein the sensitivity identification model comprises a feature extraction layer and a sensitivity identification layer, and the method comprises the following steps:
acquiring metadata of data to be identified, wherein the metadata is used for describing the data to be identified;
performing feature extraction on the metadata of the data to be identified through the feature extraction layer to obtain data features of the metadata;
through the sensitivity identification layer, based on the data characteristics of the metadata, sensitivity identification is carried out on the data to be identified to obtain a sensitivity identification result;
and the sensitivity identification result is used for indicating the data sensitivity corresponding to the data to be identified.
The embodiment of the application provides a training method of a sensitivity recognition model, wherein the sensitivity recognition model comprises a feature extraction layer and a sensitivity recognition layer, and the method comprises the following steps:
acquiring metadata of a data sample, wherein the data sample carries a sensitivity label which is used for indicating the data sensitivity corresponding to the data sample;
performing feature extraction on the metadata of the data sample through the feature extraction layer to obtain sample data features of the metadata of the data sample;
performing sensitivity identification on the data sample based on the sample data characteristics through the sensitivity identification layer to obtain a sample sensitivity identification result;
acquiring the difference between the sample sensitivity identification result and a sensitivity label carried by the data sample, and updating the model parameters of the sensitivity identification model based on the difference;
the sensitivity identification model is used for outputting a sensitivity identification result indicating the data sensitivity corresponding to the data to be identified after the metadata of the data to be identified is input to the sensitivity identification model.
The embodiment of the application provides a data sensitivity recognition device based on sensitivity recognition model, sensitivity recognition model includes characteristic extraction layer and sensitivity recognition layer, and the device includes:
the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring metadata of data to be identified, and the metadata is used for describing the data to be identified;
the first extraction module is used for extracting the characteristics of the metadata of the data to be identified through the characteristic extraction layer to obtain the data characteristics of the metadata;
the first identification module is used for carrying out sensitivity identification on the data to be identified based on the data characteristics of the metadata through the sensitivity identification layer to obtain a sensitivity identification result;
and the sensitivity identification result is used for indicating the data sensitivity corresponding to the data to be identified.
In the foregoing solution, the first obtaining module is further configured to, when the storage form of the data to be identified is a data table, obtain at least one of the following table elements from the data table: the data table name, the table description corresponding to the data to be identified in the data table, and the attribute field corresponding to the data to be identified in the data table;
and determining the acquired table elements as the metadata of the data to be identified.
In the foregoing solution, the first obtaining module is further configured to, when the storage form of the data to be identified is a document, obtain at least one of the following document contents from the document: document title, document abstract, document keyword;
and determining the obtained document content as the metadata of the data to be identified.
In the above scheme, the first extraction module is further configured to perform word segmentation on the metadata of the data to be identified to obtain a plurality of words corresponding to the metadata;
respectively carrying out feature coding on each word to obtain word features corresponding to each word;
and performing characteristic splicing on the word characteristics corresponding to each word to obtain the data characteristics corresponding to the metadata.
In the above scheme, the first extraction module is further configured to perform bidirectional encoding processing on the word features of each word respectively to obtain an upper encoding feature and a lower encoding feature corresponding to each word;
respectively performing characteristic splicing on the upper coding characteristics and the lower coding characteristics of each word to obtain corresponding splicing coding characteristics;
and performing characteristic splicing on the splicing coding characteristics corresponding to the word, so as to obtain data characteristics corresponding to the metadata.
In the above scheme, the first identification module is further configured to perform, by using the sensitivity identification layer, classification prediction corresponding to at least two sensitivity levels on the data features of the metadata, so as to obtain a probability that the metadata corresponds to each sensitivity level;
and selecting the sensitivity grade with the highest probability as a sensitivity identification result of the data to be identified.
In the above scheme, the first extraction module is further configured to, when the metadata includes at least two keywords, respectively perform feature extraction on each keyword through the feature extraction layer to obtain a feature corresponding to each keyword as a data feature of the metadata;
correspondingly, the first extraction module is further configured to match, through the sensitivity recognition layer, the features corresponding to the keywords with the features corresponding to the at least two sensitive words, respectively, so as to obtain corresponding matching degrees;
and selecting the data sensitivity corresponding to the sensitive word with the highest matching degree as a sensitivity identification result of the data to be identified.
In the above scheme, the apparatus further comprises:
the processing module is used for establishing an incidence relation between the sensitivity identification result and the data to be identified and storing the incidence relation;
and the incidence relation is used for searching the data sensitivity corresponding to the data to be identified based on the data to be identified.
In the foregoing solution, the processing module is further configured to store the sensitivity identification result to a target area associated with the data to be identified, where the target area is an area corresponding to the data sensitivity in the storage area corresponding to the metadata.
In the above scheme, the apparatus further comprises:
the return module is used for responding to a data display request aiming at the data to be identified and acquiring the data sensitivity corresponding to the data to be identified;
when the data sensitivity corresponding to the data to be identified reaches a sensitivity threshold, returning shielding indication information corresponding to the data to be identified;
and the shielding indication information is used for indicating to shield and display the data to be identified.
In the above scheme, the apparatus further comprises:
the output module is used for outputting the encrypted prompt information corresponding to the data to be identified when the sensitivity identification result represents that the data sensitivity of the data to be identified reaches the target data sensitivity;
and the encryption prompt information is used for prompting the encryption processing of the data to be identified.
The embodiment of the application provides a training device of a sensitivity recognition model, the sensitivity recognition model comprises a feature extraction layer and a sensitivity recognition layer, and the device comprises:
the second acquisition module is used for acquiring metadata of a data sample, wherein the data sample carries a sensitivity label which is used for indicating the data sensitivity corresponding to the data sample;
the second extraction module is used for performing feature extraction on the metadata of the data sample through the feature extraction layer to obtain sample data features of the metadata of the data sample;
the second identification module is used for carrying out sensitivity identification on the data sample based on the sample data characteristics through the sensitivity identification layer to obtain a sample sensitivity identification result;
the updating module is used for acquiring the difference between the sample sensitivity identification result and the sensitivity label carried by the data sample and updating the model parameter of the sensitivity identification model based on the difference;
the sensitivity identification model is used for outputting a sensitivity identification result indicating the data sensitivity corresponding to the data to be identified after the metadata of the data to be identified is input to the sensitivity identification model.
An embodiment of the present application provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the data sensitivity identification method based on the sensitivity identification model provided by the embodiment of the application when the executable instructions stored in the memory are executed.
An embodiment of the present application provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the training method of the sensitivity recognition model provided by the embodiment of the application when the executable instructions stored in the memory are executed.
The embodiment of the application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the method for identifying data sensitivity based on a sensitivity identification model provided by the embodiment of the application.
The embodiment of the present application further provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the method for training the sensitivity recognition model provided in the embodiment of the present application.
The embodiment of the application has the following beneficial effects:
the server carries out sensitivity identification on metadata of data to be identified through a sensitivity identification model, specifically obtains the metadata used for describing the data to be identified, and carries out feature extraction on the metadata of the data to be identified through a feature extraction layer of the sensitivity identification model to obtain data features of the metadata; carrying out sensitivity identification on data to be identified through a sensitivity identification layer of a sensitivity identification model based on the data characteristics of the metadata to obtain a sensitivity identification result; therefore, the metadata to be recognized is input into the sensitivity recognition model, so that a sensitivity recognition result for indicating the data sensitivity corresponding to the data to be recognized can be automatically recognized, the recognition efficiency of the data sensitivity can be greatly improved compared with a manual recognition mode, and the probability of missing the sensitive data is reduced.
Drawings
FIG. 1 is a schematic diagram of an alternative architecture of a data sensitivity recognition system 100 based on a sensitivity recognition model according to an embodiment of the present application;
fig. 2 is an alternative schematic structural diagram of an electronic device 500 provided in an embodiment of the present application;
FIG. 3 is a schematic flowchart illustrating a data sensitivity recognition method based on a sensitivity recognition model according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a sensitivity recognition model provided in an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a sensitivity recognition model provided in an embodiment of the present application;
FIG. 6 is a schematic structural diagram of a sensitivity recognition model provided in an embodiment of the present application;
FIG. 7 is a flowchart illustrating a method for training a sensitivity recognition model according to an embodiment of the present disclosure;
FIG. 8 is a schematic flowchart of a data sensitivity recognition method based on a sensitivity recognition model according to an embodiment of the present application;
FIG. 9 is a schematic flowchart of a data sensitivity recognition method based on a sensitivity recognition model according to an embodiment of the present application;
FIG. 10 is a schematic structural diagram of a sensitivity recognition model provided in an embodiment of the present application;
FIG. 11 is a schematic structural diagram of a data sensitivity recognition apparatus based on a sensitivity recognition model according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of a training apparatus for a sensitivity recognition model according to an embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.
1) Metadata, which is data describing data or structural data providing information about a certain resource (i.e. data to be identified), is mainly used for describing data attribute information of the data to be identified, and is used for supporting functions such as indicating a storage location, history data, resource searching, file recording and the like; metadata can be called an electronic catalog, and for the purpose of cataloguing, the content or features of data must be described and collected to assist in data retrieval.
2) In response to the condition or state on which the performed operation depends, one or more of the performed operations may be in real-time or may have a set delay when the dependent condition or state is satisfied; there is no restriction on the order of execution of the operations performed unless otherwise specified.
Based on the above explanations of terms and terms involved in the embodiments of the present application, a data sensitivity identification method based on a sensitivity identification model provided in the embodiments of the present application is described next, referring to fig. 1, fig. 1 is an alternative architecture schematic diagram of a data sensitivity identification system 100 based on a sensitivity identification model provided in the embodiments of the present application, in order to support an exemplary application, a terminal (exemplarily showing a terminal 400-1 and a terminal 400-2) is connected to a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of both, and uses a wireless link to implement data transmission.
In practical application, a client, such as a microblog, a web page, an enterprise application, and the like, is provided on a terminal, and is configured to provide data to be identified related to a service or data to be identified related to a user behavior, and send the data to be identified to the server 200, where the server 200 may be a server configured independently to support various services, may also be configured as a server cluster, may also be a cloud server, and the like, such as a background server of the client, and may also be an information flow platform.
In practical implementation, the server 200 is configured to obtain metadata of data to be identified, where the metadata is used to describe the data to be identified; performing feature extraction on metadata of data to be identified through a feature extraction layer of the sensitivity identification model to obtain data features of the metadata; carrying out sensitivity identification on data to be identified through a sensitivity identification layer of a sensitivity identification model based on the data characteristics of the metadata to obtain a sensitivity identification result; and the sensitivity identification result is used for indicating the data sensitivity corresponding to the data to be identified.
Next, an electronic device implementing the data sensitivity recognition method based on the sensitivity recognition model according to the embodiment of the present application will be described. Referring to fig. 2, fig. 2 is an optional schematic structural diagram of an electronic device 500 provided in this embodiment, in practical application, the electronic device 500 may be a terminal (such as a terminal 400-1 and a terminal 400-2) or a server 200 in fig. 1, taking the electronic device as the server 200 shown in fig. 1 as an example, and the electronic device 500 shown in fig. 2 includes: at least one processor 510, memory 550, at least one network interface 520, and a user interface 530. The various components in the electronic device 500 are coupled together by a bus system 540. It is understood that the bus system 540 is used to enable communications among the components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 540 in fig. 2.
The Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The user interface 530 includes one or more output devices 531 enabling presentation of media content, including one or more speakers and/or one or more visual display screens. The user interface 530 also includes one or more input devices 532, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 550 optionally includes one or more storage devices physically located remote from processor 510.
The memory 550 may comprise volatile memory or nonvolatile memory, and may also comprise both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory.
In some embodiments, memory 550 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 551 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
a network communication module 552 for communicating to other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;
a presentation module 553 for enabling presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530;
an input processing module 554 to detect one or more user inputs or interactions from one of the one or more input devices 532 and to translate the detected inputs or interactions.
In some embodiments, the data sensitivity recognition device based on the sensitivity recognition model provided by the embodiments of the present application can be implemented in software, and fig. 2 shows a data sensitivity recognition device 555 based on the sensitivity recognition model stored in a memory 550, which can be software in the form of programs and plug-ins, and includes the following software modules: the first obtaining module 5551, the first extracting module 5552 and the first identifying module 5553, which are logical and thus can be arbitrarily combined or further split according to the implemented functions, and the functions of the respective modules will be described below.
In other embodiments, the data sensitivity recognition Device based on the sensitivity recognition model provided in the embodiments of the present Application may be implemented in hardware, and as an example, the data sensitivity recognition Device based on the sensitivity recognition model provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the data sensitivity recognition method based on the sensitivity recognition model provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic devices (plds), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.
Based on the above description of the data sensitivity recognition system and the electronic device based on the sensitivity recognition model according to the embodiments of the present application, the following description will discuss a data sensitivity recognition method based on the sensitivity recognition model according to the embodiments of the present application, which, in some embodiments, the method can be implemented by a terminal or a server alone, such as the terminal 400-1, the terminal 400-2 or the server 200 in fig. 1, or by a server and a terminal in cooperation, such as by the cooperation of the terminal 400-1 and the server 200 in fig. 1, and in conjunction with fig. 1 and 3, FIG. 3 is a flowchart illustrating a data sensitivity recognition method based on a sensitivity recognition model according to an embodiment of the present application, the server 200 in fig. 1 is used to implement the data sensitivity recognition method based on the sensitivity recognition model provided in the embodiment of the present application as an example for explanation.
Step 101: the server acquires metadata of the data to be identified, wherein the metadata is used for describing the data to be identified.
In practical application, the data to be identified may be data related to enterprise business, may also be data related to individual users, may be data acquired from a database, may also be data acquired in real time, and may be stored in a form of a data table, or may also be in a form of text such as words or logs. The metadata is mainly used for attribute description of data to be identified, for example, if the data to be identified is data related to shopping business, the metadata can be data such as a shopping account number, an order number, a name, a mobile phone number, a receiving address and the like; if the data to be identified is data related to the personal user, the metadata can be data such as name, identification card number, mobile phone number, electronic mail box, bank card number, home address, work unit and the like.
In some embodiments, the server may obtain the metadata of the data to be identified by: when the storage form of the data to be identified is a data table, at least one of the following table elements is obtained from the data table: the method comprises the following steps of (1) obtaining a data table name, table description corresponding to data to be identified in the data table and an attribute field corresponding to the data to be identified in the data table; and determining the acquired table elements as metadata of the data to be identified.
Here, in practical applications, if the storage form of the data to be identified is a data table, the data table name, table description or attribute field in the data table is used as the metadata of the data to be identified. For example, attribute fields such as a data table name, a table Chinese name, a table principal, a field name or a field type in a data table corresponding to the data to be identified are used as metadata.
In some embodiments, the server may also obtain metadata for the data to be identified by: when the storage form of the data to be identified is a document, at least one of the following document contents is obtained from the document: document title, document abstract, document keyword; and determining the obtained document content as metadata of the data to be identified.
Here, the document may be a word document or a notepad (e.g., txt) document, and when the storage form of the data to be recognized is a document, a document title, a document abstract, and a document keyword of the data to be recognized are used as metadata. When the data to be identified comprises a document title and a document text, the document title can be used as metadata, and key abstract contents (namely the document abstract) can be extracted from the document text to be used as the metadata, because the document text of the data to be identified is possibly relatively long in practice, if all the document texts are identified, great calculation pressure is brought, and the identification efficiency is low; in general, the core theme of the data to be identified may be summarized by a certain sentence or several sentences, so in order to effectively extract the core theme of the data to be identified and improve the identification efficiency, the abstract content of the core theme for representing the data to be identified may be extracted from the document body as the metadata of the data to be identified.
In some embodiments, the corresponding document keywords can be obtained by performing keyword extraction on the document text, and the document abstract of the data to be identified can be obtained by the following method: sentence extraction is carried out on the document text of the data to be identified, and a plurality of target sentences corresponding to the data to be identified are obtained; determining sentence weights corresponding to the target sentences according to the word weights of the keywords in the target sentences; based on the weight of each sentence, performing descending ordering on the target sentences to obtain corresponding sentence sequences; and starting from the first target sentence in the sentence sequence, selecting target sentences with the target quantity, and taking the target sentences with the target quantity as the document abstract corresponding to the data to be identified.
The server can respectively execute the following operations on each target sentence so as to determine the sentence weight of the corresponding target sentence according to the word weights of the keywords in each target sentence: extracting keywords from the target sentence to obtain a plurality of corresponding keywords; respectively acquiring the corresponding word frequency of each keyword in the document text and the reverse file frequency of each keyword; determining the word weight of the corresponding keyword based on the word frequency and the reverse file frequency; and summing the word weights of the keywords to obtain the sentence weight corresponding to the target sentence.
Here, the word frequency represents a ratio of the frequency of occurrence of the keyword in the data to be recognized to the total number of words in the data to be recognized, and the reverse file frequency represents the rarity of the keyword, which is expressed by a logarithm of a ratio of the total number of data in the data set to which the data to be recognized belongs to the number of data corresponding to each keyword in the data set to which the data to be recognized belongs. In addition, in addition to the word frequency of the keyword, the rarity of the keyword is also comprehensively considered, and in practical implementation, the importance degree of a keyword is not only in proportion to the frequency of the keyword in the data to be recognized, but also in inverse proportion to how much data in a data set to which the data to be recognized belongs contains the keyword. And finally, determining the sum of the word weights of the keywords in the target sentence as the sentence weight of the target sentence, thus obtaining the sentence weight of each target sentence, wherein the larger the sentence weight is, the more the corresponding target sentence can represent the core theme of the data to be identified.
Through the mode, subsequent data sensitivity identification is carried out on the basis of the acquired metadata of the data to be identified, and the metadata can represent the attribute characteristics of the data to be identified and greatly reduce the data amount corresponding to the data to be identified, so that the identification accuracy can be guaranteed, and the identification efficiency can be improved.
Step 102: and performing feature extraction on the metadata of the data to be identified through a feature extraction layer to obtain the data features of the metadata.
In some embodiments, referring to fig. 4, fig. 4 is a schematic structural diagram of a sensitivity recognition model provided in the embodiments of the present application, and as shown in fig. 4, the sensitivity recognition model includes a feature extraction layer and a sensitivity recognition layer, metadata of data to be recognized is input into the sensitivity recognition model, the feature extraction layer is used to perform feature extraction on the metadata to obtain corresponding data features, and the sensitivity recognition layer is used to perform line sensitivity recognition on the data features to obtain a sensitivity recognition result.
In some embodiments, referring to fig. 5, fig. 5 is a schematic structural diagram of a sensitivity recognition model provided in an embodiment of the present application, and as shown in fig. 5, a server may perform feature extraction on metadata of data to be recognized to obtain data features of the metadata by:
performing word segmentation processing on metadata of data to be identified to obtain a plurality of words corresponding to the metadata; respectively carrying out feature coding on each word to obtain word features corresponding to each word; and performing feature splicing on the word features corresponding to all the words to obtain the data features corresponding to the metadata.
In actual implementation, the metadata of the data to be recognized, such as a data table name or a table description, is subjected to word segmentation processing to obtain a plurality of words or a plurality of words corresponding to the metadata. In practical application, in order to facilitate retrieval of corresponding words or characters, a unique index value can be set for each word or character, that is, the corresponding word or character is obtained based on the index value of each word or character, and then feature coding, such as word vector conversion, is performed on each word or character to obtain corresponding word features, that is, word vectors; and then, performing feature splicing on the word features corresponding to all words or characters to obtain corresponding data features, namely sentence vectors.
In some embodiments, the server may perform feature concatenation on the term features corresponding to the terms to obtain the data features corresponding to the metadata, in the following manner:
respectively carrying out bidirectional coding processing on the word characteristics of each word to obtain the upper coding characteristics and the lower coding characteristics corresponding to each word; respectively performing characteristic splicing on the upper coding characteristics and the lower coding characteristics of each word to obtain corresponding splicing coding characteristics; and performing characteristic splicing on the splicing coding characteristics corresponding to the words to obtain data characteristics corresponding to the metadata.
Here, considering word context characteristics, after obtaining a word vector of each word, the word vector of each word is input to a bidirectional encoding layer, such as a Bi-directional Long Short-Term Memory (Bi-LSTM) layer, wherein the Bi-LSTM layer includes two LSTMs: and one is a forward input sequence and a reverse input sequence, the upper coding features corresponding to each word are extracted through a forward process (such as from left to right), the lower coding feature vectors corresponding to each word are extracted through a backward process (such as from right to left), and finally the upper coding features and the lower coding features are spliced to obtain the splicing coding features of the corresponding words.
Step 103: and performing sensitivity identification on the data to be identified through a sensitivity identification layer based on the data characteristics of the metadata to obtain a sensitivity identification result.
In some embodiments, the server may perform sensitivity identification on the data to be identified through the sensitivity identification layer based on the data characteristics of the metadata to obtain a sensitivity identification result by:
through a sensitivity identification layer, performing classification prediction corresponding to at least two sensitivity levels on the data characteristics of the metadata to obtain the probability of the metadata corresponding to each sensitivity level; and selecting the sensitivity grade with the maximum probability as a sensitivity identification result of the data to be identified.
The sensitivity identification result is used for indicating the data sensitivity corresponding to the data to be identified, the data sensitivity has various expression forms, such as a sensitivity level or a sensitivity value, and the like, and when the data sensitivity of the data to be identified is represented by the sensitivity value, the greater the sensitivity value is, the more sensitive the data to be identified is represented; when the sensitivity level is used for representing the data sensitivity of the data to be identified, the sensitivity level customized for the data sensitivity is as follows in sequence: the data security level standard is externally disclosed, internally disclosed, generally sensitive, particularly sensitive and highly confidential, sequentially corresponds to 1-5 five natural numbers, and can define the data sensitivity level standard of an enterprise by referring to an industry standard and relevant regulations of national legislation departments in the aspect of data security.
It should be noted that, the determination of the number of the sensitivity levels is not only beneficial to reasonably distinguishing the data sensitivity, but also considering the feasibility of implementing the security control measures based on different sensitivity levels, generally, the 4-5 levels are reasonable, and when the 5 levels are selected, the high level and the low level are respectively: 5 (highly confidential), 4 (particularly sensitive), 3 (generally sensitive), 2 (internally disclosed) and 1 (externally disclosed); the definition of sensitivity level here is accurate for the data table to the sensitivity level of the field. For example, the id number and the mobile phone number in the field are classified into 5 grades, and the name, the email address, the receiving address and the like are classified into 4 grades. In addition, the data sensitivity of the data to be identified can be characterized by only adopting the customized sensitivity levels, for example, the sensitivity levels can be divided into five types: absolute secret, high sensitivity, medium sensitivity and low sensitivity.
Here, it is assumed that the sensitivity levels corresponding to the data to be recognized are the following five types: the method comprises the following steps of absolute secret, confidentiality, high sensitivity, medium sensitivity and low sensitivity, if a sensitivity identification layer is used for classifying and predicting data characteristics of metadata, the probabilities corresponding to the sensitivity levels are sequentially obtained as follows: and the absolute secret (90%), the secret (40%), the high sensitivity (30%), the medium sensitivity (15%) and the low sensitivity (10%) are selected, and the sensitivity grade with the highest probability (90%) is selected as the absolute secret, and the absolute secret is used as the sensitivity identification result of the data to be identified.
In some embodiments, referring to fig. 6, fig. 6 is a schematic structural diagram of a sensitivity recognition model provided in an embodiment of the present application, and as shown in fig. 6, the server may further perform feature extraction on metadata of data to be recognized through a feature extraction layer in the following manner, so as to obtain data features of the metadata, where the feature extraction includes: when the metadata comprises at least two keywords, respectively extracting the features of the keywords through a feature extraction layer to obtain the features corresponding to the keywords as the data features of the metadata; correspondingly, the server can perform sensitivity identification on the data to be identified through the sensitivity identification layer based on the data characteristics of the metadata in the following way to obtain a sensitivity identification result: respectively matching the characteristics corresponding to the keywords with the characteristics corresponding to at least two sensitive words through a sensitivity identification layer to obtain corresponding matching degrees; and selecting the data sensitivity corresponding to the sensitive word with the highest matching degree as a sensitivity identification result of the data to be identified.
Here, the server prestores a corresponding relationship between the sensitive word and the corresponding data sensitivity, for example, the data sensitivity corresponding to the sensitive word 1 is top secret, the data sensitivity corresponding to the sensitive word 2 is confidential, the data sensitivity corresponding to the sensitive word 3 is high sensitive, the data sensitivity corresponding to the sensitive word 4 is low sensitive, and the data sensitivity corresponding to the sensitive word 5 is low sensitive, and it is assumed that the keyword corresponding to the metadata of the data to be recognized includes: respectively extracting the characteristics of the keyword 1 and the keyword 2 through a characteristic extraction layer to obtain the characteristics corresponding to the keyword 1 and the characteristics corresponding to the keyword 2; respectively matching the characteristics of the keyword 1 with the characteristics of the sensitive words (such as the sensitive words 1-5) through a sensitivity identification layer, and obtaining the corresponding matching degrees sequentially as follows: 10%, 20%, 30%, 40%, 80%, respectively matching the features of the keyword 2 with the features of the sensitive words (such as the sensitive words 1-5), and obtaining corresponding matching degrees sequentially as follows: 20%, 10%, 30%, 40%, 60%, and selecting the low sensitivity corresponding to the sensitive word 5 with the highest matching degree of 80% as the sensitivity recognition result of the data to be recognized.
In some embodiments, after obtaining the sensitivity identification result, the server may further establish an association relationship between the sensitivity identification result and the data to be identified, and store the association relationship; and the incidence relation is used for searching the data sensitivity corresponding to the data to be identified based on the data to be identified.
In some embodiments, the server may establish the association relationship between the sensitivity identification result and the data to be identified by:
and storing the sensitivity identification result to a target area associated with the data to be identified, wherein the target area is an area corresponding to the data sensitivity in the storage area corresponding to the metadata.
Here, after the server determines the data sensitivity of the data to be identified, the data sensitivity may be further added to an area for indicating the data sensitivity associated with the data to be identified, such as a column of "sensitivity level" in the data label to complete the data sensitivity of the data to be identified, and the data sensitivity is used and maintained by the user as a part of the metadata.
In some embodiments, after obtaining the sensitivity identification result, the server may further obtain the data sensitivity corresponding to the data to be identified in response to the data display request for the data to be identified; when the sensitivity of the data corresponding to the data to be identified reaches a sensitivity threshold, returning shielding indication information corresponding to the data to be identified; and the shielding indication information is used for indicating the shielding display of the data to be identified.
Here, when the data sensitivity of the data to be identified reaches the sensitivity threshold, the data to be identified is characterized to be sensitive, such as confidential or confidential data, and at this time, the server returns shielding indication information corresponding to the data to be identified to the terminal, so that a user performs security maintenance on the data to be identified at the terminal, such as shielding the confidential or confidential data, and leakage is avoided; in addition, partial data in the data to be identified, such as some sensitive information in the table, for example, the identity number of the user, can be selectively displayed, and the field can be shielded by using a view without being displayed to others.
In some embodiments, after obtaining the sensitivity identification result, when the sensitivity identification result represents that the data sensitivity of the data to be identified reaches the target data sensitivity, the server may further output an encryption prompt message corresponding to the data to be identified; and the encryption prompting information is used for prompting the encryption processing of the data to be identified.
Through the method, when the data sensitivity of the data to be identified reaches a certain degree, a user is prompted to encrypt the data to be identified to avoid leakage when the user maintains or uses the data to be identified, for example, when the data to be identified needs to be transferred from one database to another database, the method provided by the embodiment of the application automatically identifies the data sensitivity of the data to be identified, and further performs encryption processing or fuzzification processing and the like on the data to be identified meeting the certain sensitivity, so that the security of the data to be identified which needs to be transferred is further improved.
Next, training of the sensitivity recognition model will be explained. Referring to fig. 7, fig. 7 is a schematic flowchart of a training method for a sensitivity recognition model provided in an embodiment of the present application, where in some embodiments, the sensitivity recognition model includes a feature extraction layer and a sensitivity recognition layer, and the method includes:
step 201: the server acquires metadata of a data sample, wherein the data sample carries a sensitivity label, and the sensitivity label is used for indicating data sensitivity corresponding to the data sample.
Step 202: and performing feature extraction on the metadata of the data sample through a feature extraction layer to obtain the sample data features of the metadata of the data sample.
Step 203: and carrying out sensitivity identification on the data sample based on the sample data characteristics through a sensitivity identification layer to obtain a sample sensitivity identification result.
Step 204: and acquiring the difference between the sample sensitivity identification result and the sensitivity label carried by the data sample, and updating the model parameters of the sensitivity identification model based on the acquired difference.
In practical implementation, the value of the loss function of the sensitivity identification model can be determined according to the difference between the sample sensitivity identification result and the sensitivity label carried by the data sample; when the value of the loss function reaches a preset threshold value, determining a corresponding error signal based on the value of the loss function of the sensitivity recognition model; the error signal is propagated in the sensitivity recognition model in a reverse direction, and model parameters of each layer of the sensitivity recognition model are updated in the process of propagation.
Explaining backward propagation, inputting a trained data sample into an input layer of a neural network model, passing through a hidden layer, finally reaching an output layer and outputting a result, which is a forward propagation process of the neural network model, wherein because the output result of the neural network model has an error with an actual result, an error between the output result and the actual value is calculated, the error is reversely propagated from the output layer to the hidden layer until the error is propagated to the input layer, and in the process of the backward propagation, the value of a model parameter is adjusted according to the error; and continuously iterating the process until convergence.
Through the method, the server inputs the metadata to be identified into the sensitivity identification model, so that a sensitivity identification result used for indicating the data sensitivity corresponding to the data to be identified can be automatically identified, the identification efficiency of the data sensitivity can be greatly improved compared with a manual identification mode, and the probability of missing checking the sensitive data is reduced.
Next, a data sensitivity identification method based on a sensitivity identification model provided in the embodiment of the present application is continuously described, in some embodiments, with reference to fig. 1 and 8, fig. 8 is a schematic flowchart of a data sensitivity identification method based on a sensitivity identification model provided in the embodiment of the present application, and a data sensitivity identification method based on a sensitivity identification model provided in the embodiment of the present application is described by taking as an example that a terminal in fig. 1 and a server 200 cooperate to implement the data sensitivity identification method based on a sensitivity identification model provided in the embodiment of the present application, where the sensitivity identification model provided in the embodiment of the present application includes a feature extraction layer and a sensitivity identification layer, and the method includes:
step 301: the server acquires metadata of a data sample, wherein the data sample carries a sensitivity label, and the sensitivity label is used for indicating data sensitivity corresponding to the data sample.
Step 302: and the server performs feature extraction on the metadata of the data sample through the feature extraction layer to obtain the sample data features of the metadata of the data sample.
Step 303: and the server performs sensitivity identification on the data sample through the sensitivity identification layer based on the sample data characteristics to obtain a sample sensitivity identification result.
Step 304: and the server acquires the difference between the sample sensitivity identification result and the sensitivity label carried by the data sample, and updates the model parameters of the sensitivity identification model based on the acquired difference.
Through the method, the sensitivity recognition model is obtained through training.
Step 305: and the terminal transmits the data to be identified of the user to the server.
Step 306: and if the storage form of the data to be identified is a data table, the server acquires a data table name, a table description or an attribute field in the data table as metadata.
Step 307: and the server performs feature extraction on the metadata of the data to be identified through a feature extraction layer to obtain the data features of the metadata.
Step 308: and the server identifies the sensitivity of the data to be identified through the sensitivity identification layer based on the data characteristics of the metadata to obtain a sensitivity identification result.
Step 309: and the server stores the sensitivity identification result to a target area associated with the data to be identified.
The target area is an area corresponding to data sensitivity in a storage area corresponding to metadata of the data to be identified.
By the method, the data sensitivity of the data to be recognized is recognized through the trained sensitivity recognition model, and the corresponding sensitivity recognition result is stored in the target area related to the data to be recognized, so that the data sensitivity of the data to be recognized becomes a part of metadata, the recognition efficiency of the data sensitivity is greatly improved, and the omission of the data sensitivity of the data to be recognized is avoided.
Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described. The data sensitivity identification method based on the sensitivity identification model provided by the embodiment of the application mainly includes identifying the data sensitivity of data to be identified by machine learning, referring to fig. 9, where fig. 9 is a schematic flow chart of the data sensitivity identification method based on the sensitivity identification model provided by the embodiment of the application, and as shown in fig. 9, the identification of the data sensitivity provided by the embodiment of the application includes: training the sensitivity recognition model (i.e., a training phase) and recognizing the sensitivity of data to be recognized based on the trained sensitivity recognition model (i.e., a recognition phase), which will be described one by one.
In the training stage, metadata of a data sample is obtained, wherein the data sample carries a sensitivity label, and the sensitivity label is used for indicating data sensitivity corresponding to the data sample. That is, the metadata of the data samples input to the sensitivity recognition model includes: the data sensitivity can be characterized by sensitivity grades, and the sensitivity grades are divided into five types: secret, confidential, high sensitivity, medium sensitivity and low sensitivity.
Generally speaking, the sensitivity level of data related to the security of a user account is absolute, the sensitivity level of data related to personal information and finance of the user is confidential, the sensitivity level of behavior data of the user is high sensitivity, the sensitivity level of data which is rolled up by large granularity of the confidential data is medium sensitivity, and the sensitivity level of common statistical data is low sensitivity.
During training, the data table name and the table description of the data sample are used as sample points, the sensitivity level is used as a sensitivity label, the relation between the data table name and the table description and the sensitivity level is learned by optimizing and training the sensitivity recognition model, and after the training is finished, the model parameters of the sensitivity recognition model are stored.
In the identification stage, in the identification process, the model parameters of the sensitivity identification model saved in the training stage are loaded firstly, then, the metadata of the data to be identified, namely the data table name and the table description of the data to be identified, are input into the trained sensitivity identification model, and the data sensitivity of the data to be identified is identified to obtain the sensitivity grade for indicating the data sensitivity corresponding to the data to be identified.
Next, a structure of the sensitivity recognition model will be described, referring to fig. 10, fig. 10 is a schematic structural diagram of the sensitivity recognition model provided in the embodiment of the present application, and as shown in fig. 10, the sensitivity recognition model includes an input layer, a feature extraction layer, and a sensitivity recognition layer, where the feature extraction layer includes: the method comprises an embedding layer, a bidirectional coding layer and a pooling layer, and a sensitivity identification model is explained by taking the application of identifying data sensitivity of data to be identified as an example.
1. Input layer
In the input layer, firstly, the metadata of the data to be identified, such as the name of a data table or the description of the table, is subjected to word segmentation processing to obtain a plurality of words or a plurality of words corresponding to the metadata, and then a unique index value is set for each word or word, if the ith word is w i After indexing, obtaining a unique integer number I i =I(w i ) (ii) a And finally, transmitting each word or character obtained by word segmentation and the corresponding index value to the feature extraction layer.
2. Feature extraction layer
1) Embedding layer
Here, at the embedding layer, first, a corresponding word or word is obtained based on the index value of each word or word, and then, word vector conversion (i.e., feature coding) is performed on each word or word to obtain a corresponding word vector (i.e., word feature).
Assuming the matrix of the embedding layer is E ∈ R V*D Where V is the total number of all words and D is the dimension of each word vector. To obtain the word vector of the ith word, the index value is first converted into One-Hot coded vector with the length of V, only in I i Has an element 1 at the position of (1), has elements 0 at the other positions, and can obtain a word vector E corresponding to the participle by multiplying the One-Hot coding vector by the matrix E i The specific expression is as follows:
O i ∈0 V
Figure BDA0002928102760000201
Figure BDA0002928102760000202
2) bidirectional coding layer
After obtaining the word vector of each word, the word vector of each word is input to a bidirectional coding layer, such as a bidirectional Long Short-Term Memory (Bi-LSTM) layer, where the Bi-LSTM layer includes two LSTMs: one is a forward input sequence and a reverse input sequence, and can simultaneously consider the context characteristics to play a role in fully fusing and understanding the context semantics.
In actual implementation, bidirectional coding processing can be respectively carried out on the word vectors of all the words to obtain the upper coding features and the lower coding features corresponding to all the words; and respectively carrying out characteristic splicing on the upper coding characteristics and the lower coding characteristics of each word to obtain corresponding splicing coding characteristics. The specific expression is as follows:
Figure BDA0002928102760000203
Figure BDA0002928102760000204
Figure BDA0002928102760000205
where l denotes left to right, r denotes right to left,
Figure BDA0002928102760000206
representing the hidden state of the previous word,
Figure BDA0002928102760000207
indicating the currently input word or words and,
Figure BDA0002928102760000208
indicating the cellular state of the previous word;
Figure BDA0002928102760000209
characterize the above coding features extracted by a forward process (e.g. left to right),
Figure BDA00029281027600002010
the features being extracted by a backward process (e.g. from right to left)Encoding feature vector, C, below t ,h t Characterizing features to encode above
Figure BDA00029281027600002011
And context coding features
Figure BDA00029281027600002012
And splicing the obtained splicing coding features of the corresponding words.
3) Pooling layer
And obtaining the splicing coding features corresponding to each word through the bidirectional coding layer, and performing feature splicing on the splicing coding features corresponding to each word through the pooling layer to obtain corresponding sentence vectors (namely the data features of the metadata). The specific expression is as follows:
Figure BDA0002928102760000211
wherein z represents a sentence vector corresponding to metadata of the data to be identified, C t ,h t And L represents the total number of the input words for the splicing coding features corresponding to the current input words.
3. Sensitivity recognition layer
Here, the sensitivity recognition layer is also called a MultiLayer Perceptron (MLP) layer, the MultiLayer Perceptron is composed of a MultiLayer fully-connected neural network, data features corresponding to metadata of data to be recognized pass through the sensitivity recognition layer, and a probability that the data to be recognized corresponds to each sensitivity level is output, taking a 3-layer fully-connected neural network as an example, the probability that an input book belongs to each sensitivity level may refer to the following expression:
a i =f(W 3 f(W 2 f(W 1 z+b 1 )+b 2 )+b 3 )
Figure BDA0002928102760000212
Figure BDA0002928102760000213
wherein f is a nonlinear excitation function, z is a sentence vector corresponding to the metadata of the data to be identified obtained by the pooling layer, and W 1 Is the weight of the first layer fully-connected neural network, W 2 Is the weight of the second layer fully connected neural network, W 3 Weights for a third-level fully-connected neural network are trainable, b 1 、b 2 And b 3 For correspondingly trainable bias parameters, a i Characterizing the ith sensitivity level, A characterizing the number of levels of sensitivity level, p i The sensitivity level of the data to be identified is represented by a i The probability of (c).
And then, selecting the sensitivity grade with the maximum probability as the sensitivity grade corresponding to the data to be identified, outputting the finally determined sensitivity grade, and supplementing the output sensitivity grade into the metadata of the data to be identified.
It should be noted that, the structure of the sensitivity recognition model may be set according to actual situations, for example, the metadata of the data to be recognized may be input to the input layer, and the metadata of the data to be recognized is transmitted to the feature extraction layer through the input layer, so as to perform a word segmentation or index operation on the metadata in the feature extraction layer, and the like, and the structure of the sensitivity recognition is not specifically limited in this application.
After the structural layout of the sensitivity recognition model is good, the sensitivity recognition model can be trained by using a random gradient descent method, so that the model parameters are optimal or locally optimal. For example, the metadata of the acquired data sample is transmitted to the feature extraction layer through the input layer, and the feature extraction is performed on the metadata of the data sample through the feature extraction layer to obtain the sample data features of the metadata of the data sample; performing sensitivity identification on the data sample based on the characteristics of the sample data through a sensitivity identification layer to obtain a sample sensitivity identification result; and acquiring the difference between the sample sensitivity identification result and the sensitivity label carried by the data sample, and updating the model parameters of the sensitivity identification model based on the acquired difference.
In addition, the sensitivity recognition model provided by the embodiment of the application can be trained based on a traditional machine learning method, such as a fast text classifier (FastText); or deep learning, such as a Bidirectional Encoder characterization based on deformation (BERT) general model, a TextCNN model, a chinese pre-trained RoBERTa model, a chinese training EL ECTRA model, etc. The fully-connected neural network provided by the embodiment of the application can also adopt an attention network, a cyclic neural network, a convolutional neural network and the like.
Through the method, the metadata to be recognized is input into the sensitivity recognition model, the corresponding sensitivity level is automatically recognized in a machine learning mode, and the recognized sensitivity level is supplemented into the metadata of the data to be recognized.
Continuing with the exemplary structure of the data sensitivity recognition device 555 implemented as a software module according to the embodiment of the present application, in some embodiments, as shown in fig. 11, fig. 11 is a schematic structural diagram of the data sensitivity recognition device based on the sensitivity recognition model according to the embodiment of the present application, where the sensitivity recognition model includes a feature extraction layer and a sensitivity recognition layer, and the device includes:
a first obtaining module 5551, configured to obtain metadata of data to be identified, where the metadata is used to describe the data to be identified;
a first extraction module 5552, configured to perform feature extraction on the metadata of the data to be identified through the feature extraction layer, so as to obtain data features of the metadata;
the first identification module 5553 is configured to perform sensitivity identification on the data to be identified through the sensitivity identification layer based on the data characteristics of the metadata to obtain a sensitivity identification result;
and the sensitivity identification result is used for indicating the data sensitivity corresponding to the data to be identified.
In some embodiments, the first obtaining module is further configured to, when the storage form of the data to be identified is a data table, obtain at least one of the following table elements from the data table: the data table name, the table description corresponding to the data to be identified in the data table, and the attribute field corresponding to the data to be identified in the data table;
and determining the acquired table element as the metadata of the data to be identified.
In some embodiments, the first obtaining module is further configured to, when the storage form of the data to be identified is a document, obtain at least one of the following document contents from the document: document title, document abstract, document keyword;
and determining the obtained document content as the metadata of the data to be identified.
In some embodiments, the first extraction module is further configured to perform word segmentation on metadata of the data to be identified to obtain a plurality of words corresponding to the metadata;
respectively carrying out feature coding on each word to obtain word features corresponding to each word;
and performing characteristic splicing on the word characteristics corresponding to each word to obtain the data characteristics corresponding to the metadata.
In some embodiments, the first extraction module is further configured to perform bidirectional encoding processing on term features of each term respectively to obtain an upper encoding feature and a lower encoding feature corresponding to each term;
respectively performing characteristic splicing on the upper coding characteristics and the lower coding characteristics of each word to obtain corresponding splicing coding characteristics;
and performing characteristic splicing on the splicing coding characteristics corresponding to the word, so as to obtain data characteristics corresponding to the metadata.
In some embodiments, the first identification module is further configured to perform, by the sensitivity identification layer, classification prediction on data features of the metadata corresponding to at least two sensitivity levels to obtain a probability that the metadata corresponds to each sensitivity level;
and selecting the sensitivity grade with the highest probability as a sensitivity identification result of the data to be identified.
In some embodiments, the first extraction module is further configured to, when the metadata includes at least two keywords, respectively perform feature extraction on each keyword through the feature extraction layer to obtain a feature corresponding to each keyword as a data feature of the metadata;
correspondingly, the first extraction module is further configured to match, through the sensitivity recognition layer, the features corresponding to the keywords with the features corresponding to the at least two sensitive words, respectively, so as to obtain corresponding matching degrees;
and selecting the data sensitivity corresponding to the sensitive word with the highest matching degree as a sensitivity identification result of the data to be identified.
In some embodiments, the apparatus further comprises:
the processing module is used for establishing an incidence relation between the sensitivity identification result and the data to be identified and storing the incidence relation;
and the incidence relation is used for searching the data sensitivity corresponding to the data to be identified based on the data to be identified.
In some embodiments, the processing module is further configured to store the sensitivity identification result in a target area associated with the data to be identified, where the target area is an area corresponding to data sensitivity in the storage area corresponding to the metadata.
In some embodiments, the apparatus further comprises:
the return module is used for responding to a data display request aiming at the data to be identified and acquiring the data sensitivity corresponding to the data to be identified;
when the data sensitivity corresponding to the data to be identified reaches a sensitivity threshold, returning shielding indication information corresponding to the data to be identified;
and the shielding indication information is used for indicating to shield and display the data to be identified.
In some embodiments, the apparatus further comprises:
the output module is used for outputting the encrypted prompt information corresponding to the data to be identified when the sensitivity identification result represents that the data sensitivity of the data to be identified reaches the target data sensitivity;
and the encryption prompt information is used for prompting the encryption processing of the data to be identified.
Continuing to describe the training apparatus of the sensitivity recognition model provided in the embodiment of the present application, referring to fig. 12, fig. 12 is a schematic structural diagram of the training apparatus of the sensitivity recognition model provided in the embodiment of the present application, where the sensitivity recognition model includes a feature extraction layer and a sensitivity recognition layer, and the training apparatus 120 of the sensitivity recognition model includes:
a second obtaining module 121, configured to obtain metadata of a data sample, where the data sample carries a sensitivity label, and the sensitivity label is used to indicate a data sensitivity corresponding to the data sample;
the second extraction module 122 is configured to perform feature extraction on the metadata of the data sample through the feature extraction layer to obtain sample data features of the metadata of the data sample;
the second identification module 123 is configured to perform sensitivity identification on the data sample based on the sample data characteristics through the sensitivity identification layer to obtain a sample sensitivity identification result;
an updating module 124, configured to obtain a difference between the sample sensitivity identification result and the sensitivity label carried by the data sample, and update the model parameter of the sensitivity identification model based on the difference.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method of the embodiment of the present application.
Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, cause the processor to perform the method provided by embodiments of the present application.
In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (15)

1. A data sensitivity identification method based on a sensitivity identification model, which is characterized in that the sensitivity identification model comprises a feature extraction layer and a sensitivity identification layer, and the method comprises the following steps:
acquiring metadata of data to be identified, wherein the metadata is used for describing the data to be identified;
performing feature extraction on the metadata of the data to be identified through the feature extraction layer to obtain data features of the metadata;
through the sensitivity identification layer, based on the data characteristics of the metadata, sensitivity identification is carried out on the data to be identified to obtain a sensitivity identification result;
and the sensitivity identification result is used for indicating the data sensitivity corresponding to the data to be identified.
2. The method of claim 1, wherein the obtaining metadata for the data to be identified comprises:
when the storage form of the data to be identified is a data table, acquiring at least one of the following table elements from the data table: the data table name, the table description corresponding to the data to be identified in the data table, and the attribute field corresponding to the data to be identified in the data table;
and determining the acquired table elements as the metadata of the data to be identified.
3. The method of claim 1, wherein the obtaining metadata for the data to be identified comprises:
when the storage form of the data to be identified is a document, at least one of the following document contents is obtained from the document: document title, document abstract, document keyword;
and determining the obtained document content as the metadata of the data to be identified.
4. The method of claim 1, wherein the performing feature extraction on the metadata of the data to be identified to obtain the data features of the metadata comprises:
performing word segmentation processing on the metadata of the data to be identified to obtain a plurality of words corresponding to the metadata;
respectively carrying out feature coding on each word to obtain word features corresponding to each word;
and performing characteristic splicing on the word characteristics corresponding to each word to obtain the data characteristics corresponding to the metadata.
5. The method of claim 4, wherein the feature concatenation of the term features corresponding to each of the terms to obtain the data features corresponding to the metadata comprises:
respectively carrying out bidirectional coding processing on the word characteristics of each word to obtain the upper coding characteristics and the lower coding characteristics corresponding to each word;
respectively performing characteristic splicing on the upper coding characteristics and the lower coding characteristics of each word to obtain corresponding splicing coding characteristics;
and performing characteristic splicing on the splicing coding characteristics corresponding to the word, so as to obtain data characteristics corresponding to the metadata.
6. The method of claim 1, wherein the performing, by the sensitivity identification layer, sensitivity identification on the data to be identified based on the data characteristics of the metadata to obtain a sensitivity identification result comprises:
performing classification prediction corresponding to at least two sensitivity levels on the data characteristics of the metadata through the sensitivity identification layer to obtain the probability of the metadata corresponding to each sensitivity level;
and selecting the sensitivity grade with the highest probability as a sensitivity identification result of the data to be identified.
7. The method of claim 1, wherein the performing, by the feature extraction layer, feature extraction on the metadata of the data to be identified to obtain the data features of the metadata comprises:
when the metadata comprise at least two keywords, respectively performing feature extraction on each keyword through the feature extraction layer to obtain features corresponding to each keyword as data features of the metadata;
the sensitivity identification of the data to be identified based on the data characteristics of the metadata through the sensitivity identification layer to obtain a sensitivity identification result, including:
respectively matching the characteristics corresponding to the keywords with the characteristics corresponding to at least two sensitive words through the sensitivity identification layer to obtain corresponding matching degrees;
and selecting the data sensitivity corresponding to the sensitive word with the highest matching degree as a sensitivity identification result of the data to be identified.
8. The method of claim 1, wherein the method further comprises:
establishing an incidence relation between the sensitivity identification result and the data to be identified, and storing the incidence relation;
and the incidence relation is used for searching the data sensitivity corresponding to the data to be identified based on the data to be identified.
9. The method of claim 8, wherein the establishing the association relationship between the sensitivity recognition result and the data to be recognized comprises:
and storing the sensitivity identification result to a target area associated with the data to be identified, wherein the target area is an area corresponding to the data sensitivity in the storage area corresponding to the metadata.
10. The method of claim 1, wherein the method further comprises:
responding to a data display request aiming at the data to be identified, and acquiring data sensitivity corresponding to the data to be identified;
when the data sensitivity corresponding to the data to be identified reaches a sensitivity threshold, returning shielding indication information corresponding to the data to be identified;
and the shielding indication information is used for indicating to shield and display the data to be identified.
11. The method of claim 1, wherein the method further comprises:
when the sensitivity identification result represents that the data sensitivity of the data to be identified reaches the target data sensitivity, outputting encryption prompt information corresponding to the data to be identified;
and the encryption prompt information is used for prompting the encryption processing of the data to be identified.
12. A method for training a sensitivity recognition model, wherein the sensitivity recognition model comprises a feature extraction layer and a sensitivity recognition layer, the method comprising:
acquiring metadata of a data sample, wherein the data sample carries a sensitivity label which is used for indicating the data sensitivity corresponding to the data sample;
performing feature extraction on the metadata of the data sample through the feature extraction layer to obtain sample data features of the metadata of the data sample;
performing sensitivity identification on the data sample based on the sample data characteristics through the sensitivity identification layer to obtain a sample sensitivity identification result;
acquiring the difference between the sample sensitivity identification result and the sensitivity label, and updating the model parameters of the sensitivity identification model based on the difference;
the sensitivity identification model is used for outputting a sensitivity identification result indicating the data sensitivity corresponding to the data to be identified after the metadata of the data to be identified is input to the sensitivity identification model.
13. A data sensitivity recognition apparatus based on a sensitivity recognition model, wherein the sensitivity recognition model includes a feature extraction layer and a sensitivity recognition layer, the apparatus comprising:
the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring metadata of data to be identified, and the metadata is used for describing the data to be identified;
the first extraction module is used for extracting the characteristics of the metadata of the data to be identified through the characteristic extraction layer to obtain the data characteristics of the metadata;
the first identification module is used for carrying out sensitivity identification on the data to be identified through the sensitivity identification layer based on the data characteristics of the metadata to obtain a sensitivity identification result;
and the sensitivity identification result is used for indicating the data sensitivity corresponding to the data to be identified.
14. An apparatus for training a sensitivity recognition model, wherein the sensitivity recognition model comprises a feature extraction layer and a sensitivity recognition layer, the apparatus comprising:
the second acquisition module is used for acquiring metadata of a data sample, wherein the data sample carries a sensitivity label which is used for indicating the data sensitivity corresponding to the data sample;
the second extraction module is used for performing feature extraction on the metadata of the data sample through the feature extraction layer to obtain sample data features of the metadata of the data sample;
the second identification module is used for carrying out sensitivity identification on the data sample based on the sample data characteristics through the sensitivity identification layer to obtain a sample sensitivity identification result;
the updating module is used for acquiring the difference between the sample sensitivity identification result and the sensitivity label carried by the data sample and updating the model parameter of the sensitivity identification model based on the difference;
the sensitivity identification model is used for outputting a sensitivity identification result indicating the data sensitivity corresponding to the data to be identified after the metadata of the data to be identified is input to the sensitivity identification model.
15. A computer-readable storage medium having stored thereon executable instructions for, when executed by a processor, implementing the method of any one of claims 1 to 12.
CN202110139667.XA 2021-02-01 2021-02-01 Data sensitivity identification method and device based on sensitivity identification model Pending CN114840869A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110139667.XA CN114840869A (en) 2021-02-01 2021-02-01 Data sensitivity identification method and device based on sensitivity identification model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110139667.XA CN114840869A (en) 2021-02-01 2021-02-01 Data sensitivity identification method and device based on sensitivity identification model

Publications (1)

Publication Number Publication Date
CN114840869A true CN114840869A (en) 2022-08-02

Family

ID=82561083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110139667.XA Pending CN114840869A (en) 2021-02-01 2021-02-01 Data sensitivity identification method and device based on sensitivity identification model

Country Status (1)

Country Link
CN (1) CN114840869A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115906170A (en) * 2022-12-02 2023-04-04 杨磊 Safety protection method and AI system applied to storage cluster
CN116089225A (en) * 2023-04-12 2023-05-09 浙江大学 BiLSTM-based public data acquisition dynamic sensing system and method
CN116090006A (en) * 2023-02-01 2023-05-09 北京三维天地科技股份有限公司 Sensitive identification method and system based on deep learning

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115906170A (en) * 2022-12-02 2023-04-04 杨磊 Safety protection method and AI system applied to storage cluster
CN115906170B (en) * 2022-12-02 2023-12-15 北京金安道大数据科技有限公司 Security protection method and AI system applied to storage cluster
CN116090006A (en) * 2023-02-01 2023-05-09 北京三维天地科技股份有限公司 Sensitive identification method and system based on deep learning
CN116090006B (en) * 2023-02-01 2023-09-08 北京三维天地科技股份有限公司 Sensitive identification method and system based on deep learning
CN116089225A (en) * 2023-04-12 2023-05-09 浙江大学 BiLSTM-based public data acquisition dynamic sensing system and method

Similar Documents

Publication Publication Date Title
US9495345B2 (en) Methods and systems for modeling complex taxonomies with natural language understanding
US11501080B2 (en) Sentence phrase generation
CN114840869A (en) Data sensitivity identification method and device based on sensitivity identification model
CN112396108A (en) Service data evaluation method, device, equipment and computer readable storage medium
US20210182680A1 (en) Processing sequential interaction data
CN112328909B (en) Information recommendation method and device, computer equipment and medium
EP3685243A1 (en) Content pattern based automatic document classification
CN110852106A (en) Named entity processing method and device based on artificial intelligence and electronic equipment
CN107807968A (en) Question and answer system, method and storage medium based on Bayesian network
CN112307770A (en) Sensitive information detection method and device, electronic equipment and storage medium
CN111767394A (en) Abstract extraction method and device based on artificial intelligence expert system
CN115687647A (en) Notarization document generation method and device, electronic equipment and storage medium
CN114547315A (en) Case classification prediction method and device, computer equipment and storage medium
CN110532229B (en) Evidence file retrieval method, device, computer equipment and storage medium
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof
CN114282498A (en) Data knowledge processing system applied to electric power transaction
CN112199374B (en) Data feature mining method for data missing and related equipment thereof
CN113868419A (en) Text classification method, device, equipment and medium based on artificial intelligence
KR102532216B1 (en) Method for establishing ESG database with structured ESG data using ESG auxiliary tool and ESG service providing system performing the same
CN113837210A (en) Applet classifying method, device, equipment and computer readable storage medium
CN115080039A (en) Front-end code generation method, device, computer equipment, storage medium and product
CN111459959B (en) Method and apparatus for updating event sets
CN110442759B (en) Knowledge retrieval method and system, computer equipment and readable storage medium
CN114330296A (en) New word discovery method, device, equipment and storage medium
CN107220249A (en) Full-text search based on classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination