CN114840869A - Data sensitivity identification method and device based on sensitivity identification model - Google Patents
Data sensitivity identification method and device based on sensitivity identification model Download PDFInfo
- Publication number
- CN114840869A CN114840869A CN202110139667.XA CN202110139667A CN114840869A CN 114840869 A CN114840869 A CN 114840869A CN 202110139667 A CN202110139667 A CN 202110139667A CN 114840869 A CN114840869 A CN 114840869A
- Authority
- CN
- China
- Prior art keywords
- data
- sensitivity
- identified
- metadata
- sensitivity identification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Character Discrimination (AREA)
Abstract
Description
技术领域technical field
本申请涉及人工智能及互联网技术,尤其涉及一种基于敏感度识别模型的数据敏感度识别方法、敏感度识别模型的训练方法、装置、设备及计算机可读存储介质。The present application relates to artificial intelligence and Internet technologies, and in particular, to a data sensitivity identification method based on a sensitivity identification model, a training method, apparatus, device, and computer-readable storage medium for the sensitivity identification model.
背景技术Background technique
在互联网企业的数据资产管理中,随着业务的发展和用户活跃度的提升,大量有价值的数据会沉淀在数据库表或文本中。数据敏感度作为元数据中的一部分,从泄露风险归类数据,便于开发人员使用和保密。然而,若一些有价值的数据缺少具体的数据敏感度或风险等级,并且没有被开发人员管理维护起来,那么这部分数据在使用时就有可能被泄露出去,这将对业务造成很大的影响。In the data asset management of Internet companies, with the development of business and the improvement of user activity, a large amount of valuable data will be deposited in database tables or texts. Data sensitivity, as part of the metadata, categorizes data from exposure risk, ease of use by developers and confidentiality. However, if some valuable data lacks specific data sensitivity or risk level, and is not managed and maintained by developers, then this part of data may be leaked during use, which will have a great impact on the business .
相关技术中,通过人工方式识别数据敏感度,即由数据库管理员根据个人经验对待识别数据的数据敏感度进行识别和确定,但该方式费时费力,且出现漏查敏感数据的概率较高。In the related art, the data sensitivity is identified manually, that is, the database administrator identifies and determines the data sensitivity of the identified data based on personal experience, but this method is time-consuming and labor-intensive, and the probability of missing sensitive data is high.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供一种基于敏感度识别模型的数据敏感度识别方法、敏感度识别模型的训练方法、装置、设备及计算机可读存储介质,能够提高数据敏感度的识别效率,并降低漏查敏感数据的概率。The embodiments of the present application provide a data sensitivity identification method based on a sensitivity identification model, a training method, apparatus, device and computer-readable storage medium for the sensitivity identification model, which can improve the data sensitivity identification efficiency and reduce missed checks Probability of sensitive data.
本申请实施例的技术方案是这样实现的:The technical solutions of the embodiments of the present application are implemented as follows:
本申请实施例提供一种基于敏感度识别模型数据敏感度识别方法,所述敏感度识别模型包括特征提取层和敏感度识别层,包括:An embodiment of the present application provides a data sensitivity identification method based on a sensitivity identification model, where the sensitivity identification model includes a feature extraction layer and a sensitivity identification layer, including:
获取待识别数据的元数据,所述元数据用于描述所述待识别数据;obtaining metadata of the data to be identified, the metadata being used to describe the data to be identified;
通过所述特征提取层,对所述待识别数据的元数据进行特征提取,得到所述元数据的数据特征;Through the feature extraction layer, feature extraction is performed on the metadata of the data to be identified to obtain the data features of the metadata;
通过所述敏感度识别层,基于所述元数据的数据特征,对所述待识别数据进行敏感度识别,得到敏感度识别结果;Through the sensitivity identification layer, based on the data characteristics of the metadata, the sensitivity identification is performed on the to-be-identified data to obtain a sensitivity identification result;
其中,所述敏感度识别结果,用于指示所述待识别数据对应的数据敏感度。The sensitivity identification result is used to indicate the data sensitivity corresponding to the data to be identified.
本申请实施例提供一种敏感度识别模型的训练方法,所述敏感度识别模型包括特征提取层和敏感度识别层,所述方法包括:An embodiment of the present application provides a training method for a sensitivity identification model, where the sensitivity identification model includes a feature extraction layer and a sensitivity identification layer, and the method includes:
获取数据样本的元数据,所述数据样本携带有敏感度标签,所述敏感度标签用于指示所述数据样本对应的数据敏感度;obtaining metadata of a data sample, where the data sample carries a sensitivity label, and the sensitivity label is used to indicate the data sensitivity corresponding to the data sample;
通过所述特征提取层,对所述数据样本的元数据进行特征提取,得到所述数据样本的元数据的样本数据特征;Through the feature extraction layer, feature extraction is performed on the metadata of the data sample to obtain sample data features of the metadata of the data sample;
通过所述敏感度识别层,基于所述样本数据特征,对所述数据样本进行敏感度识别,得到样本敏感度识别结果;Through the sensitivity identification layer, based on the characteristics of the sample data, the data sample is subjected to sensitivity identification to obtain a sample sensitivity identification result;
获取所述样本敏感度识别结果与所述数据样本携带的敏感度标签之间的差异,并基于所述差异,更新所述敏感度识别模型的模型参数;obtaining the difference between the sample sensitivity identification result and the sensitivity label carried by the data sample, and updating the model parameters of the sensitivity identification model based on the difference;
其中,所述敏感度识别模型,用于在将待识别数据的元数据输入至敏感度识别模型后,输出指示所述待识别数据对应的数据敏感度的敏感度识别结果。The sensitivity identification model is configured to output a sensitivity identification result indicating the data sensitivity corresponding to the data to be identified after the metadata of the data to be identified is input into the sensitivity identification model.
本申请实施例提供一种基于敏感度识别模型的数据敏感度识别装置,所述敏感度识别模型包括特征提取层和敏感度识别层,装置包括:An embodiment of the present application provides a data sensitivity identification device based on a sensitivity identification model, where the sensitivity identification model includes a feature extraction layer and a sensitivity identification layer, and the device includes:
第一获取模块,用于获取待识别数据的元数据,所述元数据用于描述所述待识别数据;a first acquisition module, configured to acquire metadata of the data to be identified, where the metadata is used to describe the data to be identified;
第一提取模块,用于通过所述特征提取层,对所述待识别数据的元数据进行特征提取,得到所述元数据的数据特征;a first extraction module, configured to perform feature extraction on the metadata of the data to be identified through the feature extraction layer to obtain data features of the metadata;
第一识别模块,用于通过所述敏感度识别层,基于所述元数据的数据特征,对所述待识别数据进行敏感度识别,得到敏感度识别结果;a first identification module, configured to perform sensitivity identification on the data to be identified through the sensitivity identification layer based on the data characteristics of the metadata to obtain a sensitivity identification result;
其中,所述敏感度识别结果,用于指示所述待识别数据对应的数据敏感度。The sensitivity identification result is used to indicate the data sensitivity corresponding to the data to be identified.
上述方案中,所述第一获取模块,还用于当所述待识别数据的存储形式为数据表时,从所述数据表中获取以下表元素至少之一:数据表名、数据表中对应所述待识别数据的表描述、数据表中对应所述待识别数据的属性字段;In the above scheme, the first acquisition module is also used to acquire at least one of the following table elements from the data table when the storage form of the data to be identified is a data table: the name of the data table, the corresponding table in the data table Table description of the data to be identified, attribute fields corresponding to the data to be identified in the data table;
将获取的表元素确定为所述待识别数据的元数据。The acquired table element is determined as the metadata of the data to be identified.
上述方案中,所述第一获取模块,还用于当所述待识别数据的存储形式为文档时,从所述文档中获取以下文档内容至少之一:文档标题、文档摘要、文档关键词;In the above solution, the first obtaining module is further configured to obtain at least one of the following document contents from the document when the storage form of the data to be identified is a document: document title, document abstract, and document keywords;
将获取的文档内容确定为所述待识别数据的元数据。The acquired document content is determined as the metadata of the data to be identified.
上述方案中,所述第一提取模块,还用于对所述待识别数据的元数据进行分词处理,得到所述元数据对应的多个词语;In the above solution, the first extraction module is further configured to perform word segmentation processing on the metadata of the data to be identified, to obtain a plurality of words corresponding to the metadata;
分别对各个所述词语进行特征编码,得到各个所述词语对应的词语特征;Carry out feature encoding on each of the described words respectively to obtain the word features corresponding to each of the described words;
对各个所述词语对应的词语特征进行特征拼接,得到所述元数据对应的数据特征。Feature splicing is performed on the word features corresponding to each of the words to obtain data features corresponding to the metadata.
上述方案中,所述第一提取模块,还用于分别对各个词语的词语特征进行双向编码处理,得到各所述词语对应的上文编码特征和下文编码特征;In the above scheme, the first extraction module is also used to perform bidirectional encoding processing on the word features of each word, respectively, to obtain the above encoding feature and the following encoding feature corresponding to each of the words;
分别对各所述词语的上文编码特征和下文编码特征进行特征拼接,得到相应的拼接编码特征;Feature splicing is performed on the above coding features and the following coding features of each of the described words, respectively, to obtain the corresponding splicing coding features;
将各所述词语对应的拼接编码特征进行特征拼接,得到所述元数据对应的数据特征。Feature splicing is performed on the splicing coding features corresponding to each of the words to obtain data features corresponding to the metadata.
上述方案中,所述第一识别模块,还用于通过所述敏感度识别层,对所述元数据的数据特征进行对应至少两个敏感等级的分类预测,得到所述元数据对应各所述敏感等级的概率;In the above solution, the first identification module is further configured to perform classification prediction corresponding to at least two sensitivity levels on the data characteristics of the metadata through the sensitivity identification layer, and obtain the metadata corresponding to each of the Probability of sensitivity level;
选取概率最大的敏感等级,作为对所述待识别数据的敏感度识别结果。The sensitivity level with the highest probability is selected as the sensitivity identification result of the data to be identified.
上述方案中,所述第一提取模块,还用于当所述元数据包括至少两个关键词时,通过所述特征提取层,分别对各所述关键词进行特征提取,得到各所述关键词对应的特征作为所述元数据的数据特征;In the above solution, the first extraction module is further configured to perform feature extraction on each of the keywords through the feature extraction layer when the metadata includes at least two keywords, to obtain each of the key words. The feature corresponding to the word is used as the data feature of the metadata;
相应的,所述第一提取模块,还用于通过所述敏感度识别层,分别将各所述关键词对应的特征与至少两个敏感词对应的特征进行匹配,得到相应的匹配度;Correspondingly, the first extraction module is further configured to match the feature corresponding to each keyword with the feature corresponding to at least two sensitive words through the sensitivity identification layer to obtain the corresponding matching degree;
选取匹配度最高的敏感词对应的数据敏感度,作为对所述待识别数据的敏感度识别结果。The data sensitivity corresponding to the sensitive word with the highest matching degree is selected as the sensitivity identification result of the data to be identified.
上述方案中,所述装置还包括:In the above scheme, the device also includes:
处理模块,用于建立所述敏感度识别结果与所述待识别数据的关联关系,并存储所述关联关系;a processing module, configured to establish an association relationship between the sensitivity identification result and the data to be identified, and store the association relationship;
其中,所述关联关系,用于供基于所述待识别数据查找对应所述待识别数据的数据敏感度。Wherein, the association relationship is used for searching the data sensitivity corresponding to the data to be identified based on the data to be identified.
上述方案中,所述处理模块,还用于将所述敏感度识别结果存储至所述待识别数据关联的目标区域,所述目标区域为所述元数据对应的存储区域中对应数据敏感度的区域。In the above solution, the processing module is further configured to store the sensitivity identification result in the target area associated with the data to be identified, where the target area is the data sensitivity corresponding to the data in the storage area corresponding to the metadata. area.
上述方案中,所述装置还包括:In the above scheme, the device also includes:
返回模块,用于响应于针对所述待识别数据的数据展示请求,获取所述待识别数据对应的数据敏感度;A return module, configured to obtain the data sensitivity corresponding to the data to be identified in response to a data display request for the data to be identified;
当所述待识别数据对应的数据敏感度达到敏感度阈值时,返回对应所述待识别数据的屏蔽指示信息;When the data sensitivity corresponding to the to-be-identified data reaches the sensitivity threshold, return masking indication information corresponding to the to-be-identified data;
所述屏蔽指示信息,用于指示对所述待识别数据进行屏蔽展示。The shielding indication information is used to indicate that the to-be-identified data is shielded and displayed.
上述方案中,所述装置还包括:In the above scheme, the device also includes:
输出模块,用于当所述敏感度识别结果表征所述待识别数据的数据敏感度达到目标数据敏感度时,输出对应所述待识别数据的加密提示信息;an output module, configured to output encryption prompt information corresponding to the data to be identified when the sensitivity identification result indicates that the data sensitivity of the data to be identified reaches the target data sensitivity;
其中,所述加密提示信息,用于提示对所述待识别数据进行加密处理。Wherein, the encryption prompt information is used for prompting to perform encryption processing on the to-be-identified data.
本申请实施例提供一种敏感度识别模型的训练装置,所述敏感度识别模型包括特征提取层和敏感度识别层,所述装置包括:An embodiment of the present application provides a training device for a sensitivity identification model, where the sensitivity identification model includes a feature extraction layer and a sensitivity identification layer, and the device includes:
第二获取模块,用于获取数据样本的元数据,所述数据样本携带有敏感度标签,所述敏感度标签用于指示所述数据样本对应的数据敏感度;a second acquisition module, configured to acquire metadata of a data sample, where the data sample carries a sensitivity label, and the sensitivity label is used to indicate the data sensitivity corresponding to the data sample;
第二提取模块,用于通过所述特征提取层,对所述数据样本的元数据进行特征提取,得到所述数据样本的元数据的样本数据特征;a second extraction module, configured to perform feature extraction on the metadata of the data sample through the feature extraction layer to obtain sample data features of the metadata of the data sample;
第二识别模块,用于通过所述敏感度识别层,基于所述样本数据特征,对所述数据样本进行敏感度识别,得到样本敏感度识别结果;a second identification module, configured to perform sensitivity identification on the data sample based on the sample data characteristics through the sensitivity identification layer, and obtain a sample sensitivity identification result;
更新模块,用于获取所述样本敏感度识别结果与所述数据样本携带的敏感度标签之间的差异,并基于所述差异,更新所述敏感度识别模型的模型参数;an update module, configured to obtain the difference between the sample sensitivity identification result and the sensitivity label carried by the data sample, and update the model parameters of the sensitivity identification model based on the difference;
其中,所述敏感度识别模型,用于在将待识别数据的元数据输入至敏感度识别模型后,输出指示所述待识别数据对应的数据敏感度的敏感度识别结果。The sensitivity identification model is configured to output a sensitivity identification result indicating the data sensitivity corresponding to the data to be identified after the metadata of the data to be identified is input into the sensitivity identification model.
本申请实施例提供一种电子设备,包括:The embodiment of the present application provides an electronic device, including:
存储器,用于存储可执行指令;memory for storing executable instructions;
处理器,用于执行所述存储器中存储的可执行指令时,实现本申请实施例提供的基于敏感度识别模型的数据敏感度识别方法。The processor is configured to implement the data sensitivity identification method based on the sensitivity identification model provided by the embodiment of the present application when executing the executable instructions stored in the memory.
本申请实施例提供一种电子设备,包括:The embodiment of the present application provides an electronic device, including:
存储器,用于存储可执行指令;memory for storing executable instructions;
处理器,用于执行所述存储器中存储的可执行指令时,实现本申请实施例提供的敏感度识别模型的训练方法。The processor is configured to implement the training method of the sensitivity recognition model provided by the embodiment of the present application when executing the executable instructions stored in the memory.
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行时,实现本申请实施例提供的基于敏感度识别模型的数据敏感度识别方法。The embodiments of the present application provide a computer-readable storage medium storing executable instructions for causing a processor to execute the data sensitivity identification method based on the sensitivity identification model provided by the embodiments of the present application.
本申请实施例还提供一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行时,实现本申请实施例提供的敏感度识别模型的训练方法。Embodiments of the present application further provide a computer-readable storage medium storing executable instructions for causing a processor to execute the training method for the sensitivity recognition model provided by the embodiments of the present application.
本申请实施例具有以下有益效果:The embodiment of the present application has the following beneficial effects:
服务器通过敏感度识别模型对待识别数据的元数据进行敏感度识别,具体地获取用于描述待识别数据的元数据,通过敏感度识别模型的特征提取层,对待识别数据的元数据进行特征提取,得到元数据的数据特征;通过敏感度识别模型的敏感度识别层,基于元数据的数据特征,对待识别数据进行敏感度识别,得到敏感度识别结果;如此,将待识别的元数据输入至敏感度识别模型中,即可自动识别得到用于指示待识别数据对应的数据敏感度的敏感度识别结果,相较于人工识别的方式而言,能够大大提高数据敏感度的识别效率,且降低漏查敏感数据的概率。The server performs sensitivity recognition on the metadata of the data to be recognized through the sensitivity recognition model, specifically obtains the metadata used to describe the data to be recognized, and performs feature extraction on the metadata of the data to be recognized through the feature extraction layer of the sensitivity recognition model, Obtain the data characteristics of the metadata; through the sensitivity identification layer of the sensitivity identification model, based on the data characteristics of the metadata, perform sensitivity identification on the data to be identified, and obtain the sensitivity identification result; in this way, input the metadata to be identified into the sensitive In the degree recognition model, the sensitivity recognition result indicating the data sensitivity corresponding to the data to be recognized can be automatically recognized. Compared with the manual recognition method, the recognition efficiency of data sensitivity can be greatly improved, and the leakage of data can be reduced. Probability of checking sensitive data.
附图说明Description of drawings
图1为本申请实施例提供的基于敏感度识别模型的数据敏感度识别系统100的一个可选的架构示意图;FIG. 1 is an optional schematic structural diagram of a data
图2为本申请实施例提供的电子设备500的一个可选的结构示意图;FIG. 2 is an optional schematic structural diagram of an electronic device 500 according to an embodiment of the present application;
图3为本申请实施例提供的基于敏感度识别模型的数据敏感度识别方法的流程示意图;3 is a schematic flowchart of a data sensitivity identification method based on a sensitivity identification model provided by an embodiment of the present application;
图4为本申请实施例提供的敏感度识别模型的结构示意图;4 is a schematic structural diagram of a sensitivity identification model provided by an embodiment of the present application;
图5为本申请实施例提供的敏感度识别模型的结构示意图;5 is a schematic structural diagram of a sensitivity identification model provided by an embodiment of the present application;
图6为本申请实施例提供的敏感度识别模型的结构示意图;6 is a schematic structural diagram of a sensitivity identification model provided by an embodiment of the present application;
图7为本申请实施例提供的敏感度识别模型的训练方法的流程示意图;7 is a schematic flowchart of a training method for a sensitivity recognition model provided by an embodiment of the present application;
图8为本申请实施例提供的基于敏感度识别模型的数据敏感度识别方法的流程示意图;8 is a schematic flowchart of a data sensitivity identification method based on a sensitivity identification model provided by an embodiment of the present application;
图9为本申请实施例提供的基于敏感度识别模型的数据敏感度识别方法的流程示意图;9 is a schematic flowchart of a data sensitivity identification method based on a sensitivity identification model provided by an embodiment of the present application;
图10为本申请实施例提供的敏感度识别模型的结构示意图;10 is a schematic structural diagram of a sensitivity identification model provided by an embodiment of the present application;
图11为本申请实施例提供的基于敏感度识别模型的数据敏感度识别装置的结构示意图;11 is a schematic structural diagram of a data sensitivity identification device based on a sensitivity identification model provided by an embodiment of the present application;
图12为本申请实施例提供的敏感度识别模型的训练装置的结构示意图。FIG. 12 is a schematic structural diagram of a training apparatus for a sensitivity recognition model provided by an embodiment of the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be described in further detail below with reference to the accompanying drawings. All other embodiments obtained under the premise of creative work fall within the scope of protection of the present application.
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" can be the same or a different subset of all possible embodiments, and Can be combined with each other without conflict.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application, and are not intended to limit the present application.
对本申请实施例进行进一步详细说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。Before further describing the embodiments of the present application in detail, the terms and terms involved in the embodiments of the present application are described, and the terms and terms involved in the embodiments of the present application are suitable for the following explanations.
1)元数据,为描述数据的数据,或者为用于提供某种资源(即待识别数据)的有关信息的结构数据,主要用于描述待识别数据的数据属性信息,用来支持如指示存储位置、历史数据、资源查找、文件记录等功能;元数据可称为一种电子式目录,为了达到编制目录的目的,必须在描述并收藏数据的内容或特色,进而达成协助数据检索的目的。1) Metadata, which is the data describing the data, or the structural data used to provide relevant information of a certain resource (that is, the data to be identified), mainly used to describe the data attribute information of the data to be identified, and used to support storage such as instructions. Location, historical data, resource search, file recording and other functions; metadata can be called an electronic catalog. In order to achieve the purpose of cataloging, it is necessary to describe and collect the content or characteristics of the data, and then achieve the purpose of assisting data retrieval.
2)响应于,用于表示所执行的操作所依赖的条件或者状态,当满足所依赖的条件或状态时,所执行的一个或多个操作可以是实时的,也可以具有设定的延迟;在没有特别说明的情况下,所执行的多个操作不存在执行先后顺序的限制。2) In response, used to represent the condition or state on which the executed operation depends, when the dependent condition or state is satisfied, the executed one or more operations may be real-time or may have a set delay; Unless otherwise specified, there is no restriction on the order of execution of multiple operations to be executed.
基于上述对本申请实施例中涉及的名词和术语的解释,接下来对本申请实施例提供的基于敏感度识别模型的数据敏感度识别方法进行说明,参见图1,图1为本申请实施例提供的基于敏感度识别模型的数据敏感度识别系统100的一个可选的架构示意图,为实现支撑一个示例性应用,终端(示例性示出了终端400-1和终端400-2)通过网络300连接服务器200,网络300可以是广域网或者局域网,又或者是二者的组合,使用无线链路实现数据传输。Based on the above explanation of the terms and terms involved in the embodiments of the present application, the following describes the data sensitivity identification method based on the sensitivity identification model provided by the embodiments of the present application. Referring to FIG. 1 , FIG. 1 provides an embodiment of the present application. An optional schematic diagram of the architecture of the data
在实际应用中,终端上设置有客户端,如微博、知乎、企业应用等,用于提供与业务相关的待识别数据,或者与用户行为相关的待识别数据,并将待识别数据发送至服务器200,服务器200既可以为单独配置的支持各种业务的一个服务器,亦可以配置为一个服务器集群,还可以为云服务器等,如可为客户端的后台服务器,也可以为信息流平台。In practical applications, the terminal is provided with a client, such as Weibo, Zhihu, enterprise applications, etc., to provide the data to be identified related to the business, or the data to be identified related to user behavior, and to send the data to be identified. As for the
在实际实施时,服务器200,用于获取待识别数据的元数据,其中,元数据用于描述待识别数据;通过敏感度识别模型的特征提取层,对待识别数据的元数据进行特征提取,得到元数据的数据特征;通过敏感度识别模型的敏感度识别层,基于元数据的数据特征,对待识别数据进行敏感度识别,得到敏感度识别结果;其中,敏感度识别结果用于指示待识别数据对应的数据敏感度。In actual implementation, the
接下来对实施本申请实施例的基于敏感度识别模型的数据敏感度识别方法的电子设备进行说明。参见图2,图2为本申请实施例提供的电子设备500的一个可选的结构示意图,在实际应用中,电子设备500可以为图1中的终端(如终端400-1和终端400-2)或服务器200,以电子设备为图1所示的服务器200为例,图2所示的电子设备500包括:至少一个处理器510、存储器550、至少一个网络接口520和用户接口530。电子设备500中的各个组件通过总线系统540耦合在一起。可理解,总线系统540用于实现这些组件之间的连接通信。总线系统540除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2中将各种总线都标为总线系统540。Next, an electronic device implementing the data sensitivity identification method based on the sensitivity identification model according to the embodiment of the present application will be described. Referring to FIG. 2, FIG. 2 is an optional schematic structural diagram of an electronic device 500 provided in an embodiment of the present application. In practical applications, the electronic device 500 may be the terminals in FIG. 1 (such as the terminal 400-1 and the terminal 400-2). ) or
处理器510可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。The processor 510 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., where a general-purpose processor may be a microprocessor or any conventional processor or the like.
用户接口530包括使得能够呈现媒体内容的一个或多个输出装置531,包括一个或多个扬声器和/或一个或多个视觉显示屏。用户接口530还包括一个或多个输入装置532,包括有助于用户输入的用户接口部件,比如键盘、鼠标、麦克风、触屏显示屏、摄像头、其他输入按钮和控件。User interface 530 includes one or more output devices 531 that enable presentation of media content, including one or more speakers and/or one or more visual display screens. User interface 530 also includes one or more input devices 532, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, and other input buttons and controls.
存储器550可以是可移除的,不可移除的或其组合。示例性的硬件设备包括固态存储器,硬盘驱动器,光盘驱动器等。存储器550可选地包括在物理位置上远离处理器510的一个或多个存储设备。Memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 550 optionally includes one or more storage devices that are physically remote from processor 510 .
存储器550包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。非易失性存储器可以是只读存储器(ROM,Read Only Me mory),易失性存储器可以是随机存取存储器(RAM,Random Access Memor y)。本申请实施例描述的存储器550旨在包括任意适合类型的存储器。Memory 550 includes volatile memory or non-volatile memory, and may also include both volatile and non-volatile memory. The non-volatile memory may be Read Only Memory (ROM, Read Only Memory), and the volatile memory may be Random Access Memory (RAM, Random Access Memory). The memory 550 described in the embodiments of the present application is intended to include any suitable type of memory.
在一些实施例中,存储器550能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。In some embodiments, memory 550 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
操作系统551,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;The operating system 551 includes system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
网络通信模块552,用于经由一个或多个(有线或无线)网络接口520到达其他计算设备,示例性的网络接口520包括:蓝牙、无线相容性认证(WiFi)、和通用串行总线(USB,Universal Serial Bus)等;A network communication module 552 for reaching other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: Bluetooth, Wireless Compatibility (WiFi), and Universal Serial Bus ( USB, Universal Serial Bus), etc.;
呈现模块553,用于经由一个或多个与用户接口530相关联的输出装置531(例如,显示屏、扬声器等)使得能够呈现信息(例如,用于操作外围设备和显示内容和信息的用户接口);A presentation module 553 for enabling presentation of information (eg, a user interface for operating peripherals and displaying content and information) via one or more output devices 531 associated with the user interface 530 (eg, a display screen, speakers, etc.) );
输入处理模块554,用于对一个或多个来自一个或多个输入装置532之一的一个或多个用户输入或互动进行检测以及翻译所检测的输入或互动。An input processing module 554 for detecting one or more user inputs or interactions from one of the one or more input devices 532 and translating the detected inputs or interactions.
在一些实施例中,本申请实施例提供的基于敏感度识别模型的数据敏感度识别装置可以采用软件方式实现,图2示出了存储在存储器550中的基于敏感度识别模型的数据敏感度识别装置555,其可以是程序和插件等形式的软件,包括以下软件模块:第一获取模块5551、第一提取模块5552和第一识别模块5553,这些模块是逻辑上的,因此根据所实现的功能可以进行任意的组合或进一步拆分,将在下文中说明各个模块的功能。In some embodiments, the data sensitivity identification device based on the sensitivity identification model provided by the embodiments of the present application may be implemented in software. FIG. 2 shows the data sensitivity identification based on the sensitivity identification model stored in the memory 550 The
在另一些实施例中,本申请实施例提供的基于敏感度识别模型的数据敏感度识别装置可以采用硬件方式实现,作为示例,本申请实施例提供的基于敏感度识别模型的数据敏感度识别装置可以是采用硬件译码处理器形式的处理器,其被编程以执行本申请实施例提供的基于敏感度识别模型的数据敏感度识别方法,例如,硬件译码处理器形式的处理器可以采用一个或多个应用专用集成电路(ASIC,Application Specific IntegratedCircuit)、DSP、可编程逻辑器件(P LD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex P rogrammable Logic Device)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)或其他电子元件。In other embodiments, the data sensitivity identification device based on the sensitivity identification model provided by the embodiments of the present application may be implemented in hardware. As an example, the data sensitivity identification device based on the sensitivity identification model provided by the embodiments of the present application It may be a processor in the form of a hardware decoding processor, which is programmed to execute the data sensitivity identification method based on the sensitivity identification model provided by the embodiments of the present application. For example, the processor in the form of a hardware decoding processor may adopt a or multiple application-specific integrated circuits (ASIC, Application Specific Integrated Circuit), DSP, Programmable Logic Device (PLD, Programmable Logic Device), Complex Programmable Logic Device (CPLD, Complex Programmable Logic Device), Field Programmable Gate Array (FPGA, Field-Programmable Gate Array) or other electronic components.
基于上述对本申请实施例的基于敏感度识别模型的数据敏感度识别系统及电子设备的说明,接下来对本申请实施例提供的基于敏感度识别模型的数据敏感度识别方法进行说明,在一些实施例中,该方法可由终端或服务器单独实施,如通过图1中的终端400-1、终端400-2或服务器200单独实施,还可由服务器及终端协同实施,如通过图1中的终端400-1和服务器200协同实施,接下来结合图1及图3,图3为本申请实施例提供的基于敏感度识别模型的数据敏感度识别方法的流程示意图,以图1中的服务器200实施本申请实施例提供的基于敏感度识别模型的数据敏感度识别方法为例进行说明。Based on the above description of the data sensitivity recognition system and electronic device based on the sensitivity recognition model of the embodiments of the present application, the following describes the data sensitivity recognition method based on the sensitivity recognition model provided by the embodiments of the present application. , the method can be implemented by a terminal or a server alone, such as by the terminal 400-1, the terminal 400-2 or the
步骤101:服务器获取待识别数据的元数据,其中,元数据用于描述待识别数据。Step 101: The server obtains metadata of the data to be identified, wherein the metadata is used to describe the data to be identified.
在实际应用中,待识别数据可为企业业务相关的数据,还可为个人用户相关的数据,可为从数据库中获取的数据,还可为实时获取的数据,待识别数据的存储形式可为数据表、还可为文字或日志等文本形式。元数据主要用于对待识别数据进行属性描述,例如,若待识别数据为如购物业务相关的数据时,元数据可为购物账号、订单编号、姓名、手机号、收货地址等数据;若待识别数据为个人用户相关的数据,元数据可为姓名、身份证号、手机号、电子邮箱、银行卡号、家庭地址、工作单位等数据。In practical applications, the data to be identified may be business-related data of an enterprise, data related to individual users, data obtained from a database, or data obtained in real time, and the storage form of the data to be identified may be The data table can also be in text form such as text or log. Metadata is mainly used to describe the attributes of the data to be identified. For example, if the data to be identified is data related to shopping business, the metadata can be data such as shopping account number, order number, name, mobile phone number, delivery address, etc.; Identification data is data related to individual users, and metadata can be data such as name, ID number, mobile phone number, email address, bank card number, home address, and work unit.
在一些实施例中,服务器可通过如下方式获取待识别数据的元数据:当待识别数据的存储形式为数据表时,从数据表中获取以下表元素至少之一:数据表名、数据表中对应待识别数据的表描述、数据表中对应待识别数据的属性字段;将获取的表元素确定为待识别数据的元数据。In some embodiments, the server may obtain the metadata of the data to be identified in the following manner: when the storage form of the data to be identified is a data table, obtain at least one of the following table elements from the data table: data table name, data table The table description corresponding to the data to be identified and the attribute field in the data table corresponding to the data to be identified; the acquired table element is determined as the metadata of the data to be identified.
这里,在实际应用中,若待识别数据的存储形式为数据表时,将数据表中的数据表名、表描述或属性字段作为待识别数据的元数据。例如,将待识别数据对应的数据表中的数据表名、表中文名、表负责人、字段名或字段类型等属性字段作为元数据。Here, in practical applications, if the storage form of the data to be identified is a data table, the data table name, table description or attribute field in the data table is used as the metadata of the data to be identified. For example, attribute fields such as data table name, table Chinese name, table responsible person, field name or field type in the data table corresponding to the data to be identified are used as metadata.
在一些实施例中,服务器还可通过如下方式获取待识别数据的元数据:当待识别数据的存储形式为文档时,从文档中获取以下文档内容至少之一:文档标题、文档摘要、文档关键词;将获取的文档内容确定为待识别数据的元数据。In some embodiments, the server may also obtain metadata of the data to be identified in the following manner: when the storage form of the data to be identified is a document, obtain at least one of the following document contents from the document: document title, document abstract, document key Word; determine the content of the document obtained as the metadata of the data to be recognized.
这里,文档可为word文档或记事本(如txt)文档,当待识别数据的存储形式为文档时,将待识别数据的文档标题、文档摘要、文档关键词作为元数据。当待识别数据包括文档标题和文档正文时,除了可将文档标题作为元数据外,还可从文档正文中提取关键的摘要内容(即文档摘要)作为元数据,这是由于在实际上,待识别数据的文档正文可能比较冗长,如果对所有的文档正文都进行识别,势必带来很大的计算压力,导致识别效率低下;而通常情况下,待识别数据的核心主题可以用其中的某一句或几句话来概括,因此,为了在有效提取待识别数据的核心主题的同时提高识别效率,可从文档正文中提取用于表征待识别数据的核心主题的摘要内容作为待识别数据的元数据。Here, the document may be a word document or a notepad (eg, txt) document. When the storage form of the data to be identified is a document, the document title, document abstract, and document keywords of the data to be identified are used as metadata. When the data to be identified includes the document title and the document body, in addition to the document title as the metadata, the key abstract content (that is, the document abstract) can also be extracted from the document body as the metadata. The document body of the identification data may be lengthy. If all document bodies are identified, it will inevitably bring a lot of computational pressure, resulting in low identification efficiency; and usually, the core theme of the data to be identified can use one of the sentences. Or a few words to summarize, therefore, in order to effectively extract the core subject of the data to be identified while improving the recognition efficiency, the abstract content used to characterize the core subject of the data to be identified can be extracted from the document body as the metadata of the data to be identified. .
在一些实施例中,可通过对文档正文进行关键词提取,得到对应的文档关键词,并可通过如下方式获取待识别数据的文档摘要:对待识别数据的文档正文进行句提取,得到对应待识别数据的多个目标句;根据各目标句中多个关键词的词权重,确定对应目标句的句权重;基于各句权重,对目标句进行降序排序,得到对应的句序列;从句序列中第一个目标句开始,选取目标数量的目标句,并将目标数量的目标句,作为对应待识别数据的文档摘要。In some embodiments, the corresponding document keywords can be obtained by performing keyword extraction on the document body, and the document abstract of the data to be identified can be obtained by the following method: Sentence extraction is performed on the document body of the data to be identified to obtain the corresponding documents to be identified. Multiple target sentences of the data; according to the word weights of multiple keywords in each target sentence, the sentence weight of the corresponding target sentence is determined; based on the weight of each sentence, the target sentence is sorted in descending order to obtain the corresponding sentence sequence; Start with a target sentence, select a target number of target sentences, and use the target number of target sentences as a document summary corresponding to the data to be identified.
其中,服务器可分别对各目标句执行以下操作,以实现根据各目标句中多个关键词的词权重,确定对应目标句的句权重:对目标句进行关键词提取,得到对应的多个关键词;分别获取各关键词在文档正文中对应的词频、及各关键词的逆向文件频率;基于词频及逆向文件频率,确定对应的关键词的词权重;将各关键词的词权重进行求和处理,得到对应目标句的句权重。The server may perform the following operations on each target sentence respectively, so as to determine the sentence weight of the corresponding target sentence according to the word weights of multiple keywords in each target sentence: perform keyword extraction on the target sentence to obtain a plurality of corresponding key words. word; obtain the word frequency corresponding to each keyword in the document body, and the reverse file frequency of each keyword; determine the word weight of the corresponding keyword based on the word frequency and reverse file frequency; sum the word weights of each keyword Process to get the sentence weight corresponding to the target sentence.
这里,词频表征该关键词在待识别数据中出现的频次与待识别数据中的总词数的比值,逆向文件频率表征该关键词的稀有程度,以待识别数据归属的数据集中的数据总数与待识别数据归属的数据集中包含各关键词对应的数据的数量比值的对数来表示。此外,除了考虑关键词的词频,还综合考虑关键词的稀有程度,在实际实施时,一个关键词的重要程度不仅正比于它在待识别数据中的频次,还反比于待识别数据所归属的数据集中有多少数据包含它,通常而言,包含该关键语的数据越多,就说明它越宽泛,越不能体现数据的特色。最后,将目标句中的关键词的词权重的总和确定为该目标句的句权重,如此,得到了每个目标句的句权重,句权重越大,表征对应的目标句越能代表待识别数据的核心主题。Here, the word frequency represents the ratio of the frequency of the keyword appearing in the data to be recognized to the total number of words in the data to be recognized, and the reverse file frequency represents the rarity of the keyword, based on the total number of data in the data set to which the data to be recognized belongs and The data set to which the data to be identified belongs is represented by the logarithm of the ratio of the quantity of data corresponding to each keyword. In addition, in addition to considering the word frequency of keywords, the rarity of keywords is also considered comprehensively. In actual implementation, the importance of a keyword is not only proportional to its frequency in the data to be identified, but also inversely proportional to the attribute to which the data to be identified belongs. How much data in the dataset contains it. Generally speaking, the more data that contains the keyword, the more general it is, and the less it can reflect the characteristics of the data. Finally, the sum of the word weights of the keywords in the target sentence is determined as the sentence weight of the target sentence. In this way, the sentence weight of each target sentence is obtained. The greater the sentence weight, the more representative the corresponding target sentence is to be identified. The core theme of data.
通过上述方式,基于获取的待识别数据的元数据进行后续的数据敏感度识别,由于元数据不仅能够表征待识别数据的属性特征,而且数据量相对应待识别而言大大降低,因此,不仅能够保证识别的精准度,还能提高识别的效率。In the above manner, the subsequent data sensitivity identification is performed based on the acquired metadata of the data to be identified. Since the metadata can not only represent the attribute characteristics of the data to be identified, but also the amount of data is greatly reduced corresponding to the to-be-identified data. Therefore, not only can To ensure the accuracy of recognition, but also to improve the efficiency of recognition.
步骤102:通过特征提取层,对待识别数据的元数据进行特征提取,得到元数据的数据特征。Step 102: Through the feature extraction layer, feature extraction is performed on the metadata of the data to be identified to obtain data features of the metadata.
在一些实施例中,参见图4,图4为本申请实施例提供的敏感度识别模型的结构示意图,如图4所示,敏感度识别模型包括特征提取层和敏感度识别层,将待识别数据的元数据输入值敏感度识别模型中,通过特征提取层,对元数据进行特征提取,得到对应的数据特征,通过敏感度识别层,对数据特征进行行敏感度识别,得到敏感度识别结果。In some embodiments, referring to FIG. 4 , FIG. 4 is a schematic structural diagram of a sensitivity identification model provided by an embodiment of the present application. As shown in FIG. 4 , the sensitivity identification model includes a feature extraction layer and a sensitivity identification layer. In the data metadata input value sensitivity recognition model, the feature extraction layer is used to extract the metadata from the metadata to obtain the corresponding data features. Through the sensitivity recognition layer, the line sensitivity recognition is performed on the data features to obtain the sensitivity recognition result. .
在一些实施例中,参见图5,图5为本申请实施例提供的敏感度识别模型的结构示意图,如图5所示,服务器可通过如下方式对待识别数据的元数据进行特征提取,得到元数据的数据特征:In some embodiments, referring to FIG. 5, FIG. 5 is a schematic structural diagram of a sensitivity identification model provided by an embodiment of the present application. As shown in FIG. 5, the server may perform feature extraction on the metadata of the data to be identified in the following manner to obtain the metadata Data characteristics of the data:
对待识别数据的元数据进行分词处理,得到元数据对应的多个词语;分别对各个词语进行特征编码,得到各个词语对应的词语特征;对各个词语对应的词语特征进行特征拼接,得到元数据对应的数据特征。Perform word segmentation processing on the metadata of the data to be identified to obtain multiple words corresponding to the metadata; perform feature encoding on each word to obtain the word features corresponding to each word; perform feature splicing on the word features corresponding to each word to obtain the corresponding metadata data characteristics.
这里,在实际实施时,对待识别数据的元数据,如数据表名或表描述,进行分词处理,得到元数据对应的多个词语或多个字。在实际应用中,为了便于检索到相应的词语或字,还可为每个词语或字设置唯一的索引值,即基于每个词语或字的索引值获取相应的词语或字,然后对每个词语或字进行特征编码,如词向量转换,得到对应的词语特征,即词向量;然后对各个词语或字对应的词语特征进行特征拼接,得到对应的数据特征,即句向量。Here, in actual implementation, the metadata of the data to be identified, such as the data table name or table description, is subjected to word segmentation processing to obtain multiple words or multiple characters corresponding to the metadata. In practical applications, in order to facilitate the retrieval of the corresponding words or characters, a unique index value can also be set for each word or character, that is, the corresponding word or character is obtained based on the index value of each word or character, and then Characteristic encoding of words or characters, such as word vector conversion, to obtain the corresponding word features, that is, word vectors;
在一些实施例中,服务器可通过如下方式对各个词语对应的词语特征进行特征拼接,得到元数据对应的数据特征:In some embodiments, the server may perform feature splicing on the word features corresponding to each word in the following manner to obtain data features corresponding to the metadata:
分别对各个词语的词语特征进行双向编码处理,得到各词语对应的上文编码特征和下文编码特征;分别对各词语的上文编码特征和下文编码特征进行特征拼接,得到相应的拼接编码特征;将各词语对应的拼接编码特征进行特征拼接,得到元数据对应的数据特征。The word features of each word are subjected to bidirectional encoding processing to obtain the above coding feature and the following coding feature corresponding to each word; the feature splicing is performed on the above coding feature and the following coding feature of each word respectively, and the corresponding splicing coding feature is obtained; The feature splicing is performed on the splicing coding features corresponding to each word to obtain the data features corresponding to the metadata.
这里,考虑词语上下文特征,在得到每个词语的词向量后,将每个词语的词向量输入至双向编码层,如双向长短期记忆网络(Bi-LSTM,Bi-directional Long Short-TermMemory)层,其中,Bi-LSTM层包括两个LSTM:一个为正向输入序列和一个反向输入序列,通过前向过程(如从左向右)提取得到各个词语对应的上文编码特征,通过后向过程(如从右向左)提取得到各个词语对应的下文编码特征向量,最后将上文编码特征和下文编码特征进行拼接得到的对应词语的拼接编码特征。Here, considering the word context features, after obtaining the word vector of each word, input the word vector of each word to the bidirectional coding layer, such as the bidirectional long short-term memory network (Bi-LSTM, Bi-directional Long Short-TermMemory) layer , where the Bi-LSTM layer includes two LSTMs: one is the forward input sequence and the other is the reverse input sequence. The above-coded features corresponding to each word are extracted through the forward process (such as from left to right), and through the backward process The process (such as from right to left) extracts the following coding feature vector corresponding to each word, and finally the splicing coding feature of the corresponding word is obtained by splicing the above coding feature and the following coding feature.
步骤103:通过敏感度识别层,基于元数据的数据特征,对待识别数据进行敏感度识别,得到敏感度识别结果。Step 103: Through the sensitivity identification layer, based on the data characteristics of the metadata, perform sensitivity identification on the data to be identified, and obtain a sensitivity identification result.
在一些实施例中,服务器可通过如下方式通过敏感度识别层,基于元数据的数据特征,对待识别数据进行敏感度识别,得到敏感度识别结果:In some embodiments, the server can perform sensitivity identification on the data to be identified through the sensitivity identification layer based on the data characteristics of the metadata, and obtain the sensitivity identification result:
通过敏感度识别层,对元数据的数据特征进行对应至少两个敏感等级的分类预测,得到元数据对应各敏感等级的概率;选取概率最大的敏感等级,作为对待识别数据的敏感度识别结果。Through the sensitivity identification layer, the data characteristics of the metadata are classified and predicted corresponding to at least two sensitivity levels, and the probability of the metadata corresponding to each sensitivity level is obtained; the sensitivity level with the largest probability is selected as the sensitivity identification result of the data to be identified.
其中,敏感度识别结果,用于指示待识别数据对应的数据敏感度,数据敏感度的表现形式有多种,如可用敏感等级或敏感度值等来表征,当以敏感度值来表征待识别数据的数据敏感度时,敏感度值越大,表征待识别数据越敏感;当以敏感等级来表征待识别数据的数据敏感度时,对数据敏感度自定义的敏感等级依次为:对外公开、内部公开、一般敏感、特别敏感、高度机密,并依次对应于1~5五个自然数字,可以参照行业标准,以及国家立法部门在数据安全方面的相关法规,定义企业自己的数据敏感等级标准。Among them, the sensitivity identification result is used to indicate the data sensitivity corresponding to the data to be identified. There are various forms of data sensitivity, such as sensitivity levels or sensitivity values. When the data sensitivity of the data, the larger the sensitivity value, the more sensitive the data to be identified; when the sensitivity level is used to represent the data sensitivity of the data to be identified, the custom sensitivity levels for the data sensitivity are as follows: external disclosure, Internally open, generally sensitive, particularly sensitive, and highly confidential, and correspond to five natural numbers from 1 to 5 in turn. You can define your own data sensitivity level standards by referring to industry standards and the relevant laws and regulations of the national legislative department on data security.
需要说明的是,敏感等级数目的确定,既要有利于数据敏感度的合理区分,也要考虑基于不同敏感等级实施安全管控措施的可行性,一般4~5级是比较合理的,当选取5级时,从高到低分别为:5(高度机密)、4(特别敏感)、3(一般敏感)、2(内部公开)和1(对外公开);这里敏感等级的定义,对于数据表而言,要精确到字段的敏感等级。比如,作为字段中的身份证号、手机号级为5级,姓名、电子邮箱、收货地址等为4级。另外,还可仅采用自定义的敏感等级来表征待识别数据的数据敏感度,如敏感等级可分为五种:绝密、机密、高敏感、中敏感和低敏感。It should be noted that the determination of the number of sensitivity levels is not only conducive to the reasonable distinction of data sensitivity, but also considers the feasibility of implementing security control measures based on different sensitivity levels. Generally, levels 4 to 5 are more reasonable. When the level is high, from high to low, they are: 5 (highly confidential), 4 (extremely sensitive), 3 (generally sensitive), 2 (internal disclosure) and 1 (external disclosure); the definition of sensitivity level here, for the data table and language, to be precise to the sensitivity level of the field. For example, the ID number and mobile phone number in the fields are level 5, and the name, email address, and delivery address are level 4. In addition, only a user-defined sensitivity level can be used to represent the data sensitivity of the data to be identified. For example, the sensitivity level can be divided into five types: top secret, confidential, high sensitivity, medium sensitivity and low sensitivity.
这里,假设待识别数据对应的敏感等级为以下五种:绝密、机密、高敏感、中敏感和低敏感,若通过敏感度识别层,对元数据的数据特征进行分类预测,得到对应以上敏感等级的概率依次为:绝密(90%)、机密(40%)、高敏感(30%)、中敏感(15%)和低敏感(10%),那么可知,从中选择概率最大(90%)的敏感等级为绝密,将绝密作为该待识别数据的敏感度识别结果。Here, it is assumed that the sensitivity levels corresponding to the data to be identified are as follows: top secret, confidential, high sensitivity, medium sensitivity and low sensitivity. If the sensitivity identification layer is used to classify and predict the data characteristics of the metadata, the corresponding sensitivity levels can be obtained. The probability is: top secret (90%), confidential (40%), high sensitivity (30%), medium sensitivity (15%) and low sensitivity (10%), then it can be seen that the one with the highest probability (90%) is selected. The sensitivity level is top secret, and the top secret is used as the sensitivity identification result of the data to be identified.
在一些实施例中,参见图6,图6为本申请实施例提供的敏感度识别模型的结构示意图,如图6所示,服务器还可通过如下方式通过特征提取层,对待识别数据的元数据进行特征提取,得到元数据的数据特征,包括:当元数据包括至少两个关键词时,通过特征提取层,分别对各关键词进行特征提取,得到各关键词对应的特征作为元数据的数据特征;相应的,服务器可通过如下方式通过敏感度识别层,基于元数据的数据特征,对待识别数据进行敏感度识别,得到敏感度识别结果:通过敏感度识别层,分别将各关键词对应的特征与至少两个敏感词对应的特征进行匹配,得到相应的匹配度;选取匹配度最高的敏感词对应的数据敏感度,作为对待识别数据的敏感度识别结果。In some embodiments, referring to FIG. 6, FIG. 6 is a schematic structural diagram of a sensitivity identification model provided by an embodiment of the present application. As shown in FIG. 6, the server may also pass the feature extraction layer in the following manner to obtain metadata of the data to be identified. Perform feature extraction to obtain the data features of the metadata, including: when the metadata includes at least two keywords, through the feature extraction layer, feature extraction is performed on each keyword respectively, and the feature corresponding to each keyword is obtained as the data of the metadata Correspondingly, the server can perform sensitivity identification on the data to be identified through the sensitivity identification layer, based on the data characteristics of the metadata, and obtain the sensitivity identification result: through the sensitivity identification layer, respectively The feature is matched with the features corresponding to at least two sensitive words to obtain the corresponding matching degree; the data sensitivity corresponding to the sensitive word with the highest matching degree is selected as the sensitivity recognition result of the data to be recognized.
这里,服务器预存有敏感词与对应的数据敏感度之间的对应关系,如敏感词1对应的数据敏感度为绝密、敏感词2对应的数据敏感度为机密、敏感词3对应的数据敏感度为高敏感、敏感词4对应的数据敏感度为低敏感、敏感词5对应的数据敏感度为低敏感,假设待识别数据的元数据对应的关键词包括:关键词1和关键词2,则通过特征提取层,分别对关键词1和关键词2进行特征提取,得到关键词1对应的特征和关键词2对应的特征;通过敏感度识别层,分别将关键词1的特征与上述敏感词(如敏感词1~敏感词5)的特征进行匹配,得到相应的匹配度依次为:10%、20%、30%、40%、80%,分别将关键词2的特征与上述敏感词(如敏感词1~敏感词5)的特征进行匹配,得到相应的匹配度依次为:20%、10%、30%、40%、60%,则从中选择匹配度最高为80%的敏感词5对应的低敏感,作为待识别数据的敏感度识别结果。Here, the server pre-stores the correspondence between the sensitive words and the corresponding data sensitivity. For example, the data sensitivity corresponding to sensitive word 1 is top secret, the data sensitivity corresponding to sensitive word 2 is confidential, and the data sensitivity corresponding to sensitive word 3 is is high sensitivity, the data sensitivity corresponding to sensitive word 4 is low sensitivity, and the data sensitivity corresponding to sensitive word 5 is low sensitivity. Assuming that the keywords corresponding to the metadata of the data to be identified include: keyword 1 and keyword 2, then Through the feature extraction layer, the features of keyword 1 and keyword 2 are respectively extracted to obtain the feature corresponding to keyword 1 and the feature corresponding to keyword 2; through the sensitivity recognition layer, the feature of keyword 1 and the above sensitive words are respectively extracted (such as sensitive word 1 to sensitive word 5) are matched, and the corresponding matching degrees are obtained as follows: 10%, 20%, 30%, 40%, 80%, respectively, the characteristics of keyword 2 and the above sensitive words ( For example, if the features of sensitive words 1 to 5) are matched, the corresponding matching degrees are obtained in order: 20%, 10%, 30%, 40%, 60%, then select the sensitive word 5 with the highest matching degree of 80%. The corresponding low sensitivity is used as the sensitivity recognition result of the data to be recognized.
在一些实施例中,在得到敏感度识别结果之后,服务器还可建立敏感度识别结果与待识别数据的关联关系,并存储关联关系;其中,关联关系用于供基于待识别数据查找对应待识别数据的数据敏感度。In some embodiments, after obtaining the sensitivity identification result, the server may further establish an association relationship between the sensitivity identification result and the data to be identified, and store the association relationship; wherein, the association relationship is used for searching the corresponding to-be-identified data based on the to-be-identified data Data sensitivity of the data.
在一些实施例中,服务器可通过如下方式建立敏感度识别结果与待识别数据的关联关系:In some embodiments, the server may establish an association relationship between the sensitivity identification result and the data to be identified in the following manner:
将所述敏感度识别结果存储至所述待识别数据关联的目标区域,所述目标区域为所述元数据对应的存储区域中对应数据敏感度的区域。The sensitivity identification result is stored in a target area associated with the data to be identified, where the target area is an area corresponding to the data sensitivity in the storage area corresponding to the metadata.
这里,在服务器确定待识别数据的数据敏感度之后,还可将该数据敏感度添加到待识别数据关联的用于指示数据敏感度的区域中,如在数据标中“敏感等级”一栏补齐该待识别数据的数据敏感度,将数据敏感度作为元数据的一部分,供用户使用与维护。Here, after the server determines the data sensitivity of the data to be identified, the data sensitivity may also be added to the area associated with the data to be identified and used to indicate the data sensitivity, for example, in the "sensitivity level" column in the data label to supplement the data sensitivity. Check the data sensitivity of the data to be identified, and use the data sensitivity as part of the metadata for users to use and maintain.
在一些实施例中,在得到敏感度识别结果之后,服务器还可响应于针对待识别数据的数据展示请求,获取待识别数据对应的数据敏感度;当待识别数据对应的数据敏感度达到敏感度阈值时,返回对应待识别数据的屏蔽指示信息;其中,所述屏蔽指示信息,用于指示对待识别数据进行屏蔽展示。In some embodiments, after obtaining the sensitivity identification result, the server may further obtain the data sensitivity corresponding to the to-be-identified data in response to a data display request for the to-be-identified data; when the data sensitivity corresponding to the to-be-identified data reaches the sensitivity When the threshold is set, the shielding indication information corresponding to the data to be identified is returned; wherein, the shielding indication information is used to indicate that the data to be identified is shielded and displayed.
这里,当待识别数据的数据敏感度达到敏感度阈值时,表征该待识别数据比较敏感,如为机密或绝密数据,此时,服务器返回对应待识别数据的屏蔽指示信息至终端,以使用户在终端对该待识别数据进行安全维护,如对机密或绝密数据进行屏蔽,避免被泄露;此外,还可选择性展示待识别数据中的部分数据,如表中某些敏感信息例如用户的身份证号,不想展示给其他人看,可用视图屏蔽掉这个字段。Here, when the data sensitivity of the data to be identified reaches the sensitivity threshold, it indicates that the data to be identified is relatively sensitive, such as confidential or top-secret data, at this time, the server returns the shielding instruction information corresponding to the data to be identified to the terminal, so that the user Security maintenance is performed on the data to be identified at the terminal, such as shielding confidential or top-secret data to avoid leakage; in addition, part of the data to be identified can also be selectively displayed, such as some sensitive information in the table, such as the identity of the user ID number, if you don't want to show it to others, you can block this field with the view.
在一些实施例中,在得到敏感度识别结果之后,当所述敏感度识别结果表征所述待识别数据的数据敏感度达到目标数据敏感度时,服务器还可输出对应待识别数据的加密提示信息;其中,加密提示信息,用于提示对待识别数据进行加密处理。In some embodiments, after obtaining the sensitivity identification result, when the sensitivity identification result indicates that the data sensitivity of the data to be identified reaches the target data sensitivity, the server may also output encryption prompt information corresponding to the data to be identified ; Among them, the encryption prompt information is used to prompt the encryption processing of the data to be identified.
通过上述方式,当待识别数据的数据敏感度达到一定程度时,提示用户在维护或使用对该待识别数据时,对该待识别数据进行加密避免泄露,例如,需将待识别数据从一个数据库转移到另一个数据库时,应用本申请实施例所提供的方法自动识别出待识别数据的数据敏感度,并进一步对满足一定敏感度的待识别数据进行加密处理或者模糊化处理等,以使该需转移的待识别数据的安全性进一步提高。Through the above method, when the data sensitivity of the data to be identified reaches a certain level, the user is prompted to encrypt the data to be identified to avoid leakage when maintaining or using the data to be identified. When transferring to another database, the data sensitivity of the data to be identified is automatically identified by the method provided in the embodiment of the present application, and further encryption processing or fuzzification processing is performed on the to-be-identified data satisfying a certain sensitivity, so that the The security of the data to be identified to be transferred is further improved.
接下来对敏感度识别模型的训练进行说明。参见图7,图7为本申请实施例提供的敏感度识别模型的训练方法的流程示意图,在一些实施例中,敏感度识别模型包括特征提取层和敏感度识别层,方法包括:Next, the training of the sensitivity recognition model will be described. Referring to FIG. 7, FIG. 7 is a schematic flowchart of a training method of a sensitivity identification model provided by an embodiment of the present application. In some embodiments, the sensitivity identification model includes a feature extraction layer and a sensitivity identification layer, and the method includes:
步骤201:服务器获取数据样本的元数据,其中,数据样本携带有敏感度标签,敏感度标签用于指示数据样本对应的数据敏感度。Step 201: The server obtains metadata of the data sample, wherein the data sample carries a sensitivity label, and the sensitivity label is used to indicate the data sensitivity corresponding to the data sample.
步骤202:通过特征提取层,对数据样本的元数据进行特征提取,得到数据样本的元数据的样本数据特征。Step 202: Perform feature extraction on the metadata of the data sample through the feature extraction layer to obtain sample data features of the metadata of the data sample.
步骤203:通过敏感度识别层,基于样本数据特征,对数据样本进行敏感度识别,得到样本敏感度识别结果。Step 203: Through the sensitivity identification layer, based on the characteristics of the sample data, perform sensitivity identification on the data samples to obtain a sample sensitivity identification result.
步骤204:获取样本敏感度识别结果与数据样本携带的敏感度标签之间的差异,并基于获取的差异,更新敏感度识别模型的模型参数。Step 204: Obtain the difference between the sample sensitivity identification result and the sensitivity label carried by the data sample, and update the model parameters of the sensitivity identification model based on the obtained difference.
在实际实施时,可根据样本敏感度识别结果与数据样本携带的敏感度标签之间的差异,确定敏感度识别模型的损失函数的值;当损失函数的值达到预设阈值时,基于敏感度识别模型的损失函数的值确定相应的误差信号;将误差信号在敏感度识别模型中反向传播,并在传播的过程中更新敏感度识别模型的各个层的模型参数。In actual implementation, the value of the loss function of the sensitivity identification model can be determined according to the difference between the sample sensitivity identification result and the sensitivity label carried by the data sample; when the value of the loss function reaches a preset threshold, based on the sensitivity The value of the loss function of the recognition model determines the corresponding error signal; the error signal is back-propagated in the sensitivity recognition model, and the model parameters of each layer of the sensitivity recognition model are updated in the process of propagation.
这里,对反向传播进行说明,将训练的数据样本输入到神经网络模型的输入层,经过隐藏层,最后达到输出层并输出结果,这是神经网络模型的前向传播过程,由于神经网络模型的输出结果与实际结果有误差,则计算输出结果与实际值之间的误差,并将该误差从输出层向隐藏层反向传播,直至传播到输入层,在反向传播的过程中,根据误差调整模型参数的值;不断迭代上述过程,直至收敛。Here, the back propagation is explained, the training data samples are input into the input layer of the neural network model, through the hidden layer, and finally reach the output layer and output the result, which is the forward propagation process of the neural network model, because the neural network model If there is an error between the output result and the actual result, calculate the error between the output result and the actual value, and propagate the error back from the output layer to the hidden layer until it propagates to the input layer. The error adjusts the values of the model parameters; the above process is iterated continuously until convergence.
通过上述方式,服务器将待识别的元数据输入至敏感度识别模型中,即可自动识别得到用于指示待识别数据对应的数据敏感度的敏感度识别结果,相较于人工识别的方式而言,能够大大提高数据敏感度的识别效率,且降低漏查敏感数据的概率。Through the above method, the server inputs the metadata to be identified into the sensitivity identification model, and can automatically identify and obtain the sensitivity identification result used to indicate the data sensitivity corresponding to the data to be identified. Compared with the manual identification method , which can greatly improve the identification efficiency of data sensitivity and reduce the probability of missing sensitive data.
接下来继续对本申请实施例提供的基于敏感度识别模型的数据敏感度识别方法进行说明,在一些实施例中,结合图1及图8,图8为本申请实施例提供的基于敏感度识别模型的数据敏感度识别方法的流程示意图,以图1中的终端与服务器200协同实施本申请实施例提供的基于敏感度识别模型的数据敏感度识别方法为例进行说明,本申请实施例提供的敏感度识别模型包括特征提取层和敏感度识别层,方法包括:Next, the data sensitivity identification method based on the sensitivity identification model provided by the embodiments of the present application will be described. In some embodiments, with reference to FIG. 1 and FIG. 8 , FIG. 8 is the sensitivity identification model provided by the embodiments of the present application. The schematic flowchart of the data sensitivity identification method shown in FIG. 1 is illustrated by taking the terminal and the
步骤301:服务器获取数据样本的元数据,其中,数据样本携带有敏感度标签,敏感度标签用于指示数据样本对应的数据敏感度。Step 301: The server obtains metadata of the data sample, wherein the data sample carries a sensitivity label, and the sensitivity label is used to indicate the data sensitivity corresponding to the data sample.
步骤302:服务器通过特征提取层,对数据样本的元数据进行特征提取,得到数据样本的元数据的样本数据特征。Step 302: The server performs feature extraction on the metadata of the data sample through the feature extraction layer to obtain sample data features of the metadata of the data sample.
步骤303:服务器通过敏感度识别层,基于样本数据特征,对数据样本进行敏感度识别,得到样本敏感度识别结果。Step 303: The server performs sensitivity identification on the data sample through the sensitivity identification layer based on the characteristics of the sample data, and obtains a sample sensitivity identification result.
步骤304:服务器获取样本敏感度识别结果与数据样本携带的敏感度标签之间的差异,并基于获取的差异,更新敏感度识别模型的模型参数。Step 304: The server obtains the difference between the sample sensitivity identification result and the sensitivity label carried by the data sample, and updates the model parameters of the sensitivity identification model based on the obtained difference.
通过上述方式,训练得到敏感度识别模型。In the above manner, a sensitivity recognition model is obtained by training.
步骤305:终端将用户的待识别数据传输至服务器。Step 305: The terminal transmits the data to be identified of the user to the server.
步骤306:若待识别数据的存储形式为数据表时,服务器获取数据表中的数据表名、表描述或属性字段作为元数据。Step 306: If the storage form of the data to be identified is a data table, the server acquires the data table name, table description or attribute field in the data table as metadata.
步骤307:服务器通过特征提取层,对待识别数据的元数据进行特征提取,得到元数据的数据特征。Step 307: The server performs feature extraction on the metadata of the data to be identified through the feature extraction layer to obtain data features of the metadata.
步骤308:服务器通过敏感度识别层,基于元数据的数据特征,对待识别数据进行敏感度识别,得到敏感度识别结果。Step 308: The server performs sensitivity identification on the data to be identified through the sensitivity identification layer based on the data characteristics of the metadata, and obtains a sensitivity identification result.
步骤309:服务器将敏感度识别结果存储至待识别数据关联的目标区域。Step 309: The server stores the sensitivity identification result in the target area associated with the data to be identified.
其中,目标区域为待识别数据的元数据对应的存储区域中对应数据敏感度的区域。The target area is an area corresponding to the data sensitivity in the storage area corresponding to the metadata of the data to be identified.
通过上述方式,通过训练好的敏感度识别模型对待识别数据的数据敏感度进行识别,并将相应的敏感度识别结果存储在待识别数据关联的目标区域上,使得待识别数据的数据敏感度成为元数据的一部分,大大提高数据敏感度的识别效率,且避免对待识别数据的数据敏感度的漏查。Through the above method, the data sensitivity of the data to be identified is identified through the trained sensitivity identification model, and the corresponding sensitivity identification result is stored in the target area associated with the data to be identified, so that the data sensitivity of the data to be identified becomes Part of the metadata, which greatly improves the identification efficiency of data sensitivity, and avoids the missed check of the data sensitivity of the data to be identified.
下面,将说明本申请实施例在一个实际的应用场景中的示例性应用。本申请实施例提供的基于敏感度识别模型的数据敏感度识别方法,主要在于利用机器学习对待识别数据的数据敏感度进行识别,参见图9,图9为本申请实施例提供的基于敏感度识别模型的数据敏感度识别方法的流程示意图,如图9所示,本申请实施例提供的数据敏感度的识别包括:敏感度识别模型的训练(即训练阶段)和基于训练好的敏感度识别模型对待识别数据进行数据敏感度的识别(即识别阶段),接下来将逐一进行说明。Below, an exemplary application of the embodiments of the present application in a practical application scenario will be described. The data sensitivity identification method based on the sensitivity identification model provided by the embodiment of the present application mainly uses machine learning to identify the data sensitivity of the data to be identified. Referring to FIG. 9 , FIG. 9 is the sensitivity-based identification method provided by the embodiment of the present application. A schematic flowchart of the data sensitivity identification method of the model, as shown in FIG. 9 , the identification of the data sensitivity provided by the embodiment of the present application includes: training of the sensitivity identification model (that is, the training stage) and based on the trained sensitivity identification model The identification of the data sensitivity of the data to be identified (ie, the identification stage) will be described one by one next.
在训练阶段,获取数据样本的元数据,其中,数据样本携带有敏感度标签,敏感度标签用于指示数据样本对应的数据敏感度。也即,输入至敏感度识别模型的数据样本的元数据包括:数据表名、数据表中对应数据样本的表描述和该数据表(即数据样本)所属的数据敏感度(即敏感度标签),其中,数据敏感度可用敏感等级来表征,敏感等级分为五种:绝密,机密,高敏感,中敏感,低敏感。In the training phase, the metadata of the data sample is obtained, wherein the data sample carries a sensitivity label, and the sensitivity label is used to indicate the data sensitivity corresponding to the data sample. That is, the metadata of the data sample input to the sensitivity identification model includes: the name of the data table, the table description of the corresponding data sample in the data table, and the data sensitivity (that is, the sensitivity label) to which the data table (that is, the data sample) belongs. , where data sensitivity can be characterized by sensitivity levels, which are divided into five levels: top secret, confidential, high sensitivity, medium sensitivity, and low sensitivity.
通常而言,与用户账号安全相关的数据的敏感等级为绝密,用户个人信息及金融相关的数据的敏感等级为机密,用户的行为数据的敏感等级为高敏感,敏感等级为机密的数据的大粒度上卷的数据的敏感等级为中敏感,普通的统计数据的敏感等级为低敏感。Generally speaking, the sensitivity level of data related to user account security is top secret, the sensitivity level of user personal information and financial-related data is confidential, the sensitivity level of user behavior data is high sensitivity, and the sensitivity level of confidential data is large. The sensitivity level of data rolled up by granularity is medium sensitivity, and the sensitivity level of common statistical data is low sensitivity.
在训练时,将数据样本的数据表名和表描述作为样本点,将敏感等级作为敏感度标签,通过优化训练敏感度识别模型,学习数据表名和表描述与敏感等级之间的关系,训练完成后,将敏感度识别模型的模型参数保存下来。During training, the data table name and table description of the data sample are used as sample points, and the sensitivity level is used as the sensitivity label. By optimizing the training sensitivity recognition model, the relationship between the data table name and table description and the sensitivity level is learned. After the training is completed , and save the model parameters of the sensitivity recognition model.
在识别阶段,在识别过程中,先加载训练阶段保存下来的敏感度识别模型的模型参数,然后,将待识别数据的元数据,即待识别数据的数据表名和表描述,输入至训练好的敏感度识别模型中,对待识别数据进行数据敏感度的识别,得到用于指示待识别数据对应的数据敏感度的敏感等级。In the identification stage, during the identification process, first load the model parameters of the sensitivity identification model saved in the training stage, and then input the metadata of the data to be identified, that is, the data table name and table description of the data to be identified, into the trained In the sensitivity identification model, data sensitivity is identified for the data to be identified, and a sensitivity level used to indicate the data sensitivity corresponding to the data to be identified is obtained.
接着将对敏感度识别模型的结构进行说明,参见图10,图10为本申请实施例提供的敏感度识别模型的结构示意图,如图10所示,敏感度识别模型包括输入层、特征提取层、敏感度识别层,其中,特征提取层包括:嵌入层、双向编码层、池化层,接下来以对待识别数据进行数据敏感度的识别这一应用为例,对敏感度识别模型进行说明。Next, the structure of the sensitivity identification model will be described. Referring to FIG. 10, FIG. 10 is a schematic structural diagram of the sensitivity identification model provided by the embodiment of the application. As shown in FIG. 10, the sensitivity identification model includes an input layer and a feature extraction layer. , Sensitivity recognition layer, wherein, the feature extraction layer includes: embedding layer, bidirectional coding layer, pooling layer, next, taking the application of data sensitivity recognition for the data to be recognized as an example, the sensitivity recognition model is explained.
1、输入层1. Input layer
在输入层中,首先对待识别数据的元数据,如数据表名或表描述,进行分词处理,得到元数据对应的多个词语或多个字,然后为每个词语或字设置唯一的索引值,如输入的第i个词为wi,经过索引之后得到唯一的一个整型编号Ii=I(wi);最后,将分词得到的每个词语或字,以及相应的索引值传输至特征提取层。In the input layer, the metadata of the data to be recognized, such as the data table name or table description, is first processed by word segmentation to obtain multiple words or words corresponding to the metadata, and then a unique index value is set for each word or word. , if the i-th word input is wi , a unique integer number I i =I(wi i ) is obtained after indexing; finally, each word or word obtained by word segmentation and the corresponding index value are transmitted to Feature extraction layer.
2、特征提取层2. Feature extraction layer
1)嵌入层1) Embedding layer
这里,在嵌入层,首先基于每个词语或字的索引值获取相应的词语或字,然后对每个词语或字进行词向量转换(即特征编码),得到对应的词向量(即词语特征)。Here, in the embedding layer, first obtain the corresponding word or word based on the index value of each word or word, and then perform word vector transformation (ie feature encoding) on each word or word to obtain the corresponding word vector (ie word feature) .
假设嵌入层的矩阵为E∈RV*D,其中,V为所有词语的总数,D为每个词向量的维度。为了获得第i个词语的词向量,首先要把其索引值转换为One-Hot编码向量,向量长度为V,只有在Ii的位置上有元素1,在其余位置上元素都为0,使用One-Hot编码向量与矩阵E相乘即可获得该分词对应的词向量ei,具体表达式为:Suppose the matrix of the embedding layer is E∈R V*D , where V is the total number of all words and D is the dimension of each word vector. In order to obtain the word vector of the i-th word, first convert its index value into a One-Hot encoding vector, the vector length is V, only the position of I i has an element 1, and the elements in the rest of the positions are 0, using The word vector e i corresponding to the segmented word can be obtained by multiplying the One-Hot encoding vector with the matrix E. The specific expression is:
Oi∈0V O i ∈ 0 V
2)双向编码层2) Bidirectional coding layer
在得到每个词语的词向量后,将每个词语的词向量输入至双向编码层,如双向长短期记忆网络(Bi-LSTM,Bi-directional Long Short-Term Memory)层,其中,Bi-LSTM层包括两个LSTM:一个为正向输入序列和一个反向输入序列,能够同时考虑上下文特征,起到充分融合理解上下文语义的作用。After obtaining the word vector of each word, input the word vector of each word to the bidirectional coding layer, such as the Bi-directional Long Short-Term Memory (Bi-LSTM, Bi-directional Long Short-Term Memory) layer, where Bi-LSTM The layer includes two LSTMs: one is the forward input sequence and the other is the reverse input sequence, which can consider contextual features at the same time, and play a role in fully integrating and understanding contextual semantics.
在实际实施时,可分别对各个词语的词向量进行双向编码处理,得到各词语对应的上文编码特征和下文编码特征;分别对各词语的上文编码特征和下文编码特征进行特征拼接,得到相应的拼接编码特征。具体表达式为:In actual implementation, two-way encoding processing can be performed on the word vectors of each word respectively to obtain the above coding feature and the following coding feature corresponding to each word; Corresponding splice encoding features. The specific expression is:
其中,l表示从左到右,r表示从右到左,表示前一个词语的隐层状态,表示当前输入词语,表示前一个词语的细胞状态;表征通过前向过程(如从左向右)提取得到的上文编码特征,表征通过后向过程(如从右向左)提取的下文编码特征向量,Ct,ht表征将上文编码特征和下文编码特征进行拼接得到的对应词语的拼接编码特征。Among them, l means left to right, r means right to left, represents the hidden state of the previous word, represents the current input word, represents the cell state of the previous word; represent the above encoded features extracted by the forward process (such as from left to right), Represents the following encoding feature vector extracted by the backward process (such as from right to left), C t , h t represents the encoding feature vector above and the following encoding features The splicing coding features of the corresponding words obtained by splicing.
3)池化层3) Pooling layer
通过上述双向编码层得到每个词语对应的拼接编码特征,通过池化层,将每个词语对应的拼接编码特征进行特征拼接,得到对应的句向量(即元数据的数据特征)。具体表达式为:The splicing coding feature corresponding to each word is obtained through the above-mentioned bidirectional coding layer, and the splicing coding feature corresponding to each word is feature spliced through the pooling layer to obtain the corresponding sentence vector (that is, the data feature of the metadata). The specific expression is:
其中,z表征待识别数据的元数据对应的句向量,Ct,ht为当前输入词对应的拼接编码特征,L表示输入词语的总数量。Among them, z represents the sentence vector corresponding to the metadata of the data to be recognized, C t , h t are the splicing coding features corresponding to the current input word, and L represents the total number of input words.
3、敏感度识别层3. Sensitivity recognition layer
这里,敏感度识别层又称为多层感知机(MLP,MultiLayer Perceptron)层,多层感知机由多层全连接神经网络组成,待识别数据的元数据对应的数据特征经过敏感度识别层,输出待识别数据对应每个敏感等级的概率,以3层全连接神经网络为例,输书属于每个敏感等级的概率可参考以下表达式:Here, the sensitivity recognition layer is also called the multi-layer perceptron (MLP, MultiLayer Perceptron) layer. The multi-layer perceptron is composed of multi-layer fully connected neural networks. The data features corresponding to the metadata of the data to be recognized pass through the sensitivity recognition layer. The probability of outputting the data to be recognized corresponds to each sensitivity level. Taking a 3-layer fully connected neural network as an example, the probability that the input book belongs to each sensitivity level can refer to the following expression:
ai=f(W3f(W2f(W1z+b1)+b2)+b3)a i =f(W 3 f(W 2 f(W 1 z+b 1 )+b 2 )+b 3 )
其中,f为非线性激励函数,z为上述池化层得到的待识别数据的元数据对应的句向量,W1为第一层全连接神经网络的权重,W2为第二层全连接神经网络的权重,W3为第三层全连接神经网络的权重,权重可训练,b1、b2与b3为相应可训练的偏置参数,ai表征第i个敏感等级,A表征敏感等级的级数,pi表征待识别数据的敏感等级属于ai的概率。Among them, f is the nonlinear excitation function, z is the sentence vector corresponding to the metadata of the data to be identified obtained by the pooling layer, W 1 is the weight of the first layer of fully connected neural network, and W 2 is the second layer of fully connected neural network The weight of the network, W 3 is the weight of the third-layer fully connected neural network, the weight can be trained, b 1 , b 2 and b 3 are the corresponding trainable bias parameters, a i represents the ith sensitivity level, A represents the sensitivity The number of grades, pi represents the probability that the sensitivity grade of the data to be identified belongs to ai .
然后,从中选择概率最大的敏感等级作为待识别数据对应的敏感等级,输出最终确定的敏感等级,并将输出的敏感等级补充到待识别数据的元数据中。Then, the sensitivity level with the highest probability is selected as the sensitivity level corresponding to the data to be identified, the finally determined sensitivity level is output, and the output sensitivity level is added to the metadata of the data to be identified.
需要说明的是,上述敏感度识别模型的结构可根据实际情况进行设置,如可将待识别数据的元数据输入至输入层,通过输入层,将待识别数据的元数据传送至特征提取层,以在特征提取层执行对元数据的分词或索引操作,等等,本申请并不对敏感度识别的结构进行具体限定。It should be noted that the structure of the above sensitivity recognition model can be set according to the actual situation. For example, the metadata of the data to be recognized can be input into the input layer, and the metadata of the data to be recognized can be transmitted to the feature extraction layer through the input layer. To perform word segmentation or indexing operations on metadata at the feature extraction layer, etc., the present application does not specifically limit the structure of sensitivity identification.
敏感度识别模型的结构布局好后,可以使用随机梯度下降的方法训练该敏感度识别模型,使得模型参数最优或者局部最优。例如,将获取数据样本的元数据通过输入层,传输至特征提取层,通过特征提取层,对数据样本的元数据进行特征提取,得到数据样本的元数据的样本数据特征;通过敏感度识别层,基于样本数据特征,对数据样本进行敏感度识别,得到样本敏感度识别结果;获取样本敏感度识别结果与数据样本携带的敏感度标签之间的差异,并基于获取的差异,更新敏感度识别模型的模型参数。After the structure of the sensitivity recognition model is arranged, the sensitivity recognition model can be trained by using the stochastic gradient descent method, so that the model parameters are optimal or locally optimal. For example, the metadata of the obtained data sample is transmitted to the feature extraction layer through the input layer, and the feature extraction layer is used to extract the metadata of the data sample to obtain the sample data feature of the metadata of the data sample; through the sensitivity identification layer , based on the characteristics of the sample data, perform sensitivity identification on the data sample to obtain the sample sensitivity identification result; obtain the difference between the sample sensitivity identification result and the sensitivity label carried by the data sample, and update the sensitivity identification based on the obtained difference Model parameters for the model.
另外,本申请实施例提供的敏感度识别模型还可基于传统的机器学习方法进行训练,如快速文本分类器(FastText);或者深度学习的方式,如基于变形的双向编码器表征(BERT,Bidirectional Encoder Representations from Transf ormers)通用模型、TextCNN模型、中文预训练RoBERTa模型、中文训练EL ECTRA模型等。本申请实施例提供的全连接神经网络还可采用注意力网络、循环神经网络与卷积神经网络等。In addition, the sensitivity recognition model provided by the embodiments of the present application can also be trained based on traditional machine learning methods, such as fast text classifier (FastText); or deep learning methods, such as deformation-based bidirectional encoder representation (BERT, Bidirectional Encoder Representations from Transformers) general model, TextCNN model, Chinese pre-trained RoBERTa model, Chinese training EL ECTRA model, etc. The fully connected neural network provided by the embodiments of the present application may also adopt an attention network, a recurrent neural network, a convolutional neural network, and the like.
通过上述方式,将待识别的元数据输入至敏感度识别模型中,利用机器学习的方式自动识别得到相应的敏感等级,并将识别到的敏感等级补充到待识别数据的元数据中,相较于人工识别的方式而言,能够大大提高数据敏感度的识别效率,且降低漏查敏感数据的概率。Through the above method, the metadata to be identified is input into the sensitivity identification model, the corresponding sensitivity level is automatically identified by means of machine learning, and the identified sensitivity level is added to the metadata of the data to be identified. In terms of manual identification, it can greatly improve the identification efficiency of data sensitivity and reduce the probability of missing sensitive data.
下面继续说明本申请实施例提供的基于敏感度识别模型的数据敏感度识别装置555的实施为软件模块的示例性结构,在一些实施例中,如图11所示,图11为本申请实施例提供的基于敏感度识别模型的数据敏感度识别装置的结构示意图,其中,敏感度识别模型包括特征提取层和敏感度识别层,装置包括:The following will continue to describe the exemplary structure of the data
第一获取模块5551,用于获取待识别数据的元数据,所述元数据用于描述所述待识别数据;The first acquisition module 5551 is used to acquire metadata of the data to be identified, where the metadata is used to describe the data to be identified;
第一提取模块5552,用于通过所述特征提取层,对所述待识别数据的元数据进行特征提取,得到所述元数据的数据特征;The first extraction module 5552 is configured to perform feature extraction on the metadata of the data to be identified through the feature extraction layer to obtain the data features of the metadata;
第一识别模块5553,用于通过所述敏感度识别层,基于所述元数据的数据特征,对所述待识别数据进行敏感度识别,得到敏感度识别结果;The first identification module 5553 is configured to perform sensitivity identification on the data to be identified through the sensitivity identification layer based on the data characteristics of the metadata to obtain a sensitivity identification result;
其中,所述敏感度识别结果,用于指示所述待识别数据对应的数据敏感度。The sensitivity identification result is used to indicate the data sensitivity corresponding to the data to be identified.
在一些实施例中,所述第一获取模块,还用于当所述待识别数据的存储形式为数据表时,从所述数据表中获取以下表元素至少之一:数据表名、数据表中对应所述待识别数据的表描述、数据表中对应所述待识别数据的属性字段;In some embodiments, the first obtaining module is further configured to obtain at least one of the following table elements from the data table when the storage form of the data to be identified is a data table: data table name, data table The table description corresponding to the data to be identified in the data table, the attribute field corresponding to the data to be identified in the data table;
将获取的表元素确定为所述待识别数据的元数据。The acquired table element is determined as the metadata of the data to be identified.
在一些实施例中,所述第一获取模块,还用于当所述待识别数据的存储形式为文档时,从所述文档中获取以下文档内容至少之一:文档标题、文档摘要、文档关键词;In some embodiments, the first obtaining module is further configured to obtain at least one of the following document contents from the document when the storage form of the data to be identified is a document: document title, document abstract, document key word;
将获取的文档内容确定为所述待识别数据的元数据。The acquired document content is determined as the metadata of the data to be identified.
在一些实施例中,所述第一提取模块,还用于对所述待识别数据的元数据进行分词处理,得到所述元数据对应的多个词语;In some embodiments, the first extraction module is further configured to perform word segmentation processing on the metadata of the data to be identified to obtain a plurality of words corresponding to the metadata;
分别对各个所述词语进行特征编码,得到各个所述词语对应的词语特征;Carry out feature encoding on each of the described words respectively to obtain the word features corresponding to each of the described words;
对各个所述词语对应的词语特征进行特征拼接,得到所述元数据对应的数据特征。Feature splicing is performed on the word features corresponding to each of the words to obtain data features corresponding to the metadata.
在一些实施例中,所述第一提取模块,还用于分别对各个词语的词语特征进行双向编码处理,得到各所述词语对应的上文编码特征和下文编码特征;In some embodiments, the first extraction module is further configured to perform bidirectional encoding processing on the word features of each word, respectively, to obtain the above encoding feature and the following encoding feature corresponding to each of the words;
分别对各所述词语的上文编码特征和下文编码特征进行特征拼接,得到相应的拼接编码特征;Feature splicing is performed on the above coding features and the following coding features of each of the described words, respectively, to obtain the corresponding splicing coding features;
将各所述词语对应的拼接编码特征进行特征拼接,得到所述元数据对应的数据特征。Feature splicing is performed on the splicing coding features corresponding to each of the words to obtain data features corresponding to the metadata.
在一些实施例中,所述第一识别模块,还用于通过所述敏感度识别层,对所述元数据的数据特征进行对应至少两个敏感等级的分类预测,得到所述元数据对应各所述敏感等级的概率;In some embodiments, the first identification module is further configured to, through the sensitivity identification layer, perform classification predictions corresponding to at least two sensitivity levels on the data features of the metadata, so as to obtain the metadata corresponding to each the probability of said sensitivity level;
选取概率最大的敏感等级,作为对所述待识别数据的敏感度识别结果。The sensitivity level with the highest probability is selected as the sensitivity identification result of the data to be identified.
在一些实施例中,所述第一提取模块,还用于当所述元数据包括至少两个关键词时,通过所述特征提取层,分别对各所述关键词进行特征提取,得到各所述关键词对应的特征作为所述元数据的数据特征;In some embodiments, the first extraction module is further configured to perform feature extraction on each of the keywords through the feature extraction layer when the metadata includes at least two keywords, to obtain each of the keywords. The feature corresponding to the keyword is used as the data feature of the metadata;
相应的,所述第一提取模块,还用于通过所述敏感度识别层,分别将各所述关键词对应的特征与至少两个敏感词对应的特征进行匹配,得到相应的匹配度;Correspondingly, the first extraction module is further configured to match the feature corresponding to each keyword with the feature corresponding to at least two sensitive words through the sensitivity identification layer to obtain the corresponding matching degree;
选取匹配度最高的敏感词对应的数据敏感度,作为对所述待识别数据的敏感度识别结果。The data sensitivity corresponding to the sensitive word with the highest matching degree is selected as the sensitivity identification result of the data to be identified.
在一些实施例中,所述装置还包括:In some embodiments, the apparatus further includes:
处理模块,用于建立所述敏感度识别结果与所述待识别数据的关联关系,并存储所述关联关系;a processing module, configured to establish an association relationship between the sensitivity identification result and the data to be identified, and store the association relationship;
其中,所述关联关系,用于供基于所述待识别数据查找对应所述待识别数据的数据敏感度。Wherein, the association relationship is used for searching the data sensitivity corresponding to the data to be identified based on the data to be identified.
在一些实施例中,所述处理模块,还用于将所述敏感度识别结果存储至所述待识别数据关联的目标区域,所述目标区域为所述元数据对应的存储区域中对应数据敏感度的区域。In some embodiments, the processing module is further configured to store the sensitivity identification result in a target area associated with the data to be identified, where the target area is the sensitive data corresponding to the storage area corresponding to the metadata degree area.
在一些实施例中,所述装置还包括:In some embodiments, the apparatus further includes:
返回模块,用于响应于针对所述待识别数据的数据展示请求,获取所述待识别数据对应的数据敏感度;A return module, configured to obtain the data sensitivity corresponding to the data to be identified in response to a data display request for the data to be identified;
当所述待识别数据对应的数据敏感度达到敏感度阈值时,返回对应所述待识别数据的屏蔽指示信息;When the data sensitivity corresponding to the to-be-identified data reaches the sensitivity threshold, return masking indication information corresponding to the to-be-identified data;
所述屏蔽指示信息,用于指示对所述待识别数据进行屏蔽展示。The shielding indication information is used to indicate that the to-be-identified data is shielded and displayed.
在一些实施例中,所述装置还包括:In some embodiments, the apparatus further includes:
输出模块,用于当所述敏感度识别结果表征所述待识别数据的数据敏感度达到目标数据敏感度时,输出对应所述待识别数据的加密提示信息;an output module, configured to output encryption prompt information corresponding to the data to be identified when the sensitivity identification result indicates that the data sensitivity of the data to be identified reaches the target data sensitivity;
其中,所述加密提示信息,用于提示对所述待识别数据进行加密处理。Wherein, the encryption prompt information is used for prompting to perform encryption processing on the to-be-identified data.
接下来继续对本申请实施例提供的敏感度识别模型的训练装置进行说明,参见图12,图12为本申请实施例提供的敏感度识别模型的训练装置的结构示意图,敏感度识别模型包括特征提取层和敏感度识别层,所述敏感度识别模型的训练装置120包括:Next, continue to describe the training device of the sensitivity recognition model provided by the embodiment of the present application. Referring to FIG. 12 , FIG. 12 is a schematic structural diagram of the training device of the sensitivity recognition model provided by the embodiment of the present application. The sensitivity recognition model includes feature extraction. layer and sensitivity identification layer, the
第二获取模块121,用于获取数据样本的元数据,所述数据样本携带有敏感度标签,所述敏感度标签用于指示所述数据样本对应的数据敏感度;The second obtaining module 121 is configured to obtain metadata of a data sample, the data sample carries a sensitivity label, and the sensitivity label is used to indicate the data sensitivity corresponding to the data sample;
第二提取模块122,用于通过所述特征提取层,对所述数据样本的元数据进行特征提取,得到所述数据样本的元数据的样本数据特征;The second extraction module 122 is configured to perform feature extraction on the metadata of the data sample through the feature extraction layer to obtain sample data features of the metadata of the data sample;
第二识别模块123,用于通过所述敏感度识别层,基于所述样本数据特征,对所述数据样本进行敏感度识别,得到样本敏感度识别结果;The second identification module 123 is configured to perform sensitivity identification on the data sample through the sensitivity identification layer based on the sample data characteristics, and obtain a sample sensitivity identification result;
更新模块124,用于获取所述样本敏感度识别结果与所述数据样本携带的敏感度标签之间的差异,并基于所述差异,更新所述敏感度识别模型的模型参数。The updating module 124 is configured to acquire the difference between the sample sensitivity identification result and the sensitivity label carried by the data sample, and update the model parameters of the sensitivity identification model based on the difference.
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例上述的方法。Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the foregoing method in the embodiments of the present application.
本申请实施例提供一种存储有可执行指令的计算机可读存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的方法。Embodiments of the present application provide a computer-readable storage medium storing executable instructions, where executable instructions are stored, and when the executable instructions are executed by a processor, the processor will cause the processor to execute the method provided by the embodiments of the present application.
在一些实施例中,计算机可读存储介质可以是FRAM、ROM、PROM、EP ROM、EEPROM、闪存、磁表面存储器、光盘、或CD-ROM等存储器;也可以是包括上述存储器之一或任意组合的各种设备。In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; it may also include one or any combination of the foregoing memories of various equipment.
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。In some embodiments, executable instructions may take the form of programs, software, software modules, scripts, or code, written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and which Deployment may be in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(H TML,Hyper TextMarkup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。As an example, executable instructions may, but do not necessarily correspond to files in a file system, may be stored as part of a file that holds other programs or data, eg, a Hyper Text Markup Language (H TML) document One or more scripts in , stored in a single file dedicated to the program in question, or in multiple cooperating files (eg, files that store one or more modules, subroutines, or code sections).
作为示例,可执行指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。As an example, executable instructions may be deployed to be executed on one computing device, or on multiple computing devices located at one site, or alternatively, distributed across multiple sites and interconnected by a communication network execute on.
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。The above descriptions are merely examples of the present application, and are not intended to limit the protection scope of the present application. Any modifications, equivalent replacements and improvements made within the spirit and scope of this application are included within the protection scope of this application.
Claims (15)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110139667.XA CN114840869B (en) | 2021-02-01 | 2021-02-01 | Data sensitivity identification method and device based on sensitivity identification model |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110139667.XA CN114840869B (en) | 2021-02-01 | 2021-02-01 | Data sensitivity identification method and device based on sensitivity identification model |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114840869A true CN114840869A (en) | 2022-08-02 |
| CN114840869B CN114840869B (en) | 2025-07-04 |
Family
ID=82561083
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202110139667.XA Active CN114840869B (en) | 2021-02-01 | 2021-02-01 | Data sensitivity identification method and device based on sensitivity identification model |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114840869B (en) |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115906170A (en) * | 2022-12-02 | 2023-04-04 | 杨磊 | Safety protection method and AI system applied to storage cluster |
| CN116089225A (en) * | 2023-04-12 | 2023-05-09 | 浙江大学 | A dynamic perception system and method for public data collection based on BiLSTM |
| CN116090006A (en) * | 2023-02-01 | 2023-05-09 | 北京三维天地科技股份有限公司 | Sensitive identification method and system based on deep learning |
| CN117851751A (en) * | 2023-11-30 | 2024-04-09 | 深圳市马博士网络科技有限公司 | Sensitive data identification method and device, electronic equipment and storage medium |
| CN118520116A (en) * | 2024-07-23 | 2024-08-20 | 湖州智慧城市研究院有限公司 | Sensitive information identification method and device |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107818077A (en) * | 2016-09-13 | 2018-03-20 | 北京金山云网络技术有限公司 | A kind of sensitive content recognition methods and device |
| CN109657243A (en) * | 2018-12-17 | 2019-04-19 | 江苏满运软件科技有限公司 | Sensitive information recognition methods, system, equipment and storage medium |
| CN110222170A (en) * | 2019-04-25 | 2019-09-10 | 平安科技(深圳)有限公司 | A kind of method, apparatus, storage medium and computer equipment identifying sensitive data |
| CN110674414A (en) * | 2019-09-20 | 2020-01-10 | 北京字节跳动网络技术有限公司 | Target information identification method, device, equipment and storage medium |
| CN110737770A (en) * | 2018-07-03 | 2020-01-31 | 百度在线网络技术(北京)有限公司 | Text data sensitivity identification method and device, electronic equipment and storage medium |
| US10878124B1 (en) * | 2017-12-06 | 2020-12-29 | Dataguise, Inc. | Systems and methods for detecting sensitive information using pattern recognition |
-
2021
- 2021-02-01 CN CN202110139667.XA patent/CN114840869B/en active Active
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107818077A (en) * | 2016-09-13 | 2018-03-20 | 北京金山云网络技术有限公司 | A kind of sensitive content recognition methods and device |
| US10878124B1 (en) * | 2017-12-06 | 2020-12-29 | Dataguise, Inc. | Systems and methods for detecting sensitive information using pattern recognition |
| CN110737770A (en) * | 2018-07-03 | 2020-01-31 | 百度在线网络技术(北京)有限公司 | Text data sensitivity identification method and device, electronic equipment and storage medium |
| CN109657243A (en) * | 2018-12-17 | 2019-04-19 | 江苏满运软件科技有限公司 | Sensitive information recognition methods, system, equipment and storage medium |
| CN110222170A (en) * | 2019-04-25 | 2019-09-10 | 平安科技(深圳)有限公司 | A kind of method, apparatus, storage medium and computer equipment identifying sensitive data |
| WO2020215571A1 (en) * | 2019-04-25 | 2020-10-29 | 平安科技(深圳)有限公司 | Sensitive data identification method and device, storage medium, and computer apparatus |
| CN110674414A (en) * | 2019-09-20 | 2020-01-10 | 北京字节跳动网络技术有限公司 | Target information identification method, device, equipment and storage medium |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115906170A (en) * | 2022-12-02 | 2023-04-04 | 杨磊 | Safety protection method and AI system applied to storage cluster |
| CN115906170B (en) * | 2022-12-02 | 2023-12-15 | 北京金安道大数据科技有限公司 | Security protection method and AI system applied to storage cluster |
| CN116090006A (en) * | 2023-02-01 | 2023-05-09 | 北京三维天地科技股份有限公司 | Sensitive identification method and system based on deep learning |
| CN116090006B (en) * | 2023-02-01 | 2023-09-08 | 北京三维天地科技股份有限公司 | Sensitive identification method and system based on deep learning |
| CN116089225A (en) * | 2023-04-12 | 2023-05-09 | 浙江大学 | A dynamic perception system and method for public data collection based on BiLSTM |
| CN117851751A (en) * | 2023-11-30 | 2024-04-09 | 深圳市马博士网络科技有限公司 | Sensitive data identification method and device, electronic equipment and storage medium |
| CN118520116A (en) * | 2024-07-23 | 2024-08-20 | 湖州智慧城市研究院有限公司 | Sensitive information identification method and device |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114840869B (en) | 2025-07-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11599714B2 (en) | Methods and systems for modeling complex taxonomies with natural language understanding | |
| CN111563141B (en) | Method and system for processing an input problem to query a database | |
| US11501080B2 (en) | Sentence phrase generation | |
| CN114840869A (en) | Data sensitivity identification method and device based on sensitivity identification model | |
| CN112307770A (en) | Sensitive information detection method and device, electronic equipment and storage medium | |
| CN115080039B (en) | Front-end code generation method, device, computer equipment, storage medium and product | |
| KR102532216B1 (en) | Method for establishing ESG database with structured ESG data using ESG auxiliary tool and ESG service providing system performing the same | |
| CN116861363A (en) | Multimodal feature processing methods, devices, storage media and electronic equipment | |
| CN113918710B (en) | Text data processing method, device, electronic device and readable storage medium | |
| CN116956818A (en) | Text material processing methods, devices, electronic equipment and storage media | |
| CN117725220A (en) | Method, server and storage medium for document characterization and document retrieval | |
| CN117609418A (en) | Document processing method, device, electronic equipment and storage medium | |
| CN118210874A (en) | Data processing method, device, computer equipment and storage medium | |
| CN117633162A (en) | Machine learning task template generation method, training method, fine adjustment method and equipment | |
| CN119669317A (en) | Information display method and device, electronic device, storage medium and program product | |
| CN112329429B (en) | Text similarity learning method, device, equipment and storage medium | |
| CN114325384A (en) | A crowdsourcing acquisition system and method based on motor fault knowledge | |
| CN119106104A (en) | A content retrieval method, device, equipment, intelligent agent and storage medium | |
| CN117235236B (en) | Dialogue method, dialogue device, computer equipment and storage medium | |
| CN118520976A (en) | Text dialogue generation model training method, text dialogue generation method and equipment | |
| CN118627513A (en) | Text evaluation method and electronic device | |
| CN118396647A (en) | An intelligent question-answering method for e-government environment | |
| CN115168609B (en) | A text matching method, device, computer equipment and storage medium | |
| US11663251B2 (en) | Question answering approach to semantic parsing of mathematical formulas | |
| CN114093447B (en) | Data asset recommendation method, device, computer equipment and storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |














