CN113378222A - File encryption method and system based on data content identification - Google Patents

File encryption method and system based on data content identification Download PDF

Info

Publication number
CN113378222A
CN113378222A CN202110658517.XA CN202110658517A CN113378222A CN 113378222 A CN113378222 A CN 113378222A CN 202110658517 A CN202110658517 A CN 202110658517A CN 113378222 A CN113378222 A CN 113378222A
Authority
CN
China
Prior art keywords
file
information
sensitive information
content
encryption
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110658517.XA
Other languages
Chinese (zh)
Inventor
秦凯
喻波
王闻馨
王志海
安鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wondersoft Technology Co Ltd
Original Assignee
Beijing Wondersoft Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wondersoft Technology Co Ltd filed Critical Beijing Wondersoft Technology Co Ltd
Priority to CN202110658517.XA priority Critical patent/CN113378222A/en
Publication of CN113378222A publication Critical patent/CN113378222A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Storage Device Security (AREA)

Abstract

The invention provides a file encryption method and system based on data content identification. Wherein the method comprises: the external file is imported to obtain an imported file; identifying and analyzing the content of the imported file, judging whether the file has relevant sensitive information, and confirming the frequency and the position of the sensitive information; classifying and grading the files according to the sensitive information and the frequency and the position of the sensitive information; carrying out file encryption according to the classification and grading result of the files to obtain encryption information; storing the file information of the imported file and the confidential information into a database; and returning a standard secret result and displaying the standard secret result in the man-machine interaction layer.

Description

File encryption method and system based on data content identification
Technical Field
The invention relates to the technical field of intelligent document encryption, in particular to a file encryption method and system based on data content identification.
Background
With the popularization of networks, network communication has been integrated into the aspects of modern social production and life, and the security problem of communication information is gradually paid attention to people. The sender realizes file transmission through the network and simultaneously enables the successfully sent file to be separated from the control of the sender, and the sender is difficult to limit the use and the propagation of the sent file, so that the sender is in danger of misusing or maliciously propagating the sent file. In order to solve the above problems, in the prior art, by means of adding a hidden tag to a file or changing file attributes, etc., it is ensured that the tag is not lost during transmission and copying, and traceable to the file in the above scenario is realized. And the label information is closely related to the file confidential range and level. The patent is a further improvement and supplement to some known defect problems in the marking process on the basis of the existing marking technology.
As shown in fig. 6, the prior art currently includes:
1. the file encryption system consists of a database module, an encryption module and an association module;
2. the database module stores the information of all files marked with the encryption, such as package file metadata, encryption information and the like;
3. the encryption module realizes an encryption process and a process of identifying an encryption file;
4. the association module is a main interaction layer and is responsible for transmitting the information of the standard secret to the standard secret module.
Publication number CN104657677A discloses a file encryption method based on exchange data stream, which writes file data to be encrypted onto NTFS volume through exchange data stream, thereby implementing file encryption, including an extended data stream created for the file; setting file security level identification information including a security marker and a security level, and automatically acquiring security marking time; inputting a password marking identification password as a verification password; and calling the file I/O, and writing the set file security level identification information into the file extended data stream to finish the encryption marking.
The prior art has the following defects:
1. artificially marking the secret: the artificial subjective judgment is used for carrying out encryption marking, and uncertainty exists;
2. and (3) metadata encryption: and structured data is supported, and the requirements of unstructured data cannot be met.
Disclosure of Invention
The invention aims to provide a file encryption method and system based on data content identification, and aims to solve the technical problems in the prior art.
The invention provides a file encryption method based on data content identification in a first aspect, which comprises the following steps:
s1: the external file is imported to obtain an imported file;
s2: identifying and analyzing the content of the imported file, judging whether the file has relevant sensitive information, and confirming the frequency and the position of the sensitive information;
s3: classifying and grading the files according to the sensitive information and the frequency and the position of the sensitive information;
s4: carrying out file encryption according to the classification and grading result of the files to obtain encryption information;
s5: storing the file information of the imported file and the confidential information into a database;
s6: and returning a standard secret result and displaying the standard secret result in the man-machine interaction layer.
Preferably, the specific method for identifying and analyzing the content of the imported file and determining whether the file has the relevant sensitive information includes: format conversion, content extraction and content identification;
the format conversion is mainly to identify, analyze and convert the format type of the imported file to obtain a format conversion file;
the content extraction is to extract file content from the format conversion file to obtain file extraction content;
and the content identification is to extract preset information of the extracted content of the file and judge whether the file has related sensitive information.
Preferably, the specific method for extracting the preset information of the file extraction content is as follows:
and performing preset information extraction on the extracted content of the file by utilizing keyword identification, regular expression judgment, data identifier identification and file fingerprint identification.
Preferably, the specific method for confirming the frequency of the occurrence of the sensitive information is as follows:
and carrying out clustering analysis on the relevant sensitive information existing in the file by utilizing a machine learning technology to obtain the occurrence frequency of the sensitive information.
Preferably, the specific method for confirming the position where the sensitive information appears is as follows:
and carrying out clustering analysis on the relevant sensitive information in the file by utilizing a machine learning technology to obtain the position of the sensitive information.
Preferably, before performing classification and classification of the file, the method further comprises:
semantic analysis is carried out on the extracted content of the file by utilizing a machine learning technology, and hard judgment of a single keyword or a regular expression is avoided.
Preferably, the specific method for classifying and grading the files comprises the following steps:
and confirming the classified level and the service classification of the imported file according to the frequency of the sensitive information, the position in the file and the meaning of the specific semantics.
The second aspect of the present invention provides a file encryption system based on data content identification, where the system is mounted on an operating system, and specifically includes:
the system comprises a man-machine interaction module, a content identification module, a security marking module, a data leakage and protection auxiliary decision module and a database module;
the man-machine interaction module is an external presentation part of the system and is used for file import and confidential information query and setting;
the content identification module identifies and analyzes the content of the imported file and judges whether the file has relevant sensitive information or not;
the functions of the NLP aided decision module include: confirming the frequency and the position of the sensitive information, and classifying and grading the files according to the sensitive information and the frequency and the position of the sensitive information;
the encryption module completes specific file encryption operation to obtain encryption information;
the database module stores the confidential information, the file information, and the content identification dependent element data information.
Preferably, the NLP decision-making assisting module further includes: semantic analysis is carried out on the extracted content of the file by utilizing a machine learning technology, and hard judgment of a single keyword or a regular expression is avoided.
Preferably, the specific process of querying the confidential information is as follows:
(1) importing the file into a standard secret system;
(2) the background system extracts the label information in the imported file, and performs database query to obtain the file related file information and the confidential information;
(3) returning a query result;
(4) and displaying the confidential information of the file in the foreground interface.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
(1) the method is realized in a pure soft mode, the encryption function can be realized only by depending on PC end application, and the file tracing can be carried out by combining a big data analysis system;
(2) artificial intelligence is achieved, accuracy of data content identification is improved based on an NLP technology, and data classification and grading are used for assisting decision-making and encryption marking results.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description in the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a method for identifying files based on data content according to the present invention;
FIG. 2 is a block diagram of a file encryption system based on data content identification as employed in the present invention;
FIG. 3 is a block diagram of the structure of a content recognition module and an NLP decision-making aid module used in the present invention;
FIG. 4 is a flowchart of a query of the cryptographic information employed in the present invention;
FIG. 5 is a flow chart of an embodiment of the present invention;
fig. 6 is a block diagram of a prior art architecture.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1:
as shown in fig. 1, a first aspect of the present application provides a file encryption method based on data content identification, where the method includes:
s1: the external file is imported to obtain an imported file;
s2: the method comprises the following steps of identifying and analyzing the content of an imported file, and judging whether relevant sensitive information exists in the file, and comprises the following specific steps: format conversion, content extraction and content identification;
the format conversion is mainly to identify, analyze and convert the format type of the imported file to obtain a format conversion file;
the content extraction is to extract file content from the format conversion file to obtain file extraction content;
performing content identification, namely performing preset information extraction on file extraction content by utilizing keyword identification, regular expression judgment, data identifier identification and file fingerprint identification, and judging whether relevant sensitive information exists in the file or not;
confirming the frequency and the position of the occurrence of the sensitive information, wherein the specific method comprises the following steps:
performing clustering analysis on relevant sensitive information existing in a file by using a machine learning technology to obtain the occurrence frequency of the sensitive information;
performing clustering analysis on relevant sensitive information existing in a file by using a machine learning technology to obtain the position of the sensitive information;
semantic analysis is carried out on the extracted content of the file by utilizing a machine learning technology, and hard judgment of a single keyword or a regular expression is avoided;
s3: and then classifying and grading the files according to the sensitive information and the frequency, the position and the semantic analysis of the occurrence of the sensitive information, wherein the specific method comprises the following steps:
confirming the classified level and the service classification of the imported file according to the frequency of the sensitive information, the position in the file and the meaning of the specific semantics;
s4: carrying out file encryption according to the classification and grading result of the files to obtain encryption information;
s5: storing the file information of the imported file and the confidential information into a database;
s6: and returning a standard secret result and displaying the standard secret result in the man-machine interaction layer.
As shown in fig. 2, a second aspect of the present invention provides a file encryption system based on data content identification, where the system is installed on an operating system, and specifically includes:
the system comprises a man-machine interaction module, a content identification module, a security marking module, a data leakage and protection auxiliary decision module and a database module;
the man-machine interaction module is an external presentation part of the system, and performs file import and confidential information query and setting, as shown in fig. 4, the specific process of the confidential information query is as follows:
(1) importing the file into a standard secret system;
(2) the background system extracts the label information in the imported file, and performs database query to obtain the file related file information and the confidential information;
(3) returning a query result;
(4) displaying the confidential information of the file in a foreground interface;
as shown in fig. 3, the content identification module identifies and analyzes the content of the imported file, and determines whether the file has relevant sensitive information;
the specific method comprises the following steps: format conversion, content extraction and content identification;
the format conversion is mainly to identify, analyze and convert the format type of the imported file to obtain a format conversion file;
the content extraction is to extract file content from the format conversion file to obtain file extraction content;
performing content identification, namely performing preset information extraction on file extraction content by utilizing keyword identification, regular expression judgment, data identifier identification and file fingerprint identification, and judging whether relevant sensitive information exists in the file or not;
as shown in fig. 3, the functions of the NLP aided decision module include: confirming the frequency and the position of the occurrence of the sensitive information, wherein the specific method comprises the following steps:
performing clustering analysis on relevant sensitive information existing in a file by using a machine learning technology to obtain the occurrence frequency of the sensitive information;
performing clustering analysis on relevant sensitive information existing in a file by using a machine learning technology to obtain the position of the sensitive information;
semantic analysis is carried out on the extracted content of the file by utilizing a machine learning technology, and hard judgment of a single keyword or a regular expression is avoided;
and then classifying and grading the files according to the sensitive information and the frequency, the position and the semantic analysis of the occurrence of the sensitive information, wherein the specific method comprises the following steps:
confirming the classified level and the service classification of the imported file according to the frequency of the sensitive information, the position in the file and the meaning of the specific semantics;
the encryption module completes specific file encryption operation to obtain encryption information;
the database module stores the confidential information, the file information, and the content identification dependent element data information.
Example 2:
as shown in fig. 5, a second aspect of the present application provides a file encryption method and system based on data content identification:
(1) the confidential client company establishes a set of mail system, and needs to purchase a file marking and confidential system to further perfect the tracing of the file sent out by the mail;
(2) the construction scheme consists of three subsystems, namely a mail interception system, a file encryption system and a file tracing system;
(3) the mail interception system is responsible for intercepting the outgoing behavior of the company mail and extracting the attachment files in the outgoing mail;
(4) the file encryption system completes the sensitive information discovery and encryption of the outgoing attachment file;
(5) the marking process and the marking information are reported to a tracing system for subsequent file tracking;
(6) after the encryption is finished, the mail is normally sent;
(7) the receiver sends the encrypted file for the second time, and the mail interception system intercepts the outgoing mail;
(8) the file encryption system reads encryption information in the mail attachment file;
(9) reporting the confidential information and the sending process to a tracing system for file tracking;
(10) the mail is returned to normal and sent to the receiver;
(11) the file tracing system stores the label information and the file circulation information of all files subjected to encryption, and the tracing system can clearly know the circulation process of the files subjected to encryption.

Claims (10)

1. A file encryption method based on data content identification is characterized by comprising the following steps:
s1: the external file is imported to obtain an imported file;
s2: identifying and analyzing the content of the imported file, judging whether the file has relevant sensitive information, and confirming the frequency and the position of the sensitive information;
s3: classifying and grading the files according to the sensitive information and the frequency and the position of the sensitive information;
s4: carrying out file encryption according to the classification and grading result of the files to obtain encryption information;
s5: storing the file information of the imported file and the confidential information into a database;
s6: and returning a standard secret result and displaying the standard secret result in the man-machine interaction layer.
2. The file encryption method based on data content identification according to claim 1, wherein the specific method for identifying and analyzing the content of the imported file and judging whether the file has the relevant sensitive information is as follows: format conversion, content extraction and content identification;
the format conversion is mainly to identify, analyze and convert the format type of the imported file to obtain a format conversion file;
the content extraction is to extract file content from the format conversion file to obtain file extraction content;
and the content identification is to extract preset information of the extracted content of the file and judge whether the file has related sensitive information.
3. The file encryption method based on data content identification as claimed in claim 2, wherein the specific method for extracting the preset information from the file extraction content is as follows:
and performing preset information extraction on the extracted content of the file by utilizing keyword identification, regular expression judgment, data identifier identification and file fingerprint identification.
4. The method for encrypting the file based on the data content identification as claimed in claim 3, wherein the specific method for confirming the occurrence frequency of the sensitive information is as follows:
and carrying out clustering analysis on the relevant sensitive information existing in the file by utilizing a machine learning technology to obtain the occurrence frequency of the sensitive information.
5. The method for encrypting the file based on the data content identification as claimed in claim 4, wherein the specific method for confirming the position where the sensitive information appears is as follows:
and carrying out clustering analysis on the relevant sensitive information in the file by utilizing a machine learning technology to obtain the position of the sensitive information.
6. The method of claim 5, wherein the method further comprises, before performing classification and ranking of the files:
semantic analysis is carried out on the extracted content of the file by utilizing a machine learning technology, and hard judgment of a single keyword or a regular expression is avoided.
7. The method for encrypting the file based on the data content identification as claimed in claim 6, wherein the specific method for classifying and grading the file is as follows:
and confirming the classified level and the service classification of the imported file according to the frequency of the sensitive information, the position in the file and the meaning of the specific semantics.
8. A file encryption system based on data content identification is characterized in that the system is mounted on an operating system and specifically comprises:
the system comprises a man-machine interaction module, a content identification module, a security marking module, a data leakage and protection auxiliary decision module and a database module;
the man-machine interaction module is an external presentation part of the system and is used for file import and confidential information query and setting;
the content identification module identifies and analyzes the content of the imported file and judges whether the file has relevant sensitive information or not;
the functions of the NLP aided decision module include: confirming the frequency and the position of the sensitive information, and classifying and grading the files according to the sensitive information and the frequency and the position of the sensitive information;
the encryption module completes specific file encryption operation to obtain encryption information;
the database module stores the confidential information, the file information, and the content identification dependent element data information.
9. The file encryption system based on data content recognition according to claim 8, wherein the NLP aid decision module further comprises: semantic analysis is carried out on the extracted content of the file by utilizing a machine learning technology, and hard judgment of a single keyword or a regular expression is avoided.
10. The file encryption system based on data content identification as claimed in claim 8, wherein the specific process of the encryption information query is:
(1) importing the file into a standard secret system;
(2) the background system extracts the label information in the imported file, and performs database query to obtain the file related file information and the confidential information;
(3) returning a query result;
(4) and displaying the confidential information of the file in the foreground interface.
CN202110658517.XA 2021-06-15 2021-06-15 File encryption method and system based on data content identification Pending CN113378222A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110658517.XA CN113378222A (en) 2021-06-15 2021-06-15 File encryption method and system based on data content identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110658517.XA CN113378222A (en) 2021-06-15 2021-06-15 File encryption method and system based on data content identification

Publications (1)

Publication Number Publication Date
CN113378222A true CN113378222A (en) 2021-09-10

Family

ID=77574155

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110658517.XA Pending CN113378222A (en) 2021-06-15 2021-06-15 File encryption method and system based on data content identification

Country Status (1)

Country Link
CN (1) CN113378222A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116756092A (en) * 2023-08-23 2023-09-15 深圳红途科技有限公司 System download file marking method, device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902931A (en) * 2011-07-28 2013-01-30 中国航天科工集团第二研究院七〇六所 File encryption system and file encryption method
CN106790174A (en) * 2016-12-29 2017-05-31 成都三零盛安信息系统有限公司 Security level identification method and device
RU2015152418A (en) * 2015-12-07 2017-06-13 федеральное государственное казенное военное образовательное учреждение высшего образования "Краснодарское высшее военное училище имени генерала армии С.М. Штеменко" Министерства обороны Российской Федерации Method for automatic classification of confidential formalized documents in electronic document management system
CN106845265A (en) * 2016-12-01 2017-06-13 北京计算机技术及应用研究所 A kind of document security level automatic identifying method
CN110166451A (en) * 2019-05-20 2019-08-23 北京计算机技术及应用研究所 A kind of lightweight electronic document transmitting control system and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902931A (en) * 2011-07-28 2013-01-30 中国航天科工集团第二研究院七〇六所 File encryption system and file encryption method
RU2015152418A (en) * 2015-12-07 2017-06-13 федеральное государственное казенное военное образовательное учреждение высшего образования "Краснодарское высшее военное училище имени генерала армии С.М. Штеменко" Министерства обороны Российской Федерации Method for automatic classification of confidential formalized documents in electronic document management system
CN106845265A (en) * 2016-12-01 2017-06-13 北京计算机技术及应用研究所 A kind of document security level automatic identifying method
CN106790174A (en) * 2016-12-29 2017-05-31 成都三零盛安信息系统有限公司 Security level identification method and device
CN110166451A (en) * 2019-05-20 2019-08-23 北京计算机技术及应用研究所 A kind of lightweight electronic document transmitting control system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张亮;徐建忠;罗准辰;: "一种基于聚类的敏感信息检测结果优化方法", 信息安全与通信保密, no. 01, 10 January 2016 (2016-01-10), pages 129 - 131 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116756092A (en) * 2023-08-23 2023-09-15 深圳红途科技有限公司 System download file marking method, device, computer equipment and storage medium
CN116756092B (en) * 2023-08-23 2024-01-05 深圳红途科技有限公司 System download file marking method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN1319331C (en) Method and system for detecting and identifying counterfeit web page
CN112491779B (en) Abnormal behavior detection method and device and electronic equipment
CN110046297B (en) Operation and maintenance violation identification method and device and storage medium
CN113111951B (en) Data processing method and device
CN110210858A (en) A kind of air control guard system design method based on intelligent terminal identification
CN108537422A (en) Security risk early warning system and method
CN111639355B (en) Data security management method and system
CN113378222A (en) File encryption method and system based on data content identification
CN116881850B (en) Safety early warning system based on multi-mode data fusion
CN113364731B (en) Fine-grained analysis method for mobile application geographic position access behavior
CN118070324A (en) Data desensitization optimization method based on multi-modal algorithm network
CN117421640A (en) API asset identification method, device, equipment and storage medium
CN112667888A (en) Big data processing system based on block chain
CN117972704A (en) Blockchain ecological safety collaborative supervision method
KR102135075B1 (en) Method for providing fake news alert service through syntactic analysis of instant messages based on news writing and broadcast guidelines and apparatus thereof
CN116738369A (en) Traffic data classification method, device, equipment and storage medium
CN108540471B (en) Mobile application network traffic clustering method, computer readable storage medium and terminal
CN116049797A (en) Intelligent storage system based on data classification system
CN107391695A (en) A kind of information extracting method based on big data
CN113904851A (en) Network information processing method, user plane function system, medium, and electronic device
CN114282903A (en) City multisource data overall process integrated management system
CN118157996B (en) Method and system for improving data security of temporary office local area network
CN109634991B (en) Searching method based on big data
CN116049877B (en) Method, system, equipment and storage medium for identifying and desensitizing private data
CN110175200A (en) A kind of abnormal energy analysis method and system based on intelligent algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination