CN112100630A - Identification method for confidential document - Google Patents
Identification method for confidential document Download PDFInfo
- Publication number
- CN112100630A CN112100630A CN201910528848.4A CN201910528848A CN112100630A CN 112100630 A CN112100630 A CN 112100630A CN 201910528848 A CN201910528848 A CN 201910528848A CN 112100630 A CN112100630 A CN 112100630A
- Authority
- CN
- China
- Prior art keywords
- confidential document
- confidential
- document
- template
- identification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000001514 detection method Methods 0.000 claims abstract description 16
- 238000012015 optical character recognition Methods 0.000 claims abstract description 15
- 238000005516 engineering process Methods 0.000 claims abstract description 12
- 238000004458 analytical method Methods 0.000 claims abstract description 5
- 238000013527 convolutional neural network Methods 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 230000003044 adaptive effect Effects 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000013459 approach Methods 0.000 claims description 2
- 238000007781 pre-processing Methods 0.000 abstract 1
- 238000007726 management method Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/602—Providing cryptographic facilities or services
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/158—Segmentation of character regions using character size, text spacings or pitch estimation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Abstract
The invention relates to an identification method for a confidential document, which comprises the following steps: firstly, preprocessing; secondly, text detection; thirdly, optical character recognition; fourthly, extracting keywords from the photos and checking whether the photos are classified files or not; fifthly, checking whether the confidential document is the confidential document or not through an OCR template of the confidential document; sixthly, attaching EXIF information; seventhly, setting a suspicious coefficient and uploading the suspicious coefficient to a background manager; eighthly, inquiring a document part; and ninthly, improving the scanning efficiency. Aiming at the detection of the confidential document, the invention not only utilizes the prior OCR technology, but also generates a plurality of sets of templates aiming at the characteristics of the confidential document, thereby improving the identification rate and the analysis speed of the confidential document.
Description
Technical Field
The invention relates to the field of file identification, in particular to an identification method for a confidential file.
Background
In the past, based on the management of paper confidential documents, each company has a set of strict management system, so that the confidential work is orderly carried out. With the development of the technology, after the electronic documents are popularized, in order to ensure the safe storage of the documents, a special encrypted U disk is generally used uniformly, a user must input a user name and a password, and the user can check the documents after logging in, so that the electronic documents are basically prevented from being leaked.
However, with the development of technology, the new period of security work has no longer managed simple paper documents and electronic documents. The popularization of high-pixel smart phones brings new problems to the work of file confidentiality.
In the file circulation process, part of personnel only need to use a portable smart phone to easily shoot a computer display or a paper file, and then content pictures with high definition can be obtained. Before, the leakage of some internal files occurs, namely, pictures are copied by a mobile phone and transmitted to the internet, so that bad influence is brought.
Based on the situation, on one hand, management of the confidential documents is further perfected, employee education is strengthened, and the employees are prohibited from storing the confidential documents into the mobile phone in any form. On the other hand, the monitoring processing of mobile phone photos and documents with specified formats is enhanced by using the emerging technology actively or not.
Disclosure of Invention
The invention aims to provide an identification method for a confidential file, which is high in identification rate and good in reliability.
The technical scheme for realizing the aim of the invention is to provide an identification method for a confidential document, which comprises the following steps:
the first step, pretreatment: firstly, image related data needs to be received, the image related data is made to be vertical in the horizontal and vertical directions, then, whether the image is a confidential file is detected by using an algorithm, and finally binaryzation is carried out so as to facilitate identification;
there are three approaches to identifying images; 1. a high-threshold self-adaptive binarization technology; convolutional Neural Network (CNN); a Haar feature classifier;
step two, text detection: there are two schemes to accomplish text detection; 1. detecting text by a linking component; 2. detecting the text by using the grids; firstly, optimizing a result by using a connection component algorithm and then using a network method;
step three, optical character recognition: the Convolutional Neural Network (CNN) is used for receiving the relevant fonts for training, and a part is output and is used for improving the probability through comparison: training by using a common font of a confidential document as a manual identification sample, wherein a plurality of characters are equal in width according to the characteristics of the confidential document, an approximate width of each character is obtained by using a picture non-uniform segmentation technology, an approximate classification is given, and then a convolutional neural network grammar is used for identification;
fourthly, extracting keywords from the photos and checking whether the photos are classified files or not;
and fifthly, checking whether the confidential document is the confidential document through an OCR template of the confidential document: by a conventional OCR algorithm, a part of files can be found; but the image processing is a very complicated process in order to improve the recognition rate of software; the identification template is used in a matching way; then processing the picture by using a template matching method;
the top of the confidential document is generally provided with a confidential character, so that a template and an area with the same size in an original image are aligned, then the template and the area are translated to a next pixel, the same operation is still carried out, after all positions are compared, a numerical value of the matching degree can be obtained, and then threshold value comparison can be set;
sixthly, the EXIF information is aided: obtaining geographical position information of picture shooting through EXIF information of a pre-read picture file; the analysis is strengthened for the pictures generated near the working time and the office area, so that the scanning detection accuracy can be further improved.
Further, in the first step, an adaptive binarization technique of a high threshold is preferred.
Further, in the fourth step, whether the document is a confidential document is checked according to predefined keywords including confidentiality, secrecy, internal matters, compensation and planning.
Furthermore, in the fifth step, when the template is actually set, the template related to the confidential document can be summarized and summarized according to the font and the language format of the document; then, the matching degree can be obtained by the following algorithms;
the invention has the positive effects that: (1) aiming at the detection of the confidential document, the invention not only utilizes the prior OCR technology, but also generates a plurality of sets of templates aiming at the characteristics of the confidential document, thereby improving the identification rate and the analysis speed of the confidential document.
(2) In order to enhance the identification of the confidential documents, the system adds geographical position judgment. If the image is a photo shot at a working place, the detection is strengthened.
Detailed Description
(example 1)
The method for identifying the confidential document in the embodiment utilizes the existing image OCR technology and the document scanning technology to scan and compare the document and the image in the mobile phone and verify whether the document and the image contain keywords.
According to predefined keywords such as confidentiality, secrecy, internal matters, compensation, planning and the like, scanning detection is carried out on documents and picture files stored in the mobile phone, a final result is fed back, and a user is prompted to process the documents and the picture files which may contain sensitive words.
In this function, the most critical is the scan detection of the picture file. Based on an optimized and improved OCR algorithm, each pixel in the picture is analyzed, comprehensive characteristics such as file format, character using font, character color and the like are judged in an auxiliary mode, a confidential file template library is set, and character content contained in the picture is obtained more accurately.
The method for identifying the confidential document in the embodiment specifically comprises the following steps:
the first step, pretreatment: firstly, image related data needs to be received, the image related data is made to be vertical in the horizontal and vertical directions, then, whether the image is a confidential file is detected through an algorithm, and finally binaryzation is performed to facilitate identification.
There are three schemes that can be used to identify images. 1. High threshold adaptive binarization technique. Convolutional Neural Network (CNN). A Haar feature classifier.
A high threshold adaptive binarization technique is preferred.
Step two, text detection: there are two schemes to accomplish text detection. 1. Text is detected by the linking component. 2. The text is detected using a grid.
When the file is detected through the link component, a lot of noisy texts exist, and a threshold value needs to be additionally set for filtering. The semantics are known mainly by the combination of the most recent characters into words. After the texts are shaped into lines, whether the texts are in the same line is judged according to the height.
The text is detected through the grids, and a lot of noisy texts are avoided.
The detection of text is accomplished by a combination of the two methods. The result is optimized by a connection component algorithm and then by a network method.
Step three, optical character recognition: the Convolutional Neural Network (CNN) is used for receiving the relevant fonts for training, and a part is output and is used for improving the probability through comparison: the method comprises the steps of training by using common fonts of confidential documents as manual identification samples, obtaining the approximate width of each character by using a picture non-uniform segmentation technology according to the characteristics of the confidential documents, giving an approximate classification, and then identifying by using a convolutional neural network grammar. The two characters are combined, so that the recognition rate of the characters is improved.
And fourthly, extracting keywords from the photos and checking whether the files are classified files. Whether the document is a confidential document is checked according to predefined keywords, such as confidentiality, secrecy, internal matters, salary, planning and the like.
And fifthly, checking whether the confidential document is the confidential document through an OCR template of the confidential document: by means of conventional OCR algorithms, it is indeed possible to find parts of the file. But image processing is a very complex process in order to increase the recognition rate of the software. The identification template is used in a matching way. The picture is then processed using template matching.
The top of the security document is generally provided with a security word, so that the template and an area in the original image with the same size are aligned, then the template and the original image are translated to the next pixel, the same operation is still carried out, after all positions are compared, a numerical value of the matching degree can be obtained, and then threshold value comparison can be set.
When the template is actually set, the template related to the confidential document can be summarized and summarized according to the font, the language format and the like of the document. Then, there are several algorithms to find the matching degree.
Sixthly, the EXIF information is aided: and obtaining the geographical position information of picture shooting by pre-reading EXIF information of the picture file. The analysis is strengthened for the pictures generated near the working time and the office area, so that the scanning detection accuracy can be further improved.
And shooting geographical position information according to the similarity of the pictures to jointly judge the confidential possibility of the pictures. And for high suspected degree, directly isolating and deleting the high suspected degree, and uploading the high suspected degree to a background administrator. Others may remind the user to check themselves.
It should be understood that the above examples are only for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And such obvious variations or modifications which fall within the spirit of the invention are intended to be covered by the scope of the present invention.
Claims (4)
1. An identification method for a secure document, characterized by comprising the steps of:
the first step, pretreatment: firstly, image related data needs to be received, the image related data is made to be vertical in the horizontal and vertical directions, then, whether the image is a confidential file is detected by using an algorithm, and finally binaryzation is carried out so as to facilitate identification;
there are three approaches to identifying images; 1. a high-threshold self-adaptive binarization technology; convolutional Neural Network (CNN); a Haar feature classifier;
step two, text detection: there are two schemes to accomplish text detection; 1. detecting text by a linking component; 2. detecting the text by using the grids; firstly, optimizing a result by using a connection component algorithm and then using a network method;
step three, optical character recognition: the Convolutional Neural Network (CNN) is used for receiving the relevant fonts for training, and a part is output and is used for improving the probability through comparison: training by using a common font of a confidential document as a manual identification sample, wherein a plurality of characters are equal in width according to the characteristics of the confidential document, an approximate width of each character is obtained by using a picture non-uniform segmentation technology, an approximate classification is given, and then a convolutional neural network grammar is used for identification;
fourthly, extracting keywords from the photos and checking whether the photos are classified files or not;
and fifthly, checking whether the confidential document is the confidential document through an OCR template of the confidential document: by a conventional OCR algorithm, a part of files can be found; but the image processing is a very complicated process in order to improve the recognition rate of software; the identification template is used in a matching way; then processing the picture by using a template matching method;
the top of the confidential document is generally provided with a confidential character, so that a template and an area with the same size in an original image are aligned, then the template and the area are translated to a next pixel, the same operation is still carried out, after all positions are compared, a numerical value of the matching degree can be obtained, and then threshold value comparison can be set;
sixthly, the EXIF information is aided: obtaining geographical position information of picture shooting through EXIF information of a pre-read picture file; the analysis is strengthened for the pictures generated near the working time and the office area, so that the scanning detection accuracy can be further improved.
2. An identification method for a security document according to claim 1, characterized in that: in the first step, a high threshold adaptive binarization technique is preferred.
3. An identification method for a security document according to claim 1, characterized in that: in the fourth step, whether the file is a confidential file is checked according to predefined keywords including confidentiality, secrecy, internal matters, salary and planning.
4. An identification method for a security document according to claim 1, characterized in that: in the fifth step, when the template is actually set, the template related to the confidential document can be summarized and summarized according to the font and the language format of the document; then, the matching degree can be obtained by the following algorithms;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910528848.4A CN112100630A (en) | 2019-06-18 | 2019-06-18 | Identification method for confidential document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910528848.4A CN112100630A (en) | 2019-06-18 | 2019-06-18 | Identification method for confidential document |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112100630A true CN112100630A (en) | 2020-12-18 |
Family
ID=73749386
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910528848.4A Withdrawn CN112100630A (en) | 2019-06-18 | 2019-06-18 | Identification method for confidential document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112100630A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115080704A (en) * | 2022-07-20 | 2022-09-20 | 广州世安信息技术股份有限公司 | Computer file security check method and system based on scoring mechanism |
-
2019
- 2019-06-18 CN CN201910528848.4A patent/CN112100630A/en not_active Withdrawn
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115080704A (en) * | 2022-07-20 | 2022-09-20 | 广州世安信息技术股份有限公司 | Computer file security check method and system based on scoring mechanism |
CN115080704B (en) * | 2022-07-20 | 2022-11-11 | 广州世安信息技术股份有限公司 | Computer file security check method and system based on scoring mechanism |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9754164B2 (en) | Systems and methods for classifying objects in digital images captured using mobile devices | |
CN107239713B (en) | Sensitive content data information protection method and system | |
US9626555B2 (en) | Content-based document image classification | |
KR20190123790A (en) | Extract data from electronic documents | |
EP3574449B1 (en) | Structured text and pattern matching for data loss prevention in object-specific image domain | |
CN101957919B (en) | Character recognition method based on image local feature retrieval | |
US20090116755A1 (en) | Systems and methods for enabling manual classification of unrecognized documents to complete workflow for electronic jobs and to assist machine learning of a recognition system using automatically extracted features of unrecognized documents | |
US20110258170A1 (en) | Systems and methods for automatically correcting data extracted from electronic documents using known constraints for semantics of extracted data elements | |
CN112508011A (en) | OCR (optical character recognition) method and device based on neural network | |
KR102319492B1 (en) | AI Deep learning based senstive information management method and system from images | |
CN112330331A (en) | Identity verification method, device and equipment based on face recognition and storage medium | |
CN108304815B (en) | Data acquisition method, device, server and storage medium | |
CN113111880A (en) | Certificate image correction method and device, electronic equipment and storage medium | |
JP6882362B2 (en) | Systems and methods for identifying images, including identification documents | |
CN113936764A (en) | Method and system for desensitizing sensitive information in medical report sheet photo | |
CN112100630A (en) | Identification method for confidential document | |
CN113076961A (en) | Image feature library updating method, image detection method and device | |
CN116663549A (en) | Digitized management method, system and storage medium based on enterprise files | |
US7532368B2 (en) | Automated processing of paper forms using remotely-stored form content | |
CN112115735A (en) | Identification management method for confidential files | |
CN107016320B (en) | Method for improving image security level identification accuracy rate based on Chinese lexicon | |
CN111046864A (en) | Method and system for automatically extracting five elements of contract scanning piece | |
CN116841424B (en) | Screen capture monitoring method and system | |
US11651093B1 (en) | Automated fraudulent document detection | |
CN111985483B (en) | Method and device for detecting screen shot file picture and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201218 |
|
WW01 | Invention patent application withdrawn after publication |