CN112100630A - Identification method for confidential document - Google Patents

Identification method for confidential document Download PDF

Info

Publication number
CN112100630A
CN112100630A CN201910528848.4A CN201910528848A CN112100630A CN 112100630 A CN112100630 A CN 112100630A CN 201910528848 A CN201910528848 A CN 201910528848A CN 112100630 A CN112100630 A CN 112100630A
Authority
CN
China
Prior art keywords
confidential document
confidential
document
template
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910528848.4A
Other languages
Chinese (zh)
Inventor
冯迪
汤丹
支劲超
顾梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Jiangsu Electric Power Co Ltd
Changzhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Jiangsu Electric Power Co Ltd
Changzhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Jiangsu Electric Power Co Ltd, Changzhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201910528848.4A priority Critical patent/CN112100630A/en
Publication of CN112100630A publication Critical patent/CN112100630A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/158Segmentation of character regions using character size, text spacings or pitch estimation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

The invention relates to an identification method for a confidential document, which comprises the following steps: firstly, preprocessing; secondly, text detection; thirdly, optical character recognition; fourthly, extracting keywords from the photos and checking whether the photos are classified files or not; fifthly, checking whether the confidential document is the confidential document or not through an OCR template of the confidential document; sixthly, attaching EXIF information; seventhly, setting a suspicious coefficient and uploading the suspicious coefficient to a background manager; eighthly, inquiring a document part; and ninthly, improving the scanning efficiency. Aiming at the detection of the confidential document, the invention not only utilizes the prior OCR technology, but also generates a plurality of sets of templates aiming at the characteristics of the confidential document, thereby improving the identification rate and the analysis speed of the confidential document.

Description

Identification method for confidential document
Technical Field
The invention relates to the field of file identification, in particular to an identification method for a confidential file.
Background
In the past, based on the management of paper confidential documents, each company has a set of strict management system, so that the confidential work is orderly carried out. With the development of the technology, after the electronic documents are popularized, in order to ensure the safe storage of the documents, a special encrypted U disk is generally used uniformly, a user must input a user name and a password, and the user can check the documents after logging in, so that the electronic documents are basically prevented from being leaked.
However, with the development of technology, the new period of security work has no longer managed simple paper documents and electronic documents. The popularization of high-pixel smart phones brings new problems to the work of file confidentiality.
In the file circulation process, part of personnel only need to use a portable smart phone to easily shoot a computer display or a paper file, and then content pictures with high definition can be obtained. Before, the leakage of some internal files occurs, namely, pictures are copied by a mobile phone and transmitted to the internet, so that bad influence is brought.
Based on the situation, on one hand, management of the confidential documents is further perfected, employee education is strengthened, and the employees are prohibited from storing the confidential documents into the mobile phone in any form. On the other hand, the monitoring processing of mobile phone photos and documents with specified formats is enhanced by using the emerging technology actively or not.
Disclosure of Invention
The invention aims to provide an identification method for a confidential file, which is high in identification rate and good in reliability.
The technical scheme for realizing the aim of the invention is to provide an identification method for a confidential document, which comprises the following steps:
the first step, pretreatment: firstly, image related data needs to be received, the image related data is made to be vertical in the horizontal and vertical directions, then, whether the image is a confidential file is detected by using an algorithm, and finally binaryzation is carried out so as to facilitate identification;
there are three approaches to identifying images; 1. a high-threshold self-adaptive binarization technology; convolutional Neural Network (CNN); a Haar feature classifier;
step two, text detection: there are two schemes to accomplish text detection; 1. detecting text by a linking component; 2. detecting the text by using the grids; firstly, optimizing a result by using a connection component algorithm and then using a network method;
step three, optical character recognition: the Convolutional Neural Network (CNN) is used for receiving the relevant fonts for training, and a part is output and is used for improving the probability through comparison: training by using a common font of a confidential document as a manual identification sample, wherein a plurality of characters are equal in width according to the characteristics of the confidential document, an approximate width of each character is obtained by using a picture non-uniform segmentation technology, an approximate classification is given, and then a convolutional neural network grammar is used for identification;
fourthly, extracting keywords from the photos and checking whether the photos are classified files or not;
and fifthly, checking whether the confidential document is the confidential document through an OCR template of the confidential document: by a conventional OCR algorithm, a part of files can be found; but the image processing is a very complicated process in order to improve the recognition rate of software; the identification template is used in a matching way; then processing the picture by using a template matching method;
the top of the confidential document is generally provided with a confidential character, so that a template and an area with the same size in an original image are aligned, then the template and the area are translated to a next pixel, the same operation is still carried out, after all positions are compared, a numerical value of the matching degree can be obtained, and then threshold value comparison can be set;
sixthly, the EXIF information is aided: obtaining geographical position information of picture shooting through EXIF information of a pre-read picture file; the analysis is strengthened for the pictures generated near the working time and the office area, so that the scanning detection accuracy can be further improved.
Further, in the first step, an adaptive binarization technique of a high threshold is preferred.
Further, in the fourth step, whether the document is a confidential document is checked according to predefined keywords including confidentiality, secrecy, internal matters, compensation and planning.
Furthermore, in the fifth step, when the template is actually set, the template related to the confidential document can be summarized and summarized according to the font and the language format of the document; then, the matching degree can be obtained by the following algorithms;
Figure 557308DEST_PATH_IMAGE002
Figure 100002_DEST_PATH_IMAGE004
the invention has the positive effects that: (1) aiming at the detection of the confidential document, the invention not only utilizes the prior OCR technology, but also generates a plurality of sets of templates aiming at the characteristics of the confidential document, thereby improving the identification rate and the analysis speed of the confidential document.
(2) In order to enhance the identification of the confidential documents, the system adds geographical position judgment. If the image is a photo shot at a working place, the detection is strengthened.
Detailed Description
(example 1)
The method for identifying the confidential document in the embodiment utilizes the existing image OCR technology and the document scanning technology to scan and compare the document and the image in the mobile phone and verify whether the document and the image contain keywords.
According to predefined keywords such as confidentiality, secrecy, internal matters, compensation, planning and the like, scanning detection is carried out on documents and picture files stored in the mobile phone, a final result is fed back, and a user is prompted to process the documents and the picture files which may contain sensitive words.
In this function, the most critical is the scan detection of the picture file. Based on an optimized and improved OCR algorithm, each pixel in the picture is analyzed, comprehensive characteristics such as file format, character using font, character color and the like are judged in an auxiliary mode, a confidential file template library is set, and character content contained in the picture is obtained more accurately.
The method for identifying the confidential document in the embodiment specifically comprises the following steps:
the first step, pretreatment: firstly, image related data needs to be received, the image related data is made to be vertical in the horizontal and vertical directions, then, whether the image is a confidential file is detected through an algorithm, and finally binaryzation is performed to facilitate identification.
There are three schemes that can be used to identify images. 1. High threshold adaptive binarization technique. Convolutional Neural Network (CNN). A Haar feature classifier.
A high threshold adaptive binarization technique is preferred.
Step two, text detection: there are two schemes to accomplish text detection. 1. Text is detected by the linking component. 2. The text is detected using a grid.
When the file is detected through the link component, a lot of noisy texts exist, and a threshold value needs to be additionally set for filtering. The semantics are known mainly by the combination of the most recent characters into words. After the texts are shaped into lines, whether the texts are in the same line is judged according to the height.
The text is detected through the grids, and a lot of noisy texts are avoided.
The detection of text is accomplished by a combination of the two methods. The result is optimized by a connection component algorithm and then by a network method.
Step three, optical character recognition: the Convolutional Neural Network (CNN) is used for receiving the relevant fonts for training, and a part is output and is used for improving the probability through comparison: the method comprises the steps of training by using common fonts of confidential documents as manual identification samples, obtaining the approximate width of each character by using a picture non-uniform segmentation technology according to the characteristics of the confidential documents, giving an approximate classification, and then identifying by using a convolutional neural network grammar. The two characters are combined, so that the recognition rate of the characters is improved.
And fourthly, extracting keywords from the photos and checking whether the files are classified files. Whether the document is a confidential document is checked according to predefined keywords, such as confidentiality, secrecy, internal matters, salary, planning and the like.
And fifthly, checking whether the confidential document is the confidential document through an OCR template of the confidential document: by means of conventional OCR algorithms, it is indeed possible to find parts of the file. But image processing is a very complex process in order to increase the recognition rate of the software. The identification template is used in a matching way. The picture is then processed using template matching.
The top of the security document is generally provided with a security word, so that the template and an area in the original image with the same size are aligned, then the template and the original image are translated to the next pixel, the same operation is still carried out, after all positions are compared, a numerical value of the matching degree can be obtained, and then threshold value comparison can be set.
When the template is actually set, the template related to the confidential document can be summarized and summarized according to the font, the language format and the like of the document. Then, there are several algorithms to find the matching degree.
Figure DEST_PATH_IMAGE006
Figure DEST_PATH_IMAGE008
Sixthly, the EXIF information is aided: and obtaining the geographical position information of picture shooting by pre-reading EXIF information of the picture file. The analysis is strengthened for the pictures generated near the working time and the office area, so that the scanning detection accuracy can be further improved.
And shooting geographical position information according to the similarity of the pictures to jointly judge the confidential possibility of the pictures. And for high suspected degree, directly isolating and deleting the high suspected degree, and uploading the high suspected degree to a background administrator. Others may remind the user to check themselves.
It should be understood that the above examples are only for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And such obvious variations or modifications which fall within the spirit of the invention are intended to be covered by the scope of the present invention.

Claims (4)

1. An identification method for a secure document, characterized by comprising the steps of:
the first step, pretreatment: firstly, image related data needs to be received, the image related data is made to be vertical in the horizontal and vertical directions, then, whether the image is a confidential file is detected by using an algorithm, and finally binaryzation is carried out so as to facilitate identification;
there are three approaches to identifying images; 1. a high-threshold self-adaptive binarization technology; convolutional Neural Network (CNN); a Haar feature classifier;
step two, text detection: there are two schemes to accomplish text detection; 1. detecting text by a linking component; 2. detecting the text by using the grids; firstly, optimizing a result by using a connection component algorithm and then using a network method;
step three, optical character recognition: the Convolutional Neural Network (CNN) is used for receiving the relevant fonts for training, and a part is output and is used for improving the probability through comparison: training by using a common font of a confidential document as a manual identification sample, wherein a plurality of characters are equal in width according to the characteristics of the confidential document, an approximate width of each character is obtained by using a picture non-uniform segmentation technology, an approximate classification is given, and then a convolutional neural network grammar is used for identification;
fourthly, extracting keywords from the photos and checking whether the photos are classified files or not;
and fifthly, checking whether the confidential document is the confidential document through an OCR template of the confidential document: by a conventional OCR algorithm, a part of files can be found; but the image processing is a very complicated process in order to improve the recognition rate of software; the identification template is used in a matching way; then processing the picture by using a template matching method;
the top of the confidential document is generally provided with a confidential character, so that a template and an area with the same size in an original image are aligned, then the template and the area are translated to a next pixel, the same operation is still carried out, after all positions are compared, a numerical value of the matching degree can be obtained, and then threshold value comparison can be set;
sixthly, the EXIF information is aided: obtaining geographical position information of picture shooting through EXIF information of a pre-read picture file; the analysis is strengthened for the pictures generated near the working time and the office area, so that the scanning detection accuracy can be further improved.
2. An identification method for a security document according to claim 1, characterized in that: in the first step, a high threshold adaptive binarization technique is preferred.
3. An identification method for a security document according to claim 1, characterized in that: in the fourth step, whether the file is a confidential file is checked according to predefined keywords including confidentiality, secrecy, internal matters, salary and planning.
4. An identification method for a security document according to claim 1, characterized in that: in the fifth step, when the template is actually set, the template related to the confidential document can be summarized and summarized according to the font and the language format of the document; then, the matching degree can be obtained by the following algorithms;
Figure DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE004
CN201910528848.4A 2019-06-18 2019-06-18 Identification method for confidential document Withdrawn CN112100630A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910528848.4A CN112100630A (en) 2019-06-18 2019-06-18 Identification method for confidential document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910528848.4A CN112100630A (en) 2019-06-18 2019-06-18 Identification method for confidential document

Publications (1)

Publication Number Publication Date
CN112100630A true CN112100630A (en) 2020-12-18

Family

ID=73749386

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910528848.4A Withdrawn CN112100630A (en) 2019-06-18 2019-06-18 Identification method for confidential document

Country Status (1)

Country Link
CN (1) CN112100630A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115080704A (en) * 2022-07-20 2022-09-20 广州世安信息技术股份有限公司 Computer file security check method and system based on scoring mechanism

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115080704A (en) * 2022-07-20 2022-09-20 广州世安信息技术股份有限公司 Computer file security check method and system based on scoring mechanism
CN115080704B (en) * 2022-07-20 2022-11-11 广州世安信息技术股份有限公司 Computer file security check method and system based on scoring mechanism

Similar Documents

Publication Publication Date Title
US9754164B2 (en) Systems and methods for classifying objects in digital images captured using mobile devices
CN107239713B (en) Sensitive content data information protection method and system
US9626555B2 (en) Content-based document image classification
KR20190123790A (en) Extract data from electronic documents
EP3574449B1 (en) Structured text and pattern matching for data loss prevention in object-specific image domain
CN101957919B (en) Character recognition method based on image local feature retrieval
US20090116755A1 (en) Systems and methods for enabling manual classification of unrecognized documents to complete workflow for electronic jobs and to assist machine learning of a recognition system using automatically extracted features of unrecognized documents
US20110258170A1 (en) Systems and methods for automatically correcting data extracted from electronic documents using known constraints for semantics of extracted data elements
CN112508011A (en) OCR (optical character recognition) method and device based on neural network
KR102319492B1 (en) AI Deep learning based senstive information management method and system from images
CN112330331A (en) Identity verification method, device and equipment based on face recognition and storage medium
CN108304815B (en) Data acquisition method, device, server and storage medium
CN113111880A (en) Certificate image correction method and device, electronic equipment and storage medium
JP6882362B2 (en) Systems and methods for identifying images, including identification documents
CN113936764A (en) Method and system for desensitizing sensitive information in medical report sheet photo
CN112100630A (en) Identification method for confidential document
CN113076961A (en) Image feature library updating method, image detection method and device
CN116663549A (en) Digitized management method, system and storage medium based on enterprise files
US7532368B2 (en) Automated processing of paper forms using remotely-stored form content
CN112115735A (en) Identification management method for confidential files
CN107016320B (en) Method for improving image security level identification accuracy rate based on Chinese lexicon
CN111046864A (en) Method and system for automatically extracting five elements of contract scanning piece
CN116841424B (en) Screen capture monitoring method and system
US11651093B1 (en) Automated fraudulent document detection
CN111985483B (en) Method and device for detecting screen shot file picture and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20201218

WW01 Invention patent application withdrawn after publication