CN112100630A

CN112100630A - Identification method for confidential document

Info

Publication number: CN112100630A
Application number: CN201910528848.4A
Authority: CN
Inventors: 冯迪; 汤丹; 支劲超; 顾梅
Original assignee: State Grid Corp of China SGCC; State Grid Jiangsu Electric Power Co Ltd; Changzhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Jiangsu Electric Power Co Ltd; Changzhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2019-06-18
Filing date: 2019-06-18
Publication date: 2020-12-18

Abstract

The invention relates to an identification method for a confidential document, which comprises the following steps: firstly, preprocessing; secondly, text detection; thirdly, optical character recognition; fourthly, extracting keywords from the photos and checking whether the photos are classified files or not; fifthly, checking whether the confidential document is the confidential document or not through an OCR template of the confidential document; sixthly, attaching EXIF information; seventhly, setting a suspicious coefficient and uploading the suspicious coefficient to a background manager; eighthly, inquiring a document part; and ninthly, improving the scanning efficiency. Aiming at the detection of the confidential document, the invention not only utilizes the prior OCR technology, but also generates a plurality of sets of templates aiming at the characteristics of the confidential document, thereby improving the identification rate and the analysis speed of the confidential document.

Description

Identification method for confidential document

Technical Field

The invention relates to the field of file identification, in particular to an identification method for a confidential file.

Background

In the past, based on the management of paper confidential documents, each company has a set of strict management system, so that the confidential work is orderly carried out. With the development of the technology, after the electronic documents are popularized, in order to ensure the safe storage of the documents, a special encrypted U disk is generally used uniformly, a user must input a user name and a password, and the user can check the documents after logging in, so that the electronic documents are basically prevented from being leaked.

However, with the development of technology, the new period of security work has no longer managed simple paper documents and electronic documents. The popularization of high-pixel smart phones brings new problems to the work of file confidentiality.

In the file circulation process, part of personnel only need to use a portable smart phone to easily shoot a computer display or a paper file, and then content pictures with high definition can be obtained. Before, the leakage of some internal files occurs, namely, pictures are copied by a mobile phone and transmitted to the internet, so that bad influence is brought.

Based on the situation, on one hand, management of the confidential documents is further perfected, employee education is strengthened, and the employees are prohibited from storing the confidential documents into the mobile phone in any form. On the other hand, the monitoring processing of mobile phone photos and documents with specified formats is enhanced by using the emerging technology actively or not.

Disclosure of Invention

The invention aims to provide an identification method for a confidential file, which is high in identification rate and good in reliability.

The technical scheme for realizing the aim of the invention is to provide an identification method for a confidential document, which comprises the following steps:

the first step, pretreatment: firstly, image related data needs to be received, the image related data is made to be vertical in the horizontal and vertical directions, then, whether the image is a confidential file is detected by using an algorithm, and finally binaryzation is carried out so as to facilitate identification;

there are three approaches to identifying images; 1. a high-threshold self-adaptive binarization technology; convolutional Neural Network (CNN); a Haar feature classifier;

step two, text detection: there are two schemes to accomplish text detection; 1. detecting text by a linking component; 2. detecting the text by using the grids; firstly, optimizing a result by using a connection component algorithm and then using a network method;

step three, optical character recognition: the Convolutional Neural Network (CNN) is used for receiving the relevant fonts for training, and a part is output and is used for improving the probability through comparison: training by using a common font of a confidential document as a manual identification sample, wherein a plurality of characters are equal in width according to the characteristics of the confidential document, an approximate width of each character is obtained by using a picture non-uniform segmentation technology, an approximate classification is given, and then a convolutional neural network grammar is used for identification;

fourthly, extracting keywords from the photos and checking whether the photos are classified files or not;

and fifthly, checking whether the confidential document is the confidential document through an OCR template of the confidential document: by a conventional OCR algorithm, a part of files can be found; but the image processing is a very complicated process in order to improve the recognition rate of software; the identification template is used in a matching way; then processing the picture by using a template matching method;

the top of the confidential document is generally provided with a confidential character, so that a template and an area with the same size in an original image are aligned, then the template and the area are translated to a next pixel, the same operation is still carried out, after all positions are compared, a numerical value of the matching degree can be obtained, and then threshold value comparison can be set;

sixthly, the EXIF information is aided: obtaining geographical position information of picture shooting through EXIF information of a pre-read picture file; the analysis is strengthened for the pictures generated near the working time and the office area, so that the scanning detection accuracy can be further improved.

Further, in the first step, an adaptive binarization technique of a high threshold is preferred.

Further, in the fourth step, whether the document is a confidential document is checked according to predefined keywords including confidentiality, secrecy, internal matters, compensation and planning.

Furthermore, in the fifth step, when the template is actually set, the template related to the confidential document can be summarized and summarized according to the font and the language format of the document; then, the matching degree can be obtained by the following algorithms;

。

the invention has the positive effects that: (1) aiming at the detection of the confidential document, the invention not only utilizes the prior OCR technology, but also generates a plurality of sets of templates aiming at the characteristics of the confidential document, thereby improving the identification rate and the analysis speed of the confidential document.

(2) In order to enhance the identification of the confidential documents, the system adds geographical position judgment. If the image is a photo shot at a working place, the detection is strengthened.

Detailed Description

(example 1)

The method for identifying the confidential document in the embodiment utilizes the existing image OCR technology and the document scanning technology to scan and compare the document and the image in the mobile phone and verify whether the document and the image contain keywords.

According to predefined keywords such as confidentiality, secrecy, internal matters, compensation, planning and the like, scanning detection is carried out on documents and picture files stored in the mobile phone, a final result is fed back, and a user is prompted to process the documents and the picture files which may contain sensitive words.

In this function, the most critical is the scan detection of the picture file. Based on an optimized and improved OCR algorithm, each pixel in the picture is analyzed, comprehensive characteristics such as file format, character using font, character color and the like are judged in an auxiliary mode, a confidential file template library is set, and character content contained in the picture is obtained more accurately.

The method for identifying the confidential document in the embodiment specifically comprises the following steps:

the first step, pretreatment: firstly, image related data needs to be received, the image related data is made to be vertical in the horizontal and vertical directions, then, whether the image is a confidential file is detected through an algorithm, and finally binaryzation is performed to facilitate identification.

There are three schemes that can be used to identify images. 1. High threshold adaptive binarization technique. Convolutional Neural Network (CNN). A Haar feature classifier.

A high threshold adaptive binarization technique is preferred.

Step two, text detection: there are two schemes to accomplish text detection. 1. Text is detected by the linking component. 2. The text is detected using a grid.

When the file is detected through the link component, a lot of noisy texts exist, and a threshold value needs to be additionally set for filtering. The semantics are known mainly by the combination of the most recent characters into words. After the texts are shaped into lines, whether the texts are in the same line is judged according to the height.

The text is detected through the grids, and a lot of noisy texts are avoided.

The detection of text is accomplished by a combination of the two methods. The result is optimized by a connection component algorithm and then by a network method.

Step three, optical character recognition: the Convolutional Neural Network (CNN) is used for receiving the relevant fonts for training, and a part is output and is used for improving the probability through comparison: the method comprises the steps of training by using common fonts of confidential documents as manual identification samples, obtaining the approximate width of each character by using a picture non-uniform segmentation technology according to the characteristics of the confidential documents, giving an approximate classification, and then identifying by using a convolutional neural network grammar. The two characters are combined, so that the recognition rate of the characters is improved.

And fourthly, extracting keywords from the photos and checking whether the files are classified files. Whether the document is a confidential document is checked according to predefined keywords, such as confidentiality, secrecy, internal matters, salary, planning and the like.

And fifthly, checking whether the confidential document is the confidential document through an OCR template of the confidential document: by means of conventional OCR algorithms, it is indeed possible to find parts of the file. But image processing is a very complex process in order to increase the recognition rate of the software. The identification template is used in a matching way. The picture is then processed using template matching.

The top of the security document is generally provided with a security word, so that the template and an area in the original image with the same size are aligned, then the template and the original image are translated to the next pixel, the same operation is still carried out, after all positions are compared, a numerical value of the matching degree can be obtained, and then threshold value comparison can be set.

When the template is actually set, the template related to the confidential document can be summarized and summarized according to the font, the language format and the like of the document. Then, there are several algorithms to find the matching degree.

Sixthly, the EXIF information is aided: and obtaining the geographical position information of picture shooting by pre-reading EXIF information of the picture file. The analysis is strengthened for the pictures generated near the working time and the office area, so that the scanning detection accuracy can be further improved.

And shooting geographical position information according to the similarity of the pictures to jointly judge the confidential possibility of the pictures. And for high suspected degree, directly isolating and deleting the high suspected degree, and uploading the high suspected degree to a background administrator. Others may remind the user to check themselves.

It should be understood that the above examples are only for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And such obvious variations or modifications which fall within the spirit of the invention are intended to be covered by the scope of the present invention.

Claims

1. An identification method for a secure document, characterized by comprising the steps of:

2. An identification method for a security document according to claim 1, characterized in that: in the first step, a high threshold adaptive binarization technique is preferred.

3. An identification method for a security document according to claim 1, characterized in that: in the fourth step, whether the file is a confidential file is checked according to predefined keywords including confidentiality, secrecy, internal matters, salary and planning.

4. An identification method for a security document according to claim 1, characterized in that: in the fifth step, when the template is actually set, the template related to the confidential document can be summarized and summarized according to the font and the language format of the document; then, the matching degree can be obtained by the following algorithms;

。