CN109284756A

CN109284756A - A kind of terminal censorship method based on OCR technique

Info

Publication number: CN109284756A
Application number: CN201810865946.2A
Authority: CN
Inventors: 李昌利; 贾乾; 刘翔
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2019-01-29

Abstract

The present invention discloses a kind of terminal censorship method based on OCR technique, step is: terminal sends censorship request to server, server includes that Connection Time and IP address record to the letter of terminal, and server by utilizing document analysis module judges the type of file to be detected；Server by utilizing document analysis module is judged that the file for belonging to image type is handled using image processing module；Cutting is carried out to image text using image cutting module to treated image；Text after cutting is extracted into character feature using characteristic extracting module and carries out Classification and Identification；Character after dividing to Classification and Identification carries out keyword match using text matches module, if containing keyword, the information that will acquire is shown to terminal interface, otherwise directly logs off.Such method can further improve the performance of OCR system, reduce the complexity of censorship system, realize the target to terminal censorship.

Description

A kind of terminal censorship method based on OCR technique

Technical field

The invention belongs to computer and information security field, in particular to a kind of terminal based on OCR technique is protected Close inspection method.

Background technique

In the epoch of Internet technology high speed development, people increasingly deepen the degree of dependence of internet, and information content is therewith In explosive growth, mass data is produced, event also happens occasionally and secrets disclosed by net event and network are stolen secret information.Nowadays most of The leakage of a state or party secret is all closely related with internet, computer and storage medium, is divulged a secret by internet, computer and storage medium Through becoming the main path currently divulged a secret, and have the tendency that rising year by year.In traditional censorship, many departments into The file that computer can be handled directly, such as office file, txt file are directed to when row censorship mostly, and for image File is then helpless, is unable to satisfy the demand of current secret inspection, not only inefficiency, and wastes vast resources, the method It is inevitable undesirable.

Therefore, the censorship method that censorship can be carried out to image file and operate by designing one kind has Thus certain value and meaning, this case generate.

Summary of the invention

The purpose of the present invention is to provide a kind of terminal censorship method based on OCR technique, can be into one Step improves the performance of OCR system, reduces the complexity of censorship system, realizes the target to terminal censorship.

In order to achieve the above objectives, solution of the invention is:

A kind of terminal censorship method based on OCR technique, comprising the following steps:

Step 1, the terminal in censorship network sends censorship request to server, and server is rung Terminal is attached after request should be detected；

Step 2, server includes that Connection Time and IP address record, and are stored in data to the letter of terminal In library, server by utilizing document analysis module judges the type of file to be detected；

Step 3, server by utilizing document analysis module is judged that the file for belonging to image type utilizes image procossing mould Block is handled, and is specifically included:

If image is gaussian noise image, it is weighted at denoising using the method that mean filter is merged with median filtering Reason, the image after being denoised；

If image is complex background image, character picture carries out the foreground area and background area of binary conversion treatment segmented image Domain；

If image is tilted image, image compress and selected part image carries out Hough transform, after obtaining correction Image；

Step 4, cutting is carried out to image text using image cutting module to treated image；

Step 5, the text after cutting is extracted into character feature using characteristic extracting module and carries out Classification and Identification；

Step 6, the character after dividing to Classification and Identification carries out keyword match using text matches module, if containing related Otherwise key word directly logs off then the information that will acquire is shown to terminal interface.

After adopting the above scheme, the technological means that the present invention uses B/S to combine with C/S, to the terminal of networking Carry out safe and secret inspection, overcome it is multidisciplinary in the file that be directed to mostly when censorship computer and can directly handle, Such as office file, txt file, and for image file then helpless drawback, on the one hand expand check object Range supports multiple types file and its Content inspection, including picture file, office file, web page files, compressed package text The various regular files such as part, mail document and picture OCR inspection etc., on the other hand also improve the efficiency for checking work, disappear In addition to the disturbing factor of character recognition.The present invention meets the needs of current secret inspection, and not only efficiency is generally improved, and also saves Vast resources is saved, the method has centainly advisability.

The drawbacks of present invention can not only overcome most of check objects that can only identify text and can not identify image file, Operating process of the information manager to censorship system can also be simplified.

Detailed description of the invention

Fig. 1 is overall logic configuration diagram of the invention；

Fig. 2 is overall procedure schematic diagram of the invention；

Fig. 3 is document analysis module diagram of the invention；

Fig. 4 is image processing module schematic diagram of the invention；

Fig. 5 is image cutting module diagram of the invention；

Fig. 6 is characteristic extracting module schematic diagram of the invention；

Fig. 7 is routine inspection result figure of the invention；

Fig. 8 is depth inspection result figure of the invention.

Specific embodiment

Below with reference to attached drawing, technical solution of the present invention and beneficial effect are described in detail.

As depicted in figs. 1 and 2, the present invention provides a kind of terminal censorship method based on OCR technique, including Following steps:

Step 1) sends censorship request, server to server in the terminal in censorship network Terminal is attached after responding detection request, censorship includes the routine inspection of file content shown in Fig. 7, Inspection to filename, file content, Mail Contents, picture file；Also support the depth inspection of file content shown in Fig. 8, Depth inspection described in Fig. 8 is that the file of deletion and operation note are restored and checked.

Step 2) server includes that Connection Time and IP address record, and are stored in data to the letter of terminal In library, server by utilizing document analysis module shown in Fig. 3 judges the type of file to be detected.

If file type is compressed file, compressed file is decompressed, again for each file in compressed file Judge file type；If file type is non-image files, the content of document data bank is parsed；If file type is Image file is then handled using image processing module.

Step 3) judges server by utilizing document analysis module shown in Fig. 3 the file for belonging to image type utilizes figure Image processing module shown in 4 is handled.

If step 3.1) image is gaussian noise image, in terms of removing Gaussian noise, filtered using mean filter and intermediate value The method of wave fusion, this method are weighted denoising by assigning the mean filter weight different from median filtering, with Just the image after being denoised；

Step 3.1.1) 3 × 3 windows of setting, containing the pixel value for obtaining each position in window in gaussian noise image.

Step 3.1.2) calculate separately 3 × 3 window mean values and intermediate value.

Step 3.1.3) mean value and intermediate value data that are obtained according to above step, and assign their different weights and added Power calculates, and the result of calculating is set as to the pixel value of center position.

Step 3.1.4) above step is repeated to whole image progress denoising.

If step 3.2) image is complex background image, character picture carries out the foreground area of binary conversion treatment segmented image And background area, to reduce system operations complexity.

Step 3.2.1) using the method calculating whole image threshold value T of global threshold, and find out cluster centre T₁。

Step 3.2.2) according to the region of the threshold value setting calculated in step 3.2.1) again threshold value, it is right using c, d as variable Whole image is judged that judgment method is as follows:

(1-c)T≤f(x,y)≤(1+c)T

(1-d)T₁≤f(x,y)≤(1+d)T₁

Wherein c, d are preset parameter.If meeting with upper inequality, step 3.2.3 is carried out)；Otherwise the pixel root Binary conversion treatment is carried out according to Global thresholding.

Step 3.2.3) when meeting above-mentioned formula, local threshold is carried out using improved Bernsen algorithm.

Step 3.2.4) above step is repeated to whole image progress binary conversion treatment.

Binary conversion treatment is carried out to image by Global thresholding, and result continues using improved by treated Bernsen algorithm carries out the processing operation of binaryzation, removes complex background to obtain the foreground image of better effect；

If step 3.3) image is tilted image, image compress and selected part image carries out Hough transform, is led to It crosses Hough transform and the test problems to straight line can be converted into and meet at the curve number problem of same point in parameter space statistics, It needs for ρ, θ to be discretized into N number of parameter space, parameter space is divided into many units, to establish parameter space accumulator.

Step 3.3.1) it compresses image and piecemeal processing is carried out to image, selected part character picture is as detected mesh Mark.

Step 3.3.2) discrete parameter space is constructed in ρ-θ plane, and accumulated matrix A (ρ, θ) is established, by each of which member Plain initial value is assigned to 0.

Step 3.3.3) Hough transform is carried out to each non-zero point element of bianry image, if θ and ρ It is corresponding, then result is recorded in accumulated matrix.

Step 3.3.4) maximum value of finally finding out vector ρ in accumulated matrix, corresponding to θ value at this time is a series of straight lines Inclination angle.

After detecting the tilt angle of character picture, processing is corrected to it, the transformation for mula of rotation correction is such as Under:

The speed of image procossing can be improved by carrying out processing to tilted image, the image after being corrected；

Step 4) carries out cutting, Fig. 5 institute to image text using image cutting module shown in fig. 5 to treated image The image cutting module shown includes String localization module and Character segmentation module.

Step 4.1) String localization module shown in fig. 5 is believed by extracting the edge of character picture with Sobel operator Breath, and String localization operation is carried out using the method for character edge detection and Morphological scale-space to the marginal information extracted, To determine the character zone in text, character zone is marked with rectangle frame.

The character zone being marked with rectangle frame is utilized projection histogram by step 4.2) Character segmentation module shown in fig. 5 Each row or each column target point quantity in figure statistical picture space, result are distributed by image ranks sequence, are divided character It cuts；

Step 4.3) utilizes the probability density distribution of character horizontally and vertically to the character after segmentation, to character Equal proportion scaling is carried out, linear normalization processing, the text after obtaining cutting to the end are carried out；

Text after cutting is extracted character feature using characteristic extracting module shown in fig. 6 and carries out classification knowledge by step 5) Not.

Step 5.1) carries out the feature extraction of directional line element feature to the character after segmentation, for the character after segmentation, 3 × 3 In window, all black pixel points in character picture element matrix are scanned one by one from top to bottom, from left to right；

If the pixel of scanning is that black pixel point is character pattern, then judge that window white pixel is counted；

If white pixel points are more than or equal to 2 and less than 8, current pixel point is the profile point of text；

Otherwise it is judged as noise spot, and the pixel value is set as 0.

Step 5.2) combines the spy for being drawn up character from eight sides based on structure feature and based on the method for statistical nature The octuple direction vector of vector is levied, to calculate the directional element features of black pixel point.

Step 5.3) selection template matching classifier is divided according to the octuple direction vector of obtained template characteristic vector Class identification, to divide different characters.

Wherein image processing module shown in Fig. 4 in step 3), image cutting module shown in fig. 5, step in step 4) 5) characteristic extracting module as shown in FIG. 6 is the nucleus module of OCR engine technology in, mainly converts the text in image to Editable text.

Character after step 6) divides Classification and Identification utilizes the text matches algorithm in text matches module shown in FIG. 1 Keyword match is carried out, text matches module shown in FIG. 1 can search in text whether contain keyword, if containing key Otherwise word directly logs off then the information that will acquire is shown to terminal interface.

The present invention includes the content of three aspects: first, the framework that B/S is combined with C/S is built, terminal is completed Various functions needed for communication and server end of the end with server；Second, it researchs and develops and realizes for various operations system The safe and secret inspection of system is supported the inspection to image file, is handled the text information of image file；Third, service Device carries main management work.

1, user entered keyword inspection

Server is connected with all clients under C/S model, and database is then placed on server end, and C/S structure is necessary Network Environment.When client proposes connection request, server end can respond these requests, and operating database saves Related data, subsequent client carry out censorship.Meanwhile server can be to the client-side information of connection (including Connection Time With IP address etc.) it is recorded, and it is stored in database.

2, rolling inspection program

Terminal rolling inspection program supports the routine inspection of the file content of terminal shown in Fig. 7 (main It is related to terminal security secrecy provision, USB device usage record, internet records, communication apparatus, user information etc.), in addition to Fig. 7 Shown in file content routine inspection result except, also provide and carry out safe and secret inspection for the contents of all types of files Function (classified information inspection), such as various document class (Office, PDF, txt etc.) file, web page files, compressed package text The inspection of part, mail document, picture file, or even also support depth recovery and inspection to file content shown in Fig. 8, finally Inspection result is shown to terminal by server.

3, the management of server

Core of the invention part is that terminal checks the audit function of program and the management function of server, is led to It crosses and image file is pre-processed, and call OCR to be parsed into the text information that can be edited volume image file after processing, it is right Text information using character locating go forward side by side line character cutting techniques by text segmentation at character, the character after extracting segmentation carries out special Sign is extracted, and carries out Classification Management to characteristic value, and sorted character is finally carried out to the text matches of keyword, and matching is believed Breath is shown in terminal.

The above examples only illustrate the technical idea of the present invention, and this does not limit the scope of protection of the present invention, all According to the technical idea provided by the invention, any changes made on the basis of the technical scheme each falls within the scope of the present invention Within.

Claims

1. a kind of terminal censorship method based on OCR technique, it is characterised in that the following steps are included:

Step 1, the terminal in censorship network sends censorship request to server, and server response should Terminal is attached after detection request；

Step 2, server includes that Connection Time and IP address record, and are stored in database to the letter of terminal In, server by utilizing document analysis module judges the type of file to be detected；

Step 3, server by utilizing document analysis module is judged the file for belonging to image type, using image processing module into Row processing, specifically includes:

If image is gaussian noise image, denoising is weighted using the method that mean filter is merged with median filtering, is obtained Image after to denoising；

If image is complex background image, character picture carries out foreground area and the background area of binary conversion treatment segmented image；

If image is tilted image, image compress and selected part image carries out Hough transform, the figure after being corrected Picture；

Step 6, the character after dividing to Classification and Identification carries out keyword match using text matches module, if containing key Otherwise word directly logs off then the information that will acquire is shown to terminal interface.

2. a kind of terminal censorship method based on OCR technique as described in claim 1, it is characterised in that: institute It states in step 2, if file type is compressed file, compressed file is decompressed, for each file weight in compressed file Newly judge file type；If file type is non-image files, the content of document data bank is parsed；If file type It is image file, then is handled using image processing module.

3. a kind of terminal censorship method based on OCR technique as described in claim 1, it is characterised in that: institute It states in step 3, if image is gaussian noise image, is weighted at denoising using the method that mean filter is merged with median filtering Reason, the detailed process of the image after being denoised is:

Step 3.1.1 sets 3 × 3 windows, containing the pixel value for obtaining each position in window in gaussian noise image；

Step 3.1.2 calculates separately 3 × 3 window mean values and intermediate value；

Step 3.1.3 assigns their different weights and is weighted, will calculate according to obtained mean value and intermediate value data Result be set as the pixel value of center position；

Step 3.1.4 repeats above step and carries out denoising to whole image.

4. a kind of terminal censorship method based on OCR technique as described in claim 1, it is characterised in that: institute It states in step 3, if image is complex background image, character picture carries out the foreground area and background of binary conversion treatment segmented image The detailed process in region is:

Step 3.2.1 calculates whole image threshold value T using the method for global threshold, and finds out cluster centre T₁；

Step 3.2.2, according to the region of the threshold value setting calculated in step 3.2.1 again threshold value, using c, d as variable, to entire Image is judged that judgment method is as follows:

(1-c)T≤f(x,y)≤(1+c)T

(1-d)T₁≤f(x,y)≤(1+d)T₁

Wherein c, d are preset parameter；If meeting with upper inequality, step 3.2.3 is carried out；Otherwise the pixel is according to the overall situation Threshold method carries out binary conversion treatment；

Step 3.2.3 carries out local threshold using improved Bernsen algorithm when meeting above-mentioned formula；

Step 3.2.4 repeats above step and carries out binary conversion treatment to whole image.

5. a kind of terminal censorship method based on OCR technique as claimed in claim 4, it is characterised in that: institute It states in step 3.2.4, binary conversion treatment is carried out to image by Global thresholding, and result continues using improvement by treated Bernsen algorithm carry out the processing operation of binaryzation, remove complex background to obtain the foreground image of better effect.

6. a kind of terminal censorship method based on OCR technique as described in claim 1, it is characterised in that: institute It states in step 3, if image is tilted image, image compress and selected part image carries out Hough transform, is corrected The detailed process of image afterwards is:

Step 3.3.1 compresses image and carries out piecemeal processing to image, and selected part character picture is as detected target；

Step 3.3.2 constructs discrete parameter space in ρ-θ plane, and establishes accumulated matrix A (ρ, θ), will be at the beginning of each of which element Value is assigned to 0；

Step 3.3.3 carries out Hough transform to each non-zero point element of bianry image, if a θ is corresponding with a ρ, Then result is recorded in accumulated matrix；

Step 3.3.4, finally finds out the maximum value of vector ρ in accumulated matrix, corresponds to a series of inclination that θ value is straight lines at this time Angle；

After detecting the tilt angle of character picture, processing is corrected to it, the transformation for mula of rotation correction is as follows:

The speed of image procossing can be improved by carrying out processing to tilted image, the image after being corrected.

7. a kind of terminal censorship method based on OCR technique as described in claim 1, it is characterised in that: institute It states in step 4, image cutting module includes String localization module and Character segmentation module, and the detailed process of step 4 is:

Step 4.1, String localization module uses the marginal information of Sobel operator extraction character picture, and to the side extracted Edge information carries out String localization operation using the method for character edge detection and Morphological scale-space, to determine the word in text Region is accorded with, character zone is marked with rectangle frame；

Step 4.2, Character segmentation module is empty using projection histogram statistical picture by the character zone being marked with rectangle frame Between in each row or each column target point quantity, result by image ranks sequence be distributed, character is split；

Step 4.3, the probability density distribution of character horizontally and vertically is utilized to the character after segmentation, character is carried out Equal proportion scaling, carries out linear normalization processing, the text after obtaining cutting to the end.

8. a kind of terminal censorship method based on OCR technique as described in claim 1, it is characterised in that: institute The detailed process for stating step 5 is:

Step 5.1, the feature extraction that directional line element feature is carried out to the character after segmentation, for the character after segmentation, in 3 × 3 windows In, scan all black pixel points in character picture element matrix one by one from top to bottom, from left to right；

Otherwise it is judged as noise spot, and the pixel value is set as 0；

Step 5.2, in conjunction with based on structure feature and based on the method for statistical nature from eight sides be drawn up the feature of character to The octuple direction vector of amount, to calculate the directional element features of black pixel point；

Step 5.3, selection template matching classifier carries out classification knowledge according to the octuple direction vector of obtained template characteristic vector Not, different characters is divided.