CN103996180B - Shredder based on English words feature crushes document restored method - Google Patents

Shredder based on English words feature crushes document restored method Download PDF

Info

Publication number
CN103996180B
CN103996180B CN201410185991.5A CN201410185991A CN103996180B CN 103996180 B CN103996180 B CN 103996180B CN 201410185991 A CN201410185991 A CN 201410185991A CN 103996180 B CN103996180 B CN 103996180B
Authority
CN
China
Prior art keywords
binary sequence
starting point
line
english alphabet
lattice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410185991.5A
Other languages
Chinese (zh)
Other versions
CN103996180A (en
Inventor
冯钧
陈焕霖
杨艳林
陈丽君
唐志贤
许潇
朱忠华
盛震宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201410185991.5A priority Critical patent/CN103996180B/en
Publication of CN103996180A publication Critical patent/CN103996180A/en
Application granted granted Critical
Publication of CN103996180B publication Critical patent/CN103996180B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Character Discrimination (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses shredder based on English words feature and crush document restored method, belong to the technical field of image procossing.Document restored method includes image digitazation, Image semantic classification, image clustering and four steps of image mosaic.Image semantic classification refers to use matlab software to import every a scrap of paper, generates and corresponding to the gray matrix of every a scrap of paper and gray matrix is carried out binary conversion treatment, and generates binary sequence;Image clustering step refers to, according to English words feature, render binary sequence, and cluster original image according to rendering result;Image mosaic step refers to, according to cluster result, based on minimum accumulation Edge Distance principle, carry out longitudinal spliced to original image, then carry out horizontally-spliced to the result after longitudinal spliced.The present invention solves the recovery problem of the English document after being crushed by shredder, has filled up the blank of prior art, meanwhile, by cluster, has made splicing efficiency be greatly improved.

Description

Shredder based on English words feature crushes document restored method
Technical field
The present invention relates to shredder based on English words feature and crush document restored method, belong to image procossing Document recovery technique field in.
Background technology
The splicing of broken file is in fields such as judicial material evidence recovery, historical document reparation and military information acquisitions all There is important application.Traditionally, splicing recovery operation need to be by being accomplished manually, and accuracy rate is higher, but efficiency is very Low, particularly huge when number of tiles, artificial splicing is difficult to complete task at short notice.Along with computer skill The development of art, is attempted to develop the automatic Mosaic technology of a scrap of paper, and to improve splicing recovering efficiency, one excellent Elegant method should need not manual intervention, and can be spliced into same class and crush the scraps of paper.It is by scanning Obtain relevant information with image technique, then utilize computer to process accordingly, thus realize broken to these Restoring full-automatic or automanual splicing of the scraps of paper.
Summary of the invention
The present invention is directed to the problem about English document splicing inefficiency in existing broken scraps of paper joining method, carry Go out shredder based on English words feature and crush document restored method.
The present invention adopts the following technical scheme that for achieving the above object
Shredder based on English words feature crushes document restored method, and the broken file and picture of scanning, to each Width image is handled as follows according to step 1 to step 3:
Step 1, sets up gray matrix Ak:
A k ( i , j ) = 0 ... ... ... ( A k ( i , j ) = 255 ) A k ( i , j ) = 1 ... ... ... ( A k ( i , j ) ≠ 255 ) - - - ( 1 ) ,
, k is broken number of documents,
Binaryzation gray matrix AkAfter obtain two values matrix Bk, BkThere are m row, n column element,
Ask for two values matrix B the most respectivelykEach row element sum, vertical direction arrangement two values matrix BkEach row Element sum obtains binary sequence Qk:
Q ( 1 ) = 1......... ( Σ j = 1 j = n B k ( 1 , j ) > 0 ) Q ( 1 ) = 0......... ( Σ j = 1 j = n B k ( 1 , j ) = 0 ) Q ( 2 ) = 1......... ( Σ j = 1 j = n B k ( 2 , j ) > 0 ) Q ( 2 ) = 0......... ( Σ j = 1 j = n B k ( 2 , j ) = 0 ) . . . Q ( m ) = 1......... ( Σ j = 1 j = n B k ( m , j ) > 0 ) Q ( m ) = 0......... ( Σ j = 1 j = n B k ( m , j ) = 0 - - - ( 2 ) ;
Step 2, by upper width W of English alphabet typesetting spaceu, middle part width WmWith lower width Wd, Establish English alphabet and occupy the locus of four line three lattice, described locus include middle part, middle part, under Portion, upper, middle and lower portion:
According to binary sequence QkPixel distribution situation divide upper middle part, middle and lower part:
As binary sequence QkFront 1/3rd pixels and less than rear 1/3rd pixels and time, then in belonging to Portion,
As binary sequence QkFront 1/3rd pixels and more than rear 1/3rd pixels and time, simultaneously first three point One of pixel and with rear 1/3rd pixels and ratio less than 3/2 time, then belong to middle part,
As binary sequence QkFront 1/3rd pixels and more than rear 1/3rd pixels and time, if simultaneously first three / mono-pixel and with rear 1/3rd pixels and ratio more than or equal to 3/2 time, then belong to middle and lower part;
To binary sequence QkScan from top to bottom, after reading first 01 time, record 1 place line number f1, From binary sequence f1Row continues to scan on, and reads f10 place line number f is recorded when first 0 after row2, by f2-f1Value establish English alphabet and occupy the locus of four line three lattice:
Work as f2-f1<WmWhen-1: the most scanned binary sequence does not constitute a complete English alphabet, Continue down to scan binary sequence,
Work as Wm-1≤f2-f1<WmWhen+1: the most scanned binary sequence constitutes a complete English words Mother, this English alphabet is in the middle part of four line three lattice,
Work as Wm+1<f2-f1≤Wm+WuWhen+2: the most scanned binary sequence constitutes a complete English Word is female, and English alphabet is in upper middle part or the middle and lower part of four line three lattice,
Work as Wm+Wu+2<f2-f1≤Wm+Wu+WdWhen+1: the most scanned binary sequence constitutes one Complete English alphabet, English alphabet is in the upper, middle and lower portion of four line three lattice;
Step 3, at binary sequence QkMiddle lookup renders starting point, carries out clustering processing after rendering binary sequence:
Step 3-1, is occupied the locus of four line three lattice by English alphabet, determines the starting point that the first round renders:
When English alphabet is in the upper middle part of four line three lattice, from the beginning of the lowest point at middle part, by this position Rollback Wm+WuObtain the starting point that the first round renders,
When English alphabet is in the middle part of four line three lattice, from the beginning of the top point at middle part, returned by this position Move back WuObtain the starting point that the first round renders,
When English alphabet is in the middle and lower part of four line three lattice, from the beginning of the top point at middle part, by this position Rollback WuObtain the starting point that the first round renders,
When English alphabet is in the upper, middle and lower portion of four line three lattice, from the beginning of the lowest point of bottom, by this some position Put back into and move back Wu+Wm+WdObtain the starting point that the first round renders;
Step 3-2, the starting point rendered from the first round starts to render binary sequence and obtains new binary sequence, tool Body includes step a and step b:
Step a, the starting point rendered from the first round starts the most up, down scans, after scanning chance 0 First 1 time, record line number f at 1 placek, and with fkOn the basis of, it is in fkTop and and fk's Distance is WbFont space in all use 1 to cover;The f being inkLower section and and fkDistance be Wu+Wm+Wd+WbFont space in the most all with 1 cover, WbFor between letter and letter longitudinal direction Away from,
Step b, after the first round has rendered, taking the first round on scanning direction renders the next point of starting point As the new starting point that renders, repeat step a, after one takes turns and rendered, using next one point as new Render starting point,
For scanning up mode, take epicycle and render a upper point of starting point as the new starting point that renders, right In downward scan mode, take epicycle and render the next point of starting point as the new starting point that renders, more than repetition Step a, until binary sequence QkTop and bottommost till, so generate new binary sequence Q′k
Step 4, according to the new binary sequence Q ' of every piece imagekIn the value condition of each element: value is complete Exactly the same two-value subsequence be polymerized to a class, the two-value subsequence of different values is then polymerized to different classes, herein " cluster " simply sort out according to the difference of element each in binary sequence, such as, result creates 4 two Value sequence, is [1,0,1], [1,1,1], [1,1,1], [1,0,1] respectively, then different according to its element value, can It is polymerized to two classes, class one: [1,0,1], [1,0,1], class two: [1,1,1], [1,1,1];
Step 5, respectively in each class, based on minimum accumulation Edge Distance principle:
Find two two values matrix B ' and the B that accumulation Edge Distance is minimum ", The broken document that the two two values matrix is corresponding is laterally matched splicing, until all of image in such all Till being spliced together, enter being complete each broken document laterally matched further according to minimum accumulation Edge Distance The longitudinally matched splicing of row.
The present invention uses technique scheme, has the advantages that the typesetting of english font of having analyzed and researched Rule, has filled up the blank about English document a scrap of paper splicing;For given English document a scrap of paper, Owing to being clustered according to English words typesetting feature, therefore greatly reduce and laterally match and longitudinally matched mistake About the amount of calculation of minimum edge distance in journey, improve splicing efficiency.
Accompanying drawing explanation
Fig. 1 is the flow chart that shredder based on English words feature crushes document restored method.
Fig. 2 is the electronic image that obtains after 38 a scrap of papers of scanning.
Fig. 3 is gray matrix binaryzation and is converted to binary sequence procedure chart.
Fig. 4 is english font typesetting feature figure.
Render process schematic diagram when Fig. 5 English alphabet occupies in the middle part of four line lattice.
Fig. 6 is the cluster result figure of 38 a scrap of papers.
Fig. 7 is the horizontally-spliced result figure of 38 a scrap of papers.
Fig. 8 is the final splicing result figure of 38 a scrap of papers.
Detailed description of the invention
Below in conjunction with the accompanying drawings the technical scheme of invention is described in detail:
The shredder based on English words feature that the present invention proposes crushes document restored method according to shown in Fig. 1 Flow chart realizes, and uses scanner to be scanned 38 paper document fragments processing, and exports such as Fig. 2 institute The electronic image of every a scrap of paper shown, uses matlab software to import every a scrap of paper,
According to transfer process shown in Fig. 3, generate the gray matrix A corresponding to every a scrap of paperk (k=1,2 ... 38), to gray matrix Ak(k=1,2 ... 38) convolution (1) carries out binary conversion treatment and obtains To two values matrix Bk(k=1,2 ... 38), in conjunction with formula (2) regeneration binary sequence Qk(k=1,2 ... 38);
As shown in Figure 4 (oblique line portion is taken up space by font), the top width of English alphabet typesetting space is calculated Degree Wu=13, middle part width Wm=25 and lower width Wd=13 and letter with character spcing Wb=13 (i.e. White space width), to binary sequence Qk(k=1,2 ... 38) scans from top to bottom, when reading first chance After 01 time, record its place line number f1, continuing down to scan, until finding first 0, recording its row Number f2, then, can be according to f2-f1Value, primarily determine that this English alphabet occupies the space bit of four line three lattice Put:
①f2-f1< 24: do not become a letter, continue down to scan;
②24≤f2-f1< 26: middle part;
③26<f2-f1≤ 41: upper middle part or middle and lower part;
④41<f2-f1≤ 52: upper, middle and lower portion,
For upper middle part and the concrete division of middle and lower part, then will be according to binary sequence Qk(k=1,2's ... 38) Pixel distribution situation determines, if front 1/3rd pixels and less than rear 1/3rd pixels and, then belong to Middle part;If front 1/3rd pixels and more than rear 1/3rd pixels and, then need to further confirm that: first three / mono-pixel and with rear 1/3rd pixels and ratio less than 3/2, then belong to middle part, if front 1/3rd Pixel and with rear 1/3rd pixels and ratio more than or equal to 3/2, then belong to middle and lower part;
After determining that English alphabet occupies four line three grid space positions, next at QkIn (k=1,2 ... 38) Lookup renders starting point, and specific rules is as follows:
1. middle part is gone up: from the beginning of the lowest point of middle part, obtain initial point by 38 points of this position rollback;
2. middle part: from the beginning of the top point of middle part, obtain initial point by 13 points of this position rollback;
3. middle and lower part: from the beginning of the top point of middle part, obtains initial point by 13 points of this position rollback;
4. upper, middle and lower portion: from the beginning of the lowest point of bottom, obtains initial point by 51 points of this position rollback.
Find after rendering starting point, start the most down to scan, when scanning first 1 after meeting 0, Record its line number f in the sequencek, and with fkOn the basis of, it is in fkTop and and fkDistance be 13 Font space in, all use 1 to cover;The f being inkLower section and and fkThe font space that distance is 64 in, The most all cover with 1, when one takes turns after " rendering " complete, using next one point as new render initial Point (if direction is upwards, then take a point and render starting point as new ", repeat above procedure, directly To the top and bottommost of this sequence, then, new binary sequence Q ' is generatedk(k=1,2 ... 38), Fig. 5 is English alphabet render process schematic diagram when occupying in the middle part of four line lattice.
According to every piece image Q 'kThe difference of (k=1,2 ... 38), above 38 a scrap of papers are divided into Two classes as shown in Figure 6, respectively in above two classes, based on minimum accumulation Edge Distance principle, i.e. from such In in all of two values matrix, findCarry out laterally matching splicing, directly In such, all of image is all spliced together, horizontally-spliced result is as shown in Figure 7;
According to minimum accumulation Edge Distance principle, above two classes are carried out longitudinally matched splicing, longitudinal spliced knot Fruit is as shown in Figure 8.

Claims (4)

1. shredder based on English words feature crushes document restored method, it is characterised in that: scanning is broken File and picture, is handled as follows according to step 1 to step 3 every piece image:
Step 1, sets up gray matrix, obtains two values matrix after binaryzation gray matrix, ask for two the most respectively Value matrix each row element sum, vertical direction arrangement two values matrix each row element sum obtains binary sequence;
Step 2, by upper width W of English alphabet typesetting spaceu, middle part width WmWith lower width Wd, Establish English alphabet and occupy the locus of four line three lattice, described locus include middle part, middle part, under Portion, upper, middle and lower portion:
Step 3, searches in binary sequence and renders starting point, carry out clustering processing after rendering binary sequence:
Step 3-1, is occupied the locus of four line three lattice by English alphabet, determines the starting point that the first round renders:
When English alphabet is in the upper middle part of four line three lattice, from the beginning of the lowest point at middle part, by this position Rollback Wm+WuObtain the starting point that the first round renders,
When English alphabet is in the middle part of four line three lattice, from the beginning of the top point at middle part, returned by this position Move back WuObtain the starting point that the first round renders,
When English alphabet is in the middle and lower part of four line three lattice, from the beginning of the top point at middle part, by this position Rollback WuObtain the starting point that the first round renders,
When English alphabet is in the upper, middle and lower portion of four line three lattice, from the beginning of the lowest point of bottom, by this some position Put back into and move back Wu+Wm+WdObtain the starting point that the first round renders;
Step 3-2, the starting point rendered from the first round starts to render binary sequence and obtains new binary sequence, tool Body includes step a and step b:
Step a, the starting point rendered from the first round starts the most up, down scans, after scanning chance 0 First 1 time, record line number f at 1 placek, and with fkOn the basis of, it is in fkTop and and fk's Distance is WbFont space in all use 1 to cover;The f being inkLower section and and fkDistance be Wu+Wm+Wd+WbFont space in the most all with 1 cover, WbFor between letter and letter longitudinal direction Away from,
Step b, after the first round has rendered, taking the first round on scanning direction renders the next point of starting point As the new starting point that renders, repeating step a, the institute of traversal binary sequence is a little as rendering starting point generation New binary sequence;
Step 4, carries out clustering processing to the new binary sequence of every piece image;
Step 5, respectively in each class, finds accumulation Edge Distance based on minimum accumulation Edge Distance principle Two little two values matrixs, laterally match splicing to the broken document that the two two values matrix is corresponding, until Till in such, all of image is all spliced together, further according to minimum accumulation Edge Distance to being complete laterally The each broken document of coupling carries out longitudinally matched splicing.
Shredder based on English words feature the most according to claim 1 crushes document restored method, It is characterized in that, the upper middle part described in step 2, middle and lower part divide according to the pixel distribution situation of binary sequence:
When before binary sequence 1/3rd pixels and less than rear 1/3rd pixels and time, then belong to middle part,
When before binary sequence 1/3rd pixels and more than rear 1/3rd pixels and time, simultaneously front 1/3rd Pixel and with rear 1/3rd pixels and ratio less than 3/2 time, then belong to middle part,
When before binary sequence 1/3rd pixels and more than rear 1/3rd pixels and time, if simultaneously the most front three/ One pixel and with rear 1/3rd pixels and ratio more than or equal to 3/2 time, then belong to middle and lower part.
Shredder based on English words feature the most according to claim 2 crushes document restored method, The concrete grammar that it is characterized in that step 2 is: scans binary sequence from top to bottom, is reading first 0 After 1 time, record 1 place line number f1, from binary sequence f1Row continues to scan on, and reads f1After row first 0 place line number f is recorded when individual 02, by f2-f1Value establish English alphabet and occupy the locus of four line three lattice:
Work as f2-f1<WmWhen-1: the most scanned binary sequence does not constitute a complete English alphabet, Continue down to scan binary sequence,
Work as Wm-1≤f2-f1<WmWhen+1: the most scanned binary sequence constitutes a complete English words Mother, this English alphabet is in the middle part of four line three lattice,
Work as Wm+1<f2-f1≤Wm+WuWhen+2: the most scanned binary sequence constitutes a complete English Word is female, and English alphabet is in upper middle part or the middle and lower part of four line three lattice,
Work as Wm+Wu+2<f2-f1≤Wm+Wu+WdWhen+1: the most scanned binary sequence constitutes one Complete English alphabet, English alphabet is in the upper, middle and lower portion of four line three lattice.
Shredder based on English words feature the most according to claim 3 crushes document restored method, It is characterized in that, step 5 adds up Edge Distance based on minimum and passes expression formula in principleFind two two values matrix B ' and the B that accumulation Edge Distance is minimum ", m is The line number of two values matrix, n is the columns of two values matrix.
CN201410185991.5A 2014-05-05 2014-05-05 Shredder based on English words feature crushes document restored method Expired - Fee Related CN103996180B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410185991.5A CN103996180B (en) 2014-05-05 2014-05-05 Shredder based on English words feature crushes document restored method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410185991.5A CN103996180B (en) 2014-05-05 2014-05-05 Shredder based on English words feature crushes document restored method

Publications (2)

Publication Number Publication Date
CN103996180A CN103996180A (en) 2014-08-20
CN103996180B true CN103996180B (en) 2016-09-07

Family

ID=51310337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410185991.5A Expired - Fee Related CN103996180B (en) 2014-05-05 2014-05-05 Shredder based on English words feature crushes document restored method

Country Status (1)

Country Link
CN (1) CN103996180B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106952230B (en) * 2017-03-19 2021-02-02 北京工业大学 Transverse and longitudinal shredded slice restoration method based on clustering and ant colony algorithm
CN106952232B (en) * 2017-03-22 2019-01-25 海南职业技术学院 A kind of picture and text fragment restoration methods based on ant group algorithm
CN107229953B (en) * 2017-06-06 2020-12-25 西南石油大学 Broken document splicing method based on DFS and improved center clustering method
CN107180412B (en) * 2017-06-15 2020-10-16 北京工业大学 Horizontal and vertical shredded paper sheet reconstruction method based on horizontal projection and K-means clustering
CN108694414A (en) * 2018-05-11 2018-10-23 哈尔滨工业大学深圳研究生院 Digital evidence obtaining file fragmentation sorting technique based on digital picture conversion and deep learning
CN108921793A (en) * 2018-07-15 2018-11-30 江西理工大学 A scrap of paper based on fuzzy C-means clustering splices restored method
CN109584163B (en) * 2018-12-17 2020-12-08 深圳市华星光电半导体显示技术有限公司 Method for restoring original file of paper scrap

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679671A (en) * 2014-01-12 2014-03-26 王浩 Transverse and vertical sliced shredded paper splicing and recovery algorithm of FFT (Fast Fourier Transform) integrated comprehensive evaluation method
CN103679678A (en) * 2013-12-18 2014-03-26 山东大学 Semi-automatic splicing recovery method for character characteristic rectangular scraps of paper

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679678A (en) * 2013-12-18 2014-03-26 山东大学 Semi-automatic splicing recovery method for character characteristic rectangular scraps of paper
CN103679671A (en) * 2014-01-12 2014-03-26 王浩 Transverse and vertical sliced shredded paper splicing and recovery algorithm of FFT (Fast Fourier Transform) integrated comprehensive evaluation method

Also Published As

Publication number Publication date
CN103996180A (en) 2014-08-20

Similar Documents

Publication Publication Date Title
CN103996180B (en) Shredder based on English words feature crushes document restored method
US8908961B2 (en) System and methods for arabic text recognition based on effective arabic text feature extraction
CN106326854B (en) A kind of format document paragraph recognition methods
CN103700081B (en) A kind of shredder crushes the restoration methods of English document
CN103679678B (en) A kind of semi-automatic splicing restored method of rectangle character features a scrap of paper
EP2395453A2 (en) Method and system for preprocessing an image for optical character recognition
CN105654130A (en) Recurrent neural network-based complex image character sequence recognition system
CN105469027A (en) Horizontal and vertical line detection and removal for document images
CN105678292A (en) Complex optical text sequence identification system based on convolution and recurrent neural network
CN104008384A (en) Character identification method and character identification apparatus
CN104008401A (en) Method and device for image character recognition
CN106682698A (en) OCR identification method based on template matching
CN1916940A (en) Template optimized character recognition method and system
Das et al. Heuristic based script identification from multilingual text documents
RU2625533C1 (en) Devices and methods, which build the hierarchially ordinary data structure, containing nonparameterized symbols for documents images conversion to electronic documents
CN1545067A (en) A method for compressing digitalized archive file using computer
CN105677718B (en) Character search method and device
CN108510442B (en) Single-side paper scrap splicing and restoring method based on absolute value distance optimization
Ajward et al. Converting printed Sinhala documents to formatted editable text
CN106778759A (en) For the feature image automatic creation system of pictograph identification
JPH08320914A (en) Table recognition method and device
CN116778497A (en) Method and device for identifying hand well number, computer equipment and storage medium
CN113901913A (en) Convolution network for ancient book document image binaryzation
Aparna et al. A complete OCR system development of Tamil magazine documents
CN106991082B (en) Grouping method for multi-page homogeneous document fragments

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160907

CF01 Termination of patent right due to non-payment of annual fee