CN103996180B - Shredder based on English words feature crushes document restored method - Google Patents
Shredder based on English words feature crushes document restored method Download PDFInfo
- Publication number
- CN103996180B CN103996180B CN201410185991.5A CN201410185991A CN103996180B CN 103996180 B CN103996180 B CN 103996180B CN 201410185991 A CN201410185991 A CN 201410185991A CN 103996180 B CN103996180 B CN 103996180B
- Authority
- CN
- China
- Prior art keywords
- binary sequence
- starting point
- line
- english alphabet
- lattice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 239000011159 matrix material Substances 0.000 claims abstract description 23
- 238000009825 accumulation Methods 0.000 claims abstract description 10
- 238000009877 rendering Methods 0.000 claims abstract description 6
- 230000008878 coupling Effects 0.000 claims 1
- 238000010168 coupling process Methods 0.000 claims 1
- 238000005859 coupling reaction Methods 0.000 claims 1
- 238000011084 recovery Methods 0.000 abstract description 4
- 238000006243 chemical reaction Methods 0.000 abstract description 2
- 230000008676 import Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000008929 regeneration Effects 0.000 description 1
- 238000011069 regeneration method Methods 0.000 description 1
Landscapes
- Character Discrimination (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses shredder based on English words feature and crush document restored method, belong to the technical field of image procossing.Document restored method includes image digitazation, Image semantic classification, image clustering and four steps of image mosaic.Image semantic classification refers to use matlab software to import every a scrap of paper, generates and corresponding to the gray matrix of every a scrap of paper and gray matrix is carried out binary conversion treatment, and generates binary sequence;Image clustering step refers to, according to English words feature, render binary sequence, and cluster original image according to rendering result;Image mosaic step refers to, according to cluster result, based on minimum accumulation Edge Distance principle, carry out longitudinal spliced to original image, then carry out horizontally-spliced to the result after longitudinal spliced.The present invention solves the recovery problem of the English document after being crushed by shredder, has filled up the blank of prior art, meanwhile, by cluster, has made splicing efficiency be greatly improved.
Description
Technical field
The present invention relates to shredder based on English words feature and crush document restored method, belong to image procossing
Document recovery technique field in.
Background technology
The splicing of broken file is in fields such as judicial material evidence recovery, historical document reparation and military information acquisitions all
There is important application.Traditionally, splicing recovery operation need to be by being accomplished manually, and accuracy rate is higher, but efficiency is very
Low, particularly huge when number of tiles, artificial splicing is difficult to complete task at short notice.Along with computer skill
The development of art, is attempted to develop the automatic Mosaic technology of a scrap of paper, and to improve splicing recovering efficiency, one excellent
Elegant method should need not manual intervention, and can be spliced into same class and crush the scraps of paper.It is by scanning
Obtain relevant information with image technique, then utilize computer to process accordingly, thus realize broken to these
Restoring full-automatic or automanual splicing of the scraps of paper.
Summary of the invention
The present invention is directed to the problem about English document splicing inefficiency in existing broken scraps of paper joining method, carry
Go out shredder based on English words feature and crush document restored method.
The present invention adopts the following technical scheme that for achieving the above object
Shredder based on English words feature crushes document restored method, and the broken file and picture of scanning, to each
Width image is handled as follows according to step 1 to step 3:
Step 1, sets up gray matrix Ak:
, k is broken number of documents,
Binaryzation gray matrix AkAfter obtain two values matrix Bk, BkThere are m row, n column element,
Ask for two values matrix B the most respectivelykEach row element sum, vertical direction arrangement two values matrix BkEach row
Element sum obtains binary sequence Qk:
Step 2, by upper width W of English alphabet typesetting spaceu, middle part width WmWith lower width Wd,
Establish English alphabet and occupy the locus of four line three lattice, described locus include middle part, middle part, under
Portion, upper, middle and lower portion:
According to binary sequence QkPixel distribution situation divide upper middle part, middle and lower part:
As binary sequence QkFront 1/3rd pixels and less than rear 1/3rd pixels and time, then in belonging to
Portion,
As binary sequence QkFront 1/3rd pixels and more than rear 1/3rd pixels and time, simultaneously first three point
One of pixel and with rear 1/3rd pixels and ratio less than 3/2 time, then belong to middle part,
As binary sequence QkFront 1/3rd pixels and more than rear 1/3rd pixels and time, if simultaneously first three
/ mono-pixel and with rear 1/3rd pixels and ratio more than or equal to 3/2 time, then belong to middle and lower part;
To binary sequence QkScan from top to bottom, after reading first 01 time, record 1 place line number f1,
From binary sequence f1Row continues to scan on, and reads f10 place line number f is recorded when first 0 after row2, by
f2-f1Value establish English alphabet and occupy the locus of four line three lattice:
Work as f2-f1<WmWhen-1: the most scanned binary sequence does not constitute a complete English alphabet,
Continue down to scan binary sequence,
Work as Wm-1≤f2-f1<WmWhen+1: the most scanned binary sequence constitutes a complete English words
Mother, this English alphabet is in the middle part of four line three lattice,
Work as Wm+1<f2-f1≤Wm+WuWhen+2: the most scanned binary sequence constitutes a complete English
Word is female, and English alphabet is in upper middle part or the middle and lower part of four line three lattice,
Work as Wm+Wu+2<f2-f1≤Wm+Wu+WdWhen+1: the most scanned binary sequence constitutes one
Complete English alphabet, English alphabet is in the upper, middle and lower portion of four line three lattice;
Step 3, at binary sequence QkMiddle lookup renders starting point, carries out clustering processing after rendering binary sequence:
Step 3-1, is occupied the locus of four line three lattice by English alphabet, determines the starting point that the first round renders:
When English alphabet is in the upper middle part of four line three lattice, from the beginning of the lowest point at middle part, by this position
Rollback Wm+WuObtain the starting point that the first round renders,
When English alphabet is in the middle part of four line three lattice, from the beginning of the top point at middle part, returned by this position
Move back WuObtain the starting point that the first round renders,
When English alphabet is in the middle and lower part of four line three lattice, from the beginning of the top point at middle part, by this position
Rollback WuObtain the starting point that the first round renders,
When English alphabet is in the upper, middle and lower portion of four line three lattice, from the beginning of the lowest point of bottom, by this some position
Put back into and move back Wu+Wm+WdObtain the starting point that the first round renders;
Step 3-2, the starting point rendered from the first round starts to render binary sequence and obtains new binary sequence, tool
Body includes step a and step b:
Step a, the starting point rendered from the first round starts the most up, down scans, after scanning chance 0
First 1 time, record line number f at 1 placek, and with fkOn the basis of, it is in fkTop and and fk's
Distance is WbFont space in all use 1 to cover;The f being inkLower section and and fkDistance be
Wu+Wm+Wd+WbFont space in the most all with 1 cover, WbFor between letter and letter longitudinal direction
Away from,
Step b, after the first round has rendered, taking the first round on scanning direction renders the next point of starting point
As the new starting point that renders, repeat step a, after one takes turns and rendered, using next one point as new
Render starting point,
For scanning up mode, take epicycle and render a upper point of starting point as the new starting point that renders, right
In downward scan mode, take epicycle and render the next point of starting point as the new starting point that renders, more than repetition
Step a, until binary sequence QkTop and bottommost till, so generate new binary sequence
Q′k;
Step 4, according to the new binary sequence Q ' of every piece imagekIn the value condition of each element: value is complete
Exactly the same two-value subsequence be polymerized to a class, the two-value subsequence of different values is then polymerized to different classes, herein
" cluster " simply sort out according to the difference of element each in binary sequence, such as, result creates 4 two
Value sequence, is [1,0,1], [1,1,1], [1,1,1], [1,0,1] respectively, then different according to its element value, can
It is polymerized to two classes, class one: [1,0,1], [1,0,1], class two: [1,1,1], [1,1,1];
Step 5, respectively in each class, based on minimum accumulation Edge Distance principle:
Find two two values matrix B ' and the B that accumulation Edge Distance is minimum ",
The broken document that the two two values matrix is corresponding is laterally matched splicing, until all of image in such all
Till being spliced together, enter being complete each broken document laterally matched further according to minimum accumulation Edge Distance
The longitudinally matched splicing of row.
The present invention uses technique scheme, has the advantages that the typesetting of english font of having analyzed and researched
Rule, has filled up the blank about English document a scrap of paper splicing;For given English document a scrap of paper,
Owing to being clustered according to English words typesetting feature, therefore greatly reduce and laterally match and longitudinally matched mistake
About the amount of calculation of minimum edge distance in journey, improve splicing efficiency.
Accompanying drawing explanation
Fig. 1 is the flow chart that shredder based on English words feature crushes document restored method.
Fig. 2 is the electronic image that obtains after 38 a scrap of papers of scanning.
Fig. 3 is gray matrix binaryzation and is converted to binary sequence procedure chart.
Fig. 4 is english font typesetting feature figure.
Render process schematic diagram when Fig. 5 English alphabet occupies in the middle part of four line lattice.
Fig. 6 is the cluster result figure of 38 a scrap of papers.
Fig. 7 is the horizontally-spliced result figure of 38 a scrap of papers.
Fig. 8 is the final splicing result figure of 38 a scrap of papers.
Detailed description of the invention
Below in conjunction with the accompanying drawings the technical scheme of invention is described in detail:
The shredder based on English words feature that the present invention proposes crushes document restored method according to shown in Fig. 1
Flow chart realizes, and uses scanner to be scanned 38 paper document fragments processing, and exports such as Fig. 2 institute
The electronic image of every a scrap of paper shown, uses matlab software to import every a scrap of paper,
According to transfer process shown in Fig. 3, generate the gray matrix A corresponding to every a scrap of paperk
(k=1,2 ... 38), to gray matrix Ak(k=1,2 ... 38) convolution (1) carries out binary conversion treatment and obtains
To two values matrix Bk(k=1,2 ... 38), in conjunction with formula (2) regeneration binary sequence Qk(k=1,2 ... 38);
As shown in Figure 4 (oblique line portion is taken up space by font), the top width of English alphabet typesetting space is calculated
Degree Wu=13, middle part width Wm=25 and lower width Wd=13 and letter with character spcing Wb=13 (i.e.
White space width), to binary sequence Qk(k=1,2 ... 38) scans from top to bottom, when reading first chance
After 01 time, record its place line number f1, continuing down to scan, until finding first 0, recording its row
Number f2, then, can be according to f2-f1Value, primarily determine that this English alphabet occupies the space bit of four line three lattice
Put:
①f2-f1< 24: do not become a letter, continue down to scan;
②24≤f2-f1< 26: middle part;
③26<f2-f1≤ 41: upper middle part or middle and lower part;
④41<f2-f1≤ 52: upper, middle and lower portion,
For upper middle part and the concrete division of middle and lower part, then will be according to binary sequence Qk(k=1,2's ... 38)
Pixel distribution situation determines, if front 1/3rd pixels and less than rear 1/3rd pixels and, then belong to
Middle part;If front 1/3rd pixels and more than rear 1/3rd pixels and, then need to further confirm that: first three
/ mono-pixel and with rear 1/3rd pixels and ratio less than 3/2, then belong to middle part, if front 1/3rd
Pixel and with rear 1/3rd pixels and ratio more than or equal to 3/2, then belong to middle and lower part;
After determining that English alphabet occupies four line three grid space positions, next at QkIn (k=1,2 ... 38)
Lookup renders starting point, and specific rules is as follows:
1. middle part is gone up: from the beginning of the lowest point of middle part, obtain initial point by 38 points of this position rollback;
2. middle part: from the beginning of the top point of middle part, obtain initial point by 13 points of this position rollback;
3. middle and lower part: from the beginning of the top point of middle part, obtains initial point by 13 points of this position rollback;
4. upper, middle and lower portion: from the beginning of the lowest point of bottom, obtains initial point by 51 points of this position rollback.
Find after rendering starting point, start the most down to scan, when scanning first 1 after meeting 0,
Record its line number f in the sequencek, and with fkOn the basis of, it is in fkTop and and fkDistance be 13
Font space in, all use 1 to cover;The f being inkLower section and and fkThe font space that distance is 64 in,
The most all cover with 1, when one takes turns after " rendering " complete, using next one point as new render initial
Point (if direction is upwards, then take a point and render starting point as new ", repeat above procedure, directly
To the top and bottommost of this sequence, then, new binary sequence Q ' is generatedk(k=1,2 ... 38),
Fig. 5 is English alphabet render process schematic diagram when occupying in the middle part of four line lattice.
According to every piece image Q 'kThe difference of (k=1,2 ... 38), above 38 a scrap of papers are divided into
Two classes as shown in Figure 6, respectively in above two classes, based on minimum accumulation Edge Distance principle, i.e. from such
In in all of two values matrix, findCarry out laterally matching splicing, directly
In such, all of image is all spliced together, horizontally-spliced result is as shown in Figure 7;
According to minimum accumulation Edge Distance principle, above two classes are carried out longitudinally matched splicing, longitudinal spliced knot
Fruit is as shown in Figure 8.
Claims (4)
1. shredder based on English words feature crushes document restored method, it is characterised in that: scanning is broken
File and picture, is handled as follows according to step 1 to step 3 every piece image:
Step 1, sets up gray matrix, obtains two values matrix after binaryzation gray matrix, ask for two the most respectively
Value matrix each row element sum, vertical direction arrangement two values matrix each row element sum obtains binary sequence;
Step 2, by upper width W of English alphabet typesetting spaceu, middle part width WmWith lower width Wd,
Establish English alphabet and occupy the locus of four line three lattice, described locus include middle part, middle part, under
Portion, upper, middle and lower portion:
Step 3, searches in binary sequence and renders starting point, carry out clustering processing after rendering binary sequence:
Step 3-1, is occupied the locus of four line three lattice by English alphabet, determines the starting point that the first round renders:
When English alphabet is in the upper middle part of four line three lattice, from the beginning of the lowest point at middle part, by this position
Rollback Wm+WuObtain the starting point that the first round renders,
When English alphabet is in the middle part of four line three lattice, from the beginning of the top point at middle part, returned by this position
Move back WuObtain the starting point that the first round renders,
When English alphabet is in the middle and lower part of four line three lattice, from the beginning of the top point at middle part, by this position
Rollback WuObtain the starting point that the first round renders,
When English alphabet is in the upper, middle and lower portion of four line three lattice, from the beginning of the lowest point of bottom, by this some position
Put back into and move back Wu+Wm+WdObtain the starting point that the first round renders;
Step 3-2, the starting point rendered from the first round starts to render binary sequence and obtains new binary sequence, tool
Body includes step a and step b:
Step a, the starting point rendered from the first round starts the most up, down scans, after scanning chance 0
First 1 time, record line number f at 1 placek, and with fkOn the basis of, it is in fkTop and and fk's
Distance is WbFont space in all use 1 to cover;The f being inkLower section and and fkDistance be
Wu+Wm+Wd+WbFont space in the most all with 1 cover, WbFor between letter and letter longitudinal direction
Away from,
Step b, after the first round has rendered, taking the first round on scanning direction renders the next point of starting point
As the new starting point that renders, repeating step a, the institute of traversal binary sequence is a little as rendering starting point generation
New binary sequence;
Step 4, carries out clustering processing to the new binary sequence of every piece image;
Step 5, respectively in each class, finds accumulation Edge Distance based on minimum accumulation Edge Distance principle
Two little two values matrixs, laterally match splicing to the broken document that the two two values matrix is corresponding, until
Till in such, all of image is all spliced together, further according to minimum accumulation Edge Distance to being complete laterally
The each broken document of coupling carries out longitudinally matched splicing.
Shredder based on English words feature the most according to claim 1 crushes document restored method,
It is characterized in that, the upper middle part described in step 2, middle and lower part divide according to the pixel distribution situation of binary sequence:
When before binary sequence 1/3rd pixels and less than rear 1/3rd pixels and time, then belong to middle part,
When before binary sequence 1/3rd pixels and more than rear 1/3rd pixels and time, simultaneously front 1/3rd
Pixel and with rear 1/3rd pixels and ratio less than 3/2 time, then belong to middle part,
When before binary sequence 1/3rd pixels and more than rear 1/3rd pixels and time, if simultaneously the most front three/
One pixel and with rear 1/3rd pixels and ratio more than or equal to 3/2 time, then belong to middle and lower part.
Shredder based on English words feature the most according to claim 2 crushes document restored method,
The concrete grammar that it is characterized in that step 2 is: scans binary sequence from top to bottom, is reading first 0
After 1 time, record 1 place line number f1, from binary sequence f1Row continues to scan on, and reads f1After row first
0 place line number f is recorded when individual 02, by f2-f1Value establish English alphabet and occupy the locus of four line three lattice:
Work as f2-f1<WmWhen-1: the most scanned binary sequence does not constitute a complete English alphabet,
Continue down to scan binary sequence,
Work as Wm-1≤f2-f1<WmWhen+1: the most scanned binary sequence constitutes a complete English words
Mother, this English alphabet is in the middle part of four line three lattice,
Work as Wm+1<f2-f1≤Wm+WuWhen+2: the most scanned binary sequence constitutes a complete English
Word is female, and English alphabet is in upper middle part or the middle and lower part of four line three lattice,
Work as Wm+Wu+2<f2-f1≤Wm+Wu+WdWhen+1: the most scanned binary sequence constitutes one
Complete English alphabet, English alphabet is in the upper, middle and lower portion of four line three lattice.
Shredder based on English words feature the most according to claim 3 crushes document restored method,
It is characterized in that, step 5 adds up Edge Distance based on minimum and passes expression formula in principleFind two two values matrix B ' and the B that accumulation Edge Distance is minimum ", m is
The line number of two values matrix, n is the columns of two values matrix.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410185991.5A CN103996180B (en) | 2014-05-05 | 2014-05-05 | Shredder based on English words feature crushes document restored method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410185991.5A CN103996180B (en) | 2014-05-05 | 2014-05-05 | Shredder based on English words feature crushes document restored method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103996180A CN103996180A (en) | 2014-08-20 |
CN103996180B true CN103996180B (en) | 2016-09-07 |
Family
ID=51310337
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410185991.5A Expired - Fee Related CN103996180B (en) | 2014-05-05 | 2014-05-05 | Shredder based on English words feature crushes document restored method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103996180B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106952230B (en) * | 2017-03-19 | 2021-02-02 | 北京工业大学 | Transverse and longitudinal shredded slice restoration method based on clustering and ant colony algorithm |
CN106952232B (en) * | 2017-03-22 | 2019-01-25 | 海南职业技术学院 | A kind of picture and text fragment restoration methods based on ant group algorithm |
CN107229953B (en) * | 2017-06-06 | 2020-12-25 | 西南石油大学 | Broken document splicing method based on DFS and improved center clustering method |
CN107180412B (en) * | 2017-06-15 | 2020-10-16 | 北京工业大学 | Horizontal and vertical shredded paper sheet reconstruction method based on horizontal projection and K-means clustering |
CN108694414A (en) * | 2018-05-11 | 2018-10-23 | 哈尔滨工业大学深圳研究生院 | Digital evidence obtaining file fragmentation sorting technique based on digital picture conversion and deep learning |
CN108921793A (en) * | 2018-07-15 | 2018-11-30 | 江西理工大学 | A scrap of paper based on fuzzy C-means clustering splices restored method |
CN109584163B (en) * | 2018-12-17 | 2020-12-08 | 深圳市华星光电半导体显示技术有限公司 | Method for restoring original file of paper scrap |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103679671A (en) * | 2014-01-12 | 2014-03-26 | 王浩 | Transverse and vertical sliced shredded paper splicing and recovery algorithm of FFT (Fast Fourier Transform) integrated comprehensive evaluation method |
CN103679678A (en) * | 2013-12-18 | 2014-03-26 | 山东大学 | Semi-automatic splicing recovery method for character characteristic rectangular scraps of paper |
-
2014
- 2014-05-05 CN CN201410185991.5A patent/CN103996180B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103679678A (en) * | 2013-12-18 | 2014-03-26 | 山东大学 | Semi-automatic splicing recovery method for character characteristic rectangular scraps of paper |
CN103679671A (en) * | 2014-01-12 | 2014-03-26 | 王浩 | Transverse and vertical sliced shredded paper splicing and recovery algorithm of FFT (Fast Fourier Transform) integrated comprehensive evaluation method |
Also Published As
Publication number | Publication date |
---|---|
CN103996180A (en) | 2014-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103996180B (en) | Shredder based on English words feature crushes document restored method | |
US8908961B2 (en) | System and methods for arabic text recognition based on effective arabic text feature extraction | |
CN106326854B (en) | A kind of format document paragraph recognition methods | |
CN103700081B (en) | A kind of shredder crushes the restoration methods of English document | |
CN103679678B (en) | A kind of semi-automatic splicing restored method of rectangle character features a scrap of paper | |
EP2395453A2 (en) | Method and system for preprocessing an image for optical character recognition | |
CN105654130A (en) | Recurrent neural network-based complex image character sequence recognition system | |
CN105469027A (en) | Horizontal and vertical line detection and removal for document images | |
CN105678292A (en) | Complex optical text sequence identification system based on convolution and recurrent neural network | |
CN104008384A (en) | Character identification method and character identification apparatus | |
CN104008401A (en) | Method and device for image character recognition | |
CN106682698A (en) | OCR identification method based on template matching | |
CN1916940A (en) | Template optimized character recognition method and system | |
Das et al. | Heuristic based script identification from multilingual text documents | |
RU2625533C1 (en) | Devices and methods, which build the hierarchially ordinary data structure, containing nonparameterized symbols for documents images conversion to electronic documents | |
CN1545067A (en) | A method for compressing digitalized archive file using computer | |
CN105677718B (en) | Character search method and device | |
CN108510442B (en) | Single-side paper scrap splicing and restoring method based on absolute value distance optimization | |
Ajward et al. | Converting printed Sinhala documents to formatted editable text | |
CN106778759A (en) | For the feature image automatic creation system of pictograph identification | |
JPH08320914A (en) | Table recognition method and device | |
CN116778497A (en) | Method and device for identifying hand well number, computer equipment and storage medium | |
CN113901913A (en) | Convolution network for ancient book document image binaryzation | |
Aparna et al. | A complete OCR system development of Tamil magazine documents | |
CN106991082B (en) | Grouping method for multi-page homogeneous document fragments |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20160907 |
|
CF01 | Termination of patent right due to non-payment of annual fee |