CN103996180B

CN103996180B - Shredder based on English words feature crushes document restored method

Info

Publication number: CN103996180B
Application number: CN201410185991.5A
Authority: CN
Inventors: 冯钧; 陈焕霖; 杨艳林; 陈丽君; 唐志贤; 许潇; 朱忠华; 盛震宇
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2014-05-05
Filing date: 2014-05-05
Publication date: 2016-09-07
Anticipated expiration: 2034-05-05
Also published as: CN103996180A

Abstract

The invention discloses shredder based on English words feature and crush document restored method, belong to the technical field of image procossing.Document restored method includes image digitazation, Image semantic classification, image clustering and four steps of image mosaic.Image semantic classification refers to use matlab software to import every a scrap of paper, generates and corresponding to the gray matrix of every a scrap of paper and gray matrix is carried out binary conversion treatment, and generates binary sequence；Image clustering step refers to, according to English words feature, render binary sequence, and cluster original image according to rendering result；Image mosaic step refers to, according to cluster result, based on minimum accumulation Edge Distance principle, carry out longitudinal spliced to original image, then carry out horizontally-spliced to the result after longitudinal spliced.The present invention solves the recovery problem of the English document after being crushed by shredder, has filled up the blank of prior art, meanwhile, by cluster, has made splicing efficiency be greatly improved.

Description

Shredder based on English words feature crushes document restored method

Technical field

The present invention relates to shredder based on English words feature and crush document restored method, belong to image procossing Document recovery technique field in.

Background technology

The splicing of broken file is in fields such as judicial material evidence recovery, historical document reparation and military information acquisitions all There is important application.Traditionally, splicing recovery operation need to be by being accomplished manually, and accuracy rate is higher, but efficiency is very Low, particularly huge when number of tiles, artificial splicing is difficult to complete task at short notice.Along with computer skill The development of art, is attempted to develop the automatic Mosaic technology of a scrap of paper, and to improve splicing recovering efficiency, one excellent Elegant method should need not manual intervention, and can be spliced into same class and crush the scraps of paper.It is by scanning Obtain relevant information with image technique, then utilize computer to process accordingly, thus realize broken to these Restoring full-automatic or automanual splicing of the scraps of paper.

Summary of the invention

The present invention is directed to the problem about English document splicing inefficiency in existing broken scraps of paper joining method, carry Go out shredder based on English words feature and crush document restored method.

The present invention adopts the following technical scheme that for achieving the above object

Shredder based on English words feature crushes document restored method, and the broken file and picture of scanning, to each Width image is handled as follows according to step 1 to step 3:

Step 1, sets up gray matrix A_k:

\{\begin{matrix} A_{k} (i, j) = 0 ... ... ... (A_{k} (i, j) = 255) \\ A_{k} (i, j) = 1 ... ... ... (A_{k} (i, j) &NotEqual; 255) \end{matrix} - - - (1),

, k is broken number of documents,

Binaryzation gray matrix A_kAfter obtain two values matrix B_k, B_kThere are m row, n column element,

Ask for two values matrix B the most respectively_kEach row element sum, vertical direction arrangement two values matrix B_kEach row Element sum obtains binary sequence Q_k:

\{\begin{matrix} Q (1) = 1......... (Σ_{j = 1}^{j = n} B_{k} (1, j) > 0) \\ Q (1) = 0......... (Σ_{j = 1}^{j = n} B_{k} (1, j) = 0) \\ Q (2) = 1......... (Σ_{j = 1}^{j = n} B_{k} (2, j) > 0) \\ Q (2) = 0......... (Σ_{j = 1}^{j = n} B_{k} (2, j) = 0) \\ . \\ . \\ . \\ Q (m) = 1......... (Σ_{j = 1}^{j = n} B_{k} (m, j) > 0) \\ Q (m) = 0......... (Σ_{j = 1}^{j = n} B_{k} (m, j) = 0 \end{matrix} - - - (2);

Step 2, by upper width W of English alphabet typesetting space_u, middle part width W_mWith lower width W_d, Establish English alphabet and occupy the locus of four line three lattice, described locus include middle part, middle part, under Portion, upper, middle and lower portion:

According to binary sequence Q_kPixel distribution situation divide upper middle part, middle and lower part:

As binary sequence Q_kFront 1/3rd pixels and less than rear 1/3rd pixels and time, then in belonging to Portion,

As binary sequence Q_kFront 1/3rd pixels and more than rear 1/3rd pixels and time, simultaneously first three point One of pixel and with rear 1/3rd pixels and ratio less than 3/2 time, then belong to middle part,

As binary sequence Q_kFront 1/3rd pixels and more than rear 1/3rd pixels and time, if simultaneously first three / mono-pixel and with rear 1/3rd pixels and ratio more than or equal to 3/2 time, then belong to middle and lower part；

To binary sequence Q_kScan from top to bottom, after reading first 01 time, record 1 place line number f₁, From binary sequence f₁Row continues to scan on, and reads f₁0 place line number f is recorded when first 0 after row₂, by f₂-f₁Value establish English alphabet and occupy the locus of four line three lattice:

Work as f₂-f₁<W_mWhen-1: the most scanned binary sequence does not constitute a complete English alphabet, Continue down to scan binary sequence,

Work as W_m-1≤f₂-f₁<W_mWhen+1: the most scanned binary sequence constitutes a complete English words Mother, this English alphabet is in the middle part of four line three lattice,

Work as W_m+1<f₂-f₁≤W_m+W_uWhen+2: the most scanned binary sequence constitutes a complete English Word is female, and English alphabet is in upper middle part or the middle and lower part of four line three lattice,

Work as W_m+W_u+2<f₂-f₁≤W_m+W_u+W_dWhen+1: the most scanned binary sequence constitutes one Complete English alphabet, English alphabet is in the upper, middle and lower portion of four line three lattice；

Step 3, at binary sequence Q_kMiddle lookup renders starting point, carries out clustering processing after rendering binary sequence:

Step 3-1, is occupied the locus of four line three lattice by English alphabet, determines the starting point that the first round renders:

When English alphabet is in the upper middle part of four line three lattice, from the beginning of the lowest point at middle part, by this position Rollback W_m+W_uObtain the starting point that the first round renders,

When English alphabet is in the middle part of four line three lattice, from the beginning of the top point at middle part, returned by this position Move back W_uObtain the starting point that the first round renders,

When English alphabet is in the middle and lower part of four line three lattice, from the beginning of the top point at middle part, by this position Rollback W_uObtain the starting point that the first round renders,

When English alphabet is in the upper, middle and lower portion of four line three lattice, from the beginning of the lowest point of bottom, by this some position Put back into and move back W_u+W_m+W_dObtain the starting point that the first round renders；

Step 3-2, the starting point rendered from the first round starts to render binary sequence and obtains new binary sequence, tool Body includes step a and step b:

Step a, the starting point rendered from the first round starts the most up, down scans, after scanning chance 0 First 1 time, record line number f at 1 place_k, and with f_kOn the basis of, it is in f_kTop and and f_k's Distance is W_bFont space in all use 1 to cover；The f being in_kLower section and and f_kDistance be W_u+W_m+W_d+W_bFont space in the most all with 1 cover, W_bFor between letter and letter longitudinal direction Away from,

Step b, after the first round has rendered, taking the first round on scanning direction renders the next point of starting point As the new starting point that renders, repeat step a, after one takes turns and rendered, using next one point as new Render starting point,

For scanning up mode, take epicycle and render a upper point of starting point as the new starting point that renders, right In downward scan mode, take epicycle and render the next point of starting point as the new starting point that renders, more than repetition Step a, until binary sequence Q_kTop and bottommost till, so generate new binary sequence Q′_k；

Step 4, according to the new binary sequence Q ' of every piece image_kIn the value condition of each element: value is complete Exactly the same two-value subsequence be polymerized to a class, the two-value subsequence of different values is then polymerized to different classes, herein " cluster " simply sort out according to the difference of element each in binary sequence, such as, result creates 4 two Value sequence, is [1,0,1], [1,1,1], [1,1,1], [1,0,1] respectively, then different according to its element value, can It is polymerized to two classes, class one: [1,0,1], [1,0,1], class two: [1,1,1], [1,1,1]；

Step 5, respectively in each class, based on minimum accumulation Edge Distance principle:

Find two two values matrix B ' and the B that accumulation Edge Distance is minimum ", The broken document that the two two values matrix is corresponding is laterally matched splicing, until all of image in such all Till being spliced together, enter being complete each broken document laterally matched further according to minimum accumulation Edge Distance The longitudinally matched splicing of row.

The present invention uses technique scheme, has the advantages that the typesetting of english font of having analyzed and researched Rule, has filled up the blank about English document a scrap of paper splicing；For given English document a scrap of paper, Owing to being clustered according to English words typesetting feature, therefore greatly reduce and laterally match and longitudinally matched mistake About the amount of calculation of minimum edge distance in journey, improve splicing efficiency.

Accompanying drawing explanation

Fig. 1 is the flow chart that shredder based on English words feature crushes document restored method.

Fig. 2 is the electronic image that obtains after 38 a scrap of papers of scanning.

Fig. 3 is gray matrix binaryzation and is converted to binary sequence procedure chart.

Fig. 4 is english font typesetting feature figure.

Render process schematic diagram when Fig. 5 English alphabet occupies in the middle part of four line lattice.

Fig. 6 is the cluster result figure of 38 a scrap of papers.

Fig. 7 is the horizontally-spliced result figure of 38 a scrap of papers.

Fig. 8 is the final splicing result figure of 38 a scrap of papers.

Detailed description of the invention

Below in conjunction with the accompanying drawings the technical scheme of invention is described in detail:

The shredder based on English words feature that the present invention proposes crushes document restored method according to shown in Fig. 1 Flow chart realizes, and uses scanner to be scanned 38 paper document fragments processing, and exports such as Fig. 2 institute The electronic image of every a scrap of paper shown, uses matlab software to import every a scrap of paper,

According to transfer process shown in Fig. 3, generate the gray matrix A corresponding to every a scrap of paper_k (k=1,2 ... 38), to gray matrix A_k(k=1,2 ... 38) convolution (1) carries out binary conversion treatment and obtains To two values matrix B_k(k=1,2 ... 38), in conjunction with formula (2) regeneration binary sequence Q_k(k=1,2 ... 38)；

As shown in Figure 4 (oblique line portion is taken up space by font), the top width of English alphabet typesetting space is calculated Degree W_u=13, middle part width W_m=25 and lower width W_d=13 and letter with character spcing W_b=13 (i.e. White space width), to binary sequence Q_k(k=1,2 ... 38) scans from top to bottom, when reading first chance After 01 time, record its place line number f₁, continuing down to scan, until finding first 0, recording its row Number f₂, then, can be according to f₂-f₁Value, primarily determine that this English alphabet occupies the space bit of four line three lattice Put:

①f₂-f₁< 24: do not become a letter, continue down to scan；

②24≤f₂-f₁< 26: middle part；

③26<f₂-f₁≤ 41: upper middle part or middle and lower part；

④41<f₂-f₁≤ 52: upper, middle and lower portion,

For upper middle part and the concrete division of middle and lower part, then will be according to binary sequence Q_k(k=1,2's ... 38) Pixel distribution situation determines, if front 1/3rd pixels and less than rear 1/3rd pixels and, then belong to Middle part；If front 1/3rd pixels and more than rear 1/3rd pixels and, then need to further confirm that: first three / mono-pixel and with rear 1/3rd pixels and ratio less than 3/2, then belong to middle part, if front 1/3rd Pixel and with rear 1/3rd pixels and ratio more than or equal to 3/2, then belong to middle and lower part；

After determining that English alphabet occupies four line three grid space positions, next at Q_kIn (k=1,2 ... 38) Lookup renders starting point, and specific rules is as follows:

1. middle part is gone up: from the beginning of the lowest point of middle part, obtain initial point by 38 points of this position rollback；

2. middle part: from the beginning of the top point of middle part, obtain initial point by 13 points of this position rollback；

3. middle and lower part: from the beginning of the top point of middle part, obtains initial point by 13 points of this position rollback；

4. upper, middle and lower portion: from the beginning of the lowest point of bottom, obtains initial point by 51 points of this position rollback.

Find after rendering starting point, start the most down to scan, when scanning first 1 after meeting 0, Record its line number f in the sequence_k, and with f_kOn the basis of, it is in f_kTop and and f_kDistance be 13 Font space in, all use 1 to cover；The f being in_kLower section and and f_kThe font space that distance is 64 in, The most all cover with 1, when one takes turns after " rendering " complete, using next one point as new render initial Point (if direction is upwards, then take a point and render starting point as new ", repeat above procedure, directly To the top and bottommost of this sequence, then, new binary sequence Q ' is generated_k(k=1,2 ... 38), Fig. 5 is English alphabet render process schematic diagram when occupying in the middle part of four line lattice.

According to every piece image Q '_kThe difference of (k=1,2 ... 38), above 38 a scrap of papers are divided into Two classes as shown in Figure 6, respectively in above two classes, based on minimum accumulation Edge Distance principle, i.e. from such In in all of two values matrix, findCarry out laterally matching splicing, directly In such, all of image is all spliced together, horizontally-spliced result is as shown in Figure 7；

According to minimum accumulation Edge Distance principle, above two classes are carried out longitudinally matched splicing, longitudinal spliced knot Fruit is as shown in Figure 8.

Claims

1. shredder based on English words feature crushes document restored method, it is characterised in that: scanning is broken File and picture, is handled as follows according to step 1 to step 3 every piece image:

Step 1, sets up gray matrix, obtains two values matrix after binaryzation gray matrix, ask for two the most respectively Value matrix each row element sum, vertical direction arrangement two values matrix each row element sum obtains binary sequence；

Step 3, searches in binary sequence and renders starting point, carry out clustering processing after rendering binary sequence:

Step b, after the first round has rendered, taking the first round on scanning direction renders the next point of starting point As the new starting point that renders, repeating step a, the institute of traversal binary sequence is a little as rendering starting point generation New binary sequence；

Step 4, carries out clustering processing to the new binary sequence of every piece image；

Step 5, respectively in each class, finds accumulation Edge Distance based on minimum accumulation Edge Distance principle Two little two values matrixs, laterally match splicing to the broken document that the two two values matrix is corresponding, until Till in such, all of image is all spliced together, further according to minimum accumulation Edge Distance to being complete laterally The each broken document of coupling carries out longitudinally matched splicing.

Shredder based on English words feature the most according to claim 1 crushes document restored method, It is characterized in that, the upper middle part described in step 2, middle and lower part divide according to the pixel distribution situation of binary sequence:

When before binary sequence 1/3rd pixels and less than rear 1/3rd pixels and time, then belong to middle part,

When before binary sequence 1/3rd pixels and more than rear 1/3rd pixels and time, simultaneously front 1/3rd Pixel and with rear 1/3rd pixels and ratio less than 3/2 time, then belong to middle part,

When before binary sequence 1/3rd pixels and more than rear 1/3rd pixels and time, if simultaneously the most front three/ One pixel and with rear 1/3rd pixels and ratio more than or equal to 3/2 time, then belong to middle and lower part.

Shredder based on English words feature the most according to claim 2 crushes document restored method, The concrete grammar that it is characterized in that step 2 is: scans binary sequence from top to bottom, is reading first 0 After 1 time, record 1 place line number f₁, from binary sequence f₁Row continues to scan on, and reads f₁After row first 0 place line number f is recorded when individual 0₂, by f₂-f₁Value establish English alphabet and occupy the locus of four line three lattice:

Work as W_m+W_u+2<f₂-f₁≤W_m+W_u+W_dWhen+1: the most scanned binary sequence constitutes one Complete English alphabet, English alphabet is in the upper, middle and lower portion of four line three lattice.

Shredder based on English words feature the most according to claim 3 crushes document restored method, It is characterized in that, step 5 adds up Edge Distance based on minimum and passes expression formula in principleFind two two values matrix B ' and the B that accumulation Edge Distance is minimum ", m is The line number of two values matrix, n is the columns of two values matrix.