CN114821611A - A method for protecting private data in archive table images - Google Patents
A method for protecting private data in archive table images Download PDFInfo
- Publication number
- CN114821611A CN114821611A CN202210558787.8A CN202210558787A CN114821611A CN 114821611 A CN114821611 A CN 114821611A CN 202210558787 A CN202210558787 A CN 202210558787A CN 114821611 A CN114821611 A CN 114821611A
- Authority
- CN
- China
- Prior art keywords
- line
- image
- row
- lines
- horizontal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 21
- 239000011159 matrix material Substances 0.000 claims description 8
- 238000005452 bending Methods 0.000 claims description 7
- 238000003708 edge detection Methods 0.000 claims description 7
- 238000010586 diagram Methods 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 4
- 238000001514 detection method Methods 0.000 claims description 4
- 230000002452 interceptive effect Effects 0.000 claims description 4
- 238000013461 design Methods 0.000 claims description 2
- 238000012797 qualification Methods 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 abstract description 7
- 238000012545 processing Methods 0.000 abstract description 5
- 238000012937 correction Methods 0.000 abstract description 2
- 230000009466 transformation Effects 0.000 abstract 1
- 238000005516 engineering process Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000003973 paint Substances 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/412—Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/16—Image preprocessing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/18—Extraction of features or characteristics of the image
- G06V30/1801—Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/24—Character recognition characterised by the processing or recognition method
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/42—Document-oriented image-based pattern recognition based on the type of document
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Image Analysis (AREA)
Abstract
Description
技术领域technical field
本发明涉及档案利用和图像处理领域,特别是一种档案表格图像识别和处理算法。The invention relates to the field of file utilization and image processing, in particular to an image recognition and processing algorithm of file forms.
背景技术Background technique
为了尊重档案的原始记录性与内容的真实性,档案管理部门会将原始档案扫描保存,在出具档案证明等利用材料时,会将图像中与档案利用者无关的他人信息进行图像遮蔽或者模糊处理。In order to respect the original record of the files and the authenticity of the content, the file management department will scan and save the original files, and when issuing file certificates and other utilization materials, the information of others in the images that have nothing to do with the file users will be masked or blurred. .
目前的遮蔽方法为人工使用画图或PhotoShop等图像处理软件打开原始档案图像,手工选择需要做隐私处理的部分,对图像处理,再保存、打印、签字盖章交付给档案利用人。其中手工选择隐私处理这个环节操作较为繁琐。The current masking method is to manually open the original file image using image processing software such as Paint or PhotoShop, manually select the part that needs to be processed for privacy, process the image, save, print, sign and seal and deliver it to the file user. Among them, the operation of manual selection of privacy processing is relatively cumbersome.
计算机视觉技术对文档图像中的表格识别技术早已有之,传统图像识别和处理技术包括图像倾斜校正、图像二值化、水平和垂直投影出表格水平线和垂直线、利用Hough变化投票得出线段方程再进行线段跟踪检测等技术。在图像背景干净、表格简单规范时,这类方法具有较好效果。但大量档案表格存在表头复杂、陈旧灰暗、线条存在弯曲、手写和签章覆盖于表格图像等各种干扰因素,难以采用上述方法得到表格线框信息。Computer vision technology has long been used to recognize tables in document images. Traditional image recognition and processing technologies include image skew correction, image binarization, horizontal and vertical projection of table horizontal and vertical lines, and Hough change voting to obtain line segment equations. Then perform line segment tracking detection and other technologies. When the image background is clean and the table is simple and standardized, this kind of method has better effect. However, a large number of file tables have various interference factors such as complicated headers, outdated and gray, curved lines, handwriting and signatures covering the table images, etc., and it is difficult to obtain the table wireframe information by the above method.
计算机视觉技术的另一个分支是基于深度神经网络的机器学习技术,这种技术使用多层卷积层、池化层、激活函数、损失函数等方法构造了一套能够自行学习大量样本,从图像底层边缘等特征到高层结构等特征的学习机构。缺点包括:需要准备和标记大量学习样本,需要较高的算力进行学习和推理计算。其中机器学习前期收集和标记大量档案表格样本图像工作量非常大,即便组织大量人力完成此工作,机器学习得到的推理模型,运算在一般档案利用科室的普通办公电脑中,也存在一定的延迟,导致应用不便。Another branch of computer vision technology is machine learning technology based on deep neural networks. This technology uses multi-layer convolution layers, pooling layers, activation functions, loss functions and other methods to construct a set of self-learning large numbers of samples, from images. A learning mechanism from features such as low-level edges to features such as high-level structures. Disadvantages include: a large number of learning samples need to be prepared and marked, and high computing power is required for learning and reasoning calculations. Among them, in the early stage of machine learning, the workload of collecting and labeling a large number of files and table samples is very large. Even if a large amount of manpower is organized to complete this work, the inference model obtained by machine learning will be calculated in the general office computer of the general file utilization department. There is a certain delay. cause application inconvenience.
发明内容SUMMARY OF THE INVENTION
本发明的目的在于,提供一种保护档案表格图像中的隐私数据的方法。它计算速度快,抗干扰能力强,对陈旧的档案、有彩色印章、原始涂改痕迹和表格线条极淡的表格,均能准确识别,并且兼容性好,对不同长宽比、不同行列数、不同表头样式的表格就能有效识别处理。The purpose of the present invention is to provide a method for protecting private data in a file table image. It has fast calculation speed and strong anti-interference ability. It can accurately identify old files, forms with color seals, original alteration traces and extremely light table lines, and has good compatibility. Tables with different header styles can be effectively identified and processed.
本发明的技术方案:保护档案表格图像中的隐私数据的方法,其特征在于,具体包括如下步骤:The technical scheme of the present invention: a method for protecting privacy data in a file table image, which is characterized in that it specifically includes the following steps:
一种保护档案表格图像中的隐私数据的方法,具体包括如下步骤:A method for protecting private data in an archive table image, specifically comprising the following steps:
1)先利用彩色图像的RGB色差,将表格图像的彩色印章等内容淡化;1) First use the RGB color difference of the color image to dilute the content such as the color seal of the table image;
2)将图像缩放到特定大小,设计在此图像尺寸下的兼容性良好的水平边缘检测算子,通过卷积运算得到图像的边缘强度图;2) Scale the image to a specific size, design a horizontal edge detection operator with good compatibility under this image size, and obtain the edge intensity map of the image through the convolution operation;
3)利用Hough变换得到候选的水平线和垂直线方程;3) Use Hough transform to obtain candidate horizontal and vertical line equations;
4)根据直线方程,进行线条跟踪,利用循环队列记忆不超过15个像素的线条坐标,实时判断线条弯折情况;对于过于弯折的,判断为干扰,利用队列存储的坐标信息,进行回溯搜索其它可能的跟踪轨迹;4) According to the straight line equation, carry out line tracking, use the circular queue to memorize the line coordinates of no more than 15 pixels, and judge the bending of the line in real time; for the excessive bending, it is judged as interference, and the coordinate information stored in the queue is used to perform a backtracking search other possible tracking trajectories;
5)对贯穿表格的水平线进行Y坐标排序,根据表头行高与普通行区别较大的特点,识别出表头;5) Sort the Y coordinate of the horizontal line running through the table, and identify the header according to the feature that the header row height is quite different from the common row;
6)利用图形交互界面,从档案利用者处获取利用行的坐标信息,结合上述步骤识别到的表格信息,自动模糊档案图像中的隐私部分和保留需要利用的部分。6) Using the graphical interactive interface, obtain the coordinate information of the utilization line from the file user, and combine the table information identified in the above steps to automatically blur the private part of the file image and retain the part that needs to be used.
步骤2)所述的检测算子是,构造n行m列边缘检测算子算子矩阵H,矩阵中间行值为负,两端行元素值为正,每行元素值相同,矩阵所有正元素值和为1,负元素值和为-1,其中n为不大于7的整数,m为不大于5的整数。The detection operator described in step 2) is to construct an edge detection operator matrix H with n rows and m columns. The sum of values is 1, and the sum of negative elements is -1, where n is an integer not greater than 7 and m is an integer not greater than 5.
步骤3)中所述的利用Hough变换得到候选的水平线和垂直线方程,包括如下两个步骤:Using Hough transform to obtain candidate horizontal line and vertical line equations described in step 3) includes the following two steps:
对图像进行Hough变换,简称投票:取投票的参数空间为一个二维空间:行坐标表示直线截距,列坐标表示直线倾斜角,高度与文档高度一致;投票阈值为边缘强度达到+5的均有1票资格,保证弱的边缘和强的边缘,在长度一样的情况下,所得票数一致,方便在投票结果中,区分出那些相同长度的表格水平线;边缘强度大于0低于+5的情况,认为是纸质和扫描仪器产生的线条假象;Perform Hough transform on the image, referred to as voting: take the parameter space of voting as a two-dimensional space: the row coordinates represent the straight line intercept, the column coordinates represent the straight line inclination angle, and the height is consistent with the height of the document; the voting threshold is the average of the edge strength reaching +5. There is 1 vote qualification to ensure that weak edges and strong edges, in the case of the same length, get the same number of votes, which is convenient to distinguish those table horizontal lines of the same length in the voting results; the edge strength is greater than 0 and lower than +5. , considered to be line artefacts produced by paper and scanning equipment;
获取候选线的方程参数:在Hough变化的参数空间图中,求得其中最大值为表格线条宽度maxValH;遍历参数空间图,凡达到maxValH的百分之七十的局部最大值,都进行考察,以便尽量检测出间断缺失的表格线条。Obtain the equation parameters of the candidate line: In the parameter space diagram of Hough change, the maximum value is obtained as the table line width maxValH; traverse the parameter space diagram, and all local maximum values that reach 70% of maxValH are investigated. In order to try to detect the intermittent missing table lines.
步骤4)中能够跟踪适当弯曲、间断和严重笔画干扰的线段跟踪算法是,在直线方程的指引下,跟踪线条,通过直线的参数方程拿到直线的倾斜角度和截距,该截距为与X=0的直线的交点Y坐标;在边缘强度图中,从X=0,Y=截距点出发,以直线倾斜角向右游走;取t=20作为阈值,取当前点的上中下三点的最大值所在点为跟踪方向;当最大值点强度大于t时,视为线条点,否则认为是非线条的点;用一个循环队列记录之前跟踪到的点的坐标;如果线条长度大于7,则计算最近7个点构造的线条弯曲夹角;当夹角大于3°,视为跟踪错误,进行回退;将所有跟踪到的线条记入集合S。In step 4), the line segment tracking algorithm that can track appropriate bends, discontinuities and severe stroke interference is that, under the guidance of the straight line equation, the line is tracked, and the inclination angle and intercept of the straight line are obtained by the parametric equation of the straight line. The Y coordinate of the intersection of the straight line with X=0; in the edge intensity map, starting from the X=0, Y=intercept point, walk to the right at the inclination angle of the straight line; take t=20 as the threshold, and take the upper middle of the current point The point where the maximum value of the next three points is located is the tracking direction; when the intensity of the maximum point is greater than t, it is regarded as a line point, otherwise it is regarded as a non-line point; a circular queue is used to record the coordinates of the previously tracked points; if the line length is greater than 7, then calculate the bending angle of the line constructed by the last 7 points; when the angle is greater than 3°, it is regarded as a tracking error, and a rollback is performed; all tracked lines are recorded in the set S.
步骤5)中所述的对贯穿表格的水平线进行Y坐标排序,识别出表头的操作是,将S中所有线条的左右端点的x坐标,进行投票计数,取左端最大票数的x作为表格左边界L,右端最大票数的x作为表格右边界R;再遍历S中的所有线条,仅保留左右端点分别在L和R附近的线条,视为能够贯穿表格的水平线;对这些水平线按y坐标排序,视为相邻的表格水平线。在表格的上半部分,从上向下判断:相邻两行行高相比,超过20%的,上一行视为表头部分。As described in step 5), the horizontal lines running through the table are sorted by Y coordinates, and the operation of identifying the header is to count the votes on the x coordinates of the left and right endpoints of all lines in S, and take the x with the largest number of votes at the left end as the left side of the table. Boundary L, the x with the largest number of votes at the right end is used as the right boundary R of the table; then traverse all the lines in S, and only keep the lines whose left and right endpoints are near L and R, which are regarded as horizontal lines that can run through the table; these horizontal lines are sorted by y-coordinate , treated as adjacent table horizontal lines. In the upper part of the table, judge from top to bottom: if the row height of two adjacent rows exceeds 20%, the previous row is regarded as the header part.
步骤6)所述的自动模糊档案图像中的隐私部分和保留需要利用的部分是指,获取档案利用者选择的行,对该行和表头保留清晰图像,其它行进行常规图像模糊,得到最终可用的图像进行打印输出。Step 6) The automatic blurring of the privacy part and the part to be used in the archive image refers to obtaining the row selected by the archive user, retaining a clear image for the row and the header, and performing conventional image blurring on other rows to obtain the final result. Available images for printout.
本发明的有益效果是:1、计算速度快,在普通办公电脑上能够在1秒以内完成识别和隐私模糊处理;2、抗干扰能力强,对陈旧的档案、有彩色印章、原始涂改痕迹和表格线条极淡的表格,均能准确识别。3、兼容性好,对不同长宽比、不同行列数、不同表头样式的表格就能有效识别处理。The beneficial effects of the invention are as follows: 1. The calculation speed is fast, and the identification and privacy blurring can be completed within 1 second on an ordinary office computer; Tables with very light lines can be accurately identified. 3. Good compatibility, it can effectively identify and process tables with different aspect ratios, different numbers of rows and columns, and different header styles.
附图说明Description of drawings
图1为具有各种干扰特征的彩色扫描件(已屏蔽姓名);Figure 1 is a color scan (names have been masked) with various interference features;
图2为普通彩色转灰度和本专利采用的灰度化方法对比;Fig. 2 is the comparison of the grayscale method adopted by the ordinary color conversion to grayscale and this patent;
图3为水平边缘图;Figure 3 is a horizontal edge map;
图4为Hough变换参数空间(左)和水平跟踪效果(右);Figure 4 shows the Hough transform parameter space (left) and the horizontal tracking effect (right);
图5为保留得到贯穿表格的水平线;Figure 5 is a horizontal line that is retained to get through the table;
图6为直线跟踪算法示意图;6 is a schematic diagram of a straight line tracking algorithm;
从左向右跟踪,黄色为无线条,绿色为正确跟踪,红色为干扰线条Tracking from left to right, yellow for no bars, green for correct tracking, and red for interference lines
图7为用户在图形交互界面选择利用行以后,自动产生的保护隐私信息的图像。FIG. 7 is an image of protecting privacy information automatically generated after a user selects a utilization line in a graphical interactive interface.
具体实施方式Detailed ways
下面结合实施例对本发明作进一步的说明,但并不作为对本发明限制的依据。The present invention will be further described below in conjunction with the examples, but not as a basis for limiting the present invention.
实施例1:一种保护档案表格图像中的隐私数据的方法Embodiment 1: A method for protecting private data in archive table images
第一步:尺寸统一化以及色彩印章和签字的淡化Step 1: Uniform size and fade color stamps and signatures
定义原始图像长宽为W和H,缩放比例为zoom=min(1024/W,1024/H)。将原图缩放到zoom倍大小。Define the length and width of the original image as W and H, and the zoom ratio as zoom=min(1024/W, 1024/H). Scale the original image to zoom times.
在尺寸缩放之后的图像中,线条宽度能够控制在0.5至2个像素以内,便于后续检测。将彩色图像分离为R、G、B三个通道的图像。通过图像矩阵的max计算,得到Gray=max(R,G,B),Gray为淡化颜色之后的图像,能够极大减少彩色印章、红色签字等的干扰。如图2对比所示。In the image after size scaling, the line width can be controlled within 0.5 to 2 pixels, which is convenient for subsequent detection. The color image is separated into images of three channels, R, G, and B. Through the max calculation of the image matrix, Gray=max(R, G, B) is obtained, and Gray is the image after lightening the color, which can greatly reduce the interference of color seals, red signatures, etc. As shown in Figure 2 for comparison.
第二步:构造水平边缘检测算子,对图像进行水平边缘检测Step 2: Construct a horizontal edge detection operator to perform horizontal edge detection on the image
构造n行m列边缘检测算子算子矩阵H,n和m建议取值5和5,矩阵中间行值为负,两端行元素值为正,每行元素值相同,矩阵所有正元素值和为1,负元素值和为-1。H取值建议为:Construct the edge detection operator matrix H with n rows and m columns. It is recommended to take
H=[+0.100,+0.100,+0.100,+0.100,+0.100;H=[+0.100,+0.100,+0.100,+0.100,+0.100;
-0.025,-0.025,-0.025,-0.025,-0.025;-0.025,-0.025,-0.025,-0.025,-0.025;
-0.150,-0.150,-0.150,-0.150,-0.150;-0.150,-0.150,-0.150,-0.150,-0.150;
-0.025,-0.025,-0.025,-0.025,-0.025;-0.025,-0.025,-0.025,-0.025,-0.025;
+0.100,+0.100,+0.100,+0.100,+0.100;]+0.100,+0.100,+0.100,+0.100,+0.100;]
将Gray和H作2维卷积计算,得到水平边缘强度图E=conv2d(Gray,H);效果如图3所示。Perform 2-dimensional convolution calculation on Gray and H to obtain the horizontal edge intensity map E=conv2d(Gray, H); the effect is shown in Figure 3.
这里的conv2d和卷积神经网络的2d卷积是一种计算。The conv2d here and the 2d convolution of the convolutional neural network is a calculation.
第三步:提取表格行线条Step 3: Extract table row lines
涉及档案隐私的数据一般以行为单位,所有这里以水平线的跟踪为例说明。Data related to file privacy is generally in behavioral units, all of which are illustrated here with the tracking of horizontal lines.
投票。对图像进行Hough变换(投票),取投票的参数空间为一个二维空间:行坐标表示直线截距,列坐标表示直线倾斜角(取角度分辨率0.1°,角度范围±3°),一共有61列。如有需要可以自行扩大此范围,高度与文档高度一致。投票阈值为边缘强度达到+5的均有1票资格,保证弱的边缘和强的边缘,在长度一样的情况下,所得票数一致,方便在投票结果中,区分出那些相同长度的表格水平线。边缘强度大于0低于+5的情况,认为是纸质和扫描仪器产生的线条假象。vote. Perform Hough transform (voting) on the image, and take the parameter space of voting as a two-dimensional space: the row coordinates represent the intercept of the line, and the column coordinates represent the inclination angle of the line (with an angular resolution of 0.1° and an angle range of ±3°). There are a total of 61 columns. If necessary, you can expand this range by yourself, and the height is consistent with the height of the document. The voting threshold is that those whose edge strength reaches +5 are eligible for 1 vote, ensuring that weak edges and strong edges have the same number of votes in the case of the same length, which is convenient to distinguish those horizontal lines of the same length in the voting results. Edge strengths greater than 0 and lower than +5 are considered to be line artifacts produced by paper and scanners.
获取候选线的方程参数。在Hough变化的参数空间图(图4左边黑色部分)中,求得其中最大值为表格线条宽度maxValH。遍历参数空间图,凡达到maxValH的百分之七十的局部最大值,都进行考察,以便尽量检测出间断缺失的表格线条。Get the equation parameters of the candidate line. In the parameter space diagram of Hough change (the black part on the left of Figure 4), the maximum value is obtained as the table line width maxValH. Traversing the parameter space graph, where the local maximum value of 70% of maxValH is reached, it is inspected in order to detect the discontinuous missing table lines as much as possible.
第四步:在直线方程的指引下,跟踪线条(图6所示)。直线的参数方程,可以拿到直线的倾斜角度和截距(与X=0的直线的交点Y坐标)。在边缘强度图中,从X=0,Y=截距点出发,以直线倾斜角向右游走。取t=20作为阈值,取当前点的上中下三点的最大值所在点为跟踪方向。当最大值点强度大于t时,视为线条点,否则认为是非线条的点。Step 4: Under the guidance of the straight line equation, trace the line (as shown in Figure 6). The parametric equation of the straight line can obtain the inclination angle and intercept of the straight line (the Y coordinate of the intersection with the straight line with X=0). In the edge intensity map, starting from the point of X=0, Y=intercept, walk to the right with a straight line inclination angle. Take t=20 as the threshold, and take the point where the maximum value of the upper, middle and lower points of the current point is located as the tracking direction. When the intensity of the maximum point is greater than t, it is regarded as a line point, otherwise it is regarded as a non-line point.
用一个循环队列(一种数据结构)记录之前跟踪到的点的坐标。如果线条长度大于7,则计算最近7个点构造的线条弯曲夹角。当夹角大于3°,视为跟踪错误,进行回退。将所有跟踪到的线条记入集合S。Use a circular queue (a data structure) to record the coordinates of previously tracked points. If the line length is greater than 7, calculate the line bending angle constructed by the nearest 7 points. When the included angle is greater than 3°, it is regarded as a tracking error and a rollback is performed. All traced lines are recorded in set S.
用循环队列的原因是:考察跟踪线条的弯曲,仅在近期跟踪到的7个点中考察,更长范围内的点的坐标存储空间可以循环利用。The reason for using the circular queue is: to examine the curve of the traced line, only the 7 points that have been traced recently are examined, and the coordinate storage space of the points in a longer range can be recycled.
第五步:求得表格水平贯穿线和表头Step 5: Obtain the horizontal through line and header of the table
将S中所有线条的左右端点的x坐标,进行投票计数,取左端最大票数的x作为表格左边界L,右端最大票数的x作为表格右边界R。Take the x-coordinates of the left and right endpoints of all lines in S to count votes, take the x with the largest number of votes at the left end as the left border L of the table, and the x with the largest number of votes at the right end as the right border R of the table.
再遍历S中的所有线条,仅保留左右端点分别在L和R附近的线条,视为能够贯穿表格的水平线。效果如图6所示。Then traverse all the lines in S, and only keep the lines whose left and right endpoints are near L and R respectively, which are regarded as horizontal lines that can run through the table. The effect is shown in Figure 6.
对这些水平线按y坐标排序,视为相邻的表格水平线。在表格的上半部分,从上向下判断:相邻两行行高相比,超过20%的,上一行视为表头部分。Sort these horizontal lines by their y-coordinates and treat them as adjacent table horizontal lines. In the upper part of the table, judge from top to bottom: if the row height of two adjacent rows exceeds 20%, the previous row is regarded as the header part.
第六步:施加隐私保护Step 6: Apply Privacy Shield
从软件交互界面中,获取档案利用者选择的行,对该行和表头保留清晰图像,其它行进行常规图像模糊,得到最终可用的图像进行打印输出。效果如图7所示。From the software interactive interface, obtain the row selected by the file user, keep a clear image of the row and the header, and perform conventional image blurring on other rows to obtain the final usable image for print output. The effect is shown in Figure 7.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210558787.8A CN114821611A (en) | 2022-05-20 | 2022-05-20 | A method for protecting private data in archive table images |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210558787.8A CN114821611A (en) | 2022-05-20 | 2022-05-20 | A method for protecting private data in archive table images |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114821611A true CN114821611A (en) | 2022-07-29 |
Family
ID=82516430
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210558787.8A Pending CN114821611A (en) | 2022-05-20 | 2022-05-20 | A method for protecting private data in archive table images |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114821611A (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5579414A (en) * | 1992-10-19 | 1996-11-26 | Fast; Bruce B. | OCR image preprocessing method for image enhancement of scanned documents by reversing invert text |
US5737442A (en) * | 1995-10-20 | 1998-04-07 | Bcl Computers | Processor based method for extracting tables from printed documents |
US20120008874A1 (en) * | 2009-04-07 | 2012-01-12 | Murata Machinery, Ltd. | Image processing apparatus, image processing method, image processing program, and storage medium |
CN103020609A (en) * | 2012-12-30 | 2013-04-03 | 上海师范大学 | Complex fiber image recognition method |
CN103258198A (en) * | 2013-04-26 | 2013-08-21 | 四川大学 | Extraction method for characters in form document image |
US20150294523A1 (en) * | 2013-08-26 | 2015-10-15 | Vertifi Software, LLC | Document image capturing and processing |
CN109766749A (en) * | 2018-11-27 | 2019-05-17 | 上海眼控科技股份有限公司 | A kind of detection method of the bending table line for financial statement |
CN113516103A (en) * | 2021-08-07 | 2021-10-19 | 山东微明信息技术有限公司 | Table image inclination angle determining method based on support vector machine |
-
2022
- 2022-05-20 CN CN202210558787.8A patent/CN114821611A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5579414A (en) * | 1992-10-19 | 1996-11-26 | Fast; Bruce B. | OCR image preprocessing method for image enhancement of scanned documents by reversing invert text |
US5737442A (en) * | 1995-10-20 | 1998-04-07 | Bcl Computers | Processor based method for extracting tables from printed documents |
US20120008874A1 (en) * | 2009-04-07 | 2012-01-12 | Murata Machinery, Ltd. | Image processing apparatus, image processing method, image processing program, and storage medium |
CN103020609A (en) * | 2012-12-30 | 2013-04-03 | 上海师范大学 | Complex fiber image recognition method |
CN103258198A (en) * | 2013-04-26 | 2013-08-21 | 四川大学 | Extraction method for characters in form document image |
US20150294523A1 (en) * | 2013-08-26 | 2015-10-15 | Vertifi Software, LLC | Document image capturing and processing |
CN109766749A (en) * | 2018-11-27 | 2019-05-17 | 上海眼控科技股份有限公司 | A kind of detection method of the bending table line for financial statement |
CN113516103A (en) * | 2021-08-07 | 2021-10-19 | 山东微明信息技术有限公司 | Table image inclination angle determining method based on support vector machine |
Non-Patent Citations (1)
Title |
---|
李云华;段会川;: "基于Hough变换的图像档案的表格提取与倾斜校正", 信息技术与信息化, no. 06, 15 December 2007 (2007-12-15) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109409355B (en) | Novel transformer nameplate identification method and device | |
CN104408449B (en) | Intelligent mobile terminal scene literal processing method | |
CN109784344A (en) | An image non-target filtering method for ground plane identification recognition | |
CN102831584B (en) | Data-driven object image restoring system and method | |
CN116563237B (en) | A hyperspectral image detection method for chicken carcass defects based on deep learning | |
CN109360179B (en) | Image fusion method and device and readable storage medium | |
CN110008900B (en) | Method for extracting candidate target from visible light remote sensing image from region to target | |
CN112819748B (en) | Training method and device for strip steel surface defect recognition model | |
CN112446259A (en) | Image processing method, device, terminal and computer readable storage medium | |
CN110503103A (en) | A Character Segmentation Method in Text Lines Based on Fully Convolutional Neural Networks | |
CN113065396A (en) | Automated filing processing system and method for scanned archive images based on deep learning | |
CN111125403B (en) | Aided design drawing method and system based on artificial intelligence | |
CN110390677A (en) | A method and system for defect localization based on sliding self-matching | |
CN110334709A (en) | End-to-end multi-task deep learning-based license plate detection method | |
CN115331245A (en) | A table structure recognition method based on image instance segmentation | |
CN105184225A (en) | Multinational paper money image identification method and apparatus | |
CN113158895A (en) | Bill identification method and device, electronic equipment and storage medium | |
CN111415330A (en) | Copper foil appearance defect detection method based on deep learning | |
CN115082776A (en) | Electric energy meter automatic detection system and method based on image recognition | |
CN114708237A (en) | Detection algorithm for hair health condition | |
CN115797336A (en) | Fault detection method and device of photovoltaic module, electronic equipment and storage medium | |
CN112364863B (en) | Character positioning method and system for license document | |
CN114565927B (en) | Table recognition method, device, electronic device and storage medium | |
Kovanen et al. | A layered method for determining manga text bubble reading order | |
Schüffler et al. | Overcoming an annotation hurdle: Digitizing pen annotations from whole slide images |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |