CN114821611A - A method for protecting private data in archive table images - Google Patents

A method for protecting private data in archive table images Download PDF

Info

Publication number
CN114821611A
CN114821611A CN202210558787.8A CN202210558787A CN114821611A CN 114821611 A CN114821611 A CN 114821611A CN 202210558787 A CN202210558787 A CN 202210558787A CN 114821611 A CN114821611 A CN 114821611A
Authority
CN
China
Prior art keywords
line
image
row
lines
horizontal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210558787.8A
Other languages
Chinese (zh)
Inventor
李云义
程欣宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou University
Original Assignee
Guizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou University filed Critical Guizhou University
Priority to CN202210558787.8A priority Critical patent/CN114821611A/en
Publication of CN114821611A publication Critical patent/CN114821611A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/16Image preprocessing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/1801Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/42Document-oriented image-based pattern recognition based on the type of document

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for protecting private data in a form image of an archive, which utilizes RGB color difference of a color image to fade contents such as a color seal and the like of the form image; scaling the image to a specific size, and obtaining an edge intensity image of the image through convolution operation; obtaining candidate horizontal line equations and candidate vertical line equations by utilizing Hough transformation; backtracking and searching other possible tracking tracks; the coordinate information of the utilization line is obtained from the file user, and the private part and the part needing to be utilized are automatically blurred in the file image. The method has high calculation speed, and can complete identification and privacy fuzzy processing within 1 second on a common office computer; and the anti-interference capability is strong, and the old file, the table with the color seal, the original correction trace and the table with extremely light lines can be accurately identified. In addition, the compatibility is good, and the tables with different length-width ratios, different row and column numbers and different header styles can be effectively identified and processed.

Description

一种保护档案表格图像中的隐私数据的方法A method for protecting private data in archive table images

技术领域technical field

本发明涉及档案利用和图像处理领域,特别是一种档案表格图像识别和处理算法。The invention relates to the field of file utilization and image processing, in particular to an image recognition and processing algorithm of file forms.

背景技术Background technique

为了尊重档案的原始记录性与内容的真实性,档案管理部门会将原始档案扫描保存,在出具档案证明等利用材料时,会将图像中与档案利用者无关的他人信息进行图像遮蔽或者模糊处理。In order to respect the original record of the files and the authenticity of the content, the file management department will scan and save the original files, and when issuing file certificates and other utilization materials, the information of others in the images that have nothing to do with the file users will be masked or blurred. .

目前的遮蔽方法为人工使用画图或PhotoShop等图像处理软件打开原始档案图像,手工选择需要做隐私处理的部分,对图像处理,再保存、打印、签字盖章交付给档案利用人。其中手工选择隐私处理这个环节操作较为繁琐。The current masking method is to manually open the original file image using image processing software such as Paint or PhotoShop, manually select the part that needs to be processed for privacy, process the image, save, print, sign and seal and deliver it to the file user. Among them, the operation of manual selection of privacy processing is relatively cumbersome.

计算机视觉技术对文档图像中的表格识别技术早已有之,传统图像识别和处理技术包括图像倾斜校正、图像二值化、水平和垂直投影出表格水平线和垂直线、利用Hough变化投票得出线段方程再进行线段跟踪检测等技术。在图像背景干净、表格简单规范时,这类方法具有较好效果。但大量档案表格存在表头复杂、陈旧灰暗、线条存在弯曲、手写和签章覆盖于表格图像等各种干扰因素,难以采用上述方法得到表格线框信息。Computer vision technology has long been used to recognize tables in document images. Traditional image recognition and processing technologies include image skew correction, image binarization, horizontal and vertical projection of table horizontal and vertical lines, and Hough change voting to obtain line segment equations. Then perform line segment tracking detection and other technologies. When the image background is clean and the table is simple and standardized, this kind of method has better effect. However, a large number of file tables have various interference factors such as complicated headers, outdated and gray, curved lines, handwriting and signatures covering the table images, etc., and it is difficult to obtain the table wireframe information by the above method.

计算机视觉技术的另一个分支是基于深度神经网络的机器学习技术,这种技术使用多层卷积层、池化层、激活函数、损失函数等方法构造了一套能够自行学习大量样本,从图像底层边缘等特征到高层结构等特征的学习机构。缺点包括:需要准备和标记大量学习样本,需要较高的算力进行学习和推理计算。其中机器学习前期收集和标记大量档案表格样本图像工作量非常大,即便组织大量人力完成此工作,机器学习得到的推理模型,运算在一般档案利用科室的普通办公电脑中,也存在一定的延迟,导致应用不便。Another branch of computer vision technology is machine learning technology based on deep neural networks. This technology uses multi-layer convolution layers, pooling layers, activation functions, loss functions and other methods to construct a set of self-learning large numbers of samples, from images. A learning mechanism from features such as low-level edges to features such as high-level structures. Disadvantages include: a large number of learning samples need to be prepared and marked, and high computing power is required for learning and reasoning calculations. Among them, in the early stage of machine learning, the workload of collecting and labeling a large number of files and table samples is very large. Even if a large amount of manpower is organized to complete this work, the inference model obtained by machine learning will be calculated in the general office computer of the general file utilization department. There is a certain delay. cause application inconvenience.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于,提供一种保护档案表格图像中的隐私数据的方法。它计算速度快,抗干扰能力强,对陈旧的档案、有彩色印章、原始涂改痕迹和表格线条极淡的表格,均能准确识别,并且兼容性好,对不同长宽比、不同行列数、不同表头样式的表格就能有效识别处理。The purpose of the present invention is to provide a method for protecting private data in a file table image. It has fast calculation speed and strong anti-interference ability. It can accurately identify old files, forms with color seals, original alteration traces and extremely light table lines, and has good compatibility. Tables with different header styles can be effectively identified and processed.

本发明的技术方案:保护档案表格图像中的隐私数据的方法,其特征在于,具体包括如下步骤:The technical scheme of the present invention: a method for protecting privacy data in a file table image, which is characterized in that it specifically includes the following steps:

一种保护档案表格图像中的隐私数据的方法,具体包括如下步骤:A method for protecting private data in an archive table image, specifically comprising the following steps:

1)先利用彩色图像的RGB色差,将表格图像的彩色印章等内容淡化;1) First use the RGB color difference of the color image to dilute the content such as the color seal of the table image;

2)将图像缩放到特定大小,设计在此图像尺寸下的兼容性良好的水平边缘检测算子,通过卷积运算得到图像的边缘强度图;2) Scale the image to a specific size, design a horizontal edge detection operator with good compatibility under this image size, and obtain the edge intensity map of the image through the convolution operation;

3)利用Hough变换得到候选的水平线和垂直线方程;3) Use Hough transform to obtain candidate horizontal and vertical line equations;

4)根据直线方程,进行线条跟踪,利用循环队列记忆不超过15个像素的线条坐标,实时判断线条弯折情况;对于过于弯折的,判断为干扰,利用队列存储的坐标信息,进行回溯搜索其它可能的跟踪轨迹;4) According to the straight line equation, carry out line tracking, use the circular queue to memorize the line coordinates of no more than 15 pixels, and judge the bending of the line in real time; for the excessive bending, it is judged as interference, and the coordinate information stored in the queue is used to perform a backtracking search other possible tracking trajectories;

5)对贯穿表格的水平线进行Y坐标排序,根据表头行高与普通行区别较大的特点,识别出表头;5) Sort the Y coordinate of the horizontal line running through the table, and identify the header according to the feature that the header row height is quite different from the common row;

6)利用图形交互界面,从档案利用者处获取利用行的坐标信息,结合上述步骤识别到的表格信息,自动模糊档案图像中的隐私部分和保留需要利用的部分。6) Using the graphical interactive interface, obtain the coordinate information of the utilization line from the file user, and combine the table information identified in the above steps to automatically blur the private part of the file image and retain the part that needs to be used.

步骤2)所述的检测算子是,构造n行m列边缘检测算子算子矩阵H,矩阵中间行值为负,两端行元素值为正,每行元素值相同,矩阵所有正元素值和为1,负元素值和为-1,其中n为不大于7的整数,m为不大于5的整数。The detection operator described in step 2) is to construct an edge detection operator matrix H with n rows and m columns. The sum of values is 1, and the sum of negative elements is -1, where n is an integer not greater than 7 and m is an integer not greater than 5.

步骤3)中所述的利用Hough变换得到候选的水平线和垂直线方程,包括如下两个步骤:Using Hough transform to obtain candidate horizontal line and vertical line equations described in step 3) includes the following two steps:

对图像进行Hough变换,简称投票:取投票的参数空间为一个二维空间:行坐标表示直线截距,列坐标表示直线倾斜角,高度与文档高度一致;投票阈值为边缘强度达到+5的均有1票资格,保证弱的边缘和强的边缘,在长度一样的情况下,所得票数一致,方便在投票结果中,区分出那些相同长度的表格水平线;边缘强度大于0低于+5的情况,认为是纸质和扫描仪器产生的线条假象;Perform Hough transform on the image, referred to as voting: take the parameter space of voting as a two-dimensional space: the row coordinates represent the straight line intercept, the column coordinates represent the straight line inclination angle, and the height is consistent with the height of the document; the voting threshold is the average of the edge strength reaching +5. There is 1 vote qualification to ensure that weak edges and strong edges, in the case of the same length, get the same number of votes, which is convenient to distinguish those table horizontal lines of the same length in the voting results; the edge strength is greater than 0 and lower than +5. , considered to be line artefacts produced by paper and scanning equipment;

获取候选线的方程参数:在Hough变化的参数空间图中,求得其中最大值为表格线条宽度maxValH;遍历参数空间图,凡达到maxValH的百分之七十的局部最大值,都进行考察,以便尽量检测出间断缺失的表格线条。Obtain the equation parameters of the candidate line: In the parameter space diagram of Hough change, the maximum value is obtained as the table line width maxValH; traverse the parameter space diagram, and all local maximum values that reach 70% of maxValH are investigated. In order to try to detect the intermittent missing table lines.

步骤4)中能够跟踪适当弯曲、间断和严重笔画干扰的线段跟踪算法是,在直线方程的指引下,跟踪线条,通过直线的参数方程拿到直线的倾斜角度和截距,该截距为与X=0的直线的交点Y坐标;在边缘强度图中,从X=0,Y=截距点出发,以直线倾斜角向右游走;取t=20作为阈值,取当前点的上中下三点的最大值所在点为跟踪方向;当最大值点强度大于t时,视为线条点,否则认为是非线条的点;用一个循环队列记录之前跟踪到的点的坐标;如果线条长度大于7,则计算最近7个点构造的线条弯曲夹角;当夹角大于3°,视为跟踪错误,进行回退;将所有跟踪到的线条记入集合S。In step 4), the line segment tracking algorithm that can track appropriate bends, discontinuities and severe stroke interference is that, under the guidance of the straight line equation, the line is tracked, and the inclination angle and intercept of the straight line are obtained by the parametric equation of the straight line. The Y coordinate of the intersection of the straight line with X=0; in the edge intensity map, starting from the X=0, Y=intercept point, walk to the right at the inclination angle of the straight line; take t=20 as the threshold, and take the upper middle of the current point The point where the maximum value of the next three points is located is the tracking direction; when the intensity of the maximum point is greater than t, it is regarded as a line point, otherwise it is regarded as a non-line point; a circular queue is used to record the coordinates of the previously tracked points; if the line length is greater than 7, then calculate the bending angle of the line constructed by the last 7 points; when the angle is greater than 3°, it is regarded as a tracking error, and a rollback is performed; all tracked lines are recorded in the set S.

步骤5)中所述的对贯穿表格的水平线进行Y坐标排序,识别出表头的操作是,将S中所有线条的左右端点的x坐标,进行投票计数,取左端最大票数的x作为表格左边界L,右端最大票数的x作为表格右边界R;再遍历S中的所有线条,仅保留左右端点分别在L和R附近的线条,视为能够贯穿表格的水平线;对这些水平线按y坐标排序,视为相邻的表格水平线。在表格的上半部分,从上向下判断:相邻两行行高相比,超过20%的,上一行视为表头部分。As described in step 5), the horizontal lines running through the table are sorted by Y coordinates, and the operation of identifying the header is to count the votes on the x coordinates of the left and right endpoints of all lines in S, and take the x with the largest number of votes at the left end as the left side of the table. Boundary L, the x with the largest number of votes at the right end is used as the right boundary R of the table; then traverse all the lines in S, and only keep the lines whose left and right endpoints are near L and R, which are regarded as horizontal lines that can run through the table; these horizontal lines are sorted by y-coordinate , treated as adjacent table horizontal lines. In the upper part of the table, judge from top to bottom: if the row height of two adjacent rows exceeds 20%, the previous row is regarded as the header part.

步骤6)所述的自动模糊档案图像中的隐私部分和保留需要利用的部分是指,获取档案利用者选择的行,对该行和表头保留清晰图像,其它行进行常规图像模糊,得到最终可用的图像进行打印输出。Step 6) The automatic blurring of the privacy part and the part to be used in the archive image refers to obtaining the row selected by the archive user, retaining a clear image for the row and the header, and performing conventional image blurring on other rows to obtain the final result. Available images for printout.

本发明的有益效果是:1、计算速度快,在普通办公电脑上能够在1秒以内完成识别和隐私模糊处理;2、抗干扰能力强,对陈旧的档案、有彩色印章、原始涂改痕迹和表格线条极淡的表格,均能准确识别。3、兼容性好,对不同长宽比、不同行列数、不同表头样式的表格就能有效识别处理。The beneficial effects of the invention are as follows: 1. The calculation speed is fast, and the identification and privacy blurring can be completed within 1 second on an ordinary office computer; Tables with very light lines can be accurately identified. 3. Good compatibility, it can effectively identify and process tables with different aspect ratios, different numbers of rows and columns, and different header styles.

附图说明Description of drawings

图1为具有各种干扰特征的彩色扫描件(已屏蔽姓名);Figure 1 is a color scan (names have been masked) with various interference features;

图2为普通彩色转灰度和本专利采用的灰度化方法对比;Fig. 2 is the comparison of the grayscale method adopted by the ordinary color conversion to grayscale and this patent;

图3为水平边缘图;Figure 3 is a horizontal edge map;

图4为Hough变换参数空间(左)和水平跟踪效果(右);Figure 4 shows the Hough transform parameter space (left) and the horizontal tracking effect (right);

图5为保留得到贯穿表格的水平线;Figure 5 is a horizontal line that is retained to get through the table;

图6为直线跟踪算法示意图;6 is a schematic diagram of a straight line tracking algorithm;

从左向右跟踪,黄色为无线条,绿色为正确跟踪,红色为干扰线条Tracking from left to right, yellow for no bars, green for correct tracking, and red for interference lines

图7为用户在图形交互界面选择利用行以后,自动产生的保护隐私信息的图像。FIG. 7 is an image of protecting privacy information automatically generated after a user selects a utilization line in a graphical interactive interface.

具体实施方式Detailed ways

下面结合实施例对本发明作进一步的说明,但并不作为对本发明限制的依据。The present invention will be further described below in conjunction with the examples, but not as a basis for limiting the present invention.

实施例1:一种保护档案表格图像中的隐私数据的方法Embodiment 1: A method for protecting private data in archive table images

第一步:尺寸统一化以及色彩印章和签字的淡化Step 1: Uniform size and fade color stamps and signatures

定义原始图像长宽为W和H,缩放比例为zoom=min(1024/W,1024/H)。将原图缩放到zoom倍大小。Define the length and width of the original image as W and H, and the zoom ratio as zoom=min(1024/W, 1024/H). Scale the original image to zoom times.

在尺寸缩放之后的图像中,线条宽度能够控制在0.5至2个像素以内,便于后续检测。将彩色图像分离为R、G、B三个通道的图像。通过图像矩阵的max计算,得到Gray=max(R,G,B),Gray为淡化颜色之后的图像,能够极大减少彩色印章、红色签字等的干扰。如图2对比所示。In the image after size scaling, the line width can be controlled within 0.5 to 2 pixels, which is convenient for subsequent detection. The color image is separated into images of three channels, R, G, and B. Through the max calculation of the image matrix, Gray=max(R, G, B) is obtained, and Gray is the image after lightening the color, which can greatly reduce the interference of color seals, red signatures, etc. As shown in Figure 2 for comparison.

第二步:构造水平边缘检测算子,对图像进行水平边缘检测Step 2: Construct a horizontal edge detection operator to perform horizontal edge detection on the image

构造n行m列边缘检测算子算子矩阵H,n和m建议取值5和5,矩阵中间行值为负,两端行元素值为正,每行元素值相同,矩阵所有正元素值和为1,负元素值和为-1。H取值建议为:Construct the edge detection operator matrix H with n rows and m columns. It is recommended to take values 5 and 5 for n and m. The value of the middle row of the matrix is negative, and the value of the elements of both ends is positive. The value of each row is the same, and the value of all positive elements of the matrix is the same. The sum is 1, and the sum of negative element values is -1. The recommended value of H is:

H=[+0.100,+0.100,+0.100,+0.100,+0.100;H=[+0.100,+0.100,+0.100,+0.100,+0.100;

-0.025,-0.025,-0.025,-0.025,-0.025;-0.025,-0.025,-0.025,-0.025,-0.025;

-0.150,-0.150,-0.150,-0.150,-0.150;-0.150,-0.150,-0.150,-0.150,-0.150;

-0.025,-0.025,-0.025,-0.025,-0.025;-0.025,-0.025,-0.025,-0.025,-0.025;

+0.100,+0.100,+0.100,+0.100,+0.100;]+0.100,+0.100,+0.100,+0.100,+0.100;]

将Gray和H作2维卷积计算,得到水平边缘强度图E=conv2d(Gray,H);效果如图3所示。Perform 2-dimensional convolution calculation on Gray and H to obtain the horizontal edge intensity map E=conv2d(Gray, H); the effect is shown in Figure 3.

这里的conv2d和卷积神经网络的2d卷积是一种计算。The conv2d here and the 2d convolution of the convolutional neural network is a calculation.

第三步:提取表格行线条Step 3: Extract table row lines

涉及档案隐私的数据一般以行为单位,所有这里以水平线的跟踪为例说明。Data related to file privacy is generally in behavioral units, all of which are illustrated here with the tracking of horizontal lines.

投票。对图像进行Hough变换(投票),取投票的参数空间为一个二维空间:行坐标表示直线截距,列坐标表示直线倾斜角(取角度分辨率0.1°,角度范围±3°),一共有61列。如有需要可以自行扩大此范围,高度与文档高度一致。投票阈值为边缘强度达到+5的均有1票资格,保证弱的边缘和强的边缘,在长度一样的情况下,所得票数一致,方便在投票结果中,区分出那些相同长度的表格水平线。边缘强度大于0低于+5的情况,认为是纸质和扫描仪器产生的线条假象。vote. Perform Hough transform (voting) on the image, and take the parameter space of voting as a two-dimensional space: the row coordinates represent the intercept of the line, and the column coordinates represent the inclination angle of the line (with an angular resolution of 0.1° and an angle range of ±3°). There are a total of 61 columns. If necessary, you can expand this range by yourself, and the height is consistent with the height of the document. The voting threshold is that those whose edge strength reaches +5 are eligible for 1 vote, ensuring that weak edges and strong edges have the same number of votes in the case of the same length, which is convenient to distinguish those horizontal lines of the same length in the voting results. Edge strengths greater than 0 and lower than +5 are considered to be line artifacts produced by paper and scanners.

获取候选线的方程参数。在Hough变化的参数空间图(图4左边黑色部分)中,求得其中最大值为表格线条宽度maxValH。遍历参数空间图,凡达到maxValH的百分之七十的局部最大值,都进行考察,以便尽量检测出间断缺失的表格线条。Get the equation parameters of the candidate line. In the parameter space diagram of Hough change (the black part on the left of Figure 4), the maximum value is obtained as the table line width maxValH. Traversing the parameter space graph, where the local maximum value of 70% of maxValH is reached, it is inspected in order to detect the discontinuous missing table lines as much as possible.

第四步:在直线方程的指引下,跟踪线条(图6所示)。直线的参数方程,可以拿到直线的倾斜角度和截距(与X=0的直线的交点Y坐标)。在边缘强度图中,从X=0,Y=截距点出发,以直线倾斜角向右游走。取t=20作为阈值,取当前点的上中下三点的最大值所在点为跟踪方向。当最大值点强度大于t时,视为线条点,否则认为是非线条的点。Step 4: Under the guidance of the straight line equation, trace the line (as shown in Figure 6). The parametric equation of the straight line can obtain the inclination angle and intercept of the straight line (the Y coordinate of the intersection with the straight line with X=0). In the edge intensity map, starting from the point of X=0, Y=intercept, walk to the right with a straight line inclination angle. Take t=20 as the threshold, and take the point where the maximum value of the upper, middle and lower points of the current point is located as the tracking direction. When the intensity of the maximum point is greater than t, it is regarded as a line point, otherwise it is regarded as a non-line point.

用一个循环队列(一种数据结构)记录之前跟踪到的点的坐标。如果线条长度大于7,则计算最近7个点构造的线条弯曲夹角。当夹角大于3°,视为跟踪错误,进行回退。将所有跟踪到的线条记入集合S。Use a circular queue (a data structure) to record the coordinates of previously tracked points. If the line length is greater than 7, calculate the line bending angle constructed by the nearest 7 points. When the included angle is greater than 3°, it is regarded as a tracking error and a rollback is performed. All traced lines are recorded in set S.

用循环队列的原因是:考察跟踪线条的弯曲,仅在近期跟踪到的7个点中考察,更长范围内的点的坐标存储空间可以循环利用。The reason for using the circular queue is: to examine the curve of the traced line, only the 7 points that have been traced recently are examined, and the coordinate storage space of the points in a longer range can be recycled.

第五步:求得表格水平贯穿线和表头Step 5: Obtain the horizontal through line and header of the table

将S中所有线条的左右端点的x坐标,进行投票计数,取左端最大票数的x作为表格左边界L,右端最大票数的x作为表格右边界R。Take the x-coordinates of the left and right endpoints of all lines in S to count votes, take the x with the largest number of votes at the left end as the left border L of the table, and the x with the largest number of votes at the right end as the right border R of the table.

再遍历S中的所有线条,仅保留左右端点分别在L和R附近的线条,视为能够贯穿表格的水平线。效果如图6所示。Then traverse all the lines in S, and only keep the lines whose left and right endpoints are near L and R respectively, which are regarded as horizontal lines that can run through the table. The effect is shown in Figure 6.

对这些水平线按y坐标排序,视为相邻的表格水平线。在表格的上半部分,从上向下判断:相邻两行行高相比,超过20%的,上一行视为表头部分。Sort these horizontal lines by their y-coordinates and treat them as adjacent table horizontal lines. In the upper part of the table, judge from top to bottom: if the row height of two adjacent rows exceeds 20%, the previous row is regarded as the header part.

第六步:施加隐私保护Step 6: Apply Privacy Shield

从软件交互界面中,获取档案利用者选择的行,对该行和表头保留清晰图像,其它行进行常规图像模糊,得到最终可用的图像进行打印输出。效果如图7所示。From the software interactive interface, obtain the row selected by the file user, keep a clear image of the row and the header, and perform conventional image blurring on other rows to obtain the final usable image for print output. The effect is shown in Figure 7.

Claims (6)

1.一种保护档案表格图像中的隐私数据的方法,其特征在于,具体包括如下步骤:1. a method for protecting the privacy data in the archive table image, is characterized in that, specifically comprises the steps: 1)先利用彩色图像的RGB色差,将表格图像的彩色印章等内容淡化;1) First use the RGB color difference of the color image to dilute the content such as the color seal of the table image; 2)将图像缩放到特定大小,设计在此图像尺寸下的兼容性良好的水平边缘检测算子,通过卷积运算得到图像的边缘强度图;2) Scale the image to a specific size, design a horizontal edge detection operator with good compatibility under this image size, and obtain the edge intensity map of the image through the convolution operation; 3)利用Hough变换得到候选的水平线和垂直线方程;3) Use Hough transform to obtain candidate horizontal and vertical line equations; 4)根据直线方程,进行线条跟踪,利用循环队列记忆不超过15个像素的线条坐标,实时判断线条弯折情况;对于过于弯折的,判断为干扰,利用队列存储的坐标信息,进行回溯搜索其它可能的跟踪轨迹;4) According to the straight line equation, carry out line tracking, use the circular queue to memorize the line coordinates of no more than 15 pixels, and judge the bending of the line in real time; for the excessive bending, it is judged as interference, and the coordinate information stored in the queue is used to perform a backtracking search other possible tracking trajectories; 5)对贯穿表格的水平线进行Y坐标排序,根据表头行高与普通行区别较大的特点,识别出表头;5) Sort the Y coordinate of the horizontal line running through the table, and identify the header according to the feature that the header row height is quite different from the common row; 6)利用图形交互界面,从档案利用者处获取利用行的坐标信息,结合上述步骤识别到的表格信息,自动模糊档案图像中的隐私部分和保留需要利用的部分。6) Using the graphical interactive interface, obtain the coordinate information of the utilization line from the file user, and combine the table information identified in the above steps to automatically blur the private part of the file image and retain the part that needs to be used. 2.根据权利要求1所述的保护档案表格图像中的隐私数据的方法,其特征在于:步骤2)所述的检测算子是,构造n行m列边缘检测算子算子矩阵H,矩阵中间行值为负,两端行元素值为正,每行元素值相同,矩阵所有正元素值和为1,负元素值和为-1,其中n为不大于7的整数,m为不大于5的整数。2. the method for protecting the privacy data in the archive table image according to claim 1, is characterized in that: the detection operator described in step 2) is, constructs the edge detection operator matrix H of n rows and m columns, matrix The value of the middle row is negative, the value of the row elements at both ends is positive, the value of each row is the same, the value of all positive elements of the matrix is 1, and the value of negative elements is -1, where n is an integer not greater than 7, and m is not greater than An integer of 5. 3.根据权利要求1所述的保护档案表格图像中的隐私数据的方法,其特征在于:步骤3)中所述的利用Hough变换得到候选的水平线和垂直线方程,包括如下两个步骤:3. the method for the privacy data in the protection file form image according to claim 1, is characterized in that: utilize Hough transform described in step 3) to obtain candidate horizontal line and vertical line equation, comprise the following two steps: 对图像进行Hough变换,简称投票:取投票的参数空间为一个二维空间:行坐标表示直线截距,列坐标表示直线倾斜角,高度与文档高度一致;投票阈值为边缘强度达到+5的均有1票资格,保证弱的边缘和强的边缘,在长度一样的情况下,所得票数一致,方便在投票结果中,区分出那些相同长度的表格水平线;边缘强度大于0低于+5的情况,认为是纸质和扫描仪器产生的线条假象;Perform Hough transform on the image, referred to as voting: take the parameter space of voting as a two-dimensional space: the row coordinates represent the straight line intercept, the column coordinates represent the straight line inclination angle, and the height is consistent with the height of the document; the voting threshold is the average of the edge strength reaching +5. There is 1 vote qualification to ensure that weak edges and strong edges, in the case of the same length, get the same number of votes, which is convenient to distinguish those table horizontal lines of the same length in the voting results; the edge strength is greater than 0 and lower than +5. , considered to be line artefacts produced by paper and scanning equipment; 获取候选线的方程参数:在Hough变化的参数空间图中,求得其中最大值为表格线条宽度maxValH;遍历参数空间图,凡达到maxValH的百分之七十的局部最大值,都进行考察,以便尽量检测出间断缺失的表格线条。Obtain the equation parameters of the candidate line: In the parameter space diagram of Hough change, the maximum value is obtained as the table line width maxValH; traverse the parameter space diagram, and all local maximum values that reach 70% of maxValH are investigated. In order to try to detect the intermittent missing table lines. 4.根据权利要求1所述的保护档案表格图像中的隐私数据的方法,其特征在于:步骤4)中能够跟踪适当弯曲、间断和严重笔画干扰的线段跟踪算法是,在直线方程的指引下,跟踪线条,通过直线的参数方程拿到直线的倾斜角度和截距,该截距为与X=0的直线的交点Y坐标;在边缘强度图中,从X=0,Y=截距点出发,以直线倾斜角向右游走;取t=20作为阈值,取当前点的上中下三点的最大值所在点为跟踪方向;当最大值点强度大于t时,视为线条点,否则认为是非线条的点;用一个循环队列记录之前跟踪到的点的坐标;如果线条长度大于7,则计算最近7个点构造的线条弯曲夹角;当夹角大于3°,视为跟踪错误,进行回退;将所有跟踪到的线条记入集合S。4. the method for the privacy data in the protection file form image according to claim 1, is characterized in that: in step 4), the line segment tracking algorithm that can track suitable bend, discontinuity and severe stroke interference is, under the guidance of straight line equation , trace the line, obtain the inclination angle and intercept of the line through the parameter equation of the line, the intercept is the Y coordinate of the intersection with the line with X=0; in the edge intensity map, from X=0, Y=intercept point Start, walk to the right at the inclination angle of the straight line; take t=20 as the threshold, take the point where the maximum value of the upper, middle and lower points of the current point is located as the tracking direction; when the intensity of the maximum point is greater than t, it is regarded as a line point, Otherwise, it is considered as a non-line point; a circular queue is used to record the coordinates of the previously tracked point; if the line length is greater than 7, the bending angle of the line constructed by the latest 7 points is calculated; when the included angle is greater than 3°, it is regarded as a tracking error , go back; record all the traced lines into the set S. 5.根据权利要求1所述的保护档案表格图像中的隐私数据的方法,其特征在于:步骤5)中所述的对贯穿表格的水平线进行Y坐标排序,识别出表头的操作是,将S中所有线条的左右端点的x坐标,进行投票计数,取左端最大票数的x作为表格左边界L,右端最大票数的x作为表格右边界R;再遍历S中的所有线条,仅保留左右端点分别在L和R附近的线条,视为能够贯穿表格的水平线;对这些水平线按y坐标排序,视为相邻的表格水平线。在表格的上半部分,从上向下判断:相邻两行行高相比,超过20%的,上一行视为表头部分。5. the method for the privacy data in the protection file table image according to claim 1, is characterized in that: described in step 5) carries out Y coordinate sorting to the horizontal line that runs through table, the operation that identifies table header is, will The x-coordinates of the left and right endpoints of all lines in S are counted, and the x with the largest number of votes at the left end is taken as the left border L of the table, and the x with the largest number of votes at the right end is taken as the right border R of the table; then traverse all the lines in S, only keep the left and right endpoints Lines near L and R, respectively, are regarded as horizontal lines that can run through the table; these horizontal lines are sorted by y-coordinate and regarded as adjacent table horizontal lines. In the upper part of the table, judge from top to bottom: if the row height of two adjacent rows exceeds 20%, the previous row is regarded as the header part. 6.根据权利要求1所述的保护档案表格图像中的隐私数据的方法,其特征在于:步骤6)所述的自动模糊档案图像中的隐私部分和保留需要利用的部分是指,获取档案利用者选择的行,对该行和表头保留清晰图像,其它行进行常规图像模糊,得到最终可用的图像进行打印输出。6. the method for protecting the privacy data in the archive form image according to claim 1, is characterized in that: the privacy part in the described automatic blurring archive image of step 6) and the part that retains need to utilize refer to, obtain archives and utilize For the row selected by the user, keep a clear image for the row and the header, and blur other rows with conventional images to obtain the final usable image for print output.
CN202210558787.8A 2022-05-20 2022-05-20 A method for protecting private data in archive table images Pending CN114821611A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210558787.8A CN114821611A (en) 2022-05-20 2022-05-20 A method for protecting private data in archive table images

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210558787.8A CN114821611A (en) 2022-05-20 2022-05-20 A method for protecting private data in archive table images

Publications (1)

Publication Number Publication Date
CN114821611A true CN114821611A (en) 2022-07-29

Family

ID=82516430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210558787.8A Pending CN114821611A (en) 2022-05-20 2022-05-20 A method for protecting private data in archive table images

Country Status (1)

Country Link
CN (1) CN114821611A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5579414A (en) * 1992-10-19 1996-11-26 Fast; Bruce B. OCR image preprocessing method for image enhancement of scanned documents by reversing invert text
US5737442A (en) * 1995-10-20 1998-04-07 Bcl Computers Processor based method for extracting tables from printed documents
US20120008874A1 (en) * 2009-04-07 2012-01-12 Murata Machinery, Ltd. Image processing apparatus, image processing method, image processing program, and storage medium
CN103020609A (en) * 2012-12-30 2013-04-03 上海师范大学 Complex fiber image recognition method
CN103258198A (en) * 2013-04-26 2013-08-21 四川大学 Extraction method for characters in form document image
US20150294523A1 (en) * 2013-08-26 2015-10-15 Vertifi Software, LLC Document image capturing and processing
CN109766749A (en) * 2018-11-27 2019-05-17 上海眼控科技股份有限公司 A kind of detection method of the bending table line for financial statement
CN113516103A (en) * 2021-08-07 2021-10-19 山东微明信息技术有限公司 Table image inclination angle determining method based on support vector machine

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5579414A (en) * 1992-10-19 1996-11-26 Fast; Bruce B. OCR image preprocessing method for image enhancement of scanned documents by reversing invert text
US5737442A (en) * 1995-10-20 1998-04-07 Bcl Computers Processor based method for extracting tables from printed documents
US20120008874A1 (en) * 2009-04-07 2012-01-12 Murata Machinery, Ltd. Image processing apparatus, image processing method, image processing program, and storage medium
CN103020609A (en) * 2012-12-30 2013-04-03 上海师范大学 Complex fiber image recognition method
CN103258198A (en) * 2013-04-26 2013-08-21 四川大学 Extraction method for characters in form document image
US20150294523A1 (en) * 2013-08-26 2015-10-15 Vertifi Software, LLC Document image capturing and processing
CN109766749A (en) * 2018-11-27 2019-05-17 上海眼控科技股份有限公司 A kind of detection method of the bending table line for financial statement
CN113516103A (en) * 2021-08-07 2021-10-19 山东微明信息技术有限公司 Table image inclination angle determining method based on support vector machine

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李云华;段会川;: "基于Hough变换的图像档案的表格提取与倾斜校正", 信息技术与信息化, no. 06, 15 December 2007 (2007-12-15) *

Similar Documents

Publication Publication Date Title
CN109409355B (en) Novel transformer nameplate identification method and device
CN104408449B (en) Intelligent mobile terminal scene literal processing method
CN109784344A (en) An image non-target filtering method for ground plane identification recognition
CN102831584B (en) Data-driven object image restoring system and method
CN116563237B (en) A hyperspectral image detection method for chicken carcass defects based on deep learning
CN109360179B (en) Image fusion method and device and readable storage medium
CN110008900B (en) Method for extracting candidate target from visible light remote sensing image from region to target
CN112819748B (en) Training method and device for strip steel surface defect recognition model
CN112446259A (en) Image processing method, device, terminal and computer readable storage medium
CN110503103A (en) A Character Segmentation Method in Text Lines Based on Fully Convolutional Neural Networks
CN113065396A (en) Automated filing processing system and method for scanned archive images based on deep learning
CN111125403B (en) Aided design drawing method and system based on artificial intelligence
CN110390677A (en) A method and system for defect localization based on sliding self-matching
CN110334709A (en) End-to-end multi-task deep learning-based license plate detection method
CN115331245A (en) A table structure recognition method based on image instance segmentation
CN105184225A (en) Multinational paper money image identification method and apparatus
CN113158895A (en) Bill identification method and device, electronic equipment and storage medium
CN111415330A (en) Copper foil appearance defect detection method based on deep learning
CN115082776A (en) Electric energy meter automatic detection system and method based on image recognition
CN114708237A (en) Detection algorithm for hair health condition
CN115797336A (en) Fault detection method and device of photovoltaic module, electronic equipment and storage medium
CN112364863B (en) Character positioning method and system for license document
CN114565927B (en) Table recognition method, device, electronic device and storage medium
Kovanen et al. A layered method for determining manga text bubble reading order
Schüffler et al. Overcoming an annotation hurdle: Digitizing pen annotations from whole slide images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination