CN105938547A - Paper hydrologic yearbook digitalization method - Google Patents
Paper hydrologic yearbook digitalization method Download PDFInfo
- Publication number
- CN105938547A CN105938547A CN201610232680.9A CN201610232680A CN105938547A CN 105938547 A CN105938547 A CN 105938547A CN 201610232680 A CN201610232680 A CN 201610232680A CN 105938547 A CN105938547 A CN 105938547A
- Authority
- CN
- China
- Prior art keywords
- numeric character
- feature
- hydrological
- character block
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 230000008569 process Effects 0.000 claims abstract description 10
- 239000000284 extract Substances 0.000 claims description 21
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims description 20
- 230000004927 fusion Effects 0.000 claims description 14
- 230000000295 complement effect Effects 0.000 claims description 13
- 238000010586 diagram Methods 0.000 claims description 10
- 238000003860 storage Methods 0.000 claims description 9
- 238000012706 support-vector machine Methods 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 5
- 238000001228 spectrum Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 4
- 238000001514 detection method Methods 0.000 claims description 3
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 claims 2
- 239000012141 concentrate Substances 0.000 claims 2
- 230000005484 gravity Effects 0.000 claims 2
- 238000005452 bending Methods 0.000 claims 1
- 238000010191 image analysis Methods 0.000 claims 1
- 238000007689 inspection Methods 0.000 claims 1
- 239000000203 mixture Substances 0.000 claims 1
- 238000012545 processing Methods 0.000 abstract description 9
- 230000007246 mechanism Effects 0.000 abstract description 5
- 238000002474 experimental method Methods 0.000 abstract description 4
- 238000012937 correction Methods 0.000 abstract description 3
- 238000007500 overflow downdraw method Methods 0.000 abstract description 2
- 230000001932 seasonal effect Effects 0.000 abstract description 2
- 238000000605 extraction Methods 0.000 description 9
- 238000004458 analytical method Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 8
- 230000002354 daily effect Effects 0.000 description 7
- 230000011218 segmentation Effects 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 238000013461 design Methods 0.000 description 4
- 238000012015 optical character recognition Methods 0.000 description 4
- 238000006073 displacement reaction Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000012938 design process Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000008521 reorganization Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 230000029305 taxis Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/412—Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Graphics (AREA)
- Geometry (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
Description
技术领域technical field
本发明涉及一种纸质水文年鉴数字化方法,属于计算机图像处理技术和水文交叉技术领域。The invention relates to a digitalization method for a paper-based hydrological yearbook, which belongs to the fields of computer image processing technology and hydrological intersection technology.
背景技术Background technique
纸质水文年鉴记录了最基本的水文测验数据,这些数据中蕴涵着自然界长期的演变规律和人类活动影响的信息,在生产、科研、社会服务中发挥了重要作用。鉴于水文年鉴保存年代较为久远、使用频率高以及保存条件差等原因,纸质水文年鉴已逐渐开始损坏,而且一旦遭受人为或自然损害,将带来难以弥补的损失,抢救这些珍贵的历史资料已经成为迫在眉睫的问题。保护水文年鉴最有效的方式是对水文年鉴进行数字化扫描加工,形成电子档案。现有技术基于以上问题对水文年鉴的数字化进行了研究,提出了水文年鉴数据的智能识别,识别水文资料中的数字(即数字字符识别)是水文资料数字化的重要任务。The paper hydrological yearbook records the most basic hydrological test data, which contains information on the long-term evolution of nature and the impact of human activities, and plays an important role in production, scientific research, and social services. In view of the fact that hydrological yearbooks have been preserved for a long time, are frequently used, and are in poor storage conditions, paper hydrological yearbooks have gradually begun to be damaged, and once they suffer man-made or natural damage, irreparable losses will be brought about. Rescuing these precious historical materials has become an urgent problem. The most effective way to protect the hydrological yearbook is to digitally scan and process the hydrological yearbook to form an electronic file. Based on the above problems, the prior art studies the digitization of the hydrological yearbook, and proposes the intelligent recognition of the hydrological yearbook data. Recognizing the numbers in the hydrological data (ie, digital character recognition) is an important task for the digitization of the hydrological data.
水文资料是一种逐年刊印的资料,以统一、科学的图表形式表达出来的成果。内容主要是上年实测的并经过严格整编审查的、普遍需要的基本水文资料;其表格特点是横排表示具体月份,竖排表示每个月份的日期,表格底部由每个月的平均流量、最大流量、最小流量、年统计及附注组成。所以本文在识别水文年鉴数字之前先对其进行版面分析,提取表格线。Hydrological data is a kind of data published year by year, expressed in a unified and scientific graphic form. The content is mainly the basic hydrological data that are measured in the previous year and have been strictly compiled and reviewed. The table is characterized by the fact that the horizontal row indicates the specific month, and the vertical row indicates the date of each month. It consists of maximum flow, minimum flow, annual statistics and notes. Therefore, this paper analyzes the layout of the hydrological yearbook and extracts the table lines before identifying the numbers of the hydrological yearbook.
水文年鉴数字字符比较规范化、笔划数也比较少,它比之汉字特征码的提取相对要容易些。但是,它们形态变化不大、笔划信息过少,在某种意义上来说导致有效的特征矢量提取的困难增大。例如,数字“8”和“6”,当它们的油墨重一点时,白正宋体的“6”有时上半部也成了个小圆圈,几乎与“8”类同。数字“1”和“3”,“2”和“7”,当油墨较重或是字型太小,很可能出现数字“1”和“3”、“2”和“7”有相同的特征矢量。因此,在实际应用中,采用现有技术针对水文资料进行识别,具有精度低、效率低的缺点。The digital characters of the hydrological yearbook are relatively standardized and the number of strokes is relatively small, which is relatively easier to extract than the Chinese character feature code. However, they have little shape change and too little stroke information, which in a sense increases the difficulty of effective feature vector extraction. For example, the numbers "8" and "6", when their ink is a bit heavier, sometimes the upper half of "6" in Bai Zheng Song typeface becomes a small circle, which is almost similar to "8". The numbers "1" and "3", "2" and "7", when the ink is heavy or the font is too small, it is likely that the numbers "1" and "3", "2" and "7" have the same feature vector. Therefore, in practical applications, using existing technologies to identify hydrological data has the disadvantages of low precision and low efficiency.
发明内容Contents of the invention
本发明所要解决的技术问题是提供一种采用全新特征融合设计方法,能够有效提高识别率,保证工作效率的纸质水文年鉴数字化方法。The technical problem to be solved by the present invention is to provide a digitalization method for paper-based hydrological yearbooks that adopts a new feature fusion design method, can effectively improve the recognition rate, and ensure work efficiency.
本发明为了解决上述技术问题采用以下技术方案:本发明设计了一种纸质水文年鉴数字化方法,包括如下步骤:The present invention adopts the following technical solutions in order to solve the above-mentioned technical problems: the present invention designs a kind of digital method of paper hydrological yearbook, comprises the following steps:
步骤001.根据纸质水文年鉴页面的版面设计,确定水文资料表格位于纸质水文年鉴页面中的像素位置,然后进入步骤002;Step 001. According to the layout design of the paper hydrological yearbook page, determine the pixel position where the hydrological data table is located in the paper hydrological yearbook page, and then enter step 002;
步骤002.根据纸质水文年鉴页面中水文资料表格的像素位置,针对水文资料表格分别进行纵向和横向投影,并针对水文资料表格的纵向投影图、横向投影分别进行分析,分别提取水文资料表格中各条竖线的横坐标、各条横线的纵坐标,然后进入步骤003;Step 002. According to the pixel position of the hydrological data table in the paper hydrological yearbook page, perform vertical and horizontal projections on the hydrological data table respectively, and analyze the longitudinal projection and horizontal projection of the hydrological data table respectively, and extract the hydrological data table respectively. The abscissa of each vertical line, the ordinate of each horizontal line, and then enter step 003;
步骤003.根据水文资料表格的版式,以及水文资料表格中各条竖线的横坐标、各条横线的纵坐标,针对水文资料表格的投影图像,分别获得水文资料表格各个数值单元格中的数据图像,然后进入步骤004;其中,水文资料表格各个数据图像中的数值字符为白色,底色为黑色;Step 003. According to the layout of the hydrological data table, and the abscissa of each vertical line in the hydrological data table, and the vertical coordinate of each horizontal line, for the projected image of the hydrological data table, obtain the values in each numerical cell of the hydrological data table respectively. Data image, then enter step 004; Wherein, the numerical characters in each data image of the hydrological data form are white, and the background color is black;
步骤004.分别针对各个数据图像,针对数据图像中的各个数值字符进行字符切分,获得该数据图像中的各个数值字符块,进而分别获得各个数据图像中的各个数值字符块,然后进入步骤005;Step 004. For each data image, perform character segmentation for each numerical character in the data image, obtain each numerical character block in the data image, and then obtain each numerical character block in each data image, and then enter step 005 ;
步骤005.分别针对各个数据图像中的各个数值字符块,提取数值字符块中数值字符的网格特征、傅里叶特征、轮廓矩特征,共同作为该数值字符的识别特征,进而分别获得各个数据图像中各个数值字符块中数值字符的识别特征,然后进入步骤006;Step 005. For each numerical character block in each data image, extract the grid feature, Fourier feature, and contour moment feature of the numerical character in the numerical character block, and use them together as the identification feature of the numerical character, and then obtain each data respectively The recognition feature of numerical characters in each numerical character block in the image, then enter step 006;
步骤006.分别针对各个数据图像中的各个数值字符块,判断是否存在由数值字符块顶边向下存在预设数量的黑色像素点,是则判定该数值字符块中为小数点,否则不做任何进一步操作;在完成分别针对各个数据图像中各个数值字符块的判断后,然后进入步骤007;Step 006. For each numerical character block in each data image, determine whether there is a preset number of black pixels from the top edge of the numerical character block downward, and if so, determine that the numerical character block is a decimal point, otherwise do not do anything Further operation; after completing the judgment for each numerical character block in each data image, then enter step 007;
步骤007.针对所有数据图像中数值字符的所有识别特征,进行特征融合,构成水文资料表格中分别对应“0”到“9”的数值识别特征,然后进入步骤008;Step 007. For all the identification features of numerical characters in all data images, perform feature fusion to form numerical identification features corresponding to "0" to "9" in the hydrological data table, and then enter step 008;
步骤008.根据水文资料表格中分别对应“0”到“9”的数值识别特征,以及各个数据图像中各个数值字符块中数值字符的识别特征,通过预设分类器,分别获得各个数据图像中各个数值字符块所对应的数字,然后进入步骤009;Step 008. According to the numerical identification features corresponding to "0" to "9" in the hydrological data table, and the identification features of the numerical characters in each numerical character block in each data image, respectively obtain the The number corresponding to each numerical character block, then enter step 009;
步骤009.根据各个数据图像中各个数值字符块所对应的数字或小数点,分别构成水文资料表格各个数值单元格中数据图像所对应的数值,再结合水文资料表格版式的各项属性,获得水文资料表格中各项属性,及其所对应的数值,并进行存储。Step 009. According to the number or decimal point corresponding to each numerical character block in each data image, respectively form the numerical value corresponding to the data image in each numerical value cell of the hydrological data table, and then combine various attributes of the hydrological data table layout to obtain the hydrological data Each attribute in the table and its corresponding value are stored.
作为本发明的一种优选技术方案,所述步骤009之后还包括如下步骤,执行完步骤009之后,进入步骤010;As a preferred technical solution of the present invention, after step 009, the following steps are also included, after step 009 is executed, enter step 010;
步骤010.针对所识别存储水文资料表格中各项属性、及其所对应的数值,分别针对各个月的流量数值,按如下步骤010-01至步骤010-02进行执行,进而分别获得针对各个月每日流量数值的初步识别判断,然后进入步骤011;Step 010. For each attribute in the identified and stored hydrological data table and its corresponding value, respectively for the flow value of each month, execute according to the following steps 010-01 to 010-02, and then respectively obtain the flow rate for each month Preliminary identification and judgment of the daily flow value, and then enter step 011;
步骤010-01.将当月第一日流量数值作为第一阈值,然后分别针对当月前两日流量数值,判断下一日流量数值与当日流量数值之间的差值是否小于第一阈值,是则判断当日流量数值识别无误;否则判断当日流量数值初步识别错误;由此获得分别针对当月前两日流量数值的初步识别判断,然后进入步骤010-02;Step 010-01. Use the flow value on the first day of the current month as the first threshold, and then judge whether the difference between the next day’s flow value and the current day’s flow value is less than the first threshold for the flow values of the first two days of the current month, if yes, then Judging that the identification of the flow value of the current day is correct; otherwise, it is determined that the preliminary identification of the flow value of the current day is wrong; thereby obtaining the preliminary identification and judgment of the flow values of the first two days of the current month, and then proceed to step 010-02;
步骤010-02.分别针对当月由第三日开始的各日流量数值,判断下一日流量数值与当日流量数值之间的差值是否小于前一日流量数值,是则判断当日流量数值识别无误;否则判断当日流量数值初步识别错误;由此获得分别针对当月由第三日开始各日流量数值的初步识别判断;Step 010-02. For each daily flow value starting from the third day of the current month, determine whether the difference between the next day’s flow value and the current day’s flow value is smaller than the previous day’s flow value, and if so, determine that the current day’s flow value is correctly identified ; Otherwise, it is judged that the preliminary recognition of the flow value of the current day is wrong; thus, the preliminary recognition and judgment of the flow value of each day starting from the third day of the current month are obtained;
步骤011.根据所识别存储水文资料表格中的各个数值,以及各个数值中各个数字的识别特征,通过预设训练器,获得所识别存储水文资料表格中各个数值中的各个数字,分别对应“0”到“9”的十个识别结果概率,然后进入步骤012;Step 011. According to each numerical value in the identified and stored hydrological data table, and the identification features of each number in each numerical value, through the preset trainer, obtain each number in each numerical value in the identified stored hydrological data table, respectively corresponding to "0 " to "9" ten recognition result probabilities, and then enter step 012;
步骤012.分别针对所识别存储水文资料表格中各个数值中的各个数字,获得数字所对应“0”到“9”十个识别结果概率中的最大识别结果概率,以及第二大识别结果概率,并获得该最大识别结果概率与该第二大识别结果概率的差值,判断该差值是否小于预设识别结果概率阈值,是则判断该数字初步识别错误;否则判断该数字识别无误;由此获得分别针对所识别存储水文资料表格中各个数值中各个数字的初步识别判断,然后进入步骤013;Step 012. Obtain the maximum recognition result probability and the second largest recognition result probability among the ten recognition result probabilities from "0" to "9" corresponding to the numbers for each number in each numerical value in the identified and stored hydrological data table, And obtain the difference between the maximum recognition result probability and the second largest recognition result probability, judge whether the difference is less than the preset recognition result probability threshold, if so, judge that the initial recognition of the number is incorrect; otherwise, judge that the recognition of the number is correct; thus Obtain a preliminary identification judgment for each number in each numerical value in the identified stored hydrological data table, and then enter step 013;
步骤013.分别针对各月中各个初步识别错误的流量数值,判断初步识别错误的流量数值中是否存在初步识别错误的数字,是则判断该初步识别错误的流量数值错误,并进行报警;否则判断该初步识别错误流量数值无误;由此实现针对所识别存储水文资料表格中各个数值的检验。Step 013. For each preliminary misidentified flow value in each month, judge whether there is a preliminary misidentified number in the preliminary misidentified flow value, if yes, judge that the preliminary misidentified flow value is wrong, and issue an alarm; otherwise, judge The preliminary identification error flow value is correct; thus, the verification of each value in the identified stored hydrological data table is realized.
作为本发明的一种优选技术方案:所述步骤011中,根据所识别存储水文资料表格中的各个数值,以及各个数值中各个数字的识别特征,通过支持向量机训练器,获得所识别存储水文资料表格中各个数值中的各个数字,分别对应“0”到“9”的十个识别结果概率。As a preferred technical solution of the present invention: in the step 011, according to each numerical value in the identified stored hydrological data table and the identification features of each number in each numerical value, the identified stored hydrological data is obtained through the support vector machine trainer. Each number in each numerical value in the data table corresponds to ten recognition result probabilities from "0" to "9".
作为本发明的一种优选技术方案:所述步骤013中,所述根据初步识别错误的流量数值中存在初步识别错误的数字,判断该初步识别错误的流量数值错误,并进行报警的同时,根据该初步识别错误数字在该初步识别错误流量数值中的位置进行分析,若该初步识别错误数字位于该初步识别错误流量数值中的整数部分,则用该初步识别错误流量数值所对应日期的前一日流量数值与后一日流量数值的平均值,替换该初步识别错误流量数值;若该初步识别错误数字位于该初步识别错误流量数值中的小数部分,则用该初步识别错误流量数值所对应日期的前一日流量数值的小数与后一日流量数值的小数的平均值,替换该初步识别错误流量数值中的小数。As a preferred technical solution of the present invention: in the step 013, according to the number of initially identified incorrect numbers in the initially identified incorrect flow value, it is judged that the initially identified incorrectly identified flow value is wrong, and at the same time as an alarm, according to The position of the preliminary identification error number in the preliminary identification error flow value is analyzed. If the preliminary identification error number is in the integer part of the preliminary identification error flow value, the previous day of the date corresponding to the preliminary identification error flow value is used. The average value of the daily flow value and the next day's flow value is used to replace the preliminary identification error flow value; if the preliminary identification error number is in the decimal part of the preliminary identification error flow value, the date corresponding to the preliminary identification error flow value is used The average value of the decimals of the traffic values of the previous day and the decimals of the traffic values of the next day is to replace the decimals in the preliminary identification error traffic values.
作为本发明的一种优选技术方案,所述步骤004,针对数据图像中的各个数值字符进行字符切分,获得该数据图像中的各个数值字符块,具体包括如下步骤:As a preferred technical solution of the present invention, the step 004, performing character segmentation for each numerical character in the data image, to obtain each numerical character block in the data image, specifically includes the following steps:
步骤a01.检测获得数据图像中各数值字符内部的各个白色像素点,以及该数据图像各边缘分别相距各数值字符最小距离,所对应数值字符上的白色像素点,然后进入步骤a02;Step a01. Detect each white pixel point inside each numerical character in the obtained data image, and the minimum distance between each edge of the data image and the white pixel point on the corresponding numerical character, and then enter step a02;
步骤a02.针对上一步骤由该数据图像中所获各个白色像素点分别进行判断,判断像素点上、下、左、右各位置的像素点是否均为白色像素点,是则判断该像素点为数值字符内部的像素点;否则根据标识符判断该像素点为字符的边缘像素点,并获取该像素点在该数据图像中所在像素列的列号;由此针对上一步骤由该数据图像中所获各个白色像素点分别进行判断,获得该数据图像中各个数值字符上边缘像素点所在该数据图像中所在像素列的列号,然后进入步骤a03;Step a02. For the previous step, judge each white pixel point obtained in the data image separately, and judge whether the pixels at the upper, lower, left, and right positions of the pixel point are all white pixels, and if so, judge the pixel point is the pixel inside the numeric character; otherwise, it is judged according to the identifier that the pixel is the edge pixel of the character, and the column number of the pixel column where the pixel is located in the data image is obtained; thus for the previous step, the data image Each white pixel obtained in is judged respectively, obtains the column number of the pixel row in the data image where the upper edge pixel of each numerical character in the data image is located, and then enters step a03;
步骤a03.根据该数据图像中各个数值字符上边缘像素点在该数据图像中所在像素列的列号,针对该数据图像中的各个数值字符进行划分,获得该数据图像中的各个数值字符块。Step a03. According to the column number of the pixel column where the upper edge pixel point of each numerical character in the data image is located in the data image, divide each numerical character in the data image to obtain each numerical character block in the data image.
作为本发明的一种优选技术方案,所述步骤005中,分别针对各个数据图像中的各个数值字符块,提取数值字符块中数值字符的网格特征,具体包括如下步骤:As a preferred technical solution of the present invention, in the step 005, for each numerical character block in each data image, the grid feature of the numerical character in the numerical character block is extracted, specifically comprising the following steps:
步骤b01.获取数值字符块的上、下、左、右的边界,并由此获得数值字符本体图像,然后进入步骤b02;Step b01. Obtain the upper, lower, left, and right boundaries of the numerical character block, and thus obtain the numerical character ontology image, and then enter step b02;
步骤b02.针对该数值字符本体图像进行重心归一化,并将经过重心归一化的该数值字符本体图像平均分割成预设数量个子区域图像,然后进入步骤b03;Step b02. Perform center-of-gravity normalization on the numerical character body image, and divide the weight-normalized numerical character body image into a preset number of sub-region images on average, and then enter step b03;
步骤b03.分别获得该数值字符本体图像中各个子区域图像中白色像素点的所占比例,共同构成该数值字符块中数值字符的网格特征。Step b03. Obtain the proportion of white pixels in each sub-region image of the numerical character body image, and together form the grid features of the numerical character in the numerical character block.
作为本发明的一种优选技术方案,所述步骤005中,分别针对各个数据图像中的各个数值字符块,提取数值字符块中数值字符的傅里叶特征,具体包括如下步骤:As a preferred technical solution of the present invention, in the step 005, for each numerical character block in each data image, extract the Fourier feature of the numerical character in the numerical character block, specifically comprising the following steps:
步骤c01.针对数值字符块进行二维离散傅里叶变换,然后进入步骤c02;Step c01. Carry out two-dimensional discrete Fourier transform for the numerical character block, and then enter step c02;
步骤c02.将经过二维离散傅里叶变换的该数值字符块,继续进行中心变换,即将数值字符块平均划分为四块子区域图像,并进行对角交换,获得傅里叶图像谱,然后进入步骤c03;Step c02. Continue to perform central transformation on the numerical character block after two-dimensional discrete Fourier transform, that is, divide the numerical character block into four sub-region images on average, and perform diagonal exchange to obtain the Fourier image spectrum, and then Go to step c03;
步骤c03.针对中心变换后的傅里叶图像谱分析其傅里叶系数,获得该数值字符块的傅里叶系数中、大于预设幅值阈值的傅里叶系数集中所在区域,构成大幅傅里叶系数区域,然后进入步骤c04;Step c03. Analyze the Fourier coefficients of the Fourier image spectrum after the central transformation, and obtain the area where the Fourier coefficients of the numerical character block are concentrated, which are greater than the preset amplitude threshold, to form a large Fourier coefficient. Liye coefficient area, then enter step c04;
步骤c04.由大幅傅里叶系数区域中,提取预设数量个离散傅里叶变换系数,并将其进行归一化,构成该数值字符块中数值字符的傅里叶特征。Step c04. Extract a preset number of discrete Fourier transform coefficients from the large-scale Fourier coefficient area, and normalize them to form Fourier features of the numerical characters in the numerical character block.
作为本发明的一种优选技术方案:所述步骤005中,分别针对各个数据图像中的各个数值字符块,提取数值字符块中数值字符的轮廓矩特征,具体包括如下步骤:As a preferred technical solution of the present invention: in the step 005, for each numerical character block in each data image, extract the contour moment feature of the numerical character in the numerical character block, specifically comprising the following steps:
步骤d01.针对数值字符块中的数值字符进行轮廓提取,然后进入步骤d02;Step d01. Perform contour extraction for the numerical characters in the numerical character block, and then enter step d02;
步骤d02.针对该数值字符块中数值字符的轮廓进行不变矩处理,提取预设数量个二维轮廓不变矩特征,构成该数值字符块中数值字符的轮廓矩特征。Step d02. Perform moment invariant processing on the contours of the numerical characters in the numerical character block, and extract a preset number of two-dimensional contour invariant moment features to form the contour moment features of the numerical characters in the numerical character block.
作为本发明的一种优选技术方案,所述步骤007具体包括如下步骤:As a preferred technical solution of the present invention, the step 007 specifically includes the following steps:
步骤e01.根据排列组合,针对所有数据图像中数值字符的所有识别特征,进行任意两个识别特征的组合,构成所有识别特征组合,然后进入步骤e02;Step e01. According to the permutation and combination, for all the identification features of the numerical characters in all data images, perform any combination of two identification features to form all identification feature combinations, and then enter step e02;
步骤e02.将所有数据图像中数值字符的所有识别特征,构成水文资料表格中对应数字“0”到“9”的样本集合S,然后分别针对各组识别特征组合,根据如下公式(1):Step e02. All the identification features of the numerical characters in all data images are used to form the sample set S corresponding to the numbers "0" to "9" in the hydrological data table, and then each group of identification features is combined according to the following formula (1):
获得该组识别特征组合分别相对标准数字“0”-“9”的特征互补指数Cij,A;进而分别获得各组识别特征组合分别相对标准数字“0”-“9”的特征互补指数Cij,A;然后进入步骤e03;其中,Si和Sj分别表示样本集合S被识别特征Fi与识别特征Fj错分的样本集合;E(S)表示样本集合S中的样本个数;E(Si∪Sj)表示样本集合Si与样本集合Sj之间并集中的样本个数;E(Si∩Sj)表示样本集合Si与样本集合Sj之间交集中的样本个数;A={0、1、…、9},Cij,A表示由识别特征Fi与识别特征Fj所构成识别特征组合相对标准数字A的特征互补指数;Obtain the feature complementary index C ij,A of the group of recognition feature combinations relative to the standard number "0"-"9"respectively; and then obtain the feature complementary index C of each group of recognition feature combinations respectively relative to the standard number "0"-"9" ij, A ; Then enter step e03; Wherein, S i and S j respectively represent the sample set that the sample set S is misclassified by the recognition feature F i and the recognition feature F j ; E (S) represents the number of samples in the sample set S ; E(S i ∪ S j ) represents the number of samples in the union between the sample set S i and the sample set S j ; E(S i ∩ S j ) represents the intersection between the sample set S i and the sample set S j The number of samples; A={0, 1, ..., 9}, C ij, A represents the feature complementary index of the identification feature combination made of identification feature F i and identification feature F j relative to the standard number A;
步骤e03.分别针对各组识别特征组合,根据如下公式(2):Step e03. Respectively for each group of identification feature combinations, according to the following formula (2):
分别获取各组识别特征组合相对于标准数字的整体互补指数TCk,然后进入步骤e04;其中,k={1、…、K},K表示所有识别特征组合的组合数,TCk表示第k组识别特征组合相对于标准数字的整体互补指数;Obtain the overall complementary index TC k of each group of identification feature combinations relative to the standard number, and then enter step e04; where, k={1,...,K}, K represents the number of combinations of all identification feature combinations, and TC k represents the kth The overall complementarity index of group-identifying feature combinations relative to standard numbers;
步骤e04.针对所有识别特征组合,按其整体互补指数由大至小排序,获得排序前两个识别特征组合,然后针对该两个识别特征组合进行特征融合,构成水文资料表格中分别对应“0”到“9”的数值识别特征。Step e04. For all the identification feature combinations, sort according to their overall complementary index from large to small to obtain the first two identification feature combinations, and then carry out feature fusion for the two identification feature combinations to form the corresponding "0" in the hydrological data table. ” to “9” to identify the character.
作为本发明的一种优选技术方案,所述步骤008中,根据水文资料表格中分别对应“0”到“9”的数值识别特征,以及各个数据图像中各个数值字符块中数值字符的识别特征,通过支持向量机(SVM)分类器,分别获得各个数据图像中各个数值字符块所对应的数字。As a preferred technical solution of the present invention, in the step 008, according to the numerical identification features respectively corresponding to "0" to "9" in the hydrological data table, and the identification features of numerical characters in each numerical character block in each data image , through a support vector machine (SVM) classifier, the numbers corresponding to each numerical character block in each data image are respectively obtained.
本发明所述一种纸质水文年鉴数字化方法及控制方法采用以上技术方案与现有技术相比,具有以下技术效果:本发明所设计纸质水文年鉴数字化方法,在单一特征的基础上提出了互补性较强的特征融合方法,识别率得到了提高,由于水文过程受相似的季节性气候因素,以及其他随机因素影响而呈现相似性,也即其流量具有上下文相关性,所以本发明鉴于此相关性,同时提出了基于时间序列的后期纠错机制。即在分类器识别后,根据某种准则对其进行纠错处理,通过实验证明,本发明所提出的机制,有效提高了识别精度,保证了工作效率。Compared with the prior art, a paper-based hydrological yearbook digitization method and control method according to the present invention have the following technical effects: the paper-based hydrological yearbook digitization method designed by the present invention proposes a new method based on a single feature. The feature fusion method with strong complementarity has improved the recognition rate. Since the hydrological process is affected by similar seasonal climate factors and other random factors, it shows similarity, that is, its flow has context correlation. Therefore, the present invention considers this At the same time, a later error correction mechanism based on time series is proposed. That is, after the classifier is recognized, it is corrected according to a certain criterion. It is proved by experiments that the mechanism proposed by the present invention effectively improves the recognition accuracy and ensures the work efficiency.
附图说明Description of drawings
图1是本发明设计的纸质水文年鉴数字化方法及控制方法的流程图;Fig. 1 is the flow chart of the paper hydrology yearbook digitization method and control method designed by the present invention;
图2a是实施例中水文资料表格横向投影示意图;Fig. 2 a is the horizontal projection schematic diagram of hydrological data table in the embodiment;
图2b是实施例中水文资料表格纵向投影示意图;Fig. 2b is a schematic diagram of the longitudinal projection of the hydrological data table in the embodiment;
图3是实施例中由水文资料表格中分别所提取各条竖线、各条横线组成的表格示意图;Fig. 3 is the schematic diagram of the table formed by each vertical line and each horizontal line extracted respectively in the hydrological data table in the embodiment;
图4是实施例中水文年鉴版面分析示意图;Fig. 4 is a schematic diagram of layout analysis of hydrological yearbook in the embodiment;
图5是实施例中分别获得水文资料表格各个数值单元格中数据图像的示意图;Fig. 5 is the schematic diagram that respectively obtains the data image in each numerical value cell of hydrological data table in the embodiment;
图6是实施例中所获数据图像中各个数值字符块的示意图。Fig. 6 is a schematic diagram of each numerical character block in the data image obtained in the embodiment.
具体实施方式detailed description
下面结合说明书附图对本发明的具体实施方式作进一步详细的说明。The specific implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings.
在日常的商业活动中,我们每天都运用了大量的文档和表格。同时表格文档也广泛地应用于各个领域,通常人们需要手动处理表格文档,例如客户需要缴纳赋税,图书管理员需要采集纸质表格文档中所包含的数据信息。由于光学字符识别(OCR)技术的发展,人们开始尝试利用可获得数据的标准表格图像来提取表格中的数据信息,这可以减少工作时间并减轻工作负担。在商业领域中,利用OCR技术可以提高工作质量,并且可以减少人们花费在处理表格文档上的大量时间。在OCR运用的许多领域中,我们通过获取的表格模板使用户知道图像中印刷体的目标字符串。这些字符串信息包括了许多项目内容如流量信息、文本信息和数学公式等。表格的存在阻碍了数据信息的提取,因此表格线检测是印刷体表格识别技术中一项重要任务。In our daily business activities, we use a lot of documents and forms every day. At the same time, form documents are also widely used in various fields. Usually, people need to manually process form documents. For example, customers need to pay taxes, and librarians need to collect data information contained in paper form documents. Due to the development of Optical Character Recognition (OCR) technology, people began to try to extract the data information in the form by using the standard form image of the available data, which can reduce the working time and reduce the workload. In the business field, using OCR technology can improve the quality of work and reduce the amount of time people spend on processing form documents. In many fields where OCR is used, we let the user know the target string printed in the image through the obtained form template. These string information includes many items such as traffic information, text information and mathematical formulas. The existence of tables hinders the extraction of data information, so table line detection is an important task in printed table recognition technology.
在水文资料印刷体文档中,表格是其必不可少的一部分,它可以将所有的文档信息高度集中在一起,并且让读者准确地明白其表达的含义,既简明又规范。通过查阅水文年鉴各大水文站的流量表,可以发现水文年鉴流量表的版面结构是有规律可循的。我们可以利用这些规律来切割出字符。In printed documents of hydrological data, tables are an essential part, which can highly gather all document information together and allow readers to accurately understand the meaning of their expressions, which is both concise and standardized. By consulting the discharge tables of major hydrological stations in the Hydrological Yearbook, it can be found that the layout structure of the discharge table in the Hydrological Yearbook is regular. We can use these rules to cut out characters.
水文年鉴是水文机构对流域内各河道水体进行水文监测、次年进行加工整理刊印形成水文监测成果的载体。其内容包括各项整编成果以及用图表和必要文字说明的汇总资料,是一部系统、规范的水文数据宝库。The hydrological yearbook is the carrier for the hydrological agency to monitor the water bodies of the rivers in the basin, and to process, arrange and publish the following year to form the carrier of the hydrological monitoring results. Its content includes various reorganization results and summary data explained with charts and necessary words. It is a treasure house of systematic and standardized hydrological data.
1958年,水利部水文局将全国按流域水系统一划分水文资料的卷册范围,并将逐年资料统一命名为《中华人民共和国水文年鉴》全国分10卷94册。其特征如下。In 1958, the Hydrological Bureau of the Ministry of Water Resources divided the country into volumes of hydrological data according to the basin water system, and named the year-by-year data uniformly the "Hydrological Yearbook of the People's Republic of China", which is divided into 10 volumes and 94 volumes nationwide. Its characteristics are as follows.
颜色特征:黄底黑字。Color characteristics: Black characters on a yellow background.
结构特征:纸张宽度为440mm,高度为140mm,宽高比为3.14。年鉴中数字宽度约为15mm,高度约为24mm,宽高比为0.625。字符位于表格内。Structural features: the paper width is 440mm, the height is 140mm, and the aspect ratio is 3.14. The numbers in the yearbook are approximately 15mm wide, 24mm high, and have an aspect ratio of 0.625. The characters are inside the table.
纹理特征:年鉴中含有类字符区,即数字横向、竖向颜色色度呈现有规律的波峰波谷变化。Texture features: The yearbook contains character-like areas, that is, the horizontal and vertical color chromaticity of numbers presents regular changes in peaks and valleys.
水文年鉴字符是多行水平规则排列的字符,具有比较稳定的结构和纹理特征。基于投影的自顶向下版面分析方法就是应用了这一特点。在年鉴的字符区域,字符的边缘信息非常丰富,运用一定的工具对字符边缘信息进行检测和分析,可将水文数据从背景中分离出来。水文年鉴区域的像素值将呈现特定的起伏变化,变化频率也保持在一定范围内,利用这些特征可实现水文年鉴字符定位。根据年鉴数字区域的横向、竖向特征比非数字区域丰富这一特征提出了基于横向竖向投影的字符定位算法。求出其跳变点,根据跳变点的数量和跳变点间的距离来确定可能的字符区域。The hydrological yearbook characters are multi-line horizontal and regular characters with relatively stable structure and texture characteristics. The projection-based top-down layout analysis method is the application of this feature. In the character area of the yearbook, the edge information of the characters is very rich. Using certain tools to detect and analyze the edge information of the characters can separate the hydrological data from the background. The pixel values in the hydrological yearbook area will show specific fluctuations, and the frequency of change will also be kept within a certain range. Using these features, the character location of the hydrological yearbook can be realized. According to the fact that the horizontal and vertical features of the yearbook digital area are more abundant than the non-digital area, a character positioning algorithm based on horizontal and vertical projection is proposed. Calculate its jump points, and determine possible character regions according to the number of jump points and the distance between jump points.
距页面上边距大概275个像素左右页面空白,随后是水文年鉴的流域名称和水文站名称加上逐日平均流量表字样。距离此字样30像素左右位置标有集水面积、流量的单位。距离此20像素左右是表格开始位置。水文年鉴表格均由11条横线和14条竖线组成。前两条横线中间标有月份信息,前两条竖线之间标有每月日期,随后在每两条竖线之间和第三天横线之前的区域均是每个月的流量值。在随后的横线之间标有每个月的平均流量值、日期最大的流量值和日期最小的流量值、年统计和附注信息。我们的最终目的是识别流量值,因此首先必须对水文资料进行版面分析,分析其表格结构,提取表格框线,以便具体对每个月份的流量值进行定位。About 275 pixels away from the top margin of the page, the page is blank, followed by the name of the watershed and hydrological station in the hydrological yearbook plus the words daily average discharge table. Units of water catchment area and flow rate are marked about 30 pixels away from this word. About 20 pixels from this is where the table starts. The hydrological yearbook tables are composed of 11 horizontal lines and 14 vertical lines. The month information is marked in the middle of the first two horizontal lines, the date of each month is marked between the first two vertical lines, and the area between each two vertical lines and before the third horizontal line is the traffic value of each month . The average flow value of each month, the maximum flow value and the minimum flow value of the date, annual statistics and notes are marked between the subsequent horizontal lines. Our ultimate goal is to identify the flow value, so we must first analyze the layout of the hydrological data, analyze its table structure, and extract the table frame, so as to specifically locate the flow value of each month.
如图1所示,本发明设计了一种纸质水文年鉴数字化方法,首先要针对纸质水文年鉴页面中水文资料表格进行拍照,获取水文资料表格图像,并进行预处理操作,其中包括图像二值化、灰度化、去噪、旋转和反色处理;然后针对预处理操作的水文资料表格图像,具体进行如下步骤:As shown in Figure 1, the present invention has designed a kind of digitalization method of papery hydrological yearbook, first will take a picture of the hydrological data table in the page of papery hydrological yearbook, obtain the image of hydrological data table, and carry out preprocessing operation, wherein include image two Value, gray scale, denoising, rotation and inversion processing; then for the hydrological data table image of the preprocessing operation, the specific steps are as follows:
步骤001.随着对文档版面分析算法的深入研究,本文在原有文档版面分割典型算法(自顶向下、自底向下)的基础上,综合两种典型算法的优点,即同时使用结构特征和纹理特征来处理水文年鉴里的文档版面。这种处理方式既考虑了分割的精确性,又兼顾了分析处理的时间消耗,因此能够快速、准确的定位表格。根据纸质水文年鉴页面的版面设计,确定水文资料表格位于纸质水文年鉴页面中的像素位置,然后进入步骤002。Step 001. With the in-depth research on document layout analysis algorithms, this paper combines the advantages of two typical algorithms based on the original document layout segmentation algorithms (top-down and bottom-down), that is, using structural features at the same time and texture features to process document layouts in the Hydrological Yearbook. This processing method not only considers the accuracy of segmentation, but also takes into account the time consumption of analysis and processing, so it can quickly and accurately locate the table. According to the layout design of the page of the paper hydrological yearbook, determine the pixel position of the hydrological data table on the page of the paper hydrological yearbook, and then go to step 002.
步骤002.根据纸质水文年鉴页面中水文资料表格的像素位置,针对水文资料表格分别进行纵向和横向投影,横向投影如图2a所示,纵向投影如图2b所示,并针对水文资料表格的纵向投影图、横向投影分别进行分析,其中,图2a中,11个黑色点分别表示水文年鉴表格的横线,在第二个黑点之后的空心点表示每行流量值的上下位置,之后的每个波峰的两侧表示第一日到第三十一日每行的流量值的上下位置;图2b中,14个黑色点表示表格的14条竖线的横坐标,每两个黑色点之间,即每两条竖线之间的波峰两侧表示每月的流量值的左右坐标,用空心点标出。分别提取水文资料表格中各条竖线的横坐标、各条横线的纵坐标,实际应用实施例中如图3所示,其中,水文资料表格各个数据图像中的数值字符为白色,底色为黑色;因此,通过图2a和图2b可以粗略定位出每个月的流量值以及表格位置,最终水文年鉴版面分析的结果如图4所示,然后进入步骤003。Step 002. According to the pixel position of the hydrological data table in the paper hydrological yearbook page, vertical and horizontal projections are respectively performed on the hydrological data table. The horizontal projection is shown in Figure 2a, and the vertical projection is shown in Figure 2b. The vertical and horizontal projections were analyzed separately. In Figure 2a, the 11 black dots represent the horizontal lines of the hydrological yearbook table, and the hollow dots after the second black dot represent the upper and lower positions of the flow values in each row. The two sides of each peak represent the upper and lower positions of the flow values of each row from the first day to the 31st day; in Figure 2b, 14 black dots represent the abscissa of the 14 vertical lines in the table, and every two black dots The interval, that is, the left and right coordinates of the monthly flow value on both sides of the peak between every two vertical lines, is marked with a hollow point. Extract the abscissa of each vertical line and the ordinate of each horizontal line in the hydrological data table respectively, as shown in Figure 3 in the practical application embodiment, wherein, the numerical characters in each data image of the hydrological data table are white, and the background color It is black; therefore, the monthly discharge value and table position can be roughly located through Figure 2a and Figure 2b, and the final layout analysis results of the hydrological yearbook are shown in Figure 4, and then enter step 003.
通过统计同一行或列上的黑像素数目,避免了对直线段的直接检测,对表格线的连通性要求不高,具有很好的抗干扰和泛化能力。通过该方法可反映出图像中目标的位置与尺寸等有效信息。为后续水文年鉴数字的定位处理提供了便利。By counting the number of black pixels on the same row or column, direct detection of straight line segments is avoided, the connectivity of table lines is not high, and it has good anti-interference and generalization capabilities. Effective information such as the position and size of the target in the image can be reflected by this method. It provides convenience for the positioning and processing of subsequent hydrological yearbook numbers.
步骤003.根据水文资料表格的版式,以及水文资料表格中各条竖线的横坐标、各条横线的纵坐标,针对水文资料表格的投影图像,分别获得水文资料表格各个数值单元格中的数据图像,实际应用实施例如图5所示,然后进入步骤004;其中,水文资料表格各个数据图像中的数值字符为白色,底色为黑色。Step 003. According to the layout of the hydrological data table, and the abscissa of each vertical line in the hydrological data table, and the vertical coordinate of each horizontal line, for the projected image of the hydrological data table, obtain the values in each numerical cell of the hydrological data table respectively. The data image, the actual application example is shown in Figure 5, and then enter step 004; wherein, the numerical characters in each data image of the hydrological data table are white, and the background color is black.
在纸质水位资料的数字化过程中,只有能够自适应地很好地对水文资料图像进行分割,才能保证后续提取特征的数据的精确性。纸质水位资料图像的分割是整个数字化过程的基础,数字定位出来以后的图像还是个整体,包括数字与数字之间的空白。对于已经提取出来的数字整体,需要进行字符切分。把单个字符从整体数字中分离出来。In the digitization process of paper water level data, only by segmenting the hydrological data images adaptively can the accuracy of the subsequent extracted feature data be guaranteed. The segmentation of paper water level data images is the basis of the entire digitization process. After digital positioning, the image is still a whole, including the blanks between numbers. For the whole number that has been extracted, character segmentation is required. Separate individual characters from the overall number.
步骤004.分别针对各个数据图像,针对数据图像中的各个数值字符进行字符切分,获得该数据图像中的各个数值字符块,具体包括如下步骤:Step 004. For each data image, perform character segmentation for each numerical character in the data image, and obtain each numerical character block in the data image, which specifically includes the following steps:
步骤a01.检测获得数据图像中各数值字符内部的各个白色像素点,以及该数据图像各边缘分别相距各数值字符最小距离,所对应数值字符上的白色像素点,然后进入步骤a02;Step a01. Detect each white pixel point inside each numerical character in the obtained data image, and the minimum distance between each edge of the data image and the white pixel point on the corresponding numerical character, and then enter step a02;
步骤a02.针对上一步骤由该数据图像中所获各个白色像素点分别进行判断,判断像素点上、下、左、右各位置的像素点是否均为白色像素点,是则判断该像素点为数值字符内部的像素点;否则根据标识符判断该像素点为字符的边缘像素点,并获取该像素点在该数据图像中所在像素列的列号;由此针对上一步骤由该数据图像中所获各个白色像素点分别进行判断,获得该数据图像中各个数值字符上边缘像素点所在该数据图像中所在像素列的列号,然后进入步骤a03;Step a02. For the previous step, judge each white pixel point obtained in the data image separately, and judge whether the pixels at the upper, lower, left, and right positions of the pixel point are all white pixels, and if so, judge the pixel point is the pixel inside the numeric character; otherwise, it is judged according to the identifier that the pixel is the edge pixel of the character, and the column number of the pixel column where the pixel is located in the data image is obtained; thus for the previous step, the data image Each white pixel obtained in is judged respectively, obtains the column number of the pixel row in the data image where the upper edge pixel of each numerical character in the data image is located, and then enters step a03;
步骤a03.根据该数据图像中各个数值字符上边缘像素点在该数据图像中所在像素列的列号,针对该数据图像中的各个数值字符进行划分,获得该数据图像中的各个数值字符块。Step a03. According to the column number of the pixel column where the upper edge pixel point of each numerical character in the data image is located in the data image, divide each numerical character in the data image to obtain each numerical character block in the data image.
基于上述设计过程,进而分别获得各个数据图像中的各个数值字符块,实际应用实施例,所获得该数据图像中的各个数值字符块,如图6所示;然后进入步骤005。Based on the above-mentioned design process, each numerical character block in each data image is further respectively obtained. In the actual application embodiment, each numerical character block in the data image is obtained, as shown in FIG. 6 ; then go to step 005.
若直接把预处理后的数据作为分类器的输入量,进行分类计算时数据量大,特征提取的目的就是从分析数字的拓扑结构入手,把它的某些结构特征提取出来,使数字的位移、大小变化、字形畸变等干扰相对减小,也就是把那些反映数字特征的关键信息提供给分类器,这样就等于间接地增加了分类器的容错能力,而且经过特征提取后数据量也大大减小了;特征抽取对识别起关键性的作用,它应遵循以下原则:If the preprocessed data is directly used as the input of the classifier, the amount of data in the classification calculation is large, and the purpose of feature extraction is to start with the analysis of the topology of the number, extract some of its structural features, and make the displacement of the number , size change, font distortion and other interference are relatively reduced, that is, the key information that reflects the digital features is provided to the classifier, which is equivalent to indirectly increasing the fault tolerance of the classifier, and the amount of data is greatly reduced after feature extraction. Small; feature extraction plays a key role in recognition, and it should follow the following principles:
(1)易于提取;(1) easy to extract;
(2)具有较强的分类能力,即该特征对不同的数字应表现出较大的差异,而对相同的数字则应表现出尽可能小的差异;(2) It has a strong classification ability, that is, the feature should show a large difference for different numbers, and should show as small a difference as possible for the same number;
(3)具有较高的稳定性,尽量减小笔划断裂或粘连的影响。(3) It has high stability and minimizes the impact of stroke breakage or adhesion.
步骤005.分别针对各个数据图像中的各个数值字符块,提取数值字符块中数值字符的网格特征、傅里叶特征、轮廓矩特征,共同作为该数值字符的识别特征,进而分别获得各个数据图像中各个数值字符块中数值字符的识别特征,然后进入步骤006。Step 005. For each numerical character block in each data image, extract the grid feature, Fourier feature, and contour moment feature of the numerical character in the numerical character block, and use them together as the identification feature of the numerical character, and then obtain each data respectively The recognition features of numerical characters in each numerical character block in the image, and then enter step 006.
其中,网格特征是一组注重字符图像整体的分布特征,此种特征对噪声具有极强的抑制能力。其提取方法的主要思想是,把数字点阵分成几个局部小区域,并把每个小区域上的点阵密度作为描述特征,即统计每个小区域图像像素所占的百分比作为特征数据;由于网格特征反映的是图像的局部统计特征,是个百分比相对值,而图像局部的形变或噪声对应数字点阵就是局部元素的“0”和“1”的值互换,所以如果图像带有局部的形变或噪声,与没有形变和噪声的原图像相比,计算出来的百分比相对值变化不大。也就是说,这个相对值对于数字图片局部笔划的形变或孤立噪声点带来的影响不敏感。因此,以网格为特征进行数字识别,具有较好的抗噪声能力。针对本文中分割出的数字,我将之划分成大小为3×3的小区域,共计9个。Among them, the grid feature is a group of distribution features that focus on the overall character image, and this feature has a strong ability to suppress noise. The main idea of its extraction method is to divide the digital lattice into several local small areas, and use the lattice density on each small area as a description feature, that is, to count the percentage of image pixels in each small area as feature data; Since the grid feature reflects the local statistical characteristics of the image, it is a relative percentage value, and the digital lattice corresponding to the local deformation or noise of the image is the exchange of the values of "0" and "1" of the local elements, so if the image has Local deformation or noise, compared with the original image without deformation and noise, the relative value of the calculated percentage changes little. That is to say, this relative value is not sensitive to the deformation of local strokes in the digital picture or the influence of isolated noise points. Therefore, the digital recognition with the grid as the feature has better anti-noise ability. For the numbers segmented in this article, I divided them into small areas with a size of 3×3, a total of 9.
上述步骤005中,分别针对各个数据图像中的各个数值字符块,提取数值字符块中数值字符的网格特征,具体包括如下步骤:In the above step 005, for each numerical character block in each data image, extract the grid features of the numerical character in the numerical character block, specifically including the following steps:
步骤b01.获取数值字符块的上、下、左、右的边界,并由此获得数值字符本体图像,然后进入步骤b02。Step b01. Obtain the upper, lower, left, and right boundaries of the numerical character block, and thus obtain the numerical character ontology image, and then enter step b02.
步骤b02.针对该数值字符本体图像进行重心归一化,并将经过重心归一化的该数值字符本体图像平均分割成预设数量个子区域图像,然后进入步骤b03。Step b02. Carry out barycenter normalization for the numerical character ontology image, and evenly divide the barycenter-normalized numerical character ontology image into a preset number of sub-region images, and then proceed to step b03.
步骤b03.分别获得该数值字符本体图像中各个子区域图像中白色像素点的所占比例,共同构成该数值字符块中数值字符的网格特征。Step b03. Obtain the proportion of white pixels in each sub-region image of the numerical character body image, and together form the grid features of the numerical character in the numerical character block.
傅立叶变换是在图像处理中应用广泛的一种二维正交变换,傅立叶变换后平均值即直流项正比于图像灰度值的平均值,低频分量则表明了图像中目标边缘的强度和方向。数字字符一般能用很多线段构成的封闭轮廓来表示,通过映射所得到的一些离散量能够充分的反映这些封闭轮廓的变化。傅立叶系数能够很好的描述图像边界轮廓,其值与相似字形的平移、旋转、位移和尺寸大小无关。在字形表征和识别时,这些特征形成明显的数据压缩。Fourier transform is a two-dimensional orthogonal transform widely used in image processing. After Fourier transform, the average value, that is, the DC term, is proportional to the average value of the gray value of the image, and the low-frequency component indicates the strength and direction of the target edge in the image. Numerical characters can generally be represented by closed contours composed of many line segments, and some discrete quantities obtained through mapping can fully reflect changes in these closed contours. The Fourier coefficient can well describe the image boundary contour, and its value has nothing to do with the translation, rotation, displacement and size of similar glyphs. These features form an obvious data compression during character representation and recognition.
上述步骤005中,分别针对各个数据图像中的各个数值字符块,提取数值字符块中数值字符的傅里叶特征,具体包括如下步骤:In the above step 005, for each numerical character block in each data image, extract the Fourier feature of the numerical character in the numerical character block, specifically including the following steps:
步骤c01.针对数值字符块进行二维离散傅里叶变换,然后进入步骤c02。Step c01. Perform two-dimensional discrete Fourier transform on the numerical character block, and then enter step c02.
步骤c02.将经过二维离散傅里叶变换的该数值字符块,继续进行中心变换,即将数值字符块平均划分为四块子区域图像,并进行对角交换,获得傅里叶图像谱,然后进入步骤c03。Step c02. Continue to perform central transformation on the numerical character block after two-dimensional discrete Fourier transform, that is, divide the numerical character block into four sub-region images on average, and perform diagonal exchange to obtain the Fourier image spectrum, and then Go to step c03.
步骤c03.针对中心变换后的傅里叶图像谱分析其傅里叶系数,获得该数值字符块的傅里叶系数中、大于预设幅值阈值的傅里叶系数集中所在区域,构成大幅傅里叶系数区域,然后进入步骤c04。Step c03. Analyze the Fourier coefficients of the Fourier image spectrum after the central transformation, and obtain the area where the Fourier coefficients of the numerical character block are concentrated, which are greater than the preset amplitude threshold, to form a large Fourier coefficient. Liye coefficient area, and then go to step c04.
步骤c04.由大幅傅里叶系数区域中,提取预设数量个离散傅里叶变换系数,并将其进行归一化,构成该数值字符块中数值字符的傅里叶特征。Step c04. Extract a preset number of discrete Fourier transform coefficients from the large-scale Fourier coefficient area, and normalize them to form Fourier features of the numerical characters in the numerical character block.
不变矩特征是图像的一种统计特征,是图像中具有平移、缩放和旋转不变性的数学特征。The moment invariant feature is a statistical feature of the image, and it is a mathematical feature in the image that is invariant to translation, scaling, and rotation.
上述步骤005中,分别针对各个数据图像中的各个数值字符块,提取数值字符块中数值字符的轮廓矩特征,具体包括如下步骤:In the above step 005, for each numerical character block in each data image, the contour moment feature of the numerical character in the numerical character block is extracted, which specifically includes the following steps:
步骤d01.针对数值字符块中的数值字符进行轮廓提取,然后进入步骤d02。Step d01. Perform outline extraction for the numerical characters in the numerical character block, and then proceed to step d02.
步骤d02.针对该数值字符块中数值字符的轮廓进行不变矩处理,提取预设数量个二维轮廓不变矩特征,构成该数值字符块中数值字符的轮廓矩特征。Step d02. Perform moment invariant processing on the contours of the numerical characters in the numerical character block, and extract a preset number of two-dimensional contour invariant moment features to form the contour moment features of the numerical characters in the numerical character block.
步骤006.分别针对各个数据图像中的各个数值字符块,判断是否存在由数值字符块顶边向下存在预设数量的黑色像素点,是则判定该数值字符块中为小数点,否则不做任何进一步操作;在完成分别针对各个数据图像中各个数值字符块的判断后,然后进入步骤007;Step 006. For each numerical character block in each data image, determine whether there is a preset number of black pixels from the top edge of the numerical character block downward, and if so, determine that the numerical character block is a decimal point, otherwise do not do anything Further operation; after completing the judgment for each numerical character block in each data image, then enter step 007;
上述步骤中所获得的所有识别特征,若分别用神经网络和支持向量机分类器进行分类,分类效果不甚理想,这主要是因为很难找到一种特征分别适合于不同的数字,而前人的方法均是在分析特定的数字识别应用方面进行特征提取和融合,每个数字有不同的特点,要想正确分类需要各种特征进行组合,特征的互补性是保证提取的特征具有较高识别率和泛化能力的关键,是特征融合的依据;因此,在进行特征融合之前,必须解决特征互补性度量的问题。All the recognition features obtained in the above steps, if they are classified by neural network and support vector machine classifier, the classification effect is not ideal, mainly because it is difficult to find a feature suitable for different numbers, while the predecessors The methods are all feature extraction and fusion in the analysis of specific digital recognition applications. Each number has different characteristics. In order to classify correctly, various features need to be combined. The complementarity of features is to ensure that the extracted features have higher recognition. The key to the rate and generalization ability is the basis of feature fusion; therefore, the problem of feature complementarity measurement must be solved before feature fusion.
步骤007.针对所有数据图像中数值字符的所有识别特征,进行特征融合,构成水文资料表格中分别对应“0”到“9”的数值识别特征,然后进入步骤008。Step 007. For all the recognition features of the numerical characters in all data images, perform feature fusion to form numerical recognition features corresponding to "0" to "9" in the hydrological data table, and then enter step 008.
上述步骤007具体包括如下步骤:The above step 007 specifically includes the following steps:
步骤e01.根据排列组合,针对所有数据图像中数值字符的所有识别特征,进行任意两个识别特征的组合,构成所有识别特征组合,然后进入步骤e02。Step e01. According to the permutation and combination, for all the recognition features of the numerical characters in all the data images, combine any two recognition features to form all the recognition feature combinations, and then go to step e02.
步骤e02.将所有数据图像中数值字符的所有识别特征,构成水文资料表格中对应数字“0”到“9”的样本集合S,然后分别针对各组识别特征组合,根据如下公式(1):Step e02. All the identification features of the numerical characters in all data images are used to form the sample set S corresponding to the numbers "0" to "9" in the hydrological data table, and then each group of identification features is combined according to the following formula (1):
获得该组识别特征组合分别相对标准数字“0”-“9”的特征互补指数Cij,A;进而分别获得各组识别特征组合分别相对标准数字“0”-“9”的特征互补指数Cij,A;然后进入步骤e03;其中,Cij,A越大,说明识别特征Fi和识别特征Fj相对标准数字A的特征互补性越强;反之,则特征互补性越弱;Si和Sj分别表示样本集合S被识别特征Fi与识别特征Fj错分的样本集合;E(S)表示样本集合S中的样本个数;E(Si∪Sj)表示样本集合Si与样本集合Sj之间并集中的样本个数;E(Si∩Sj)表示样本集合Si与样本集合Sj之间交集中的样本个数;A={0、1、…、9},Cij,A表示由识别特征Fi与识别特征Fj所构成识别特征组合相对标准数字A的特征互补指数。Obtain the feature complementary index C ij,A of the group of recognition feature combinations relative to the standard number "0"-"9"respectively; and then obtain the feature complementary index C of each group of recognition feature combinations respectively relative to the standard number "0"-"9" ij, A ; then enter step e03; wherein, C ij, the larger the A, the stronger the feature complementarity of the identification feature F i and the identification feature F j relative to the standard number A; otherwise, the weaker the feature complementarity; S i and S j represent the sample set S misclassified by the recognition feature F i and the recognition feature F j respectively; E(S) represents the number of samples in the sample set S; E(S i ∪ S j ) represents the sample set S The number of samples in the union between i and the sample set S j ; E(S i ∩ S j ) represents the number of samples in the intersection between the sample set S i and the sample set S j ; A={0, 1,... , 9}, C ij,A represents the feature complementarity index of the identification feature combination composed of identification feature F i and identification feature F j relative to the standard number A.
步骤e03.分别针对各组识别特征组合,根据如下公式(2):Step e03. Respectively for each group of identification feature combinations, according to the following formula (2):
分别获取各组识别特征组合相对于标准数字的整体互补指数TCk,然后进入步骤e04;其中,k={1、…、K},K表示所有识别特征组合的组合数,TCk表示第k组识别特征组合相对于标准数字的整体互补指数。Obtain the overall complementary index TC k of each group of identification feature combinations relative to the standard number, and then enter step e04; where, k={1,...,K}, K represents the number of combinations of all identification feature combinations, and TC k represents the kth The overall complementarity index of groups identifying feature combinations relative to standard numbers.
步骤e04.针对所有识别特征组合,按其整体互补指数由大至小排序,获得排序前两个识别特征组合,然后针对该两个识别特征组合进行特征融合,构成水文资料表格中分别对应“0”到“9”的数值识别特征。Step e04. For all the identification feature combinations, sort according to their overall complementary index from large to small to obtain the first two identification feature combinations, and then carry out feature fusion for the two identification feature combinations to form the corresponding "0" in the hydrological data table. ” to “9” to identify the character.
上述技术方案通过将不同的特征用于分类器中分类,对单个特征的识别结果进行分析,通过上述公式计算各个特征的整体互补指数,然后将选出的特征利用某种线性关系将其融合,通过实验表明粗网格特征和傅立叶特征对水文年鉴资料的数字识别效果甚佳,而且其整体互补性较强,所以将傅立叶特征串接在粗网格特征之后,通过实验得出提出的融合特征的识别率较单个傅立叶特征提高了3.8981%,较网格特征提高了1.4033%,较轮廓矩提高了83.1956%。The above technical solution uses different features for classification in the classifier, analyzes the recognition result of a single feature, calculates the overall complementary index of each feature through the above formula, and then fuses the selected features using a certain linear relationship. Experiments show that the coarse grid feature and Fourier feature are very effective in digital recognition of hydrological yearbook data, and their overall complementarity is strong, so the Fourier feature is connected in series after the coarse grid feature, and the proposed fusion feature is obtained through experiments The recognition rate of the method is 3.8981% higher than that of a single Fourier feature, 1.4033% higher than a grid feature, and 83.1956% higher than a contour moment.
步骤008.根据水文资料表格中分别对应“0”到“9”的数值识别特征,以及各个数据图像中各个数值字符块中数值字符的识别特征,通过支持向量机(SVM)分类器,分别获得各个数据图像中各个数值字符块所对应的数字,然后进入步骤009。Step 008. According to the numerical identification features respectively corresponding to "0" to "9" in the hydrological data table, and the identification features of numerical characters in each numerical character block in each data image, through a support vector machine (SVM) classifier, respectively obtain The number corresponding to each numeric character block in each data image, and then enter step 009.
步骤009.根据各个数据图像中各个数值字符块所对应的数字或小数点,分别构成水文资料表格各个数值单元格中数据图像所对应的数值,再结合水文资料表格版式的各项属性,获得水文资料表格中各项属性,及其所对应的数值,并进行存储;然后进入步骤010。Step 009. According to the number or decimal point corresponding to each numerical character block in each data image, respectively form the numerical value corresponding to the data image in each numerical value cell of the hydrological data table, and then combine various attributes of the hydrological data table layout to obtain the hydrological data The attributes in the table and their corresponding values are stored; then go to step 010.
本文通过分析流量的规律,根据时间序列提出了后期排错机制。通过实验结果可知,水文年鉴的最终识别结果接近99%,错误率相对来说较低,一个流量值由4至5个数字组成,若其中一个数字识别有误,即认为结果有误,这和以往的数据集MNIST,USPS上的识别结果的错误率统计还是稍有不同的。观察识别结果可知,一个流量值一般只有一个数字识别错误,而且每个月份识别错误的流量值在3个以内,这样的话如果我们能通过一定的算法思想找到识别可靠度不高的流量值,也即找到流量值的小数点前的数字的关键位置的识别错误,通过统计每月流量的变化规律,利用平均值法进行纠错,将带来很高的应用效率。This paper analyzes the law of traffic and proposes a post-debugging mechanism based on time series. It can be seen from the experimental results that the final recognition result of the Hydrological Yearbook is close to 99%, and the error rate is relatively low. A flow value is composed of 4 to 5 numbers. If one of the numbers is wrongly recognized, the result is considered to be wrong. This is the same as The error rate statistics of the recognition results on the previous data sets MNIST and USPS are still slightly different. Observing the recognition results, we can see that there is generally only one digital misrecognition error for a flow value, and the number of misidentified flow values per month is less than 3. In this case, if we can find flow values with low recognition reliability through certain algorithmic ideas, we can also That is to find the identification error of the key position of the number before the decimal point of the flow value, and use the average value method to correct the error by counting the change rule of the monthly flow, which will bring high application efficiency.
因为得到流量的本身也是通过仪器测量得到的,本身也存在一定的误差,因此若流量在一定小范围内波动的情况下,也即在流量值的小数点后的数字识别有误的情况下,在不影响流量数据的分析和应用的前提下,我们是可以容忍的。即不认为其识别有误。Because the flow rate itself is also obtained through instrument measurement, there are certain errors in itself. Therefore, if the flow rate fluctuates within a certain small range, that is, when the number after the decimal point of the flow value is incorrectly identified, in the As long as it does not affect the analysis and application of traffic data, we can tolerate it. That is, it is not considered to be misidentified.
步骤010.针对所识别存储水文资料表格中各项属性、及其所对应的数值,分别针对各个月的流量数值,按如下步骤010-01至步骤010-02进行执行,进而分别获得针对各个月每日流量数值的初步识别判断,然后进入步骤011。Step 010. For each attribute in the identified and stored hydrological data table and its corresponding value, respectively for the flow value of each month, execute according to the following steps 010-01 to 010-02, and then respectively obtain the flow rate for each month Preliminary identification and judgment of the daily flow value, and then enter step 011.
步骤010-01.将当月第一日流量数值作为第一阈值,然后分别针对当月前两日流量数值,判断下一日流量数值与当日流量数值之间的差值是否小于第一阈值,是则判断当日流量数值识别无误;否则判断当日流量数值初步识别错误;由此获得分别针对当月前两日流量数值的初步识别判断,然后进入步骤010-02。Step 010-01. Use the flow value on the first day of the current month as the first threshold, and then judge whether the difference between the next day’s flow value and the current day’s flow value is less than the first threshold for the flow values of the first two days of the current month, if yes, then Judging that the identification of the flow value of the current day is correct; otherwise, it is determined that the preliminary identification of the flow value of the current day is wrong; thereby obtaining preliminary identification judgments for the flow values of the first two days of the current month, and then proceed to step 010-02.
步骤010-02.分别针对当月由第三日开始的各日流量数值,判断下一日流量数值与当日流量数值之间的差值是否小于前一日流量数值,是则判断当日流量数值识别无误;否则判断当日流量数值初步识别错误;由此获得分别针对当月由第三日开始各日流量数值的初步识别判断。Step 010-02. For each daily flow value starting from the third day of the current month, determine whether the difference between the next day’s flow value and the current day’s flow value is smaller than the previous day’s flow value, and if so, determine that the current day’s flow value is correctly identified ; Otherwise, it is judged that the preliminary recognition of the flow value of the current day is wrong; thus, the preliminary recognition judgment of the flow value of each day starting from the third day of the current month is obtained.
步骤011.根据所识别存储水文资料表格中的各个数值,以及各个数值中各个数字的识别特征,通过支持向量机训练器,获得所识别存储水文资料表格中各个数值中的各个数字,分别对应“0”到“9”的十个识别结果概率,然后进入步骤012。Step 011. According to each numerical value in the identified stored hydrological data table, and the identification features of each number in each numerical value, through the support vector machine trainer, obtain each numerical value in the identified stored hydrological data table, corresponding to " Ten recognition result probabilities from 0" to "9", and then go to step 012.
步骤012.分别针对所识别存储水文资料表格中各个数值中的各个数字,获得数字所对应“0”到“9”十个识别结果概率中的最大识别结果概率,以及第二大识别结果概率,并获得该最大识别结果概率与该第二大识别结果概率的差值,判断该差值是否小于预设识别结果概率阈值0.1-0.25,是则判断该数字初步识别错误;否则判断该数字识别无误;由此获得分别针对所识别存储水文资料表格中各个数值中各个数字的初步识别判断,然后进入步骤013。Step 012. Obtain the maximum recognition result probability and the second largest recognition result probability among the ten recognition result probabilities from "0" to "9" corresponding to the numbers for each number in each numerical value in the identified and stored hydrological data table, And obtain the difference between the maximum recognition result probability and the second largest recognition result probability, judge whether the difference is less than the preset recognition result probability threshold of 0.1-0.25, if yes, judge that the initial recognition of the number is incorrect; otherwise, judge that the recognition of the number is correct ; Obtain a preliminary identification judgment for each number in each numerical value in the identified stored hydrological data table, and then enter step 013.
步骤013.分别针对各月中各个初步识别错误的流量数值,判断初步识别错误的流量数值中是否存在初步识别错误的数字,具体如下两种情况:Step 013. For each preliminary misidentified flow value in each month, determine whether there is a preliminary misidentified number in the preliminary misidentified flow value, specifically the following two situations:
是则判断该初步识别错误的流量数值错误,并进行报警,同时,根据该初步识别错误数字在该初步识别错误流量数值中的位置进行分析,若该初步识别错误数字位于该初步识别错误流量数值中的整数部分,则用该初步识别错误流量数值所对应日期的前一日流量数值与后一日流量数值的平均值,替换该初步识别错误流量数值;若该初步识别错误数字位于该初步识别错误流量数值中的小数部分,则用该初步识别错误流量数值所对应日期的前一日流量数值的小数与后一日流量数值的小数的平均值,替换该初步识别错误流量数值中的小数;If it is, it is judged that the flow value of the initial recognition error is wrong, and an alarm is issued. At the same time, the position of the preliminary recognition error number in the preliminary recognition error flow value is analyzed. Integer part in the initial recognition error flow value, the average value of the previous day’s flow value and the next day’s flow value on the date corresponding to the preliminary identification error flow value is used to replace the preliminary identification error flow value; if the preliminary identification error number is located in the preliminary identification For the decimal part in the wrong flow value, replace the decimal in the preliminary identified wrong flow value with the average of the decimals of the previous day’s flow value and the following day’s flow value on the date corresponding to the preliminary identified wrong flow value;
否则判断该初步识别错误流量数值无误;由此实现针对所识别存储水文资料表格中各个数值的检验。Otherwise, it is judged that the flow value of the preliminary identification error is correct; thus, the verification of each value in the identified and stored hydrological data table is realized.
通过实验对比可以发现本发明所设计的纸质水文年鉴数字化方法中,特征融合较单个特征提高了识别率,单个傅立叶特征对数字0识别效果较佳,对6和9识别效果差,而粗网格特征对数字0识别效果差,对数字6和9识别效果较佳,轮廓矩特征对数字0、6、8识别效果差。三种特征对其他数字识别的结果大体一致,通过计算特征之间的互补性指数可以发现傅立叶和粗网格特征的融合具有很好的区分不同数字的能力;将描述数字边界轮廓和数字内部的特征进行融合能够将整个数字从内到外更完整的描述出来,足以代表一个数字,所以得到了较好的识别效果。Through experimental comparison, it can be found that in the digitalization method of the paper hydrological yearbook designed by the present invention, the feature fusion improves the recognition rate compared with a single feature. The lattice feature has a poor recognition effect on the number 0, but it has a better recognition effect on the numbers 6 and 9, and the contour moment feature has a poor recognition effect on the numbers 0, 6, and 8. The results of the three features for other digital recognition are generally consistent. By calculating the complementarity index between the features, it can be found that the fusion of Fourier and coarse grid features has a good ability to distinguish different numbers; the digital boundary contour and the internal digital The fusion of features can describe the whole number more completely from the inside to the outside, which is enough to represent a number, so a better recognition effect is obtained.
上面结合附图对本发明的实施方式作了详细说明,但是本发明并不限于上述实施方式,在本领域普通技术人员所具备的知识范围内,还可以在不脱离本发明宗旨的前提下做出各种变化。The embodiments of the present invention have been described in detail above in conjunction with the accompanying drawings, but the present invention is not limited to the above embodiments, and can also be made without departing from the gist of the present invention within the scope of knowledge possessed by those of ordinary skill in the art. Variations.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610232680.9A CN105938547B (en) | 2016-04-14 | 2016-04-14 | A digital method for paper hydrological yearbook |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610232680.9A CN105938547B (en) | 2016-04-14 | 2016-04-14 | A digital method for paper hydrological yearbook |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105938547A true CN105938547A (en) | 2016-09-14 |
CN105938547B CN105938547B (en) | 2019-02-12 |
Family
ID=57151427
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610232680.9A Expired - Fee Related CN105938547B (en) | 2016-04-14 | 2016-04-14 | A digital method for paper hydrological yearbook |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105938547B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108805076A (en) * | 2018-06-07 | 2018-11-13 | 浙江大学 | The extracting method and system of environmental impact assessment report table word |
CN109190611A (en) * | 2018-08-14 | 2019-01-11 | 江西师范大学 | Pedigree system makes are compiled in a kind of internet based on crowdsourcing |
CN111060527A (en) * | 2019-12-30 | 2020-04-24 | 歌尔股份有限公司 | Character defect detection method and device |
CN113436117A (en) * | 2021-08-03 | 2021-09-24 | 东莞理工学院 | Hydrology long sequence data extraction method based on image recognition |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3582734B2 (en) * | 1993-07-14 | 2004-10-27 | 富士通株式会社 | Table vectorizer |
CN103996057A (en) * | 2014-06-12 | 2014-08-20 | 武汉科技大学 | Real-time handwritten digital recognition method based on multi-feature fusion |
CN105184265A (en) * | 2015-09-14 | 2015-12-23 | 哈尔滨工业大学 | Self-learning-based handwritten form numeric character string rapid recognition method |
CN105426834A (en) * | 2015-11-17 | 2016-03-23 | 中国传媒大学 | Projection feature and structure feature based form image detection method |
-
2016
- 2016-04-14 CN CN201610232680.9A patent/CN105938547B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3582734B2 (en) * | 1993-07-14 | 2004-10-27 | 富士通株式会社 | Table vectorizer |
CN103996057A (en) * | 2014-06-12 | 2014-08-20 | 武汉科技大学 | Real-time handwritten digital recognition method based on multi-feature fusion |
CN105184265A (en) * | 2015-09-14 | 2015-12-23 | 哈尔滨工业大学 | Self-learning-based handwritten form numeric character string rapid recognition method |
CN105426834A (en) * | 2015-11-17 | 2016-03-23 | 中国传媒大学 | Projection feature and structure feature based form image detection method |
Non-Patent Citations (2)
Title |
---|
刘昱: "《印刷体表格识别的研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
张世平: "《水文年鉴数据的智能识别》", 《人民珠江》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108805076A (en) * | 2018-06-07 | 2018-11-13 | 浙江大学 | The extracting method and system of environmental impact assessment report table word |
CN108805076B (en) * | 2018-06-07 | 2021-01-08 | 浙江大学 | Method and system for extracting table characters of environmental impact evaluation report |
CN109190611A (en) * | 2018-08-14 | 2019-01-11 | 江西师范大学 | Pedigree system makes are compiled in a kind of internet based on crowdsourcing |
CN111060527A (en) * | 2019-12-30 | 2020-04-24 | 歌尔股份有限公司 | Character defect detection method and device |
CN111060527B (en) * | 2019-12-30 | 2021-10-29 | 歌尔股份有限公司 | Character defect detection method and device |
US12002198B2 (en) | 2019-12-30 | 2024-06-04 | Goertek Inc. | Character defect detection method and device |
CN113436117A (en) * | 2021-08-03 | 2021-09-24 | 东莞理工学院 | Hydrology long sequence data extraction method based on image recognition |
CN113436117B (en) * | 2021-08-03 | 2022-11-25 | 东莞理工学院 | A Method of Extracting Hydrological Long Sequence Data Based on Image Recognition |
Also Published As
Publication number | Publication date |
---|---|
CN105938547B (en) | 2019-02-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102496013B (en) | Chinese character segmentation method for off-line handwritten Chinese character recognition | |
CN103996057B (en) | Real-time Handwritten Numeral Recognition Method based on multi-feature fusion | |
CN104966051B (en) | A kind of Layout Recognition method of file and picture | |
TW389865B (en) | System and method for automated interpretation of input expressions using novel a posteriori probability measures and optimally trained information processing network | |
CN104484643B (en) | The intelligent identification Method and system of a kind of handwriting table | |
WO2017016240A1 (en) | Banknote serial number identification method | |
CN103366367B (en) | Based on the FCM gray-scale image segmentation method of pixel count cluster | |
CN106598920B (en) | A shape-near-character classification method based on stroke coding combined with Chinese character dot matrix | |
CN112395996A (en) | Financial bill OCR recognition and image processing method, system and readable storage medium | |
CN101719142B (en) | Method for detecting picture characters by sparse representation based on classifying dictionary | |
CN106910187B (en) | An Artificial Augmentation Method of Image Dataset for Bridge Crack Detection and Location | |
CN103824373B (en) | A kind of bill images amount of money sorting technique and system | |
CN111507351B (en) | A Method for Digitizing Ancient Books and Documents | |
CN101359373B (en) | Method and device for recognizing degenerate characters | |
CN105447522A (en) | Complex image character identification system | |
CN105512611A (en) | Detection and identification method for form image | |
CN106446882A (en) | method for intelligently marking paper with trace left based on 8-character code | |
CN105938547A (en) | Paper hydrologic yearbook digitalization method | |
CN107818321A (en) | A kind of watermark date recognition method for vehicle annual test | |
Belaïd et al. | Handwritten and printed text separation in real document | |
CN102411711A (en) | A Finger Vein Recognition Method Based on Personalized Weights | |
CN110647956A (en) | Invoice information extraction method combined with two-dimensional code recognition | |
Awaidah et al. | A multiple feature/resolution scheme to Arabic (Indian) numerals recognition using hidden Markov models | |
CN112364837A (en) | Bill information identification method based on target detection and text identification | |
CN101452532A (en) | Text identification method and device irrelevant to handwriting |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190212 |
|
CF01 | Termination of patent right due to non-payment of annual fee |