CN105938547A

CN105938547A - Paper hydrologic yearbook digitalization method

Info

Publication number: CN105938547A
Application number: CN201610232680.9A
Authority: CN
Inventors: 李士进; 陈婉婉; 郑展; 郝立; 蒋亚平; 高祥涛; 胡金龙
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2016-04-14
Filing date: 2016-04-14
Publication date: 2016-09-14
Anticipated expiration: 2036-04-14
Also published as: CN105938547B

Abstract

The invention relates to a paper hydrologic yearbook digitalization method. A feature fusion method of high complementarity is put forward on the basis of single feature so that the recognition rate is enhanced. The hydrologic process is similar due to the fact that the hydrologic process is influenced by similar seasonal climatic factors and other random factors, i.e. the flow has contextual relevance. In view of the correlation, a later error correction mechanism based on time sequences is also put forward. In other words, error correction processing is performed according to a certain criterion after classifier recognition. The experiment proves that the recognition accuracy can be effectively enhanced by the mechanism and the working efficiency can be guaranteed.

Description

A digital method for paper-based hydrological yearbook

技术领域technical field

本发明涉及一种纸质水文年鉴数字化方法，属于计算机图像处理技术和水文交叉技术领域。The invention relates to a digitalization method for a paper-based hydrological yearbook, which belongs to the fields of computer image processing technology and hydrological intersection technology.

背景技术Background technique

纸质水文年鉴记录了最基本的水文测验数据，这些数据中蕴涵着自然界长期的演变规律和人类活动影响的信息，在生产、科研、社会服务中发挥了重要作用。鉴于水文年鉴保存年代较为久远、使用频率高以及保存条件差等原因，纸质水文年鉴已逐渐开始损坏，而且一旦遭受人为或自然损害，将带来难以弥补的损失，抢救这些珍贵的历史资料已经成为迫在眉睫的问题。保护水文年鉴最有效的方式是对水文年鉴进行数字化扫描加工，形成电子档案。现有技术基于以上问题对水文年鉴的数字化进行了研究，提出了水文年鉴数据的智能识别，识别水文资料中的数字(即数字字符识别)是水文资料数字化的重要任务。The paper hydrological yearbook records the most basic hydrological test data, which contains information on the long-term evolution of nature and the impact of human activities, and plays an important role in production, scientific research, and social services. In view of the fact that hydrological yearbooks have been preserved for a long time, are frequently used, and are in poor storage conditions, paper hydrological yearbooks have gradually begun to be damaged, and once they suffer man-made or natural damage, irreparable losses will be brought about. Rescuing these precious historical materials has become an urgent problem. The most effective way to protect the hydrological yearbook is to digitally scan and process the hydrological yearbook to form an electronic file. Based on the above problems, the prior art studies the digitization of the hydrological yearbook, and proposes the intelligent recognition of the hydrological yearbook data. Recognizing the numbers in the hydrological data (ie, digital character recognition) is an important task for the digitization of the hydrological data.

水文资料是一种逐年刊印的资料，以统一、科学的图表形式表达出来的成果。内容主要是上年实测的并经过严格整编审查的、普遍需要的基本水文资料；其表格特点是横排表示具体月份，竖排表示每个月份的日期，表格底部由每个月的平均流量、最大流量、最小流量、年统计及附注组成。所以本文在识别水文年鉴数字之前先对其进行版面分析，提取表格线。Hydrological data is a kind of data published year by year, expressed in a unified and scientific graphic form. The content is mainly the basic hydrological data that are measured in the previous year and have been strictly compiled and reviewed. The table is characterized by the fact that the horizontal row indicates the specific month, and the vertical row indicates the date of each month. It consists of maximum flow, minimum flow, annual statistics and notes. Therefore, this paper analyzes the layout of the hydrological yearbook and extracts the table lines before identifying the numbers of the hydrological yearbook.

水文年鉴数字字符比较规范化、笔划数也比较少，它比之汉字特征码的提取相对要容易些。但是，它们形态变化不大、笔划信息过少，在某种意义上来说导致有效的特征矢量提取的困难增大。例如，数字“8”和“6”，当它们的油墨重一点时，白正宋体的“6”有时上半部也成了个小圆圈，几乎与“8”类同。数字“1”和“3”，“2”和“7”，当油墨较重或是字型太小，很可能出现数字“1”和“3”、“2”和“7”有相同的特征矢量。因此，在实际应用中，采用现有技术针对水文资料进行识别，具有精度低、效率低的缺点。The digital characters of the hydrological yearbook are relatively standardized and the number of strokes is relatively small, which is relatively easier to extract than the Chinese character feature code. However, they have little shape change and too little stroke information, which in a sense increases the difficulty of effective feature vector extraction. For example, the numbers "8" and "6", when their ink is a bit heavier, sometimes the upper half of "6" in Bai Zheng Song typeface becomes a small circle, which is almost similar to "8". The numbers "1" and "3", "2" and "7", when the ink is heavy or the font is too small, it is likely that the numbers "1" and "3", "2" and "7" have the same feature vector. Therefore, in practical applications, using existing technologies to identify hydrological data has the disadvantages of low precision and low efficiency.

发明内容Contents of the invention

本发明所要解决的技术问题是提供一种采用全新特征融合设计方法，能够有效提高识别率，保证工作效率的纸质水文年鉴数字化方法。The technical problem to be solved by the present invention is to provide a digitalization method for paper-based hydrological yearbooks that adopts a new feature fusion design method, can effectively improve the recognition rate, and ensure work efficiency.

本发明为了解决上述技术问题采用以下技术方案：本发明设计了一种纸质水文年鉴数字化方法，包括如下步骤：The present invention adopts the following technical solutions in order to solve the above-mentioned technical problems: the present invention designs a kind of digital method of paper hydrological yearbook, comprises the following steps:

步骤001.根据纸质水文年鉴页面的版面设计，确定水文资料表格位于纸质水文年鉴页面中的像素位置，然后进入步骤002；Step 001. According to the layout design of the paper hydrological yearbook page, determine the pixel position where the hydrological data table is located in the paper hydrological yearbook page, and then enter step 002;

步骤002.根据纸质水文年鉴页面中水文资料表格的像素位置，针对水文资料表格分别进行纵向和横向投影，并针对水文资料表格的纵向投影图、横向投影分别进行分析，分别提取水文资料表格中各条竖线的横坐标、各条横线的纵坐标，然后进入步骤003；Step 002. According to the pixel position of the hydrological data table in the paper hydrological yearbook page, perform vertical and horizontal projections on the hydrological data table respectively, and analyze the longitudinal projection and horizontal projection of the hydrological data table respectively, and extract the hydrological data table respectively. The abscissa of each vertical line, the ordinate of each horizontal line, and then enter step 003;

步骤003.根据水文资料表格的版式，以及水文资料表格中各条竖线的横坐标、各条横线的纵坐标，针对水文资料表格的投影图像，分别获得水文资料表格各个数值单元格中的数据图像，然后进入步骤004；其中，水文资料表格各个数据图像中的数值字符为白色，底色为黑色；Step 003. According to the layout of the hydrological data table, and the abscissa of each vertical line in the hydrological data table, and the vertical coordinate of each horizontal line, for the projected image of the hydrological data table, obtain the values in each numerical cell of the hydrological data table respectively. Data image, then enter step 004; Wherein, the numerical characters in each data image of the hydrological data form are white, and the background color is black;

步骤004.分别针对各个数据图像，针对数据图像中的各个数值字符进行字符切分，获得该数据图像中的各个数值字符块，进而分别获得各个数据图像中的各个数值字符块，然后进入步骤005；Step 004. For each data image, perform character segmentation for each numerical character in the data image, obtain each numerical character block in the data image, and then obtain each numerical character block in each data image, and then enter step 005 ;

步骤005.分别针对各个数据图像中的各个数值字符块，提取数值字符块中数值字符的网格特征、傅里叶特征、轮廓矩特征，共同作为该数值字符的识别特征，进而分别获得各个数据图像中各个数值字符块中数值字符的识别特征，然后进入步骤006；Step 005. For each numerical character block in each data image, extract the grid feature, Fourier feature, and contour moment feature of the numerical character in the numerical character block, and use them together as the identification feature of the numerical character, and then obtain each data respectively The recognition feature of numerical characters in each numerical character block in the image, then enter step 006;

步骤006.分别针对各个数据图像中的各个数值字符块，判断是否存在由数值字符块顶边向下存在预设数量的黑色像素点，是则判定该数值字符块中为小数点，否则不做任何进一步操作；在完成分别针对各个数据图像中各个数值字符块的判断后，然后进入步骤007；Step 006. For each numerical character block in each data image, determine whether there is a preset number of black pixels from the top edge of the numerical character block downward, and if so, determine that the numerical character block is a decimal point, otherwise do not do anything Further operation; after completing the judgment for each numerical character block in each data image, then enter step 007;

步骤007.针对所有数据图像中数值字符的所有识别特征，进行特征融合，构成水文资料表格中分别对应“0”到“9”的数值识别特征，然后进入步骤008；Step 007. For all the identification features of numerical characters in all data images, perform feature fusion to form numerical identification features corresponding to "0" to "9" in the hydrological data table, and then enter step 008;

步骤008.根据水文资料表格中分别对应“0”到“9”的数值识别特征，以及各个数据图像中各个数值字符块中数值字符的识别特征，通过预设分类器，分别获得各个数据图像中各个数值字符块所对应的数字，然后进入步骤009；Step 008. According to the numerical identification features corresponding to "0" to "9" in the hydrological data table, and the identification features of the numerical characters in each numerical character block in each data image, respectively obtain the The number corresponding to each numerical character block, then enter step 009;

步骤009.根据各个数据图像中各个数值字符块所对应的数字或小数点，分别构成水文资料表格各个数值单元格中数据图像所对应的数值，再结合水文资料表格版式的各项属性，获得水文资料表格中各项属性，及其所对应的数值，并进行存储。Step 009. According to the number or decimal point corresponding to each numerical character block in each data image, respectively form the numerical value corresponding to the data image in each numerical value cell of the hydrological data table, and then combine various attributes of the hydrological data table layout to obtain the hydrological data Each attribute in the table and its corresponding value are stored.

作为本发明的一种优选技术方案，所述步骤009之后还包括如下步骤，执行完步骤009之后，进入步骤010；As a preferred technical solution of the present invention, after step 009, the following steps are also included, after step 009 is executed, enter step 010;

步骤010.针对所识别存储水文资料表格中各项属性、及其所对应的数值，分别针对各个月的流量数值，按如下步骤010-01至步骤010-02进行执行，进而分别获得针对各个月每日流量数值的初步识别判断，然后进入步骤011；Step 010. For each attribute in the identified and stored hydrological data table and its corresponding value, respectively for the flow value of each month, execute according to the following steps 010-01 to 010-02, and then respectively obtain the flow rate for each month Preliminary identification and judgment of the daily flow value, and then enter step 011;

步骤010-01.将当月第一日流量数值作为第一阈值，然后分别针对当月前两日流量数值，判断下一日流量数值与当日流量数值之间的差值是否小于第一阈值，是则判断当日流量数值识别无误；否则判断当日流量数值初步识别错误；由此获得分别针对当月前两日流量数值的初步识别判断，然后进入步骤010-02；Step 010-01. Use the flow value on the first day of the current month as the first threshold, and then judge whether the difference between the next day’s flow value and the current day’s flow value is less than the first threshold for the flow values of the first two days of the current month, if yes, then Judging that the identification of the flow value of the current day is correct; otherwise, it is determined that the preliminary identification of the flow value of the current day is wrong; thereby obtaining the preliminary identification and judgment of the flow values of the first two days of the current month, and then proceed to step 010-02;

步骤010-02.分别针对当月由第三日开始的各日流量数值，判断下一日流量数值与当日流量数值之间的差值是否小于前一日流量数值，是则判断当日流量数值识别无误；否则判断当日流量数值初步识别错误；由此获得分别针对当月由第三日开始各日流量数值的初步识别判断；Step 010-02. For each daily flow value starting from the third day of the current month, determine whether the difference between the next day’s flow value and the current day’s flow value is smaller than the previous day’s flow value, and if so, determine that the current day’s flow value is correctly identified ; Otherwise, it is judged that the preliminary recognition of the flow value of the current day is wrong; thus, the preliminary recognition and judgment of the flow value of each day starting from the third day of the current month are obtained;

步骤011.根据所识别存储水文资料表格中的各个数值，以及各个数值中各个数字的识别特征，通过预设训练器，获得所识别存储水文资料表格中各个数值中的各个数字，分别对应“0”到“9”的十个识别结果概率，然后进入步骤012；Step 011. According to each numerical value in the identified and stored hydrological data table, and the identification features of each number in each numerical value, through the preset trainer, obtain each number in each numerical value in the identified stored hydrological data table, respectively corresponding to "0 " to "9" ten recognition result probabilities, and then enter step 012;

步骤012.分别针对所识别存储水文资料表格中各个数值中的各个数字，获得数字所对应“0”到“9”十个识别结果概率中的最大识别结果概率，以及第二大识别结果概率，并获得该最大识别结果概率与该第二大识别结果概率的差值，判断该差值是否小于预设识别结果概率阈值，是则判断该数字初步识别错误；否则判断该数字识别无误；由此获得分别针对所识别存储水文资料表格中各个数值中各个数字的初步识别判断，然后进入步骤013；Step 012. Obtain the maximum recognition result probability and the second largest recognition result probability among the ten recognition result probabilities from "0" to "9" corresponding to the numbers for each number in each numerical value in the identified and stored hydrological data table, And obtain the difference between the maximum recognition result probability and the second largest recognition result probability, judge whether the difference is less than the preset recognition result probability threshold, if so, judge that the initial recognition of the number is incorrect; otherwise, judge that the recognition of the number is correct; thus Obtain a preliminary identification judgment for each number in each numerical value in the identified stored hydrological data table, and then enter step 013;

步骤013.分别针对各月中各个初步识别错误的流量数值，判断初步识别错误的流量数值中是否存在初步识别错误的数字，是则判断该初步识别错误的流量数值错误，并进行报警；否则判断该初步识别错误流量数值无误；由此实现针对所识别存储水文资料表格中各个数值的检验。Step 013. For each preliminary misidentified flow value in each month, judge whether there is a preliminary misidentified number in the preliminary misidentified flow value, if yes, judge that the preliminary misidentified flow value is wrong, and issue an alarm; otherwise, judge The preliminary identification error flow value is correct; thus, the verification of each value in the identified stored hydrological data table is realized.

作为本发明的一种优选技术方案：所述步骤011中，根据所识别存储水文资料表格中的各个数值，以及各个数值中各个数字的识别特征，通过支持向量机训练器，获得所识别存储水文资料表格中各个数值中的各个数字，分别对应“0”到“9”的十个识别结果概率。As a preferred technical solution of the present invention: in the step 011, according to each numerical value in the identified stored hydrological data table and the identification features of each number in each numerical value, the identified stored hydrological data is obtained through the support vector machine trainer. Each number in each numerical value in the data table corresponds to ten recognition result probabilities from "0" to "9".

作为本发明的一种优选技术方案：所述步骤013中，所述根据初步识别错误的流量数值中存在初步识别错误的数字，判断该初步识别错误的流量数值错误，并进行报警的同时，根据该初步识别错误数字在该初步识别错误流量数值中的位置进行分析，若该初步识别错误数字位于该初步识别错误流量数值中的整数部分，则用该初步识别错误流量数值所对应日期的前一日流量数值与后一日流量数值的平均值，替换该初步识别错误流量数值；若该初步识别错误数字位于该初步识别错误流量数值中的小数部分，则用该初步识别错误流量数值所对应日期的前一日流量数值的小数与后一日流量数值的小数的平均值，替换该初步识别错误流量数值中的小数。As a preferred technical solution of the present invention: in the step 013, according to the number of initially identified incorrect numbers in the initially identified incorrect flow value, it is judged that the initially identified incorrectly identified flow value is wrong, and at the same time as an alarm, according to The position of the preliminary identification error number in the preliminary identification error flow value is analyzed. If the preliminary identification error number is in the integer part of the preliminary identification error flow value, the previous day of the date corresponding to the preliminary identification error flow value is used. The average value of the daily flow value and the next day's flow value is used to replace the preliminary identification error flow value; if the preliminary identification error number is in the decimal part of the preliminary identification error flow value, the date corresponding to the preliminary identification error flow value is used The average value of the decimals of the traffic values of the previous day and the decimals of the traffic values of the next day is to replace the decimals in the preliminary identification error traffic values.

作为本发明的一种优选技术方案，所述步骤004，针对数据图像中的各个数值字符进行字符切分，获得该数据图像中的各个数值字符块，具体包括如下步骤：As a preferred technical solution of the present invention, the step 004, performing character segmentation for each numerical character in the data image, to obtain each numerical character block in the data image, specifically includes the following steps:

步骤a01.检测获得数据图像中各数值字符内部的各个白色像素点，以及该数据图像各边缘分别相距各数值字符最小距离，所对应数值字符上的白色像素点，然后进入步骤a02；Step a01. Detect each white pixel point inside each numerical character in the obtained data image, and the minimum distance between each edge of the data image and the white pixel point on the corresponding numerical character, and then enter step a02;

步骤a02.针对上一步骤由该数据图像中所获各个白色像素点分别进行判断，判断像素点上、下、左、右各位置的像素点是否均为白色像素点，是则判断该像素点为数值字符内部的像素点；否则根据标识符判断该像素点为字符的边缘像素点，并获取该像素点在该数据图像中所在像素列的列号；由此针对上一步骤由该数据图像中所获各个白色像素点分别进行判断，获得该数据图像中各个数值字符上边缘像素点所在该数据图像中所在像素列的列号，然后进入步骤a03；Step a02. For the previous step, judge each white pixel point obtained in the data image separately, and judge whether the pixels at the upper, lower, left, and right positions of the pixel point are all white pixels, and if so, judge the pixel point is the pixel inside the numeric character; otherwise, it is judged according to the identifier that the pixel is the edge pixel of the character, and the column number of the pixel column where the pixel is located in the data image is obtained; thus for the previous step, the data image Each white pixel obtained in is judged respectively, obtains the column number of the pixel row in the data image where the upper edge pixel of each numerical character in the data image is located, and then enters step a03;

步骤a03.根据该数据图像中各个数值字符上边缘像素点在该数据图像中所在像素列的列号，针对该数据图像中的各个数值字符进行划分，获得该数据图像中的各个数值字符块。Step a03. According to the column number of the pixel column where the upper edge pixel point of each numerical character in the data image is located in the data image, divide each numerical character in the data image to obtain each numerical character block in the data image.

作为本发明的一种优选技术方案，所述步骤005中，分别针对各个数据图像中的各个数值字符块，提取数值字符块中数值字符的网格特征，具体包括如下步骤：As a preferred technical solution of the present invention, in the step 005, for each numerical character block in each data image, the grid feature of the numerical character in the numerical character block is extracted, specifically comprising the following steps:

步骤b01.获取数值字符块的上、下、左、右的边界，并由此获得数值字符本体图像，然后进入步骤b02；Step b01. Obtain the upper, lower, left, and right boundaries of the numerical character block, and thus obtain the numerical character ontology image, and then enter step b02;

步骤b02.针对该数值字符本体图像进行重心归一化，并将经过重心归一化的该数值字符本体图像平均分割成预设数量个子区域图像，然后进入步骤b03；Step b02. Perform center-of-gravity normalization on the numerical character body image, and divide the weight-normalized numerical character body image into a preset number of sub-region images on average, and then enter step b03;

步骤b03.分别获得该数值字符本体图像中各个子区域图像中白色像素点的所占比例，共同构成该数值字符块中数值字符的网格特征。Step b03. Obtain the proportion of white pixels in each sub-region image of the numerical character body image, and together form the grid features of the numerical character in the numerical character block.

作为本发明的一种优选技术方案，所述步骤005中，分别针对各个数据图像中的各个数值字符块，提取数值字符块中数值字符的傅里叶特征，具体包括如下步骤：As a preferred technical solution of the present invention, in the step 005, for each numerical character block in each data image, extract the Fourier feature of the numerical character in the numerical character block, specifically comprising the following steps:

步骤c01.针对数值字符块进行二维离散傅里叶变换，然后进入步骤c02；Step c01. Carry out two-dimensional discrete Fourier transform for the numerical character block, and then enter step c02;

步骤c02.将经过二维离散傅里叶变换的该数值字符块，继续进行中心变换，即将数值字符块平均划分为四块子区域图像，并进行对角交换，获得傅里叶图像谱，然后进入步骤c03；Step c02. Continue to perform central transformation on the numerical character block after two-dimensional discrete Fourier transform, that is, divide the numerical character block into four sub-region images on average, and perform diagonal exchange to obtain the Fourier image spectrum, and then Go to step c03;

步骤c03.针对中心变换后的傅里叶图像谱分析其傅里叶系数，获得该数值字符块的傅里叶系数中、大于预设幅值阈值的傅里叶系数集中所在区域，构成大幅傅里叶系数区域，然后进入步骤c04；Step c03. Analyze the Fourier coefficients of the Fourier image spectrum after the central transformation, and obtain the area where the Fourier coefficients of the numerical character block are concentrated, which are greater than the preset amplitude threshold, to form a large Fourier coefficient. Liye coefficient area, then enter step c04;

步骤c04.由大幅傅里叶系数区域中，提取预设数量个离散傅里叶变换系数，并将其进行归一化，构成该数值字符块中数值字符的傅里叶特征。Step c04. Extract a preset number of discrete Fourier transform coefficients from the large-scale Fourier coefficient area, and normalize them to form Fourier features of the numerical characters in the numerical character block.

作为本发明的一种优选技术方案：所述步骤005中，分别针对各个数据图像中的各个数值字符块，提取数值字符块中数值字符的轮廓矩特征，具体包括如下步骤：As a preferred technical solution of the present invention: in the step 005, for each numerical character block in each data image, extract the contour moment feature of the numerical character in the numerical character block, specifically comprising the following steps:

步骤d01.针对数值字符块中的数值字符进行轮廓提取，然后进入步骤d02；Step d01. Perform contour extraction for the numerical characters in the numerical character block, and then enter step d02;

步骤d02.针对该数值字符块中数值字符的轮廓进行不变矩处理，提取预设数量个二维轮廓不变矩特征，构成该数值字符块中数值字符的轮廓矩特征。Step d02. Perform moment invariant processing on the contours of the numerical characters in the numerical character block, and extract a preset number of two-dimensional contour invariant moment features to form the contour moment features of the numerical characters in the numerical character block.

作为本发明的一种优选技术方案，所述步骤007具体包括如下步骤：As a preferred technical solution of the present invention, the step 007 specifically includes the following steps:

步骤e01.根据排列组合，针对所有数据图像中数值字符的所有识别特征，进行任意两个识别特征的组合，构成所有识别特征组合，然后进入步骤e02；Step e01. According to the permutation and combination, for all the identification features of the numerical characters in all data images, perform any combination of two identification features to form all identification feature combinations, and then enter step e02;

步骤e02.将所有数据图像中数值字符的所有识别特征，构成水文资料表格中对应数字“0”到“9”的样本集合S，然后分别针对各组识别特征组合，根据如下公式(1)：Step e02. All the identification features of the numerical characters in all data images are used to form the sample set S corresponding to the numbers "0" to "9" in the hydrological data table, and then each group of identification features is combined according to the following formula (1):

${C C}_{i i j j,, A A} = = \frac{E E. (({S S}_{i i} \cup \cup {S S}_{j j})) - - E E. (({S S}_{i i} \cap \cap {S S}_{j j}))}{E E. ((S S))} - - - - - - ((11))$

获得该组识别特征组合分别相对标准数字“0”-“9”的特征互补指数C_ij,A；进而分别获得各组识别特征组合分别相对标准数字“0”-“9”的特征互补指数C_ij,A；然后进入步骤e03；其中，S_i和S_j分别表示样本集合S被识别特征F_i与识别特征F_j错分的样本集合；E(S)表示样本集合S中的样本个数；E(S_i∪S_j)表示样本集合S_i与样本集合S_j之间并集中的样本个数；E(S_i∩S_j)表示样本集合S_i与样本集合S_j之间交集中的样本个数；A＝{0、1、…、9}，C_ij,A表示由识别特征F_i与识别特征F_j所构成识别特征组合相对标准数字A的特征互补指数；Obtain the feature complementary index C _ij,A of the group of recognition feature combinations relative to the standard number "0"-"9"respectively; and then obtain the feature complementary index C of each group of recognition feature combinations respectively relative to the standard number "0"-"9" _{ij, A} ; Then enter step e03; Wherein, S _i and S _j respectively represent the sample set that the sample set S is misclassified by the recognition feature F _i and the recognition feature F _j ; E (S) represents the number of samples in the sample set S ; E(S _i ∪ S _j ) represents the number of samples in the union between the sample set S _i and the sample set S _j ; E(S _i ∩ S _j ) represents the intersection between the sample set S _i and the sample set S _j The number of samples; A={0, 1, ..., 9}, C _{ij, A} represents the feature complementary index of the identification feature combination made of identification feature F _i and identification feature F _j relative to the standard number A;

步骤e03.分别针对各组识别特征组合，根据如下公式(2)：Step e03. Respectively for each group of identification feature combinations, according to the following formula (2):

${TC TC}_{k k} = = \frac{{Σ Σ}_{00,, i i &NotEqual; &NotEqual; j j}^{99} {C C}_{i i j j}}{{A A}_{1010}^{22}} - - - - - - ((22))$

分别获取各组识别特征组合相对于标准数字的整体互补指数TC_k，然后进入步骤e04；其中，k＝{1、…、K}，K表示所有识别特征组合的组合数，TC_k表示第k组识别特征组合相对于标准数字的整体互补指数；Obtain the overall complementary index TC _k of each group of identification feature combinations relative to the standard number, and then enter step e04; where, k={1,...,K}, K represents the number of combinations of all identification feature combinations, and TC _k represents the kth The overall complementarity index of group-identifying feature combinations relative to standard numbers;

步骤e04.针对所有识别特征组合，按其整体互补指数由大至小排序，获得排序前两个识别特征组合，然后针对该两个识别特征组合进行特征融合，构成水文资料表格中分别对应“0”到“9”的数值识别特征。Step e04. For all the identification feature combinations, sort according to their overall complementary index from large to small to obtain the first two identification feature combinations, and then carry out feature fusion for the two identification feature combinations to form the corresponding "0" in the hydrological data table. ” to “9” to identify the character.

作为本发明的一种优选技术方案，所述步骤008中，根据水文资料表格中分别对应“0”到“9”的数值识别特征，以及各个数据图像中各个数值字符块中数值字符的识别特征，通过支持向量机(SVM)分类器，分别获得各个数据图像中各个数值字符块所对应的数字。As a preferred technical solution of the present invention, in the step 008, according to the numerical identification features respectively corresponding to "0" to "9" in the hydrological data table, and the identification features of numerical characters in each numerical character block in each data image , through a support vector machine (SVM) classifier, the numbers corresponding to each numerical character block in each data image are respectively obtained.

本发明所述一种纸质水文年鉴数字化方法及控制方法采用以上技术方案与现有技术相比，具有以下技术效果：本发明所设计纸质水文年鉴数字化方法，在单一特征的基础上提出了互补性较强的特征融合方法，识别率得到了提高，由于水文过程受相似的季节性气候因素，以及其他随机因素影响而呈现相似性，也即其流量具有上下文相关性，所以本发明鉴于此相关性，同时提出了基于时间序列的后期纠错机制。即在分类器识别后，根据某种准则对其进行纠错处理，通过实验证明，本发明所提出的机制，有效提高了识别精度，保证了工作效率。Compared with the prior art, a paper-based hydrological yearbook digitization method and control method according to the present invention have the following technical effects: the paper-based hydrological yearbook digitization method designed by the present invention proposes a new method based on a single feature. The feature fusion method with strong complementarity has improved the recognition rate. Since the hydrological process is affected by similar seasonal climate factors and other random factors, it shows similarity, that is, its flow has context correlation. Therefore, the present invention considers this At the same time, a later error correction mechanism based on time series is proposed. That is, after the classifier is recognized, it is corrected according to a certain criterion. It is proved by experiments that the mechanism proposed by the present invention effectively improves the recognition accuracy and ensures the work efficiency.

附图说明Description of drawings

图1是本发明设计的纸质水文年鉴数字化方法及控制方法的流程图；Fig. 1 is the flow chart of the paper hydrology yearbook digitization method and control method designed by the present invention;

图2a是实施例中水文资料表格横向投影示意图；Fig. 2 a is the horizontal projection schematic diagram of hydrological data table in the embodiment;

图2b是实施例中水文资料表格纵向投影示意图；Fig. 2b is a schematic diagram of the longitudinal projection of the hydrological data table in the embodiment;

图3是实施例中由水文资料表格中分别所提取各条竖线、各条横线组成的表格示意图；Fig. 3 is the schematic diagram of the table formed by each vertical line and each horizontal line extracted respectively in the hydrological data table in the embodiment;

图4是实施例中水文年鉴版面分析示意图；Fig. 4 is a schematic diagram of layout analysis of hydrological yearbook in the embodiment;

图5是实施例中分别获得水文资料表格各个数值单元格中数据图像的示意图；Fig. 5 is the schematic diagram that respectively obtains the data image in each numerical value cell of hydrological data table in the embodiment;

图6是实施例中所获数据图像中各个数值字符块的示意图。Fig. 6 is a schematic diagram of each numerical character block in the data image obtained in the embodiment.

具体实施方式detailed description

下面结合说明书附图对本发明的具体实施方式作进一步详细的说明。The specific implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings.

在日常的商业活动中，我们每天都运用了大量的文档和表格。同时表格文档也广泛地应用于各个领域，通常人们需要手动处理表格文档，例如客户需要缴纳赋税，图书管理员需要采集纸质表格文档中所包含的数据信息。由于光学字符识别(OCR)技术的发展，人们开始尝试利用可获得数据的标准表格图像来提取表格中的数据信息，这可以减少工作时间并减轻工作负担。在商业领域中，利用OCR技术可以提高工作质量，并且可以减少人们花费在处理表格文档上的大量时间。在OCR运用的许多领域中，我们通过获取的表格模板使用户知道图像中印刷体的目标字符串。这些字符串信息包括了许多项目内容如流量信息、文本信息和数学公式等。表格的存在阻碍了数据信息的提取，因此表格线检测是印刷体表格识别技术中一项重要任务。In our daily business activities, we use a lot of documents and forms every day. At the same time, form documents are also widely used in various fields. Usually, people need to manually process form documents. For example, customers need to pay taxes, and librarians need to collect data information contained in paper form documents. Due to the development of Optical Character Recognition (OCR) technology, people began to try to extract the data information in the form by using the standard form image of the available data, which can reduce the working time and reduce the workload. In the business field, using OCR technology can improve the quality of work and reduce the amount of time people spend on processing form documents. In many fields where OCR is used, we let the user know the target string printed in the image through the obtained form template. These string information includes many items such as traffic information, text information and mathematical formulas. The existence of tables hinders the extraction of data information, so table line detection is an important task in printed table recognition technology.

在水文资料印刷体文档中，表格是其必不可少的一部分，它可以将所有的文档信息高度集中在一起，并且让读者准确地明白其表达的含义，既简明又规范。通过查阅水文年鉴各大水文站的流量表，可以发现水文年鉴流量表的版面结构是有规律可循的。我们可以利用这些规律来切割出字符。In printed documents of hydrological data, tables are an essential part, which can highly gather all document information together and allow readers to accurately understand the meaning of their expressions, which is both concise and standardized. By consulting the discharge tables of major hydrological stations in the Hydrological Yearbook, it can be found that the layout structure of the discharge table in the Hydrological Yearbook is regular. We can use these rules to cut out characters.

水文年鉴是水文机构对流域内各河道水体进行水文监测、次年进行加工整理刊印形成水文监测成果的载体。其内容包括各项整编成果以及用图表和必要文字说明的汇总资料，是一部系统、规范的水文数据宝库。The hydrological yearbook is the carrier for the hydrological agency to monitor the water bodies of the rivers in the basin, and to process, arrange and publish the following year to form the carrier of the hydrological monitoring results. Its content includes various reorganization results and summary data explained with charts and necessary words. It is a treasure house of systematic and standardized hydrological data.

1958年,水利部水文局将全国按流域水系统一划分水文资料的卷册范围,并将逐年资料统一命名为《中华人民共和国水文年鉴》全国分10卷94册。其特征如下。In 1958, the Hydrological Bureau of the Ministry of Water Resources divided the country into volumes of hydrological data according to the basin water system, and named the year-by-year data uniformly the "Hydrological Yearbook of the People's Republic of China", which is divided into 10 volumes and 94 volumes nationwide. Its characteristics are as follows.

颜色特征：黄底黑字。Color characteristics: Black characters on a yellow background.

结构特征：纸张宽度为440mm，高度为140mm，宽高比为3.14。年鉴中数字宽度约为15mm，高度约为24mm，宽高比为0.625。字符位于表格内。Structural features: the paper width is 440mm, the height is 140mm, and the aspect ratio is 3.14. The numbers in the yearbook are approximately 15mm wide, 24mm high, and have an aspect ratio of 0.625. The characters are inside the table.

纹理特征：年鉴中含有类字符区，即数字横向、竖向颜色色度呈现有规律的波峰波谷变化。Texture features: The yearbook contains character-like areas, that is, the horizontal and vertical color chromaticity of numbers presents regular changes in peaks and valleys.

水文年鉴字符是多行水平规则排列的字符，具有比较稳定的结构和纹理特征。基于投影的自顶向下版面分析方法就是应用了这一特点。在年鉴的字符区域，字符的边缘信息非常丰富，运用一定的工具对字符边缘信息进行检测和分析，可将水文数据从背景中分离出来。水文年鉴区域的像素值将呈现特定的起伏变化，变化频率也保持在一定范围内，利用这些特征可实现水文年鉴字符定位。根据年鉴数字区域的横向、竖向特征比非数字区域丰富这一特征提出了基于横向竖向投影的字符定位算法。求出其跳变点，根据跳变点的数量和跳变点间的距离来确定可能的字符区域。The hydrological yearbook characters are multi-line horizontal and regular characters with relatively stable structure and texture characteristics. The projection-based top-down layout analysis method is the application of this feature. In the character area of the yearbook, the edge information of the characters is very rich. Using certain tools to detect and analyze the edge information of the characters can separate the hydrological data from the background. The pixel values in the hydrological yearbook area will show specific fluctuations, and the frequency of change will also be kept within a certain range. Using these features, the character location of the hydrological yearbook can be realized. According to the fact that the horizontal and vertical features of the yearbook digital area are more abundant than the non-digital area, a character positioning algorithm based on horizontal and vertical projection is proposed. Calculate its jump points, and determine possible character regions according to the number of jump points and the distance between jump points.

距页面上边距大概275个像素左右页面空白，随后是水文年鉴的流域名称和水文站名称加上逐日平均流量表字样。距离此字样30像素左右位置标有集水面积、流量的单位。距离此20像素左右是表格开始位置。水文年鉴表格均由11条横线和14条竖线组成。前两条横线中间标有月份信息，前两条竖线之间标有每月日期，随后在每两条竖线之间和第三天横线之前的区域均是每个月的流量值。在随后的横线之间标有每个月的平均流量值、日期最大的流量值和日期最小的流量值、年统计和附注信息。我们的最终目的是识别流量值，因此首先必须对水文资料进行版面分析，分析其表格结构，提取表格框线，以便具体对每个月份的流量值进行定位。About 275 pixels away from the top margin of the page, the page is blank, followed by the name of the watershed and hydrological station in the hydrological yearbook plus the words daily average discharge table. Units of water catchment area and flow rate are marked about 30 pixels away from this word. About 20 pixels from this is where the table starts. The hydrological yearbook tables are composed of 11 horizontal lines and 14 vertical lines. The month information is marked in the middle of the first two horizontal lines, the date of each month is marked between the first two vertical lines, and the area between each two vertical lines and before the third horizontal line is the traffic value of each month . The average flow value of each month, the maximum flow value and the minimum flow value of the date, annual statistics and notes are marked between the subsequent horizontal lines. Our ultimate goal is to identify the flow value, so we must first analyze the layout of the hydrological data, analyze its table structure, and extract the table frame, so as to specifically locate the flow value of each month.

如图1所示，本发明设计了一种纸质水文年鉴数字化方法，首先要针对纸质水文年鉴页面中水文资料表格进行拍照，获取水文资料表格图像，并进行预处理操作，其中包括图像二值化、灰度化、去噪、旋转和反色处理；然后针对预处理操作的水文资料表格图像，具体进行如下步骤：As shown in Figure 1, the present invention has designed a kind of digitalization method of papery hydrological yearbook, first will take a picture of the hydrological data table in the page of papery hydrological yearbook, obtain the image of hydrological data table, and carry out preprocessing operation, wherein include image two Value, gray scale, denoising, rotation and inversion processing; then for the hydrological data table image of the preprocessing operation, the specific steps are as follows:

步骤001.随着对文档版面分析算法的深入研究，本文在原有文档版面分割典型算法(自顶向下、自底向下)的基础上，综合两种典型算法的优点，即同时使用结构特征和纹理特征来处理水文年鉴里的文档版面。这种处理方式既考虑了分割的精确性，又兼顾了分析处理的时间消耗，因此能够快速、准确的定位表格。根据纸质水文年鉴页面的版面设计，确定水文资料表格位于纸质水文年鉴页面中的像素位置，然后进入步骤002。Step 001. With the in-depth research on document layout analysis algorithms, this paper combines the advantages of two typical algorithms based on the original document layout segmentation algorithms (top-down and bottom-down), that is, using structural features at the same time and texture features to process document layouts in the Hydrological Yearbook. This processing method not only considers the accuracy of segmentation, but also takes into account the time consumption of analysis and processing, so it can quickly and accurately locate the table. According to the layout design of the page of the paper hydrological yearbook, determine the pixel position of the hydrological data table on the page of the paper hydrological yearbook, and then go to step 002.

步骤002.根据纸质水文年鉴页面中水文资料表格的像素位置，针对水文资料表格分别进行纵向和横向投影，横向投影如图2a所示，纵向投影如图2b所示，并针对水文资料表格的纵向投影图、横向投影分别进行分析，其中，图2a中，11个黑色点分别表示水文年鉴表格的横线，在第二个黑点之后的空心点表示每行流量值的上下位置，之后的每个波峰的两侧表示第一日到第三十一日每行的流量值的上下位置；图2b中，14个黑色点表示表格的14条竖线的横坐标，每两个黑色点之间，即每两条竖线之间的波峰两侧表示每月的流量值的左右坐标，用空心点标出。分别提取水文资料表格中各条竖线的横坐标、各条横线的纵坐标，实际应用实施例中如图3所示，其中，水文资料表格各个数据图像中的数值字符为白色，底色为黑色；因此，通过图2a和图2b可以粗略定位出每个月的流量值以及表格位置，最终水文年鉴版面分析的结果如图4所示，然后进入步骤003。Step 002. According to the pixel position of the hydrological data table in the paper hydrological yearbook page, vertical and horizontal projections are respectively performed on the hydrological data table. The horizontal projection is shown in Figure 2a, and the vertical projection is shown in Figure 2b. The vertical and horizontal projections were analyzed separately. In Figure 2a, the 11 black dots represent the horizontal lines of the hydrological yearbook table, and the hollow dots after the second black dot represent the upper and lower positions of the flow values in each row. The two sides of each peak represent the upper and lower positions of the flow values of each row from the first day to the 31st day; in Figure 2b, 14 black dots represent the abscissa of the 14 vertical lines in the table, and every two black dots The interval, that is, the left and right coordinates of the monthly flow value on both sides of the peak between every two vertical lines, is marked with a hollow point. Extract the abscissa of each vertical line and the ordinate of each horizontal line in the hydrological data table respectively, as shown in Figure 3 in the practical application embodiment, wherein, the numerical characters in each data image of the hydrological data table are white, and the background color It is black; therefore, the monthly discharge value and table position can be roughly located through Figure 2a and Figure 2b, and the final layout analysis results of the hydrological yearbook are shown in Figure 4, and then enter step 003.

通过统计同一行或列上的黑像素数目，避免了对直线段的直接检测，对表格线的连通性要求不高，具有很好的抗干扰和泛化能力。通过该方法可反映出图像中目标的位置与尺寸等有效信息。为后续水文年鉴数字的定位处理提供了便利。By counting the number of black pixels on the same row or column, direct detection of straight line segments is avoided, the connectivity of table lines is not high, and it has good anti-interference and generalization capabilities. Effective information such as the position and size of the target in the image can be reflected by this method. It provides convenience for the positioning and processing of subsequent hydrological yearbook numbers.

步骤003.根据水文资料表格的版式，以及水文资料表格中各条竖线的横坐标、各条横线的纵坐标，针对水文资料表格的投影图像，分别获得水文资料表格各个数值单元格中的数据图像，实际应用实施例如图5所示，然后进入步骤004；其中，水文资料表格各个数据图像中的数值字符为白色，底色为黑色。Step 003. According to the layout of the hydrological data table, and the abscissa of each vertical line in the hydrological data table, and the vertical coordinate of each horizontal line, for the projected image of the hydrological data table, obtain the values in each numerical cell of the hydrological data table respectively. The data image, the actual application example is shown in Figure 5, and then enter step 004; wherein, the numerical characters in each data image of the hydrological data table are white, and the background color is black.

在纸质水位资料的数字化过程中，只有能够自适应地很好地对水文资料图像进行分割，才能保证后续提取特征的数据的精确性。纸质水位资料图像的分割是整个数字化过程的基础，数字定位出来以后的图像还是个整体，包括数字与数字之间的空白。对于已经提取出来的数字整体，需要进行字符切分。把单个字符从整体数字中分离出来。In the digitization process of paper water level data, only by segmenting the hydrological data images adaptively can the accuracy of the subsequent extracted feature data be guaranteed. The segmentation of paper water level data images is the basis of the entire digitization process. After digital positioning, the image is still a whole, including the blanks between numbers. For the whole number that has been extracted, character segmentation is required. Separate individual characters from the overall number.

步骤004.分别针对各个数据图像，针对数据图像中的各个数值字符进行字符切分，获得该数据图像中的各个数值字符块，具体包括如下步骤：Step 004. For each data image, perform character segmentation for each numerical character in the data image, and obtain each numerical character block in the data image, which specifically includes the following steps:

基于上述设计过程，进而分别获得各个数据图像中的各个数值字符块，实际应用实施例，所获得该数据图像中的各个数值字符块，如图6所示；然后进入步骤005。Based on the above-mentioned design process, each numerical character block in each data image is further respectively obtained. In the actual application embodiment, each numerical character block in the data image is obtained, as shown in FIG. 6 ; then go to step 005.

若直接把预处理后的数据作为分类器的输入量，进行分类计算时数据量大，特征提取的目的就是从分析数字的拓扑结构入手，把它的某些结构特征提取出来，使数字的位移、大小变化、字形畸变等干扰相对减小，也就是把那些反映数字特征的关键信息提供给分类器，这样就等于间接地增加了分类器的容错能力，而且经过特征提取后数据量也大大减小了；特征抽取对识别起关键性的作用，它应遵循以下原则：If the preprocessed data is directly used as the input of the classifier, the amount of data in the classification calculation is large, and the purpose of feature extraction is to start with the analysis of the topology of the number, extract some of its structural features, and make the displacement of the number , size change, font distortion and other interference are relatively reduced, that is, the key information that reflects the digital features is provided to the classifier, which is equivalent to indirectly increasing the fault tolerance of the classifier, and the amount of data is greatly reduced after feature extraction. Small; feature extraction plays a key role in recognition, and it should follow the following principles:

(1)易于提取；(1) easy to extract;

(2)具有较强的分类能力，即该特征对不同的数字应表现出较大的差异，而对相同的数字则应表现出尽可能小的差异；(2) It has a strong classification ability, that is, the feature should show a large difference for different numbers, and should show as small a difference as possible for the same number;

(3)具有较高的稳定性，尽量减小笔划断裂或粘连的影响。(3) It has high stability and minimizes the impact of stroke breakage or adhesion.

步骤005.分别针对各个数据图像中的各个数值字符块，提取数值字符块中数值字符的网格特征、傅里叶特征、轮廓矩特征，共同作为该数值字符的识别特征，进而分别获得各个数据图像中各个数值字符块中数值字符的识别特征，然后进入步骤006。Step 005. For each numerical character block in each data image, extract the grid feature, Fourier feature, and contour moment feature of the numerical character in the numerical character block, and use them together as the identification feature of the numerical character, and then obtain each data respectively The recognition features of numerical characters in each numerical character block in the image, and then enter step 006.

其中，网格特征是一组注重字符图像整体的分布特征，此种特征对噪声具有极强的抑制能力。其提取方法的主要思想是，把数字点阵分成几个局部小区域，并把每个小区域上的点阵密度作为描述特征，即统计每个小区域图像像素所占的百分比作为特征数据；由于网格特征反映的是图像的局部统计特征，是个百分比相对值，而图像局部的形变或噪声对应数字点阵就是局部元素的“0”和“1”的值互换，所以如果图像带有局部的形变或噪声，与没有形变和噪声的原图像相比，计算出来的百分比相对值变化不大。也就是说，这个相对值对于数字图片局部笔划的形变或孤立噪声点带来的影响不敏感。因此，以网格为特征进行数字识别，具有较好的抗噪声能力。针对本文中分割出的数字，我将之划分成大小为3×3的小区域，共计9个。Among them, the grid feature is a group of distribution features that focus on the overall character image, and this feature has a strong ability to suppress noise. The main idea of its extraction method is to divide the digital lattice into several local small areas, and use the lattice density on each small area as a description feature, that is, to count the percentage of image pixels in each small area as feature data; Since the grid feature reflects the local statistical characteristics of the image, it is a relative percentage value, and the digital lattice corresponding to the local deformation or noise of the image is the exchange of the values of "0" and "1" of the local elements, so if the image has Local deformation or noise, compared with the original image without deformation and noise, the relative value of the calculated percentage changes little. That is to say, this relative value is not sensitive to the deformation of local strokes in the digital picture or the influence of isolated noise points. Therefore, the digital recognition with the grid as the feature has better anti-noise ability. For the numbers segmented in this article, I divided them into small areas with a size of 3×3, a total of 9.

上述步骤005中，分别针对各个数据图像中的各个数值字符块，提取数值字符块中数值字符的网格特征，具体包括如下步骤：In the above step 005, for each numerical character block in each data image, extract the grid features of the numerical character in the numerical character block, specifically including the following steps:

步骤b01.获取数值字符块的上、下、左、右的边界，并由此获得数值字符本体图像，然后进入步骤b02。Step b01. Obtain the upper, lower, left, and right boundaries of the numerical character block, and thus obtain the numerical character ontology image, and then enter step b02.

步骤b02.针对该数值字符本体图像进行重心归一化，并将经过重心归一化的该数值字符本体图像平均分割成预设数量个子区域图像，然后进入步骤b03。Step b02. Carry out barycenter normalization for the numerical character ontology image, and evenly divide the barycenter-normalized numerical character ontology image into a preset number of sub-region images, and then proceed to step b03.

傅立叶变换是在图像处理中应用广泛的一种二维正交变换，傅立叶变换后平均值即直流项正比于图像灰度值的平均值，低频分量则表明了图像中目标边缘的强度和方向。数字字符一般能用很多线段构成的封闭轮廓来表示，通过映射所得到的一些离散量能够充分的反映这些封闭轮廓的变化。傅立叶系数能够很好的描述图像边界轮廓，其值与相似字形的平移、旋转、位移和尺寸大小无关。在字形表征和识别时，这些特征形成明显的数据压缩。Fourier transform is a two-dimensional orthogonal transform widely used in image processing. After Fourier transform, the average value, that is, the DC term, is proportional to the average value of the gray value of the image, and the low-frequency component indicates the strength and direction of the target edge in the image. Numerical characters can generally be represented by closed contours composed of many line segments, and some discrete quantities obtained through mapping can fully reflect changes in these closed contours. The Fourier coefficient can well describe the image boundary contour, and its value has nothing to do with the translation, rotation, displacement and size of similar glyphs. These features form an obvious data compression during character representation and recognition.

上述步骤005中，分别针对各个数据图像中的各个数值字符块，提取数值字符块中数值字符的傅里叶特征，具体包括如下步骤：In the above step 005, for each numerical character block in each data image, extract the Fourier feature of the numerical character in the numerical character block, specifically including the following steps:

步骤c01.针对数值字符块进行二维离散傅里叶变换，然后进入步骤c02。Step c01. Perform two-dimensional discrete Fourier transform on the numerical character block, and then enter step c02.

步骤c02.将经过二维离散傅里叶变换的该数值字符块，继续进行中心变换，即将数值字符块平均划分为四块子区域图像，并进行对角交换，获得傅里叶图像谱，然后进入步骤c03。Step c02. Continue to perform central transformation on the numerical character block after two-dimensional discrete Fourier transform, that is, divide the numerical character block into four sub-region images on average, and perform diagonal exchange to obtain the Fourier image spectrum, and then Go to step c03.

步骤c03.针对中心变换后的傅里叶图像谱分析其傅里叶系数，获得该数值字符块的傅里叶系数中、大于预设幅值阈值的傅里叶系数集中所在区域，构成大幅傅里叶系数区域，然后进入步骤c04。Step c03. Analyze the Fourier coefficients of the Fourier image spectrum after the central transformation, and obtain the area where the Fourier coefficients of the numerical character block are concentrated, which are greater than the preset amplitude threshold, to form a large Fourier coefficient. Liye coefficient area, and then go to step c04.

不变矩特征是图像的一种统计特征，是图像中具有平移、缩放和旋转不变性的数学特征。The moment invariant feature is a statistical feature of the image, and it is a mathematical feature in the image that is invariant to translation, scaling, and rotation.

上述步骤005中，分别针对各个数据图像中的各个数值字符块，提取数值字符块中数值字符的轮廓矩特征，具体包括如下步骤：In the above step 005, for each numerical character block in each data image, the contour moment feature of the numerical character in the numerical character block is extracted, which specifically includes the following steps:

步骤d01.针对数值字符块中的数值字符进行轮廓提取，然后进入步骤d02。Step d01. Perform outline extraction for the numerical characters in the numerical character block, and then proceed to step d02.

上述步骤中所获得的所有识别特征，若分别用神经网络和支持向量机分类器进行分类，分类效果不甚理想，这主要是因为很难找到一种特征分别适合于不同的数字，而前人的方法均是在分析特定的数字识别应用方面进行特征提取和融合，每个数字有不同的特点，要想正确分类需要各种特征进行组合，特征的互补性是保证提取的特征具有较高识别率和泛化能力的关键，是特征融合的依据；因此，在进行特征融合之前，必须解决特征互补性度量的问题。All the recognition features obtained in the above steps, if they are classified by neural network and support vector machine classifier, the classification effect is not ideal, mainly because it is difficult to find a feature suitable for different numbers, while the predecessors The methods are all feature extraction and fusion in the analysis of specific digital recognition applications. Each number has different characteristics. In order to classify correctly, various features need to be combined. The complementarity of features is to ensure that the extracted features have higher recognition. The key to the rate and generalization ability is the basis of feature fusion; therefore, the problem of feature complementarity measurement must be solved before feature fusion.

步骤007.针对所有数据图像中数值字符的所有识别特征，进行特征融合，构成水文资料表格中分别对应“0”到“9”的数值识别特征，然后进入步骤008。Step 007. For all the recognition features of the numerical characters in all data images, perform feature fusion to form numerical recognition features corresponding to "0" to "9" in the hydrological data table, and then enter step 008.

上述步骤007具体包括如下步骤：The above step 007 specifically includes the following steps:

步骤e01.根据排列组合，针对所有数据图像中数值字符的所有识别特征，进行任意两个识别特征的组合，构成所有识别特征组合，然后进入步骤e02。Step e01. According to the permutation and combination, for all the recognition features of the numerical characters in all the data images, combine any two recognition features to form all the recognition feature combinations, and then go to step e02.

获得该组识别特征组合分别相对标准数字“0”-“9”的特征互补指数C_ij,A；进而分别获得各组识别特征组合分别相对标准数字“0”-“9”的特征互补指数C_ij,A；然后进入步骤e03；其中，C_ij,A越大，说明识别特征F_i和识别特征F_j相对标准数字A的特征互补性越强；反之，则特征互补性越弱；S_i和S_j分别表示样本集合S被识别特征F_i与识别特征F_j错分的样本集合；E(S)表示样本集合S中的样本个数；E(S_i∪S_j)表示样本集合S_i与样本集合S_j之间并集中的样本个数；E(S_i∩S_j)表示样本集合S_i与样本集合S_j之间交集中的样本个数；A＝{0、1、…、9}，C_ij,A表示由识别特征F_i与识别特征F_j所构成识别特征组合相对标准数字A的特征互补指数。Obtain the feature complementary index C _ij,A of the group of recognition feature combinations relative to the standard number "0"-"9"respectively; and then obtain the feature complementary index C of each group of recognition feature combinations respectively relative to the standard number "0"-"9" _{ij, A} ; then enter step e03; wherein, C _ij, the larger the A, the stronger the feature complementarity of the identification feature F _i and the identification feature F _j relative to the standard number A; otherwise, the weaker the feature complementarity; S _i and S _j represent the sample set S misclassified by the recognition feature F _i and the recognition feature F _j respectively; E(S) represents the number of samples in the sample set S; E(S _i ∪ S _j ) represents the sample set S The number of samples in the union between _i and the sample set S _j ; E(S _i ∩ S _j ) represents the number of samples in the intersection between the sample set S _i and the sample set S _j ; A={0, 1,... , 9}, C _ij,A represents the feature complementarity index of the identification feature combination composed of identification feature F _i and identification feature F _j relative to the standard number A.

分别获取各组识别特征组合相对于标准数字的整体互补指数TC_k，然后进入步骤e04；其中，k＝{1、…、K}，K表示所有识别特征组合的组合数，TC_k表示第k组识别特征组合相对于标准数字的整体互补指数。Obtain the overall complementary index TC _k of each group of identification feature combinations relative to the standard number, and then enter step e04; where, k={1,...,K}, K represents the number of combinations of all identification feature combinations, and TC _k represents the kth The overall complementarity index of groups identifying feature combinations relative to standard numbers.

上述技术方案通过将不同的特征用于分类器中分类，对单个特征的识别结果进行分析，通过上述公式计算各个特征的整体互补指数，然后将选出的特征利用某种线性关系将其融合，通过实验表明粗网格特征和傅立叶特征对水文年鉴资料的数字识别效果甚佳，而且其整体互补性较强，所以将傅立叶特征串接在粗网格特征之后，通过实验得出提出的融合特征的识别率较单个傅立叶特征提高了3.8981％，较网格特征提高了1.4033％，较轮廓矩提高了83.1956％。The above technical solution uses different features for classification in the classifier, analyzes the recognition result of a single feature, calculates the overall complementary index of each feature through the above formula, and then fuses the selected features using a certain linear relationship. Experiments show that the coarse grid feature and Fourier feature are very effective in digital recognition of hydrological yearbook data, and their overall complementarity is strong, so the Fourier feature is connected in series after the coarse grid feature, and the proposed fusion feature is obtained through experiments The recognition rate of the method is 3.8981% higher than that of a single Fourier feature, 1.4033% higher than a grid feature, and 83.1956% higher than a contour moment.

步骤008.根据水文资料表格中分别对应“0”到“9”的数值识别特征，以及各个数据图像中各个数值字符块中数值字符的识别特征，通过支持向量机(SVM)分类器，分别获得各个数据图像中各个数值字符块所对应的数字，然后进入步骤009。Step 008. According to the numerical identification features respectively corresponding to "0" to "9" in the hydrological data table, and the identification features of numerical characters in each numerical character block in each data image, through a support vector machine (SVM) classifier, respectively obtain The number corresponding to each numeric character block in each data image, and then enter step 009.

步骤009.根据各个数据图像中各个数值字符块所对应的数字或小数点，分别构成水文资料表格各个数值单元格中数据图像所对应的数值，再结合水文资料表格版式的各项属性，获得水文资料表格中各项属性，及其所对应的数值，并进行存储；然后进入步骤010。Step 009. According to the number or decimal point corresponding to each numerical character block in each data image, respectively form the numerical value corresponding to the data image in each numerical value cell of the hydrological data table, and then combine various attributes of the hydrological data table layout to obtain the hydrological data The attributes in the table and their corresponding values are stored; then go to step 010.

本文通过分析流量的规律，根据时间序列提出了后期排错机制。通过实验结果可知，水文年鉴的最终识别结果接近99％，错误率相对来说较低，一个流量值由4至5个数字组成，若其中一个数字识别有误，即认为结果有误，这和以往的数据集MNIST,USPS上的识别结果的错误率统计还是稍有不同的。观察识别结果可知，一个流量值一般只有一个数字识别错误，而且每个月份识别错误的流量值在3个以内，这样的话如果我们能通过一定的算法思想找到识别可靠度不高的流量值，也即找到流量值的小数点前的数字的关键位置的识别错误，通过统计每月流量的变化规律，利用平均值法进行纠错，将带来很高的应用效率。This paper analyzes the law of traffic and proposes a post-debugging mechanism based on time series. It can be seen from the experimental results that the final recognition result of the Hydrological Yearbook is close to 99%, and the error rate is relatively low. A flow value is composed of 4 to 5 numbers. If one of the numbers is wrongly recognized, the result is considered to be wrong. This is the same as The error rate statistics of the recognition results on the previous data sets MNIST and USPS are still slightly different. Observing the recognition results, we can see that there is generally only one digital misrecognition error for a flow value, and the number of misidentified flow values per month is less than 3. In this case, if we can find flow values with low recognition reliability through certain algorithmic ideas, we can also That is to find the identification error of the key position of the number before the decimal point of the flow value, and use the average value method to correct the error by counting the change rule of the monthly flow, which will bring high application efficiency.

因为得到流量的本身也是通过仪器测量得到的，本身也存在一定的误差，因此若流量在一定小范围内波动的情况下，也即在流量值的小数点后的数字识别有误的情况下，在不影响流量数据的分析和应用的前提下，我们是可以容忍的。即不认为其识别有误。Because the flow rate itself is also obtained through instrument measurement, there are certain errors in itself. Therefore, if the flow rate fluctuates within a certain small range, that is, when the number after the decimal point of the flow value is incorrectly identified, in the As long as it does not affect the analysis and application of traffic data, we can tolerate it. That is, it is not considered to be misidentified.

步骤010.针对所识别存储水文资料表格中各项属性、及其所对应的数值，分别针对各个月的流量数值，按如下步骤010-01至步骤010-02进行执行，进而分别获得针对各个月每日流量数值的初步识别判断，然后进入步骤011。Step 010. For each attribute in the identified and stored hydrological data table and its corresponding value, respectively for the flow value of each month, execute according to the following steps 010-01 to 010-02, and then respectively obtain the flow rate for each month Preliminary identification and judgment of the daily flow value, and then enter step 011.

步骤010-01.将当月第一日流量数值作为第一阈值，然后分别针对当月前两日流量数值，判断下一日流量数值与当日流量数值之间的差值是否小于第一阈值，是则判断当日流量数值识别无误；否则判断当日流量数值初步识别错误；由此获得分别针对当月前两日流量数值的初步识别判断，然后进入步骤010-02。Step 010-01. Use the flow value on the first day of the current month as the first threshold, and then judge whether the difference between the next day’s flow value and the current day’s flow value is less than the first threshold for the flow values of the first two days of the current month, if yes, then Judging that the identification of the flow value of the current day is correct; otherwise, it is determined that the preliminary identification of the flow value of the current day is wrong; thereby obtaining preliminary identification judgments for the flow values of the first two days of the current month, and then proceed to step 010-02.

步骤010-02.分别针对当月由第三日开始的各日流量数值，判断下一日流量数值与当日流量数值之间的差值是否小于前一日流量数值，是则判断当日流量数值识别无误；否则判断当日流量数值初步识别错误；由此获得分别针对当月由第三日开始各日流量数值的初步识别判断。Step 010-02. For each daily flow value starting from the third day of the current month, determine whether the difference between the next day’s flow value and the current day’s flow value is smaller than the previous day’s flow value, and if so, determine that the current day’s flow value is correctly identified ; Otherwise, it is judged that the preliminary recognition of the flow value of the current day is wrong; thus, the preliminary recognition judgment of the flow value of each day starting from the third day of the current month is obtained.

步骤011.根据所识别存储水文资料表格中的各个数值，以及各个数值中各个数字的识别特征，通过支持向量机训练器，获得所识别存储水文资料表格中各个数值中的各个数字，分别对应“0”到“9”的十个识别结果概率，然后进入步骤012。Step 011. According to each numerical value in the identified stored hydrological data table, and the identification features of each number in each numerical value, through the support vector machine trainer, obtain each numerical value in the identified stored hydrological data table, corresponding to " Ten recognition result probabilities from 0" to "9", and then go to step 012.

步骤012.分别针对所识别存储水文资料表格中各个数值中的各个数字，获得数字所对应“0”到“9”十个识别结果概率中的最大识别结果概率，以及第二大识别结果概率，并获得该最大识别结果概率与该第二大识别结果概率的差值，判断该差值是否小于预设识别结果概率阈值0.1-0.25，是则判断该数字初步识别错误；否则判断该数字识别无误；由此获得分别针对所识别存储水文资料表格中各个数值中各个数字的初步识别判断，然后进入步骤013。Step 012. Obtain the maximum recognition result probability and the second largest recognition result probability among the ten recognition result probabilities from "0" to "9" corresponding to the numbers for each number in each numerical value in the identified and stored hydrological data table, And obtain the difference between the maximum recognition result probability and the second largest recognition result probability, judge whether the difference is less than the preset recognition result probability threshold of 0.1-0.25, if yes, judge that the initial recognition of the number is incorrect; otherwise, judge that the recognition of the number is correct ; Obtain a preliminary identification judgment for each number in each numerical value in the identified stored hydrological data table, and then enter step 013.

步骤013.分别针对各月中各个初步识别错误的流量数值，判断初步识别错误的流量数值中是否存在初步识别错误的数字，具体如下两种情况：Step 013. For each preliminary misidentified flow value in each month, determine whether there is a preliminary misidentified number in the preliminary misidentified flow value, specifically the following two situations:

是则判断该初步识别错误的流量数值错误，并进行报警，同时，根据该初步识别错误数字在该初步识别错误流量数值中的位置进行分析，若该初步识别错误数字位于该初步识别错误流量数值中的整数部分，则用该初步识别错误流量数值所对应日期的前一日流量数值与后一日流量数值的平均值，替换该初步识别错误流量数值；若该初步识别错误数字位于该初步识别错误流量数值中的小数部分，则用该初步识别错误流量数值所对应日期的前一日流量数值的小数与后一日流量数值的小数的平均值，替换该初步识别错误流量数值中的小数；If it is, it is judged that the flow value of the initial recognition error is wrong, and an alarm is issued. At the same time, the position of the preliminary recognition error number in the preliminary recognition error flow value is analyzed. Integer part in the initial recognition error flow value, the average value of the previous day’s flow value and the next day’s flow value on the date corresponding to the preliminary identification error flow value is used to replace the preliminary identification error flow value; if the preliminary identification error number is located in the preliminary identification For the decimal part in the wrong flow value, replace the decimal in the preliminary identified wrong flow value with the average of the decimals of the previous day’s flow value and the following day’s flow value on the date corresponding to the preliminary identified wrong flow value;

否则判断该初步识别错误流量数值无误；由此实现针对所识别存储水文资料表格中各个数值的检验。Otherwise, it is judged that the flow value of the preliminary identification error is correct; thus, the verification of each value in the identified and stored hydrological data table is realized.

通过实验对比可以发现本发明所设计的纸质水文年鉴数字化方法中，特征融合较单个特征提高了识别率，单个傅立叶特征对数字0识别效果较佳，对6和9识别效果差，而粗网格特征对数字0识别效果差，对数字6和9识别效果较佳，轮廓矩特征对数字0、6、8识别效果差。三种特征对其他数字识别的结果大体一致，通过计算特征之间的互补性指数可以发现傅立叶和粗网格特征的融合具有很好的区分不同数字的能力；将描述数字边界轮廓和数字内部的特征进行融合能够将整个数字从内到外更完整的描述出来，足以代表一个数字，所以得到了较好的识别效果。Through experimental comparison, it can be found that in the digitalization method of the paper hydrological yearbook designed by the present invention, the feature fusion improves the recognition rate compared with a single feature. The lattice feature has a poor recognition effect on the number 0, but it has a better recognition effect on the numbers 6 and 9, and the contour moment feature has a poor recognition effect on the numbers 0, 6, and 8. The results of the three features for other digital recognition are generally consistent. By calculating the complementarity index between the features, it can be found that the fusion of Fourier and coarse grid features has a good ability to distinguish different numbers; the digital boundary contour and the internal digital The fusion of features can describe the whole number more completely from the inside to the outside, which is enough to represent a number, so a better recognition effect is obtained.

上面结合附图对本发明的实施方式作了详细说明，但是本发明并不限于上述实施方式，在本领域普通技术人员所具备的知识范围内，还可以在不脱离本发明宗旨的前提下做出各种变化。The embodiments of the present invention have been described in detail above in conjunction with the accompanying drawings, but the present invention is not limited to the above embodiments, and can also be made without departing from the gist of the present invention within the scope of knowledge possessed by those of ordinary skill in the art. Variations.

Claims

1. a papery Water Year Book digitizing solution, it is characterised in that comprise the steps:

Step 001., according to the layout of the papery Water Year Book page, determines that hydrological data form is positioned at the papery Water Year Book page In location of pixels, subsequently into step 002；

Step 002., according to the location of pixels of hydrological data form in the papery Water Year Book page, is entered respectively for hydrological data form Row vertical and horizontal project, and longitudinal projection's figure, the transverse projection for hydrological data form is analyzed respectively, carries respectively The abscissa of each bar vertical line, the vertical coordinate of each bar horizontal line in water intaking literary composition data form, subsequently into step 003；

Step 003. is according to the abscissa of bar vertical line each in the format of hydrological data form, and hydrological data form, each bar horizontal line Vertical coordinate, for the projection picture of hydrological data form, obtain the number in each numerical value cell of hydrological data form respectively According to image, subsequently into step 004；Wherein, the numeric character in each data image of hydrological data form is white, the end Color is black；

Step 004. is respectively directed to each data image, carries out character cutting for each numeric character in data image, it is thus achieved that Each numeric character block in this data image, and then obtain each numeric character block in each data image respectively, then Enter step 005；

Each numeric character block that step 005. is respectively directed in each data image, extracts the net of numeric character in numeric character block Lattice feature, Fourier's feature, Contour moment feature, collectively as the identification feature of this numeric character, and then obtain each respectively The identification feature of numeric character in each numeric character block in data image, subsequently into step 006；

Each numeric character block that step 006. is respectively directed in each data image, it may be judged whether exist by numeric character block top margin There is downwards the black pixel point of predetermined number, be, judge in this numeric character block as arithmetic point, the most do not do any enter one Step operation；It is respectively directed in each data image after the judgement of each numeric character block, subsequently into step 007 completing；

Step 007., for all identification features of numeric character in all data images, carries out Feature Fusion, constitutes hydrological data In form, the most corresponding " 0 " arrives the numerical identification feature of " 9 ", subsequently into step 008；

Step 008. arrives the numerical identification feature of " 9 ", and each datagram according to the most corresponding " 0 " in hydrological data form In Xiang, the identification feature of numeric character in each numeric character block, by default grader, obtains in each data image respectively Each numeral corresponding to numeric character block, subsequently into step 009；

Step 009., according to the numeral corresponding to each numeric character block in each data image or arithmetic point, respectively constitutes hydrology money Expect the numerical value corresponding to data image in each numerical value cell of form, in conjunction with every attribute of hydrological data form format, Obtain every attribute in hydrological data form, and corresponding numerical value, and store.

A kind of papery Water Year Book digitizing solution, it is characterised in that after described step 009 Also comprise the steps, after execution of step 009, enter step 010；

Step 010., for being identified every attribute and corresponding numerical value thereof in storage hydrological data form, is respectively directed to each The flow number of the moon, 010-01 performs to step 010-02 as follows, and then obtains for each moon every respectively Daily flow numerical value tentatively identify judgement, subsequently into step 011；

Of that month first daily flow numerical value as first threshold, is then respectively directed to two daily flow numerical value before this month by step 010-01., Judge that the difference between next daily flow numerical value and same day flow number, whether less than first threshold, is then to judge flow number on the same day Value identifies errorless；Otherwise judge that flow number on the same day tentatively identifies mistake；It is derived from being respectively directed to two daily flow numbers before this month Value tentatively identify judgement, subsequently into step 010-02；

Step 010-02. is respectively directed to of that month each daily flow numerical value by the 3rd day, it is judged that next daily flow numerical value and same day Whether the difference between flow number, less than proxima luce (prox. luc) flow number, is then to judge that flow number identification on the same day is errorless；Otherwise sentence The flow number on the same day that breaks tentatively identifies mistake；It is derived from being respectively directed to of that month each daily flow numerical value preliminary by the 3rd day Identify and judge；

Step 011. is according to each numerical value identified in storage hydrological data form, and the identification of each numeral in each numerical value Feature, by default training aids, it is thus achieved that is identified each numeral in each numerical value in storage hydrological data form, the most right " 0 " is answered to arrive ten recognition result probability of " 9 ", subsequently into step 012；

Step 012. is respectively directed to be identified each numeral in each numerical value in storage hydrological data form, it is thus achieved that numeral is corresponding " 0 " arrives the maximum recognition result probability in " 9 " ten recognition result probability, and second largest recognition result probability, and obtains Obtain the difference of this maximum recognition result probability and this second largest recognition result probability, it is judged that whether this difference identifies knot less than presetting Really probability threshold value, is to judge that this numeral tentatively identifies mistake；Otherwise judge that this numeral identifies errorless；It is derived from pin respectively Judgement is tentatively identified, subsequently into step 013 to identified in storage hydrological data form each numeral in each numerical value；

Step 013. is respectively directed to each middle of the month, and each tentatively identifies wrong flow number, it is judged that the preliminary flow number identifying mistake In whether there is the preliminary numeral identifying mistake, be to judge that this tentatively identifies the flow number mistake of mistake, and report to the police； Otherwise judge that this tentatively identifies that error flow numerical quantity is errorless；It is achieved in for being identified each number in storage hydrological data form The inspection of value.

A kind of papery Water Year Book digitizing solution, it is characterised in that in described step 011, According to each numerical value identified in storage hydrological data form, and the identification feature of each numeral in each numerical value, pass through Support vector machine training aids, it is thus achieved that identified each numeral in each numerical value, the most corresponding " 0 " in storage hydrological data form Ten recognition result probability to " 9 ".

A kind of papery Water Year Book digitizing solution, it is characterised in that in described step 013, Described basis tentatively identifies and there is the preliminary numeral identifying mistake in wrong flow number, it is judged that this preliminary stream identifying mistake Numerical quantity mistake, and while reporting to the police, tentatively identify that according to this error number tentatively identifies in error flow numerical quantity at this Position be analyzed, if this tentatively identifies that error number is positioned at this and tentatively identifies the integer part in error flow numerical quantity, then Tentatively identify the proxima luce (prox. luc) flow number on date corresponding to error flow numerical quantity and the meansigma methods of a rear daily flow numerical value with this, replace Change this and tentatively identify error flow numerical quantity；If it is little that this tentatively identifies that error number is positioned in this preliminary identification error flow numerical quantity Fractional part, then tentatively identify the decimal of the proxima luce (prox. luc) flow number on date corresponding to error flow numerical quantity and a rear daily flow with this The meansigma methods of the decimal of numerical value, replaces this and tentatively identifies the decimal in error flow numerical quantity.

5. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that institute State step 004, carry out character cutting for each numeric character in data image, it is thus achieved that each number in this data image Value character block, specifically includes following steps:

The detection of step a01. obtains each white pixel point in data image within each numeric character, and each limit of this data image Edge is respectively at a distance of each numeric character minimum range, and the white pixel point on corresponding numeric character, subsequently into step a02；

Step a02. judges by being obtained each white pixel point in this data image respectively for previous step, it is judged that pixel Whether the pixel of upper and lower, left and right each position is white pixel point, is, judges that this pixel is inside numeric character Pixel；Otherwise judge, according to identifier, the edge pixel point that this pixel is character, and obtain this pixel in these data The row number of place pixel column in image；It is thus directed towards previous step and is entered respectively by this data image obtains each white pixel point Row judges, it is thus achieved that the row of place pixel column in each this data image of numeric character top edge pixel place in this data image Number, subsequently into step a03；

Step a03. is according to each numeric character top edge pixel place pixel column in this data image in this data image Row number, divide for each numeric character in this data image, it is thus achieved that each numeric character block in this data image.

6. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that institute State in step 005, each numeric character block being respectively directed in each data image, extracts numeric word in numeric character block The grid search-engine of symbol, specifically includes following steps:

Step b01. obtains the border of the upper and lower, left and right of numeric character block, and is derived from numeric character ontology diagram picture, so Rear entrance step b02；

Step b02. carries out center of gravity normalization for this numeric character ontology diagram picture, and will be through center of gravity this numeric character normalized Ontology diagram is slit into predetermined number sub regions image as average mark, subsequently into step b03；

Step b03. obtains in this numeric character ontology diagram picture the proportion of white pixel point in each sub regions image respectively, altogether With constituting the grid search-engine of numeric character in this numeric character block.

7. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that institute State in step 005, each numeric character block being respectively directed in each data image, extracts numeric word in numeric character block Fourier's feature of symbol, specifically includes following steps:

Step c01. carries out two dimensional discrete Fourier transform for numeric character block, subsequently into step c02；

Step c02., by this numeric character block through two dimensional discrete Fourier transform, proceeds central transformation, will numeric word Symbol block is averagely divided into four blocks of subregion images, and carries out diagonal angle exchange, it is thus achieved that Fourier's image is composed, subsequently into step c03；

Step c03. is for Fourier's its Fourier coefficient of image analysis of spectrum after central transformation, it is thus achieved that in Fu of this numeric character block In leaf system number, concentrate region more than the Fourier coefficient presetting amplitude thresholds, constitute significantly Fourier coefficient region, so Rear entrance step c04；

Step c04., by significantly Fourier coefficient region, is extracted predetermined number discrete Fourier transform coefficient, and is carried out Normalization, constitutes Fourier's feature of numeric character in this numeric character block.

8. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that institute State in step 005, each numeric character block being respectively directed in each data image, extracts numeric word in numeric character block The Contour moment feature of symbol, specifically includes following steps:

Step d01. carries out contours extract for the numeric character in numeric character block, subsequently into step d02；

Step d02. carries out not bending moment for the profile of numeric character in this numeric character block and processes, and extracts predetermined number two dimension wheel Wide invariant moment features, constitutes the Contour moment feature of numeric character in this numeric character block.

9. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that described Step 007 specifically includes following steps:

Step e01., according to permutation and combination, for all identification features of numeric character in all data images, carries out any two Identify the combination of feature, constitute the combination of all identification features, subsequently into step e02；

Step e02., by all identification features of numeric character in all data images, constitutes corresponding numeral in hydrological data form " 0 " arrives the sample set S of " 9 ", is then respectively directed to each group of identification feature combination, according to equation below (1):

C_{i j, A} = \frac{E (S_{i} \cup S_{j}) - E (S_{i} \cap S_{j})}{E (S)} - - - (1)

Obtain the feature complementary index C of this group identification feature combination relative standard digital " 0 "-" 9 " respectively_ij,A；And then obtain respectively Obtain the feature complementary index C of each group of identification feature combination relative standard digital " 0 "-" 9 " respectively_ij,A；Subsequently into step e03；Wherein, S_iAnd S_jRepresent that sample set S is identified feature F respectively_iWith identification feature F_jThe sample set of wrong point； E (S) represents the number of samples in sample set S；E(S_i∪S_j) represent sample set S_iWith sample set S_jBetween and concentrate Number of samples；E(S_i∩S_j) represent sample set S_iWith sample set S_jBetween occur simultaneously in number of samples； A={0,1 ..., 9}, C_ij,ARepresent by identifying feature F_iWith identification feature F_jConstituted identification feature combination relative standard numeral The feature complementary index of A；

Step e03. is respectively directed to each group of identification feature combination, according to equation below (2):

{TC}_{k} = \frac{Σ_{0, i &NotEqual; j}^{9} C_{i j}}{A_{10}^{2}} - - - (2)

Obtain each group of identification feature combination overall complementation index TC relative to standard digital respectively_k, subsequently into step e04；Its In, k={1 ..., K}, K represents the number of combinations that all identification features combine, TC_kRepresent kth group identification feature combination phase Overall complementation index for standard digital；

Step e04. combines for all identification features, sorts from large to small by its overall complementation index, it is thus achieved that sequence the first two is known Other feature combines, and then identifies that feature combination carries out Feature Fusion for these two, constitutes difference correspondence in hydrological data form " 0 " arrives the numerical identification feature of " 9 ".

10. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that institute State in step 008, arrive the numerical identification feature of " 9 ", and each number according to the most corresponding " 0 " in hydrological data form According to the identification feature of numeric character in each numeric character block in image, by support vector machine classifier, obtain each respectively Each numeral corresponding to numeric character block in data image.