CN105938547A - Paper hydrologic yearbook digitalization method - Google Patents

Paper hydrologic yearbook digitalization method Download PDF

Info

Publication number
CN105938547A
CN105938547A CN201610232680.9A CN201610232680A CN105938547A CN 105938547 A CN105938547 A CN 105938547A CN 201610232680 A CN201610232680 A CN 201610232680A CN 105938547 A CN105938547 A CN 105938547A
Authority
CN
China
Prior art keywords
numeric character
feature
numerical value
subsequently
character block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610232680.9A
Other languages
Chinese (zh)
Other versions
CN105938547B (en
Inventor
李士进
陈婉婉
郑展
郝立
蒋亚平
高祥涛
胡金龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201610232680.9A priority Critical patent/CN105938547B/en
Publication of CN105938547A publication Critical patent/CN105938547A/en
Application granted granted Critical
Publication of CN105938547B publication Critical patent/CN105938547B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a paper hydrologic yearbook digitalization method. A feature fusion method of high complementarity is put forward on the basis of single feature so that the recognition rate is enhanced. The hydrologic process is similar due to the fact that the hydrologic process is influenced by similar seasonal climatic factors and other random factors, i.e. the flow has contextual relevance. In view of the correlation, a later error correction mechanism based on time sequences is also put forward. In other words, error correction processing is performed according to a certain criterion after classifier recognition. The experiment proves that the recognition accuracy can be effectively enhanced by the mechanism and the working efficiency can be guaranteed.

Description

A kind of papery Water Year Book digitizing solution
Technical field
The present invention relates to a kind of papery Water Year Book digitizing solution, belong to computer image processing technology and hydrology interleaving techniques Field.
Background technology
Papery Water Year Book have recorded most basic hydrologic survey data, contains the differentiation rule that nature is long-term in these data Rule and the information of the effect of human activity, played important function in production, scientific research, community service.In view of Water Year Book is protected Depositing that the age is the most remote, use frequency high and the reason such as preservation condition difference, papery Water Year Book the most gradually starts to damage, and And once suffer artificial or naturally damage, the loss being difficult to make up being brought, the historical summary rescuing these preciousnesses has become as Extremely urgent problem.Protection Water Year Book most effective way is that Water Year Book is digitized scanning machining, forms electricity Sub-file.The digitized of Water Year Book is studied by prior art based on problem above, it is proposed that Water Year Book data Intelligent Recognition, identifies that the numeral (i.e. Number character recognition) in hydrological data is the digitized vital task of hydrological data.
Hydrological data is a kind of data being published year by year, the achievement expressed with the diagrammatic form of unification, science.Content master Basic hydrology data that if last year surveys and that examine through strict reorganization, that generally need;Its form feature is horizontally-arranged table Showing concrete month, vertical setting of types represents the date in each month, by the average discharge of every month, maximum stream flow, minimum bottom form Flow, year statistics and note composition.So first it being carried out printed page analysis before identifying Water Year Book numeral herein, extract Form line.
It is the most fewer that Water Year Book numerical character compares standardization, stroke number, it than the extraction of Hanzi features code relatively to hold Easily.But, they metamorphosis are little, stroke information is very few, in a sense cause effective characteristic vector The difficulty increase extracted.Such as, numeral " 8 " and " 6 ", when their ink weight is a bit, " 6 " of the white positive Song typeface have Time the first half also become individual small circle, almost similar with " 8 ".Numeral " 1 " and " 3 ", " 2 " and " 7 ", when ink relatively Weigh or font is the least, it is likely that occur that there is identical characteristic vector numeral " 1 " and " 3 ", " 2 " and " 7 ".Therefore, In actual applications, use prior art to be identified for hydrological data, there is precision shortcoming low, inefficient.
Summary of the invention
The technical problem to be solved is to provide the brand-new Feature Fusion method for designing of a kind of employing, it is possible to be effectively improved knowledge Not rate, it is ensured that the papery Water Year Book digitizing solution of work efficiency.
The present invention is to solve above-mentioned technical problem by the following technical solutions: the present invention devises a kind of papery Water Year Book number Word method, comprises the steps:
Step 001., according to the layout of the papery Water Year Book page, determines that hydrological data form is positioned at papery Water Year Book Location of pixels in the page, subsequently into step 002;
Step 002., according to the location of pixels of hydrological data form in the papery Water Year Book page, is divided for hydrological data form Do not carry out vertical and horizontal projection, and longitudinal projection's figure, the transverse projection for hydrological data form is analyzed respectively, point You can well imagine the abscissa of each bar vertical line, the vertical coordinate of each bar horizontal line in water intaking literary composition data form, subsequently into step 003;
Step 003. is according to the abscissa of bar vertical line each in the format of hydrological data form, and hydrological data form, each bar The vertical coordinate of horizontal line, for the projection picture of hydrological data form, obtains in each numerical value cell of hydrological data form respectively Data image, subsequently into step 004;Wherein, the numeric character in each data image of hydrological data form is white, Background color is black;
Step 004. is respectively directed to each data image, carries out character cutting for each numeric character in data image, Obtain each numeric character block in this data image, and then obtain each numeric character block in each data image respectively, Subsequently into step 005;
Each numeric character block that step 005. is respectively directed in each data image, extracts numeric character in numeric character block Grid search-engine, Fourier's feature, Contour moment feature, collectively as the identification feature of this numeric character, and then obtain respectively The identification feature of numeric character in each numeric character block in each data image, subsequently into step 006;
Each numeric character block that step 006. is respectively directed in each data image, it may be judged whether exist by numeric character block There is downwards the black pixel point of predetermined number in top margin, is, judges in this numeric character block as arithmetic point, does not the most do any Operation further;It is respectively directed in each data image after the judgement of each numeric character block, subsequently into step completing 007;
Step 007., for all identification features of numeric character in all data images, carries out Feature Fusion, constitutes the hydrology In data form, the most corresponding " 0 " arrives the numerical identification feature of " 9 ", subsequently into step 008;
Step 008. arrives the numerical identification feature of " 9 ", and each number according to the most corresponding " 0 " in hydrological data form According to the identification feature of numeric character in each numeric character block in image, by default grader, obtain each datagram respectively Each numeral corresponding to numeric character block in Xiang, subsequently into step 009;
Step 009., according to the numeral corresponding to each numeric character block in each data image or arithmetic point, respectively constitutes water Numerical value corresponding to data image in literary composition each numerical value cell of data form, in conjunction with every genus of hydrological data form format Property, it is thus achieved that every attribute in hydrological data form, and corresponding numerical value, and store.
As a preferred technical solution of the present invention, also comprise the steps after described step 009, execution of step 009 Afterwards, step 010 is entered;
Step 010., for being identified every attribute and corresponding numerical value thereof in storage hydrological data form, is respectively directed to The flow number of each month, 010-01 performs to step 010-02 as follows, and then obtains respectively for each Month daily flow numerical value tentatively identify judgement, subsequently into step 011;
Of that month first daily flow numerical value as first threshold, is then respectively directed to two daily flow numbers before this month by step 010-01. Value, it is judged that whether the difference between next daily flow numerical value and same day flow number, less than first threshold, is then to judge that the same day flows Numerical quantity identification is errorless;Otherwise judge that flow number on the same day tentatively identifies mistake;It is derived from being respectively directed to of that month first two days stream Numerical quantity tentatively identify judgement, subsequently into step 010-02;
Step 010-02. is respectively directed to of that month each daily flow numerical value by the 3rd day s, it is judged that next daily flow numerical value with ought Whether the difference between daily flow numerical value, less than proxima luce (prox. luc) flow number, is then to judge that flow number identification on the same day is errorless;Otherwise Judge that flow number on the same day tentatively identifies mistake;It is derived from being respectively directed to of that month by the 3rd day at the beginning of each daily flow numerical value Step identifies and judges;
Step 011. is according to each numerical value identified in storage hydrological data form, and each numeral in each numerical value Identify feature, by default training aids, it is thus achieved that identified each numeral in each numerical value in storage hydrological data form, point Not corresponding " 0 " arrives ten recognition result probability of " 9 ", subsequently into step 012;
Step 012. is respectively directed to be identified each numeral in each numerical value in storage hydrological data form, it is thus achieved that numeral institute Corresponding " 0 " arrives the maximum recognition result probability in " 9 " ten recognition result probability, and second largest recognition result probability, And obtain the difference of this maximum recognition result probability and this second largest recognition result probability, it is judged that whether this difference is known less than presetting Other probability of outcome threshold value, is to judge that this numeral tentatively identifies mistake;Otherwise judge that this numeral identifies errorless;It is derived from point Safety pin is tentatively identified judgement, subsequently into step to identify in storage hydrological data form each numeral in each numerical value 013;
Step 013. is respectively directed to each middle of the month, and each tentatively identifies wrong flow number, it is judged that the preliminary flow identifying mistake Whether numerical value exists the preliminary numeral identifying mistake, is, judge that this tentatively identifies the flow number mistake of mistake, and carry out Report to the police;Otherwise judge that this tentatively identifies that error flow numerical quantity is errorless;It is achieved in for being identified in storage hydrological data form The inspection of each numerical value.
As a preferred technical solution of the present invention: in described step 011, according to being identified in storage hydrological data form Each numerical value, and the identification feature of each numeral in each numerical value, by support vector machine training aids, it is thus achieved that identified Each numeral in each numerical value in storage hydrological data form, the most corresponding " 0 " arrives ten recognition result probability of " 9 ".
As a preferred technical solution of the present invention: in described step 013, described basis tentatively identifies the flow number of mistake Value exists the preliminary numeral identifying mistake, it is judged that this tentatively identifies flow number mistake of mistake, and while reporting to the police, Tentatively identify according to this at this, error number tentatively identifies that the position in error flow numerical quantity is analyzed, if this tentatively identifies mistake Numeral is positioned at this preliminary integer part identified in error flow numerical quantity by mistake, then tentatively identify corresponding to error flow numerical quantity with this The proxima luce (prox. luc) flow number on date and the meansigma methods of a rear daily flow numerical value, replace this and tentatively identify error flow numerical quantity;If should The preliminary fractional part identifying that error number is positioned in this preliminary identification error flow numerical quantity, then tentatively identify mistake flow with this The decimal of the proxima luce (prox. luc) flow number on date corresponding to numerical value and the meansigma methods of the decimal of a rear daily flow numerical value, replace this preliminary Identify the decimal in error flow numerical quantity.
As a preferred technical solution of the present invention, described step 004, enter for each numeric character in data image Line character cutting, it is thus achieved that each numeric character block in this data image, specifically includes following steps:
The detection of step a01. obtains each white pixel point in data image within each numeric character, and this data image Each edge is respectively at a distance of each numeric character minimum range, and the white pixel point on corresponding numeric character, subsequently into step a02;
Step a02. judges by being obtained each white pixel point in this data image respectively for previous step, it is judged that as Whether the pixel of vegetarian refreshments upper and lower, left and right each position is white pixel point, is, judges that this pixel is numeric character Internal pixel;Otherwise judge, according to identifier, the edge pixel point that this pixel is character, and obtain this pixel at this The row number of place pixel column in data image;It is thus directed towards previous step by this data image is obtained each white pixel point minute Do not judge, it is thus achieved that place pixel column in each this data image of numeric character top edge pixel place in this data image Row number, subsequently into step a03;
Step a03. is according to each numeric character top edge pixel place pixel column in this data image in this data image Row number, divide for each numeric character in this data image, it is thus achieved that each numeric character in this data image Block.
As a preferred technical solution of the present invention, in described step 005, be respectively directed in each data image is each Individual numerical value character block, extracts the grid search-engine of numeric character in numeric character block, specifically includes following steps:
Step b01. obtains the border of the upper and lower, left and right of numeric character block, and is derived from numeric character ontology diagram picture, Subsequently into step b02;
Step b02. carries out center of gravity normalization for this numeric character ontology diagram picture, and will be through center of gravity this numerical value normalized Character body image averaging is divided into predetermined number sub regions image, subsequently into step b03;
Step b03. obtains in this numeric character ontology diagram picture institute's accounting of white pixel point in each sub regions image respectively Example, collectively forms the grid search-engine of numeric character in this numeric character block.
As a preferred technical solution of the present invention, in described step 005, be respectively directed in each data image is each Individual numerical value character block, extracts Fourier's feature of numeric character in numeric character block, specifically includes following steps:
Step c01. carries out two dimensional discrete Fourier transform for numeric character block, subsequently into step c02;
Step c02., by this numeric character block through two dimensional discrete Fourier transform, proceeds central transformation, will count Value character block is averagely divided into four blocks of subregion images, and carries out diagonal angle exchange, it is thus achieved that Fourier's image is composed, subsequently into step Rapid c03;
Step c03. is for Fourier's its Fourier coefficient of image analysis of spectrum after central transformation, it is thus achieved that this numeric character block In Fourier coefficient, concentrate region more than the Fourier coefficient presetting amplitude thresholds, constitute significantly Fourier coefficient region, Subsequently into step c04;
Step c04., by significantly Fourier coefficient region, extracts predetermined number discrete Fourier transform coefficient, and by it It is normalized, constitutes Fourier's feature of numeric character in this numeric character block.
As a preferred technical solution of the present invention: in described step 005, be respectively directed in each data image is each Individual numerical value character block, extracts the Contour moment feature of numeric character in numeric character block, specifically includes following steps:
Step d01. carries out contours extract for the numeric character in numeric character block, subsequently into step d02;
Step d02. carries out not bending moment for the profile of numeric character in this numeric character block and processes, and extracts predetermined number two Dimension profile invariant moment features, constitutes the Contour moment feature of numeric character in this numeric character block.
As a preferred technical solution of the present invention, described step 007 specifically includes following steps:
Step e01. is according to permutation and combination, for all identification features of numeric character in all data images, carries out arbitrarily Two combinations identifying feature, constitute the combination of all identification features, subsequently into step e02;
Step e02., by all identification features of numeric character in all data images, constitutes corresponding number in hydrological data form Word " 0 " arrives the sample set S of " 9 ", is then respectively directed to each group of identification feature combination, according to equation below (1):
C i j , A = E ( S i ∪ S j ) - E ( S i ∩ S j ) E ( S ) - - - ( 1 )
Obtain the feature complementary index C of this group identification feature combination relative standard digital " 0 "-" 9 " respectivelyij,A;And then respectively Obtain the feature complementary index C of each group of identification feature combination relative standard digital " 0 "-" 9 " respectivelyij,A;Subsequently into step e03;Wherein, SiAnd SjRepresent that sample set S is identified feature F respectivelyiWith identification feature FjThe sample set of wrong point;E(S) Represent the number of samples in sample set S;E(Si∪Sj) represent sample set SiWith sample set SjBetween and the sample concentrated This number;E(Si∩Sj) represent sample set SiWith sample set SjBetween occur simultaneously in number of samples;A={0,1 ..., 9}, Cij,ARepresent by identifying feature FiWith identification feature FjConstituted and identified that the feature complementary of feature combination relative standard numeral A refers to Number;
Step e03. is respectively directed to each group of identification feature combination, according to equation below (2):
TC k = Σ 0 , i ≠ j 9 C i j A 10 2 - - - ( 2 )
Obtain each group of identification feature combination overall complementation index TC relative to standard digital respectivelyk, subsequently into step e04; Wherein, k={1 ..., K}, K represent the number of combinations that all identification features combine, TCkRepresent the combination of kth group identification feature Overall complementation index relative to standard digital;
Step e04. combines for all identification features, sorts from large to small by its overall complementation index, it is thus achieved that sort front two Individual identification feature combines, and then identifies that feature combination carries out Feature Fusion for these two, constitutes difference in hydrological data form Corresponding " 0 " arrives the numerical identification feature of " 9 ".
As a preferred technical solution of the present invention, in described step 008, according to the most corresponding " 0 " in hydrological data form To the numerical identification feature of " 9 ", and the identification feature of numeric character in each numeric character block in each data image, By support vector machine (SVM) grader, obtain each numeral corresponding to numeric character block in each data image respectively.
A kind of papery Water Year Book digitizing solution of the present invention and control method use above technical scheme and prior art Compare, have following technical effect that papery Water Year Book digitizing solution designed by the present invention, on the basis of single features Proposing complementary stronger Feature fusion, discrimination is improved, owing to hydrologic process is by similar seasonal gas Time factor, and the impact of other random factors and present similarity, namely its flow has context dependence, thus this Bright Given this dependency, proposes based on seasonal effect in time series later stage mechanism for correcting errors simultaneously.I.e. after grader identification, according to certain Plant criterion and it carried out correction process, be experimentally confirmed, mechanism proposed by the invention, be effectively increased accuracy of identification, Ensure that work efficiency.
Accompanying drawing explanation
Fig. 1 is the papery Water Year Book digitizing solution that designs of the present invention and the flow chart of control method;
Fig. 2 a is hydrological data form transverse projection schematic diagram in embodiment;
Fig. 2 b is hydrological data form longitudinal projection schematic diagram in embodiment;
Fig. 3 is the form schematic diagram being made up of each bar vertical line extracted in hydrological data form, each bar horizontal line in embodiment;
Fig. 4 is Water Year Book printed page analysis schematic diagram in embodiment;
Fig. 5 is to obtain the schematic diagram of data image in each numerical value cell of hydrological data form in embodiment respectively;
Fig. 6 is the schematic diagram of each numeric character block in data acquired image in embodiment.
Detailed description of the invention
Below in conjunction with Figure of description, the detailed description of the invention of the present invention is described in further detail.
In daily business activity, we have used substantial amounts of document and form every day.Form document is the most simultaneously Being applied to every field, usual people need manual handle form document, such as client to need to pay taxes, librarian Need to gather the data message included in paper list document.Due to the development of optical character recognition (OCR) technology, people Begin attempt to utilize and can obtain the criteria table image of data to the data message extracting in form, when this can reduce work Between and alleviate work load.In commercial field, utilize OCR technique can improve work quality, and people can be reduced Spend in the plenty of time processed on form document.In many fields that OCR uses, we are by the form obtained Template makes user know the target string of block letter in image.These character string informations include numerous items content such as flow Information, text message and mathematical formulae etc..The existence of form hinders the extraction of data message, and therefore table line detecting is print A vital task in brush body Table recognition technology.
In hydrological data block letter document, form is its requisite part, and it can be high by all of document information Degree concentrates in together, and allows reader understand its implication expressed exactly, the simplest and the clearest but also specification.By consulting Water Year Book The flow meter at each big hydrometric station, it appeared that the layout structure of Water Year Book flow meter is regular governed.We can be in order to Character is cut out by these rules.
Water Year Book is that each river water body carries out hydrologic monitoring, next year is processed arrangement and is published formation in hydrology mechanism watershed The carrier of hydrologic monitoring achievement.Its content includes every reorganization achievement and the summarized materials with chart with necessary explanatory note, It it is the hydrographic data treasure-house of a system, specification.
1958, the volume scope of basin water system universal formulation hydrological data was pressed in the whole nation by Hydrological Bureau of Ministry of Water Resources, and will provide year by year Material Uniform Name is volume 10 94, " People's Republic of China's Water Year Book " whole nation point.Its feature is as follows.
Color characteristic: yellow end surplus.
Architectural feature: paper width is 440mm, height is 140mm, and the ratio of width to height is 3.14.In yearbook, numeral width is about 15mm, height is about 24mm, and the ratio of width to height is 0.625.Character is positioned at form.
Textural characteristics: containing class character area in yearbook, the most digital color shades horizontal, vertical presents regular Wave crest and wave trough Change.
Water Year Book character is the character that rows of horizontal is regularly arranged, has more stable structure and textural characteristics.Based on throwing The top-down printed page analysis method of shadow applies this feature exactly.At the character zone of yearbook, the marginal information of character is non- The abundantest, use certain instrument that character edge information is detected and analyzed, hydrographic data can be isolated from background Come.The pixel value in Water Year Book region will present specific fluctuations, and change frequency also keeps within the specific limits, utilize These features can realize Water Year Book character locating.According to yearbook numeric area horizontal, vertical features is richer than nonnumeric region The character locating algorithm of based on the most vertical projection of this feature extraction rich.Obtain its trip point, according to the quantity of trip point With the distance between trip point determines possible character zone.
Away from general 275 the pixel left and right page empty of page top margin, it is followed by basin title and the hydrology name of station of Water Year Book Claim plus daily mean flow meter printed words.The unit of catchment area, flow is indicated apart from this printed words 30 pixel right position. It is form starting position apart from about this 20 pixel.Water Year Book form is by 11 horizontal lines and 14 vertical line compositions.Before Article two, indicate month information in the middle of horizontal line, between front two vertical lines, indicate per per medio, subsequently between every two articles of vertical lines and the Region before three days horizontal lines is all the flow value of every month.Indicate between horizontal line subsequently every month average flow rate value, What the date was maximum adds up and pragmatic information in flow value and the flow value of date minimum, year.Our final purpose is to identify flow Value, therefore it is first necessary to hydrological data is carried out printed page analysis, analyzes its tableau format, extracts Form Frame Line, in order to be concrete The flow value in each month is positioned.
As it is shown in figure 1, the present invention devises a kind of papery Water Year Book digitizing solution, first have to for papery Water Year Book In the page, hydrological data form is taken pictures, and obtains hydrological data tabular drawing picture, and carries out pretreatment operation, including figure As binaryzation, gray processing, denoising, rotation and inverse process;Then for the hydrological data tabular drawing picture of pretreatment operation, Specifically carry out following steps:
Step 001., along with the further investigation to document layout parser, splits typical algorithm at original document layout herein On the basis of (top-down, downward the end of from), the advantage of comprehensive two kinds of typical algorithm, use architectural feature and texture the most simultaneously Feature processes the document layout in Water Year Book.This processing mode had both considered the accuracy of segmentation, had taken into account again analysis The time loss processed, therefore, it is possible to position-table fast and accurately.According to the layout of the papery Water Year Book page, Determine the location of pixels that hydrological data form is positioned in the papery Water Year Book page, subsequently into step 002.
Step 002., according to the location of pixels of hydrological data form in the papery Water Year Book page, is divided for hydrological data form Do not carry out vertical and horizontal projection, transverse projection as shown in Figure 2 a, longitudinal projection as shown in Figure 2 b, and for hydrological data Longitudinal projection's figure of form, transverse projection are analyzed respectively, and wherein, in Fig. 2 a, 11 black color dots represent the hydrology respectively The horizontal line of yearbook form, the hollow dots after second stain represents the upper-lower position of often row flow value, each ripple afterwards The both sides at peak represent the upper-lower position of the flow value often gone first to the 31st;In Fig. 2 b, 14 black color dots represent The abscissa of 14 vertical lines of form, between each two black color dots, the crest both sides between the most every two vertical lines represent monthly The left and right coordinate of flow value, mark by hollow dots.Extract the abscissa of each bar vertical line, each bar in hydrological data form respectively The vertical coordinate of horizontal line, in actual Application Example as it is shown on figure 3, wherein, the number in each data image of hydrological data form Value character is white, and background color is black;Therefore, by Fig. 2 a and Fig. 2 b can with coarse localization go out the flow value of every month with And table position, the result of final Water Year Book printed page analysis as shown in Figure 4, subsequently into step 003.
By the black number of pixels on statistics same row or column, it is to avoid directly detecting straightway, connection to form line Property is less demanding, has the most anti-interference and generalization ability.Position and the chi of target in image can be reflected by the method Very little effective information such as grade.Localization process for follow-up Water Year Book numeral is provided convenience.
Step 003. is according to the abscissa of bar vertical line each in the format of hydrological data form, and hydrological data form, each bar The vertical coordinate of horizontal line, for the projection picture of hydrological data form, obtains in each numerical value cell of hydrological data form respectively Data image, actual Application Example is as it is shown in figure 5, subsequently into step 004;Wherein, hydrological data form each Numeric character in data image is white, and background color is black.
In the digitized process of papery water level prediction, only can the most well hydrological data image be carried out point Cut, the accuracy of the data of guarantee subsequent extracted feature.The segmentation of papery water level prediction image is whole digitized process Basis, it is also entirety that numeral positions out later image, including the blank between numeral and numeral.For having carried The numeral taken out is overall, needs to carry out character cutting.Single character is separated from overall digital.
Step 004. is respectively directed to each data image, carries out character cutting for each numeric character in data image, Obtain each numeric character block in this data image, specifically include following steps:
The detection of step a01. obtains each white pixel point in data image within each numeric character, and this data image Each edge is respectively at a distance of each numeric character minimum range, and the white pixel point on corresponding numeric character, subsequently into step a02;
Step a02. judges by being obtained each white pixel point in this data image respectively for previous step, it is judged that as Whether the pixel of vegetarian refreshments upper and lower, left and right each position is white pixel point, is, judges that this pixel is numeric character Internal pixel;Otherwise judge, according to identifier, the edge pixel point that this pixel is character, and obtain this pixel at this The row number of place pixel column in data image;It is thus directed towards previous step by this data image is obtained each white pixel point minute Do not judge, it is thus achieved that place pixel column in each this data image of numeric character top edge pixel place in this data image Row number, subsequently into step a03;
Step a03. is according to each numeric character top edge pixel place pixel column in this data image in this data image Row number, divide for each numeric character in this data image, it is thus achieved that each numeric character in this data image Block.
Based on above-mentioned design process, and then obtain each numeric character block in each data image, actual application implementation respectively Example, is obtained each numeric character block in this data image, as shown in Figure 6;Subsequently into step 005.
If directly using pretreated data as the input quantity of grader, carrying out data volume during classified counting big, feature extraction Purpose be exactly from analyze numeral topological structure start with, its some architectural feature is extracted, make numeral displacement, The interference such as size variation, font distortion is relatively reduced, namely the key message of those reflection numerical characteristics is supplied to classification Device, is thus equal to indirectly add the fault-tolerant ability of grader, and data volume is also greatly reduced after feature extraction ?;Feature extraction is to having identified critical effect, and it should follow following principle:
(1) it is prone to extract;
(2) there is stronger classification capacity, i.e. this feature and different numerals should be shown bigger difference, and to identical Numeral then should show the least difference;
(3) there is higher stability, reduce stroke fracture or the impact of adhesion as far as possible.
Each numeric character block that step 005. is respectively directed in each data image, extracts numeric character in numeric character block Grid search-engine, Fourier's feature, Contour moment feature, collectively as the identification feature of this numeric character, and then obtain respectively The identification feature of numeric character in each numeric character block in each data image, subsequently into step 006.
Wherein, grid search-engine is one group focuses on the distribution characteristics that character picture is overall, and this kind of feature has extremely strong pressing down to noise Ability processed.The main thought of its extracting method is, digital dot array is divided into several local cell territory, and on each zonule Reticular density as Expressive Features, i.e. add up the percentage ratio shared by the image pixel of each zonule as characteristic;Due to What grid search-engine reflected is the partial statistics characteristic of image, is a percentage ratio relative value, and the deformation of image local or noise pair Answer the value exchange that digital dot array is exactly " 0 " and " 1 " of local element, if so image is with deformation locally or noise, Compared with the original image not having deformation and noise, the percentage ratio relative value calculated changes not quite.It is to say, this phase The impact brought value for deformation or the isolated noise point of digital picture partial stroke is insensitive.Therefore, it is characterized with grid Carry out numeral identification, there is preferable noise resisting ability.For the numeral being partitioned into herein, it is divided into size and is by me The zonule of 3 × 3,9 altogether.
In above-mentioned steps 005, each numeric character block being respectively directed in each data image, extracts number in numeric character block The grid search-engine of value character, specifically includes following steps:
Step b01. obtains the border of the upper and lower, left and right of numeric character block, and is derived from numeric character ontology diagram picture, Subsequently into step b02.
Step b02. carries out center of gravity normalization for this numeric character ontology diagram picture, and will be through center of gravity this numerical value normalized Character body image averaging is divided into predetermined number sub regions image, subsequently into step b03.
Step b03. obtains in this numeric character ontology diagram picture institute's accounting of white pixel point in each sub regions image respectively Example, collectively forms the grid search-engine of numeric character in this numeric character block.
Fourier transform is widely used a kind of two-dimensional orthogonal transformation in image procossing, and after Fourier transform, meansigma methods is the most straight Stream item is proportional to the meansigma methods of image intensity value, and low frequency component then indicates intensity and the direction of object edge in image.Numeral Character typically can represent with the closed outline that a lot of line segments are constituted, can be sufficient by some discrete magnitudes obtained by mapping Reflect the change of these closed outlines.Fourier coefficient can be good at describing image boundary profile, its value and similar font Translation, rotation, displacement and size are unrelated.When font characterizes and identifies, these features form obvious data compression.
In above-mentioned steps 005, each numeric character block being respectively directed in each data image, extracts number in numeric character block Fourier's feature of value character, specifically includes following steps:
Step c01. carries out two dimensional discrete Fourier transform for numeric character block, subsequently into step c02.
Step c02., by this numeric character block through two dimensional discrete Fourier transform, proceeds central transformation, will count Value character block is averagely divided into four blocks of subregion images, and carries out diagonal angle exchange, it is thus achieved that Fourier's image is composed, subsequently into step Rapid c03.
Step c03. is for Fourier's its Fourier coefficient of image analysis of spectrum after central transformation, it is thus achieved that this numeric character block In Fourier coefficient, concentrate region more than the Fourier coefficient presetting amplitude thresholds, constitute significantly Fourier coefficient region, Subsequently into step c04.
Step c04., by significantly Fourier coefficient region, extracts predetermined number discrete Fourier transform coefficient, and by it It is normalized, constitutes Fourier's feature of numeric character in this numeric character block.
Invariant moment features is a kind of statistical nature of image, is the mathematics spy in image with translation scaling and rotation invariant Levy.
In above-mentioned steps 005, each numeric character block being respectively directed in each data image, extracts number in numeric character block The Contour moment feature of value character, specifically includes following steps:
Step d01. carries out contours extract for the numeric character in numeric character block, subsequently into step d02.
Step d02. carries out not bending moment for the profile of numeric character in this numeric character block and processes, and extracts predetermined number two Dimension profile invariant moment features, constitutes the Contour moment feature of numeric character in this numeric character block.
Each numeric character block that step 006. is respectively directed in each data image, it may be judged whether exist by numeric character block There is downwards the black pixel point of predetermined number in top margin, is, judges in this numeric character block as arithmetic point, does not the most do any Operation further;It is respectively directed in each data image after the judgement of each numeric character block, subsequently into step completing 007;
All identification features obtained in above-mentioned steps, if carrying out point with neutral net and support vector machine classifier respectively Class, classifying quality is not satisfactory, and this is primarily due to be difficult to find a kind of feature to be respectively adapted to different numerals, and forefathers Method be all to identify that application aspect carries out feature extraction and fusion analyzing specific numeral, each numeral has different spies Point, the various feature of needs that wants correctly to classify is combined, and the complementarity of feature is to ensure that the feature of extraction has higher identification Rate and the key of generalization ability, be the foundation of Feature Fusion;Therefore, before carrying out Feature Fusion, it is necessary to solve feature mutual The problem of benefit property tolerance.
Step 007., for all identification features of numeric character in all data images, carries out Feature Fusion, constitutes the hydrology In data form, the most corresponding " 0 " arrives the numerical identification feature of " 9 ", subsequently into step 008.
Above-mentioned steps 007 specifically includes following steps:
Step e01. is according to permutation and combination, for all identification features of numeric character in all data images, carries out arbitrarily Two combinations identifying feature, constitute the combination of all identification features, subsequently into step e02.
Step e02., by all identification features of numeric character in all data images, constitutes corresponding number in hydrological data form Word " 0 " arrives the sample set S of " 9 ", is then respectively directed to each group of identification feature combination, according to equation below (1):
C i j , A = E ( S i ∪ S j ) - E ( S i ∩ S j ) E ( S ) - - - ( 1 )
Obtain the feature complementary index C of this group identification feature combination relative standard digital " 0 "-" 9 " respectivelyij,A;And then respectively Obtain the feature complementary index C of each group of identification feature combination relative standard digital " 0 "-" 9 " respectivelyij,A;Subsequently into step e03;Wherein, Cij,AThe biggest, illustrate to identify feature FiWith identification feature FjThe Features Complement of relative standard numeral A is the strongest; Otherwise, then Features Complement is the most weak;SiAnd SjRepresent that sample set S is identified feature F respectivelyiWith identification feature FjWrong point Sample set;E (S) represents the number of samples in sample set S;E(Si∪Sj) represent sample set SiWith sample set Sj Between and the number of samples concentrated;E(Si∩Sj) represent sample set SiWith sample set SjBetween occur simultaneously in number of samples; A={0,1 ..., 9}, Cij,ARepresent by identifying feature FiWith identification feature FjConstituted identification feature combination relative standard numeral A Feature complementary index.
Step e03. is respectively directed to each group of identification feature combination, according to equation below (2):
TC k = Σ 0 , i ≠ j 9 C i j A 10 2 - - - ( 2 )
Obtain each group of identification feature combination overall complementation index TC relative to standard digital respectivelyk, subsequently into step e04; Wherein, k={1 ..., K}, K represent the number of combinations that all identification features combine, TCkRepresent the combination of kth group identification feature Overall complementation index relative to standard digital.
Step e04. combines for all identification features, sorts from large to small by its overall complementation index, it is thus achieved that sort front two Individual identification feature combines, and then identifies that feature combination carries out Feature Fusion for these two, constitutes difference in hydrological data form Corresponding " 0 " arrives the numerical identification feature of " 9 ".
The recognition result of single feature, by different features is used in grader classification, is carried out point by technique scheme Analysis, calculates the overall complementation index of each feature by above-mentioned formula, then will by certain linear relationship of characteristic use selected It merges, and is shown experimentally that thick meshed feature and Fourier feature are favourable to the digital recognition effect of Water Year Book data, and And its overall complementarity is relatively strong, so after Fourier feature string is connected on thick meshed feature, drawing melting of proposition by experiment The discrimination more single Fourier feature closing feature improves 3.8981%, and relatively grid search-engine improves 1.4033%, relatively profile Square improves 83.1956%.
Step 008. arrives the numerical identification feature of " 9 ", and each number according to the most corresponding " 0 " in hydrological data form According to the identification feature of numeric character in each numeric character block in image, by support vector machine (SVM) grader, obtain respectively Obtain each numeral corresponding to numeric character block in each data image, subsequently into step 009.
Step 009., according to the numeral corresponding to each numeric character block in each data image or arithmetic point, respectively constitutes water Numerical value corresponding to data image in literary composition each numerical value cell of data form, in conjunction with every genus of hydrological data form format Property, it is thus achieved that every attribute in hydrological data form, and corresponding numerical value, and store;Subsequently into step 010.
Herein by the rule of analysis flow, propose later stage misarrangement mechanism according to time series.By experimental result, The final recognition result of Water Year Book is close to 99%, and error rate is the most relatively low, and a flow value is by 4 to 5 blockettes Become, if one of them numeral identifies wrong, i.e. think that result is wrong, this and conventional data set MNIST, the knowledge on USPS The error rate statistic of other result or slightly different.Observation recognition result understands, a flow value general only one of which numeral Identify mistake, and the flow value of identification of each month mistake is within 3, if like this we can be by certain Algorithm idea finds the flow value identifying that reliability is the highest, namely finds the digital key position before the arithmetic point of flow value Identify mistake, by adding up the Changing Pattern of monthly flow, utilize mean value method to carry out error correction, imitate bringing the highest application Rate.
Because obtaining also being obtained of flow itself by apparatus measures, itself there is also certain error, if therefore flow In certain little scope in the case of fluctuation, namely in the case of numeral identification after the arithmetic point of flow value is wrong, not On the premise of affecting analysis and the application of data on flows, we can tolerate.I.e. it is not considered as that its identification is wrong.
Step 010., for being identified every attribute and corresponding numerical value thereof in storage hydrological data form, is respectively directed to The flow number of each month, 010-01 performs to step 010-02 as follows, and then obtains respectively for each Month daily flow numerical value tentatively identify judgement, subsequently into step 011.
Of that month first daily flow numerical value as first threshold, is then respectively directed to two daily flow numbers before this month by step 010-01. Value, it is judged that whether the difference between next daily flow numerical value and same day flow number, less than first threshold, is then to judge that the same day flows Numerical quantity identification is errorless;Otherwise judge that flow number on the same day tentatively identifies mistake;It is derived from being respectively directed to of that month first two days stream Numerical quantity tentatively identify judgement, subsequently into step 010-02.
Step 010-02. is respectively directed to of that month each daily flow numerical value by the 3rd day s, it is judged that next daily flow numerical value with ought Whether the difference between daily flow numerical value, less than proxima luce (prox. luc) flow number, is then to judge that flow number identification on the same day is errorless;Otherwise Judge that flow number on the same day tentatively identifies mistake;It is derived from being respectively directed to of that month by the 3rd day at the beginning of each daily flow numerical value Step identifies and judges.
Step 011. is according to each numerical value identified in storage hydrological data form, and each numeral in each numerical value Identify feature, by support vector machine training aids, it is thus achieved that identified each number in each numerical value in storage hydrological data form Word, the most corresponding " 0 " arrives ten recognition result probability of " 9 ", subsequently into step 012.
Step 012. is respectively directed to be identified each numeral in each numerical value in storage hydrological data form, it is thus achieved that numeral institute Corresponding " 0 " arrives the maximum recognition result probability in " 9 " ten recognition result probability, and second largest recognition result probability, And obtain the difference of this maximum recognition result probability and this second largest recognition result probability, it is judged that whether this difference is known less than presetting Other probability of outcome threshold value 0.1-0.25, is to judge that this numeral tentatively identifies mistake;Otherwise judge that this numeral identifies errorless;Thus Obtain and be respectively directed to be identified that store in hydrological data form each numeral in each numerical value tentatively identifies judgement, subsequently into Step 013.
Step 013. is respectively directed to each middle of the month, and each tentatively identifies wrong flow number, it is judged that the preliminary flow identifying mistake Whether numerical value exists the preliminary numeral identifying mistake, two kinds of situations specific as follows:
It is to judge that this tentatively identifies the flow number mistake of mistake, and report to the police, meanwhile, tentatively identify mistake according to this At this, numeral tentatively identifies that the position in error flow numerical quantity is analyzed, if this tentatively identifies that error number is positioned at this and tentatively knows Integer part in other error flow numerical quantity, then tentatively identify the proxima luce (prox. luc) flow number on date corresponding to error flow numerical quantity with this Value and the meansigma methods of a rear daily flow numerical value, replace this and tentatively identify error flow numerical quantity;If this tentatively identifies error number word bit The fractional part in error flow numerical quantity is tentatively identified, then before tentatively identifying the date corresponding to error flow numerical quantity with this in this The decimal of one daily flow numerical value and the meansigma methods of the decimal of a rear daily flow numerical value, replace this and tentatively identify in error flow numerical quantity Decimal;
Otherwise judge that this tentatively identifies that error flow numerical quantity is errorless;It is achieved in for being identified in storage hydrological data form each The inspection of individual numerical value.
By Experimental comparison it appeared that in papery Water Year Book digitizing solution designed by the present invention, Feature Fusion is more single Feature improves discrimination, and single Fourier feature is preferable to digital 0 recognition effect, poor to 6 and 9 recognition effects, and thick Grid search-engine is poor to digital 0 recognition effect, and preferable to numeral 6 and 9 recognition effects, Contour moment feature is to numeral 0,6,8 Recognition effect is poor.Three kinds of features are unanimous on the whole to the result of other numeral identifications, by calculating the complementary index between feature It appeared that the fusion of Fourier and thick meshed feature has the ability well distinguishing different digital;Digital boundary wheel will be described Wide and digital internal feature carries out fusion can by description the most complete for whole numeral out, it is sufficient to represents one Numeral, so having obtained preferable recognition effect.
Above in conjunction with accompanying drawing, embodiments of the present invention are explained in detail, but the present invention is not limited to above-mentioned embodiment party Formula, in the ken that those of ordinary skill in the art are possessed, it is also possible to do on the premise of without departing from present inventive concept Go out various change.

Claims (10)

1. a papery Water Year Book digitizing solution, it is characterised in that comprise the steps:
Step 001., according to the layout of the papery Water Year Book page, determines that hydrological data form is positioned at the papery Water Year Book page In location of pixels, subsequently into step 002;
Step 002., according to the location of pixels of hydrological data form in the papery Water Year Book page, is entered respectively for hydrological data form Row vertical and horizontal project, and longitudinal projection's figure, the transverse projection for hydrological data form is analyzed respectively, carries respectively The abscissa of each bar vertical line, the vertical coordinate of each bar horizontal line in water intaking literary composition data form, subsequently into step 003;
Step 003. is according to the abscissa of bar vertical line each in the format of hydrological data form, and hydrological data form, each bar horizontal line Vertical coordinate, for the projection picture of hydrological data form, obtain the number in each numerical value cell of hydrological data form respectively According to image, subsequently into step 004;Wherein, the numeric character in each data image of hydrological data form is white, the end Color is black;
Step 004. is respectively directed to each data image, carries out character cutting for each numeric character in data image, it is thus achieved that Each numeric character block in this data image, and then obtain each numeric character block in each data image respectively, then Enter step 005;
Each numeric character block that step 005. is respectively directed in each data image, extracts the net of numeric character in numeric character block Lattice feature, Fourier's feature, Contour moment feature, collectively as the identification feature of this numeric character, and then obtain each respectively The identification feature of numeric character in each numeric character block in data image, subsequently into step 006;
Each numeric character block that step 006. is respectively directed in each data image, it may be judged whether exist by numeric character block top margin There is downwards the black pixel point of predetermined number, be, judge in this numeric character block as arithmetic point, the most do not do any enter one Step operation;It is respectively directed in each data image after the judgement of each numeric character block, subsequently into step 007 completing;
Step 007., for all identification features of numeric character in all data images, carries out Feature Fusion, constitutes hydrological data In form, the most corresponding " 0 " arrives the numerical identification feature of " 9 ", subsequently into step 008;
Step 008. arrives the numerical identification feature of " 9 ", and each datagram according to the most corresponding " 0 " in hydrological data form In Xiang, the identification feature of numeric character in each numeric character block, by default grader, obtains in each data image respectively Each numeral corresponding to numeric character block, subsequently into step 009;
Step 009., according to the numeral corresponding to each numeric character block in each data image or arithmetic point, respectively constitutes hydrology money Expect the numerical value corresponding to data image in each numerical value cell of form, in conjunction with every attribute of hydrological data form format, Obtain every attribute in hydrological data form, and corresponding numerical value, and store.
A kind of papery Water Year Book digitizing solution, it is characterised in that after described step 009 Also comprise the steps, after execution of step 009, enter step 010;
Step 010., for being identified every attribute and corresponding numerical value thereof in storage hydrological data form, is respectively directed to each The flow number of the moon, 010-01 performs to step 010-02 as follows, and then obtains for each moon every respectively Daily flow numerical value tentatively identify judgement, subsequently into step 011;
Of that month first daily flow numerical value as first threshold, is then respectively directed to two daily flow numerical value before this month by step 010-01., Judge that the difference between next daily flow numerical value and same day flow number, whether less than first threshold, is then to judge flow number on the same day Value identifies errorless;Otherwise judge that flow number on the same day tentatively identifies mistake;It is derived from being respectively directed to two daily flow numbers before this month Value tentatively identify judgement, subsequently into step 010-02;
Step 010-02. is respectively directed to of that month each daily flow numerical value by the 3rd day, it is judged that next daily flow numerical value and same day Whether the difference between flow number, less than proxima luce (prox. luc) flow number, is then to judge that flow number identification on the same day is errorless;Otherwise sentence The flow number on the same day that breaks tentatively identifies mistake;It is derived from being respectively directed to of that month each daily flow numerical value preliminary by the 3rd day Identify and judge;
Step 011. is according to each numerical value identified in storage hydrological data form, and the identification of each numeral in each numerical value Feature, by default training aids, it is thus achieved that is identified each numeral in each numerical value in storage hydrological data form, the most right " 0 " is answered to arrive ten recognition result probability of " 9 ", subsequently into step 012;
Step 012. is respectively directed to be identified each numeral in each numerical value in storage hydrological data form, it is thus achieved that numeral is corresponding " 0 " arrives the maximum recognition result probability in " 9 " ten recognition result probability, and second largest recognition result probability, and obtains Obtain the difference of this maximum recognition result probability and this second largest recognition result probability, it is judged that whether this difference identifies knot less than presetting Really probability threshold value, is to judge that this numeral tentatively identifies mistake;Otherwise judge that this numeral identifies errorless;It is derived from pin respectively Judgement is tentatively identified, subsequently into step 013 to identified in storage hydrological data form each numeral in each numerical value;
Step 013. is respectively directed to each middle of the month, and each tentatively identifies wrong flow number, it is judged that the preliminary flow number identifying mistake In whether there is the preliminary numeral identifying mistake, be to judge that this tentatively identifies the flow number mistake of mistake, and report to the police; Otherwise judge that this tentatively identifies that error flow numerical quantity is errorless;It is achieved in for being identified each number in storage hydrological data form The inspection of value.
A kind of papery Water Year Book digitizing solution, it is characterised in that in described step 011, According to each numerical value identified in storage hydrological data form, and the identification feature of each numeral in each numerical value, pass through Support vector machine training aids, it is thus achieved that identified each numeral in each numerical value, the most corresponding " 0 " in storage hydrological data form Ten recognition result probability to " 9 ".
A kind of papery Water Year Book digitizing solution, it is characterised in that in described step 013, Described basis tentatively identifies and there is the preliminary numeral identifying mistake in wrong flow number, it is judged that this preliminary stream identifying mistake Numerical quantity mistake, and while reporting to the police, tentatively identify that according to this error number tentatively identifies in error flow numerical quantity at this Position be analyzed, if this tentatively identifies that error number is positioned at this and tentatively identifies the integer part in error flow numerical quantity, then Tentatively identify the proxima luce (prox. luc) flow number on date corresponding to error flow numerical quantity and the meansigma methods of a rear daily flow numerical value with this, replace Change this and tentatively identify error flow numerical quantity;If it is little that this tentatively identifies that error number is positioned in this preliminary identification error flow numerical quantity Fractional part, then tentatively identify the decimal of the proxima luce (prox. luc) flow number on date corresponding to error flow numerical quantity and a rear daily flow with this The meansigma methods of the decimal of numerical value, replaces this and tentatively identifies the decimal in error flow numerical quantity.
5. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that institute State step 004, carry out character cutting for each numeric character in data image, it is thus achieved that each number in this data image Value character block, specifically includes following steps:
The detection of step a01. obtains each white pixel point in data image within each numeric character, and each limit of this data image Edge is respectively at a distance of each numeric character minimum range, and the white pixel point on corresponding numeric character, subsequently into step a02;
Step a02. judges by being obtained each white pixel point in this data image respectively for previous step, it is judged that pixel Whether the pixel of upper and lower, left and right each position is white pixel point, is, judges that this pixel is inside numeric character Pixel;Otherwise judge, according to identifier, the edge pixel point that this pixel is character, and obtain this pixel in these data The row number of place pixel column in image;It is thus directed towards previous step and is entered respectively by this data image obtains each white pixel point Row judges, it is thus achieved that the row of place pixel column in each this data image of numeric character top edge pixel place in this data image Number, subsequently into step a03;
Step a03. is according to each numeric character top edge pixel place pixel column in this data image in this data image Row number, divide for each numeric character in this data image, it is thus achieved that each numeric character block in this data image.
6. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that institute State in step 005, each numeric character block being respectively directed in each data image, extracts numeric word in numeric character block The grid search-engine of symbol, specifically includes following steps:
Step b01. obtains the border of the upper and lower, left and right of numeric character block, and is derived from numeric character ontology diagram picture, so Rear entrance step b02;
Step b02. carries out center of gravity normalization for this numeric character ontology diagram picture, and will be through center of gravity this numeric character normalized Ontology diagram is slit into predetermined number sub regions image as average mark, subsequently into step b03;
Step b03. obtains in this numeric character ontology diagram picture the proportion of white pixel point in each sub regions image respectively, altogether With constituting the grid search-engine of numeric character in this numeric character block.
7. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that institute State in step 005, each numeric character block being respectively directed in each data image, extracts numeric word in numeric character block Fourier's feature of symbol, specifically includes following steps:
Step c01. carries out two dimensional discrete Fourier transform for numeric character block, subsequently into step c02;
Step c02., by this numeric character block through two dimensional discrete Fourier transform, proceeds central transformation, will numeric word Symbol block is averagely divided into four blocks of subregion images, and carries out diagonal angle exchange, it is thus achieved that Fourier's image is composed, subsequently into step c03;
Step c03. is for Fourier's its Fourier coefficient of image analysis of spectrum after central transformation, it is thus achieved that in Fu of this numeric character block In leaf system number, concentrate region more than the Fourier coefficient presetting amplitude thresholds, constitute significantly Fourier coefficient region, so Rear entrance step c04;
Step c04., by significantly Fourier coefficient region, is extracted predetermined number discrete Fourier transform coefficient, and is carried out Normalization, constitutes Fourier's feature of numeric character in this numeric character block.
8. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that institute State in step 005, each numeric character block being respectively directed in each data image, extracts numeric word in numeric character block The Contour moment feature of symbol, specifically includes following steps:
Step d01. carries out contours extract for the numeric character in numeric character block, subsequently into step d02;
Step d02. carries out not bending moment for the profile of numeric character in this numeric character block and processes, and extracts predetermined number two dimension wheel Wide invariant moment features, constitutes the Contour moment feature of numeric character in this numeric character block.
9. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that described Step 007 specifically includes following steps:
Step e01., according to permutation and combination, for all identification features of numeric character in all data images, carries out any two Identify the combination of feature, constitute the combination of all identification features, subsequently into step e02;
Step e02., by all identification features of numeric character in all data images, constitutes corresponding numeral in hydrological data form " 0 " arrives the sample set S of " 9 ", is then respectively directed to each group of identification feature combination, according to equation below (1):
C i j , A = E ( S i ∪ S j ) - E ( S i ∩ S j ) E ( S ) - - - ( 1 )
Obtain the feature complementary index C of this group identification feature combination relative standard digital " 0 "-" 9 " respectivelyij,A;And then obtain respectively Obtain the feature complementary index C of each group of identification feature combination relative standard digital " 0 "-" 9 " respectivelyij,A;Subsequently into step e03;Wherein, SiAnd SjRepresent that sample set S is identified feature F respectivelyiWith identification feature FjThe sample set of wrong point; E (S) represents the number of samples in sample set S;E(Si∪Sj) represent sample set SiWith sample set SjBetween and concentrate Number of samples;E(Si∩Sj) represent sample set SiWith sample set SjBetween occur simultaneously in number of samples; A={0,1 ..., 9}, Cij,ARepresent by identifying feature FiWith identification feature FjConstituted identification feature combination relative standard numeral The feature complementary index of A;
Step e03. is respectively directed to each group of identification feature combination, according to equation below (2):
TC k = Σ 0 , i ≠ j 9 C i j A 10 2 - - - ( 2 )
Obtain each group of identification feature combination overall complementation index TC relative to standard digital respectivelyk, subsequently into step e04;Its In, k={1 ..., K}, K represents the number of combinations that all identification features combine, TCkRepresent kth group identification feature combination phase Overall complementation index for standard digital;
Step e04. combines for all identification features, sorts from large to small by its overall complementation index, it is thus achieved that sequence the first two is known Other feature combines, and then identifies that feature combination carries out Feature Fusion for these two, constitutes difference correspondence in hydrological data form " 0 " arrives the numerical identification feature of " 9 ".
10. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that institute State in step 008, arrive the numerical identification feature of " 9 ", and each number according to the most corresponding " 0 " in hydrological data form According to the identification feature of numeric character in each numeric character block in image, by support vector machine classifier, obtain each respectively Each numeral corresponding to numeric character block in data image.
CN201610232680.9A 2016-04-14 2016-04-14 A kind of papery Water Year Book digitizing solution Expired - Fee Related CN105938547B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610232680.9A CN105938547B (en) 2016-04-14 2016-04-14 A kind of papery Water Year Book digitizing solution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610232680.9A CN105938547B (en) 2016-04-14 2016-04-14 A kind of papery Water Year Book digitizing solution

Publications (2)

Publication Number Publication Date
CN105938547A true CN105938547A (en) 2016-09-14
CN105938547B CN105938547B (en) 2019-02-12

Family

ID=57151427

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610232680.9A Expired - Fee Related CN105938547B (en) 2016-04-14 2016-04-14 A kind of papery Water Year Book digitizing solution

Country Status (1)

Country Link
CN (1) CN105938547B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805076A (en) * 2018-06-07 2018-11-13 浙江大学 The extracting method and system of environmental impact assessment report table word
CN109190611A (en) * 2018-08-14 2019-01-11 江西师范大学 Pedigree system makes are compiled in a kind of internet based on crowdsourcing
CN111060527A (en) * 2019-12-30 2020-04-24 歌尔股份有限公司 Character defect detection method and device
CN113436117A (en) * 2021-08-03 2021-09-24 东莞理工学院 Hydrology long sequence data extraction method based on image recognition

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3582734B2 (en) * 1993-07-14 2004-10-27 富士通株式会社 Table vectorizer
CN103996057A (en) * 2014-06-12 2014-08-20 武汉科技大学 Real-time handwritten digital recognition method based on multi-feature fusion
CN105184265A (en) * 2015-09-14 2015-12-23 哈尔滨工业大学 Self-learning-based handwritten form numeric character string rapid recognition method
CN105426834A (en) * 2015-11-17 2016-03-23 中国传媒大学 Projection feature and structure feature based form image detection method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3582734B2 (en) * 1993-07-14 2004-10-27 富士通株式会社 Table vectorizer
CN103996057A (en) * 2014-06-12 2014-08-20 武汉科技大学 Real-time handwritten digital recognition method based on multi-feature fusion
CN105184265A (en) * 2015-09-14 2015-12-23 哈尔滨工业大学 Self-learning-based handwritten form numeric character string rapid recognition method
CN105426834A (en) * 2015-11-17 2016-03-23 中国传媒大学 Projection feature and structure feature based form image detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘昱: "《印刷体表格识别的研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
张世平: "《水文年鉴数据的智能识别》", 《人民珠江》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805076A (en) * 2018-06-07 2018-11-13 浙江大学 The extracting method and system of environmental impact assessment report table word
CN108805076B (en) * 2018-06-07 2021-01-08 浙江大学 Method and system for extracting table characters of environmental impact evaluation report
CN109190611A (en) * 2018-08-14 2019-01-11 江西师范大学 Pedigree system makes are compiled in a kind of internet based on crowdsourcing
CN111060527A (en) * 2019-12-30 2020-04-24 歌尔股份有限公司 Character defect detection method and device
CN111060527B (en) * 2019-12-30 2021-10-29 歌尔股份有限公司 Character defect detection method and device
US12002198B2 (en) 2019-12-30 2024-06-04 Goertek Inc. Character defect detection method and device
CN113436117A (en) * 2021-08-03 2021-09-24 东莞理工学院 Hydrology long sequence data extraction method based on image recognition
CN113436117B (en) * 2021-08-03 2022-11-25 东莞理工学院 Hydrological long sequence data extraction method based on image recognition

Also Published As

Publication number Publication date
CN105938547B (en) 2019-02-12

Similar Documents

Publication Publication Date Title
CN110033000A (en) A kind of text detection and recognition methods of bill images
CN106529508B (en) Based on local and non local multiple features semanteme hyperspectral image classification method
CN101447017B (en) Method and system for quickly identifying and counting votes on the basis of layout analysis
CN103034848B (en) A kind of recognition methods of form types
CN103996057B (en) Real-time Handwritten Numeral Recognition Method based on multi-feature fusion
CN103810484B (en) The mimeograph documents discrimination method analyzed based on printing character library
CN101256631B (en) Method and apparatus for character recognition
CN105447522A (en) Complex image character identification system
CN106875546A (en) A kind of recognition methods of VAT invoice
CN106611174A (en) OCR recognition method for unusual fonts
CN103914680A (en) Character image jet-printing, recognition and calibration system and method
CN103824373B (en) A kind of bill images amount of money sorting technique and system
CN104732215A (en) Remote-sensing image coastline extracting method based on information vector machine
CN105938547A (en) Paper hydrologic yearbook digitalization method
CN104680130A (en) Chinese character recognition method for identification cards
CN104573685A (en) Natural scene text detecting method based on extraction of linear structures
CN101930549A (en) Second generation curvelet transform-based static human detection method
CN101359373A (en) Method and device for recognizing degraded character
Chaabouni et al. Multi-fractal modeling for on-line text-independent writer identification
CN106778717A (en) A kind of test and appraisal table recognition methods based on image recognition and k nearest neighbor
CN109800756A (en) A kind of text detection recognition methods for the intensive text of Chinese historical document
CN106874901A (en) A kind of driving license recognition methods and device
CN101251896A (en) Object detecting system and method based on multiple classifiers
CN103500323B (en) Based on the template matching method of self-adaptation gray level image filtering
Lu et al. Retrieval of machine-printed latin documents through word shape coding

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190212