CN105938547A - Paper hydrologic yearbook digitalization method - Google Patents
Paper hydrologic yearbook digitalization method Download PDFInfo
- Publication number
- CN105938547A CN105938547A CN201610232680.9A CN201610232680A CN105938547A CN 105938547 A CN105938547 A CN 105938547A CN 201610232680 A CN201610232680 A CN 201610232680A CN 105938547 A CN105938547 A CN 105938547A
- Authority
- CN
- China
- Prior art keywords
- numeric character
- feature
- numerical value
- subsequently
- character block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 230000008569 process Effects 0.000 claims abstract description 12
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims description 62
- 239000000284 extract Substances 0.000 claims description 23
- 238000003860 storage Methods 0.000 claims description 21
- 238000010586 diagram Methods 0.000 claims description 16
- 230000000295 complement effect Effects 0.000 claims description 13
- 230000004927 fusion Effects 0.000 claims description 13
- 238000012706 support-vector machine Methods 0.000 claims description 9
- 239000000203 mixture Substances 0.000 claims description 7
- 230000009466 transformation Effects 0.000 claims description 7
- 238000005520 cutting process Methods 0.000 claims description 6
- 230000005484 gravity Effects 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 5
- 239000012141 concentrate Substances 0.000 claims description 5
- 238000012549 training Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 4
- 238000005452 bending Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 claims description 3
- 238000010191 image analysis Methods 0.000 claims description 3
- 238000007689 inspection Methods 0.000 claims description 3
- 238000001228 spectrum Methods 0.000 claims description 3
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 claims 2
- 230000007246 mechanism Effects 0.000 abstract description 6
- 238000012937 correction Methods 0.000 abstract description 4
- 238000012545 processing Methods 0.000 abstract description 4
- 230000001932 seasonal effect Effects 0.000 abstract description 3
- 238000002474 experimental method Methods 0.000 abstract description 2
- 238000007500 overflow downdraw method Methods 0.000 abstract description 2
- 230000002354 daily effect Effects 0.000 description 23
- 230000000694 effects Effects 0.000 description 12
- 238000004458 analytical method Methods 0.000 description 9
- 238000000605 extraction Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 4
- 238000012015 optical character recognition Methods 0.000 description 4
- 238000012935 Averaging Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000006073 displacement reaction Methods 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000008521 reorganization Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000012938 design process Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000003754 machining Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000002844 melting Methods 0.000 description 1
- 230000008018 melting Effects 0.000 description 1
- 230000029052 metamorphosis Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 238000003825 pressing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000029305 taxis Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/412—Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Graphics (AREA)
- Geometry (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a paper hydrologic yearbook digitalization method. A feature fusion method of high complementarity is put forward on the basis of single feature so that the recognition rate is enhanced. The hydrologic process is similar due to the fact that the hydrologic process is influenced by similar seasonal climatic factors and other random factors, i.e. the flow has contextual relevance. In view of the correlation, a later error correction mechanism based on time sequences is also put forward. In other words, error correction processing is performed according to a certain criterion after classifier recognition. The experiment proves that the recognition accuracy can be effectively enhanced by the mechanism and the working efficiency can be guaranteed.
Description
Technical field
The present invention relates to a kind of papery Water Year Book digitizing solution, belong to computer image processing technology and hydrology interleaving techniques
Field.
Background technology
Papery Water Year Book have recorded most basic hydrologic survey data, contains the differentiation rule that nature is long-term in these data
Rule and the information of the effect of human activity, played important function in production, scientific research, community service.In view of Water Year Book is protected
Depositing that the age is the most remote, use frequency high and the reason such as preservation condition difference, papery Water Year Book the most gradually starts to damage, and
And once suffer artificial or naturally damage, the loss being difficult to make up being brought, the historical summary rescuing these preciousnesses has become as
Extremely urgent problem.Protection Water Year Book most effective way is that Water Year Book is digitized scanning machining, forms electricity
Sub-file.The digitized of Water Year Book is studied by prior art based on problem above, it is proposed that Water Year Book data
Intelligent Recognition, identifies that the numeral (i.e. Number character recognition) in hydrological data is the digitized vital task of hydrological data.
Hydrological data is a kind of data being published year by year, the achievement expressed with the diagrammatic form of unification, science.Content master
Basic hydrology data that if last year surveys and that examine through strict reorganization, that generally need;Its form feature is horizontally-arranged table
Showing concrete month, vertical setting of types represents the date in each month, by the average discharge of every month, maximum stream flow, minimum bottom form
Flow, year statistics and note composition.So first it being carried out printed page analysis before identifying Water Year Book numeral herein, extract
Form line.
It is the most fewer that Water Year Book numerical character compares standardization, stroke number, it than the extraction of Hanzi features code relatively to hold
Easily.But, they metamorphosis are little, stroke information is very few, in a sense cause effective characteristic vector
The difficulty increase extracted.Such as, numeral " 8 " and " 6 ", when their ink weight is a bit, " 6 " of the white positive Song typeface have
Time the first half also become individual small circle, almost similar with " 8 ".Numeral " 1 " and " 3 ", " 2 " and " 7 ", when ink relatively
Weigh or font is the least, it is likely that occur that there is identical characteristic vector numeral " 1 " and " 3 ", " 2 " and " 7 ".Therefore,
In actual applications, use prior art to be identified for hydrological data, there is precision shortcoming low, inefficient.
Summary of the invention
The technical problem to be solved is to provide the brand-new Feature Fusion method for designing of a kind of employing, it is possible to be effectively improved knowledge
Not rate, it is ensured that the papery Water Year Book digitizing solution of work efficiency.
The present invention is to solve above-mentioned technical problem by the following technical solutions: the present invention devises a kind of papery Water Year Book number
Word method, comprises the steps:
Step 001., according to the layout of the papery Water Year Book page, determines that hydrological data form is positioned at papery Water Year Book
Location of pixels in the page, subsequently into step 002;
Step 002., according to the location of pixels of hydrological data form in the papery Water Year Book page, is divided for hydrological data form
Do not carry out vertical and horizontal projection, and longitudinal projection's figure, the transverse projection for hydrological data form is analyzed respectively, point
You can well imagine the abscissa of each bar vertical line, the vertical coordinate of each bar horizontal line in water intaking literary composition data form, subsequently into step 003;
Step 003. is according to the abscissa of bar vertical line each in the format of hydrological data form, and hydrological data form, each bar
The vertical coordinate of horizontal line, for the projection picture of hydrological data form, obtains in each numerical value cell of hydrological data form respectively
Data image, subsequently into step 004;Wherein, the numeric character in each data image of hydrological data form is white,
Background color is black;
Step 004. is respectively directed to each data image, carries out character cutting for each numeric character in data image,
Obtain each numeric character block in this data image, and then obtain each numeric character block in each data image respectively,
Subsequently into step 005;
Each numeric character block that step 005. is respectively directed in each data image, extracts numeric character in numeric character block
Grid search-engine, Fourier's feature, Contour moment feature, collectively as the identification feature of this numeric character, and then obtain respectively
The identification feature of numeric character in each numeric character block in each data image, subsequently into step 006;
Each numeric character block that step 006. is respectively directed in each data image, it may be judged whether exist by numeric character block
There is downwards the black pixel point of predetermined number in top margin, is, judges in this numeric character block as arithmetic point, does not the most do any
Operation further;It is respectively directed in each data image after the judgement of each numeric character block, subsequently into step completing
007;
Step 007., for all identification features of numeric character in all data images, carries out Feature Fusion, constitutes the hydrology
In data form, the most corresponding " 0 " arrives the numerical identification feature of " 9 ", subsequently into step 008;
Step 008. arrives the numerical identification feature of " 9 ", and each number according to the most corresponding " 0 " in hydrological data form
According to the identification feature of numeric character in each numeric character block in image, by default grader, obtain each datagram respectively
Each numeral corresponding to numeric character block in Xiang, subsequently into step 009;
Step 009., according to the numeral corresponding to each numeric character block in each data image or arithmetic point, respectively constitutes water
Numerical value corresponding to data image in literary composition each numerical value cell of data form, in conjunction with every genus of hydrological data form format
Property, it is thus achieved that every attribute in hydrological data form, and corresponding numerical value, and store.
As a preferred technical solution of the present invention, also comprise the steps after described step 009, execution of step 009
Afterwards, step 010 is entered;
Step 010., for being identified every attribute and corresponding numerical value thereof in storage hydrological data form, is respectively directed to
The flow number of each month, 010-01 performs to step 010-02 as follows, and then obtains respectively for each
Month daily flow numerical value tentatively identify judgement, subsequently into step 011;
Of that month first daily flow numerical value as first threshold, is then respectively directed to two daily flow numbers before this month by step 010-01.
Value, it is judged that whether the difference between next daily flow numerical value and same day flow number, less than first threshold, is then to judge that the same day flows
Numerical quantity identification is errorless;Otherwise judge that flow number on the same day tentatively identifies mistake;It is derived from being respectively directed to of that month first two days stream
Numerical quantity tentatively identify judgement, subsequently into step 010-02;
Step 010-02. is respectively directed to of that month each daily flow numerical value by the 3rd day s, it is judged that next daily flow numerical value with ought
Whether the difference between daily flow numerical value, less than proxima luce (prox. luc) flow number, is then to judge that flow number identification on the same day is errorless;Otherwise
Judge that flow number on the same day tentatively identifies mistake;It is derived from being respectively directed to of that month by the 3rd day at the beginning of each daily flow numerical value
Step identifies and judges;
Step 011. is according to each numerical value identified in storage hydrological data form, and each numeral in each numerical value
Identify feature, by default training aids, it is thus achieved that identified each numeral in each numerical value in storage hydrological data form, point
Not corresponding " 0 " arrives ten recognition result probability of " 9 ", subsequently into step 012;
Step 012. is respectively directed to be identified each numeral in each numerical value in storage hydrological data form, it is thus achieved that numeral institute
Corresponding " 0 " arrives the maximum recognition result probability in " 9 " ten recognition result probability, and second largest recognition result probability,
And obtain the difference of this maximum recognition result probability and this second largest recognition result probability, it is judged that whether this difference is known less than presetting
Other probability of outcome threshold value, is to judge that this numeral tentatively identifies mistake;Otherwise judge that this numeral identifies errorless;It is derived from point
Safety pin is tentatively identified judgement, subsequently into step to identify in storage hydrological data form each numeral in each numerical value
013;
Step 013. is respectively directed to each middle of the month, and each tentatively identifies wrong flow number, it is judged that the preliminary flow identifying mistake
Whether numerical value exists the preliminary numeral identifying mistake, is, judge that this tentatively identifies the flow number mistake of mistake, and carry out
Report to the police;Otherwise judge that this tentatively identifies that error flow numerical quantity is errorless;It is achieved in for being identified in storage hydrological data form
The inspection of each numerical value.
As a preferred technical solution of the present invention: in described step 011, according to being identified in storage hydrological data form
Each numerical value, and the identification feature of each numeral in each numerical value, by support vector machine training aids, it is thus achieved that identified
Each numeral in each numerical value in storage hydrological data form, the most corresponding " 0 " arrives ten recognition result probability of " 9 ".
As a preferred technical solution of the present invention: in described step 013, described basis tentatively identifies the flow number of mistake
Value exists the preliminary numeral identifying mistake, it is judged that this tentatively identifies flow number mistake of mistake, and while reporting to the police,
Tentatively identify according to this at this, error number tentatively identifies that the position in error flow numerical quantity is analyzed, if this tentatively identifies mistake
Numeral is positioned at this preliminary integer part identified in error flow numerical quantity by mistake, then tentatively identify corresponding to error flow numerical quantity with this
The proxima luce (prox. luc) flow number on date and the meansigma methods of a rear daily flow numerical value, replace this and tentatively identify error flow numerical quantity;If should
The preliminary fractional part identifying that error number is positioned in this preliminary identification error flow numerical quantity, then tentatively identify mistake flow with this
The decimal of the proxima luce (prox. luc) flow number on date corresponding to numerical value and the meansigma methods of the decimal of a rear daily flow numerical value, replace this preliminary
Identify the decimal in error flow numerical quantity.
As a preferred technical solution of the present invention, described step 004, enter for each numeric character in data image
Line character cutting, it is thus achieved that each numeric character block in this data image, specifically includes following steps:
The detection of step a01. obtains each white pixel point in data image within each numeric character, and this data image
Each edge is respectively at a distance of each numeric character minimum range, and the white pixel point on corresponding numeric character, subsequently into step
a02;
Step a02. judges by being obtained each white pixel point in this data image respectively for previous step, it is judged that as
Whether the pixel of vegetarian refreshments upper and lower, left and right each position is white pixel point, is, judges that this pixel is numeric character
Internal pixel;Otherwise judge, according to identifier, the edge pixel point that this pixel is character, and obtain this pixel at this
The row number of place pixel column in data image;It is thus directed towards previous step by this data image is obtained each white pixel point minute
Do not judge, it is thus achieved that place pixel column in each this data image of numeric character top edge pixel place in this data image
Row number, subsequently into step a03;
Step a03. is according to each numeric character top edge pixel place pixel column in this data image in this data image
Row number, divide for each numeric character in this data image, it is thus achieved that each numeric character in this data image
Block.
As a preferred technical solution of the present invention, in described step 005, be respectively directed in each data image is each
Individual numerical value character block, extracts the grid search-engine of numeric character in numeric character block, specifically includes following steps:
Step b01. obtains the border of the upper and lower, left and right of numeric character block, and is derived from numeric character ontology diagram picture,
Subsequently into step b02;
Step b02. carries out center of gravity normalization for this numeric character ontology diagram picture, and will be through center of gravity this numerical value normalized
Character body image averaging is divided into predetermined number sub regions image, subsequently into step b03;
Step b03. obtains in this numeric character ontology diagram picture institute's accounting of white pixel point in each sub regions image respectively
Example, collectively forms the grid search-engine of numeric character in this numeric character block.
As a preferred technical solution of the present invention, in described step 005, be respectively directed in each data image is each
Individual numerical value character block, extracts Fourier's feature of numeric character in numeric character block, specifically includes following steps:
Step c01. carries out two dimensional discrete Fourier transform for numeric character block, subsequently into step c02;
Step c02., by this numeric character block through two dimensional discrete Fourier transform, proceeds central transformation, will count
Value character block is averagely divided into four blocks of subregion images, and carries out diagonal angle exchange, it is thus achieved that Fourier's image is composed, subsequently into step
Rapid c03;
Step c03. is for Fourier's its Fourier coefficient of image analysis of spectrum after central transformation, it is thus achieved that this numeric character block
In Fourier coefficient, concentrate region more than the Fourier coefficient presetting amplitude thresholds, constitute significantly Fourier coefficient region,
Subsequently into step c04;
Step c04., by significantly Fourier coefficient region, extracts predetermined number discrete Fourier transform coefficient, and by it
It is normalized, constitutes Fourier's feature of numeric character in this numeric character block.
As a preferred technical solution of the present invention: in described step 005, be respectively directed in each data image is each
Individual numerical value character block, extracts the Contour moment feature of numeric character in numeric character block, specifically includes following steps:
Step d01. carries out contours extract for the numeric character in numeric character block, subsequently into step d02;
Step d02. carries out not bending moment for the profile of numeric character in this numeric character block and processes, and extracts predetermined number two
Dimension profile invariant moment features, constitutes the Contour moment feature of numeric character in this numeric character block.
As a preferred technical solution of the present invention, described step 007 specifically includes following steps:
Step e01. is according to permutation and combination, for all identification features of numeric character in all data images, carries out arbitrarily
Two combinations identifying feature, constitute the combination of all identification features, subsequently into step e02;
Step e02., by all identification features of numeric character in all data images, constitutes corresponding number in hydrological data form
Word " 0 " arrives the sample set S of " 9 ", is then respectively directed to each group of identification feature combination, according to equation below (1):
Obtain the feature complementary index C of this group identification feature combination relative standard digital " 0 "-" 9 " respectivelyij,A;And then respectively
Obtain the feature complementary index C of each group of identification feature combination relative standard digital " 0 "-" 9 " respectivelyij,A;Subsequently into step
e03;Wherein, SiAnd SjRepresent that sample set S is identified feature F respectivelyiWith identification feature FjThe sample set of wrong point;E(S)
Represent the number of samples in sample set S;E(Si∪Sj) represent sample set SiWith sample set SjBetween and the sample concentrated
This number;E(Si∩Sj) represent sample set SiWith sample set SjBetween occur simultaneously in number of samples;A={0,1 ..., 9},
Cij,ARepresent by identifying feature FiWith identification feature FjConstituted and identified that the feature complementary of feature combination relative standard numeral A refers to
Number;
Step e03. is respectively directed to each group of identification feature combination, according to equation below (2):
Obtain each group of identification feature combination overall complementation index TC relative to standard digital respectivelyk, subsequently into step e04;
Wherein, k={1 ..., K}, K represent the number of combinations that all identification features combine, TCkRepresent the combination of kth group identification feature
Overall complementation index relative to standard digital;
Step e04. combines for all identification features, sorts from large to small by its overall complementation index, it is thus achieved that sort front two
Individual identification feature combines, and then identifies that feature combination carries out Feature Fusion for these two, constitutes difference in hydrological data form
Corresponding " 0 " arrives the numerical identification feature of " 9 ".
As a preferred technical solution of the present invention, in described step 008, according to the most corresponding " 0 " in hydrological data form
To the numerical identification feature of " 9 ", and the identification feature of numeric character in each numeric character block in each data image,
By support vector machine (SVM) grader, obtain each numeral corresponding to numeric character block in each data image respectively.
A kind of papery Water Year Book digitizing solution of the present invention and control method use above technical scheme and prior art
Compare, have following technical effect that papery Water Year Book digitizing solution designed by the present invention, on the basis of single features
Proposing complementary stronger Feature fusion, discrimination is improved, owing to hydrologic process is by similar seasonal gas
Time factor, and the impact of other random factors and present similarity, namely its flow has context dependence, thus this
Bright Given this dependency, proposes based on seasonal effect in time series later stage mechanism for correcting errors simultaneously.I.e. after grader identification, according to certain
Plant criterion and it carried out correction process, be experimentally confirmed, mechanism proposed by the invention, be effectively increased accuracy of identification,
Ensure that work efficiency.
Accompanying drawing explanation
Fig. 1 is the papery Water Year Book digitizing solution that designs of the present invention and the flow chart of control method;
Fig. 2 a is hydrological data form transverse projection schematic diagram in embodiment;
Fig. 2 b is hydrological data form longitudinal projection schematic diagram in embodiment;
Fig. 3 is the form schematic diagram being made up of each bar vertical line extracted in hydrological data form, each bar horizontal line in embodiment;
Fig. 4 is Water Year Book printed page analysis schematic diagram in embodiment;
Fig. 5 is to obtain the schematic diagram of data image in each numerical value cell of hydrological data form in embodiment respectively;
Fig. 6 is the schematic diagram of each numeric character block in data acquired image in embodiment.
Detailed description of the invention
Below in conjunction with Figure of description, the detailed description of the invention of the present invention is described in further detail.
In daily business activity, we have used substantial amounts of document and form every day.Form document is the most simultaneously
Being applied to every field, usual people need manual handle form document, such as client to need to pay taxes, librarian
Need to gather the data message included in paper list document.Due to the development of optical character recognition (OCR) technology, people
Begin attempt to utilize and can obtain the criteria table image of data to the data message extracting in form, when this can reduce work
Between and alleviate work load.In commercial field, utilize OCR technique can improve work quality, and people can be reduced
Spend in the plenty of time processed on form document.In many fields that OCR uses, we are by the form obtained
Template makes user know the target string of block letter in image.These character string informations include numerous items content such as flow
Information, text message and mathematical formulae etc..The existence of form hinders the extraction of data message, and therefore table line detecting is print
A vital task in brush body Table recognition technology.
In hydrological data block letter document, form is its requisite part, and it can be high by all of document information
Degree concentrates in together, and allows reader understand its implication expressed exactly, the simplest and the clearest but also specification.By consulting Water Year Book
The flow meter at each big hydrometric station, it appeared that the layout structure of Water Year Book flow meter is regular governed.We can be in order to
Character is cut out by these rules.
Water Year Book is that each river water body carries out hydrologic monitoring, next year is processed arrangement and is published formation in hydrology mechanism watershed
The carrier of hydrologic monitoring achievement.Its content includes every reorganization achievement and the summarized materials with chart with necessary explanatory note,
It it is the hydrographic data treasure-house of a system, specification.
1958, the volume scope of basin water system universal formulation hydrological data was pressed in the whole nation by Hydrological Bureau of Ministry of Water Resources, and will provide year by year
Material Uniform Name is volume 10 94, " People's Republic of China's Water Year Book " whole nation point.Its feature is as follows.
Color characteristic: yellow end surplus.
Architectural feature: paper width is 440mm, height is 140mm, and the ratio of width to height is 3.14.In yearbook, numeral width is about
15mm, height is about 24mm, and the ratio of width to height is 0.625.Character is positioned at form.
Textural characteristics: containing class character area in yearbook, the most digital color shades horizontal, vertical presents regular Wave crest and wave trough
Change.
Water Year Book character is the character that rows of horizontal is regularly arranged, has more stable structure and textural characteristics.Based on throwing
The top-down printed page analysis method of shadow applies this feature exactly.At the character zone of yearbook, the marginal information of character is non-
The abundantest, use certain instrument that character edge information is detected and analyzed, hydrographic data can be isolated from background
Come.The pixel value in Water Year Book region will present specific fluctuations, and change frequency also keeps within the specific limits, utilize
These features can realize Water Year Book character locating.According to yearbook numeric area horizontal, vertical features is richer than nonnumeric region
The character locating algorithm of based on the most vertical projection of this feature extraction rich.Obtain its trip point, according to the quantity of trip point
With the distance between trip point determines possible character zone.
Away from general 275 the pixel left and right page empty of page top margin, it is followed by basin title and the hydrology name of station of Water Year Book
Claim plus daily mean flow meter printed words.The unit of catchment area, flow is indicated apart from this printed words 30 pixel right position.
It is form starting position apart from about this 20 pixel.Water Year Book form is by 11 horizontal lines and 14 vertical line compositions.Before
Article two, indicate month information in the middle of horizontal line, between front two vertical lines, indicate per per medio, subsequently between every two articles of vertical lines and the
Region before three days horizontal lines is all the flow value of every month.Indicate between horizontal line subsequently every month average flow rate value,
What the date was maximum adds up and pragmatic information in flow value and the flow value of date minimum, year.Our final purpose is to identify flow
Value, therefore it is first necessary to hydrological data is carried out printed page analysis, analyzes its tableau format, extracts Form Frame Line, in order to be concrete
The flow value in each month is positioned.
As it is shown in figure 1, the present invention devises a kind of papery Water Year Book digitizing solution, first have to for papery Water Year Book
In the page, hydrological data form is taken pictures, and obtains hydrological data tabular drawing picture, and carries out pretreatment operation, including figure
As binaryzation, gray processing, denoising, rotation and inverse process;Then for the hydrological data tabular drawing picture of pretreatment operation,
Specifically carry out following steps:
Step 001., along with the further investigation to document layout parser, splits typical algorithm at original document layout herein
On the basis of (top-down, downward the end of from), the advantage of comprehensive two kinds of typical algorithm, use architectural feature and texture the most simultaneously
Feature processes the document layout in Water Year Book.This processing mode had both considered the accuracy of segmentation, had taken into account again analysis
The time loss processed, therefore, it is possible to position-table fast and accurately.According to the layout of the papery Water Year Book page,
Determine the location of pixels that hydrological data form is positioned in the papery Water Year Book page, subsequently into step 002.
Step 002., according to the location of pixels of hydrological data form in the papery Water Year Book page, is divided for hydrological data form
Do not carry out vertical and horizontal projection, transverse projection as shown in Figure 2 a, longitudinal projection as shown in Figure 2 b, and for hydrological data
Longitudinal projection's figure of form, transverse projection are analyzed respectively, and wherein, in Fig. 2 a, 11 black color dots represent the hydrology respectively
The horizontal line of yearbook form, the hollow dots after second stain represents the upper-lower position of often row flow value, each ripple afterwards
The both sides at peak represent the upper-lower position of the flow value often gone first to the 31st;In Fig. 2 b, 14 black color dots represent
The abscissa of 14 vertical lines of form, between each two black color dots, the crest both sides between the most every two vertical lines represent monthly
The left and right coordinate of flow value, mark by hollow dots.Extract the abscissa of each bar vertical line, each bar in hydrological data form respectively
The vertical coordinate of horizontal line, in actual Application Example as it is shown on figure 3, wherein, the number in each data image of hydrological data form
Value character is white, and background color is black;Therefore, by Fig. 2 a and Fig. 2 b can with coarse localization go out the flow value of every month with
And table position, the result of final Water Year Book printed page analysis as shown in Figure 4, subsequently into step 003.
By the black number of pixels on statistics same row or column, it is to avoid directly detecting straightway, connection to form line
Property is less demanding, has the most anti-interference and generalization ability.Position and the chi of target in image can be reflected by the method
Very little effective information such as grade.Localization process for follow-up Water Year Book numeral is provided convenience.
Step 003. is according to the abscissa of bar vertical line each in the format of hydrological data form, and hydrological data form, each bar
The vertical coordinate of horizontal line, for the projection picture of hydrological data form, obtains in each numerical value cell of hydrological data form respectively
Data image, actual Application Example is as it is shown in figure 5, subsequently into step 004;Wherein, hydrological data form each
Numeric character in data image is white, and background color is black.
In the digitized process of papery water level prediction, only can the most well hydrological data image be carried out point
Cut, the accuracy of the data of guarantee subsequent extracted feature.The segmentation of papery water level prediction image is whole digitized process
Basis, it is also entirety that numeral positions out later image, including the blank between numeral and numeral.For having carried
The numeral taken out is overall, needs to carry out character cutting.Single character is separated from overall digital.
Step 004. is respectively directed to each data image, carries out character cutting for each numeric character in data image,
Obtain each numeric character block in this data image, specifically include following steps:
The detection of step a01. obtains each white pixel point in data image within each numeric character, and this data image
Each edge is respectively at a distance of each numeric character minimum range, and the white pixel point on corresponding numeric character, subsequently into step
a02;
Step a02. judges by being obtained each white pixel point in this data image respectively for previous step, it is judged that as
Whether the pixel of vegetarian refreshments upper and lower, left and right each position is white pixel point, is, judges that this pixel is numeric character
Internal pixel;Otherwise judge, according to identifier, the edge pixel point that this pixel is character, and obtain this pixel at this
The row number of place pixel column in data image;It is thus directed towards previous step by this data image is obtained each white pixel point minute
Do not judge, it is thus achieved that place pixel column in each this data image of numeric character top edge pixel place in this data image
Row number, subsequently into step a03;
Step a03. is according to each numeric character top edge pixel place pixel column in this data image in this data image
Row number, divide for each numeric character in this data image, it is thus achieved that each numeric character in this data image
Block.
Based on above-mentioned design process, and then obtain each numeric character block in each data image, actual application implementation respectively
Example, is obtained each numeric character block in this data image, as shown in Figure 6;Subsequently into step 005.
If directly using pretreated data as the input quantity of grader, carrying out data volume during classified counting big, feature extraction
Purpose be exactly from analyze numeral topological structure start with, its some architectural feature is extracted, make numeral displacement,
The interference such as size variation, font distortion is relatively reduced, namely the key message of those reflection numerical characteristics is supplied to classification
Device, is thus equal to indirectly add the fault-tolerant ability of grader, and data volume is also greatly reduced after feature extraction
?;Feature extraction is to having identified critical effect, and it should follow following principle:
(1) it is prone to extract;
(2) there is stronger classification capacity, i.e. this feature and different numerals should be shown bigger difference, and to identical
Numeral then should show the least difference;
(3) there is higher stability, reduce stroke fracture or the impact of adhesion as far as possible.
Each numeric character block that step 005. is respectively directed in each data image, extracts numeric character in numeric character block
Grid search-engine, Fourier's feature, Contour moment feature, collectively as the identification feature of this numeric character, and then obtain respectively
The identification feature of numeric character in each numeric character block in each data image, subsequently into step 006.
Wherein, grid search-engine is one group focuses on the distribution characteristics that character picture is overall, and this kind of feature has extremely strong pressing down to noise
Ability processed.The main thought of its extracting method is, digital dot array is divided into several local cell territory, and on each zonule
Reticular density as Expressive Features, i.e. add up the percentage ratio shared by the image pixel of each zonule as characteristic;Due to
What grid search-engine reflected is the partial statistics characteristic of image, is a percentage ratio relative value, and the deformation of image local or noise pair
Answer the value exchange that digital dot array is exactly " 0 " and " 1 " of local element, if so image is with deformation locally or noise,
Compared with the original image not having deformation and noise, the percentage ratio relative value calculated changes not quite.It is to say, this phase
The impact brought value for deformation or the isolated noise point of digital picture partial stroke is insensitive.Therefore, it is characterized with grid
Carry out numeral identification, there is preferable noise resisting ability.For the numeral being partitioned into herein, it is divided into size and is by me
The zonule of 3 × 3,9 altogether.
In above-mentioned steps 005, each numeric character block being respectively directed in each data image, extracts number in numeric character block
The grid search-engine of value character, specifically includes following steps:
Step b01. obtains the border of the upper and lower, left and right of numeric character block, and is derived from numeric character ontology diagram picture,
Subsequently into step b02.
Step b02. carries out center of gravity normalization for this numeric character ontology diagram picture, and will be through center of gravity this numerical value normalized
Character body image averaging is divided into predetermined number sub regions image, subsequently into step b03.
Step b03. obtains in this numeric character ontology diagram picture institute's accounting of white pixel point in each sub regions image respectively
Example, collectively forms the grid search-engine of numeric character in this numeric character block.
Fourier transform is widely used a kind of two-dimensional orthogonal transformation in image procossing, and after Fourier transform, meansigma methods is the most straight
Stream item is proportional to the meansigma methods of image intensity value, and low frequency component then indicates intensity and the direction of object edge in image.Numeral
Character typically can represent with the closed outline that a lot of line segments are constituted, can be sufficient by some discrete magnitudes obtained by mapping
Reflect the change of these closed outlines.Fourier coefficient can be good at describing image boundary profile, its value and similar font
Translation, rotation, displacement and size are unrelated.When font characterizes and identifies, these features form obvious data compression.
In above-mentioned steps 005, each numeric character block being respectively directed in each data image, extracts number in numeric character block
Fourier's feature of value character, specifically includes following steps:
Step c01. carries out two dimensional discrete Fourier transform for numeric character block, subsequently into step c02.
Step c02., by this numeric character block through two dimensional discrete Fourier transform, proceeds central transformation, will count
Value character block is averagely divided into four blocks of subregion images, and carries out diagonal angle exchange, it is thus achieved that Fourier's image is composed, subsequently into step
Rapid c03.
Step c03. is for Fourier's its Fourier coefficient of image analysis of spectrum after central transformation, it is thus achieved that this numeric character block
In Fourier coefficient, concentrate region more than the Fourier coefficient presetting amplitude thresholds, constitute significantly Fourier coefficient region,
Subsequently into step c04.
Step c04., by significantly Fourier coefficient region, extracts predetermined number discrete Fourier transform coefficient, and by it
It is normalized, constitutes Fourier's feature of numeric character in this numeric character block.
Invariant moment features is a kind of statistical nature of image, is the mathematics spy in image with translation scaling and rotation invariant
Levy.
In above-mentioned steps 005, each numeric character block being respectively directed in each data image, extracts number in numeric character block
The Contour moment feature of value character, specifically includes following steps:
Step d01. carries out contours extract for the numeric character in numeric character block, subsequently into step d02.
Step d02. carries out not bending moment for the profile of numeric character in this numeric character block and processes, and extracts predetermined number two
Dimension profile invariant moment features, constitutes the Contour moment feature of numeric character in this numeric character block.
Each numeric character block that step 006. is respectively directed in each data image, it may be judged whether exist by numeric character block
There is downwards the black pixel point of predetermined number in top margin, is, judges in this numeric character block as arithmetic point, does not the most do any
Operation further;It is respectively directed in each data image after the judgement of each numeric character block, subsequently into step completing
007;
All identification features obtained in above-mentioned steps, if carrying out point with neutral net and support vector machine classifier respectively
Class, classifying quality is not satisfactory, and this is primarily due to be difficult to find a kind of feature to be respectively adapted to different numerals, and forefathers
Method be all to identify that application aspect carries out feature extraction and fusion analyzing specific numeral, each numeral has different spies
Point, the various feature of needs that wants correctly to classify is combined, and the complementarity of feature is to ensure that the feature of extraction has higher identification
Rate and the key of generalization ability, be the foundation of Feature Fusion;Therefore, before carrying out Feature Fusion, it is necessary to solve feature mutual
The problem of benefit property tolerance.
Step 007., for all identification features of numeric character in all data images, carries out Feature Fusion, constitutes the hydrology
In data form, the most corresponding " 0 " arrives the numerical identification feature of " 9 ", subsequently into step 008.
Above-mentioned steps 007 specifically includes following steps:
Step e01. is according to permutation and combination, for all identification features of numeric character in all data images, carries out arbitrarily
Two combinations identifying feature, constitute the combination of all identification features, subsequently into step e02.
Step e02., by all identification features of numeric character in all data images, constitutes corresponding number in hydrological data form
Word " 0 " arrives the sample set S of " 9 ", is then respectively directed to each group of identification feature combination, according to equation below (1):
Obtain the feature complementary index C of this group identification feature combination relative standard digital " 0 "-" 9 " respectivelyij,A;And then respectively
Obtain the feature complementary index C of each group of identification feature combination relative standard digital " 0 "-" 9 " respectivelyij,A;Subsequently into step
e03;Wherein, Cij,AThe biggest, illustrate to identify feature FiWith identification feature FjThe Features Complement of relative standard numeral A is the strongest;
Otherwise, then Features Complement is the most weak;SiAnd SjRepresent that sample set S is identified feature F respectivelyiWith identification feature FjWrong point
Sample set;E (S) represents the number of samples in sample set S;E(Si∪Sj) represent sample set SiWith sample set Sj
Between and the number of samples concentrated;E(Si∩Sj) represent sample set SiWith sample set SjBetween occur simultaneously in number of samples;
A={0,1 ..., 9}, Cij,ARepresent by identifying feature FiWith identification feature FjConstituted identification feature combination relative standard numeral A
Feature complementary index.
Step e03. is respectively directed to each group of identification feature combination, according to equation below (2):
Obtain each group of identification feature combination overall complementation index TC relative to standard digital respectivelyk, subsequently into step e04;
Wherein, k={1 ..., K}, K represent the number of combinations that all identification features combine, TCkRepresent the combination of kth group identification feature
Overall complementation index relative to standard digital.
Step e04. combines for all identification features, sorts from large to small by its overall complementation index, it is thus achieved that sort front two
Individual identification feature combines, and then identifies that feature combination carries out Feature Fusion for these two, constitutes difference in hydrological data form
Corresponding " 0 " arrives the numerical identification feature of " 9 ".
The recognition result of single feature, by different features is used in grader classification, is carried out point by technique scheme
Analysis, calculates the overall complementation index of each feature by above-mentioned formula, then will by certain linear relationship of characteristic use selected
It merges, and is shown experimentally that thick meshed feature and Fourier feature are favourable to the digital recognition effect of Water Year Book data, and
And its overall complementarity is relatively strong, so after Fourier feature string is connected on thick meshed feature, drawing melting of proposition by experiment
The discrimination more single Fourier feature closing feature improves 3.8981%, and relatively grid search-engine improves 1.4033%, relatively profile
Square improves 83.1956%.
Step 008. arrives the numerical identification feature of " 9 ", and each number according to the most corresponding " 0 " in hydrological data form
According to the identification feature of numeric character in each numeric character block in image, by support vector machine (SVM) grader, obtain respectively
Obtain each numeral corresponding to numeric character block in each data image, subsequently into step 009.
Step 009., according to the numeral corresponding to each numeric character block in each data image or arithmetic point, respectively constitutes water
Numerical value corresponding to data image in literary composition each numerical value cell of data form, in conjunction with every genus of hydrological data form format
Property, it is thus achieved that every attribute in hydrological data form, and corresponding numerical value, and store;Subsequently into step 010.
Herein by the rule of analysis flow, propose later stage misarrangement mechanism according to time series.By experimental result,
The final recognition result of Water Year Book is close to 99%, and error rate is the most relatively low, and a flow value is by 4 to 5 blockettes
Become, if one of them numeral identifies wrong, i.e. think that result is wrong, this and conventional data set MNIST, the knowledge on USPS
The error rate statistic of other result or slightly different.Observation recognition result understands, a flow value general only one of which numeral
Identify mistake, and the flow value of identification of each month mistake is within 3, if like this we can be by certain
Algorithm idea finds the flow value identifying that reliability is the highest, namely finds the digital key position before the arithmetic point of flow value
Identify mistake, by adding up the Changing Pattern of monthly flow, utilize mean value method to carry out error correction, imitate bringing the highest application
Rate.
Because obtaining also being obtained of flow itself by apparatus measures, itself there is also certain error, if therefore flow
In certain little scope in the case of fluctuation, namely in the case of numeral identification after the arithmetic point of flow value is wrong, not
On the premise of affecting analysis and the application of data on flows, we can tolerate.I.e. it is not considered as that its identification is wrong.
Step 010., for being identified every attribute and corresponding numerical value thereof in storage hydrological data form, is respectively directed to
The flow number of each month, 010-01 performs to step 010-02 as follows, and then obtains respectively for each
Month daily flow numerical value tentatively identify judgement, subsequently into step 011.
Of that month first daily flow numerical value as first threshold, is then respectively directed to two daily flow numbers before this month by step 010-01.
Value, it is judged that whether the difference between next daily flow numerical value and same day flow number, less than first threshold, is then to judge that the same day flows
Numerical quantity identification is errorless;Otherwise judge that flow number on the same day tentatively identifies mistake;It is derived from being respectively directed to of that month first two days stream
Numerical quantity tentatively identify judgement, subsequently into step 010-02.
Step 010-02. is respectively directed to of that month each daily flow numerical value by the 3rd day s, it is judged that next daily flow numerical value with ought
Whether the difference between daily flow numerical value, less than proxima luce (prox. luc) flow number, is then to judge that flow number identification on the same day is errorless;Otherwise
Judge that flow number on the same day tentatively identifies mistake;It is derived from being respectively directed to of that month by the 3rd day at the beginning of each daily flow numerical value
Step identifies and judges.
Step 011. is according to each numerical value identified in storage hydrological data form, and each numeral in each numerical value
Identify feature, by support vector machine training aids, it is thus achieved that identified each number in each numerical value in storage hydrological data form
Word, the most corresponding " 0 " arrives ten recognition result probability of " 9 ", subsequently into step 012.
Step 012. is respectively directed to be identified each numeral in each numerical value in storage hydrological data form, it is thus achieved that numeral institute
Corresponding " 0 " arrives the maximum recognition result probability in " 9 " ten recognition result probability, and second largest recognition result probability,
And obtain the difference of this maximum recognition result probability and this second largest recognition result probability, it is judged that whether this difference is known less than presetting
Other probability of outcome threshold value 0.1-0.25, is to judge that this numeral tentatively identifies mistake;Otherwise judge that this numeral identifies errorless;Thus
Obtain and be respectively directed to be identified that store in hydrological data form each numeral in each numerical value tentatively identifies judgement, subsequently into
Step 013.
Step 013. is respectively directed to each middle of the month, and each tentatively identifies wrong flow number, it is judged that the preliminary flow identifying mistake
Whether numerical value exists the preliminary numeral identifying mistake, two kinds of situations specific as follows:
It is to judge that this tentatively identifies the flow number mistake of mistake, and report to the police, meanwhile, tentatively identify mistake according to this
At this, numeral tentatively identifies that the position in error flow numerical quantity is analyzed, if this tentatively identifies that error number is positioned at this and tentatively knows
Integer part in other error flow numerical quantity, then tentatively identify the proxima luce (prox. luc) flow number on date corresponding to error flow numerical quantity with this
Value and the meansigma methods of a rear daily flow numerical value, replace this and tentatively identify error flow numerical quantity;If this tentatively identifies error number word bit
The fractional part in error flow numerical quantity is tentatively identified, then before tentatively identifying the date corresponding to error flow numerical quantity with this in this
The decimal of one daily flow numerical value and the meansigma methods of the decimal of a rear daily flow numerical value, replace this and tentatively identify in error flow numerical quantity
Decimal;
Otherwise judge that this tentatively identifies that error flow numerical quantity is errorless;It is achieved in for being identified in storage hydrological data form each
The inspection of individual numerical value.
By Experimental comparison it appeared that in papery Water Year Book digitizing solution designed by the present invention, Feature Fusion is more single
Feature improves discrimination, and single Fourier feature is preferable to digital 0 recognition effect, poor to 6 and 9 recognition effects, and thick
Grid search-engine is poor to digital 0 recognition effect, and preferable to numeral 6 and 9 recognition effects, Contour moment feature is to numeral 0,6,8
Recognition effect is poor.Three kinds of features are unanimous on the whole to the result of other numeral identifications, by calculating the complementary index between feature
It appeared that the fusion of Fourier and thick meshed feature has the ability well distinguishing different digital;Digital boundary wheel will be described
Wide and digital internal feature carries out fusion can by description the most complete for whole numeral out, it is sufficient to represents one
Numeral, so having obtained preferable recognition effect.
Above in conjunction with accompanying drawing, embodiments of the present invention are explained in detail, but the present invention is not limited to above-mentioned embodiment party
Formula, in the ken that those of ordinary skill in the art are possessed, it is also possible to do on the premise of without departing from present inventive concept
Go out various change.
Claims (10)
1. a papery Water Year Book digitizing solution, it is characterised in that comprise the steps:
Step 001., according to the layout of the papery Water Year Book page, determines that hydrological data form is positioned at the papery Water Year Book page
In location of pixels, subsequently into step 002;
Step 002., according to the location of pixels of hydrological data form in the papery Water Year Book page, is entered respectively for hydrological data form
Row vertical and horizontal project, and longitudinal projection's figure, the transverse projection for hydrological data form is analyzed respectively, carries respectively
The abscissa of each bar vertical line, the vertical coordinate of each bar horizontal line in water intaking literary composition data form, subsequently into step 003;
Step 003. is according to the abscissa of bar vertical line each in the format of hydrological data form, and hydrological data form, each bar horizontal line
Vertical coordinate, for the projection picture of hydrological data form, obtain the number in each numerical value cell of hydrological data form respectively
According to image, subsequently into step 004;Wherein, the numeric character in each data image of hydrological data form is white, the end
Color is black;
Step 004. is respectively directed to each data image, carries out character cutting for each numeric character in data image, it is thus achieved that
Each numeric character block in this data image, and then obtain each numeric character block in each data image respectively, then
Enter step 005;
Each numeric character block that step 005. is respectively directed in each data image, extracts the net of numeric character in numeric character block
Lattice feature, Fourier's feature, Contour moment feature, collectively as the identification feature of this numeric character, and then obtain each respectively
The identification feature of numeric character in each numeric character block in data image, subsequently into step 006;
Each numeric character block that step 006. is respectively directed in each data image, it may be judged whether exist by numeric character block top margin
There is downwards the black pixel point of predetermined number, be, judge in this numeric character block as arithmetic point, the most do not do any enter one
Step operation;It is respectively directed in each data image after the judgement of each numeric character block, subsequently into step 007 completing;
Step 007., for all identification features of numeric character in all data images, carries out Feature Fusion, constitutes hydrological data
In form, the most corresponding " 0 " arrives the numerical identification feature of " 9 ", subsequently into step 008;
Step 008. arrives the numerical identification feature of " 9 ", and each datagram according to the most corresponding " 0 " in hydrological data form
In Xiang, the identification feature of numeric character in each numeric character block, by default grader, obtains in each data image respectively
Each numeral corresponding to numeric character block, subsequently into step 009;
Step 009., according to the numeral corresponding to each numeric character block in each data image or arithmetic point, respectively constitutes hydrology money
Expect the numerical value corresponding to data image in each numerical value cell of form, in conjunction with every attribute of hydrological data form format,
Obtain every attribute in hydrological data form, and corresponding numerical value, and store.
A kind of papery Water Year Book digitizing solution, it is characterised in that after described step 009
Also comprise the steps, after execution of step 009, enter step 010;
Step 010., for being identified every attribute and corresponding numerical value thereof in storage hydrological data form, is respectively directed to each
The flow number of the moon, 010-01 performs to step 010-02 as follows, and then obtains for each moon every respectively
Daily flow numerical value tentatively identify judgement, subsequently into step 011;
Of that month first daily flow numerical value as first threshold, is then respectively directed to two daily flow numerical value before this month by step 010-01.,
Judge that the difference between next daily flow numerical value and same day flow number, whether less than first threshold, is then to judge flow number on the same day
Value identifies errorless;Otherwise judge that flow number on the same day tentatively identifies mistake;It is derived from being respectively directed to two daily flow numbers before this month
Value tentatively identify judgement, subsequently into step 010-02;
Step 010-02. is respectively directed to of that month each daily flow numerical value by the 3rd day, it is judged that next daily flow numerical value and same day
Whether the difference between flow number, less than proxima luce (prox. luc) flow number, is then to judge that flow number identification on the same day is errorless;Otherwise sentence
The flow number on the same day that breaks tentatively identifies mistake;It is derived from being respectively directed to of that month each daily flow numerical value preliminary by the 3rd day
Identify and judge;
Step 011. is according to each numerical value identified in storage hydrological data form, and the identification of each numeral in each numerical value
Feature, by default training aids, it is thus achieved that is identified each numeral in each numerical value in storage hydrological data form, the most right
" 0 " is answered to arrive ten recognition result probability of " 9 ", subsequently into step 012;
Step 012. is respectively directed to be identified each numeral in each numerical value in storage hydrological data form, it is thus achieved that numeral is corresponding
" 0 " arrives the maximum recognition result probability in " 9 " ten recognition result probability, and second largest recognition result probability, and obtains
Obtain the difference of this maximum recognition result probability and this second largest recognition result probability, it is judged that whether this difference identifies knot less than presetting
Really probability threshold value, is to judge that this numeral tentatively identifies mistake;Otherwise judge that this numeral identifies errorless;It is derived from pin respectively
Judgement is tentatively identified, subsequently into step 013 to identified in storage hydrological data form each numeral in each numerical value;
Step 013. is respectively directed to each middle of the month, and each tentatively identifies wrong flow number, it is judged that the preliminary flow number identifying mistake
In whether there is the preliminary numeral identifying mistake, be to judge that this tentatively identifies the flow number mistake of mistake, and report to the police;
Otherwise judge that this tentatively identifies that error flow numerical quantity is errorless;It is achieved in for being identified each number in storage hydrological data form
The inspection of value.
A kind of papery Water Year Book digitizing solution, it is characterised in that in described step 011,
According to each numerical value identified in storage hydrological data form, and the identification feature of each numeral in each numerical value, pass through
Support vector machine training aids, it is thus achieved that identified each numeral in each numerical value, the most corresponding " 0 " in storage hydrological data form
Ten recognition result probability to " 9 ".
A kind of papery Water Year Book digitizing solution, it is characterised in that in described step 013,
Described basis tentatively identifies and there is the preliminary numeral identifying mistake in wrong flow number, it is judged that this preliminary stream identifying mistake
Numerical quantity mistake, and while reporting to the police, tentatively identify that according to this error number tentatively identifies in error flow numerical quantity at this
Position be analyzed, if this tentatively identifies that error number is positioned at this and tentatively identifies the integer part in error flow numerical quantity, then
Tentatively identify the proxima luce (prox. luc) flow number on date corresponding to error flow numerical quantity and the meansigma methods of a rear daily flow numerical value with this, replace
Change this and tentatively identify error flow numerical quantity;If it is little that this tentatively identifies that error number is positioned in this preliminary identification error flow numerical quantity
Fractional part, then tentatively identify the decimal of the proxima luce (prox. luc) flow number on date corresponding to error flow numerical quantity and a rear daily flow with this
The meansigma methods of the decimal of numerical value, replaces this and tentatively identifies the decimal in error flow numerical quantity.
5. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that institute
State step 004, carry out character cutting for each numeric character in data image, it is thus achieved that each number in this data image
Value character block, specifically includes following steps:
The detection of step a01. obtains each white pixel point in data image within each numeric character, and each limit of this data image
Edge is respectively at a distance of each numeric character minimum range, and the white pixel point on corresponding numeric character, subsequently into step a02;
Step a02. judges by being obtained each white pixel point in this data image respectively for previous step, it is judged that pixel
Whether the pixel of upper and lower, left and right each position is white pixel point, is, judges that this pixel is inside numeric character
Pixel;Otherwise judge, according to identifier, the edge pixel point that this pixel is character, and obtain this pixel in these data
The row number of place pixel column in image;It is thus directed towards previous step and is entered respectively by this data image obtains each white pixel point
Row judges, it is thus achieved that the row of place pixel column in each this data image of numeric character top edge pixel place in this data image
Number, subsequently into step a03;
Step a03. is according to each numeric character top edge pixel place pixel column in this data image in this data image
Row number, divide for each numeric character in this data image, it is thus achieved that each numeric character block in this data image.
6. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that institute
State in step 005, each numeric character block being respectively directed in each data image, extracts numeric word in numeric character block
The grid search-engine of symbol, specifically includes following steps:
Step b01. obtains the border of the upper and lower, left and right of numeric character block, and is derived from numeric character ontology diagram picture, so
Rear entrance step b02;
Step b02. carries out center of gravity normalization for this numeric character ontology diagram picture, and will be through center of gravity this numeric character normalized
Ontology diagram is slit into predetermined number sub regions image as average mark, subsequently into step b03;
Step b03. obtains in this numeric character ontology diagram picture the proportion of white pixel point in each sub regions image respectively, altogether
With constituting the grid search-engine of numeric character in this numeric character block.
7. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that institute
State in step 005, each numeric character block being respectively directed in each data image, extracts numeric word in numeric character block
Fourier's feature of symbol, specifically includes following steps:
Step c01. carries out two dimensional discrete Fourier transform for numeric character block, subsequently into step c02;
Step c02., by this numeric character block through two dimensional discrete Fourier transform, proceeds central transformation, will numeric word
Symbol block is averagely divided into four blocks of subregion images, and carries out diagonal angle exchange, it is thus achieved that Fourier's image is composed, subsequently into step
c03;
Step c03. is for Fourier's its Fourier coefficient of image analysis of spectrum after central transformation, it is thus achieved that in Fu of this numeric character block
In leaf system number, concentrate region more than the Fourier coefficient presetting amplitude thresholds, constitute significantly Fourier coefficient region, so
Rear entrance step c04;
Step c04., by significantly Fourier coefficient region, is extracted predetermined number discrete Fourier transform coefficient, and is carried out
Normalization, constitutes Fourier's feature of numeric character in this numeric character block.
8. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that institute
State in step 005, each numeric character block being respectively directed in each data image, extracts numeric word in numeric character block
The Contour moment feature of symbol, specifically includes following steps:
Step d01. carries out contours extract for the numeric character in numeric character block, subsequently into step d02;
Step d02. carries out not bending moment for the profile of numeric character in this numeric character block and processes, and extracts predetermined number two dimension wheel
Wide invariant moment features, constitutes the Contour moment feature of numeric character in this numeric character block.
9. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that described
Step 007 specifically includes following steps:
Step e01., according to permutation and combination, for all identification features of numeric character in all data images, carries out any two
Identify the combination of feature, constitute the combination of all identification features, subsequently into step e02;
Step e02., by all identification features of numeric character in all data images, constitutes corresponding numeral in hydrological data form
" 0 " arrives the sample set S of " 9 ", is then respectively directed to each group of identification feature combination, according to equation below (1):
Obtain the feature complementary index C of this group identification feature combination relative standard digital " 0 "-" 9 " respectivelyij,A;And then obtain respectively
Obtain the feature complementary index C of each group of identification feature combination relative standard digital " 0 "-" 9 " respectivelyij,A;Subsequently into step
e03;Wherein, SiAnd SjRepresent that sample set S is identified feature F respectivelyiWith identification feature FjThe sample set of wrong point;
E (S) represents the number of samples in sample set S;E(Si∪Sj) represent sample set SiWith sample set SjBetween and concentrate
Number of samples;E(Si∩Sj) represent sample set SiWith sample set SjBetween occur simultaneously in number of samples;
A={0,1 ..., 9}, Cij,ARepresent by identifying feature FiWith identification feature FjConstituted identification feature combination relative standard numeral
The feature complementary index of A;
Step e03. is respectively directed to each group of identification feature combination, according to equation below (2):
Obtain each group of identification feature combination overall complementation index TC relative to standard digital respectivelyk, subsequently into step e04;Its
In, k={1 ..., K}, K represents the number of combinations that all identification features combine, TCkRepresent kth group identification feature combination phase
Overall complementation index for standard digital;
Step e04. combines for all identification features, sorts from large to small by its overall complementation index, it is thus achieved that sequence the first two is known
Other feature combines, and then identifies that feature combination carries out Feature Fusion for these two, constitutes difference correspondence in hydrological data form
" 0 " arrives the numerical identification feature of " 9 ".
10. according to papery Water Year Book digitizing solution a kind of described in any one in Claims 1-4, it is characterised in that institute
State in step 008, arrive the numerical identification feature of " 9 ", and each number according to the most corresponding " 0 " in hydrological data form
According to the identification feature of numeric character in each numeric character block in image, by support vector machine classifier, obtain each respectively
Each numeral corresponding to numeric character block in data image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610232680.9A CN105938547B (en) | 2016-04-14 | 2016-04-14 | A kind of papery Water Year Book digitizing solution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610232680.9A CN105938547B (en) | 2016-04-14 | 2016-04-14 | A kind of papery Water Year Book digitizing solution |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105938547A true CN105938547A (en) | 2016-09-14 |
CN105938547B CN105938547B (en) | 2019-02-12 |
Family
ID=57151427
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610232680.9A Expired - Fee Related CN105938547B (en) | 2016-04-14 | 2016-04-14 | A kind of papery Water Year Book digitizing solution |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105938547B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108805076A (en) * | 2018-06-07 | 2018-11-13 | 浙江大学 | The extracting method and system of environmental impact assessment report table word |
CN109190611A (en) * | 2018-08-14 | 2019-01-11 | 江西师范大学 | Pedigree system makes are compiled in a kind of internet based on crowdsourcing |
CN111060527A (en) * | 2019-12-30 | 2020-04-24 | 歌尔股份有限公司 | Character defect detection method and device |
CN113436117A (en) * | 2021-08-03 | 2021-09-24 | 东莞理工学院 | Hydrology long sequence data extraction method based on image recognition |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3582734B2 (en) * | 1993-07-14 | 2004-10-27 | 富士通株式会社 | Table vectorizer |
CN103996057A (en) * | 2014-06-12 | 2014-08-20 | 武汉科技大学 | Real-time handwritten digital recognition method based on multi-feature fusion |
CN105184265A (en) * | 2015-09-14 | 2015-12-23 | 哈尔滨工业大学 | Self-learning-based handwritten form numeric character string rapid recognition method |
CN105426834A (en) * | 2015-11-17 | 2016-03-23 | 中国传媒大学 | Projection feature and structure feature based form image detection method |
-
2016
- 2016-04-14 CN CN201610232680.9A patent/CN105938547B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3582734B2 (en) * | 1993-07-14 | 2004-10-27 | 富士通株式会社 | Table vectorizer |
CN103996057A (en) * | 2014-06-12 | 2014-08-20 | 武汉科技大学 | Real-time handwritten digital recognition method based on multi-feature fusion |
CN105184265A (en) * | 2015-09-14 | 2015-12-23 | 哈尔滨工业大学 | Self-learning-based handwritten form numeric character string rapid recognition method |
CN105426834A (en) * | 2015-11-17 | 2016-03-23 | 中国传媒大学 | Projection feature and structure feature based form image detection method |
Non-Patent Citations (2)
Title |
---|
刘昱: "《印刷体表格识别的研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
张世平: "《水文年鉴数据的智能识别》", 《人民珠江》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108805076A (en) * | 2018-06-07 | 2018-11-13 | 浙江大学 | The extracting method and system of environmental impact assessment report table word |
CN108805076B (en) * | 2018-06-07 | 2021-01-08 | 浙江大学 | Method and system for extracting table characters of environmental impact evaluation report |
CN109190611A (en) * | 2018-08-14 | 2019-01-11 | 江西师范大学 | Pedigree system makes are compiled in a kind of internet based on crowdsourcing |
CN111060527A (en) * | 2019-12-30 | 2020-04-24 | 歌尔股份有限公司 | Character defect detection method and device |
CN111060527B (en) * | 2019-12-30 | 2021-10-29 | 歌尔股份有限公司 | Character defect detection method and device |
US12002198B2 (en) | 2019-12-30 | 2024-06-04 | Goertek Inc. | Character defect detection method and device |
CN113436117A (en) * | 2021-08-03 | 2021-09-24 | 东莞理工学院 | Hydrology long sequence data extraction method based on image recognition |
CN113436117B (en) * | 2021-08-03 | 2022-11-25 | 东莞理工学院 | Hydrological long sequence data extraction method based on image recognition |
Also Published As
Publication number | Publication date |
---|---|
CN105938547B (en) | 2019-02-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110033000A (en) | A kind of text detection and recognition methods of bill images | |
CN106529508B (en) | Based on local and non local multiple features semanteme hyperspectral image classification method | |
CN101447017B (en) | Method and system for quickly identifying and counting votes on the basis of layout analysis | |
CN103034848B (en) | A kind of recognition methods of form types | |
CN103996057B (en) | Real-time Handwritten Numeral Recognition Method based on multi-feature fusion | |
CN103810484B (en) | The mimeograph documents discrimination method analyzed based on printing character library | |
CN101256631B (en) | Method and apparatus for character recognition | |
CN105447522A (en) | Complex image character identification system | |
CN106875546A (en) | A kind of recognition methods of VAT invoice | |
CN106611174A (en) | OCR recognition method for unusual fonts | |
CN103914680A (en) | Character image jet-printing, recognition and calibration system and method | |
CN103824373B (en) | A kind of bill images amount of money sorting technique and system | |
CN104732215A (en) | Remote-sensing image coastline extracting method based on information vector machine | |
CN105938547A (en) | Paper hydrologic yearbook digitalization method | |
CN104680130A (en) | Chinese character recognition method for identification cards | |
CN104573685A (en) | Natural scene text detecting method based on extraction of linear structures | |
CN101930549A (en) | Second generation curvelet transform-based static human detection method | |
CN101359373A (en) | Method and device for recognizing degraded character | |
Chaabouni et al. | Multi-fractal modeling for on-line text-independent writer identification | |
CN106778717A (en) | A kind of test and appraisal table recognition methods based on image recognition and k nearest neighbor | |
CN109800756A (en) | A kind of text detection recognition methods for the intensive text of Chinese historical document | |
CN106874901A (en) | A kind of driving license recognition methods and device | |
CN101251896A (en) | Object detecting system and method based on multiple classifiers | |
CN103500323B (en) | Based on the template matching method of self-adaptation gray level image filtering | |
Lu et al. | Retrieval of machine-printed latin documents through word shape coding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190212 |