CN1269060C - Method and system of digitizing ancient Chinese books and automatizing the content search - Google Patents

Method and system of digitizing ancient Chinese books and automatizing the content search Download PDF

Info

Publication number
CN1269060C
CN1269060C CN 00119542 CN00119542A CN1269060C CN 1269060 C CN1269060 C CN 1269060C CN 00119542 CN00119542 CN 00119542 CN 00119542 A CN00119542 A CN 00119542A CN 1269060 C CN1269060 C CN 1269060C
Authority
CN
China
Prior art keywords
page
retrieval
feature
images
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 00119542
Other languages
Chinese (zh)
Other versions
CN1336604A (en
Inventor
施伯乐
张亮
王勇
陈智峰
印峻
陈国梁
舒韵宏
焦宇翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI JINXIN COMPUTER SYSTEM ENGINEERING Co Ltd
Fudan University
Original Assignee
SHANGHAI JINXIN COMPUTER SYSTEM ENGINEERING Co Ltd
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI JINXIN COMPUTER SYSTEM ENGINEERING Co Ltd, Fudan University filed Critical SHANGHAI JINXIN COMPUTER SYSTEM ENGINEERING Co Ltd
Priority to CN 00119542 priority Critical patent/CN1269060C/en
Publication of CN1336604A publication Critical patent/CN1336604A/en
Application granted granted Critical
Publication of CN1269060C publication Critical patent/CN1269060C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a computer ancient book content retrieving method based on visual similarity and a retrieval system. The computer ancient book content retrieving method is characterized in that object linear order number position characteristics, page characteristics, multi-stage gravity center division area stroke factor cumulative value morphological characteristics and corresponding extractive techniques which are determined by historic relic writing rules are designed, and the page characteristics are composed of page numbers and the geometry coordinates of objects in pages; an arbitrary retrieval point calibrating method and a constraint verification technique which raises matching accuracy are proposed; a dynamic regulation method is created, and retrieval persons are allowed to utilize search accuracy control parameters to balance recall ratios and precision ratios in a retrieval stage so as to immediately select object forms to object semantic mapping. In the present invention, image processing, high-dimensional characteristic space indexing and the techniques are optimized and combined; a universal computer and relevant external equipment are used to realize Chinese ancient book digitization with software/hardware unification and a content retrieval system; thus, the technical effects of computer ancient book content retrieval which is directly completed on page images and supports arbitrary retrieval points are achieved.

Description

Handle and reuse the method for Chinese ancient book and the computer software and hardware system that is adopted thereof
Technical field
The present invention relates to a kind of computer hardware and the method for software processes and repeated use Chinese ancient book and computer system that is adopted thereof of adopting.
Background technology
Ancient books has high academic research and value of art appreciation as the important component part of human culture legacy.Because it is rare, rare, the above-mentioned value of ancient books can't interiorly on a large scale be utilized by the public, even in strict restricted portion, the security of ancient books original paper and sustainable keeping quality still are difficult to ensure.Excavation and effective utilization to literature of ancient book have become one of main target of various countries digital library (Digital Library) engineering.Up to now, the mode of utilizing of the various ancient books digitizings of proposition and digital media can be summarized as follows:
Index adds the picture browsing mode.At first with the predetermined resolution scan ancient books page, the digital media (being called for short " page-images ") as the ancient books page behind the elimination noise is stored in the mass storage device (CD commonly used).Library or museum professional are to page-images index (as by portion/class/genus/order classification, title, author's epoch, author's name, works mode, year of publication, publication ground, publisher, format, the form and arrangement of lines in calligraphy or printing, crowd school person, preface and postscript person, Tibetan seal, front cover, title page, preface, front/rear interpolation page or leaf, the note on the use, catalogue, figure, appendix, postscript etc.), as the additional information of page-images and set up relative index, be kept in the memory storage for future reference.Retrieval person utilizes data input device (keyboard or mouse), retrieval point (commonly book number, portion/class/genus/order classification, title, author's epoch, author's name) the retrieval ancient books of the limited quantity that provides by system, browse the page-images of pandect or partial page then, also can be according to the front cover in the page-images of the ancient books of index information browse in advance, front/rear interpolation page or leaf, title page, preface, the note on the use, catalogue, figure, appendix, postscript etc.System provides also generally that the advance and retreat and the image of the may command page such as amplifies/dwindle at subsidiary function in the navigation process.The characteristics of this mode are:
● realized that the ancient books scarce resource is to conversion that can widely used digitizing page-images
● do not possess the ability of on page-images, separating and extract the part resource
Subsidiary text adds text full-text search mode.At first make corresponding with it text (as the manual keyboard typing), use global search technology then this subsidiary text is realized the word content retrieval, access page-images by corresponding relation more at last according to ancient books.This indirect mode is at the generation phase of its requisite subsidiary text file, and aspects such as the judgement of the homogeneity of body text and ancient books original copy content, character set scale, special symbol processing, automaticity exist the professional unacceptable restriction condition of ancient books works possessor (as library or museum); These problems cause information retrieval method and the system based on the text form that propose among the Chinese patent application prospectus CN-1151558A can't be applied to use for the content retrieval of the ancient books page of its essence with image.In addition, interchangeability of Chinese characters word being extensive use of in ancient books also makes global search technology that the ancient books content retrieval is lacked necessary ability.
Optical character identification adds text full-text search mode.This mode is used global search technology then this subsidiary text is realized word content retrieval with the text and the searching object of optical character identification (OCR) technology generation ancient books correspondence, accesses page-images by corresponding relation more at last.Yet owing to ancient books year of publication, version form difference, ancient books is huge with the word difference, can't set up to comprise all dictionaries of words at all times; More because in the Chinese ancient book between fuzzy, lack of standardization, the stroke of writing brush personal letter Chinese-character stroke/relative position instability between parts, stroke inclination angle/relative length instability, writing style difference, soft stroke distortion etc. are all multifactor, are difficult to finish the accurate identification that soft pen is write in one's own hand font.Proposed a kind of generation text strings (as " middle final accounts " and " between ox final accounts ") similar among the Chinese patent application prospectus CN-1165571A, the method for a text full-text search has been used in every kind of possible distortion respectively, to avoid the problems referred to above that wrong identification is brought to retrieval to the searching object shape.But this method is helpless for ancient books.Because the distortion number of text strings increases with index law with text strings length.For example, the average deformation number of establishing each word is k, and text strings length is n, and then possible deformed letters string adds up to kn.Therefore, this method lacks scalability (Scalability) on algorithm, be reflected in the application, is to lack practicality.OCR as another major defect of additional text file Core Generator is: the semanteme of ancient books literal/symbol object (hereinafter to be referred as " object ") " freezes " at the OCR cognitive phase, and promptly the image of object determinacy is mapped to a literal.Retrieval person changes the Semantic mapping of having been freezed by additional text documenting person without any ability in retrieving.Be in the Chinese ancient book works of principal character with the writing brush personal letter, the stained semanteme of object that all causes inevitably of the variation of hand-written script, page papery can't uniquely be determined, need retrieval person immediately to make a choice, for example determine the compromise of recall ratio and precision ratio according to target.This requirement can't be satisfied by the ancient books content search method based on OCR.
In a word, for writing in one's own hand Chinese character with writing brush is the Chinese ancient book works of its principal character, solve simultaneously this scarce resource is converted to and low-cost unlimited to reuse medium, and directly on this medium, separate and to extract the part resource problem that satisfies objective condition very difficult.Still do not have at present effectively, directly technology and system, promptly its technique effect can not be by every known technology alone or utilize prior art measure and technological process combination to realize.
Summary of the invention
The objective of the invention is to make up known technology, form a whole set of brand-new technology measure and technological process, be converted to and low-costly infinitely reuse medium to solve, and directly on this medium, separate and extract the technical matters of the part resource that satisfies objective condition with this scarce resource of Chinese ancient book of writing brush personal letter Chinese character for its principal character.
One aspect of the present invention provides a kind of method that adopts computer hardware and software processes and repeated use Chinese ancient book, this method will be write in one's own hand Chinese character with writing brush and will be converted to unlimited repeated use medium for the Chinese ancient book of its principal character, and directly separate on this medium and extract the part resource that satisfies objective condition; Be characterized in that it is by the disposable feature space organized processing flow process of finishing and can repeated ancient mat content retrieval treatment scheme in succession forming; Described feature space organized processing flow process may further comprise the steps: produce page figure and deposit it in the page-images storehouse by scanning and pretreatment module, pass to follow-up extraction characteristic module the object in the page-images is decomposed into the ordered set of independent image by skeleton simultaneously; By extracting characteristic module the ordered set of described object is separated into page feature, object global position feature and morphological feature vector also are kept at these features in the mark sheet; Organize described global position feature and morphological feature vector and be stored in the data structure feature space index by the index characteristic module; By the feature space index module morphological feature vector is carried out visual similarity cluster and an eliminating and the dissimilar letter symbol image of retrieval point that computing machine is finished automatically; So that its global position feature is fed back; The described content retrieval stage may further comprise the steps: the page coordinates by demarcating sample retrieval module settings page-images and the order of coordinate sequence to be forming sample retrieval, and the order of coordinate sequence is passed to checking constraint condition module as constraint condition; By obtaining characteristic module, to obtain and the corresponding morphological feature vector of object with the concrete object of page coordinates sequence as condition definite page-images from mark sheet; By the approximate query module is that reference point is sought the arest neighbors element to constitute the analogical object set of reference point with the morphological feature vector; And the global position feature set is formed in the set of the analogical object of correspondence bunch passed to checking constraint condition module; By of the effective combination of checking constraint condition module, to form result for retrieval according to described constraint condition inspection set bunch element; And by showing/browse that the result for retrieval module is apparent in result for retrieval on retrieval person's the client screen.
The present invention provides a kind of computer software and hardware system that realizes the described method of one aspect of the present invention on the other hand, this system is made of jointly the configuration of known computer hardware device and the component models of implementing the computer application software system of the described method flow of one aspect of the present invention, it is characterized in that: the configuration of this known computer hardware device comprises by CPU (central processing unit), random access memory, hard disk, keyboard, display, network access device, the server that scanner and indicating equipment are formed and by CPU (central processing unit), primary memory, hard disk or ROM (read-only memory), keyboard, display, the client computer that network access device and indicating equipment are formed; The parts of this realization computer application software system comprise with lower module and with the configuration of computer hardware equipment and combining: scanning and pretreatment module, utilize the computing power that CPU (central processing unit) provides, the storage capacity of random access memory, the display capabilities of display, the picture acquisition power of scanner to produce page-images, and deposit the page-images storehouse in, page-images stock is put on the storage medium of hard disk, generates the skeleton object simultaneously; Extract characteristic module, utilize the computing power that CPU (central processing unit) provides, the storage capacity of random access memory, the display capabilities of display, the station-keeping ability of indicating equipment, be connected with pretreatment module by skeleton object and described scanning and accept and decompose object in the described page-images, object ordered set is separated into the global position feature and the morphological feature sequence of page feature, object, and be stored in the mark sheet, this process is by means of input related command and parameter, and mark sheet is deposited on the storage medium of hard disk; The index characteristic module utilizes the computing power that CPU (central processing unit) provides, the storage capacity of random access memory, and tissue is by extracting described global position feature and the morphological feature vector that characteristic module extracts; Proper vector is deposited on the storage medium of hard disk; Data structure feature space index module, the global position feature of preservation process tissue and morphological feature vector are on the storage medium of hard disk; The hard disk of above-mentioned server contains permanent storage computer operating system, digitizing ancient books storehouse; Demarcate the sample retrieval module, utilize the computing power that CPU (central processing unit) provides, the storage capacity of primary memory, the display capabilities of display, the station-keeping ability of indicating equipment, the order of determining the page coordinate of object and coordinate sequence is to form retrieval person's sample retrieval; Obtain characteristic module, computing power, the main memory store ability of utilizing CPU (central processing unit) to provide are determined concrete object in the page-images with the page coordinates sequence as condition, to obtain and the corresponding proper vector of object from mark sheet; The approximate module of inquiry, utilize the computing power that CPU (central processing unit) provides, the storage capacity of random access memory, with the reference point is clue, its analogical object of coupling set in the feature space index on the storage medium of hard disk, and the analogical object set of all the object correspondences in the sample retrieval is combined into global position feature set bunch; Checking constraint condition module is utilized the computing power that CPU (central processing unit) provides, the storage capacity of random access memory, and the order of accepting coordinate sequence is as constraint condition and global position feature set bunch, with effective combination of inspection set bunch element and form result for retrieval; With show/browse the result for retrieval module, utilize storage capacity, the display of computing power that CPU (central processing unit) provides, random access memory display capabilities, result for retrieval is apparent on retrieval person's the client computer indicator screen; Network access device is used for communicating by letter between server and the client computer.Method and system of the present invention has obtained to write in one's own hand Chinese character with writing brush, and can be low-cost unlimited for this scarce resource of Chinese ancient book of its principal character is converted to reuse medium be the digitizing page-images of ancient books works, and directly separate on this medium and extract the technique effect that the part resource that satisfies objective condition promptly satisfies the partial page image of the condition that preestablishes.
Description of drawings
Below in conjunction with the description of drawings embodiments of the invention.
Fig. 1 is system architecture and base conditioning process flow diagram;
Fig. 2 is the block scheme of system hardware structure;
Fig. 3 is the search method overview flow chart;
Fig. 4 is the feature space organization flow chart;
Fig. 5 is the content retrieval process flow diagram;
Fig. 6 uses the symbolic significance explanation in the process flow diagram;
Fig. 7 is a bitonal bitmap longitudinal projection synoptic diagram;
Fig. 8 is level and smooth with auxiliary grid;
Fig. 9 a and 9b are the page-images of filling apportion mark;
Figure 10 a and 10b are the results of dividing object from row;
Figure 11 is the object example that branch marks;
Figure 12 is the refinement bitmap of Figure 11;
Figure 13 is the bitmap of Figure 12 after normalization;
Figure 14 is the firsts and seconds area dividing example of Figure 13 based on center of gravity;
Figure 15 be horizontal, vertical, cast aside, press down the definition of stroke factor;
Figure 16 is level-1 area and level-2 area coding rule;
Figure 17 is left-falling stroke, distribution and horizontal, vertical stroke factor the distribution plan in secondary zoning of right-falling stroke stroke factor in the one-level zoning of Figure 14.
Embodiment
Low-cost unlimited repeated use medium is the digitizing page-images (hereinafter to be referred as page-images) of ancient books works among the present invention, directly on medium, separate and extract the part resource that satisfies objective condition showing as content retrieval, extract the result for satisfying the partial page image of the condition that preestablishes.
The base conditioning flow process of system of the present invention search method is described referring now to Fig. 1.Should note: two processing units among Fig. 1 demarcate sample retrievals 121 and show/browse result for retrieval 125 as program file separately or global storage in the hard disk 204b of Fig. 2; The processing unit that all the other each block schemes are represented as data file or program file separately or global storage in the hard disk 204a of Fig. 2.
Search method among the present invention and technology are made of in succession the processing stage 120 two of feature space tissue 100 and ancient books content retrievals, and the digitizing ancient books storehouse 110 that the former produces provides the basis for the latter.Feature space is organized disposable finishing of stages 100, and the ancient books content retrieval stage 120 can repeatedly repeat according to retrieval person's requirement.
Ancient books is through overscanning and pretreatment module 101, produce on the one hand page-images and deposit page-images storehouse 111 in and browse in order to the user, the object in the page-images is passed to the ordered set that follow-up extraction characteristic module 102 is broken down into independent image by skeleton on the other hand.The page-images that deposits in the storehouse 111 can be original scanning result (as coloured image or a gray level image), keeps original visual image of ancient books and style; Also can be through the picture rich in detail after the pre-service processing, obtain readable preferably.Object ordered set is extracted characteristic module 102 again and is separated into and is converted to three category features: the global position feature of page feature, object and morphological feature sequence.These features are kept in the mark sheet 112.Global position feature that module 102 is extracted and morphological feature vector are by higher dimensional space index characteristic module 103 favorable tissue and being stored in the data structure feature space index module 113 in addition.Except the visual similarity cluster to the mathematical expression of proper vector object, another function of feature space index structure 113 is exactly in time to get rid of and the dissimilar literal/glyph image of retrieval point, the object that the vision of acceleration search query point is similar.This is the basis that the ancient books content retrieval is realized high speed.
The content retrieval stage 120 is adopted the working method of inquiry by example.Demarcation sample retrieval module 121 supports that retrieval persons at any time, at random demarcate object on the page-images of being browsed, the page coordinates when record client indicating equipment 209b clicks page-images and the order of this coordinate sequence, formation retrieval person's sample retrieval.The order of coordinate sequence is passed to checking constraint condition module 124 as constraint condition.The page coordinates sequence itself is acquired characteristic module 122 and is used as condition concrete object in definite page-images from mark sheet 112, obtains and the corresponding proper vector of object.The approximate object module 123 of inquiry is a reference point with the proper vector that obtains, and the searching arest neighbors element in feature space index 113 constitutes the analogical object set of reference point.This module 123 will be combined into the global position feature set with the set of all objects are corresponding in the sample retrieval analogical object simultaneously and bunch give checking constraint condition module 124.By the effective combination of module 124, form result for retrieval according to the constraint condition inspection set bunch element that obtains.These results are by showing/browse that result for retrieval module 125 is apparent on retrieval person's the client screen 206b in eye-catching mode.Browse and observe its context for the user.
Referring now to Fig. 2, among the figure illustration in order to implement system hardware structure of the present invention.They are server 200a and the client computer 200b that are connected in network 210.Server 200a is used for the transmission of storage, maintenance, management, retrieval and the result for retrieval of data and page-images.Its hardware system is the universal computer architecture that is linked together by bus 201a, comprises the CPU (central processing unit) 202a with computing and control input/output function, the random access memory 203a of save routine and computing intermediate data, the permanent storage computer operating system, retrieve application software, page-images, the hard disk 204a of contents such as feature space index file, in order to key in order and the keyboard 205a of parameter and the display 206a of display command feedback result, network access device 207a, digitized scanner 208 of the ancient books page and function selecting and auxiliary positioning equipment are indicating equipment 209a; Client computer 200b be responsible for man-machine interface operation, send the demand of inquiring and browsing and display navigation Query Result.Its hardware system is the universal computer architecture that is linked together by bus 201b, comprising the CPU (central processing unit) 202b with computing and control input/output function, the primary memory 203b of save routine and computing intermediate data, the permanent storage computer operating system, the hard disk 204b (or ROM (read-only memory) 204b) of contents such as retrieve application software, in order to key in the keyboard 205b of order and parameter, the display 206b of display page image and order feedback result, network access device 207b, help designated display 206b to go up the indicating equipment of screen position (as Genius mouse, writing pencil) 209b; Server and client computer connect via network 210 by network access device 207a, 207b, exchange information.
As the another kind of special case of above-mentioned embodiment, network 210 can be wide area network (WAN is as Internet).In the system architecture that is known as the browser/server pattern, HTTP (HTML (Hypertext Markup Language)) agreement is followed in the communication between client computer 200a and the server 200b.Client computer 200b specifies certain Web page or leaf by uniform resource locator (URL) address of given server 200a, help retrieval person to prepare retrieval/browse request then, transmission is asked to server 200a, and accepts page-images and relevant information (as the JAVA applet) that server 200a transmits; Server 200a deposits the hypermedia file with HTML (HTML (Hypertext Markup Language)) language compilation, it has a HTTP finger daemon, the request of its subscribing client 200b proposition is also made response, when this process receives a request, just create a new subprocess and be this request service, finish validity checking, handle and make data at the request of client computer, comprise and use CGI (CGI (Common Gateway Interface)) program that data are carried out early stage and post-processed, then, page-images of handling well etc. is sent to the client computer 200b that files a request.
As another special case of above-mentioned embodiment, network 210 can be a Local Area Network.
As another special case again of above-mentioned embodiment, server 200a and client computer 200b can be same machines, do not have network 210, network access device 207a, 207b this moment, adopt the loopback adapter; Bus is that 201a, CPU (central processing unit) are that 202a, random access memory are that 203a, hard disk are that 204a, keyboard are that 205a, display are that 206b, scanner are 208, indicating equipment is 209a.
Client computer in another embodiment can adopt mobile computing device (as notebook, PDA etc.).
The operating system of server can be that the various realization versions of Windows95/Windows98 (Microsoft trade mark), MacOS (Apple trade mark), Unix are as (AIX of IBM or free software Linux), do not require multiwindow and figure man-machine interface, but should support the HTTP access protocal; Client computer can adopt above-mentioned any operating system, but requires multiwindow and figure man-machine interface simultaneously, and supports the HTTP access protocal; When the embodiment that adopts client/server on a computing machine, operating system is got the configuration of client-side; When client computer was handheld devices such as PDA, the operating system of this handheld device or its equivalent should be supported the HTTP access protocal.
Further specify the flow process characteristics of search method of the present invention and the technology that is adopted below.
The computing machine ancient books content search method of visual similarity of the present invention is formed by a series of technical unit organic assembling.Each technical unit can adopt the technique known scheme to realize, also can realize with the technical scheme that the present invention proposes, to exchange higher execution efficient for.Make up these technical units and form the new technical scheme of a cover, solution with writing brush personal letter Chinese character for this scarce resource of Chinese ancient book of its principal character is converted to can the low-cost unlimited medium of reusing, and directly on this medium, separate and the technical matters of extracting the part resource that satisfies objective condition is main contents of the present invention.Fig. 3 is the overview flow chart of search method, and Fig. 4, Fig. 5 are the detail flowcharts of Fig. 3.Fig. 6 uses the symbolic significance explanation in the process flow diagram.
As previously mentioned, search method by 120 two of feature space tissue 100 and ancient books content retrievals in succession treatment scheme constitute.Feature space organization flow 100 is finished by ancient books information services provision merchant is disposable in advance.It generates the result, and promptly the digitizing ancient books storehouse 110 among Fig. 1 is kept in the hard disk or CD 204a of server end among Fig. 2.Ancient books content retrieval flow process 120 can repeatedly repeat according to retrieval person's requirement, and it utilizes the digitizing ancient books storehouse of storing among hard disk or the CD 204a.Two flow processs 100 and 120 needn't be continuous in time, only requires to guarantee that the order that provides as Fig. 3 gets final product.
Now further specify feature space and organize the stage in conjunction with Fig. 4.The purpose of feature space tissue is that the content (object and sequence relation thereof) in the ancient books generates its feature clustering as previously mentioned, sets up the index structure that is easy to search fast according to visual similarity approximate object.Feature space organizes the basic step in stage as follows:
1. scan ancient books page 101a
Scan ancient books by visible light or other light sources page by page according to ancient books page number numbering, obtain its digitizing colour or gray scale image.To intact ancient books, can adopt ordinary flat formula scanner, for the ancient books that is damaged by fire damage or other reasons, available far infrared or other light sources irradiation manifest the literal of being covered.
2. pre-service 101b
For outstanding ancient books content, overcome scanning errors, separate foreground object and ground unrest, the acquisition object, before formal structural attitude spatial index 113, carry out the pre-service work such as graduation, object refinement of space of a whole page slant correction, noise removing, binaryzation and row/object.The function that the preprocessing means of Chinese optical character identification (OCR) technology of available standards or pool image are handled needs a spot of manual intervention to realize in case of necessity.Below provide some embodiment.
(1) color and gray scale are handled
The digitizing ancient books page-images that is obtained by scanning step 101a can be colour or gray scale.The purpose of doing like this is in order to keep the original appearance of ancient books to greatest extent, to be convenient to the user and to view and admire.Be the processing needs of subsequent step, the page-images of confession extraction feature should be converted to black and white, promptly so-called bianry image or bitmap.The page-images of viewing and admiring for the user still can keep original color or gray scale.
Coloured image generally is expressed as RGB (RGB) or other color spaces, as the point set of YIQ (brightness, colourity, saturation degree).From the angle of compression of images, adopt the situation of scheme of non-rgb color space more general.Because these schemes concentrate on the principal character of image on some coordinate axis in the space, the gray level image on this is handled, can embody image aspects substantially.In Chinese ancient book content retrieval field, adopt such scheme to change coloured image into form that gray level image still can keep literal/symbol object.
A kind of specific embodiments is that coloured image is decomposed into Y, I, three components of Q, again Y component is wherein given over to further processing as gray level image.The Y component has comprised the main information of original image.Transformational relation between YIQ and RGB is:
Y I Q = 0.299 0.587 0.114 0.596 - 0.275 - 0.322 0.211 - 0.523 - 0.312 R G B , R G B = 1.0 1.176 0.763 1.0 - 0.411 - 0.677 1.0 - 0.964 1.487 Y I Q
Gray level image becomes bitonal bitmap through binaryzation.The key of binaryzation is to determine appropriate threshold.A kind of system of selection is to determine global threshold according to grey level histogram.If number of grayscale levels is G, the pixel of image adds up to n, and (number of picture elements of 1≤k≤G) is n to k level gray scale k, statistical picture is in the gray level (frequency of occurrences of k1≤k≤G) locate
p ( k ) = n k n , k=1,2,...,G
And be ordinate with p (k), k is the horizontal ordinate mapping, obtains the grey level histogram of image.The grey level histogram of Chinese ancient book generally is bimodal, and two spikes have been represented prospect and background pixels respectively.Gray threshold can be taken at the trough place between bimodal, for example value 1≤g≤G.According to gray threshold g with gray level image IMG gChange bitonal bitmap IMG into b:
IM G b ( i , j ) = 1 , IM G g ( i , j ) &GreaterEqual; g i = 1,2 , &CenterDot; &CenterDot; &CenterDot; , R 0 , IM G g ( i , j ) < g ' j = 1,2 , &CenterDot; &CenterDot; &CenterDot; , C
Wherein, R, C are respectively the number of lines and columns of image pixel matrix.
For the grey level histogram of multimodal, can adopt the local threshold binarization method.
(2) space of a whole page is proofreaied and correct
Deflection can take place because the ancient books original copy puts the inaccurate of angle in the page-images that scanning obtains, and influences subsequent treatment.In most cases, the angle of deflection is not too large.If departing from the scope of normal position (as vertical) is [A ,+A].With a is increment, from-A rotation bitonal bitmap, calculate projected density by following method, until+A.The bitonal bitmap that records maximal projection density is as correction chart.
With reference to Fig. 7, at first,, obtain the horizontal distribution (the latter half of Fig. 7) of display foreground pixel with a certain postrotational bitmap (the first half of Fig. 7) projection longitudinally.Make that projection width is W, then the average line height
h = &Sigma; i &Sigma; j IM G b ( i , j ) W .
On the average line of horizontal distribution, calculate projected density
&rho; = &Sigma; k n k W k ,
In the following formula, n kBe to be higher than counting of h, W in k the continuous segment on the average line kBe these projection widths on average line.Select the projection of projection on the average line rather than all horizontal distribution to help to reduce the page-images influence on horizontal line and border up and down.
(3) eliminate noise
Use smoothing technique to eliminate residual isolated point in the bitonal bitmap, level and smooth stroke edge.Smoothing process is the application of low-pass filtering in the image processing techniques.
A kind of simple embodiment is 3 * 3 grids decision pixel x that adopts as shown in Figure 8 0Value.If represent that with x pixel x value is 1 (foreground), represent that with~x pixel x value is 0 (background colour), then pixel x 0Result after level and smooth is:
x 0′=~x 0[x 3x 7(x 1+x 5)+x 1x 5(x 3+x 7)]+
x 0~[(x 3+x 7)
(~(x 4+x 5+x 6)+~(x 1+x 2+x 8)+~(x 1+x 5)+(~(x 6+x 7+x 8)+~(x 2+x 3+x 4))]
(4) Object Segmentation
The row of Chinese character OCR, character segmentation technology can be directly used in Object Segmentation.It below is another comparatively simple Object Segmentation method.It is divided into apportion, participle and three subsequent steps of adjustment.As previously mentioned, bitonal bitmap IMG bWidth be C, highly be R, (i, the pixel of j) locating is designated as IMG at coordinate b(i, j), IMG b(i, j)=1 this point of expression is foreground.
A. apportion
Make j list sum of all pixels being
C jResult after the horizontal distribution figure that constitutes is smooth is designated as
S j = 1 d &Sigma; d = 0 &mu; - 1 C j + d , (j=0,...,C-μ).
Wherein, μ is smooth step-length.S jMaximal value, minimum value and both difference do not remember with:
M=max{S j},m=min{S j},D=max-min
Order again: Th=m+ α D, wherein the α threshold parameter generally gets 0.1 or 0.2.Obtain S jThe j value j of=Th 0, j 1..., j 2n-1' these values organize in twos in regular turn right, that is: p k=(j 2k+ j 2k+1)/2,0<k<n can obtain the column split line sequence p of the page kShown in the dotted line among Fig. 9 a.For extracting easy-to-handle object row, before participle, also should get rid of vertical line.Concrete grammar is: calculate average col width δ=(p N-1-P 0)/(n-1) is if two adjacent column split line (p kAnd p K+1) spacing is less than 0.1 δ, thinks that then between this two adjacent column split line be the column split vertical line of ancient books, fill out between with them into background colour and with (p k+ p K+1)/2 substitute these two column split lines.Fig. 9 a obtains white stick separated among Fig. 9 b through after getting rid of vertical line.
B. participle
The object that obtains row are considered as the parent page image, the transposing steps A. in the row, column mark.Can obtain the basic division of each object.Concrete outcome is seen Figure 10 a.
C. adjust
Automatically there is a spot of erroneous judgement result sometimes in cut zone, and cutting techniques should provide image feedback, for treatment people manual setting cut zone.This is the indicating equipment 209a selection deletion/increase function with server end among Fig. 1, clicks corresponding object or position then.For example, obtain correct Object Segmentation behind the useless cut-off rule that deletion Figure 10 a top is caused by former ancient books outer rim, shown in Figure 10 b.One to cut apart the object diagram of finishing for example shown in Figure 11.
(5) refinement
The bitonal bitmap of object is converted into the skeleton image that live width is single pixel, to reduce because of of the influence of stroke width difference to feature extraction.Thinning algorithm is as follows:
i.I”=IMG b
ii.Do
a.I=I”;
B. all pixels in the scans I form new bitmap I '.To pixel x among the I 0, investigate its neighborhood as shown in Figure 8, if C 1Set up, then the relevant position puts 1 among the I ';
C. scans I ' in all pixels, form new bitmap I ".To pixel x among the I ' n, investigate its neighborhood as shown in Figure 8, if G 2Set up, then I " in the relevant position put 1;
Until I=I”;
Iii. return I ".
C 1=x 0~x 1~x 2~x 3x 4x 5x 6~x 7~x 8+x 0~x 1~x 2x 3~x 4x 5~x 6~x 7~x 8+x 0~x 1~x 2x 3x 4x 5~x 6~x 7~x 8+
x 0~x 1~x 2x 3~x 4x 5x 6~x 7~x 8+x 0~x 1~x 2x 3x 4x 5x 6~x 7~x 8+x 0~x 1~x 2~x 3~x 4x 5~x 6x 7~x 8+
x 0~x 1~x 2~x 3x 4x 5~x 6x 7~x 8+x 0~x 1~x 2~x 3~x 4x 5x 6x 7~x 8+x 0~x 1~x 2~x 3x 4x 5x 6x 7~x 8+
x 0~x 1~x 2x 3~x 4x 5~x 6x 7~x 8+x 0~x 1~x 2x 3x 4x 5~x 6x 7~x 8+x 0~x 1~x 2x 3~x 4x 5x 6x 7~x 8+
x 0~x 1~x 2x 3x 4x 5x 6x 7~x 8+x 0~x 1x 2x 3x 4~x 5~x 6~x 7~x 8+x 0~x 1x 2x 3~x 4x 5~x 6~x 7~x 8+
x 0~x 1x 2x 3x 4x 5~x 6~x 7~x 8+x 0~x 1x 2x 3~x 4x 5x 6~xl~x 8+x 0~x 1x 2x 3x 4x 5x 6~x 7~x 8+
x 0~x 1xlx 3~x 4x 5~x 6x 7~x 8+x 0~x 1x 2x 3x 4x 5~x 6x 7~x 8+x 0~x 1x 2x 3~x 4x 5x 6x 7~x 8+
x 0~x 1x 2x 3x 4x 5x 6x 7~x 8+x 0~x 1~x 2~x 3~x 4x 5~x 6x 7x 8+x 0~x 1~x 2~x 3x 4x 5~x 6x 7x 8+
x 0~x 1~x 2~x 3~x 4~x 5x 6x 7x 8+x 0~x 1~x 2~x 3~x 4x 5x 6x 7x 8+x 0~x 1~x 2~x 3x 4x 5x 6x 7x 8+
x 0~x 1~x 2x 3~x 4x 5~x 6x 7x 8+x 0~x 1~x 2x 3x 4x 5~x 6X 7x 8+x 0~x 1~x 2x 3~x 4x 5x 6x 7x 8+
x 0~x 1~x 2x 3x 4x 5x 6x 7x 8+x 0x 1~x 2~x 3~x 4~x 5~x 6x 7x 8+x 0x 1~x 2~x 3~x 4x 5~x 6x 7x 8+
x 0x 1~x 2~x 3x 4x 5~x 6x 7x 8+x 0x 1~x 2~x 3~x 4~x 5x 6x 7x 8+x 0x 1~x 2~x 3~x 4x 5x 6x 7x 8+
x 0x 1~x 2~x 3x 4x 5x 6x 7x 8
Bitmap when algorithm finishes is the skeleton image after the refinement.Condition in the algorithm
C 2=x 0~x 1~x 2x 3x 4x 5~x 6~x 7~x 8+x 0~x 1x 2x 3x 4~x 5~x 6~x 7~x 8+x 0~x 1x 2x 3x 4x 5~x 6~x 7~x 8+
x 0x 1~x 2x 3~x 4~x 5~x 6~x 7~x 8~x 0x 1~x 2x 3x 4~x 5~x 6~x 7~x 8+x 0x 1~x 2x 3x 4x 5~x 6~x 7~x 8+
x 0x 1~x 2~x 3~x 4~x 5~x 6x 7~x 8+x 0x 1~x 2~x 3~x 4~x 5x 6x 7~x 8+x 0x 1~x 2x 3~x 4~x 5~x 6x 7~x 8+
x 0x 1~x 2x 3x 4~x 5~x 6x 7~x 8+x 0x 1~x 2x 3~x 4~x 5x 6x 7~x 8+x 0x 1x 2x 3~x 4~x 5~x 6~x 7~x 8+
x 0x 1x 2x 3x 4~x 5~x 6~x 7~x 8+x 0x 1x 2x 3x 4x 5~x 6~x 7~x 8+x 0x 1x 2~x 3~x 4~x 5~x 6x 7~x 8+
x 0x 1x 2~x 3~x 4~x 5x 6x 7~x 8+x 0x 1x 2x 3~x 4~x 5~x 6x 7~x 8+x 0x 1x 2x 3x 4~x 5~x 6x 7~x 8+
x 0x 1x 2x 3~x 4~x 5x 6x 7~x 8+x 0~x 1~x 2~x 3~x 4~x 5x 6x 7x 8+x 0x 1~x 2x 3~x 4~x 5~x 6~x 7x 8+
x 0x 1~x 2x 3x 4~x 5~x 6~x 7x 8+x 0x 1~x 2x 3x 4x 5~x 6~x 7x 8+x 0x 1~x 2~x 3~x 4~x 5~x 6x 7x 8+
x 0x 1~x 2~x 3~x 4~x 5x 6x 7x 8+x 0x 1~x 2x 3~x 4~x 5~x 6x 7x 8+x 0x 1~x 2x 3x 4~x 5~x 6x 7x 8+
x 0x 1~x 2x 3~x 4~x 5x 6x 7x 8+x 0x 1x 2~x 3~x 4~x 5~x 6~x 7x 8+x 0x 1x 2x 3~x 4~x 5~x 6~x 7x 8+
x 0x 1x 2x 3x 4~x 5~x 6~x 7x 8+x 0x 1x 2x 3x 4x 5~x 6~x 7x 8+x 0x 1x 2~x 3~x 4~x 5~x 6x 7x 8+
x 0x 1x 2~x 3~x 4~x 5x 6x 7x 8+x 0x 1x 2x 3~x 4~x 5~x 6x 7x 8+x 0x 1x 2x 3x 4~x 5~x 6x 7x 8+
x 0x 1x 2x 3~x 4~x 5x 6x 7x 8
(6) normalization
For eliminating the influence of handwritten form object size and change in location, the skeleton image of each object of standardizing.For example, Figure 13 is the normalization bitmap of the skeleton image of Figure 12, and housing is represented the border of new bitmap.
Standardized method is to select the maximal value of the height of skeleton image and width as monolateral length, makes a square bitmap.Then skeleton image is placed this square bitmap center.Deserving to be called and stating square is MBS (Minimal BoundingSquare).Compare with the conventional standardized method that uses boundary rectangle MBB (Minimal Bounding Box), the standardized method has here kept the ratio of width to height of object.Be difficult for causing elongated objects deviation when feature extraction, to occur.
3. feature extraction 102
This method is at single ancient books definition and extract three class essential characteristics, that is: the global position feature and the morphological feature of page feature, object.If the multireel ancient books that same people transcribes is combined processing, only need to add the books sign.Above-mentioned feature description the ancient books content.
In module 102, each object is separated from page-images, and each object has all possessed geometric coordinate and range of size in the clear and definite page.Following mask body defines described three class essential characteristic and extracting method thereof.
The global position feature (GLF) that defines 1 object is the linear order numbering of this object in the page of an ancient books.
As long as can guarantee object is that 1-1 is corresponding with its global position feature, the linear order in the definition can be taked arbitrary form.For example, global position Feature Extraction method can according to ancient books transcribe custom (page number from small to large, in the page or leaf from right to left, each row from top to bottom), obtain the global position feature of each object of obtaining by the scanner uni pretreatment module.For complicated space of a whole page layout, global position Feature Extraction method can be utilized recurrence curve such as Hilbert or Piano curved scanning layout area earlier, and each intra-zone is handled in the usual way again then.
The page feature (PF) that defines 2 ancient books is made of the geometric coordinate of each object in the page number and the page.
Page feature description by the geometric layout of object in page relation.
The morphological feature of object has been portrayed the vision semanteme of object.And then, remove outside the polyphone, a Chinese character write unique linguistics semanteme that has determined this word.In other words, by comparison, can realize the approximate match of literal, semiotics semanteme to Chinese character-type.Research of Chinese Feature Extraction technology among any Chinese OCR all can be used as object extracting of morphological method.
Yet,, exist a lot of variable factors to influence the extraction of Hanzi component and formation stroke thereof being in the Chinese ancient book of feature with writing brush personal letter Chinese character.For example, stroke weight is inhomogeneous, the part stroke is fuzzy or owe, the relative position skew during repeatedly the occurring of same literal between stroke/parts, stroke inclination angle/relative length variation etc., all can influence the coupling of object on the vision meaning.Need the stronger Feature Extraction Technology of exploitation fault-tolerant ability.Notice " the fixing standardized of block character parts position and ratio is the crystallization of Chinese character calligraphy art for a long time " this fact, below provide a kind of morphological feature of in multistage barycenter graduation zone, adding up stroke factor aggregate-value and describe and extractive technique.It has stronger fault-tolerant ability to the above-mentioned changing factor that exists in the Chinese ancient book.
The morphological feature (MF) that defines 3 objects is the aggregate-value of its image stroke factor component in multistage barycenter graduation zone.
The extracting of morphological method is as follows:
At first, according to the center of gravity of object its MBS is done the multilayer graduation.Each regional graduation point is decided to be the center of gravity of object foreground point in this zone (the stain collection in the accompanying drawing).Further graduation recurrence on the basis of shallow one deck is carried out.The concrete mode of one, two layer of graduation of Figure 13 as shown in figure 14.
Then, add up stroke factor in each zone, classification accumulative total back forms proper vector.So-called stroke factor is meant to constitute fundamental element horizontal, vertical, that cast aside, press down four kinds of strokes that its dot matrix is arranged as shown in figure 15.With respect to complete stroke, phenomenons such as the feature formation based on the stroke factor is inhomogeneous to soft stroke handwriting Chinese character stroke, stroke fuzzy, inclination angle/relative length lacks rule all have stronger fault-tolerant ability, also are convenient to the unified of non-legible symbol object in the ancient books handled.It is easy to extract the stroke factor scheme from the bitmap of object, has multiple embodiments.For example, be structural element (Structure Elements) with four kinds of stroke factors respectively, the applied mathematics morphological method is done corrosion (Erosion) computing to the foreground point (stain among the figure in the square frame) of Figure 13, obtains four kinds of stroke factor distributions in square frame.The extracting method of stroke factor is acted among Figure 14, and the stroke that can obtain in the graduation zone distributes, and the pixel count with all foreground points in the zone removes it again, obtains the distribution density of stroke factor in each zone.Notice in the Chinese character that the occurrence frequency of stroke is much higher than to cast aside anyhow and press down stroke, simultaneously for reducing the dimension of feature space, improve index and effectiveness of retrieval,, can decompose one deck less in the promptly regional graduation casting aside the statistics shallow level of stroke more anyhow of pressing down the stroke factor.A kind of concrete mode is that horizontal, vertical stroke factor is all used two layer region graduation, casts aside, presses down the stroke factor and all use one deck graduation.Among Figure 17 illustration in the double-layer separate partition territory horizontal, vertical stroke distribute and one deck graduation zone in cast aside, press down the stroke distribution.Adopt the zone number rule of Figure 16, the morphological feature vector of all objects has been opened into 16 * 2+4 * 2=40 dimensional feature space in the ancient books.Vector f in the space is calculated by following formula:
In the following formula, p 1(k) and p 2(k) be respectively resembling among the preceding bitmap firsts and seconds zoning k of feature extraction
f ( i ) = &Sigma; 1 &le; k &le; i h ( k ) p 2 ( k ) ,
Figure C0011954200182
f ( 32 + j ) j = 1,2,3,4 = &Sigma; 1 &le; k &le; j p ( k ) p 1 ( k ) , f ( 36 + j ) = &Sigma; 1 &le; k &le; j n ( k ) p 1 ( k )
That vegetarian refreshments number, h (k), s (k), p (k), n (k) are respectively is horizontal, vertical, cast aside, press down the black pixel of stroke factor in bitmap region k counts.
Adopt the object morphological feature of multistage barycenter graduation zone stroke factor aggregate-value, embodied the vision content of handwritten Chinese character preferably, can express literal/symbol with stroke distribution density relatively flexibly.Certain tolerance of definition in feature space (or claiming distance) can form vector space.A kind of tolerance is known Euclidean distance.In the characteristic vector space that forms, the morphological feature vector of object has constituted the coordinate of feature space mid point.Therefore, the unique point of plesiomorphism object has formed cluster naturally, and bigger distance is arranged between the unique point of discrepant Chinese character.
So far, the feature of ancient books has been extracted and has been finished, and the morphological feature and the global position feature of ancient books page feature, object remain to mark sheet 112.Be that mark sheet is made up of as the four-tuple of (geometric coordinate, global position feature, morphological feature in page number, the page or leaf) a plurality of shapes, a plurality of numbers is the object number that scanning pretreatment module 101 is determined.
4. the feature space index 113
In the practical application, the feature space of generation generally has characteristics such as dimension height, unique point quantity are many.Need the design space index structure corresponding with application target, the unique point that rationalization is all exchanges information inquiry fast for less storage overhead.Say on the principle that all space index methods (as R-tree and improve one's methods, X-tree, SR-tree, PK-tree etc.) can both become the embodiment of feature space index structure.Yet the performance of partial index algorithm such as R-tree can sharply descend with the increase of space dimensionality.Provide the optimization embodiment of SR-tree herein.About inner realization and the performance evaluation thereof of SR-tree, see also relevant paper and package description.
A. data structure
Definition of data item E i=(MF i, GLF I)=(f i, GLF I).f iBe the coordinate of feature space mid point i, the morphological feature vector of object i just; GLF iIt is the global position feature of object i.
B. create the SR-tree
Call function new_HnSRTreeFilePath, Dimension, DataSize, BlockSize, SplitFactor, ReinsertFactor.Generate an empty SR-tree and return it, return data type HnSRTreeFile.
The meaning of the input parameter in calling and value such as following table:
Parameter name Type The parameter meaning Value
Path Dimension DataSize BlockSize SplitFactor ReinsertFactor Character string integer integer integer integer integer Preserve the Data Filename feature space Dimension Characteristics spot correlation attribute GLF byte number data block size of SR-tree, the minimum utilization factor of (byte) database, (percent) insert the factor again, (percent) Ancient books name .idx 40 2 8192 (system default value) 40 (system default values) 30 (system default value)
C. insert data item
SR-tree object File according to B. returns calls its method Store (...) with data item E i=(f i, GLF I) insertion SR-tree.Concrete steps are:
HnSRTreeFile File;
File.Store(Point,Data)。
The meaning of parameter wherein and value such as following table:
Parameter name Type The parameter meaning Value
Point HnPoint& The storage address of point coordinate in the feature space The morphological feature vector f of object
Data HnData& The storage address of feature space mid point attribute The GLF of this object
5. treatment scheme control
Ancient books is handled and is adopted recycle design to finish.In a width of cloth page image, each object is implemented 102 to 113 processing, the object in one page whether finish dealing with Fig. 4 105 in judge.If this page or leaf also has other objects, then repeat said process, handle otherwise change time page or leaf.Ancient books whether be converted into fully digitizing ancient books storehouse 110 Fig. 4 106 in judge.
The processing stage of now retrieving 120 in conjunction with Fig. 5 description.Content retrieval must have been finished feature space at the ancient books that is retrieved and carry out after organizing 100 steps.For a cover feature space index structure of being set up, retrieval person can carry out the content retrieval of arbitrary number of times referring to Fig. 3.The purpose of content retrieval is to utilize feature space to organize resulting index structure, obtains all other objects similar to given object vision content fast.The basic step of content retrieval is as follows:
(1) reads precision controlled variable 501
Retrieval person adjusts the retrieval precision controlled variable by man-machine interaction mode.This parameter is only represented notional " strictness " and " loose ", value determine need not any quiet and secluded knowledge.Parameter value generally is divided into multistage, and pairing distance thresholds at different levels can be implemented the people by setting arbitrarily to big monotone increasing add mode by zero by invention.A kind of embodiment is to set 11 grades, and the 0th grade of predetermined distance threshold value is zero, the strict coupling of expression; The 10th grade is the loosest precision controlled condition, and the predetermined distance threshold value is 1; Progressively increase distance threshold by 0.1 increment therebetween.Because content retrieval can repeatedly carry out, retrieval person can dynamically adjust the precision controlled variable with reference to result for retrieval last time, gives new balance to next time recall ratio and precision ratio, satisfies its needs.The approximate hunting zone of Object Query 123 in feature space index 113 of precision controlled variable influence.
(2) open open the beginning browsing pages 502
Retrieval person can page number accesses corresponding page-images or enters certain page in conjunction with general indexing method by importing arbitrarily.Directly the scheme of input page numbering is the simplest.Comparatively practical with the scheme that indexing method is used.This is not only harmonious with the existing retrieval mode of library and ancient books CD server, and formed 2-level search pattern is more convenient for handling the different literature of ancient book of a large amount of writing styles.The retrieval point guiding retrieval person that indexing method provides further offers help for retrieval person finds target in volume based on the content search method of visual similarity at digital library or CD server discovery candidate's ancient books folder.
(3) demarcate searching object 121
On the page displayed image, retrieval person utilizes indicating equipment 209b such as mouse or writing pencil to click object, sets or adjust object-order.Interior geometric coordinate of page number, the page that demarcation sample retrieval module 121 record indicating equipments provide and the natural number of representing this order according to order mark on page-images that retrieval person sets.Can cooperate and browse controlling mechanism, in multipage, demarcate searching object.When retrieval person starts when retrieval, module 121 forms sample retrieval according to the order of geometric coordinate sequence and coordinate sequence in the page number of above-mentioned object, the page.Page number and coordinate set are passed to and are obtained characteristic module 122, and the order of coordinate sequence is passed to checking constraint condition module 124 as constraint condition.Afterwards the member object of each sample retrieval is implemented 122 to 123 and handle, in 506 steps, carry out aftertreatment and judge loop ends.
(4) obtain the morphological feature vector 122 of sample retrieval
From mark sheet, obtain the morphological feature vector of this object according to geometric coordinate in the page label of the member object of sample retrieval and the page.Acquisition methods depends on the organizational form of the interior geometric coordinate of the page of mark sheet.Page-images is after Object Segmentation, and each object all has a rectangle (referring to 2 (4)) that comprises it.If geometric coordinate is provided by the middle point coordinate of this rectangle in the page of object, then should be under the identical situation of page number, in mark sheet, calculate and the immediate point in sample member position earlier, and then obtain the morphological feature vector of object from this list item according to Euclidean distance; If geometric coordinate angular coordinate is provided in the page of object by this rectangle, then should be under the identical situation of page number, whether the check rectangle comprises sample member position in mark sheet earlier, and then obtains the morphological feature vector of object from the list item that comprises sample.Before a kind of method each object is saved the storage space of a pair of coordinate, a kind of method in back can be avoided the multiplication and division computing relatively the time, execution speed is very fast.In ancient books the more or sample retrieval length of object number generally more in short-term, a kind of method is favourable before using.
(5) approximate Object Query 123
With respect to certain sample member object, in the feature space index, search the similar object set of its vision according to nearest neighbouring rule.Specific practice is that establishing by the 123 form vectors that obtain is v, is r by the 501 search precision controlled variable that read, and then uses the set that following A~B obtains the global position feature GLF of analogical object.
A. according to parameter r setting range border.
To each dimension of feature space, establishing its mobility scale is W, then at first sets the range of search width
w = &epsiv; r = 0 W &times; r / s 0 < r &le; s
Wherein, ε is a very little number, general value 0.0001, the situation of corresponding strict search.S is the maximum occurrences of r.If read described in the precision controlled variable step s=10 according to aforementioned.
Then, system adjusts the position of range of search automatically, obtains on this dimension of feature space an interval w who comprises retrieval point x and be positioned at W, makes the w mid point that x is positioned at as much as possible.The border of note w is respectively a iAnd b i
Utilize the method SetRange of HnRect in the SR-tree program bag to set range of search, promptly i is tieed up
rect.SetRange(a i,HnRange::INCLUSIVE,b i,HnRange::INCLUSIVE,i).
Wherein, HnRange::INCLUSIVE is the constant that defines in the software package.
B. scope is searched (Range Search).
According to the range of search of setting among the A., from feature space index 113, return the global position feature GLF of analogical object one by one, form this sample member's analogical object set.Specific algorithm is as follows:
I) call the GetFirst method of HnSRTreeFile object File, return the GLF of first approximate object;
Ii) incorporate this GLF into results set
Iii) call the GetNext method of HnSRTreeFile object File repeatedly, return the GLF of next approximate object.Incorporate this GLF into results set, Key.isValid () test is for false in return parameters.
(6) handle lookup result 123
To all member objects of sample retrieval, the GLF of their approximate object set is accumulated cluster, passes to checking constraint condition module 124.
(7) checking constraint condition 124
So-called constraint condition promptly is the relative order of object elements that retrieval person demarcates in 121.Concrete proof procedure is as follows:
A. make sample retrieval comprise M member object, remember with e successively relatively in proper order by it 1, e 2..., e M, from 506 obtain bunch M GLF souvenir with L 1, L 2..., L M
B. with L 1As L, be circulated to M with subscript i from 2 with increment 1, carry out C
C. to each the element e among the L, establishing its GLF is j, if L iIn not have GLF be the object of j+i-1, then e is left out from L
D. the result who keeps among the L during loop ends is exactly first element list of result for retrieval.
(8) mark result for retrieval 508 on page image
Take out the GLF of header element one by one from 127 result for retrieval, as index, search mark sheet 112, deterministic retrieval is the page number and the page internal coordinate of header element as a result.On page-images, paste additional marking such as red round dot, indicate a continuous N object that begins thus.When this page or leaf when side-play amount begins not enough M object, from inferior beginning of the page portion beginning label residue object.
(9) page-images shows/browses 125
Set up hop button such as first term mark, preceding paragraph mark, consequent mark, last item mark,, provide retrieval person to observe result for retrieval and its contextual function of observation in conjunction with common homepage, preceding page or leaf, back page or leaf and last page navigation button.

Claims (2)

1. method that adopts computer hardware and software processes and repeated use Chinese ancient book, this method will be write in one's own hand Chinese character with writing brush and will be converted to unlimited repeated use medium for the Chinese ancient book of its principal character, and directly separate on this medium and extract the part resource that satisfies objective condition; It is characterized in that this method is made up of disposable feature space tissue (100) treatment scheme of finishing and ancient mat content retrieval (120) treatment scheme that can be repeatedly used;
Described feature space organized processing (100) flow process produces low-cost, reusable content media---digitizing ancient books storehouse (110); This flow process is made of jointly following steps:
Produce page-images and deposit it in page-images storehouse (111) by scanning and pre-service (101), simultaneously the skeleton that generates is passed to follow-up extraction feature (102) the object in the page-images is decomposed into the ordered set of independent image;
By extracting feature (102) ordered set of described object is separated into page feature, object global position feature and morphological feature vector also are kept at these features in the mark sheet (112);
By described global position feature of index feature (103) tissue and morphological feature vector and be stored in the data structure feature space index (113); And
By data structure feature space index (113) morphological feature vector is carried out cluster and an eliminating and the incoherent letter symbol image of retrieval point;
The feature space organized processing produces reusable digitizing ancient books storehouse (110), comprises page-images storehouse (111), mark sheet (112) and feature space index (113);
Described content retrieval (120) flow process is directly separated on ancient books digitizing medium and is extracted the part resource, and it is made of jointly following steps:
By demarcating order that sample retrieval (121) determines the page coordinates of object and coordinate sequence forming sample retrieval, and the order of coordinate sequence is passed to checking constraint condition (124) as constraint condition;
By obtaining feature (122), to obtain and the corresponding morphological feature vector of object with the concrete object of page coordinates sequence as condition definite page-images from mark sheet (112);
By approximate query (123) is that reference point is mated the arest neighbors element to constitute the analogical object set of reference point with the morphological feature vector in feature space index (113); And the global position feature set is formed in the set of the analogical object of correspondence bunch passed to checking constraint condition (124);
By the effective combination of checking constraint condition (124), to form result for retrieval according to described constraint condition inspection set bunch element; And
By showing/browse that result for retrieval (125) is apparent in result for retrieval on retrieval person's the client screen (206b).
2. computer software and hardware system that realizes the method for claim 1, this system is made of jointly the configuration of known computer hardware device and the component models of implementing the computer application software system of the method for claim 1 flow process, it is characterized in that: the configuration of this known computer hardware device comprises by CPU (central processing unit) (202a), random access memory (203a), hard disk (204a), keyboard (205a), display (206a), network access device (207a), the server (200a) that scanner (208) and indicating equipment (209a) are formed and by CPU (central processing unit) (202b), primary memory (203b), hard disk or ROM (read-only memory) (204b), keyboard (205b), display (206b), the client computer (200b) that network access device (207b) and indicating equipment (209b) are formed; The parts of this realization computer application software system comprise with lower module and with the configuration of computer hardware equipment and combining:
Scanning and pretreatment module (101), utilize the computing power that CPU (central processing unit) (202a) provides, the storage capacity of random access memory (203a), the display capabilities of display (206a), the picture acquisition power of scanner (208) to produce page-images, and deposit page-images storehouse (111) in, page-images stock is put on the storage medium of hard disk (204a), generates the skeleton object simultaneously;
Extract characteristic module (102), the computing power of utilizing CPU (central processing unit) (202a) to provide, the storage capacity of random access memory (203a), the display capabilities of display (206a), the station-keeping ability of indicating equipment (209a), be connected with pretreatment module by skeleton object and described scanning and accept and decompose object in the described page-images, so that object ordered set is separated into page feature, the global position feature of object and morphological feature sequence, and be stored in the mark sheet (112), this process is by means of 205a input related command and parameter, and mark sheet is deposited on the storage medium of hard disk (204a);
Index characteristic module (103) utilizes the computing power that CPU (central processing unit) (202a) provides, the storage capacity of random access memory (203a), and tissue is by extracting described global position feature and the morphological feature vector that characteristic module extracts; Proper vector is deposited on the storage medium of hard disk (204a);
Data structure feature space index module (113), the global position feature of preservation process tissue and morphological feature vector are on the storage medium of hard disk (204a);
The hard disk (204a) of above-mentioned server (200a) contains permanent storage computer operating system, digitizing ancient books storehouse (110);
Demarcate sample retrieval module (121), utilize the computing power that CPU (central processing unit) (202b) provides, the storage capacity of primary memory (203b), the display capabilities of display (206b), the station-keeping ability of indicating equipment (209b), the order of determining the page coordinates of object and coordinate sequence is to form retrieval person's sample retrieval;
Obtain characteristic module (122), computing power, primary memory (203b) storage capacity of utilizing CPU (central processing unit) (202b) to provide, the page coordinates sequence is determined concrete object in the page-images as condition from mark sheet (112), to obtain and the corresponding proper vector of object;
The approximate module (123) of inquiry, utilize the computing power that CPU (central processing unit) (202a, 202b) provides, the storage capacity of random access memory (203a, 203b), with the reference point is clue, its analogical object of coupling set in the feature space index (113) on the storage medium of hard disk (204a), and the analogical object set of all the object correspondences in the sample retrieval is combined into global position feature set bunch;
Checking constraint condition module (124), utilize the computing power that CPU (central processing unit) (202a, 202b) provides, the storage capacity of random access memory (203a, 203b), the order of accepting coordinate sequence is as constraint condition and global position feature set bunch, with effective combination of inspection set bunch element and form result for retrieval; With
Show/browse result for retrieval module (125), utilize storage capacity, the display (206b) of computing power that CPU (central processing unit) (202b) provides, random access memory (203b) display capabilities, result for retrieval is apparent on retrieval person's the client computer indicator screen (206b);
Network access device (207a, 207b) is used for communicating by letter between server (200a) and the client computer (200b).
CN 00119542 2000-08-01 2000-08-01 Method and system of digitizing ancient Chinese books and automatizing the content search Expired - Fee Related CN1269060C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 00119542 CN1269060C (en) 2000-08-01 2000-08-01 Method and system of digitizing ancient Chinese books and automatizing the content search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 00119542 CN1269060C (en) 2000-08-01 2000-08-01 Method and system of digitizing ancient Chinese books and automatizing the content search

Publications (2)

Publication Number Publication Date
CN1336604A CN1336604A (en) 2002-02-20
CN1269060C true CN1269060C (en) 2006-08-09

Family

ID=4587786

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 00119542 Expired - Fee Related CN1269060C (en) 2000-08-01 2000-08-01 Method and system of digitizing ancient Chinese books and automatizing the content search

Country Status (1)

Country Link
CN (1) CN1269060C (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100664311B1 (en) * 2005-11-18 2007-01-04 삼성전자주식회사 Image forming apparatus for automatically forming index and the automatically index forming method thereof
CN101393643B (en) * 2007-09-21 2012-01-18 华东师范大学 Computer stroke deforming system and method
CN102253990A (en) * 2011-07-05 2011-11-23 广东星海数字家庭产业技术研究院有限公司 Interactive application multimedia data query method and device
KR20150006740A (en) * 2013-07-09 2015-01-19 류중하 Method for Composing Mark Image Corresponding Letter, and Method for Analyzing Mark Image About Corresponding Letter
CN105183744A (en) * 2015-06-29 2015-12-23 努比亚技术有限公司 Method and device for carrying out paper book keyword retrieval by mobile phone
US9898452B2 (en) 2015-10-16 2018-02-20 International Business Machines Corporation Annotation data generation and overlay for enhancing readability on electronic book image stream service
CN106502974A (en) * 2016-10-17 2017-03-15 王忠义 A kind of inscriptions on bones or tortoise shells carves diction Electronic record template construction method
CN106503247A (en) * 2016-11-09 2017-03-15 天津赛因哲信息技术有限公司 Ancient book document management system and method based on knowledge discovery technology
CN108550154A (en) * 2018-04-11 2018-09-18 中国科学院西双版纳热带植物园 A kind of method of accurately measuring karst earth's surface bare rock accounting
TWI734037B (en) * 2018-09-28 2021-07-21 愛探極溫度行銷有限公司 Real estate holding and inheritance calculation management system
CN111666262B (en) * 2020-05-28 2021-06-22 重庆中联信息产业有限责任公司 Working method for extracting feature points of massive medical images in network attached storage NAS state

Also Published As

Publication number Publication date
CN1336604A (en) 2002-02-20

Similar Documents

Publication Publication Date Title
CN1158627C (en) Method and apparatus for character recognition
CN1248138C (en) Image processing method and image processing system
US8520889B2 (en) Automated generation of form definitions from hard-copy forms
JP5710624B2 (en) Method and system for extraction
US5350303A (en) Method for accessing information in a computer
CN1701323A (en) Digital ink database searching using handwriting feature synthesis
JP4577931B2 (en) Document processing system and index information acquisition method
CN1625741A (en) An electronic filing system searchable by a handwritten search query
CN1542655A (en) Information processing apparatus, method, storage medium and program
CN1269060C (en) Method and system of digitizing ancient Chinese books and automatizing the content search
DE102011079443A1 (en) Learning weights of typed font fonts in handwriting keyword retrieval
US20040139384A1 (en) Removal of extraneous text from electronic documents
CN1900933A (en) Image search system, image search method, and storage medium
CN1177407A (en) Method and system for velocity-based head writing recognition
CN1877598A (en) Method for gathering and recording business card information in mobile phone by using image recognition
Ahmad et al. Kpti: Katib's pashto text imagebase and deep learning benchmark
Rizo et al. Muret: A music recognition, encoding, and transcription tool
US20150310269A1 (en) System and Method of Using Dynamic Variance Networks
CN115830620B (en) Archive text data processing method and system based on OCR
CN1251130C (en) Method for identifying multi-font multi-character size print form Tibetan character
CN1129331A (en) Character recognition method and apparatus based on 0-1 pattern representation of histogram of character image
CN115774805A (en) File intelligent query method and system based on digital processing
Thammarak et al. Automated data digitization system for vehicle registration certificates using google cloud vision API
Sari et al. A search engine for Arabic documents
CN112464907A (en) Document processing system and method

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20060809

Termination date: 20130801