CN1266632C - Document search device - Google Patents

Document search device Download PDF

Info

Publication number
CN1266632C
CN1266632C CN 02105715 CN02105715A CN1266632C CN 1266632 C CN1266632 C CN 1266632C CN 02105715 CN02105715 CN 02105715 CN 02105715 A CN02105715 A CN 02105715A CN 1266632 C CN1266632 C CN 1266632C
Authority
CN
China
Prior art keywords
document
character
mentioned
retrieval
assistance information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 02105715
Other languages
Chinese (zh)
Other versions
CN1381799A (en
Inventor
龟代泰三
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mitsubishi Electric Corp
Original Assignee
Mitsubishi Electric Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mitsubishi Electric Corp filed Critical Mitsubishi Electric Corp
Publication of CN1381799A publication Critical patent/CN1381799A/en
Application granted granted Critical
Publication of CN1266632C publication Critical patent/CN1266632C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Character Discrimination (AREA)

Abstract

A kind of highly precise retrieval which is realized by taking discrimination between a printing type letter and a handwritten letter into consideration, this document retrieval device is provided with a character recognition means 2 recognizing a character written in a sentence inputted by a document inputting means 1 and extracting information about a quality and a condition of the character as retrieval auxiliary information from an image of the inputted document, a character dictionary 3 storing characteristics of a character standard pattern, a document accumulating means 4 accumulating the character identification result and the retrieval auxiliary information as document data for retrieval, a retrieval document database 7 storing document data for retrieval, a keyword inputting means 5 inputting a keyword for document retrieval, a document retrieving means 6 performing collation matching the retrieval auxiliary information extracted by the character recognition means in collation between the document data for retrieval and the keyword letter, and a retrieval result outputting means 8 outputting the retrieval result. In this way, retrieval is carried out with high precision, and omitting of retrieval and retrieval noise can be reduced.

Description

Document search device
Technical field
The present invention relates to electronics and preserve and retrieve the document search device of images such as reading document or drawing, particularly relate to from be documented in the document drawing data that character on document image or the drawing generates storage by identification the document search device that uses key word arbitrarily to carry out full-text search.
Background technology
For being carried out the electronics login as the document image of embodied on computer readable, preserves the paper document, retrieval shows, in the past, have when document is logined, manually add the method for keyword message for the document image, perhaps make OCR (optics document reading device) identification document characters in images, the method that the document text that is generated is preserved with the document image.
Need extremely many labour and time aspect the key word interpolation of the former method when document is logined.On the other hand, the latter's method is because not exclusively therefore the character recognition performance can not avoid mistake identification, if do not revise the character code that obtains by character recognition, then when key search, have " retrieval loses " that desired document will take place not show as result for retrieval, perhaps show " retrieval noise " such problem of the character row different with search key as result for retrieval.The same labour more than needing extremely aspect the correction of the mistake of manually carrying out identification with the former method.
One of method of the problem of solution latter method even the Character segmentation of existence error character identification error is arranged, also can reduce " character is lost ", realizes the method (spy opens the 2000-057315 communique) of high-precision character retrieval.This method is to remove beyond the character code that obtains with the character row processing, generates the characteristic quantity (shape facility) that keeps each character shape of performance from character picture, the method that also contrasts with character code and shape facility when retrieval.
With reference to description of drawings document search device in the past.Figure 18 illustrates the structure of for example opening the document search device in the past shown in the 2000-057315 communique the spy.
Among Figure 18, the 101st, input media, the 102nd, control device, the 103rd, character recognition device, the 104th, the feature generating apparatus, the 105th, display device, the 106th, indexing unit, the 107th, feature comparison decision maker, the 108th, the retrieval character generating apparatus, the 109th, identification dictionary, the 110th, retrieve data storage unit, the 111st, shape facility dictionary.
Below, with reference to the action of description of drawings document search device in the past.
The document login at first is described.Figure 19 (a) is the document image of login, and character recognition device 103 has been discerned the results are shown among Figure 19 (b) of Figure 19 (a).
Then, feature generating apparatus 104 generates the shape facility of each character of having discerned.Shape facility as shown in figure 20, and is vertical, upper right by extracting the level of the character exterior contour part in each zone that each document image 8 has been cut apart, all directions composition of bottom right and generating.Its result show with Figure 21 in.
Then, use Figure 22, illustrate key word " character recognition " and retrieve data [civilian space identification " control treatment.
Indexing unit 106 has at first used the contrast of character code.In Figure 22, though the character " literary composition " " knowledge " " not " in the key word of input is consistent with retrieve data, " word " is inconsistent.
Then, the contrast carried out based on the shape facility between the inconsistent character of indexing unit 106.Specifically, carry out the contrast of shape facility 122 with the shape facility 123 of the character picture of the recognition result of having exported " space " in the retrieve data of " word " in the inconsistent key word.For the shape facility in the character in the key word " word ", use the eigenwert that is stored in the test pattern in the shape facility dictionary 111.
If the distance between the character code is designated as C, the distance between the shape facility is designated as D, then use the distance between formula (1) expression key word and the retrieve data.
Dist=(∑ D+ ∑ C)/keyword character number
Formula (1)
Wherein, (α: in the time of constant), the character code of i character of key word and j character of retrieve data is inconsistent at Cij=α.When Cij=0, i character of key word is consistent with the character code of j character of retrieve data.
D[dic(i),img(j)]=∑∑|Fdic(k1)-Fimg(k1)|
Formula (2)
In the formula, the scope of the 1st ∑ is k=1~K, and the scope of the 2nd ∑ is 1=1~L.
Here, Fdic is the eigenwert that is stored in i character of the key word in the shape facility dictionary 111, and Fimg is the eigenwert of i character of retrieve data, and K is that direction becomes mark, and L is the characteristic number of each direction composition.(TH: under situation threshold value), it is consistent with key word to be considered as character row, is output as result for retrieval satisfying Dist<TH.
The character book that carries out the shape facility contrast can contrast by using the dynamic design method under the key word situation different with retrieve data.Thus, realize having the contrast of the ambiguity that allows Character segmentation error character identification error.
In document search device in the past as described above,, has the contrast of ambiguity in order to realize allowing the retrieval of character-recognition errors Character segmentation mistake.Therefore, if for example increase mistake and extract (retrieval noise) such problem points for retrieving at the character row that does not have the Character segmentation mistake, then compare to exist with the retrieval that does not allow the Character segmentation mistake with character that charging to of the character frame of each monocase (below be designated as the monocase frame) write in the hurdle etc.
In addition, the hand-written character that writes in the field that does not have the monocase frame is compared with font, and the dispersiveness of the size of character or character pitch is big, is difficult to the correct interval of detecting the character in 1 row in character recognition.Therefore, hand-written character is compared with font, has increased the Character segmentation mistake, has reduced discrimination.Its result if retrieve from the document data of identification hand-written character generation, then exists retrieval and loses the problem points that falls to increasing.
Like this, be font or hand-written character according to the character that the NULI character frame is arranged or write, the erroneous tendancy difference in the character recognition is not if consider when document is retrieved aspect these then exist the problem that can not realize that high-precision retrieval is such.
Summary of the invention
The present invention produces for solving above problem points, purpose is to obtain document search device, this document indexing unit is preserved retrieves assistance information with recognition result when document is logined, when retrieval, contrast according to retrieves assistance information, can carry out high-precision retrieval process corresponding to each document data, thus, compare with the situation of not using retrieves assistance information and can cut down retrieval and lose and to fall to retrieving noise.
A kind of document search device of the present invention program 1 is characterized in that possessing: the document input block of input document; The formal definition file of the field information that maintenance is recorded and narrated for the attribute information in the area information of document and zone; When stating the character of putting down in writing in the document of formal definition file identification by the input of above-mentioned document input block in the use, extract judge the character of putting down in writing in the input document be hand-written or the information of font as the character recognition unit of the 1st retrieves assistance information; The character dictionary of the feature of the test pattern of store character; The field information of storing the character identification result of above-mentioned character recognition unit and above-mentioned the 1st retrieves assistance information and recording and narrating in above-mentioned formal definition file is as the document storage unit of retrieval with the document data; Store the retrieval document database of the retrieval of above-mentioned document cell stores with the document data; The key word input block of the key word of input document retrieval; The document retrieval unit, be used for when the contrast of document data and key word is used in above-mentioned retrieval, under above-mentioned the 1st retrieves assistance information is hand-written situation, implement to allow Character segmentation and the wrong contrast of understanding, in above-mentioned the 1st retrieves assistance information is under the situation of font, implements to allow the contrast of literal segmentation errors; And the result for retrieval output unit of exporting the result for retrieval of above-mentioned document retrieval unit.
The present invention program 2 document search device is hand-written or font for the character of putting down in writing in the above-mentioned input document, to calculate in the average and dispersion of the size of the character-circumscribed rectangle of each character in 1 row, and with the threshold ratio of the dispersion of calculating with font data and hand-written character data in advance from study, the dispersion of each character in above-mentioned 1 row is during greater than above-mentioned threshold value, be judged as hand-written character, and each character in above-mentioned 1 row be dispersed in above-mentioned threshold value when following, be judged as font.
In the present invention program 3 the document search device, above-mentioned document storage unit is added in above-mentioned retrieval with on the document data with the information that has or not the monocase frame in the above-mentioned formal definition file as the 2nd retrieves assistance information, and be stored in above-mentioned retrieval with in the document database, and when above-mentioned document retrieval unit contrasts with document data and key word in above-mentioned retrieval, in above-mentioned the 1st retrieves assistance information is hand-written, and above-mentioned the 2nd retrieves assistance information has under the situation of monocase frame, implement to allow the wrong contrast of character understanding, in above-mentioned the 1st retrieves assistance information is hand-written, and above-mentioned the 2nd retrieves assistance information does not have under the situation of monocase frame, implement to allow Character segmentation mistake and the wrong contrast of character understanding, in above-mentioned the 1st retrieves assistance information is font, and above-mentioned the 2nd retrieves assistance information has under the situation of monocase frame, implement not allow Character segmentation mistake and the wrong contrast of character understanding, in above-mentioned the 1st retrieves assistance information is font, and above-mentioned the 2nd retrieves assistance information does not have under the situation of monocase frame the contrast of enforcement permission Character segmentation mistake.
In the present invention program 4 the document search device, above-mentioned document storage unit is being used maintenance retrieval document data in the document database corresponding to the retrieval of above-mentioned the 1st retrieves assistance information and above-mentioned field information, above-mentioned document retrieval unit is by the contrast output result for retrieval of each above-mentioned the 1st retrieves assistance information and field information.
Description of drawings
Fig. 1 illustrates the structure of the document search device of the invention process form 1.
Fig. 2 is that the document that the document search device of the invention process form 1 is shown is logined the process flow diagram of action.
Fig. 3 is the process flow diagram of document retrieval actions that the document search device of the invention process form 1 is shown.
Fig. 4 is the process flow diagram of document retrieval actions that the document search device of the invention process form 1 is shown.
Fig. 5 illustrate the invention process form 1 document search device retrieves assistance information with the contrast mode corresponding relation.
Fig. 6 illustrates the document login paper of the document search device of the invention process form 1.
Fig. 7 illustrates the document of the document search device of the invention process form 1 and logins the format information of using paper.
Fig. 8 illustrates the example of charging to by the hand-written character of the document search device of the invention process form 1.
Fig. 9 illustrate the invention process form 1 document search device font charge to example.
Figure 10 illustrates the document data of Fig. 8.
Figure 11 illustrates the document data of Fig. 9.
Figure 12 illustrates the retrieves assistance information of the document search device of the invention process form 1, the corresponding relation of field information and contrast mode.
Figure 13 illustrates other example of the character data of Fig. 8.
Figure 14 illustrates other example of the character data of Fig. 9
Figure 15 illustrates the example of character index of hand-written document of the document search device of the invention process form 1.
Figure 16 illustrates the example of character index of the field that does not have the monocase frame of printing document of the document search device of the invention process form 1.
Figure 17 illustrates the example of character index of the field with monocase frame of printing document of the document search device of the invention process form 1.
Figure 18 illustrates the structure of document search device in the past.
Figure 19 illustrates the character graphics and the character identification result of document search device in the past.
Figure 20 illustrates the zone of the generation shape facility in the past the document search device.
Figure 21 illustrates the character identification result and the shape facility of document search device in the past.
Figure 22 is used for illustrating the contrast action of document search device in the past.
Embodiment
Example 1
Document search device with reference to description of drawings the invention process form 1.Fig. 1 illustrates the structure of the document search device of the invention process shape 1.In addition, in each figure, the part that identical symbolic representation is identical or suitable.
Among Fig. 1, the 1st, character entry apparatus, the 2nd, the document characters in images of identification instrument input device 1 input also extracts the character recognition device of retrieves assistance information from character code and character graphics, the 3rd, the character dictionary of the characteristics of image of store character test pattern, the 4th, the character identification result of store character recognition device 2 outputs and the instrument storage device of retrieves assistance information, the 5th, the key word input media, the 6th, document search device, the 7th, storage is by the retrieval document database of the retrieval document data of character memory storage 4 outputs, the 8th, result for retrieval output unit, the 9th, formal definition file.
Then, with reference to the action of the document search device of description of drawings the invention process form 1.
The document login process at first is described.Here, use typing paper shown in Figure 6 to login.Among Fig. 6,202 illustrate name field, and 203 illustrate the address field, and 204 illustrate phone number field, and 205 illustrate the trade name field.
The example of the formal definition file that uses when Fig. 7 is illustrated in reading of tabulating paper shown in Figure 6.Among Fig. 7, the monocase frame that has or not each field is shown, and the field rectangular coordinates.Formal definition file shown in Figure 7 generates with artificial.
Fig. 2 illustrates the process flow diagram of login process of the document search device of this example 1.
Use this Fig. 2 that login process is described.At first, in the step S100 of Fig. 2, instrument input device 1 input document image.This document input media 1 can be realized by using scanner that the paper document is carried out light-to-current inversion.In addition, can also be light-to-current inversion image through reading of network and realize.Fig. 8 and Fig. 9 illustrate the example of the document image that is read in by instrument input device 1.
Then, in the step S200 of Fig. 2, carry out character recognition.Character recognition device 2 goes out to take out character graphics from the document image of instrument input device 1 input, and output is corresponding to the character code of each character picture.In this example 1, character recognition device 2 uses well-known image processing techniques to realize.At first, according to the field rectangular coordinates and the character frame information of formal definition file 9, go out the image of each monocase from the document image contract.For field with monocase frame, carry out the character frame from the straight line composition of image and extract, be image segmentation in each character frame that monocase is discerned.For the field that does not have the monocase frame, in rectangular coordinates, carry out the character row extraction, use the circumferential distribution of character row to cut apart according to each monocase.
Then, go out the feature of using the character recognition from each monocase image contract, calculate with character dictionary 3 in the distance of characteristics of image of test pattern of each character, the order that reduces according to distance is being output as the identification candidate characters more than the monocase.
Specifically, detect from field rectangular area detection level from the character frame of field with monocase frame, it is the above straight line composition of certain value that vertical direction becomes mark, at the besieged rectangle of its intersection point as the monocase frame.The straight line composition detection uses well-known image processing techniques to carry out.Its result the character in the resulting monocase frame as monocase.Carrying out character row for the field that does not have the monocase frame extracts and Character segmentation.It is that certain value is handled with the coupling between the interior black pixel that the character row extraction is at first carried out Euclidean distance for input picture (white pixel value=0, the bianry image of black pixel value=1).Then, carrying out handling as the sign of image processing method, is the shape of each mark that rectangular part decision is character row.
Then,, ask the circumferential distribution of black pixel count from horizontal direction and each character row of vertical scan direction, the black matrix pixel count be minimum position as the Character segmentation candidate point, character row is divided in the monocase image.
Character recognition is handled, and for the monocase image, as the feature of character, for example uses the mesh feature of vertical 8 dimension * horizontal strokes, 8 dimensions.Specifically, the black pixel count that calculating exists in each zonule of 8 * 8 go flaking hole shape, test pattern feature in the character dictionary 3 and the absolute value of the difference of each dimension with ask distance, according to its order that reduces, 1 or a plurality of character are output as the identification candidate characters.
Then, character recognition device 2 goes out retrieves assistance information from the image feature extraction of the character row of identification.Here, judge that character is font or hand-written character.Its determination methods, for example utilize " compare with the font size of monocase of the hand-written character in 1 row has dispersiveness; its dispersiveness is bigger " such knowledge, calculate the average and dispersion of the character-circumscribed rectangle size of each character in 1 row, threshold with the dispersion that calculates with font data and hand-written character data in advance from study, when dispersion was bigger than threshold value, being judged as was hand-written character, and being judged as when threshold value is following is font.In addition, the test pattern that keeps font and hand-written character in the character dictionary 3, carry out the distance calculation of the test pattern feature of the feature that extracts from character picture and hand-written character and font, can also judge with the nearest character test pattern of character graphics be hand-written character or font.
At last, in step S300, instrument storage device 4 is preserved the identification candidate characters and is finished.Here, the character code of removing character recognition device 2 outputs is also preserved the retrieves assistance information of judging hand-written/printing in addition.
Figure 10 illustrates the retrieval document data for document image shown in Figure 8, and Figure 11 illustrates the retrieval document data for document image shown in Figure 9.The character that surrounds with [] in the identification candidate characters of Figure 10 and Figure 11 illustrates from a plurality of identification candidate characters of monocase image output.Increase by the normal solution number of characters that keeps a plurality of identification candidate characters, make being included in the character row, its result can reduce retrieval and lose.Figure 10, retrieval shown in Figure 11 finishes after retrieval is used in the document database 7 with the document data entry.
Secondly, according to Fig. 3, the order of the flowchart text retrieval process of Fig. 4.
Here, in search key, use " youth " and " one is bright " to describe.At first, in the step S1100 of Fig. 3, key word input media 5 input search keys.These key word input media 5 enough keyboards of energy or mouses, realizations such as pen or duplicate.At first, as search key defeated " youth ".
Then, in step S1200, document search device 6 is retrieved with the control treatment of document database 7 with the input key word.Use the order of the flowchart text control treatment of Fig. 4.
In the step S1210 of Fig. 4, extract 1 retrieval document data from retrieval with document database 7, its retrieves assistance information and identification candidate characters are loaded in the not shown impact damper.Now, use in the document database 7, storing Figure 10,2 documents shown in Figure 11 in retrieval.At first, use retrieval shown in Figure 10 the document Data Loading in impact damper.
Then, in step S1220, document search device 6 carries out retrieval in the field.
Retrieval as shown in Figure 5, is carried out the retrieval corresponding to retrieves assistance information in the field.In Fig. 5, when being hand-written, carry out the corresponding retrieval 151 of Character segmentation identification error in retrieves assistance information, when being font, carry out the corresponding retrieval 152 of Character segmentation mistake.
At first, obtain the retrieves assistance information of field number 1 (name) from Figure 10.Here owing to be " hand-written ", therefore carry out the corresponding retrieval 151 of character cutting identification error.For realizing the corresponding retrieval 151 of character cutting identification error, it both can be the method for cutting identification error with the character of character code shown in the example and shape facility permission in the past by also, also can be to be considered as contrasting successfully the part of character code of input key word is consistent, be output as result for retrieval, allow the method for character cutting identification error.
The latter's example is shown here.In the latter case, from continuous character row, calculating consistent degree=(keyword character and retrieval with the consistent number of characters of character in the document data)/(keyword character number), is that certain value (being taken as 0.5 here) is output as result for retrieval when above at it.Identification candidate characters " one [bright youth] on the river " and the 1st of key word " youth " identification candidate characters " youth " with " bright " though mutually internally inconsistent because " youth " therefore unanimity is arranged in the 2nd candidate.At this moment therefore consistent degree exports the candidate as result for retrieval owing to be 2/2=1.0.
Then, enter into step S1230, judged whether to handle all fields.Owing in Figure 10, also exist the field that does not contrast therefore to enter into step S1220, carry out field internal reference with field number 2 (address).Owing to do not have the character identification result of field number 2 and the consistent character of keyword character, therefore do not have the result for retrieval of output.
Below carry out above processing equally repeatedly, if all in the field retrieval finish then enter into step S1240, check retrieval with document database 7 in whether also existence do not carry out the retrieval document data of control treatment.Now,, therefore enter into step S1210, carry out above-mentioned processing equally because retrieval shown in Figure 11 is present in retrieval with in the document database 7 with the document data.
Retrieval shown in Figure 5 owing to be " font ", is therefore carried out the corresponding retrieval 152 of Character segmentation mistake with the retrieves assistance information of document data.The corresponding retrieval 152 of so-called this Character segmentation mistake, the erroneous results that is defined as character recognition here refers to the situation of having cut apart character mistakenly, contrast at keyword character and in retrieving with the 1st character of the identification candidate in the document data, in contrast, there is the inconsistent character of part even be made as, also is not considered as contrasting successful contrast simultaneously at corresponding number of characters.
For example, in the contrast of key word " zero * motor " and character row " zero sour machine ", " zero " and " machine " though consistent with each other, " * " is inconsistent with " acid ", and number of characters is respectively that " 2 " are with " 1 " and different.In this case, character recognition device 2 is interpreted as " * " mistake in the corresponding retrieval 152 of Character segmentation mistake, is identified as " acid " and contrasts successfully.And then in order to improve precision, also can be with routine identical in the past, by the shape facility of contrast " * ", detect the shape of inconsistent character with " acid ", be judged as shape success in contrast when similar.
Among Figure 11, in " hillside plot one [Lang Lang] " as the identification candidate characters of input key word " youth " and name field, " one " and " youth " is owing to consistent with each other so be output as result for retrieval.Below up to not carrying out step S1220 repeatedly to step S1240 till the field of contrast, if finish with the contrast of all data then enter into step S1250, export the result and generate.Result for retrieval output unit 8 is Figure 10, and the retrieval of Figure 11 is output as result for retrieval with any of document data.At last, in Fig. 3, enter into step S1300, the output result for retrieval.
Secondly, use key word " is bright " to retrieve with the manner.In the retrieval of having used " one is bright ", Figure 10, it is ideal results that 11 retrieval all is not output as result for retrieval with any of document data.At first, carry out the corresponding retrieval 151 of Figure 10 and Character segmentation identification error." one [the bright youth] on the river " of Figure 10 therefore contrasts successfully owing to consistent with certain monocase of key word.Its result, the retrieval of Figure 10 is output as result for retrieval with the document data, becomes the retrieval noise.
Secondly, carry out the corresponding retrieval 152 of Figure 11 with the character miscut." hillside plot one [Lang Lang] " of Figure 11 is though consistent with keyword character " ", and the 1st candidate characters " youth " in keyword character " bright " and the character row is inconsistent, and inconsistent number of characters all is all " 1 ", therefore with the contrast failure of key word.Its result, the retrieval of Figure 11 is not exported as result for retrieval with the document data.
According to above processing, in the method, retrieval does not lose in key word " youth ", and in key word " youth ", the retrieval noise becomes 1 document.
For relatively, for Figure 10,11, consider not use the retrieval subsidiary conditions, and situation about retrieving with same method.As use the corresponding retrieval 151 of Character segmentation identification error, in key word " youth ", retrieve, then because and Figure 10, any keyword character of 11 is all consistent, therefore contrasts successfully.
Equally, if use key word " is bright " to retrieve, Figure 10 then, any of Figure 11 is all consistent with keyword character, therefore contrasts successfully, becomes the retrieval noise.Its result in by the corresponding retrieval of Character segmentation identification error 151 retrievals of carrying out, does not lose though retrieve in key word " youth ", and the retrieval noise becomes 2 documents in " one is bright ".
Equally, consider not use the retrieval subsidiary conditions, and carry out the situation of the corresponding retrieval 152 of Character segmentation mistake.With key word " youth's " contrast in, though contrast success with Figure 11, with the contrast of Figure 10 in, keyword character " youth " is inconsistent with " bright " in the document data with retrieval, and because inconsistent number of characters is identical, therefore contrast is unsuccessful, becomes retrieval and loses.
On the other hand, in retrieval, successfully become the retrieval noise though Figure 10 contrasts based on key word " one is bright ", however with the contrast of Figure 11 in, keyword character " " unanimity, " bright " is inconsistent, does not export as result for retrieval.Its result, in the corresponding retrieval 152 of Character segmentation mistake, the completion of retrieval something lost is 1 document in key word " youth ", the retrieval noise becomes 1 document in groups of keywords " is bright ".
In the retrieval of having used key word " youth " " one is bright ", this method with only compare with the situation of the corresponding retrieval of Character segmentation identification error 151, the retrieval noise reduces to 1 document.In addition, and only compare with the situation of the corresponding retrieval of Character segmentation mistake 152, character is lost and is fallen to reducing to 1 document.Like this, switch search method, can reduce the retrieval noise, realize high-precision retrieval by using retrieves assistance information.
The 2nd implementation method as this example 1, removing document search device 6 carries out beyond the difference contrast that assisted retrieval information is " hand-written " or " font ", by the field information in the formal definition file also is used as retrieves assistance information, can carry out contrast corresponding to more detailed condition.
Use Figure 12,13,14 illustrate its example.In the step S300 of Fig. 2, instrument storage device 4 is removed beyond the identification candidate characters and retrieves assistance information of character recognition device 2 outputs, also having in the formal definition file 9 of Fig. 7/no monocase frame information is also joined retrieval with in the document data as retrieves assistance information, store retrieval into in the document database 7.
Figure 13,14 illustrate its example.At Figure 13, among Figure 14, hand-written/font information that retrieves assistance information 1 refers to, retrieves assistance information 2 have referred to/no monocase frame information.
Use in the contrast of document database 7 in key word and retrieval, set 4 kinds of methods from printing/hand-written information and the combination that has or not monocase frame information.Figure 12 illustrates its example.Be in the contrast of document data of font and field with monocase frame owing to there is character-recognition errors Character segmentation mistake hardly, therefore be set at retrieval 154 in full accord.This is the method for only exporting as result for retrieval when in full accord with the character row in the document data in input key word and retrieval.
Being that font does not still have under the situation of monocase frame, adopt the Character segmentation mistake corresponding retrieval 152 identical with the 1st implementation of the invention process form 1.
In addition, be hand-written character and do not having under the situation of monocase frame, also taking the Character segmentation identification error corresponding retrieval 151 identical with the 1st implementation method of this enforcement mood 1.
Be hand-written character and having under the situation of monocase frame, implementing the corresponding retrieval 153 of character-recognition errors.The corresponding retrieval 153 of this character-recognition errors is to allow input key word and retrieval with the consistent retrieval of part in the character row in the document data, when the number of characters of the inconsistent character of mutual correspondence is identical as retrieving successfully.
For example, if consider the contrast of input key word " zero * motor " and character row " zero * thunder machine ", then " zero " " * " " machine " is consistent with each other, and corresponding " " " thunder " is inconsistent.At this moment therefore inconsistent character is output as result for retrieval to " zero * thunder machine " owing to be all monocase.Like this, by preparing retrieval mode, can realize the retrieval mode of each identification error corresponding best corresponding to retrieves assistance information.
In the 2nd implementation of this example 1, in retrieval, used the field information in retrieves assistance information and the form defined file, yet be not limited thereto, for example also can only login format information and in retrieval, use.
In addition, in this example 1, in the assisted retrieval supplementary, used and printed hand-written judgement, and retrieves assistance information is not limited thereto, and for example also can use the quality (noise what) of document image, and perpendicular writing write across the page, the kind of font, character size etc.
In addition, in this example 1, mix document data such as the retrieval that keeping hand-written character and font 1 retrieval in document database 7, yet be not limited thereto, also can be according to hand-written character, the difference of retrieves assistance information such as font generates retrieval independently with document database 7, retrieves with each specific retrieval mode.In the 2nd implementation of this example 1, in Figure 12, at 4 kinds of retrieval modes shown in each retrieves assistance information, and the high speed by realizing at the best theasaurus (character position index information) of each retrieval mode generation retrieving.
Here, Figure 15, Figure 16, Figure 17 illustrates theasaurus.In each index, keep character code, field number, character position as index information.Thus, can character identification result and key word directly not contrasted, search at high speed is present in the key word in the document.
Figure 17 is the theasaurus of search 154 in full accord, is " font " and be the field of " having the monocase frame " from retrieves assistance information, and promptly the field number 3,4 of Figure 14 generates.For example, from " " as the recognition result of field number " 4 ", the field number of " " is 4, and character position is initial several from field, owing to be monocase, therefore becomes " 1 ".Equally, the field number of " " is 4, and character position is 2.Below similarly generate.In addition, also generate the character number 4 with " ", character position 1, the character number 4 of " ", the index of 2 characters that character position number 2 connects.Owing to increase the concatenation character number more, reduce more the input keyword character index read in and contrast number of times, therefore can realize the high speed of retrieval 154 in full accord.
Figure 15 is the corresponding retrieval 153 of character-recognition errors, and the search index of the corresponding retrieval 151 of Character segmentation character-recognition errors, from the character identification result generation of Figure 13.Equally, Figure 16 is the example of the theasaurus of the corresponding retrieval 152 of Character segmentation, from field number 1,2 generation of Figure 14.Figure 15, Figure 16 are the index with retrieval mode of ambiguity, lose for the retrieval that prevents to result from Character segmentation error character identification error, only use the monocase index to retrieve.Thus, compare with the situation that keeps the concatenation character index as shown in Figure 17, can cut down the index capacity, and realize retrieval at a high speed.When in hand-written printing, carrying out same retrieval, can also be Figure 15, it is 1 that theasaurus shown in Figure 16 gathers.
As discussed above, if according to this example 1, then when document is logined, preserve retrieves assistance information, when retrieval,, can carry out high-precision retrieval process corresponding to each document data by contrasting according to retrieves assistance information with recognition result.Thus, compare with the situation of not using retrieves assistance information and can cut down retrieval and lose and to fall to retrieving noise.
The present invention program 1 document search device as described above owing to possess input The instrument input device of document; Identification is by putting down in writing in the document of above-mentioned instrument input device input Character the time, as retrieves assistance information from the input document image contract go out relevant character Quality or the character recognition device of the information of state; The feature of the test pattern of store character The character dictionary; As retrieving the character recognition of storing above-mentioned character recognition device with the document data The instrument storage device of result and retrieves assistance information; Store the above-mentioned retrieval inspection of document data Rope document database; The keyword input unit of the keyword of input document retrieval; Above-mentioned When retrieval is used the contrast of document data and keyword character with the retrieval in the document database, implement The document inspection of the contrast of the above-mentioned retrieves assistance information that extracts corresponding to above-mentioned character recognition device The rope device; Export the result for retrieval output device of the result for retrieval of above-mentioned document search device, because of This has can carry out high-precision retrieval, can cut down retrieval and lose the effect that falls to retrieving hot-tempered sound.
The present invention program 2 document search device is discussed above, because above-mentioned inspection The rope supplementary is as judging that the character of putting down in writing in the above-mentioned input document is hand-written or the letter of font Therefore breath has and can carry out high-precision retrieval, can cut down the retrieval something lost and fall to retrieving hot-tempered sound Effect.
The present invention program 3 document search device is discussed above, because above-mentioned document Storage device is being used maintenance retrieval in the document database corresponding to the retrieval of above-mentioned retrieves assistance information Use the document data, above-mentioned document search device is according to indication in retrieving with the document database at each Fixed contrast method contrasts, and therefore has and can carry out high-precision retrieval, can cut down The effect that falls to retrieving hot-tempered sound is lost in retrieval.
The present invention program 4 document search device is discussed above, owing to possess input The input instrument input device of document; Maintenance is for the area information of document and the attribute in zone The formal definition file of the field information that information is recorded and narrated; Use above-mentioned formal definition file identification In the time of the character put down in writing in the document by the input of above-mentioned instrument input device, auxiliary as retrieval Information goes out the character of the information of the quality of relevant character or state from the image contract of document input Recognition device; The character dictionary of the feature of the test pattern of store character; Storing above-mentioned character knows Not Zhuan Zhi character identification result, retrieves assistance information and in above-mentioned formal definition file, remembering The instrument storage device of the field information of stating; Storing the retrieval of above-mentioned instrument storage device storage uses The retrieval of document data document database; The keyword input of the keyword of input document retrieval Device; In above-mentioned retrieval during with the contrast of document data and keyword, according to corresponding to above-mentioned inspection The contrast method of rope supplementary and above-mentioned field information is implemented the document search device of contrast; Export the result for retrieval output device of the result for retrieval of above-mentioned document search device, therefore have energy Enough carry out high-precision retrieval, can cut down retrieval and lose the effect that falls to retrieving hot-tempered sound.
The present invention program 5 document search device is discussed above, because above-mentioned inspection The rope supplementary is as judging that the character of putting down in writing in the above-mentioned input document is hand-written or the letter of font Therefore breath has and can carry out high-precision retrieval, can cut down the retrieval something lost and fall to retrieving hot-tempered sound Effect.
The present invention program 6 document search device is discussed above, because above-mentioned document Indexing unit uses the information that has or not the monocase frame in the above-mentioned formal definition file to retrieve the place Reason is not permitting with from the contrast of the recognition result character of the field that has the monocase frame time The be betrothed to contrast of symbol segmentation errors, with recognition result from the field that does not have the monocase frame Allow the contrast of Character segmentation mistake during the contrast of character, therefore have can carry out high-precision The retrieval of degree can be cut down retrieval and be lost the effect that falls to retrieving hot-tempered sound.
The present invention program 7 document search device is discussed above, because above-mentioned instrument storage device is being used maintenance retrieval document data in the document database corresponding to the retrieval of above-mentioned retrieves assistance information and above-mentioned field information, above-mentioned document search device is by the contrast output result for retrieval of above-mentioned each retrieves assistance information and field information, therefore have and to carry out high-precision retrieval, can cut down retrieval and lose the effect that falls to retrieving hot-tempered sound.

Claims (8)

1. document search device is characterized in that possessing:
The document input block of input document;
The formal definition file of the field information that maintenance is recorded and narrated for the attribute information in the area information of document and zone;
When stating the character of putting down in writing in the document of formal definition file identification by the input of above-mentioned document input block in the use, extract judge the character of putting down in writing in the input document be hand-written or the information of font as the character recognition unit of the 1st retrieves assistance information;
The character dictionary of the feature of the test pattern of store character;
The field information of storing the character identification result of above-mentioned character recognition unit and above-mentioned the 1st retrieves assistance information and recording and narrating in above-mentioned formal definition file is as the document storage unit of retrieval with the document data;
Store the retrieval document database of the retrieval of above-mentioned document cell stores with the document data;
The key word input block of the key word of input document retrieval;
The document retrieval unit, be used for when the contrast of document data and key word is used in above-mentioned retrieval, under above-mentioned the 1st retrieves assistance information is hand-written situation, implement to allow Character segmentation and the wrong contrast of understanding, in above-mentioned the 1st retrieves assistance information is under the situation of font, implements to allow the contrast of literal segmentation errors; And
Export the result for retrieval output unit of the result for retrieval of above-mentioned document retrieval unit.
2. document search device according to claim 1 is characterized in that:
For the character of putting down in writing in the above-mentioned input document is hand-written or font, to calculate in the average and dispersion of the size of the character-circumscribed rectangle of each character in 1 row, and with the threshold ratio of the dispersion of calculating with font data and hand-written character data in advance from study, the dispersion of each character in above-mentioned 1 row is during greater than above-mentioned threshold value, be judged as hand-written character, and each character in above-mentioned 1 row be dispersed in above-mentioned threshold value when following, be judged as font.
3. document search device according to claim 1 is characterized in that:
Above-mentioned document storage unit is added in above-mentioned retrieval with on the document data with the information that has or not the monocase frame in the above-mentioned formal definition file as the 2nd retrieves assistance information, and is stored in above-mentioned retrieval with in the document database, and
When above-mentioned document retrieval unit contrasts with document data and key word in above-mentioned retrieval, in above-mentioned the 1st retrieves assistance information is hand-written, and above-mentioned the 2nd retrieves assistance information has under the situation of monocase frame, implement to allow the wrong contrast of character understanding, in above-mentioned the 1st retrieves assistance information is hand-written, and above-mentioned the 2nd retrieves assistance information does not have under the situation of monocase frame, implement to allow Character segmentation mistake and the wrong contrast of character understanding, in above-mentioned the 1st retrieves assistance information is font, and above-mentioned the 2nd retrieves assistance information has under the situation of monocase frame, implement not allow Character segmentation mistake and the wrong contrast of character understanding, in above-mentioned the 1st retrieves assistance information is font, and above-mentioned the 2nd retrieves assistance information does not have under the situation of monocase frame the contrast of enforcement permission Character segmentation mistake.
4. document search device according to claim 1 is characterized in that:
Above-mentioned document storage unit is being used maintenance retrieval document data in the document database corresponding to the retrieval of above-mentioned the 1st retrieves assistance information and above-mentioned field information,
Above-mentioned document retrieval unit is by the contrast output result for retrieval corresponding to each above-mentioned the 1st retrieves assistance information and field information.
5. document search method is characterized in that comprising:
The document input step of input document;
Use to keep the formal definition file of the field information recorded and narrated for the attribute information in the area information of document and zone that the character of putting down in writing in the document of being imported is discerned, extract simultaneously judge the character of putting down in writing in the input document be hand-written or the information of font as the character recognition step of the 1st retrieves assistance information;
With the character identification result of above-mentioned character recognition step and above-mentioned the 1st retrieves assistance information and the field information in above-mentioned formal definition file, recorded and narrated, as retrieval with the document data storage to retrieval with the document storing step in the document database;
The key word input step of the key word of input document retrieval;
When the contrast of document data and key word is used in above-mentioned retrieval, under above-mentioned the 1st retrieves assistance information is hand-written situation, implement to allow Character segmentation and the wrong contrast of understanding, in above-mentioned the 1st retrieves assistance information is under the situation of font, implements the document searching step of the contrast of permission literal segmentation errors; And
Export the result for retrieval output step of the result for retrieval of above-mentioned document searching step.
6. document search method according to claim 5 is characterized in that:
For the character of putting down in writing in the above-mentioned input document is hand-written or font, to calculate in the average and dispersion of the size of the character-circumscribed rectangle of each character in 1 row, and with the threshold ratio of the dispersion of calculating with font data and hand-written character data in advance from study, the dispersion of each character in above-mentioned 1 row is during greater than above-mentioned threshold value, be judged as hand-written character, and each character in above-mentioned 1 row be dispersed in above-mentioned threshold value when following, be judged as font.
7. document search method according to claim 5 is characterized in that:
Above-mentioned document storing step is added in above-mentioned retrieval with on the document data with the information that has or not the monocase frame in the above-mentioned formal definition file as the 2nd retrieves assistance information, and is stored in above-mentioned retrieval with in the document database, and
When above-mentioned document searching step contrasts with document data and key word in above-mentioned retrieval, in above-mentioned the 1st retrieves assistance information is hand-written, and above-mentioned the 2nd retrieves assistance information has under the situation of monocase frame, implement to allow the wrong contrast of character understanding, in above-mentioned the 1st retrieves assistance information is hand-written, and above-mentioned the 2nd retrieves assistance information does not have under the situation of monocase frame, implement to allow Character segmentation mistake and the wrong contrast of character understanding, in above-mentioned the 1st retrieves assistance information is font, and above-mentioned the 2nd retrieves assistance information has under the situation of monocase frame, implement not allow Character segmentation mistake and the wrong contrast of character understanding, in above-mentioned the 1st retrieves assistance information is font, and above-mentioned the 2nd retrieves assistance information does not have under the situation of monocase frame the contrast of enforcement permission Character segmentation mistake.
8. document search method according to claim 5 is characterized in that:
Above-mentioned document storing step is being used maintenance retrieval document data in the document database corresponding to the retrieval of above-mentioned the 1st retrieves assistance information and above-mentioned field information,
Above-mentioned document searching step is by the contrast output result for retrieval corresponding to each above-mentioned the 1st retrieves assistance information and field information.
CN 02105715 2001-04-16 2002-04-15 Document search device Expired - Fee Related CN1266632C (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2001116751A JP3812719B2 (en) 2001-04-16 2001-04-16 Document search device
JP116751/01 2001-04-16

Publications (2)

Publication Number Publication Date
CN1381799A CN1381799A (en) 2002-11-27
CN1266632C true CN1266632C (en) 2006-07-26

Family

ID=18967439

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 02105715 Expired - Fee Related CN1266632C (en) 2001-04-16 2002-04-15 Document search device

Country Status (2)

Country Link
JP (1) JP3812719B2 (en)
CN (1) CN1266632C (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007094078A1 (en) * 2006-02-14 2007-08-23 Hitachi, Ltd. Character string search method and device thereof
JP4436894B2 (en) * 2007-08-09 2010-03-24 パナソニック株式会社 Content search device
CN105787415B (en) * 2014-12-18 2020-04-07 富士通株式会社 Document image processing device and method and scanner
CN105302626B (en) * 2015-11-09 2021-07-23 深圳市巨鼎医疗股份有限公司 Analytic method of XPS (XPS) structured data

Also Published As

Publication number Publication date
JP3812719B2 (en) 2006-08-23
CN1381799A (en) 2002-11-27
JP2002312398A (en) 2002-10-25

Similar Documents

Publication Publication Date Title
EP1598770B1 (en) Low resolution optical character recognition for camera acquired documents
EP2015228B1 (en) Retrieving electronic documents by converting them to synthetic text
US8566349B2 (en) Handwritten document categorizer and method of training
US8335381B2 (en) Handwritten word spotter using synthesized typed queries
US8509537B2 (en) Learning weights of fonts for typed samples in handwritten keyword spotting
CN1139884C (en) Method and device for information treatment and storage medium for storaging and impelementing said method program
US8401293B2 (en) Word recognition of text undergoing an OCR process
US8977054B2 (en) Candidate identification by image fingerprinting and model matching
CN1269069C (en) Symbol identifying device and method
CN1258894A (en) Apparatus and method for identifying character
CN1752992A (en) Character recognition apparatus, character recognition method, and character recognition program
CN1492377A (en) Form processing system and method
CN1625741A (en) An electronic filing system searchable by a handwritten search query
US10963717B1 (en) Auto-correction of pattern defined strings
CN1324068A (en) Explanatory and search for handwriting sloppy Chinese characters based on shape of radicals
CN1173682A (en) Online character recognition system for recognizing input characters using standard strokes
CN1877578A (en) Document retrieving device and method
CN1916940A (en) Template optimized character recognition method and system
Christy et al. Mass digitization of early modern texts with optical character recognition
CN1266632C (en) Document search device
US7756872B2 (en) Searching device and program product
Padma et al. Identification of Telugu, Devanagari and English Scripts Using Discriminating Features
Srihari et al. Versatile search of scanned arabic handwriting
CN1955979A (en) Automatic extraction device, method and program of essay title and correlation information
Zhu et al. A novel OCR approach based on document layout analysis and text block classification

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20060726

Termination date: 20110415