CN100351839C

CN100351839C - File searching and reading method and apparatus

Info

Publication number: CN100351839C
Application number: CNB2004100048717A
Authority: CN
Inventors: 永崎健; 丸川胜美; 竹内沙弥香
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2003-10-29
Filing date: 2004-02-10
Publication date: 2007-11-28
Anticipated expiration: 2024-02-10
Also published as: JP2005135041A; JP4461769B2; CN1612154A

Abstract

The invention provides a method that enables a search and browse of a document image group through the application of a document structure analysis technique and a character recognition technique as searching/browsing means for paper documents and document images. A highly functional document image search/browse system separates an OCR and a document processing apparatus, adopts as OCR output formats data (reading hypothesis data) holding multiple hypotheses of character line extraction, character segmentation and character recognition, and document structure data having ruled line information, frame information, character line information, browse attribute information and the like about a document image, and provides a function of important keyword extraction and document search from typed and handwritten character strings using OCR-added data, and of document display intended by a browser using the document structure data.

Description

File retrieval browsing method and file retrieval browsing apparatus

Technical field

The present invention relates to the profile analytical technology, from paper document group or file and picture group, obtain and retrieve on computers and the device of information necessary and the medium that has write down the document analysis technical program when reading sets of documentation.

Background technology

Even, still extensively utilize paper document to transmit medium as information in today that digital information technology is popularized.But, because existing the keeping paper document will take the place and be difficult to retrieve problem such as needed information, therefore preserve after paper document carry out electronic imageization, very surging in the society to the requirement of the document (hereinafter referred to as file and picture) after the electronic imageization of reading with computer search.

The basic skills of paper document retrieval is by OCR (optical character recognition) paper document to be transformed to text, and text is retrieved.But, since wrong by containing usually in the text codes of OCR conversion, situation about can not tackle therefore produced by simple text retrieval.Certainly, also can manually revise text codes, its correction result is retrieved by the OCR conversion.But the artificial correction that gets involved all is difficult to mention practicality from its processing speed and cost aspect.

Open in the flat 05-108891 communique (patent documentation 1) the spy,, recorded and narrated and in the recognition result of OCR, be suitable for the method that morphology factor is analyzed as improving the method that OCR reads precision.Can revise and misread by carrying out knowledge processings such as morphology factor analysis reliably, even but can not reach 100% correction like this.In addition, the dictionary that uses in common morphology factor is analyzed as object, in order to revise the document of particular service purposes accurately, just need append the special dictionary that definition is suitable for its field to general articles such as news.Therefore aspect maintainability and calculated amount, have problems.

Open in the flat 10-74250 communique (patent documentation 2) the spy, misread the baneful influence that retrieval is caused, proposed to utilize the similar Word message that in OCR, is easy to misread to carry out the method for Word search for fear of literal.In addition, open in the flat 95-134369 communique (patent documentation 3), proposed to allow to exist a plurality of literal identification candidates among the result, from wherein selecting the method that character code comes searching word reading of OCR the spy.If use these technology, then can avoid the baneful influence that Word search is caused of misreading of a literal unit really.

But, above-mentioned method can not handle since contact etc. is caused between separate text or literal, divide the situation of character and graphic mistakenly because of can not clearly determining the character and graphic border.For example, at OCR the literal that is written as " Ha Le " is read as under the situation of " ヘ Le ", available above-mentioned patented method is handled, but just can not handle under the situation that is read as " Ha ノレ ".And then, for having merged document of scheming or showing or the document that has a large amount of mesh lines with the mixing of account ticket form etc., originally before literal reads, just be difficult to as a rule detect and recognize literal line.Can not handle with above-mentioned method for this problem.

And then, as the reading function of file and picture, have the requirement of wishing to be added on the additional function that does not have in the paper document.For example, under the situation of checking a large amount of book classes, generally do not consult entire document, consult the hurdle that to write but only concentrate.Thereby when on picture, checking, can consider to extract in advance the particular bar of file and picture, and on picture, only show the particular bar that is extracted, perhaps emphasize to show functions such as particular bar.But, in OCR in the past,, therefore only can on picture, show this recognition result owing to only have the function of the item of putting down in writing in the identification particular bar.If recognition result is completely, then just can be corresponding with the part reading of file and picture fully by the recognition result that shows particular bar, but this point is difficult to realize in reality.Wish that thus the OCR device has the document structuring data such as output box structure or mesh lines coordinate side by side with the result of text identification, and use the reading function of these information.

There are picture formats such as TIFF or GIF in processing format as the paper document of electronic imageization, document formats such as PDF.Usually, record the file of image and the recognition result of OCR device and export, they are handled together as other file of forms such as CSV or XML.But, in this case, need construct the system that is used to keep the file relation of interlinking.Though for PDF, exist the OCR recognition result be embedded into the function of handling in the image file as transparent text, under the situation of handwriting, be not limited to determine uniquely recognition result.And, do not support a document structuring data to be embedded in the image file.In addition, can also construct a document structuring data and image file separate processes, then reading software that both combine.But, poor efficiency on document management document structuring data and image file separate processes.This is because the document structuring data comprise the mesh lines in the file and picture or the such characteristic of coordinate information of frame or literal line, so it is different from text, is independent of the poor performance of image file.

When carrying out document reading on computers, carry out on document, adding effects such as Accent colour or multi-color cord widely and show, but the general document data that only constitutes for electronics such as WORD or HTML carries out such demonstration.And for the effect of file and picture file, owing to have restriction aspect the needed time in order to reach above-mentioned display effect, so above-mentioned processing is kept someone at a respectful distance.

[patent documentation 1] spy opens flat 05-108891 communique

[patent documentation 2] spy opens flat 10-74250 communique

[patent documentation 3] spy opens flat 9-134369 communique

[patent documentation 4] spy opens flat 09-319824 communique

[patent documentation 5] spy opens the 2000-251012 communique

[patent documentation 6] spy opens the 2001-014311 communique

No. 2886868 communiques of [patent documentation 7] special permission

The flat 09-238032 communique of [patent documentation 8] special hope

Summary of the invention

The literal recognition result that the object of the present invention is to provide a kind of basis to be undertaken by the OCR device, after the paper document group carry out electronic imageization, provide efficient retrieval reading function the file retrieval browsing system, install and record the recording medium of OCR recognizer and document reading system.

In existing method, file retrieval to the paper document group is that the text that reads the result as OCR is retrieved, but be difficult to the OCR literal identification error that reply is imperfect by literal or the style of calligraphy characterized by hollow strokes (ガ The れ) causes, or the OCR literal that causes owing to the ambiguity on character and graphic border divides mistake, perhaps mixes the OCR literal line that exists and cause by document-illustration-mesh lines and extracts mistake.The 1st purpose of the present invention is to provide a kind of method, and this method can avoid that in OCR the reads identification of issuable literal, literal are divided, literal line extracts wrong to baneful influence that file retrieval produced.

In addition, in existing method, under situation that the subregion shows is carried out in reading during file and picture,, has the problem that image shift of being subjected to etc. influences although use stationary coordinate to come the determining section zone.For this point, in the method,, avoid baneful influence to showing by using these data from the document structuring data of OCR device output packet purse rope ruling information, frame information and literal line information etc.The 2nd purpose of the present invention is when the reading file and picture, provides the subregion to show and emphasizes to show that important word shows additional functions such as hiding processing.

In addition, in existing method, have the usefulness in order to bring into play file and picture and to show and spend the problem of the conversion time of document image data.In the method,,, carry out false colourization, avoid this problem for being predicted to be the regional text strings that needs display effect in advance by using from the document structuring data of OCR device output.The needed processing time of document display process when the 3rd purpose of the present invention is to reduce document reading.

In order to reach above-mentioned the 1st purpose, the invention provides a kind of system, this system separates OCR device and document image processing apparatus, employing preserve as the file and picture (comprising the false colour file and picture) of OCR output form and read result data, read tentation data, the file of document structuring data (above merging is called the OCR additional data), by constituting key search and document reading function, carry out necessary file and picture retrieval and file and picture reading according to the document image and OCR additional data.

In order to reach above-mentioned the 2nd purpose, the invention provides a kind of browsing system, this browsing system uses the OCR additional data as the output of OCR device, realize the subregion emphasize show, the division demonstration of subregion, specific character string emphasize visual effect such as demonstration.

In order to reach above-mentioned the 3rd purpose, the present invention carries out the false colour processing by using the OCR additional data to predetermined specific region, and changes pseudo-colour with the display mode switching, and the function that shows at a high speed is provided.

According to the present invention, for in existing method, although the file retrieval to the file and picture group is to retrieve according to the text that reads the result as OCR, but the mixing that is difficult to tackle by printing word and handwriting exists, or literal is imperfect or the style of calligraphy characterized by hollow strokes etc. and the OCR literal identification error that causes, mistake divided in the OCR literal that is caused by the ambiguity on character and graphic border, mix the OCR literal line that exists and cause by document-plate-mesh lines and extract wrong such problem, has literal identification by use, literal is divided, the OCR additional data that literal line extracts the candidate carries out Word search and file retrieval, can avoid the problems referred to above.In addition, be included in document structuring data in the OCR additional data, when the reading file and picture, can construct and have the browsing system of emphasizing to show additional functions such as necessary position, the guide look of many documents by use.

Description of drawings

Fig. 1 is the comparison diagram that this patent and existing method are handled.

Fig. 2 is the process flow diagram of the OCR device of output OCR additional data.

Fig. 3 is to use the process flow diagram of the document process of OCR additional data.

Fig. 4 is the concept map with OCR additional data embedded images file.

Fig. 5 is an example of file and picture.

Fig. 6 is the example that document structuring is analyzed.

Fig. 7 is to use the concept map of writing knowledge processing of text strings supposition.

Fig. 8 is the concept map of text strings supposition.

Fig. 9 is an example (part is scanned) of document reading system.

Figure 10 is an example (important statement demonstration) of document reading system.

Figure 11 is an example (regular toposcopy) of document reading system.

Figure 12 is an example (Information hiding) of document reading system.

Figure 13 is the concept map of false colourization.

Figure 14 is an example (zone is emphasized) of document reading system.

Figure 15 is the structure example of OCR device and document processing device, document processing.

Embodiment

With Fig. 1 is example, the difference of the existing method of general description and this motion method.Fig. 1 will use the document process and the figure that uses the diversity modeization of the document process of institute's extracting method in this patent of OCR in the past.

At first in flow process in the past, have the paper document group shown in 0101, they are placed on the OCR device shown in 0102 read.The output of OCR is file and picture after the papery image electronicization and the text that reads the result as OCR shown in 0103.Secondly, use the device shown in 0104 to carry out document process.In this flow process,, therefore in document process, can carry out the reading of text retrieval and file and picture because OCR output result reads resulting text and file and picture.

To this, in the treatment scheme that present patent application proposes, at first have the paper document group shown in 0105, they are placed on the OCR device shown in 0106 read.The output of OCR is shown in 0107, file and picture after the output device handlebar papery image electronicization, as OCR read reading resulting text, extracting or literal is divided or literal identification candidate is reading tentation data and having the mesh lines information of document or the document structuring data of frame information or literal line information or reading attribute information of information of result with literal line, perhaps have the file and picture that in file and picture, embeds the additional information of above-mentioned data set.Then, use the device shown in 0108 to carry out document process.In this flow process, because OCR output result comprises the above-mentioned information beyond the text document image, therefore in document process, be not only the simple reading of text retrieval and file and picture, can also discern difficulty hand-written key word retrieval utilize colo(u)r streak or contrast (contrast) wait the important key word emphasize in the document or zone etc. emphasize show or the subregion of only the necessary part of file and picture being got up to read side by side shows the demonstration etc. of (part is scanned) or the hiding secret item of part.

The data of output have and are used for the unique corresponding paper document or document id code of file and picture distinguished in 0107, can be kept in the magnetic memory apparatus etc.As the form of preserving, can consider in database, to preserve file and picture, false colour file and picture respectively, read resulting text, read tentation data, the form of document structuring data, and these data are embedded form in the file and picture file as additional data.The former advantage is because the additional data of individual processing file and picture and OCR (read data such as resulting text.Below be referred to as the OCR additional data), so document reading or retrieval can be used independently existing instrument respectively.But, show document in hope with text retrieval, perhaps wish to emphasize to show with the situation of retrieving relevant position under, be necessary to use document id to calculate corresponding relation between OCR additional data and the file and picture.In addition, using under the situation that reads resulting text, even will emphasize to show the term that adds when retrieving on file and picture, but owing to do not exist and the coordinate information that reads on the corresponding file and picture of resulting text, also is impossible therefore.The latter's advantage is by management document image file only, just can access images and all information of OCR additional data.Under latter instance, owing to do not need as the former, to use document id, linking between OCR additional data (reading resulting text etc.) and the file and picture is set, so document management is more or less freely.

The following describes Fig. 2.In the account ticket recognition device as the embodiment of the invention, at first, the OCR device is taken paper document, and this paper document is transformed to the electronic image data.At original document is under the electronic image data conditions, can omit above-mentioned processing (0201).Then, according to the electronic image data, carry out the document structuring analyses (0202) such as position deduction on mesh lines extraction, mount structure analysis, reading object hurdle.In the processing that document structuring is analyzed, use the document structuring dictionary.In the document structuring dictionary, comprise the attribute information such as (hurdle charged in name, and the hurdle is charged in the address, reading attribute information etc.) on mesh lines coordinate as the file and picture of reading object, frame coordinate, reading object hurdle.At this moment the well-known in the past technology of the identification processing and utilizing of Shi Yonging (spy opens flat 09-319824 number bulletin (patent document 4), and the spy opens 2000-251012 communique (patent documentation 5) etc.).Then, receive the result that document structuring is analyzed, extract literal line (0203) as reading object.Then, carry out from the literal line image, dividing character and graphic candidate and each character and graphic candidate's literal identification (0204).Under the situation of literal complex structure, set up the supposition of a plurality of literal lines, in each supposition, carry out character and graphic candidate division and literal identification.In the processing of literal identification, use literal identification dictionary.In literal identification dictionary, comprise as the character code of the character and graphic of identifying object and tectonic information (intensity distributions of contour direction component, various statistics etc.) etc.Character and graphic candidate and recognition result merged call the text strings supposition.In the document as reading object, under the situation that has pre-determined writeable character calligraph string, grapho-analysis (0205) is carried out in supposition for text strings.In the processing of text strings grapho-analysis, use text strings to write the knowledge dictionary.Write in the knowledge dictionary in text strings, be included in the word that may occur in the document, mark the sequential scheduling information that numeric string and group of words may occur.Thus, literal divide or literal identification on have an ambiguity the text strings supposition be transformed to text strings passage (pass) and then become the text strings text.Here, so-called text strings passage be make character code and with the paired separately arrangement of this character code corresponding character candidate's figure.Processing failure or indeterminate in advance document above-mentioned 0205 are write under the situation of knowledge, directly next the processing are transferred in the text strings supposition.In the next one was handled, the information of supposition of input characters string or text to this, was selected one of them or the both sides output (0206) as OCR.Usually, the text strings supposition is interpreted as directed graph, and the predetermined passage that knowledge is also passed through the figure origin-to-destination of writing is filled up in existence.Determined and, be judged to be this information of output character illustration and text juxtaposed setting by unique at this passage as according to the combination of literal identification similarity and character and graphic and the reliability of definite text strings passage surpasses under the situation of a certain threshold value.Be judged as in the result who judges under the situation of output text, in 0207 processing, the text strings text exported as reading resulting text.In addition, for reading resulting text output, can add artificial correction.Otherwise, under the lower situation of the reliability of text strings passage, text strings is supposed as output.Read resulting text and read the tentation data both sides and be kept at positional information on the file and picture that has write this text strings as required.By above processing, output document image file, document structuring data, read resulting text, read tentation data, and carry out next document process according to these data.The document process process can consider roughly to be divided into two parts.Part 1 is data entry portion (0209).Here, data entry in database or file and picture, so that above-mentioned data set is handled.Then, use these data to carry out document process (0210).Under OCR device and situation that document processing device, document processing separates, the process range of OCR device is from 0201 to 0208, perhaps from 0201 to 0209.

The following describes Fig. 3.Fig. 3 shows and uses file and picture and OCR additional data to carry out the process flow diagram of document process.But 0301 among Fig. 3 also can carry out in OCR one side to 0307 data and processing thereof.In this case, to be used to preserve and have by the document structuring data, read resulting text and read the file and picture or the false colour file and picture of the OCR additional data that tentation data constitutes, perhaps the database of OCR additional data and file and picture or false colour file and picture is transferred to the document process unit shown in the figure 0308 from the OCR side.At first, as importing, from file, read in these data (0302) with file and picture and corresponding OCR additional data group (0301) thereof.If desired, then file and picture is carried out the false colour processing with when the display document image more convenient (0303).About the false colour processing, will narrate its detailed process in the back.As the form of handling file and picture and OCR additional data, can consider on database, to preserve file and picture respectively, read resulting text, read tentation data and document structuring data and these 2 kinds of forms in the OCR additional data embedding file and picture file.The former carries out database login and handles (0304), makes file and picture and OCR additional data sign in to (0305) in the database accordingly.The latter carries out image information and embeds processing (0306), generates the file and picture file (0307) that has additional information.The data entry that more than is equivalent among Fig. 2 handles 0209.After these operations, carry out document process (0308).

The following describes Fig. 4.Fig. 4 shows the OCR additional data is embedded into a example in the file and picture file.Among this figure, suppose marking image files such as TIFF.Generally in the marking image file, preserve label information in the starting block of file, the view data body is positioned at and the position adjacent that links that begins from mark.Comprise Tag ID that memory location and expression with the corresponding body of data of each mark part be recorded in the data type in the body of data part number in the label information.Tag ID number is determined in advance as the rule of image file form, and by checking Tag ID number, can distinguish this mark data designated is view data or data such as author or time date of formation.Under the situation of adding the OCR additional data, in data block, append this label information, it can be realized with the pointer of the login target of Tag ID and sensing OCR additional data by expansion OCR additional data.

Fig. 5 is an example as the file and picture of process object.Fig. 6 is the result who the file and picture among Fig. 5 is carried out document structuring analysis and row extraction.In Fig. 6 (a), mesh lines information, frame information and literal line information as the document structuring analysis result are shown with thick line or housing rectangle.The sick and wounded denominational of 0601 expression, 0602 expression diagnosis and treatment day hurdle, 0603 expression description column, 0604 expression treatment fate hurdle, 0605 expression note time hurdle.The part that fences up with thick square is the zone that the document structuring analysis result is discerned as the analytic target hurdle respectively.The analytic target hurdle is important column in document process, and is designated in the document structuring dictionary in advance.The thin square that is arranged in thick frame is the zone that is extracted out as literal line.Whether the frame (0602 or 0604) that each frame extracts the frame (0601 or 0603 etc.) of literal line or do not extract literal line will be reading object and deciding according to the analytic target hurdle.Whether be that reading object is also logined in the document structuring dictionary in advance.In printing font document, carry out literal line easily and extract, and carry out in next being difficult to of environment of handwriting and printing font and handwriting mixing existence.For such situation, shown in Fig. 6 (b), the extraction that keeps the literal line ambiguity.That is, set up a plurality of hypothesis of thinking the piece of literal line, for them as extracting the result, 1 character and graphic candidate does not limit and only belongs to 1 literal line.Be that the literal line of prerequisite extracts result and the literal line that hand-written literal line is supposed to extract the result different sometimes with the printing font in addition, suppose but also export a plurality of literal lines in this case.Thus, corresponding with the processing of printing font and hand-written file and picture.The 0607th, as printing font literal line and the zone that is extracted out, the 0608th, and the zone that be extracted out capable as fuzzy handwriting.In the processing that above-mentioned document structuring is analyzed, use the document structuring dictionary.In the document structuring dictionary, comprise mesh lines coordinate, frame coordinate, reading object hurdle attribute information such as (hurdle charged in name, and the hurdle is charged in the address, reading attribute information etc.) as the file and picture of reading object.According to the result who carries out above-mentioned processing, the information such as reading attribute information that can obtain the attribute on frame coordinate, this hurdle, the literal line coordinate information in this hurdle, the character and graphic candidate coordinate information in this hurdle and this hurdle are as the document structuring data in the OCR additional data in addition.

The flow process of the text strings identification of writing knowledge according to the generation and the utilization of Fig. 7 expository writing word string supposition.Fig. 8 illustrates the concept map of text strings supposition and the details of data.Mark off the various parts that are inferred to be character and graphic from reading object literal line 7 (a), generate the character and graphic candidate, the result that each character and graphic candidate is carried out after literal is discerned is text strings supposition 7 (b).Text strings supposition bottom line has the character and graphic candidate, by the connection relation information between tactic identification character code group that obtains from the literal recognition result and the text strings supposition Chinese words figure candidate.Such text strings is supposed that the form of expression calls the form of expression based on figure.Use text strings to write knowledge 7 (c) then, according to text strings assumed calculation text strings passage 7 (d).So-called text strings passage means well-determined character code string (text) and corresponding to the arrangement of the character and graphic of each character code.For example, be contained in the writing words string candidate that text strings writes in the knowledge dictionary with OR mark (|) arrange packets and show word.That is, mean the group of words that is clipped between mark " | " is appointed as searching object.Write the method for knowledge as the performance text strings, remove beyond the above-mentioned form of expression, also used methods such as hit-and-miss method, context free grammar (being documented in the spy opens in the 2001-014311 communique (patent documentation 6) etc.).Fig. 8 shows in detail the situation of text strings supposition.Text strings supposition shows as with the character and graphic candidate as arc (0801), with the border of the character and graphic directed graph as node (0802).In each character and graphic, comprise about expression border ID number of (if the perpendicular words of writing then be about) node (character and graphic boundary candidate) with literal identification candidate (0803) and the relevant information of identification similarity (0804).Knowledge processing is to write knowledge as input with this text strings supposition and text strings, the word that can comprise in the supposition of discovery text strings and the processing of figure string thereof.For example, for the word of writing in text strings in the knowledge " blood chemical examination ", can find by in the text strings supposition of Fig. 8 (b), seeking the character code and the character and graphic candidate (0805) that represent with circle.Under the situation that has pre-determined the text strings content that writes this hurdle, handle to determine the character code string by carrying out this.That is,, can determine to read result's text strings text (character code string) or the result for retrieval in the processing of Fig. 3 Chinese words as OCR among Fig. 2 by above processing.

The example of the reading function that Fig. 9, Figure 10, Figure 11, Figure 12, Figure 14 show OCR additional data that utilization obtains by above-mentioned processing and file and picture or false colour image when carrying out document reading.Under the OCR additional data is stored in situation in the database different with the file and picture file, use the document id visit corresponding to the OCR additional data on the database of file and picture file, realize read function.And in the file and picture file, preserve under the situation of OCR additional data, as shown in Figure 4,, realize read function with reference to being stored in by the OCR additional data in the zone of the mark appointment in the file and picture file.

The following describes Fig. 9.Fig. 9 shows the configuration example of a picture of the document process browsing system of the method that use proposes in present patent application.Here, be example with prescription document reading system.At first, read paper prescription, output document image and OCR additional data by OCR.In this system, the integral body that can carry out file and picture shows the switching that shows with part.Under the situation of carrying out the part demonstration, use the document structuring data in the OCR additional data to obtain the coordinate data on this hurdle, and show its subregion.0901 for showing the piece of a file and picture.The title that shows shown file and picture in 0902 shows the disease name hurdle on the prescription in 0903, show the description column of prescription in 0909.The general whole image that does not need the display document image in documentation review is only limited to needed zone in inspection, by a plurality of documents of demonstration arranged side by side, can seek to improve the efficient of inspection.In addition, it is also conceivable that, revise the document configuration structure and on the narrow and small picture of portable information terminal equipments such as PDA, show being suitable for by using the document structuring data in the OCR additional data.For example, if the document of two sections group forms then segments document, and with its vertical alignment arrangements, can realize only utilizing the reading function of up-down scroll bar piecemeal.Perhaps, on the basis that supporting documentation is managed business,, can realize showing function with corresponding help in this hurdle or service technique knowledge etc. if with in the click hurdle.

The following describes Figure 10 and Figure 11.Figure 10 shows the picture configuration example of use at the important key word browsing system of the method that present patent application proposed.In 1001, specified the tabulation of the important key word that should extract.In 1002, represent the key word that extracted with underscore.Figure 11 is a picture configuration example using the simple and easy toposcopy of the file and picture system that checks rule with the extract function of the important key word in front.At first, in input field 1101, specify in the inspection rule of using in the inspection.In this figure, checking tool is defined as the logical operation of search key.Then, according to reading resulting text or reading tentation data in the OCR additional data, carry out the retrieval and the logical operation of this key word and use.Algorithm as keywording, comprise finite-state (オ one トマト Application) method, structure literary composition analytic approach from top to bottom, structure literary composition analytic approach from bottom to up, the dynamic plan law etc. (being documented in special permission No. 2886868 communiques (patent documentation 7), the special flat 09-238032 communique of hope (patent documentation 8) etc.).Showing the document name that obtains from the result who retrieves shown in the column 1103.Meeting the document of checking rule is displayed in the demonstration column 1104.Because the OCR additional data has and original paper document or the unique corresponding document id code of file and picture, therefore can while display document image and result for retrieval.In addition, because coordinate information is comprised in the keyword message, therefore the key word that retrieves illustrates the position with the underscore shown in 1105.Here, show the file and picture that meets " specified disease Laboratory Fee AND specified disease prescription management total " such inspection rule.In the OCR additional data,,, therefore can irrespectively retrieve inspection with the hand-written document of printing font owing to have the tentation data that reads that has kept the ambiguity that literal is divided or literal is discerned for the handwriting that is difficult to read with common OCR.In addition, carry out separately under the situation of business processing,, also can not retrieve any key word at any time even then do not proofread and correct from the literal identification of OCR device by using the tentation data that reads in the OCR additional data at OCR device and document process.

The following describes Figure 12.Figure 12 shows the example that the hiding item of using the method that present patent application proposed shows attributive function.Figure 12 (a) is the hidden object zone that obtains of the result as the document structuring analysis and extraction result that should zone Chinese words row.Here, with literal behavior hidden object item with name.Smear the result in covering hidden object zone shown in Figure 12 (b) with black surround.Thus, can seek the hiding needed data of each reviewer that disclose.Equally, smear the result in hidden object zone shown in Figure 12 (c) with background colour (in vain).Under the situation that back one usefulness background colour is smeared, compare with the situation of smearing with black surround, can make reviewer not recognize and have the hidden object data here, thus the confidentiality of raising data.For the latter's smearing method, can consider several method.According to Figure 13 these methods are described below.

The following describes Figure 13.Figure 13 shows the concept map to the false colour processing of file and picture.Each pixel has the value (colour) of expression color.For example if black white image then has 0 or 1 value.Determine that any color of 0 value representation needs with reference to the table that is called as RGB color map (color map).In the RGB color map of Figure 13 (b), 0 expression is white, and 1 expression is black.The false colour processing is the processing that the black picture element (black not necessarily only means the color of hidden object) in the object character row in subject area distributes other colour.Figure 13 (c) distributes colour 2 to the pixel of the literal line in the file and picture name hurdle.For this colour 2,, then in display frame, show " Hitachi is the youth too " such name with white if in the RGB color map, it is defined as white (background colour).That is, be shown as just as smearing white.But, do not remove name view data partly in inside.Collection of pixels with colour 2 is equivalent to constitute the image of name part.In addition, carrying out with the OCR device under the situation of false colourization, carry out false colourization behind the change original document image, the colour of the information after the false colourization and attribute as the reading attribute information, are being kept at back output in the document structuring data in the OCR additional data.

If use the frame positional information and the box properties information that obtain from the OCR additional data, then can distinguish the regional place that will hide.Hidden method as reality can be considered the whole bag of tricks.Comprising under differentiating for the situation on hidden object hurdle, extract literal line wherein, by obtaining the housing rectangle information of literal line, smear the method for this housing rectangle interior zone with black; Perhaps carry out false colourization for the black (foreground) in this housing rectangle interior zone, is that white (background colour) makes and to seem just as with white method of smearing with pseudo-colour; Perhaps carry out false colourization for the black (foreground) in the zone in this housing rectangle, is black (foreground) with pseudo-colour, and smears method in this housing rectangle etc. with black.Under the situation that demonstration hides Info, reading attribute data from be included in the OCR additional data is understood the value and the open condition thereof of pseudo-colour, meet in reviewer under the situation of its open condition, can be by pseudo-colour being changed to foreground or showing with other colour comparatively eye-catching with respect to background colour.

Use the characteristics of the Information hiding of false colourization to be to keep the readability of the file and picture under the general browser (viewer), and can not destroy hidden that original image information ground hides Info.Generally, comprise and use the such special format of PDF, utilize special browser,, perhaps can not see by the method for the position of part blacking if just can not open the document by password checking etc. as the method for Information hiding in the file and picture.Another kind method is to use general format, only can see the method that hides Info with special browser.The false colour processing mainly is the method that can be applicable to the latter.This method has owing to use general browser can suppress the cost of system, does not particularly fundamentally remove the data on the image and the advantage just eliminated from visual effect.In order further to improve secret quality, have in schemes such as itself image adding passwords.In this case, owing to can realize by combining with general instrument, therefore without detriment to above-mentioned advantage.

The following describes Figure 14.Figure 14 be the method for utilizing present patent application to propose emphasize to show a picture configuration example when having the zone in mind.Figure 14 (a) is the result that document structuring is analyzed, and the description column in the sick and wounded denominational and 1402 in 1401 is extracted out.Note under the situation of these two columns only wanting, as among Fig. 9, extract and the method for display box though also have, emphasize to show and reduce the processing of tone on every side by frame here, can not destroy the structure of actual document image, realize emphasizing showing (Figure 14 (b)).In this processing, also can use the false colourization of front.That is,, distribute pseudo-colour 2 to the pixel that is comprised in sick and wounded denominational and the inner literal line of description column.Before emphasizing to handle, in advance the color of colour 2 is made as black.Under the situation that request is emphasized to handle, can be made as grey to the color of the colour 1 of extra-regional black picture element.As carrying out the method that contrast is handled, comprise that each scan image changes the method for its color and the method for the logical operation that obtains original image and masked images etc., and compare with these processing, this processing is if carry out false colourization in advance, then exist when required request such as emphasizing from the contrast of reviewer, the value that only changes the RGB color map just can realize emphasis effect, therefore has the high advantage of processing speed.Figure 14 (c) is illustrated in image viewing person's operation merges situation about changing to same processing in carrying out.For example, initial by using OCR additional data and false colour processing can be implemented in operation, concentrate the sick and wounded denominational of checking in 1405, in the operational phase subsequently, check the such inspection method of description column in 1406.

The following describes Figure 15.Figure 15 utilizes method that this patent a proposes structure example when separating the form formation DRS of OCR device and document image processing apparatus.Show a structure example of OCR device in the first half of Figure 15, show a structure example of document image processing apparatus in the latter half of Figure 15.

At first in the OCR of the first half device, by image-input device (1501) document is transformed to electronic data (file and picture), after being stored in it in external memory (1504) and the storer (1505), read by central arithmetic unit (1506).Document structuring dictionary among Fig. 2, literal identification dictionary and text strings are write knowledge dictionary etc. and are stored in the external memory (1504), and reference is stored in the definition here when document structuring is analyzed.These processing can be operated by operating terminal device (1502) by the people, and result etc. show, are stored in the external memory by display terminal device (1503) or pass through communicator (1507) to external device (ED) transmission data.Can resembling in the past the device, the result that OCR reads, but also can export as the OCR additional data as text output.Comprise that the OCR additional data that reads tentation data, reads resulting text and document structuring data is embedded in the file and picture file, or be stored in accordingly in the external memory, perhaps be sent to external device (ED) by communicator with the file and picture file.At this moment, in the OCR additional data, distribute and the corresponding document id code of document (perhaps image) that reads by OCR.By utilizing the document ID code, can make paper document or file and picture corresponding with the OCR additional data.

The document image processing apparatus of Figure 15 the latter half uses from the OCR additional data of above-mentioned OCR functional device output and carries out the file retrieval document reading, has for repeating repeatedly the function that (as long as having the OCR additional data) retrieval is read in case generated the document of OCR additional data.The document image processing apparatus reads the OCR additional data from communicator (1515) and external memory (1512), be written into it in storer (1513) after, retrieve reading by central arithmetic unit (1514) and handle.Word of retrieving and file retrieval rule can be stored in the external memory or from operating terminal device (1510) and import.The result for retrieval of word can show by display terminal device (1511), perhaps can be by communicator to outside equipment sending data, and perhaps search result storage in the memory storage externally.These devices connect by communication bus (1507,1508,1509,1015,1516).

Claims

1. an OCR device behind the optically read paper document of this OCR device, carries out literal identification and handles in the document image data that is generated, and it is characterized in that described OCR device comprises:

Memory storage is used for storing document structuring and analyzes literal identification dictionary used in used document structuring dictionary and the literal identification;

The image input block is used to import above-mentioned document image data;

Arithmetic element,

Above-mentioned arithmetic element utilizes above-mentioned document structuring dictionary to carry out the frame structure analysis of above-mentioned document image data and the appointment of reading object frame, and generates the document structuring data; Utilize above-mentioned literal identification dictionary that the reading object frame of above-mentioned appointment is carried out literal identification processing, and generation is read resulting text or is read tentation data; Export the OCR additional data accordingly with above-mentioned document image data, this OCR additional data comprises above-mentioned document structuring data and above-mentioned at least one that reads in the tentation data,

This reads tentation data and includes the literal that generates at least divide the recognition result that figure candidate and this article stroke divide the figure candidate in the literal identification processing procedure.

2. OCR device according to claim 1 is characterized in that:

Above-mentioned OCR additional data is logged in the file identical with above-mentioned document image data.

3. OCR device according to claim 2 is characterized in that:

Above-mentioned file be comprise a plurality of data blocks and with these a plurality of data blocks image file of the tag format of corresponding mark respectively,

And have at least one above-mentioned data block of preserving above-mentioned OCR additional data and comprise and represent that the data that are kept in this data block are marks of the information of OCR additional data.

4. OCR device according to claim 1 is characterized in that:

Above-mentioned arithmetic element is carried out following false colour processing, promptly specify the position that needs are hidden in the above-mentioned document image data according to above-mentioned document structuring data, for this position that need hide, the colour of each pixel of above-mentioned document image data is changed to other colour, the corresponding relation of employed demonstration look and this other colour when being created on this other colour of demonstration

Upgrade above-mentioned document image data comprising this other colour,

Export the color map table of the corresponding relation that comprises above-mentioned demonstration look and this other colour and the reading attribute information that bottom line has pseudo-colour and reading enabled condition accordingly with above-mentioned document image data.

5. OCR device according to claim 1 is characterized in that:

The above-mentioned tentation data that reads is included in the text strings supposition that the identification of above-mentioned literal generates in handling, and this text strings supposition is divided figure candidate's information and the literal recognition result that this article stroke divides the figure candidate with the graphic form performance about literal.

6. document processing device, document processing, the document treating apparatus reads processing with the document that carries out in the OCR device result carries out document process as input information, it is characterized in that comprising that above-mentioned document that acceptance is imported reads the input block of result, carries out reading the display unit that result shows, user input unit and the arithmetic element of accepting user's input about above-mentioned document

Above-mentioned document reads result and comprises document image data and the OCR additional data that generates by optically read paper document, described OCR additional data comprises the document structuring data of the mount structure that comprises the document view data and to reading in the tentation data at least one as what the frame of reading object carried out that literal identification handles in the document view data frame

Above-mentioned arithmetic element is used above-mentioned OCR additional data according to the indication from above-mentioned user input unit input, shows optionally that on above-mentioned display unit being included in above-mentioned document reads information in the result.

7. document processing device, document processing according to claim 6 is characterized in that:

Above-mentioned document reads result the above-mentioned tentation data that reads is included in the above-mentioned OCR additional data,

The above-mentioned tentation data that reads is divided figure candidate's information and the literal recognition result that this article stroke divides the figure candidate with the graphic form performance about literal,

The search key that above-mentioned arithmetic element utilization is imported in above-mentioned user input unit, retrieve the above-mentioned tentation data that reads with the graphic form performance, according to result for retrieval, be included in the above-mentioned document image data that reads in the result in above-mentioned display unit demonstration.

8. document processing device, document processing according to claim 6 is characterized in that:

Above-mentioned document reads result above-mentioned document structuring data is included in the above-mentioned OCR additional data,

Above-mentioned document structuring data have display object frame information, and which frame that this information representation is included in the document view data is the display object frame,

Above-mentioned arithmetic element optionally shows the display object frame that is included in the above-mentioned document image data according to above-mentioned display object frame information.

9. document processing device, document processing according to claim 6 is characterized in that:

Above-mentioned arithmetic element to frame information, emphasizes to show the display object frame that is included in the above-mentioned document image data according to above-mentioned demonstration.

10. document processing device, document processing according to claim 6 is characterized in that:

The false colour processing is carried out in a part of zone of above-mentioned document image data,

Above-mentioned OCR additional data comprises the color map table, contains the colour of each pixel in the zone of carrying out above-mentioned false colour processing and the corresponding relation of Show Color in this table,

Above-mentioned arithmetic element is according to the reading state of user's appointment, determines to carry out the Show Color in the zone of above-mentioned false colour processing with reference to above-mentioned color map table, and above-mentioned display unit utilizes above-mentioned determined Show Color to show above-mentioned document image data.

11. the document processing method in the DPS (Document Processing System), described DPS (Document Processing System) comprises the OCR input media that is used to import the document image data that generates by optically read paper document; Be used for being stored in the memory storage that document structuring is analyzed used document structuring dictionary and dictionary discerned in used literal in literal identification; Be used to comprise the arithmetic element of the computing of the OCR processing that contains above-mentioned document structuring analysis and literal identification; The document that is used to login above-mentioned OCR result reads storage unit as a result; Display unit, described document processing method is characterised in that:

Use above-mentioned document structuring dictionary to analyze the mount structure of above-mentioned document image data,

According to the information through the mount structure of above-mentioned analysis, the literal identification of using above-mentioned literal identification dictionary to carry out above-mentioned document image data is handled, and generate and read resulting text or read tentation data,

Comprising that above-mentioned document structuring data and above-mentioned at least one OCR additional data and the above-mentioned document image data that reads in the tentation data are stored in above-mentioned document accordingly and read as a result in the storage unit,

The above-mentioned tentation data that reads includes the literal that generates at least and divides the recognition result that figure candidate and this article stroke divide the figure candidate in the literal identification processing procedure.

12. document processing method according to claim 11 is characterized in that:

13. OCR device according to claim 12 is characterized in that:

14. document processing method according to claim 11 is characterized in that:

Carry out the false colour processing, comprise according to above-mentioned document structuring data, specify required hiding position in the above-mentioned document image data, for this position that need hide, the colour of each pixel of above-mentioned document image data is changed to other colour, the corresponding relation of employed demonstration look and this other colour when being created on this other colour of demonstration

Upgrade above-mentioned document image data comprising this other colour,

Store the color map table of the corresponding relation that comprises above-mentioned demonstration look and this other colour and the reading attribute information that bottom line has pseudo-colour and reading enabled condition accordingly with above-mentioned document image data.

15. document processing method according to claim 14 is characterized in that:

Above-mentioned OCR additional data comprises color map table and above-mentioned reading attribute information, wherein said color map table comprises the colour and the corresponding relation that shows look of each pixel in the zone of carrying out above-mentioned false colour processing, above-mentioned arithmetic element utilizes above-mentioned reading attribute information to differentiate the reading state that allows reviewer in this zone, determine to carry out the demonstration look in the zone of above-mentioned false colour processing with reference to above-mentioned color map table, and utilize determined demonstration look to show above-mentioned document image data.

16. document processing method according to claim 11 is characterized in that:

The above-mentioned tentation data that reads is included in the text strings supposition that the identification of above-mentioned literal generates in handling, the literal recognition result that this text strings supposition divides the figure candidate with the graphic form performance about character and graphic candidate's information and this article stroke.

17. document processing method according to claim 11 is characterized in that:

Above-mentioned DPS (Document Processing System) comprises the user input unit of accepting user's input,

Above-mentioned OCR additional data comprises the above-mentioned tentation data that reads, the above-mentioned tentation data that reads of search key retrieval that utilization is imported in above-mentioned user input unit is exported as result for retrieval having the document image data that reads tentation data that meets above-mentioned search key.

18. document processing method according to claim 11 is characterized in that:

In above-mentioned OCR additional data, comprise above-mentioned document and read result,

19. document processing method according to claim 11 is characterized in that:

Above-mentioned document structuring data have display object frame information, and which frame that this information representation is included in the document view data is the display object frame, to frame information, emphasize to show the display object frame that is included in the above-mentioned document image data according to above-mentioned demonstration.