CN100351849C - Character recognition apparatus and character recognition method - Google Patents

Character recognition apparatus and character recognition method Download PDF

Info

Publication number
CN100351849C
CN100351849C CNB2005100551946A CN200510055194A CN100351849C CN 100351849 C CN100351849 C CN 100351849C CN B2005100551946 A CNB2005100551946 A CN B2005100551946A CN 200510055194 A CN200510055194 A CN 200510055194A CN 100351849 C CN100351849 C CN 100351849C
Authority
CN
China
Prior art keywords
character
document
field
dictionary database
term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2005100551946A
Other languages
Chinese (zh)
Other versions
CN1741034A (en
Inventor
榊原正义
中村浩太郎
馆野昌一
田中圭
斋藤照花
小山俊哉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujifilm Business Innovation Corp
Original Assignee
Fuji Xerox Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuji Xerox Co Ltd filed Critical Fuji Xerox Co Ltd
Publication of CN1741034A publication Critical patent/CN1741034A/en
Application granted granted Critical
Publication of CN100351849C publication Critical patent/CN100351849C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/1444Selective acquisition, locating or processing of specific regions, e.g. highlighted text, fiducial marks or predetermined fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Character Discrimination (AREA)
  • Character Input (AREA)

Abstract

The present invention provides a character recognition apparatus including: plural dictionary databases that contain terms or characters classified into respective fields; a determination unit that determines which field the contents of a document shown by document image data belong to; a selection unit that selects a dictionary database pertaining to the field determined by the determination unit from among the plural dictionary databases; a recognition unit that recognizes a term or a character written in the document shown by the document image data by using the terms or characters stored in the selected dictionary database as candidates; and an output unit that outputs the result of recognition by the recognition unit.

Description

Character recognition device and character identifying method
Technical field
The present invention relates to be used for the technology of the character that identification reads from document (document).
Background technology
In the character recognition technologies that is called OCR (optical character recognition reader), in advance the candidate of a large amount of characters or term is registered in the dictionary database.To being registered in the character (term) in the dictionary database and comparing, with the character (term) in identification the document from the optically read character of document (term).Therefore, recognition accuracy depends on to a great extent whether dictionary database comprises suitable character or term.
The multilingual that is known as such as Japanese and English provides pre-prepd dictionary database.Then, the word of being made up of a plurality of characters that obtain by the document recognition process is discerned, thereby selected in the aforementioned dictionary database one.If the word of being discerned is registered in the selected dictionary with predetermined value or the ratio (correlation ratio) that is higher than this predetermined value, use this dictionary to proceed identifying so.If this ratio drops to below the predetermined value, re-use another dictionary database so and carry out aforementioned processing.Yet, identification character and identified word exactly rightly in the stage of this technical requirement before dictionary enquiry.In addition, this technology is intended to be used for speech selection, therefore is helpless to improve for example recognition accuracy of Japanese document itself.
Knownly providing another kind of technology, is a series of character strings that unit comes dissociated optical to read with several characters wherein, to extract term candidate.Then, the connection (linkage) of determining a plurality of characters in each term candidate whether with the term candidate symbol of in dictionary database, registering in one be complementary.If do not match, extract term candidate so by different way.Yet this technical requirement is prepared all characters that constitute term candidate in advance and is connected.Therefore the capacity of database becomes very big.In addition, search for all connections character by character and make that processing is greatly complicated, thereby need a large amount of processing times.
Summary of the invention
In view of above situation has proposed the present invention, the invention provides a kind of new mechanism that is used for the identification of pin-point accuracy more document institute write characters.
For addressing the above problem, the invention provides a kind of character recognition device, it comprises: a plurality of dictionary databases comprise the term or the character that are referred in the every field; Determining unit is determined the field under the content of the document that document image data is represented; Selected cell is selected the relevant dictionary database of determining with determining unit in field from described a plurality of dictionary databases; Recognition unit by using the term stored in the selected dictionary database or character as the candidate, is discerned term or the character write in the document of being represented by document image data; And output unit, the recognition result of output recognition unit.According to this character recognition device, determine the field that document content is affiliated earlier, and then select to be suitable for the field particular term dictionary database in this field and to use it for character recognition.Can expect to improve recognition accuracy thus.
Description of drawings
Embodiments of the present invention is described in detail with reference to the accompanying drawings below, in these accompanying drawings:
Fig. 1 is the block diagram that illustrates according to the formation of the character recognition device of first embodiment;
Fig. 2 is the process flow diagram that the operation of described character recognition device is shown;
Fig. 3 is the process flow diagram that the operation of described character recognition device is shown;
Fig. 4 is the block diagram that illustrates according to the formation of the character recognition device of second embodiment;
Fig. 5 (a) to (e) conceptually illustrates the figure that will store the content in block (section) form database into;
Fig. 6 is the process flow diagram that the operation of described character recognition device is shown; And
Fig. 7 is the process flow diagram that the operation of described character recognition device is shown.
Embodiment
Below embodiments of the invention are described.
(1) first embodiment
Fig. 1 is the block diagram that illustrates according to the formation of the character recognition device 10 of first embodiment.This character recognition device 10 can be realized by the computing machine that embeds in scanner, the composite machine (hybrid machine) etc., perhaps can realize by the computing machine that is used as the main process equipment that is connected with scanner or composite machine.In this first embodiment, prepared to comprise the term that is referred in the every field or a plurality of fields particular term dictionary database of character, belong to which field with the content of determining document.Then, from the particular term dictionary database of described a plurality of fields, select the field particular term dictionary database relevant with fixed field.The term or the character that are stored in this field particular term dictionary database by use come execution character identification as the candidate.For example, Fig. 1 shows field particular term dictionary database 11a, 11b and 11c.Field particular term dictionary database 11a is included in frequent term or the character that occurs in the image processing field.Field particular term dictionary database 11b is included in frequent term or the character that occurs in the photography.Field particular term dictionary database 11c is included in frequent term or the character that occurs in the political realms.Yet, except these fields, can also be various fields, as IT, computing machine, law, name, place name and exabyte, prepare suitable field particular term dictionary database.
Form database 12 comprises the title in field under the format information that is used to describe document format and the document content by the mode of mutual correspondence.More particularly, this format information comprises: the format identifier of the document (as order and application form) of each different-format is given in assignment; Information with the feature that is used to describe each form (form of form itself and structure).Character recognition device 10 determines according to being stored in the content in this form database 12 and the content of document image data which field is the content of document belong to.
Memory block particular document attribute storage unit 13 is included in and is appointed as the memory block on document image data storage purpose ground and the corresponding relation between the corresponding domain name when generating document image data.In composite machine of current popular etc., can store the image that reads by scanner with from the corresponding memory block of the numbering of the menu appointment that calls " mailbox (mailbox) " into.Can be exactly above-mentioned " when generating document image data, being appointed as the memory block on document image data storage purpose ground " from this " mailbox " specified memory.In this " mailbox ", for example, specified numbering has nothing in common with each other for the organization unit in the company (department, section office) or for the user usually.Therefore, a plurality of memory blocks that have been assigned identical numbering comprise the document image data in similar field usually.For example, in the mailbox that the Flame Image Process development department of Ying You company uses, the document of being stored is relevant with Flame Image Process usually.Therefore, each memory block in the mailbox with will or organize the field at place to be stored in accordingly mutually in the memory block particular document attribute storage unit 13 by the user of these memory blocks of full-time use.This makes character recognition device 10 only just can determine for the numbering of mailbox appointment which field document content belongs to by reference.
Standard character characteristic quantity storage unit 14 comprises the characteristic quantity of the standard font (character pattern) about each independent character.10 pairs of character recognition devices be stored in this standard character characteristic quantity storage unit 14 characteristic quantity with compare from the characteristic quantity of the optically read font of document, and according to the matching degree identification character between them.
Additional disclosure be that a plurality of fields comprise a plurality of fields and the lower a plurality of fields of interrelated degree that interrelated degree is higher.For example, image processing field and photography have higher interrelated degree.Image processing field and political realms, or how many interdependences photography and political realms do not have.In field degree of association storage unit 15, store the information that is used for this degree of association between the qualification field.For example, suppose the most relevance kilsyth basalt is shown " 1 ".So, the information that is stored in the field degree of association storage unit 15 makes the degree of association of image processing field and photography be " 0.8 ", and makes the degree of association of image processing field and political realms and photography and political realms be all " 0.1 ".
Document reading unit 16 for example is the image scanning instrument apparatus.When starting the character recognition processing, the document reading unit 16 utilizes the rayed document with the image on the optically read document, and generates document image data.Document content determining unit 17 determines by using the several method of describing after a while which field is the content of document shown in the document image data belong to.Glossary selected cell 18 is selected the field particular term dictionary database in the field relevant with determined field.Here, glossary selected cell 18 is not only selected the field particular term dictionary database in the field determined by document content determining unit 17, and selects to be defined as the field particular term dictionary database that has the field of the certain degree of association or the higher degree of association with this field by field degree of association storage unit 15.
Character recognition unit 19 is discerned the character in the document by the characteristic quantity of storage in the reference standard character feature amount storage unit 14, from the characteristic quantity and the selected field particular term dictionary database of the optically read font of document.Output unit 20 is exported recognition result by using the preordering method that shows such as panel.
Fig. 2 and 3 is process flow diagrams that the operation of character recognition device 10 is shown.
In Fig. 2, at first, document reading unit 16 utilizes the rayed document with the image on the optically read document, and generates document image data (step S11).From document reading unit 16 the document view data is offered document content determining unit 17.Document content determining unit 17 determines according to process flow diagram shown in Figure 3 the document belongs to which field (step S12).
In Fig. 3,17 references of document content determining unit are stored in the content in the memory block particular document attribute storage unit 13, and determine whether to exist any field that is associated with the zone that comprises described document image data (step S21).Here, if there is any field (is "Yes" at step S21 place) be associated, document content determining unit 17 is identified as field (step S27) under the document content to this field so.
On the other hand, the field that if there is no is associated (is "No" at step S21 place), document content determining unit 17 determines whether the represented image of document image data comprises any format identifier (step S22) so.For example, some format identifier writes on the document bight.Here, if detect any format identifier (being "Yes" at step S22 place) in image, document content determining unit 17 is discerned the field (step S27) corresponding to this format identifier with reference to the content that is stored in the form database 12 so.
On the other hand, if do not detect format identifier (being "No" at step S22 place), 17 pairs of forms by the represented document of document image data of document content determining unit (form and structure) are analyzed (step S23) so.Then, if can be according to analysis result and its field of content recognition (being "Yes" at step S24 place) that is stored in the form database 12, document content determining unit 17 identifies its field (step S27) so.
On the other hand, if can't be according to its field of format identification (is "No" at step S24 place), 17 pairs of a part of execution character identifications (step S25) of document content determining unit so by the represented document of document image data.Handle the character that obtains or term as search key by using via this identification, 17 pairs of all spectra particular term of document content determining unit dictionary database 11a, 11b and 11c search for (step S26).Comprise coupling or similar term or any field particular term dictionary database of character if find in this search, document content determining unit 17 identifies its field (step S27) so.
Here, can come the character recognition at execution in step S25 place to handle by following several method.
Some document not only comprises printed character (typed character) but also comprises handwritten character.For these documents, the accuracy of identification printed character is higher relatively.Therefore, document content determining unit 17 is based on the field of the character identification result of printed character being determined document.Specifically, document content determining unit 17 is divided into printed character zone of writing out with printed character and the handwritten character zone of writing out with handwritten character to the character zone of the represented document of document image data.17 pairs of document content determining units write on the printed character execution character identification processing in the printed character zone then.Then, by using recognition result as search key, 17 pairs of all spectra particular term of document content determining unit dictionary database 11a, 11b and 11c search for.
In addition, the user can use pen etc. to mark on the feature of document.For example, utilize wire tag (line marker) that feature is enclosed picture, adds glissade or colluded note sometimes.17 pairs of document image datas of document content determining unit are analyzed, if there is any gauge point, so preferential identification writes on the character at this some place.Then, by using recognition result as search key, 17 pairs of all spectra particular term of document content determining unit dictionary database 11a, 11b and 11c search for.In addition, write on the character at document top and constitute the title or the exercise question of document usually, so be generally suitable for determining which field is the content of the document belong to the character that the font size bigger than other character write out.Therefore, 17 pairs of document image datas of document content determining unit are analyzed, and, if there is any character that writes on the document top or write out, so preferentially discern these characters with the font size bigger than other character.Then, by using recognition result as search key, 17 pairs of all spectra particular term of document content determining unit dictionary database 11a, 11b and 11c search for.
Get back to Fig. 2, glossary selected cell 18 is selected and the relevant field particular term dictionary database of being determined by document content determining unit 17 (step S13) in field.For example, when the content of document is confirmed as belonging to image processing field, the field particular term dictionary database 11a that glossary selected cell 18 is selected about image processing field.In addition, glossary selected cell 18 is with reference to the content that is stored in the field degree of association storage unit 15, also select field particular term dictionary database 11b, this field particular term dictionary database 11b be restricted to the field relevant (being photography) that has the certain degree of association or the higher degree of association with above-mentioned image processing field here.
Next, character recognition unit 19 by with reference to be stored in characteristic quantity in the standard character characteristic quantity storage unit 14, from the content of characteristic quantity and the selected field particular term dictionary database 11a and the 11b of the optically read font of document, discern character or term (step S14) in the document.Output unit 20 is exported recognition result (step S15) by using the preordering method that shows such as panel.
According to above-mentioned first embodiment, select to comprise the field particular term dictionary database of suitable character or term in view of the content of document.Expection can improve recognition accuracy thus.
(2) second embodiment
In above-mentioned first embodiment, the entire document execution character is discerned by using selected field particular term dictionary database.In following second embodiment, single document is divided into a plurality of zones, then, for character recognition selects to be suitable for each regional field particular term dictionary database.Fig. 4 is the block diagram that illustrates according to the formation of the character recognition device 30 of second embodiment.Indicate by identical label with assembly identical among Fig. 1.Character recognition device 30 shown in Figure 4 is with the difference of the character recognition device of first embodiment shown in Figure 1: the former is provided with block form database 31 and document content determining unit 34 (block division unit 32 and block content determining unit 33), replaces form database 12, memory block particular document attribute storage unit 13, field degree of association storage unit 15 and document content determining unit 17.Block form database 31 comprises the form that is used for describing the block that document will fill and the information of size.For example, this information comprises form and the size as the various blocks of Fig. 5 (a)-(e) conceptually illustrate.
Fig. 6 and Fig. 7 are the process flow diagrams that the operation of character recognition device 30 is shown.
The difference of operation shown in Figure 6 and aforementioned operation shown in Figure 2 is: the former comprises the step S32 that carries out on the piece ground district by district processing to S35, replaces the processing to S15 to the step S12 of entire document execution.That is, document reading unit 16 utilizes the rayed document with the image on the optically read document, and generates document image data (step S11).Then, document content determining unit 34 is determined content (field) (step S32) in piece ground district by district.Specifically, as shown in Figure 7, block division unit 32 is initial with reference to the content that is stored in the block form database 31, and is that unit divides document (step S41) with the block that will fill.Then, block content determining unit 33 is analyzed form and any printed character, symbol and the mark (for example, such as the printed character of " name " and " address " and the symbol of expression postcode or telephone number) big or small and that write of block in this block.Based on this analysis result, 33 pairs of block content determining units write on the field of the content in the block and discern (step S42).For example, the content that has the block of " address " describing should belong to the place name field.Content with block of " name " description should belong to the name field.Before shown in Figure 7 finishing dealing with to this processing of all onblock executing (being "Yes" at step S43 place).
Get back to Fig. 6, glossary selected cell 18 select with by the document content determining unit 34 relevant field particular term dictionary database (step S33) in field determined of piece ground district by district.Character recognition unit 19 by with reference to be stored in characteristic quantity in the standard character characteristic quantity storage unit 14, from the characteristic quantity of the optically read font of document and the content of the field particular term dictionary database of piece ground selection district by district, discern character or term (step S34) in the block.Output unit 20 is exported recognition result (step S35) by using the preordering method that shows such as panel.
According to above-mentioned second embodiment, be that unit divides document with the block that will fill, and according to the suitable field particular term dictionary database of the content choice of each block.Therefore comparing with first embodiment can be by higher accuracy execution character identification.
(3) modified example
Can implement the present invention by the following modified example of above-mentioned a plurality of embodiment.
Field and field particular term dictionary database be not limited among described a plurality of embodiment illustrative those, but can according to character recognition handle at the type and the content of document freely be provided with.
Can also make up and implement first embodiment and second embodiment.For example, in a second embodiment, can as among first embodiment, take in execution character identification to the degree of association between the field.
When the character zone in the document is divided into a plurality of subarea, can be unit with the chapter in the document, paragraph, but not be unit with the block that will fill, divide.
Can adopt at recording medium (as magnetic recording media, optical record medium and ROM, they are readable for CPU or other processor) form of enterprising line item, character recognition device 10 and 30 is offered character recognition device 10 and 30 in order to the control programs of carrying out aforementioned operation.Also can download to character recognition device 10 and 30 to control program by network such as the Internet.
As mentioned above, some embodiments of the present invention are summarized as follows.
Embodiments of the invention provide a kind of character recognition device, and it comprises: a plurality of dictionary databases comprise the term or the character that are referred in the every field; Determining unit is determined the field under the content of the document that document image data is represented; Selected cell is selected the relevant dictionary database of determining with determining unit in field from described a plurality of dictionary databases; Recognition unit by using the term stored in the selected dictionary database or character as the candidate, is discerned term or the character write in the document of being represented by document image data; And output unit, the recognition result of output recognition unit.According to this character recognition device, determine the field that document content is affiliated earlier, and then select to be suitable for the field particular term dictionary database in this field and to use it for character recognition.Can expect to improve recognition accuracy thus.
In this embodiment of the present invention, character recognition device also comprises the area dividing unit that is used for the area dividing with character of document is become a plurality of subareas.Determining unit determines to write on the affiliated field of content in the subarea of being divided with pursuing the subarea.Selected cell is selected the every field relevant dictionary database definite with determining unit.Recognition unit is discerned the term or the character that write in the described zone by using the term stored in the selected dictionary database or character as the candidate.According to this aspect, can select to be suitable for document each subarea field particular term dictionary database and use it for character recognition.
In this embodiment of the present invention, determining unit is divided into printed character zone of writing out with printed character and the handwritten character zone of writing out with handwritten character to the character zone by the represented document of document image data, to writing on the printed character execution character identification in the printed character zone, and recognition result and the term or the character that are stored in in described a plurality of dictionary database each compared, to determine to write on the field under the content in the document that document image data represents.Some document had both comprised printed character and had also comprised handwritten character.For these documents, the accuracy of identification printed character is higher relatively.Therefore, can carry out suitable field and determine by determine the field of document based on the result who printed character is carried out character recognition.
In this embodiment of the present invention, character recognition device also comprises attributes store, and this attributes store comprises the memory block on the storage purpose ground that is designated as these data when generating document image data and the corresponding relation between the corresponding dictionary database.Determining unit is selected the dictionary database corresponding with the memory block that comprises described document image data according to the corresponding relation that is stored in this attributes store.In composite machine of current popular etc., can store the image that scanner reads into from the corresponding memory block of the numbering of the menu appointment that calls " mailbox ".In this " mailbox ", for example, specified numbering has nothing in common with each other for the organization unit in the company (department, section office) or for the user usually.Therefore, a plurality of memory blocks that have been assigned identical numbering comprise the document image data in similar field usually.Therefore, the memory block on the storage purpose ground that when generating document image data, is designated as these data (for example, each memory block in the mailbox) stores accordingly mutually with the specific dictionary storage unit in field (for example, the field that use by the user or the tissue of these memory blocks of full-time use).This only makes just can determine field under the document content by designated storage area.
In this embodiment of the present invention, character recognition device also comprises degree of association storer, and this degree of association memory stores is used for the degree of association that the degree of association between the field is limited.Selected cell is selected to be defined as the dictionary database that the field of determining with determining unit has the field of certain degree of association by the degree of association.
Embodiments of the invention provide a kind of character identifying method, and it may further comprise the steps: store term or character by the field in a plurality of dictionary databases; Determine the affiliated field of content of the document that document image data is represented; From described a plurality of dictionary databases, select the dictionary database relevant with determined field; By using the term stored in the selected dictionary database or character, the term or the character that write in the document that document image data represents are discerned as the candidate; And output recognition result.
In this embodiment of the present invention, described character identifying method also comprises: the area dividing with character of document is become a plurality of subareas.Determining step comprises: determine to write on the affiliated field of content in the subarea that is marked off with pursuing the subarea.The selection step comprises: select to determine the dictionary database that the field is relevant with each.Identification step comprises: by using the term stored in the selected dictionary database or character as the candidate, the term or the character that write in the described zone are discerned.
In this embodiment of the present invention, determining step comprises: the character zone of the document that document image data is represented is divided into printed character zone of writing out with printed character and the handwritten character zone of writing out with handwritten character; To writing on the printed character execution character identification in the printed character zone; And recognition result and the term or the character that are stored in in described a plurality of dictionary database each compared, to determine to write on the field under the content in the document that document image data represents.
In this embodiment of the present invention, described character identifying method is further comprising the steps of: store the memory block on the storage purpose ground that is designated as these data when generating document image data and the corresponding relation between the corresponding dictionary database in attributes store.Determining step comprises: according to the corresponding relation that is stored in the attributes store, select the dictionary database corresponding with the memory block that comprises described document image data.
In this embodiment of the present invention, described character identifying method is further comprising the steps of: storage is used for the degree of association that the degree of association between the field is limited in degree of association storer.The selection step comprises: selection is defined as by the degree of association and determines that the field has the dictionary database in the field of certain degree of association.
Above-mentioned description to the embodiment of the invention provides for carrying out illustration and explanation.It is not exhaustive or limit the invention to disclosed precise forms.Obviously, those skilled in the art will know many modifications and modified example.Embodiment selected and that describe is for best illustrated principle of the present invention and practical application thereof, thereby makes those skilled in the art to understand to can be applicable to other embodiment or the modification of the application-specific conceived.Scope of the present invention is limited by claims and equivalent thereof.

Claims (10)

1, a kind of character recognition device comprises:
A plurality of dictionary databases comprise the term or the character that are referred in the every field;
Determining unit is determined the field under the content of the document that document image data is represented;
Selected cell is selected the relevant dictionary database of determining with determining unit in field from described a plurality of dictionary databases;
Recognition unit by using the term stored in the selected dictionary database or character as the candidate, is discerned term or the character write in the document of being represented by document image data; And
Output unit, the recognition result of output recognition unit.
2, character recognition device as claimed in claim 1 also comprises the area dividing unit that is used for the area dividing with character of document is become a plurality of subareas, and wherein:
Determining unit determines to write on the affiliated field of content in the subarea of being divided with pursuing the subarea;
Selected cell is selected the every field relevant dictionary database definite with determining unit;
Recognition unit is discerned the term or the character that write in the described zone by using the term stored in the selected dictionary database or character as the candidate.
3, character recognition device as claimed in claim 1, wherein
Determining unit is divided into printed character zone of writing out with printed character and the handwritten character zone of writing out with handwritten character to the character zone of the document that document image data is represented, to writing on the printed character execution character identification in the printed character zone, and recognition result and the term or the character that are stored in in described a plurality of dictionary database each compared, to determine to write on the field under the content in the document that document image data represents.
4, character recognition device as claimed in claim 1 also comprises attributes store, and this attributes store comprises the memory block on the storage purpose ground that is designated as these data when generating document image data and the corresponding relation between the corresponding dictionary database, and wherein
Determining unit is selected the dictionary database corresponding with the memory block that comprises described document image data according to the corresponding relation that is stored in this attributes store.
5, character recognition device as claimed in claim 1 also comprises degree of association storer, and this degree of association memory stores is used for the degree of association that the degree of association between the field is limited; And wherein
Selected cell is selected to be defined as the dictionary database that the field of determining with determining unit has the field of certain degree of association by the degree of association.
6, a kind of character identifying method may further comprise the steps:
Storing step is stored term or character by the field in a plurality of dictionary databases;
Determining step is determined the field under the content of the document that document image data is represented;
Select step, from described a plurality of dictionary databases, select the dictionary database relevant with determined field;
Identification step by using the term stored in the selected dictionary database or character as the candidate, is discerned the term or the character that write in the document that document image data represents; And
The output step, the output recognition result.
7, character identifying method according to claim 6, further comprising the steps of: the area dividing with character of document is become a plurality of subareas, and wherein:
Determining step comprises: determine to write on the affiliated field of content in the subarea that is marked off with pursuing the subarea;
The selection step comprises: select to determine the dictionary database that the field is relevant with each; And
Identification step comprises: by using the term stored in the selected dictionary database or character as the candidate, the term or the character that write in the described zone are discerned.
8, character identifying method according to claim 6, wherein
Determining step comprises:
The character zone of the document that document image data is represented is divided into printed character zone of writing out with printed character and the handwritten character zone of writing out with handwritten character;
To writing on the printed character execution character identification in the printed character zone; And
Recognition result and the term or the character that are stored in in described a plurality of dictionary database each are compared, to determine to write on the field under the content in the document that document image data represents.
9, character identifying method according to claim 6, further comprising the steps of: in attributes store, store the memory block on the storage purpose ground that when generating document image data, is designated as these data and the corresponding relation between the corresponding dictionary database, and wherein
Determining step comprises: according to the corresponding relation that is stored in the attributes store, select the dictionary database corresponding with the memory block that comprises described document image data.
10, character identifying method according to claim 6, further comprising the steps of: storage is used for the degree of association that the degree of association between the field is limited in degree of association storer; And wherein
The selection step comprises: selection is defined as by the degree of association and determines that the field has the dictionary database in the field of certain degree of association.
CNB2005100551946A 2004-08-25 2005-03-16 Character recognition apparatus and character recognition method Expired - Fee Related CN100351849C (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2004245311A JP2006065477A (en) 2004-08-25 2004-08-25 Character recognition device
JP2004245311 2004-08-25

Publications (2)

Publication Number Publication Date
CN1741034A CN1741034A (en) 2006-03-01
CN100351849C true CN100351849C (en) 2007-11-28

Family

ID=35943131

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005100551946A Expired - Fee Related CN100351849C (en) 2004-08-25 2005-03-16 Character recognition apparatus and character recognition method

Country Status (3)

Country Link
US (1) US20060045340A1 (en)
JP (1) JP2006065477A (en)
CN (1) CN100351849C (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080008391A1 (en) * 2006-07-10 2008-01-10 Amir Geva Method and System for Document Form Recognition
JP5239419B2 (en) * 2008-03-14 2013-07-17 オムロン株式会社 Character recognition program, character recognition electronic component, character recognition device, character recognition method, and data structure
JP2010217996A (en) * 2009-03-13 2010-09-30 Omron Corp Character recognition device, character recognition program, and character recognition method
JP2011065322A (en) * 2009-09-16 2011-03-31 Konica Minolta Holdings Inc Character recognition system and character recognition program, and voice recognition system and voice recognition program
CN102855264B (en) * 2011-07-01 2015-11-25 富士通株式会社 Document processing method and device thereof
US9082035B2 (en) * 2011-08-29 2015-07-14 Qualcomm Incorporated Camera OCR with context information
DE102012008512A1 (en) * 2012-05-02 2013-11-07 Eyec Gmbh Apparatus and method for comparing two graphics and text elements containing files
JP6140946B2 (en) * 2012-07-26 2017-06-07 キヤノン株式会社 Character recognition system and character recognition device
JP2014067303A (en) * 2012-09-26 2014-04-17 Toshiba Corp Character recognition device and method and program
CN104903802B (en) * 2013-02-28 2017-03-08 发纮电机株式会社 Mapping editing device
CN105427696A (en) * 2015-11-20 2016-03-23 江苏沁恒股份有限公司 Method for distinguishing answer to target question
CN108921103B (en) * 2018-07-05 2019-04-16 掌阅科技股份有限公司 For the label synchronous method of check and correction, calculating equipment and computer storage medium
KR20200010777A (en) * 2018-07-23 2020-01-31 휴렛-팩커드 디벨롭먼트 컴퍼니, 엘.피. Character recognition using previous recognition result of similar character
JP2022148922A (en) * 2021-03-24 2022-10-06 富士フイルムビジネスイノベーション株式会社 Information processing device and program

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1059414A (en) * 1991-03-12 1992-03-11 窦祖烈 The interpretation method of Chinese sentence
CN1215201A (en) * 1997-10-16 1999-04-28 富士通株式会社 Character identifying/correcting mode
CN1221927A (en) * 1997-12-19 1999-07-07 松下电器产业株式会社 Character recognizor and its method, and recording medium for computer reading out
JPH11203414A (en) * 1998-01-08 1999-07-30 Fuji Xerox Co Ltd Broadly classified dictionary preparing device

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4944022A (en) * 1986-12-19 1990-07-24 Ricoh Company, Ltd. Method of creating dictionary for character recognition
JP2713622B2 (en) * 1989-11-20 1998-02-16 富士通株式会社 Tabular document reader
JP3275153B2 (en) * 1993-03-03 2002-04-15 株式会社日立製作所 Dictionary distribution system and dictionary distribution management method
JP3375766B2 (en) * 1994-12-27 2003-02-10 松下電器産業株式会社 Character recognition device
US6101515A (en) * 1996-05-31 2000-08-08 Oracle Corporation Learning system for classification of terminology
JP3525997B2 (en) * 1997-12-01 2004-05-10 富士通株式会社 Character recognition method
JP3895892B2 (en) * 1999-09-22 2007-03-22 株式会社東芝 Multimedia information collection management device and storage medium storing program
JP4377494B2 (en) * 1999-10-22 2009-12-02 東芝テック株式会社 Information input device
US6603464B1 (en) * 2000-03-03 2003-08-05 Michael Irl Rabin Apparatus and method for record keeping and information distribution
US20040205671A1 (en) * 2000-09-13 2004-10-14 Tatsuya Sukehiro Natural-language processing system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1059414A (en) * 1991-03-12 1992-03-11 窦祖烈 The interpretation method of Chinese sentence
CN1215201A (en) * 1997-10-16 1999-04-28 富士通株式会社 Character identifying/correcting mode
CN1221927A (en) * 1997-12-19 1999-07-07 松下电器产业株式会社 Character recognizor and its method, and recording medium for computer reading out
JPH11203414A (en) * 1998-01-08 1999-07-30 Fuji Xerox Co Ltd Broadly classified dictionary preparing device

Also Published As

Publication number Publication date
CN1741034A (en) 2006-03-01
US20060045340A1 (en) 2006-03-02
JP2006065477A (en) 2006-03-09

Similar Documents

Publication Publication Date Title
CN100351849C (en) Character recognition apparatus and character recognition method
CN100351839C (en) File searching and reading method and apparatus
US6671684B1 (en) Method and apparatus for simultaneous highlighting of a physical version of a document and an electronic version of a document
US8285047B2 (en) Automated method and system for naming documents from a scanned source based on manually marked text
US20090144277A1 (en) Electronic table of contents entry classification and labeling scheme
US6178417B1 (en) Method and means of matching documents based on text genre
US20090123071A1 (en) Document processing apparatus, document processing method, and computer program product
US7593961B2 (en) Information processing apparatus for retrieving image data similar to an entered image
CN1254894A (en) Method for font access, register, display, printing and file processing,and record medium
CN101533317A (en) Fast recording device with handwriting identifying function and method thereof
CN1894685A (en) Translation tool
CN1838113A (en) Translation processing method, document translation device, and programs
JP2004334339A (en) Information processor, information processing method, and storage medium, and program
US7359896B2 (en) Information retrieving system, information retrieving method, and information retrieving program
JPH11282955A (en) Character recognition device, its method and computer readable storage medium recording program for computer to execute the method
CN1106620C (en) Information processing method and apparatus
Couasnon et al. Making handwritten archives documents accessible to public with a generic system of document image analysis
Garris et al. NIST Scoring Package User’s Guide
CN100444194C (en) Automatic extraction device, method and program of essay title and correlation information
JP3145071B2 (en) Character recognition method and device
CN117688162B (en) Full text retrieval method and system based on OCR (optical character recognition)
US20040083242A1 (en) Method and apparatus for locating and transforming data
Furukawa et al. D-pen: A digital pen system for public and business enterprises
Al-Barhamtoshy et al. Universal metadata repository for document analysis and recognition
JP4261831B2 (en) Character recognition processing method, character recognition processing device, character recognition program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20071128

Termination date: 20170316

CF01 Termination of patent right due to non-payment of annual fee