Summary of the invention
For this reason, technical matters to be solved by this invention is to overcome in prior art when adopting OCR technology identification character exists the higher problem of error rate, provides a kind of method and system improving character identification rate in format document.
For solving the problems of the technologies described above, the present invention is a kind of method improving character identification rate in format document,
Comprise the steps:
Character original coding corresponding to book character same in described format document and the character universal standard are encoded compare and obtain coding comparison result that is identical or that encode different of encoding;
Described coding comparison result corresponding to multiple described book character is carried out probability statistics and obtain the probable value that described book character adopts character universal standard coding;
Described probable value and threshold value are compared, if exceed threshold value, then described book character shows according to the character original coding contrast character that obtains of universal standard character code storehouse described in it; Otherwise, directly show the character that described in this, book character is identified by OCR.
Improve a method for character identification rate in format document, before the step obtaining described coding comparison result, also comprise the steps:
Extract the glyph bitmap of each book character in described format document;
Extract the character original coding of each described book character in described format document;
Obtain identifying rear character after carrying out OCR identification to described glyph bitmap;
Character universal standard coding is obtained to character contrast universal standard character code storehouse after described identification.
Improve a method for character identification rate in format document, before the step extracting described character original coding, also comprise the steps:
The character in described format document with character original coding is screened as book character.
Improve a method for character identification rate in format document, using the character in described format document with character original coding as after the step that book character screens, also comprise the steps:
For each described book character carries out ID numbering.
Improve a method for character identification rate in format document, after the step of character original coding extracting each described book character in described format document, also comprise the steps:
Set up a character original coding table, the ID of described book character described character original coding is corresponding thereto stored in described character original coding table.
Improve a method for character identification rate in format document, after the step obtaining described character universal standard coding, also comprise the steps:
Set up a character standard coding schedule, by the ID of described book character described character standard code storage corresponding thereto in described character standard coding schedule.
Improve a method for character identification rate in format document, described probable value is compared to threshold value and before carrying out corresponding operation, also comprises the steps:
Set up one for show, revise and confirm described character can editing interface.
Improve a system for character identification rate in format document, comprise coding comparison device, probability statistics compiling device and probable value, threshold value comparison device, wherein,
Described coding comparison device, obtains for the character original coding corresponding to book character same in described format document and the character universal standard being encoded to compare coding comparison result that is identical or that encode different of encoding;
Described probability statistics compiling device, obtains for the described coding comparison result corresponding to multiple described book character being carried out probability statistics the probable value that described book character adopts character universal standard coding;
Described probable value, threshold value comparison device, for described probable value and threshold value being compared, if exceed threshold value, then described book character shows according to the character original coding contrast character that obtains of universal standard character code storehouse described in it; Otherwise, directly show the character that described in this, book character is identified by OCR.
Improve a system for character identification rate in format document, also comprise glyph bitmap extraction element, character original coding extraction element, OCR recognition device and character universal standard coding corresponding intrument, wherein,
Described glyph bitmap extraction element, for extracting the glyph bitmap of each book character in described format document;
Described character original coding extraction element, for extracting the character original coding of each described book character in described format document;
Described OCR recognition device, obtains identifying rear character after carrying out OCR identification to the described glyph bitmap extracted;
Described character universal standard coding corresponding intrument, for obtaining character universal standard coding to character contrast universal standard character code storehouse after described identification.
Improve a system for character identification rate in format document, also comprise book character screening plant, described book character screening plant is used for the character in described format document with character original coding to screen as book character.
Improve a system for character identification rate in format document, also comprise ID numbering device, described ID numbering device is used for carrying out ID numbering for each described book character.
A kind of system improving character identification rate in format document, also comprise character standard coding schedule apparatus for establishing, described character standard coding schedule apparatus for establishing, for setting up a character standard coding schedule, by the ID of described book character described character standard code storage corresponding thereto in described character standard coding schedule.
A kind of system improving character identification rate in format document, also comprise character standard coding schedule apparatus for establishing, described character standard coding schedule apparatus for establishing, for setting up a character standard coding schedule, by the ID of described book character described character standard code storage corresponding thereto in described character standard coding schedule.
Improve a system for character identification rate in format document, also comprising can editing interface apparatus for establishing, described can editing interface apparatus for establishing, for set up one for show, revise and confirm described character can editing interface.
Technique scheme of the present invention has the following advantages compared to existing technology:
1, at a kind of method and system improving character identification rate in format document of the present invention, character original coding corresponding to book character same in described format document and the character universal standard are encoded compare and obtain coding comparison result that is identical or that encode different of encoding, multiple described coding comparison result is carried out probability statistics and obtains probable value, described probable value and threshold value are compared, if exceed threshold value, then show the character that described character original coding contrast universal standard character code storehouse obtains; Otherwise, the character after display OCR identifies.The present invention by the method for probability statistics, select to show character that described character original coding contrast universal standard character code storehouse obtains or described format document display OCR identify after character, therefore effectively improve the accuracy of character recognition.
2, at a kind of method and system improving character identification rate in format document of the present invention, before the step obtaining described coding comparison result, the glyph bitmap of each book character extracted in described format document is also comprised the steps:.Extract the character original coding of each described book character in described format document.Obtain identifying rear character after carrying out OCR identification to described glyph bitmap; Character universal standard coding is obtained to character contrast universal standard character code storehouse after described identification.The present invention can know method for distinguishing by OCR and obtain identifying rear character, is convenient to obtain described character universal standard coding further.Described OCR recognition device is commercially available general module, has the advantage that price is low.
3, at a kind of method and system improving character identification rate in format document of the present invention, before the step extracting described character original coding, also comprise the step screened as book character by the character in described format document with character original coding, the operation of screening book character can reduce the number of times extracting and need to extract the character step of described glyph bitmap, effectively reduce working time of the present invention, improve operational efficiency.The present invention also comprises for each described book character carries out the step of ID numbering, adopts the mode of ID numbering conveniently can make character one_to_one corresponding after described book character and described character original coding or described identification accurately.The present invention also comprises the step set up a character original coding table He set up a character standard coding schedule, described character original coding table can effectively manage character original coding, described character standard coding schedule effectively can manage character standard coding, can reduce the time of operation of the present invention.
4, at a kind of method and system improving character identification rate in format document of the present invention, also comprise that set up can the step of editing interface, described can editing interface can show, revise and confirm shown by character, can error character shown by manual intervention, be convenient to correct a mistake.
Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in detail.Should be understood that, embodiment described herein, only for instruction and explanation of the present invention, is not limited to the present invention.
Embodiment 1
As one embodiment of the present of invention, as shown in Figure 1, a kind of method improving character identification rate in format document, comprises the steps:
Character original coding corresponding to book character same in described format document and the character universal standard are encoded compare and obtain coding comparison result that is identical or that encode different of encoding.
Described coding comparison result corresponding to multiple described book character is carried out probability statistics and obtain the probable value that described book character adopts character universal standard coding.
Described probable value and threshold value are compared, if exceed threshold value, then described book character shows according to the character original coding contrast character that obtains of universal standard character code storehouse described in it.Otherwise, directly show the character that described in this, book character is identified by OCR.
The present invention is by the method for probability statistics, select to show described character original coding according to contrasting the character that universal standard character code storehouse obtains or the character shown after OCR identification, the present invention is when described book character adopts character universal standard coded system, the character after alternative OCR identification is carried out according to the character that contrast universal standard character code storehouse obtains with described character original coding, described character original coding is higher than the accuracy of OCR according to the accuracy of the character that contrast character universal standard character code storehouse obtains, therefore the present invention can improve the accuracy of Text region on the whole.
Embodiment 2
As one embodiment of the present of invention, on the basis of embodiment 1, before the step obtaining described coding comparison result, also comprise the steps:
Extract the glyph bitmap of each book character in described format document.
Obtain identifying rear character after carrying out OCR identification to the described glyph bitmap extracted.
Character universal standard coding is obtained to character contrast universal standard character code storehouse after described identification.Wherein, the described character universal standard is encoded to GB GB2312.
Extract the character original coding of each described book character in described format document.
Above-mentioned acquisition character universal standard coding and character original coding step, can perform respectively simultaneously, also can have certain sequencing, such as first obtains character universal standard coding, then obtain character original coding; Or first obtain character original coding, then obtain character universal standard coding.As long as get before comparison the described character universal standard coding and character original coding can realize object of the present invention.
The present invention can know method for distinguishing by OCR and obtain identifying rear character, is convenient to obtain described character universal standard coding further.
Embodiment 3
As one embodiment of the present of invention, on the basis of embodiment 2, before the step extracting described character original coding, also comprise the steps:
The character in described format document with character original coding is screened as book character.The operation of screening book character can reduce the number of times extracting and need to extract the character step of described glyph bitmap, effectively reduces working time of the present invention, improves operational efficiency.
Embodiment 4
As one embodiment of the present of invention, on the basis of embodiment 3, using the character in described format document with character original coding as after the step that book character screens, also comprise the steps:
For each described book character carries out ID numbering.Adopt the mode of ID numbering conveniently can make character one_to_one corresponding after described book character and described character original coding or described identification accurately.
Embodiment 5
As one embodiment of the present of invention, on the basis of embodiment 4, after the step of character original coding extracting each described book character in described format document, also comprise the steps:
Set up a character original coding table, the ID of described book character described character original coding is corresponding thereto stored in described character original coding table.Described character original coding table can effectively manage character original coding, can reduce the time of operation of the present invention.
Embodiment 6
As one embodiment of the present of invention, on the basis of embodiment 4 or embodiment 5, after the step obtaining described character universal standard coding, also comprise the steps:
Set up a character standard coding schedule, by the ID of described book character described character standard code storage corresponding thereto in described character standard coding schedule.Described character standard coding schedule effectively can manage character standard coding, can reduce the time of operation of the present invention.
Embodiment 7
As one embodiment of the present of invention, on the basis of above-described embodiment, described probable value is compared to threshold value and before carrying out corresponding operation, also comprises the steps:
Set up one for show, revise and confirm described character can editing interface.
Described can editing interface can show, revise and confirm shown by character, can error character shown by manual intervention, conveniently correct a mistake.
As one embodiment of the present of invention, on the basis of above-described embodiment, described threshold value is 90%.
Embodiment 8
As one embodiment of the present of invention, shown in Figure 2, a kind of system improving character identification rate in format document, comprises coding comparison device, probability statistics compiling device and probable value, threshold value comparison device.Wherein,
Described coding comparison device, obtains for the character original coding corresponding to book character same in described format document and the character universal standard being encoded to compare coding comparison result that is identical or that encode different of encoding.
Described probability statistics compiling device, obtains for the described coding comparison result corresponding to multiple described book character being carried out probability statistics the probable value that described book character adopts character universal standard coding.
Described probable value, threshold value comparison device, for described probable value and threshold value being compared, if exceed threshold value, then described book character shows according to the character original coding contrast character that obtains of universal standard character code storehouse described in it.Otherwise, directly show the character that described in this, book character is identified by OCR.
The present invention by the method for probability statistics, select to show character that described character original coding contrast universal standard character code storehouse obtains or described format document display OCR identify after character, therefore effectively improve the accuracy of Text region.
Embodiment 9
As one embodiment of the present of invention, on the basis of embodiment 8, also comprise glyph bitmap extraction element, character original coding extraction element, OCR recognition device and character universal standard coding corresponding intrument.Wherein,
Described glyph bitmap extraction element, for extracting the glyph bitmap of each book character in described format document.
Described character original coding extraction element, for extracting the character original coding of each described book character in described format document.
Described OCR recognition device, obtains identifying rear character after carrying out OCR identification to the described glyph bitmap extracted.
Described character universal standard coding corresponding intrument, for obtaining character universal standard coding to character contrast universal standard character code storehouse after described identification.
The present invention can know method for distinguishing by OCR and obtain identifying rear character, is convenient to obtain described character universal standard coding further.Described OCR recognition device is commercially available general module, has the advantage that price is low.
Embodiment 10
As one embodiment of the present of invention, on the basis of embodiment 9, also comprise book character screening plant, described book character screening plant is used for the character in described format document with character original coding to screen as book character.Described book character screening plant can reduce the number of times extracting and need to extract the character step of described glyph bitmap, effectively reduces working time of the present invention, improves operational efficiency.
Embodiment 11
As one embodiment of the present of invention, on the basis of embodiment 10, also comprise ID numbering device, described ID numbering device is used for carrying out ID numbering for each described book character.Described ID numbering device conveniently can make character one_to_one corresponding after described book character and described character original coding or described identification accurately.
Embodiment 12
As one embodiment of the present of invention, on the basis of embodiment 11, also comprise character original coding table apparatus for establishing, described character original coding table apparatus for establishing, for setting up a character original coding table, the ID of described book character described character original coding is corresponding thereto stored in described character original coding table.Described character original coding table apparatus for establishing can effectively manage character original coding, can reduce the time of operation of the present invention.
Embodiment 13
As one embodiment of the present of invention, on the basis of embodiment 11 or embodiment 12, also comprise character standard coding schedule apparatus for establishing, described character standard coding schedule apparatus for establishing, for setting up a character standard coding schedule, by the ID of described book character described character standard code storage corresponding thereto in described character standard coding schedule.Described character standard coding schedule apparatus for establishing, effectively can manage character standard coding, can reduce the time of operation of the present invention.
Embodiment 14
As one embodiment of the present of invention, on the basis of any one embodiment of embodiment 8-13, also comprising can editing interface apparatus for establishing, described can editing interface apparatus for establishing, for set up one for show, revise and confirm described character can editing interface.Described can editing interface can show, revise, confirm shown by character, can error character shown by manual intervention, there is the function of correcting a mistake.
As one embodiment of the present of invention, on the basis of above-described embodiment, described threshold value is 90%.
Those skilled in the art should understand, embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And the present invention can adopt in one or more form wherein including the upper computer program implemented of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) of computer usable program code.
The present invention describes with reference to according to the process flow diagram of the method for the embodiment of the present invention, equipment (system) and computer program and/or block scheme.Should understand can by the combination of the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or square frame.These computer program instructions can being provided to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, making the instruction performed by the processor of computing machine or other programmable data processing device produce device for realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be stored in can in the computer-readable memory that works in a specific way of vectoring computer or other programmable data processing device, the instruction making to be stored in this computer-readable memory produces the manufacture comprising command device, and this command device realizes the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be loaded in computing machine or other programmable data processing device, make on computing machine or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computing machine or other programmable devices is provided for the step realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
Although describe the preferred embodiments of the present invention, those skilled in the art once obtain the basic creative concept of cicada, then can make other change and amendment to these embodiments.So claims are intended to be interpreted as comprising preferred embodiment and falling into all changes and the amendment of the scope of the invention.