The method and system of character identification rate in a kind of raising format document
Technical field
The present invention relates to a kind of method improving Text region rate, character is known in specifically a kind of raising format document
The not method and system of rate.
Background technology
In order to ensure the reading effect of reader, the type-setting document that the publication side of books and periodicals is issued before printing is generally format
Document.So-called format document is to clearly record the letters such as position, glyph bitmap, font, size, the color of each character
The file of breath, the format document can also record the coding of each character.Since format document describes glyph bitmap and word
Relative position between symbol, therefore there is certain stability, it can ensure the version that reader is read under any computer environment
Formula document and the books and periodicals printed all have consistent visual effect, and common format document is mainly PDF etc..
Although having recorded the coding of character in some format documents, when display, generally according to glyph bitmap
It is shown, is shown not according to coding.When extracting the character of word from format document, due to format document
Recorded in character coding may generally be encoded the universal standard or custom coding by way of obtain, therefore it is specific
To a format document, the coding mode of its character is not known, and then the character of word cannot be obtained according to the coding.
Therefore in the prior art, generally use OCR(Optical Character Recognition, optical character are known
Not)Technology extracts the character in format document, but since OCR technique itself has discrimination, uses OCR
The character for the word that technology identifies often has that error rate is high, influences user's reading.
Invention content
For this purpose, when technical problem to be solved by the present invention lies in overcoming in the prior art using OCR technique identification character
There are the higher problems of error rate, provide a kind of method and system improving character identification rate in format document.
In order to solve the above technical problems, the present invention is a kind of method improving character identification rate in format document,
Include the following steps:
Character original coding corresponding to the same predetermined character in the format document is encoded with the character universal standard
It is compared to obtain and encodes coding comparison results identical or that coding is different;
The coding comparison result corresponding to multiple predetermined characters is subjected to probability statistics and obtains the reserved word
The probability value that symbol is encoded using the character universal standard;
The probability value is compared with threshold value, if being more than threshold value, the predetermined character is former according to character described in its
Begin to encode the obtained character in control universal standard character code library and show;Otherwise, it is logical to directly display the predetermined character
Cross the character that OCR is identified.
A kind of method of character identification rate in raising format document, the step of obtaining the coding comparison result before, also
Include the following steps:
Extract the glyph bitmap of each predetermined character in the format document;
Extract the character original coding of each of described format document predetermined character;
To character after being identified after glyph bitmap progress OCR identifications;
Character universal standard coding is obtained to character control universal standard character code library after the identification.
A kind of method of character identification rate in raising format document, before the step of extracting the character original coding,
Further include following steps:
Character with character original coding in the format document is screened as predetermined character.
A kind of method of character identification rate in raising format document, will be in the format document with character original coding
Further include following steps after the step of character is screened as predetermined character:
ID numbers are carried out for each predetermined character.
A kind of method of character identification rate in raising format document, it is described predetermined extracting each of described format document
Further include following steps after the step of character original coding of character:
A character original coding table is established, by the character original codings of the ID of the predetermined character corresponding thereto
It stores in the character original coding table.
A kind of method of character identification rate in raising format document, in the step of obtaining the character universal standard coding
Afterwards, further include following steps:
A character standard coding schedule is established, the character standards of the ID of the predetermined character corresponding thereto are encoded
It stores in the character standard coding schedule.
The method of character identification rate, the probability value is compared with threshold value and carries out phase in a kind of raising format document
Further include following steps before the operation answered:
Establish an editable interface for showing, changing and confirming the character.
The system of character identification rate in a kind of raising format document, including coding comparison device, probability statistics compiling device and general
Rate value, threshold value comparison device, wherein
The coding comparison device is used for the original volume of character corresponding to the same predetermined character in the format document
Code is compared to obtain from character universal standard coding encodes coding comparison results identical or that coding is different;
The probability statistics compiling device, it is general for carrying out the coding comparison result corresponding to multiple predetermined characters
Rate counts to obtain the probability value that the predetermined character uses character universal standard coding;
The probability value, threshold value comparison device, for the probability value to be compared with threshold value, if being more than threshold value,
The predetermined character is according to the obtained character in character original coding control universal standard character code library described in its and shows;It is no
Then, the character that the predetermined character is identified by OCR is directly displayed.
The system of character identification rate, further includes glyph bitmap extraction element, the original volume of character in a kind of raising format document
Code extraction element, OCR identification devices and the character universal standard encode corresponding intrument, wherein
The glyph bitmap extraction element, the glyph bitmap for extracting each predetermined character in the format document;
The character original coding extraction element, the word for extracting each of described format document predetermined character
Accord with original coding;
The OCR identification devices, for word after being identified after the glyph bitmap progress OCR identifications to extracting
Symbol;
The character universal standard encodes corresponding intrument, for compareing universal standard character code to character after the identification
Library obtains character universal standard coding.
The system of character identification rate, further includes predetermined character screening plant, the reserved word in a kind of raising format document
Symbol screening plant is used to screen the character with character original coding in the format document as predetermined character.
The system of character identification rate, further includes ID numbering devices in a kind of raising format document, and the ID numbering devices are used
In carrying out ID numbers for each predetermined character.
The system of character identification rate, further includes that character standard coding schedule establishes device in a kind of raising format document, described
Character standard coding schedule establishes device, for establishing a character standard coding schedule, corresponding thereto by the ID of the predetermined character
In the character standard code storage to the character standard coding schedule answered.
The system of character identification rate, further includes that character standard coding schedule establishes device in a kind of raising format document, described
Character standard coding schedule establishes device, for establishing a character standard coding schedule, corresponding thereto by the ID of the predetermined character
In the character standard code storage to the character standard coding schedule answered.
The system of character identification rate, further includes that device is established at editable interface in a kind of raising format document, described to compile
Editing interface establishes device, for establishing an editable interface for showing, changing and confirming the character.
The above technical solution of the present invention has the following advantages over the prior art:
1, in a kind of method and system improving character identification rate in format document of the present invention, by format text
Character original coding in shelves corresponding to the same predetermined character encoded with the character universal standard be compared to obtain encode it is identical
Or different encoding ratios pair is encoded as a result, multiple coding comparison results, which are carried out probability statistics, obtains probability value, by institute
It states probability value to be compared with threshold value, if being more than threshold value, shows the character original coding control universal standard character code
The character that library obtains;Otherwise, the character after display OCR identifications.The present invention is by the methods of probability statistics, to select described in display
The character or the format document that character original coding control universal standard character code library obtains show the word after OCR identifications
Symbol, therefore effectively increase the accuracy of character recognition.
2, in a kind of method and system improving character identification rate in format document of the present invention, the volume is being obtained
Further include following steps before the step of code comparison result:Extract the glyph bitmap of each predetermined character in the format document.
Extract the character original coding of each of described format document predetermined character.OCR identifications are carried out to the glyph bitmap
After identified after character;Character universal standard coding is obtained to character control universal standard character code library after the identification.
The present invention can know character after method for distinguishing is identified by OCR, convenient for further obtaining the character universal standard coding.
The OCR identification devices are commercially available general module, are had the advantages that low-cost.
3, in a kind of method and system improving character identification rate in format document of the present invention, the word is being extracted
Further include using the character with character original coding in the format document as predetermined character before the step of according with original coding
The step of screening, the character step that extraction needs to extract the glyph bitmap can be reduced by screening the operation of predetermined character
Number effectively reduces the run time of the present invention, improves operational efficiency.The invention also includes for each predetermined character
The step of carrying out ID numbers more convenient can accurately make the predetermined character former with the character by the way of ID numbers
Character corresponds after beginning coding or the identification.The invention also includes establish a character original coding table and establish a word
The step of according with standard code table, the character original coding table can effectively manage character original coding, and the character standard is compiled
Code table can effectively manage character standard coding, can reduce the time of the operation of the present invention.
4, in a kind of method and system improving character identification rate in format document of the present invention, further including foundation can
The step of editing interface, the editable interface can show, change and confirm shown character, can manual intervention show
The error character shown, convenient for correcting mistake.
Description of the drawings
In order to make the content of the present invention more clearly understood, it below according to specific embodiments of the present invention and combines
Attached drawing, the present invention is described in further detail, wherein
Fig. 1 be one embodiment of the invention a kind of raising format document in character identification rate method flow chart;
Fig. 2 be one embodiment of the invention a kind of raising format document in character identification rate system structure diagram.
Specific implementation mode
The specific implementation mode of the present invention is described in detail below in conjunction with attached drawing.It should be understood that this place is retouched
The specific implementation mode stated is merely to illustrate and explain the present invention, and is not intended to restrict the invention.
Embodiment 1
As an embodiment of the present invention, as shown in Figure 1, it is a kind of improve format document in character identification rate method,
Include the following steps:
Character original coding corresponding to the same predetermined character in the format document is encoded with the character universal standard
It is compared to obtain and encodes coding comparison results identical or that coding is different.
The coding comparison result corresponding to multiple predetermined characters is subjected to probability statistics and obtains the reserved word
The probability value that symbol is encoded using the character universal standard.
The probability value is compared with threshold value, if being more than threshold value, the predetermined character is former according to character described in its
Begin to encode the obtained character in control universal standard character code library and show.Otherwise, it is logical to directly display the predetermined character
Cross the character that OCR is identified.
The present invention is by the methods of probability statistics, to select to show the character original coding according to control universal standard word
The character after the obtained character of code database or display OCR identifications is accorded with, the present invention is general using character in the predetermined character
When standard code mode, substituted according to the control obtained character in universal standard character code library with the character original coding
Character after OCR identifications, the character original coding is according to the control obtained character in character universal standard character code library
The accuracy higher of accuracy ratio OCR, therefore the present invention can improve the accuracy of Text region on the whole.
Embodiment 2
As an embodiment of the present invention, on the basis of embodiment 1, in the step of obtaining the coding comparison result
Before, further include following steps:
Extract the glyph bitmap of each predetermined character in the format document.
Character after being identified after OCR identifications is carried out to the glyph bitmap that extracts.
Character universal standard coding is obtained to character control universal standard character code library after the identification.Wherein, described
The character universal standard is encoded to national standard GB2312.
Extract the character original coding of each of described format document predetermined character.
Above-mentioned acquisition character universal standard coding and character original coding step, can be performed simultaneously respectively, it is possibility to have
Certain sequencing, for example character universal standard coding is first obtained, then obtain character original coding;Or first obtain character original
Begin coding, then obtains character universal standard coding.As long as getting the character universal standard coding before comparison and character being former
Begin to encode and the purpose of the present invention can be realized.
The present invention can know character after method for distinguishing is identified by OCR, general convenient for further obtaining the character
Standard code.
Embodiment 3
As an embodiment of the present invention, on the basis of embodiment 2, in the step of extracting the character original coding
Before, further include following steps:
Character with character original coding in the format document is screened as predetermined character.Screen reserved word
The operation of symbol can reduce the number that extraction needs to extract the character step of the glyph bitmap, effectively reduce the fortune of the present invention
The row time, improve operational efficiency.
Embodiment 4
As an embodiment of the present invention, on the basis of embodiment 3, there will be character original in the format document
Further include following steps after the step of character of coding is screened as predetermined character:
ID numbers are carried out for each predetermined character.More convenient can accurately it be made by the way of ID numbers described
Predetermined character is corresponded with character after the character original coding or the identification.
Embodiment 5
As an embodiment of the present invention, on the basis of embodiment 4, each institute in extracting the format document
Further include following steps after the step of stating the character original coding of predetermined character:
A character original coding table is established, by the character original codings of the ID of the predetermined character corresponding thereto
It stores in the character original coding table.The character original coding table can effectively manage character original coding, can subtract
The time of the operation of few present invention.
Embodiment 6
As an embodiment of the present invention, on the basis of embodiment 4 or embodiment 5, obtaining, the character is general
Further include following steps after the step of standard code:
A character standard coding schedule is established, the character standards of the ID of the predetermined character corresponding thereto are encoded
It stores in the character standard coding schedule.The character standard coding schedule can effectively manage character standard coding, can subtract
The time of the operation of few present invention.
Embodiment 7
As an embodiment of the present invention, on the basis of the above embodiments, the probability value and threshold value are compared
Pair and carry out accordingly operate before, further include following steps:
Establish an editable interface for showing, changing and confirming the character.
The editable interface can show, change and confirm shown character, mistake that can be shown by manual intervention
Accidentally character, facilitates correction mistake.
As an embodiment of the present invention, on the basis of the above embodiments, the threshold value is 90%.
Embodiment 8
As an embodiment of the present invention, shown in Figure 2, it is a kind of to improve character identification rate in format document and be
System, including coding comparison device, probability statistics compiling device and probability value, threshold value comparison device.Wherein,
The coding comparison device is used for the original volume of character corresponding to the same predetermined character in the format document
Code is compared to obtain from character universal standard coding encodes coding comparison results identical or that coding is different.
The probability statistics compiling device, it is general for carrying out the coding comparison result corresponding to multiple predetermined characters
Rate counts to obtain the probability value that the predetermined character uses character universal standard coding.
The probability value, threshold value comparison device, for the probability value to be compared with threshold value, if being more than threshold value,
The predetermined character is according to the obtained character in character original coding control universal standard character code library described in its and shows.It is no
Then, the character that the predetermined character is identified by OCR is directly displayed.
The present invention is compiled by the method for probability statistics to select to show that the character original coding compares universal standard character
Character or the format document that code library obtains show the character after OCR identifications, therefore effectively increase Text region just
True rate.
Embodiment 9
As an embodiment of the present invention, further include glyph bitmap extraction element, character on the basis of embodiment 8
Original coding extraction element, OCR identification devices and the character universal standard encode corresponding intrument.Wherein,
The glyph bitmap extraction element, the glyph bitmap for extracting each predetermined character in the format document.
The character original coding extraction element, the word for extracting each of described format document predetermined character
Accord with original coding.
The OCR identification devices, for word after being identified after the glyph bitmap progress OCR identifications to extracting
Symbol.
The character universal standard encodes corresponding intrument, for compareing universal standard character code to character after the identification
Library obtains character universal standard coding.
The present invention can know character after method for distinguishing is identified by OCR, general convenient for further obtaining the character
Standard code.The OCR identification devices are commercially available general module, are had the advantages that low-cost.
Embodiment 10
As an embodiment of the present invention, further include predetermined character screening plant on the basis of embodiment 9, it is described
Predetermined character screening plant is used to filter out the character with character original coding in the format document as predetermined character
Come.The predetermined character screening plant can reduce the number that extraction needs to extract the character step of the glyph bitmap, effectively
The run time for reducing the present invention, improves operational efficiency.
Embodiment 11
As an embodiment of the present invention, further include ID numbering devices on the basis of embodiment 10, the ID numbers
Device is used to carry out ID numbers for each predetermined character.The ID numbering devices more convenient can accurately make described pre-
Determine character to correspond with character after the character original coding or the identification.
Embodiment 12
As an embodiment of the present invention, further include that character original coding table establishes dress on the basis of embodiment 11
Set, the character original coding table establishes device, for establishing a character original coding table, by the ID of the predetermined character with
In its corresponding described character original coding storage to the character original coding table.The character original coding table establishes dress
Character original coding can effectively be managed by setting, and can reduce the time of the operation of the present invention.
Embodiment 13
As an embodiment of the present invention, further include character standard on the basis of embodiment 11 or embodiment 12
Coding schedule establishes device, and the character standard coding schedule establishes device, will be described pre- for establishing a character standard coding schedule
Determine in the character standard code storages to the character standard coding schedule of the ID of character corresponding thereto.The character standard
Coding schedule establishes device, can effectively manage character standard coding, can reduce the time of the operation of the present invention.
Embodiment 14
As an embodiment of the present invention, further include that can compile on the basis of any one embodiment of embodiment 8-13
Editing interface establishes device, and device is established at the editable interface, for establishing one for showing, changing and confirming the character
Editable interface.The editable interface can show, change, confirm shown character, can be shown by manual intervention
Error character, have the function of correct mistake.
As an embodiment of the present invention, on the basis of the above embodiments, the threshold value is 90%.
It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention
Apply the form of example.Moreover, the present invention can be used in one or more wherein include computer usable program code computer
Usable storage medium(Including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)The computer program of upper implementation produces
The form of product.
The present invention be with reference to according to the method for the embodiment of the present invention, equipment(System)And the flow of computer program product
Figure and/or block diagram describe.It should be understood that can be realized by computer program instructions every first-class in flowchart and/or the block diagram
The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided
Instruct the processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine so that the instruction executed by computer or the processor of other programmable data processing devices is generated for real
The device for the function of being specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that instruction generation stored in the computer readable memory includes referring to
Enable the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device so that count
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, in computer or
The instruction executed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in a box or multiple boxes.
Although preferred embodiments of the present invention have been described, it is created once a person skilled in the art knows basic
Property concept, then additional changes and modifications may be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as
It selects embodiment and falls into all change and modification of the scope of the invention.