CN101441713A - Optical character recognition method and apparatus of PDF document - Google Patents

Optical character recognition method and apparatus of PDF document Download PDF

Info

Publication number
CN101441713A
CN101441713A CNA2007101776734A CN200710177673A CN101441713A CN 101441713 A CN101441713 A CN 101441713A CN A2007101776734 A CNA2007101776734 A CN A2007101776734A CN 200710177673 A CN200710177673 A CN 200710177673A CN 101441713 A CN101441713 A CN 101441713A
Authority
CN
China
Prior art keywords
page
data
pages
image
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007101776734A
Other languages
Chinese (zh)
Other versions
CN101441713B (en
Inventor
刘迎建
刘昌平
江世盛
丁迎
刘强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Hanwang Yingyan Technology Co.,Ltd.
Original Assignee
Hanwang Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hanwang Technology Co Ltd filed Critical Hanwang Technology Co Ltd
Priority to CN2007101776734A priority Critical patent/CN101441713B/en
Publication of CN101441713A publication Critical patent/CN101441713A/en
Application granted granted Critical
Publication of CN101441713B publication Critical patent/CN101441713B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses an optical character identification method for PDF files. The method comprises the steps of determining a target page in a PDF file, acquiring the page-size information of the target page, generating an image region with corresponding size in a memory according to the page-size information and preset resolution information, acquiring a page-describing instruction of the target page, extracting page-content data and position information in the page-describing instruction, drawing the page-content data in a corresponding position in the image region according to the position information, identifying optical characters of the page-content data and obtaining identification results. The method can realize direct OCR identifying operation to the PDF file, and does not need to repeatedly switch over among various types of software, thereby simplifying the operation of users, reducing operation time and ensuring that the users have good use experience.

Description

A kind of optical character recognition method of pdf document and device
Technical field
The present invention relates to the optical character identification field, particularly the optical character recognition device of a kind of optical character recognition method of pdf document and a kind of pdf document.
Background technology
Optical character recognition is called for short OCR (Optical Character Recognition) technology, is that a kind of character recognition technologies that utilizes is with the image transitions of the character technology for the character computer ISN.At present, the file layout that the OCR technology can be discerned only limits to image file format, i.e. the file of forms such as tif, bmp or jpg.
PDF (Portable Document Fromat, the portable file layout) file, it is a kind of electronic file form that is used for describing content of pages, pdf document have with the operating system platform independence (promptly no matter be at Windows, Unix is general in Mac OS operating system still) characteristics, become the desirable document format that on Internet, carries out electronic document distribution and digital information propagation at present.Yet, because pdf document is not a kind of picture format file, so existing OCR system can not the Direct Recognition pdf document, and after must pdf document being converted to the discernible image file format of OCR system in advance by third party software, adopt the OCR system to carry out OCR identification again, for example:, choose the zone that needs identification with the snapshot tool in the pdf document process software (as Acrobat), by duplicating paste operation, it is saved as the picture format file.
Obviously, adopt said method that pdf document is carried out OCR identification, all need in different software, switch back and forth, complicated operation, holding time is long, and user experience is relatively poor.Thereby those skilled in the art press for develops a kind of switching, OCR disposal route and the device that can directly discern pdf document of need not repeating between a plurality of softwares.
Summary of the invention
Technical matters to be solved by this invention provides a kind of optical character recognition method that can the Direct Recognition pdf document, uses that this method can be carried out simply pdf document, OCR identifying operation efficiently, makes the user obtain experience preferably.
The present invention also provides a kind of optical character recognition device that can discern pdf document, in order to guarantee said method realization and application in practice.
For solving the problems of the technologies described above, the embodiment of the invention discloses a kind of optical character recognition method of pdf document, comprising:
In pdf document, determine target pages, and obtain the page size information of described target pages;
According to described page size information and preset resolution information, in internal memory, generate the image-region of corresponding size;
Obtain the page-describing instruction of described target pages, extract content of pages data and positional information in the described page-describing instruction;
Draw described content of pages data in the relevant position of described image-region according to described positional information;
Described content of pages data are carried out optical character identification, obtain recognition result.
Preferably, described content of pages data comprise view data, graph data and/or character data, and described plot step further comprises:
Convert described image data decoding to bitmap, draw described bitmap in the relevant position of described image-region;
And/or, directly draw described graph data in the relevant position of described image-region;
And/or, according to the attribute information generation character picture of described character data, draw described character picture in the relevant position of described image-region.
Preferably, described page-describing instruction has many, and described plot step further comprises:
If described target pages also has next bar page-describing instruction, then continue to extract content of pages data and positional information in next bar page-describing instruction.
Preferably, before the step of extracting content of pages data and positional information, also comprise:
If described page-describing instruction through compressed encoding, is then handled described page-describing instruction carrying out data decode.
Preferably, before definite target pages, also comprise:
Determine target P DF file.
Preferably, determine target pages by following steps:
Obtain the page number information of described pdf document;
If the page number of current appointment in the scope of described page number information, determines then that described page number corresponding page is a target pages.
Preferably, described method also comprises:
Content of pages data in the described image-region are saved as image file.
Preferably, described method also comprises;
Described recognition result is output as specified file format.
The embodiment of the invention also discloses a kind of optical character recognition device of pdf document, comprising:
The target pages determining unit is used for determining target pages in pdf document;
First acquiring unit is used to obtain the page size information of described target pages;
The Memory Allocation unit is used for according to described page size information and preset resolution information, generates the image-region of corresponding size in internal memory;
Second acquisition unit, the page-describing instruction that is used to obtain described target pages;
Extraction unit is used for extracting the content of pages data and the positional information of described page-describing instruction;
Draw performance element, be used for drawing described content of pages data in the relevant position of described image-region according to described positional information;
Recognition unit is used for described content of pages data are carried out optical character identification, obtains recognition result.
Preferably, described content of pages data comprise view data, graph data and/or character data, and described drafting performance element further comprises:
The image rendering subelement is used for converting described image data decoding to bitmap, draws described bitmap in the relevant position of described image-region;
And/or the graphic plotting subelement is used for directly drawing described graph data in the relevant position of described image-region;
And/or subelement drawn in character, is used for generating character picture according to the attribute information of described character data, draws described character picture in the relevant position of described image-region.
Preferably, described page-describing instruction has many, and described drafting performance element further comprises:
The circulation subelement is used for when described target pages also has next bar page-describing instruction, continues to extract content of pages data and positional information in next bar page-describing instruction.
Preferably, described device also comprises:
The data decode unit is used for when described page-describing instructs through compressed encoding, and described page-describing instruction carrying out data decode is handled.
Preferably, described device also comprises:
The file destination determining unit is used for determining target P DF file.
Preferably, described file destination determining unit further comprises:
The page number obtains subelement, is used to obtain the page number information of described pdf document;
The locator unit is used for determining that described page number corresponding page is a target pages in the scope of the page number at described page number information of current appointment the time.
Preferably, described device also comprises:
Preserve the unit, be used for the content of pages data in the described image-region are saved as image file.
Preferably, described device also comprises;
Specify output unit, be used for described recognition result is output as specified file format.
Compared with prior art, the embodiment of the invention has the following advantages:
At first, the present invention is by resolving the page size information of the pdf document page that obtains, according to this page size information and the pixels tall and the width that preset resolution information calculating output image, the image storage space that in internal memory, distributes corresponding size then for this output image, again by resolving the page-describing instruction that obtains target pages, the content of pages data are depicted in the image storage space of this distribution, thereby realize direct OCR identifying operation to pdf document, need not between various software, to repeat to switch, simplified user's operation, reduced the running time, and made the user obtain experience preferably;
Moreover, the present invention can be output as specified file format with the recognition result after handling, thereby in corresponding file format, can carry out editing and processing, effectively improve the flexibility ratio of pdf document Edition Contains, further make the user obtain experience preferably the content of pdf document.
Description of drawings
Fig. 1 is the hierarchical chart of a pdf document;
Fig. 2 is the process flow diagram of the optical character recognition method embodiment 1 of a kind of pdf document of the present invention;
Fig. 3 is the process flow diagram of the image transitions drawing process of a kind of pdf document of the present invention;
Fig. 4 is the process flow diagram of a kind of optical character recognition method embodiment 2 of pdf document;
Fig. 5 is the structured flowchart of the optical character recognition device embodiment 1 of a kind of pdf document of the present invention;
Fig. 6 is the structured flowchart of the optical character recognition device embodiment 2 of a kind of pdf document of the present invention;
Fig. 7 uses the process flow diagram that preferred embodiment shown in Figure 6 carries out the OCR identifying of pdf document.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
Angle from the pdf document generation, two kinds of methods that generate pdf document are arranged: first kind is to utilize the optical scanning technology that existing paper document, books etc. are converted to image in advance, generate pdf document by image again, data such as character wherein, figure exist with image format; Second kind is to utilize application program and PDF printer (a kind of virtual printing software), the computing machine ISN of character in the computing machine and graph data is converted to the internal representation form of PDF.Data such as character wherein, figure exist with the form of PDF coding.
From the data structure of pdf document, the data in the pdf document are to organize with the form of PDF object.Particularly, the PDF object can be divided into direct object (direct object) and indirect object (indirect object) two classes, wherein, direct object comprises Boolean type (Boolean), numeric type (Number), character string type (String), name type (Name), array type (Array), dictionary type (Dictionary), data stream type (Stream) and null value type (Null); Label on the basis of direct object to liking indirectly, other object references are provided.
From the logical organization of pdf document, pdf document can be described as a hierarchical structure of being made up of the PDF object, comprises unique root object (Catalog) in this structure, with reference to figure 1, shows the hierarchical chart of a pdf document.Wherein, root object comprises the bookmark tree and the page tree of PDF document, wherein, the bookmark tree comprises a plurality of bookmark items, and page table entry is a most important object among the PDF, comprise the page-describing instruction, the information that promptly how to show this page, the font of Shi Yonging for example, the content (literal that comprises, picture etc.), size information of the page etc.Certainly subitem wherein also can be quoting of other objects.
From the storage organization of pdf document, the pdf document of standard is made of four parts: file header (Header), file body (Body), cross reference table (Cross-reference Table) and end-of-file (Trailer) are formed.
Wherein, the version number of the PDF standard that file header (Header) specified document is deferred to number is 1.3 as " %PDF-1.3 " expression current version; File body (Body) comprises the indirect object of a series of description document pages; Cross reference table (Cross-reference Table) has write down each indirect object position hereof; End-of-file (Trailer) record cross-reference table starting position hereof, the indirect object sequence number and the end-of-file mark of root object (Catalog).
For example, the signal table of a pdf document is:
Figure A200710177673D00101
Figure A200710177673D00111
Figure A200710177673D00121
Figure A200710177673D00131
Structure analysis based on above-mentioned pdf document, one of core idea that can obtain the embodiment of the invention is, according to the page size information of resolving the pdf document target pages that obtains, with preset resolution information (being typically expressed as the pixel count that comprises in an inch), calculate the pixels tall and the width of output image, the image storage space that in calculator memory, distributes corresponding size then for this output image, according to the page-describing instruction of resolving the target pages that obtains, character, figure and view data are depicted in the image storage space of this distribution again.So that to the OCR identifying operation of pdf document can be simply, the quick realization, make the user obtain experience preferably.
With reference to figure 2, show the process flow diagram of the optical character recognition method embodiment 1 of a kind of pdf document of the present invention, specifically can may further comprise the steps:
Step 201, in pdf document, determine target pages, and obtain the page size information of described target pages;
Step 202, according to described page size information with preset resolution information, in internal memory, generate the image-region of corresponding size;
Content of pages data and positional information in the described page-describing instruction are extracted in step 203, the page-describing instruction of obtaining described target pages;
Step 204, draw described content of pages data in the relevant position of described image-region according to described positional information;
Step 205, described content of pages data are carried out optical character identification, obtain recognition result.
Be understandable that, in the present embodiment,, can obtain by the logical organization and the storage organization of resolving pdf document for the page size information of pdf document related pages and obtaining of page-describing instruction.Particularly, the Analytic principle of pdf document is, begin by end-of-file, by extracting the indirect object sequence number of root object, and the position of cross reference table (being cross reference table beginning byte location hereof), utilize the object indexing function of cross reference table, begin successively to resolve by root object.
In practice, the described resolution information that presets can be provided with by the user, also can be the system default setting, can also adopt other method to obtain, and the present invention does not limit this.
Comprise more than 70 page-describing instruction in the present PDF standard, comprised description to data object related contents such as character, figure, image, pattern, position, size information, thereby, in the present embodiment, described content of pages data can comprise view data, graph data and/or character data, in this case, the step 204 of described drafting content of pages data further can comprise following substep:
Substep S41, convert described image data decoding to bitmap, draw described bitmap in the relevant position of described image-region;
And/or, substep S42, directly draw described graph data in the relevant position of described image-region;
And/or, substep S43, generate character picture according to the attribute information of described character data, draw described character picture in the relevant position of described image-region.
For making those skilled in the art understand present embodiment better, below by being that example describes to the concrete page-describing instruction resolving in the signal table of above-mentioned pdf document.Suppose that the PDF page-describing that obtains 60obj in the described signal table instructs as follows:
BT
/F0?48.000?Tf
72.000?576.000?Td
(Hello?World)Tj
ET
Resolving above-mentioned page-describing instruction is:
(1) " BT " expression beginning character Object Operations needs to finish initialization operations such as answer initial coordinate transformation parameter in the processing;
(2) to select in presents the sign title for use be the font of F0 in "/F0 48.000 Tf " expression, and the font pantograph coefficient is 48.0.The font name of sign title F0 is " Times-Roman " in the file, and the character code name is called " WinAnsiEncoding ", will load corresponding font file according to font name in the processing;
(3) " 72.000 576.000 Td " expression as true origin, moves to lateral separation 72.0 pound, position that fore-and-aft distance 576.0 pound with current coordinate with the PDF page lower left corner;
(4) " (Hello World) Tj " expression output character sequence " Hello World ".At different characters, in the font file that loads, find corresponding characters to represent item, generate character picture and be stored in the page-images zone in the internal memory;
(5) " ET " expression character object EO.
As above shown in the example, the page-describing instruction that is comprised in page may have many, and in this case, the step 204 of described drafting content of pages data can also comprise following substep:
If the described target pages of substep S44 also has next bar page-describing instruction, then continue to extract content of pages data and positional information in next bar page-describing instruction.
In addition, the PDF standard indicates, can adopt several data encoding compression mode that the PDF object is compressed, at present, the encoding compression mode that PDF supports comprises: ASCIIHex, ASCII85, LZW, RunLength, CCITT Group3, CCITT Group 4, JPEG, JPEG 2000, Flate etc., therefore, before resolving the instruction of PDF page-describing, if described page-describing instruction is the process compressed encoding, the present invention also can comprise the step that described page-describing instruction carrying out data decode is handled so.
Correspondingly, can show the process flow diagram of the image transitions drawing process of a kind of pdf document of the present invention, specifically can may further comprise the steps with reference to figure 3:
Step 301, in pdf document, determine target pages, and obtain the page size information of described target pages;
Step 302, according to described page size information with preset resolution information, in internal memory, generate the image-region of corresponding size;
Step 303, the page-describing instruction of obtaining described target pages judge whether described page-describing instruction passes through compressed encoding, if then execution in step 304; If not, execution in step 305 then;
Step 304, described page-describing instruction carrying out data decode handled after, execution in step 305;
Content of pages data and positional information in step 305, the instruction of extraction article one page-describing;
Step 306, judge whether described content of pages data are view data, if then execution in step 307; If not, execution in step 308 then;
Step 307, convert described image data decoding to bitmap, after described bitmap is drawn in the relevant position of described image-region, execution in step 308;
Step 308, judge whether described content of pages data are graph data, if then execution in step 309; If not, execution in step 310 then;
Step 309, direct after described graph data is drawn in the relevant position of described image-region, execution in step 310;
Step 310, judge whether described content of pages data are character data, if then execution in step 311; If not, execution in step 312 then;
Step 311, generate character picture according to the attribute information of described character data, after described character picture is drawn in the relevant position of described image-region, execution in step 312;
Step 312, judge whether to also have next bar page-describing instruction, if then execution in step 313; If not, then finish the image rendering of current page;
Content of pages data and the positional information in next bar page-describing instruction extracted in step 313, continuation, and reenters step 306.
With reference to figure 4, show the process flow diagram of a kind of optical character recognition method embodiment 2 of pdf document, specifically can may further comprise the steps:
Step 401, determine target P DF file;
In practice, ask the filename discerned, can navigate to corresponding pdf document by obtaining the user.
Step 402, in described pdf document, determine target pages, and obtain the page size information of described target pages;
PDF has irrelevance as a kind of structurized file layout between its page and the page, by the page number of pdf document, promptly can carry out at random visit to the page in the pdf document.Therefore, can determine respective page in the pdf document according to the page number of user's appointment, in this case, described step 402 can also comprise following substep:
Substep 4021, obtain the page number information of described pdf document;
The page number of substep 4022, the current appointment of judgement is in the scope of described page number information, if then carry out substep 4023; If not, execution in step 4024 then; Substep 4023, determine that described page number corresponding page is a target pages.
Substep 4024, prompting user make mistakes.
Step 403, according to described page size information with preset resolution information, in internal memory, generate the image-region of corresponding size;
Content of pages data and positional information in the described page-describing instruction are extracted in step 404, the page-describing instruction of obtaining described target pages;
Step 405, draw described content of pages data in the relevant position of described image-region according to described positional information;
So far, the PDF content of pages data in the described internal memory have been converted into corresponding view data.
Step 406, described content of pages data are carried out optical character identification, obtain recognition result;
Because by above-mentioned steps has been view data with described content of pages data processing, thereby in the present embodiment, it all is feasible adopting any optical character recognition method of the prior art, for example, a kind of method of optical character identification is:
(1) pre-processing image data process:
Carry out processing such as slant correction, deformation correction, binaryzation by the view data that the PDF conversion of page is obtained, to guarantee the validity of later stage identifying operation;
(2) printed page analysis:
Mainly carry out operations such as text image zone location, form identification, page info understanding;
(3) character recognition:
With the characters in images image transitions is the computer-internal coded representation form of character, except that Chinese and English character identification, also can add the support of traditional font, Japanese, Korean as required;
(4) user's check and correction:
The user can correct the mistake knowledge that occurs in the identifying.
Certainly, above-mentioned disposal route only only limits to for example, and it also is feasible that those skilled in the art adopt other optical character recognition method, and the present invention does not need this to limit.
Be well known that pdf document has read-only property, yet, in some cases, be to edit, thereby present embodiment can also comprise to the content in the pdf document:
Step 407, described recognition result is output as specified file format.
Recognition result according to OCR formation, at first carry out the space of a whole page and restore processing, be about to the recognition data reorganization and be structures such as text fragment, form, export as the file of specified format then, as editable file layouts such as RTF, DOC, TXT, EXCEL, WPS, UOML.
In this case, the pdf document that generates for scan image no matter, the pdf document that also is to use application software to generate by the conversion of computing machine ISN, can be according to size, position, the pattern of data in original page such as character, figure, images, be converted to the various file layouts of being convenient to edit, be difficult to obtain and a multiplexing difficult problem thereby efficiently solve the pdf document content, greatly reduced the workload of artificial file typing, page composing and file check and correction.
Certainly, the method for above-mentioned output specified file format can adopt any method of the prior art to realize that the present invention does not limit this.
Preferably, in the present embodiment, can also may further comprise the steps: the content of pages data in the described image-region are saved as image file.
The method of described preservation can adopt the form of internal storage data, also can adopt any one picture format to be kept on hard disk or other memory device, uses to offer other program, and the present invention does not limit this.
For aforesaid each method embodiment, for simple description, so it all is expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not subjected to the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in the instructions all belongs to preferred embodiment, and related action and module might not be that the present invention is necessary.
With reference to figure 5, show the structured flowchart of the optical character recognition device embodiment 1 of a kind of pdf document of the present invention, specifically can comprise with lower unit:
Target pages determining unit 501 is used for determining target pages in pdf document;
First acquiring unit 502 is used to obtain the page size information of described target pages;
Memory Allocation unit 503 is used for according to described page size information and preset resolution information, generates the image-region of corresponding size in internal memory;
Second acquisition unit 504, the page-describing instruction that is used to obtain described target pages;
Extraction unit 505 is used for extracting the content of pages data and the positional information of described page-describing instruction;
Draw performance element 506, be used for drawing described content of pages data in the relevant position of described image-region according to described positional information;
Recognition unit 507 is used for described content of pages data are carried out optical character identification, obtains recognition result.
Preferably, described content of pages data can comprise view data, graph data and/or character data, described in this case drafting performance element 506 can comprise following subelement: (do not have S561-S564 in the accompanying drawing, whether will increase the diagram of relevant S561-S564)
Image rendering subelement S561 is used for converting described image data decoding to bitmap, draws described bitmap in the relevant position of described image-region;
And/or graphic plotting subelement S562 is used for directly drawing described graph data in the relevant position of described image-region;
And/or subelement S563 drawn in character, is used for generating character picture according to the attribute information of described character data, draws described character picture in the relevant position of described image-region.
In practice, page-describing instruction in the described target pages may have many, described in this case drafting performance element 506 can also comprise circulation subelement S564, be used for when described target pages also has next bar page-describing instruction, continue to extract content of pages data and positional information in next bar page-describing instruction.
In addition, if the instruction of described page-describing is the process compressed encoding, present embodiment can also comprise the data decode unit so, is used for when described page-describing instructs through compressed encoding, and described page-describing instruction carrying out data decode is handled.
With reference to figure 6, show the structured flowchart of the optical character recognition device embodiment 2 of a kind of pdf document of the present invention, specifically can comprise with lower unit:
File destination determining unit 601 is used for determining target P DF file;
Target pages determining unit 602 is used for determining target pages in described pdf document;
Preferably, described file destination determining unit can comprise following subelement: (not having S621-S622 in the accompanying drawing)
The page number obtains subelement 6021, is used to obtain the page number information of described pdf document;
Locator unit 6022 is used for determining that described page number corresponding page is a target pages in the scope of the page number at described page number information of current appointment the time.
First acquiring unit 603 is used to obtain the page size information of described target pages;
Memory Allocation unit 604 is used for according to described page size information and preset resolution information, generates the image-region of corresponding size in internal memory;
Second acquisition unit 605, the page-describing instruction that is used to obtain described target pages;
Extraction unit 606 is used for extracting the content of pages data and the positional information of described page-describing instruction;
Draw performance element 607, be used for drawing described content of pages data in the relevant position of described image-region according to described positional information;
Recognition unit 608 is used for described content of pages data are carried out optical character identification, obtains recognition result;
Specify output unit 609, be used for described recognition result is output as specified file format.
Preferably, in the present embodiment, can also comprise the preservation unit, be used for the content of pages data in the described image-region are saved as image file.
With reference to figure 7, show and use the process flow diagram that preferred embodiment shown in Figure 6 carries out the OCR identifying of pdf document, specifically can may further comprise the steps:
Step 701, file destination determining unit are determined target P DF file;
Step 702, target pages determining unit are determined target pages in described pdf document, first acquiring unit obtains the page size information of described target pages;
Step 703, the Memory Allocation unit is according to described page size information and preset resolution information, generates the image-region of corresponding size in internal memory;
Step 704, second acquisition unit obtain the page-describing instruction of described target pages, and extraction unit extracts content of pages data and the positional information in the instruction of article one page-describing;
Step 705, drafting performance element are drawn described content of pages data according to described positional information in the relevant position of described image-region;
Step 706, recognition unit carry out optical character identification to described content of pages data, obtain recognition result;
Step 707, appointment output unit are output as specified file format with described recognition result.
For device embodiment, because it is substantially corresponding to method embodiment, relevant part can not given unnecessary details at this referring to the part explanation of method embodiment.In addition, in an embodiment of the present invention, the description of each embodiment is all emphasized particularly on different fields, do not have the part that describes in detail among certain embodiment, can be referring to the associated description of other embodiment.
The present invention can be used for numerous general or special purpose computingasystem environment or configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, programmable consumer-elcetronics devices, network PC, small-size computer, mainframe computer, comprise distributed computing environment of above any system or equipment or the like.
The present invention can also describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract, program, object, assembly, data structure or the like.Also can in distributed computing environment, put into practice the present invention, in these distributed computing environment, by by communication network connected teleprocessing equipment execute the task.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.
More than the optical character recognition method of a kind of pdf document provided by the present invention and a kind of optical character recognition device of pdf document are described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (16)

1, a kind of optical character recognition method of pdf document is characterized in that, comprising:
In pdf document, determine target pages, and obtain the page size information of described target pages;
According to described page size information and preset resolution information, in internal memory, generate the image-region of corresponding size;
Obtain the page-describing instruction of described target pages, extract content of pages data and positional information in the described page-describing instruction;
Draw described content of pages data in the relevant position of described image-region according to described positional information;
Described content of pages data are carried out optical character identification, obtain recognition result.
2, the method for claim 1 is characterized in that, described content of pages data comprise view data, graph data and/or character data, and described plot step further comprises:
Convert described image data decoding to bitmap, draw described bitmap in the relevant position of described image-region;
And/or, directly draw described graph data in the relevant position of described image-region;
And/or, according to the attribute information generation character picture of described character data, draw described character picture in the relevant position of described image-region.
3, method as claimed in claim 2 is characterized in that, described page-describing instruction has many, and described plot step further comprises:
If described target pages also has next bar page-describing instruction, then continue to extract content of pages data and positional information in next bar page-describing instruction.
4, as claim 1,2 or 3 described methods, it is characterized in that, before the step of extracting content of pages data and positional information, also comprise:
If described page-describing instruction through compressed encoding, is then handled described page-describing instruction carrying out data decode.
5, as claim 1,2 or 3 described methods, it is characterized in that, before definite target pages, also comprise:
Determine target P DF file.
6, method as claimed in claim 5 is characterized in that, determines target pages by following steps:
Obtain the page number information of described pdf document;
If the page number of current appointment in the scope of described page number information, determines then that described page number corresponding page is a target pages.
7, method as claimed in claim 2 is characterized in that, also comprises:
Content of pages data in the described image-region are saved as image file.
8, as claim 1 or 7 described methods, it is characterized in that, also comprise;
Described recognition result is output as specified file format.
9, a kind of optical character recognition device of pdf document is characterized in that, comprising:
The target pages determining unit is used for determining target pages in pdf document;
First acquiring unit is used to obtain the page size information of described target pages;
The Memory Allocation unit is used for according to described page size information and preset resolution information, generates the image-region of corresponding size in internal memory;
Second acquisition unit, the page-describing instruction that is used to obtain described target pages;
Extraction unit is used for extracting the content of pages data and the positional information of described page-describing instruction;
Draw performance element, be used for drawing described content of pages data in the relevant position of described image-region according to described positional information;
Recognition unit is used for described content of pages data are carried out optical character identification, obtains recognition result.
10, device as claimed in claim 9 is characterized in that, described content of pages data comprise view data, graph data and/or character data, and described drafting performance element further comprises:
The image rendering subelement is used for converting described image data decoding to bitmap, draws described bitmap in the relevant position of described image-region;
And/or the graphic plotting subelement is used for directly drawing described graph data in the relevant position of described image-region;
And/or subelement drawn in character, is used for generating character picture according to the attribute information of described character data, draws described character picture in the relevant position of described image-region.
11, device as claimed in claim 10 is characterized in that, described page-describing instruction has many, and described drafting performance element further comprises:
The circulation subelement is used for when described target pages also has next bar page-describing instruction, continues to extract content of pages data and positional information in next bar page-describing instruction.
12, as claim 9,10 or 11 described devices, it is characterized in that, also comprise:
The data decode unit is used for when described page-describing instructs through compressed encoding, and described page-describing instruction carrying out data decode is handled.
13, as claim 9,10 or 11 described devices, it is characterized in that, also comprise:
The file destination determining unit is used for determining target P DF file.
14, device as claimed in claim 13 is characterized in that, described file destination determining unit further comprises:
The page number obtains subelement, is used to obtain the page number information of described pdf document;
The locator unit is used for determining that described page number corresponding page is a target pages in the scope of the page number at described page number information of current appointment the time.
15, device as claimed in claim 10 is characterized in that, also comprises:
Preserve the unit, be used for the content of pages data in the described image-region are saved as image file.
16, as claim 9 or 15 described devices, it is characterized in that, also comprise;
Specify output unit, be used for described recognition result is output as specified file format.
CN2007101776734A 2007-11-19 2007-11-19 Optical character recognition method and apparatus of PDF document Active CN101441713B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2007101776734A CN101441713B (en) 2007-11-19 2007-11-19 Optical character recognition method and apparatus of PDF document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2007101776734A CN101441713B (en) 2007-11-19 2007-11-19 Optical character recognition method and apparatus of PDF document

Publications (2)

Publication Number Publication Date
CN101441713A true CN101441713A (en) 2009-05-27
CN101441713B CN101441713B (en) 2010-12-08

Family

ID=40726140

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2007101776734A Active CN101441713B (en) 2007-11-19 2007-11-19 Optical character recognition method and apparatus of PDF document

Country Status (1)

Country Link
CN (1) CN101441713B (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853246A (en) * 2010-06-14 2010-10-06 深圳市万兴软件有限公司 Method and device for converting document format
CN102782702A (en) * 2010-03-10 2012-11-14 微软公司 Paragraph recognition in an optical character recognition (OCR) process
CN102831106A (en) * 2012-08-27 2012-12-19 腾讯科技(深圳)有限公司 Electronic document generation method of mobile terminal and mobile terminal
CN103186912A (en) * 2011-12-28 2013-07-03 北京神州泰岳软件股份有限公司 Method and system for showing letter in picture format
CN103279753A (en) * 2013-06-09 2013-09-04 中国科学院自动化研究所 English scene text block identification method based on instructions of tree structures
CN103744609A (en) * 2014-01-20 2014-04-23 华为终端有限公司 Data extraction method and device
CN104077593A (en) * 2013-03-27 2014-10-01 富士通株式会社 Image processing method and image processing device
CN104283921A (en) * 2013-07-08 2015-01-14 腾讯科技(深圳)有限公司 Method and device for releasing microblog
CN105335346A (en) * 2015-11-09 2016-02-17 汉王科技股份有限公司 PDF (Portable Document Format) document text extracting method and device
CN106104518A (en) * 2014-03-08 2016-11-09 微软技术许可有限责任公司 For the framework extracted according to the data of example
CN106446863A (en) * 2016-10-11 2017-02-22 同方知网(北京)技术有限公司 PDF document logic diagram identification method
CN106951362A (en) * 2015-09-18 2017-07-14 Fmr有限责任公司 To the real-time monitoring of computer system processor and affairs performance during ongoing performance test
CN108475335A (en) * 2016-01-27 2018-08-31 霍尼韦尔国际公司 The Method and kit for of the postmortem analysis of tripping field device in process industrial for using optical character identification & intelligent character recognitions
CN109446995A (en) * 2018-10-30 2019-03-08 广西科技大学 The treating method and apparatus of billing information
CN109492199A (en) * 2018-10-17 2019-03-19 四川译讯信息科技有限公司 A kind of pdf document conversion method judged in advance based on OCR
CN109948123A (en) * 2018-11-27 2019-06-28 阿里巴巴集团控股有限公司 A kind of image combining method and device
CN110321470A (en) * 2019-05-23 2019-10-11 平安科技(深圳)有限公司 Document processing method, device, computer equipment and storage medium
CN110929479A (en) * 2018-09-03 2020-03-27 珠海金山办公软件有限公司 Method and device for converting PDF scanning piece, electronic equipment and storage medium
CN110991279A (en) * 2019-11-20 2020-04-10 北京灵伴未来科技有限公司 Document image analysis and recognition method and system
CN111143213A (en) * 2019-12-24 2020-05-12 北京数衍科技有限公司 Software automation test method and device and electronic equipment
CN111213156A (en) * 2017-07-25 2020-05-29 惠普发展公司,有限责任合伙企业 Character recognition sharpness determination
CN112036123A (en) * 2020-08-31 2020-12-04 北京奇虎鸿腾科技有限公司 PDF (Portable document Format) generation method, device and equipment based on webpage and storage medium
CN112069771A (en) * 2020-08-26 2020-12-11 中国建设银行股份有限公司 Method and device for analyzing pictures in PDF (Portable document Format) file
CN112446373A (en) * 2020-12-15 2021-03-05 万兴科技(湖南)有限公司 Method, system, computer device and storage medium for identifying converted image file
CN112861821A (en) * 2021-04-06 2021-05-28 刘羽 Map data reduction method based on PDF file analysis
CN113792659A (en) * 2021-09-15 2021-12-14 上海金仕达软件科技有限公司 Document identification method and device and electronic equipment

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8565474B2 (en) 2010-03-10 2013-10-22 Microsoft Corporation Paragraph recognition in an optical character recognition (OCR) process
CN102782702A (en) * 2010-03-10 2012-11-14 微软公司 Paragraph recognition in an optical character recognition (OCR) process
CN101853246A (en) * 2010-06-14 2010-10-06 深圳市万兴软件有限公司 Method and device for converting document format
CN103186912B (en) * 2011-12-28 2016-07-06 北京神州泰岳软件股份有限公司 The method and system of word are shown with picture format
CN103186912A (en) * 2011-12-28 2013-07-03 北京神州泰岳软件股份有限公司 Method and system for showing letter in picture format
US9019583B2 (en) 2012-08-27 2015-04-28 Tencent Technology (Shenzhen) Company Limited Mobile terminals and methods for generating electronic documents for the same
CN102831106A (en) * 2012-08-27 2012-12-19 腾讯科技(深圳)有限公司 Electronic document generation method of mobile terminal and mobile terminal
CN104077593A (en) * 2013-03-27 2014-10-01 富士通株式会社 Image processing method and image processing device
CN103279753A (en) * 2013-06-09 2013-09-04 中国科学院自动化研究所 English scene text block identification method based on instructions of tree structures
CN103279753B (en) * 2013-06-09 2016-03-09 中国科学院自动化研究所 A kind of English scene text block identifying method instructed based on tree construction
CN104283921A (en) * 2013-07-08 2015-01-14 腾讯科技(深圳)有限公司 Method and device for releasing microblog
CN103744609A (en) * 2014-01-20 2014-04-23 华为终端有限公司 Data extraction method and device
CN103744609B (en) * 2014-01-20 2018-10-19 华为终端(东莞)有限公司 A kind of data extraction method and device
CN106104518A (en) * 2014-03-08 2016-11-09 微软技术许可有限责任公司 For the framework extracted according to the data of example
CN106951362A (en) * 2015-09-18 2017-07-14 Fmr有限责任公司 To the real-time monitoring of computer system processor and affairs performance during ongoing performance test
CN105335346B (en) * 2015-11-09 2018-12-04 汉王科技股份有限公司 A kind of Text Extraction and device of PDF document
CN105335346A (en) * 2015-11-09 2016-02-17 汉王科技股份有限公司 PDF (Portable Document Format) document text extracting method and device
CN108475335A (en) * 2016-01-27 2018-08-31 霍尼韦尔国际公司 The Method and kit for of the postmortem analysis of tripping field device in process industrial for using optical character identification & intelligent character recognitions
CN108475335B (en) * 2016-01-27 2022-10-14 霍尼韦尔国际公司 Method for post-inspection analysis of tripped field devices in process industry using optical character recognition, smart character recognition
CN106446863A (en) * 2016-10-11 2017-02-22 同方知网(北京)技术有限公司 PDF document logic diagram identification method
CN111213156A (en) * 2017-07-25 2020-05-29 惠普发展公司,有限责任合伙企业 Character recognition sharpness determination
CN110929479A (en) * 2018-09-03 2020-03-27 珠海金山办公软件有限公司 Method and device for converting PDF scanning piece, electronic equipment and storage medium
CN109492199A (en) * 2018-10-17 2019-03-19 四川译讯信息科技有限公司 A kind of pdf document conversion method judged in advance based on OCR
CN109446995A (en) * 2018-10-30 2019-03-08 广西科技大学 The treating method and apparatus of billing information
CN109948123A (en) * 2018-11-27 2019-06-28 阿里巴巴集团控股有限公司 A kind of image combining method and device
CN110321470A (en) * 2019-05-23 2019-10-11 平安科技(深圳)有限公司 Document processing method, device, computer equipment and storage medium
CN110991279A (en) * 2019-11-20 2020-04-10 北京灵伴未来科技有限公司 Document image analysis and recognition method and system
CN110991279B (en) * 2019-11-20 2023-08-22 北京灵伴未来科技有限公司 Document Image Analysis and Recognition Method and System
CN111143213A (en) * 2019-12-24 2020-05-12 北京数衍科技有限公司 Software automation test method and device and electronic equipment
CN112069771A (en) * 2020-08-26 2020-12-11 中国建设银行股份有限公司 Method and device for analyzing pictures in PDF (Portable document Format) file
CN112036123A (en) * 2020-08-31 2020-12-04 北京奇虎鸿腾科技有限公司 PDF (Portable document Format) generation method, device and equipment based on webpage and storage medium
CN112446373A (en) * 2020-12-15 2021-03-05 万兴科技(湖南)有限公司 Method, system, computer device and storage medium for identifying converted image file
CN112446373B (en) * 2020-12-15 2023-06-06 万兴科技(湖南)有限公司 Method, system, computer device and storage medium for identifying converted image file
CN112861821A (en) * 2021-04-06 2021-05-28 刘羽 Map data reduction method based on PDF file analysis
CN112861821B (en) * 2021-04-06 2024-04-19 刘羽 Map data reduction method based on PDF file analysis
CN113792659A (en) * 2021-09-15 2021-12-14 上海金仕达软件科技有限公司 Document identification method and device and electronic equipment
CN113792659B (en) * 2021-09-15 2024-04-05 上海金仕达软件科技股份有限公司 Document identification method and device and electronic equipment

Also Published As

Publication number Publication date
CN101441713B (en) 2010-12-08

Similar Documents

Publication Publication Date Title
CN101441713B (en) Optical character recognition method and apparatus of PDF document
US7681121B2 (en) Image processing apparatus, control method therefor, and program
US7664321B2 (en) Image processing method, system, program, program storage medium and information processing apparatus
US7349577B2 (en) Image processing method and image processing system
US8520006B2 (en) Image processing apparatus and method, and program
US8954845B2 (en) Image processing device, method and storage medium for two-way linking between related graphics and text in an electronic document
US5907835A (en) Electronic filing system using different application program for processing drawing commands for printing
US7493250B2 (en) System and method for distributing multilingual documents
EP2162859B1 (en) Image processing apparatus, image processing method, and computer program
US20040223197A1 (en) Image processing method
US20050286805A1 (en) Image processing apparatus, control method therefor, and program
US20120011429A1 (en) Image processing apparatus and image processing method
US20040213458A1 (en) Image processing method and system
JPH10100484A (en) Computer based document processing method
US8514462B2 (en) Processing document image including caption region
JP3683925B2 (en) Electronic filing device
CN111753717A (en) Method, apparatus, device and medium for extracting structured information of text
JP5551660B2 (en) Computer-implemented method for encoding text into matrix code symbols, computer-implemented method for decoding matrix code symbols, encoder for encoding text into matrix code symbols, and decoder for decoding matrix code symbols
JP2022092119A (en) Image processing apparatus, image processing method, and program
JP2000322417A (en) Device and method for filing image and storage medium
JP5089524B2 (en) Document processing apparatus, document processing system, document processing method, and document processing program
Yu et al. Extracting mathematical components directly from PDF documents for mathematical expression recognition and retrieval
KR100708389B1 (en) The device which the compression and memorial to a PDF file of the security and method thereof
JP5501307B2 (en) Apparatus for decoding matrix code symbols and method for decoding matrix code symbols
JP2005208872A (en) Image processing system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210831

Address after: 100124 first floor, building 8, No. 1129, Huihe South Street, Banbidian village, Gaobeidian Township, Chaoyang District, Beijing

Patentee after: Beijing Hanwang Yingyan Technology Co.,Ltd.

Address before: 100094, No. 5, building 8, No. three northeast Wang Xi Road, Beijing, Haidian District

Patentee before: HANWANG TECHNOLOGY Co.,Ltd.