CN101441713A

CN101441713A - Optical character recognition method and apparatus of PDF document

Info

Publication number: CN101441713A
Application number: CNA2007101776734A
Authority: CN
Inventors: 刘迎建; 刘昌平; 江世盛; 丁迎; 刘强
Original assignee: Hanwang Technology Co Ltd
Current assignee: Beijing Hanwang Yingyan Technology Co.,Ltd.
Priority date: 2007-11-19
Filing date: 2007-11-19
Publication date: 2009-05-27
Anticipated expiration: 2027-11-19
Also published as: CN101441713B

Abstract

The invention discloses an optical character identification method for PDF files. The method comprises the steps of determining a target page in a PDF file, acquiring the page-size information of the target page, generating an image region with corresponding size in a memory according to the page-size information and preset resolution information, acquiring a page-describing instruction of the target page, extracting page-content data and position information in the page-describing instruction, drawing the page-content data in a corresponding position in the image region according to the position information, identifying optical characters of the page-content data and obtaining identification results. The method can realize direct OCR identifying operation to the PDF file, and does not need to repeatedly switch over among various types of software, thereby simplifying the operation of users, reducing operation time and ensuring that the users have good use experience.

Description

A kind of optical character recognition method of pdf document and device

Technical field

The present invention relates to the optical character identification field, particularly the optical character recognition device of a kind of optical character recognition method of pdf document and a kind of pdf document.

Background technology

Optical character recognition is called for short OCR (Optical Character Recognition) technology, is that a kind of character recognition technologies that utilizes is with the image transitions of the character technology for the character computer ISN.At present, the file layout that the OCR technology can be discerned only limits to image file format, i.e. the file of forms such as tif, bmp or jpg.

PDF (Portable Document Fromat, the portable file layout) file, it is a kind of electronic file form that is used for describing content of pages, pdf document have with the operating system platform independence (promptly no matter be at Windows, Unix is general in Mac OS operating system still) characteristics, become the desirable document format that on Internet, carries out electronic document distribution and digital information propagation at present.Yet, because pdf document is not a kind of picture format file, so existing OCR system can not the Direct Recognition pdf document, and after must pdf document being converted to the discernible image file format of OCR system in advance by third party software, adopt the OCR system to carry out OCR identification again, for example:, choose the zone that needs identification with the snapshot tool in the pdf document process software (as Acrobat), by duplicating paste operation, it is saved as the picture format file.

Obviously, adopt said method that pdf document is carried out OCR identification, all need in different software, switch back and forth, complicated operation, holding time is long, and user experience is relatively poor.Thereby those skilled in the art press for develops a kind of switching, OCR disposal route and the device that can directly discern pdf document of need not repeating between a plurality of softwares.

Summary of the invention

Technical matters to be solved by this invention provides a kind of optical character recognition method that can the Direct Recognition pdf document, uses that this method can be carried out simply pdf document, OCR identifying operation efficiently, makes the user obtain experience preferably.

The present invention also provides a kind of optical character recognition device that can discern pdf document, in order to guarantee said method realization and application in practice.

For solving the problems of the technologies described above, the embodiment of the invention discloses a kind of optical character recognition method of pdf document, comprising:

In pdf document, determine target pages, and obtain the page size information of described target pages;

According to described page size information and preset resolution information, in internal memory, generate the image-region of corresponding size;

Obtain the page-describing instruction of described target pages, extract content of pages data and positional information in the described page-describing instruction;

Draw described content of pages data in the relevant position of described image-region according to described positional information;

Described content of pages data are carried out optical character identification, obtain recognition result.

Preferably, described content of pages data comprise view data, graph data and/or character data, and described plot step further comprises:

Convert described image data decoding to bitmap, draw described bitmap in the relevant position of described image-region;

And/or, directly draw described graph data in the relevant position of described image-region;

And/or, according to the attribute information generation character picture of described character data, draw described character picture in the relevant position of described image-region.

Preferably, described page-describing instruction has many, and described plot step further comprises:

If described target pages also has next bar page-describing instruction, then continue to extract content of pages data and positional information in next bar page-describing instruction.

Preferably, before the step of extracting content of pages data and positional information, also comprise:

If described page-describing instruction through compressed encoding, is then handled described page-describing instruction carrying out data decode.

Preferably, before definite target pages, also comprise:

Determine target P DF file.

Preferably, determine target pages by following steps:

Obtain the page number information of described pdf document;

If the page number of current appointment in the scope of described page number information, determines then that described page number corresponding page is a target pages.

Preferably, described method also comprises:

Content of pages data in the described image-region are saved as image file.

Preferably, described method also comprises;

Described recognition result is output as specified file format.

The embodiment of the invention also discloses a kind of optical character recognition device of pdf document, comprising:

The target pages determining unit is used for determining target pages in pdf document;

First acquiring unit is used to obtain the page size information of described target pages;

The Memory Allocation unit is used for according to described page size information and preset resolution information, generates the image-region of corresponding size in internal memory;

Second acquisition unit, the page-describing instruction that is used to obtain described target pages;

Extraction unit is used for extracting the content of pages data and the positional information of described page-describing instruction;

Draw performance element, be used for drawing described content of pages data in the relevant position of described image-region according to described positional information;

Recognition unit is used for described content of pages data are carried out optical character identification, obtains recognition result.

Preferably, described content of pages data comprise view data, graph data and/or character data, and described drafting performance element further comprises:

The image rendering subelement is used for converting described image data decoding to bitmap, draws described bitmap in the relevant position of described image-region;

And/or the graphic plotting subelement is used for directly drawing described graph data in the relevant position of described image-region;

And/or subelement drawn in character, is used for generating character picture according to the attribute information of described character data, draws described character picture in the relevant position of described image-region.

Preferably, described page-describing instruction has many, and described drafting performance element further comprises:

The circulation subelement is used for when described target pages also has next bar page-describing instruction, continues to extract content of pages data and positional information in next bar page-describing instruction.

Preferably, described device also comprises:

The data decode unit is used for when described page-describing instructs through compressed encoding, and described page-describing instruction carrying out data decode is handled.

Preferably, described device also comprises:

The file destination determining unit is used for determining target P DF file.

Preferably, described file destination determining unit further comprises:

The page number obtains subelement, is used to obtain the page number information of described pdf document;

The locator unit is used for determining that described page number corresponding page is a target pages in the scope of the page number at described page number information of current appointment the time.

Preferably, described device also comprises:

Preserve the unit, be used for the content of pages data in the described image-region are saved as image file.

Preferably, described device also comprises;

Specify output unit, be used for described recognition result is output as specified file format.

Compared with prior art, the embodiment of the invention has the following advantages:

At first, the present invention is by resolving the page size information of the pdf document page that obtains, according to this page size information and the pixels tall and the width that preset resolution information calculating output image, the image storage space that in internal memory, distributes corresponding size then for this output image, again by resolving the page-describing instruction that obtains target pages, the content of pages data are depicted in the image storage space of this distribution, thereby realize direct OCR identifying operation to pdf document, need not between various software, to repeat to switch, simplified user's operation, reduced the running time, and made the user obtain experience preferably;

Moreover, the present invention can be output as specified file format with the recognition result after handling, thereby in corresponding file format, can carry out editing and processing, effectively improve the flexibility ratio of pdf document Edition Contains, further make the user obtain experience preferably the content of pdf document.

Description of drawings

Fig. 1 is the hierarchical chart of a pdf document;

Fig. 2 is the process flow diagram of the optical character recognition method embodiment 1 of a kind of pdf document of the present invention;

Fig. 3 is the process flow diagram of the image transitions drawing process of a kind of pdf document of the present invention;

Fig. 4 is the process flow diagram of a kind of optical character recognition method embodiment 2 of pdf document;

Fig. 5 is the structured flowchart of the optical character recognition device embodiment 1 of a kind of pdf document of the present invention;

Fig. 6 is the structured flowchart of the optical character recognition device embodiment 2 of a kind of pdf document of the present invention;

Fig. 7 uses the process flow diagram that preferred embodiment shown in Figure 6 carries out the OCR identifying of pdf document.

Embodiment

For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.

Angle from the pdf document generation, two kinds of methods that generate pdf document are arranged: first kind is to utilize the optical scanning technology that existing paper document, books etc. are converted to image in advance, generate pdf document by image again, data such as character wherein, figure exist with image format; Second kind is to utilize application program and PDF printer (a kind of virtual printing software), the computing machine ISN of character in the computing machine and graph data is converted to the internal representation form of PDF.Data such as character wherein, figure exist with the form of PDF coding.

From the data structure of pdf document, the data in the pdf document are to organize with the form of PDF object.Particularly, the PDF object can be divided into direct object (direct object) and indirect object (indirect object) two classes, wherein, direct object comprises Boolean type (Boolean), numeric type (Number), character string type (String), name type (Name), array type (Array), dictionary type (Dictionary), data stream type (Stream) and null value type (Null); Label on the basis of direct object to liking indirectly, other object references are provided.

From the logical organization of pdf document, pdf document can be described as a hierarchical structure of being made up of the PDF object, comprises unique root object (Catalog) in this structure, with reference to figure 1, shows the hierarchical chart of a pdf document.Wherein, root object comprises the bookmark tree and the page tree of PDF document, wherein, the bookmark tree comprises a plurality of bookmark items, and page table entry is a most important object among the PDF, comprise the page-describing instruction, the information that promptly how to show this page, the font of Shi Yonging for example, the content (literal that comprises, picture etc.), size information of the page etc.Certainly subitem wherein also can be quoting of other objects.

From the storage organization of pdf document, the pdf document of standard is made of four parts: file header (Header), file body (Body), cross reference table (Cross-reference Table) and end-of-file (Trailer) are formed.

Wherein, the version number of the PDF standard that file header (Header) specified document is deferred to number is 1.3 as " %PDF-1.3 " expression current version; File body (Body) comprises the indirect object of a series of description document pages; Cross reference table (Cross-reference Table) has write down each indirect object position hereof; End-of-file (Trailer) record cross-reference table starting position hereof, the indirect object sequence number and the end-of-file mark of root object (Catalog).

For example, the signal table of a pdf document is:

Structure analysis based on above-mentioned pdf document, one of core idea that can obtain the embodiment of the invention is, according to the page size information of resolving the pdf document target pages that obtains, with preset resolution information (being typically expressed as the pixel count that comprises in an inch), calculate the pixels tall and the width of output image, the image storage space that in calculator memory, distributes corresponding size then for this output image, according to the page-describing instruction of resolving the target pages that obtains, character, figure and view data are depicted in the image storage space of this distribution again.So that to the OCR identifying operation of pdf document can be simply, the quick realization, make the user obtain experience preferably.

With reference to figure 2, show the process flow diagram of the optical character recognition method embodiment 1 of a kind of pdf document of the present invention, specifically can may further comprise the steps:

Step 201, in pdf document, determine target pages, and obtain the page size information of described target pages;

Step 202, according to described page size information with preset resolution information, in internal memory, generate the image-region of corresponding size;

Content of pages data and positional information in the described page-describing instruction are extracted in step 203, the page-describing instruction of obtaining described target pages;

Step 204, draw described content of pages data in the relevant position of described image-region according to described positional information;

Step 205, described content of pages data are carried out optical character identification, obtain recognition result.

Be understandable that, in the present embodiment,, can obtain by the logical organization and the storage organization of resolving pdf document for the page size information of pdf document related pages and obtaining of page-describing instruction.Particularly, the Analytic principle of pdf document is, begin by end-of-file, by extracting the indirect object sequence number of root object, and the position of cross reference table (being cross reference table beginning byte location hereof), utilize the object indexing function of cross reference table, begin successively to resolve by root object.

In practice, the described resolution information that presets can be provided with by the user, also can be the system default setting, can also adopt other method to obtain, and the present invention does not limit this.

Comprise more than 70 page-describing instruction in the present PDF standard, comprised description to data object related contents such as character, figure, image, pattern, position, size information, thereby, in the present embodiment, described content of pages data can comprise view data, graph data and/or character data, in this case, the step 204 of described drafting content of pages data further can comprise following substep:

Substep S41, convert described image data decoding to bitmap, draw described bitmap in the relevant position of described image-region;

And/or, substep S42, directly draw described graph data in the relevant position of described image-region;

And/or, substep S43, generate character picture according to the attribute information of described character data, draw described character picture in the relevant position of described image-region.

For making those skilled in the art understand present embodiment better, below by being that example describes to the concrete page-describing instruction resolving in the signal table of above-mentioned pdf document.Suppose that the PDF page-describing that obtains 60obj in the described signal table instructs as follows:

BT

/F0?48.000?Tf

72.000?576.000?Td

(Hello?World)Tj

ET

Resolving above-mentioned page-describing instruction is:

(1) " BT " expression beginning character Object Operations needs to finish initialization operations such as answer initial coordinate transformation parameter in the processing;

(2) to select in presents the sign title for use be the font of F0 in "/F0 48.000 Tf " expression, and the font pantograph coefficient is 48.0.The font name of sign title F0 is " Times-Roman " in the file, and the character code name is called " WinAnsiEncoding ", will load corresponding font file according to font name in the processing;

(3) " 72.000 576.000 Td " expression as true origin, moves to lateral separation 72.0 pound, position that fore-and-aft distance 576.0 pound with current coordinate with the PDF page lower left corner;

(4) " (Hello World) Tj " expression output character sequence " Hello World ".At different characters, in the font file that loads, find corresponding characters to represent item, generate character picture and be stored in the page-images zone in the internal memory;

(5) " ET " expression character object EO.

As above shown in the example, the page-describing instruction that is comprised in page may have many, and in this case, the step 204 of described drafting content of pages data can also comprise following substep:

If the described target pages of substep S44 also has next bar page-describing instruction, then continue to extract content of pages data and positional information in next bar page-describing instruction.

In addition, the PDF standard indicates, can adopt several data encoding compression mode that the PDF object is compressed, at present, the encoding compression mode that PDF supports comprises: ASCIIHex, ASCII85, LZW, RunLength, CCITT Group3, CCITT Group 4, JPEG, JPEG 2000, Flate etc., therefore, before resolving the instruction of PDF page-describing, if described page-describing instruction is the process compressed encoding, the present invention also can comprise the step that described page-describing instruction carrying out data decode is handled so.

Correspondingly, can show the process flow diagram of the image transitions drawing process of a kind of pdf document of the present invention, specifically can may further comprise the steps with reference to figure 3:

Step 301, in pdf document, determine target pages, and obtain the page size information of described target pages;

Step 302, according to described page size information with preset resolution information, in internal memory, generate the image-region of corresponding size;

Step 303, the page-describing instruction of obtaining described target pages judge whether described page-describing instruction passes through compressed encoding, if then execution in step 304; If not, execution in step 305 then;

Step 304, described page-describing instruction carrying out data decode handled after, execution in step 305;

Content of pages data and positional information in step 305, the instruction of extraction article one page-describing;

Step 306, judge whether described content of pages data are view data, if then execution in step 307; If not, execution in step 308 then;

Step 307, convert described image data decoding to bitmap, after described bitmap is drawn in the relevant position of described image-region, execution in step 308;

Step 308, judge whether described content of pages data are graph data, if then execution in step 309; If not, execution in step 310 then;

Step 309, direct after described graph data is drawn in the relevant position of described image-region, execution in step 310;

Step 310, judge whether described content of pages data are character data, if then execution in step 311; If not, execution in step 312 then;

Step 311, generate character picture according to the attribute information of described character data, after described character picture is drawn in the relevant position of described image-region, execution in step 312;

Step 312, judge whether to also have next bar page-describing instruction, if then execution in step 313; If not, then finish the image rendering of current page;

Content of pages data and the positional information in next bar page-describing instruction extracted in step 313, continuation, and reenters step 306.

With reference to figure 4, show the process flow diagram of a kind of optical character recognition method embodiment 2 of pdf document, specifically can may further comprise the steps:

Step 401, determine target P DF file;

In practice, ask the filename discerned, can navigate to corresponding pdf document by obtaining the user.

Step 402, in described pdf document, determine target pages, and obtain the page size information of described target pages;

PDF has irrelevance as a kind of structurized file layout between its page and the page, by the page number of pdf document, promptly can carry out at random visit to the page in the pdf document.Therefore, can determine respective page in the pdf document according to the page number of user's appointment, in this case, described step 402 can also comprise following substep:

Substep 4021, obtain the page number information of described pdf document;

The page number of substep 4022, the current appointment of judgement is in the scope of described page number information, if then carry out substep 4023; If not, execution in step 4024 then; Substep 4023, determine that described page number corresponding page is a target pages.

Substep 4024, prompting user make mistakes.

Step 403, according to described page size information with preset resolution information, in internal memory, generate the image-region of corresponding size;

Content of pages data and positional information in the described page-describing instruction are extracted in step 404, the page-describing instruction of obtaining described target pages;

Step 405, draw described content of pages data in the relevant position of described image-region according to described positional information;

So far, the PDF content of pages data in the described internal memory have been converted into corresponding view data.

Step 406, described content of pages data are carried out optical character identification, obtain recognition result;

Because by above-mentioned steps has been view data with described content of pages data processing, thereby in the present embodiment, it all is feasible adopting any optical character recognition method of the prior art, for example, a kind of method of optical character identification is:

(1) pre-processing image data process:

Carry out processing such as slant correction, deformation correction, binaryzation by the view data that the PDF conversion of page is obtained, to guarantee the validity of later stage identifying operation;

(2) printed page analysis:

Mainly carry out operations such as text image zone location, form identification, page info understanding;

(3) character recognition:

With the characters in images image transitions is the computer-internal coded representation form of character, except that Chinese and English character identification, also can add the support of traditional font, Japanese, Korean as required;

(4) user's check and correction:

The user can correct the mistake knowledge that occurs in the identifying.

Certainly, above-mentioned disposal route only only limits to for example, and it also is feasible that those skilled in the art adopt other optical character recognition method, and the present invention does not need this to limit.

Be well known that pdf document has read-only property, yet, in some cases, be to edit, thereby present embodiment can also comprise to the content in the pdf document:

Step 407, described recognition result is output as specified file format.

Recognition result according to OCR formation, at first carry out the space of a whole page and restore processing, be about to the recognition data reorganization and be structures such as text fragment, form, export as the file of specified format then, as editable file layouts such as RTF, DOC, TXT, EXCEL, WPS, UOML.

In this case, the pdf document that generates for scan image no matter, the pdf document that also is to use application software to generate by the conversion of computing machine ISN, can be according to size, position, the pattern of data in original page such as character, figure, images, be converted to the various file layouts of being convenient to edit, be difficult to obtain and a multiplexing difficult problem thereby efficiently solve the pdf document content, greatly reduced the workload of artificial file typing, page composing and file check and correction.

Certainly, the method for above-mentioned output specified file format can adopt any method of the prior art to realize that the present invention does not limit this.

Preferably, in the present embodiment, can also may further comprise the steps: the content of pages data in the described image-region are saved as image file.

The method of described preservation can adopt the form of internal storage data, also can adopt any one picture format to be kept on hard disk or other memory device, uses to offer other program, and the present invention does not limit this.

For aforesaid each method embodiment, for simple description, so it all is expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not subjected to the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in the instructions all belongs to preferred embodiment, and related action and module might not be that the present invention is necessary.

With reference to figure 5, show the structured flowchart of the optical character recognition device embodiment 1 of a kind of pdf document of the present invention, specifically can comprise with lower unit:

Target pages determining unit 501 is used for determining target pages in pdf document;

First acquiring unit 502 is used to obtain the page size information of described target pages;

Memory Allocation unit 503 is used for according to described page size information and preset resolution information, generates the image-region of corresponding size in internal memory;

Second acquisition unit 504, the page-describing instruction that is used to obtain described target pages;

Extraction unit 505 is used for extracting the content of pages data and the positional information of described page-describing instruction;

Draw performance element 506, be used for drawing described content of pages data in the relevant position of described image-region according to described positional information;

Recognition unit 507 is used for described content of pages data are carried out optical character identification, obtains recognition result.

Preferably, described content of pages data can comprise view data, graph data and/or character data, described in this case drafting performance element 506 can comprise following subelement: (do not have S561-S564 in the accompanying drawing, whether will increase the diagram of relevant S561-S564)

Image rendering subelement S561 is used for converting described image data decoding to bitmap, draws described bitmap in the relevant position of described image-region;

And/or graphic plotting subelement S562 is used for directly drawing described graph data in the relevant position of described image-region;

And/or subelement S563 drawn in character, is used for generating character picture according to the attribute information of described character data, draws described character picture in the relevant position of described image-region.

In practice, page-describing instruction in the described target pages may have many, described in this case drafting performance element 506 can also comprise circulation subelement S564, be used for when described target pages also has next bar page-describing instruction, continue to extract content of pages data and positional information in next bar page-describing instruction.

In addition, if the instruction of described page-describing is the process compressed encoding, present embodiment can also comprise the data decode unit so, is used for when described page-describing instructs through compressed encoding, and described page-describing instruction carrying out data decode is handled.

With reference to figure 6, show the structured flowchart of the optical character recognition device embodiment 2 of a kind of pdf document of the present invention, specifically can comprise with lower unit:

File destination determining unit 601 is used for determining target P DF file;

Target pages determining unit 602 is used for determining target pages in described pdf document;

Preferably, described file destination determining unit can comprise following subelement: (not having S621-S622 in the accompanying drawing)

The page number obtains subelement 6021, is used to obtain the page number information of described pdf document;

Locator unit 6022 is used for determining that described page number corresponding page is a target pages in the scope of the page number at described page number information of current appointment the time.

First acquiring unit 603 is used to obtain the page size information of described target pages;

Memory Allocation unit 604 is used for according to described page size information and preset resolution information, generates the image-region of corresponding size in internal memory;

Second acquisition unit 605, the page-describing instruction that is used to obtain described target pages;

Extraction unit 606 is used for extracting the content of pages data and the positional information of described page-describing instruction;

Draw performance element 607, be used for drawing described content of pages data in the relevant position of described image-region according to described positional information;

Recognition unit 608 is used for described content of pages data are carried out optical character identification, obtains recognition result;

Specify output unit 609, be used for described recognition result is output as specified file format.

Preferably, in the present embodiment, can also comprise the preservation unit, be used for the content of pages data in the described image-region are saved as image file.

With reference to figure 7, show and use the process flow diagram that preferred embodiment shown in Figure 6 carries out the OCR identifying of pdf document, specifically can may further comprise the steps:

Step 701, file destination determining unit are determined target P DF file;

Step 702, target pages determining unit are determined target pages in described pdf document, first acquiring unit obtains the page size information of described target pages;

Step 703, the Memory Allocation unit is according to described page size information and preset resolution information, generates the image-region of corresponding size in internal memory;

Step 704, second acquisition unit obtain the page-describing instruction of described target pages, and extraction unit extracts content of pages data and the positional information in the instruction of article one page-describing;

Step 705, drafting performance element are drawn described content of pages data according to described positional information in the relevant position of described image-region;

Step 706, recognition unit carry out optical character identification to described content of pages data, obtain recognition result;

Step 707, appointment output unit are output as specified file format with described recognition result.

For device embodiment, because it is substantially corresponding to method embodiment, relevant part can not given unnecessary details at this referring to the part explanation of method embodiment.In addition, in an embodiment of the present invention, the description of each embodiment is all emphasized particularly on different fields, do not have the part that describes in detail among certain embodiment, can be referring to the associated description of other embodiment.

The present invention can be used for numerous general or special purpose computingasystem environment or configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, programmable consumer-elcetronics devices, network PC, small-size computer, mainframe computer, comprise distributed computing environment of above any system or equipment or the like.

The present invention can also describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract, program, object, assembly, data structure or the like.Also can in distributed computing environment, put into practice the present invention, in these distributed computing environment, by by communication network connected teleprocessing equipment execute the task.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.

More than the optical character recognition method of a kind of pdf document provided by the present invention and a kind of optical character recognition device of pdf document are described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1, a kind of optical character recognition method of pdf document is characterized in that, comprising:

2, the method for claim 1 is characterized in that, described content of pages data comprise view data, graph data and/or character data, and described plot step further comprises:

3, method as claimed in claim 2 is characterized in that, described page-describing instruction has many, and described plot step further comprises:

4, as claim 1,2 or 3 described methods, it is characterized in that, before the step of extracting content of pages data and positional information, also comprise:

5, as claim 1,2 or 3 described methods, it is characterized in that, before definite target pages, also comprise:

Determine target P DF file.

6, method as claimed in claim 5 is characterized in that, determines target pages by following steps:

Obtain the page number information of described pdf document;

7, method as claimed in claim 2 is characterized in that, also comprises:

Content of pages data in the described image-region are saved as image file.

8, as claim 1 or 7 described methods, it is characterized in that, also comprise;

Described recognition result is output as specified file format.

9, a kind of optical character recognition device of pdf document is characterized in that, comprising:

10, device as claimed in claim 9 is characterized in that, described content of pages data comprise view data, graph data and/or character data, and described drafting performance element further comprises:

11, device as claimed in claim 10 is characterized in that, described page-describing instruction has many, and described drafting performance element further comprises:

12, as claim 9,10 or 11 described devices, it is characterized in that, also comprise:

13, as claim 9,10 or 11 described devices, it is characterized in that, also comprise:

The file destination determining unit is used for determining target P DF file.

14, device as claimed in claim 13 is characterized in that, described file destination determining unit further comprises:

15, device as claimed in claim 10 is characterized in that, also comprises:

16, as claim 9 or 15 described devices, it is characterized in that, also comprise;