CN101441713B - Optical character recognition method and apparatus of PDF document - Google Patents
Optical character recognition method and apparatus of PDF document Download PDFInfo
- Publication number
- CN101441713B CN101441713B CN2007101776734A CN200710177673A CN101441713B CN 101441713 B CN101441713 B CN 101441713B CN 2007101776734 A CN2007101776734 A CN 2007101776734A CN 200710177673 A CN200710177673 A CN 200710177673A CN 101441713 B CN101441713 B CN 101441713B
- Authority
- CN
- China
- Prior art keywords
- page
- data
- pages
- image
- content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Character Input (AREA)
- Character Discrimination (AREA)
Abstract
The invention discloses an optical character identification method for PDF files. The method comprises the steps of determining a target page in a PDF file, acquiring the page-size information of the target page, generating an image region with corresponding size in a memory according to the page-size information and preset resolution information, acquiring a page-describing instruction of the target page, extracting page-content data and position information in the page-describing instruction, drawing the page-content data in a corresponding position in the image region according to the position information, identifying optical characters of the page-content data and obtaining identification results. The method can realize direct OCR identifying operation to the PDF file, and does not need to repeatedly switch over among various types of software, thereby simplifying the operation of users, reducing operation time and ensuring that the users have good use experience.
Description
Technical field
The present invention relates to the optical character identification field, particularly the optical character recognition device of a kind of optical character recognition method of pdf document and a kind of pdf document.
Background technology
Optical character recognition is called for short OCR (Optical Character Recognition) technology, is that a kind of character recognition technologies that utilizes is with the image transitions of the character technology for the character computer ISN.At present, the file layout that the OCR technology can be discerned only limits to image file format, i.e. the file of forms such as tif, bmp or jpg.
PDF (Portable Document Fromat, the portable file layout) file, it is a kind of electronic file form that is used for describing content of pages, pdf document have with the operating system platform independence (promptly no matter be at Windows, Unix is general in Mac OS operating system still) characteristics, become the desirable document format that on Internet, carries out electronic document distribution and digital information propagation at present.Yet, because pdf document is not a kind of picture format file, so existing OCR system can not the Direct Recognition pdf document, and after must pdf document being converted to the discernible image file format of OCR system in advance by third party software, adopt the OCR system to carry out OCR identification again, for example:, choose the zone that needs identification with the snapshot tool in the pdf document process software (as Acrobat), by duplicating paste operation, it is saved as the picture format file.
Obviously, adopt said method that pdf document is carried out OCR identification, all need in different software, switch back and forth, complicated operation, holding time is long, and user experience is relatively poor.Thereby those skilled in the art press for develops a kind of switching, OCR disposal route and the device that can directly discern pdf document of need not repeating between a plurality of softwares.
Summary of the invention
Technical matters to be solved by this invention provides a kind of optical character recognition method that can the Direct Recognition pdf document, uses that this method can be carried out simply pdf document, OCR identifying operation efficiently, makes the user obtain experience preferably.
The present invention also provides a kind of optical character recognition device that can discern pdf document, in order to guarantee said method realization and application in practice.
For solving the problems of the technologies described above, the embodiment of the invention discloses a kind of optical character recognition method of pdf document, comprising:
In pdf document, determine target pages, and obtain the page size information of described target pages;
According to described page size information and preset resolution information, in internal memory, generate the image-region of corresponding size;
Obtain the page-describing instruction of described target pages, extract content of pages data and positional information in the described page-describing instruction;
Draw described content of pages data in the relevant position of described image-region according to described positional information;
Described content of pages data are carried out optical character identification, obtain recognition result.
Preferably, described content of pages data comprise view data, graph data and/or character data, and described plot step further comprises:
Convert described image data decoding to bitmap, draw described bitmap in the relevant position of described image-region;
And/or, directly draw described graph data in the relevant position of described image-region;
And/or, according to the attribute information generation character picture of described character data, draw described character picture in the relevant position of described image-region.
Preferably, described page-describing instruction has many, and described plot step further comprises:
If described target pages also has next bar page-describing instruction, then continue to extract content of pages data and positional information in next bar page-describing instruction.
Preferably, before the step of extracting content of pages data and positional information, also comprise:
If described page-describing instruction through compressed encoding, is then handled described page-describing instruction carrying out data decode.
Preferably, before definite target pages, also comprise:
Determine target P DF file.
Preferably, determine target pages by following steps:
Obtain the page number information of described pdf document;
If the page number of current appointment in the scope of described page number information, determines then that described page number corresponding page is a target pages.
Preferably, described method also comprises:
Content of pages data in the described image-region are saved as image file.
Preferably, described method also comprises;
Described recognition result is output as specified file format.
The embodiment of the invention also discloses a kind of optical character recognition device of pdf document, comprising:
The target pages determining unit is used for determining target pages in pdf document;
First acquiring unit is used to obtain the page size information of described target pages;
The Memory Allocation unit is used for according to described page size information and preset resolution information, generates the image-region of corresponding size in internal memory;
Second acquisition unit, the page-describing instruction that is used to obtain described target pages;
Extraction unit is used for extracting the content of pages data and the positional information of described page-describing instruction;
Draw performance element, be used for drawing described content of pages data in the relevant position of described image-region according to described positional information;
Recognition unit is used for described content of pages data are carried out optical character identification, obtains recognition result.
Preferably, described content of pages data comprise view data, graph data and/or character data, and described drafting performance element further comprises:
The image rendering subelement is used for converting described image data decoding to bitmap, draws described bitmap in the relevant position of described image-region;
And/or the graphic plotting subelement is used for directly drawing described graph data in the relevant position of described image-region;
And/or subelement drawn in character, is used for generating character picture according to the attribute information of described character data, draws described character picture in the relevant position of described image-region.
Preferably, described page-describing instruction has many, and described drafting performance element further comprises:
The circulation subelement is used for when described target pages also has next bar page-describing instruction, continues to extract content of pages data and positional information in next bar page-describing instruction.
Preferably, described device also comprises:
The data decode unit is used for when described page-describing instructs through compressed encoding, and described page-describing instruction carrying out data decode is handled.
Preferably, described device also comprises:
The file destination determining unit is used for determining target P DF file.
Preferably, described file destination determining unit further comprises:
The page number obtains subelement, is used to obtain the page number information of described pdf document;
The locator unit is used for determining that described page number corresponding page is a target pages in the scope of the page number at described page number information of current appointment the time.
Preferably, described device also comprises:
Preserve the unit, be used for the content of pages data in the described image-region are saved as image file.
Preferably, described device also comprises;
Specify output unit, be used for described recognition result is output as specified file format.
Compared with prior art, the embodiment of the invention has the following advantages:
At first, the present invention is by resolving the page size information of the pdf document page that obtains, according to this page size information and the pixels tall and the width that preset resolution information calculating output image, the image storage space that in internal memory, distributes corresponding size then for this output image, again by resolving the page-describing instruction that obtains target pages, the content of pages data are depicted in the image storage space of this distribution, thereby realize direct OCR identifying operation to pdf document, need not between various software, to repeat to switch, simplified user's operation, reduced the running time, and made the user obtain experience preferably;
Moreover, the present invention can be output as specified file format with the recognition result after handling, thereby in corresponding file format, can carry out editing and processing, effectively improve the flexibility ratio of pdf document Edition Contains, further make the user obtain experience preferably the content of pdf document.
Description of drawings
Fig. 1 is the hierarchical chart of a pdf document;
Fig. 2 is the process flow diagram of the optical character recognition method embodiment 1 of a kind of pdf document of the present invention;
Fig. 3 is the process flow diagram of the image transitions drawing process of a kind of pdf document of the present invention;
Fig. 4 is the process flow diagram of a kind of optical character recognition method embodiment 2 of pdf document;
Fig. 5 is the structured flowchart of the optical character recognition device embodiment 1 of a kind of pdf document of the present invention;
Fig. 6 is the structured flowchart of the optical character recognition device embodiment 2 of a kind of pdf document of the present invention;
Fig. 7 uses the process flow diagram that preferred embodiment shown in Figure 6 carries out the OCR identifying of pdf document.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
Angle from the pdf document generation, two kinds of methods that generate pdf document are arranged: first kind is to utilize the optical scanning technology that existing paper document, books etc. are converted to image in advance, generate pdf document by image again, data such as character wherein, figure exist with image format; Second kind is to utilize application program and PDF printer (a kind of virtual printing software), the computing machine ISN of character in the computing machine and graph data is converted to the internal representation form of PDF.Data such as character wherein, figure exist with the form of PDF coding.
From the data structure of pdf document, the data in the pdf document are to organize with the form of PDF object.Particularly, the PDF object can be divided into direct object (direct object) and indirect object (indirect object) two classes, wherein, direct object comprises Boolean type (Boolean), numeric type (Number), character string type (String), name type (Name), array type (Array), dictionary type (Dictionary), data stream type (Stream) and null value type (Null); Label on the basis of direct object to liking indirectly, other object references are provided.
From the logical organization of pdf document, pdf document can be described as a hierarchical structure of being made up of the PDF object, comprises unique root object (Catalog) in this structure, with reference to figure 1, shows the hierarchical chart of a pdf document.Wherein, root object comprises the bookmark tree and the page tree of PDF document, wherein, the bookmark tree comprises a plurality of bookmark items, and page table entry is a most important object among the PDF, comprise the page-describing instruction, the information that promptly how to show this page, the font of Shi Yonging for example, the content (literal that comprises, picture etc.), size information of the page etc.Certainly subitem wherein also can be quoting of other objects.
From the storage organization of pdf document, the pdf document of standard is made of four parts: file header (Header), file body (Body), cross reference table (Cross-reference Table) and end-of-file (Trailer) are formed.
Wherein, the version number of the PDF standard that file header (Header) specified document is deferred to number is 1.3 as " %PDF-1.3 " expression current version; File body (Body) comprises the indirect object of a series of description document pages; Cross reference table (Cross-reference Table) has write down each indirect object position hereof; End-of-file (Trailer) record cross-reference table starting position hereof, the indirect object sequence number and the end-of-file mark of root object (Catalog).
For example, the signal table of a pdf document is:
Structure analysis based on above-mentioned pdf document, one of core idea that can obtain the embodiment of the invention is, according to the page size information of resolving the pdf document target pages that obtains, with preset resolution information (being typically expressed as the pixel count that comprises in an inch), calculate the pixels tall and the width of output image, the image storage space that in calculator memory, distributes corresponding size then for this output image, according to the page-describing instruction of resolving the target pages that obtains, character, figure and view data are depicted in the image storage space of this distribution again.So that to the OCR identifying operation of pdf document can be simply, the quick realization, make the user obtain experience preferably.
With reference to figure 2, show the process flow diagram of the optical character recognition method embodiment 1 of a kind of pdf document of the present invention, specifically can may further comprise the steps:
Content of pages data and positional information in the described page-describing instruction are extracted in step 203, the page-describing instruction of obtaining described target pages;
Be understandable that, in the present embodiment,, can obtain by the logical organization and the storage organization of resolving pdf document for the page size information of pdf document related pages and obtaining of page-describing instruction.Particularly, the Analytic principle of pdf document is, begin by end-of-file, by extracting the indirect object sequence number of root object, and the position of cross reference table (being cross reference table beginning byte location hereof), utilize the object indexing function of cross reference table, begin successively to resolve by root object.
In practice, the described resolution information that presets can be provided with by the user, also can be the system default setting, can also adopt other method to obtain, and the present invention does not limit this.
Comprise more than 70 page-describing instruction in the present PDF standard, comprised description to data object related contents such as character, figure, image, pattern, position, size information, thereby, in the present embodiment, described content of pages data can comprise view data, graph data and/or character data, in this case, the step 204 of described drafting content of pages data further can comprise following substep:
Substep S41, convert described image data decoding to bitmap, draw described bitmap in the relevant position of described image-region;
And/or, substep S42, directly draw described graph data in the relevant position of described image-region;
And/or, substep S43, generate character picture according to the attribute information of described character data, draw described character picture in the relevant position of described image-region.
For making those skilled in the art understand present embodiment better, below by being that example describes to the concrete page-describing instruction resolving in the signal table of above-mentioned pdf document.Suppose that the PDF page-describing that obtains 60 obj in the described signal table instructs as follows:
BT
/F048.000Tf
72.000576.000Td
(Hello?World)Tj
ET
Resolving above-mentioned page-describing instruction is:
(1) " BT " expression beginning character Object Operations needs to finish initialization operations such as answer initial coordinate transformation parameter in the processing;
(2) to select in presents the sign title for use be the font of F0 in "/F048.000Tf " expression, and the font pantograph coefficient is 48.0.The font name of sign title F0 is " Times-Roman " in the file, and the character code name is called " WinAnsiEncoding ", will load corresponding font file according to font name in the processing;
(3) " 72.000576.000Td " expression as true origin, moves to lateral separation 72.0 pound, position that fore-and-aft distance 576.0 pound with current coordinate with the PDF page lower left corner;
(4) " (Hello World) Tj " expression output character sequence " Hello World ".At different characters, in the font file that loads, find corresponding characters to represent item, generate character picture and be stored in the page-images zone in the internal memory;
(5) " ET " expression character object EO.
As above shown in the example, the page-describing instruction that is comprised in page may have many, and in this case, the step 204 of described drafting content of pages data can also comprise following substep:
If the described target pages of substep S44 also has next bar page-describing instruction, then continue to extract content of pages data and positional information in next bar page-describing instruction.
In addition, the PDF standard indicates, can adopt several data encoding compression mode that the PDF object is compressed, at present, the encoding compression mode that PDF supports comprises: ASCIIHex, ASCII85, LZW, RunLength, CCITT Group 3, CCITT Group 4, JPEG, JPEG 2000, Flate etc., therefore, before resolving the instruction of PDF page-describing, if described page-describing instruction is the process compressed encoding, the present invention also can comprise the step that described page-describing instruction carrying out data decode is handled so.
Correspondingly, can show the process flow diagram of the image transitions drawing process of a kind of pdf document of the present invention, specifically can may further comprise the steps with reference to figure 3:
Step 304, described page-describing instruction carrying out data decode handled after, execution in step 305;
Content of pages data and positional information in step 305, the instruction of extraction article one page-describing;
Step 311, generate character picture according to the attribute information of described character data, after described character picture is drawn in the relevant position of described image-region, execution in step 312;
Content of pages data and the positional information in next bar page-describing instruction extracted in step 313, continuation, and reenters step 306.
With reference to figure 4, show the process flow diagram of a kind of optical character recognition method embodiment 2 of pdf document, specifically can may further comprise the steps:
In practice, ask the filename discerned, can navigate to corresponding pdf document by obtaining the user.
PDF has irrelevance as a kind of structurized file layout between its page and the page, by the page number of pdf document, promptly can carry out at random visit to the page in the pdf document.Therefore, can determine respective page in the pdf document according to the page number of user's appointment, in this case, described step 402 can also comprise following substep:
The page number of substep 4022, the current appointment of judgement is in the scope of described page number information, if then carry out substep 4023; If not, execution in step 4024 then; Substep 4023, determine that described page number corresponding page is a target pages.
Content of pages data and positional information in the described page-describing instruction are extracted in step 404, the page-describing instruction of obtaining described target pages;
So far, the PDF content of pages data in the described internal memory have been converted into corresponding view data.
Because by above-mentioned steps has been view data with described content of pages data processing, thereby in the present embodiment, it all is feasible adopting any optical character recognition method of the prior art, for example, a kind of method of optical character identification is:
(1) pre-processing image data process:
Carry out processing such as slant correction, deformation correction, binaryzation by the view data that the PDF conversion of page is obtained, to guarantee the validity of later stage identifying operation;
(2) printed page analysis:
Mainly carry out operations such as text image zone location, form identification, page info understanding;
(3) character recognition:
With the characters in images image transitions is the computer-internal coded representation form of character, except that Chinese and English character identification, also can add the support of traditional font, Japanese, Korean as required;
(4) user's check and correction:
The user can correct the mistake knowledge that occurs in the identifying.
Certainly, above-mentioned disposal route only only limits to for example, and it also is feasible that those skilled in the art adopt other optical character recognition method, and the present invention does not need this to limit.
Be well known that pdf document has read-only property, yet, in some cases, be to edit, thereby present embodiment can also comprise to the content in the pdf document:
Recognition result according to OCR formation, at first carry out the space of a whole page and restore processing, be about to the recognition data reorganization and be structures such as text fragment, form, export as the file of specified format then, as editable file layouts such as RTF, DOC, TXT, EXCEL, WPS, UOML.
In this case, the pdf document that generates for scan image no matter, the pdf document that also is to use application software to generate by the conversion of computing machine ISN, can be according to size, position, the pattern of data in original page such as character, figure, images, be converted to the various file layouts of being convenient to edit, be difficult to obtain and a multiplexing difficult problem thereby efficiently solve the pdf document content, greatly reduced the workload of artificial file typing, page composing and file check and correction.
Certainly, the method for above-mentioned output specified file format can adopt any method of the prior art to realize that the present invention does not limit this.
Preferably, in the present embodiment, can also may further comprise the steps: the content of pages data in the described image-region are saved as image file.
The method of described preservation can adopt the form of internal storage data, also can adopt any one picture format to be kept on hard disk or other memory device, uses to offer other program, and the present invention does not limit this.
For aforesaid each method embodiment, for simple description, so it all is expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not subjected to the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in the instructions all belongs to preferred embodiment, and related action and module might not be that the present invention is necessary.
With reference to figure 5, show the structured flowchart of the optical character recognition device embodiment 1 of a kind of pdf document of the present invention, specifically can comprise with lower unit:
Target pages determining unit 501 is used for determining target pages in pdf document;
First acquiring unit 502 is used to obtain the page size information of described target pages;
Preferably, described content of pages data can comprise view data, graph data and/or character data, described in this case drafting performance element 506 can comprise following subelement: (do not have S561-S564 in the accompanying drawing, whether will increase the diagram of relevant S561-S564)
Image rendering subelement S561 is used for converting described image data decoding to bitmap, draws described bitmap in the relevant position of described image-region;
And/or graphic plotting subelement S562 is used for directly drawing described graph data in the relevant position of described image-region;
And/or subelement S563 drawn in character, is used for generating character picture according to the attribute information of described character data, draws described character picture in the relevant position of described image-region.
In practice, page-describing instruction in the described target pages may have many, described in this case drafting performance element 506 can also comprise circulation subelement S564, be used for when described target pages also has next bar page-describing instruction, continue to extract content of pages data and positional information in next bar page-describing instruction.
In addition, if the instruction of described page-describing is the process compressed encoding, present embodiment can also comprise the data decode unit so, is used for when described page-describing instructs through compressed encoding, and described page-describing instruction carrying out data decode is handled.
With reference to figure 6, show the structured flowchart of the optical character recognition device embodiment 2 of a kind of pdf document of the present invention, specifically can comprise with lower unit:
File destination determining unit 601 is used for determining target P DF file;
Target pages determining unit 602 is used for determining target pages in described pdf document;
Preferably, described file destination determining unit can comprise following subelement: (not having S621-S622 in the accompanying drawing)
The page number obtains subelement 6021, is used to obtain the page number information of described pdf document;
First acquiring unit 603 is used to obtain the page size information of described target pages;
Specify output unit 609, be used for described recognition result is output as specified file format.
Preferably, in the present embodiment, can also comprise the preservation unit, be used for the content of pages data in the described image-region are saved as image file.
With reference to figure 7, show and use the process flow diagram that preferred embodiment shown in Figure 6 carries out the OCR identifying of pdf document, specifically can may further comprise the steps:
For device embodiment, because it is substantially corresponding to method embodiment, relevant part can not given unnecessary details at this referring to the part explanation of method embodiment.In addition, in an embodiment of the present invention, the description of each embodiment is all emphasized particularly on different fields, do not have the part that describes in detail among certain embodiment, can be referring to the associated description of other embodiment.
The present invention can be used for numerous general or special purpose computingasystem environment or configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, programmable consumer-elcetronics devices, network PC, small-size computer, mainframe computer, comprise distributed computing environment of above any system or equipment or the like.
The present invention can also describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract, program, object, assembly, data structure or the like.Also can in distributed computing environment, put into practice the present invention, in these distributed computing environment, by by communication network connected teleprocessing equipment execute the task.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.
More than the optical character recognition method of a kind of pdf document provided by the present invention and a kind of optical character recognition device of pdf document are described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.
Claims (16)
1. the optical character recognition method of a pdf document is characterized in that, comprising:
In pdf document, determine target pages, and, obtain the page size information of described target pages by resolving the logical organization and the storage organization of pdf document;
According to described page size information and preset resolution information, in internal memory, generate the image-region of corresponding size;
By resolving the logical organization and the storage organization of pdf document, obtain the page-describing instruction of described target pages, extract content of pages data and positional information in the described page-describing instruction;
Draw described content of pages data in the relevant position of described image-region according to described positional information, the PDF content of pages data in the described internal memory have been converted into corresponding view data;
Described content of pages data are carried out optical character identification, obtain recognition result.
2. the method for claim 1 is characterized in that, described content of pages data comprise view data, graph data and/or character data, and described plot step further comprises:
Convert described image data decoding to bitmap, draw described bitmap in the relevant position of described image-region;
And/or, directly draw described graph data in the relevant position of described image-region;
And/or, according to the attribute information generation character picture of described character data, draw described character picture in the relevant position of described image-region.
3. method as claimed in claim 2 is characterized in that, described page-describing instruction has many, and described plot step further comprises:
If described target pages also has next bar page-describing instruction, then continue to extract content of pages data and positional information in next bar page-describing instruction.
4. as claim 1,2 or 3 described methods, it is characterized in that, before the step of extracting content of pages data and positional information, also comprise:
If described page-describing instruction through compressed encoding, is then handled described page-describing instruction carrying out data decode.
5. as claim 1,2 or 3 described methods, it is characterized in that, before definite target pages, also comprise:
Determine target P DF file.
6. method as claimed in claim 5 is characterized in that, determines target pages by following steps:
Obtain the page number information of described pdf document;
If the page number of current appointment in the scope of described page number information, determines then that described page number corresponding page is a target pages.
7. method as claimed in claim 2 is characterized in that, also comprises:
Content of pages data in the described image-region are saved as image file.
8. as claim 1 or 7 described methods, it is characterized in that, also comprise;
Described recognition result is output as specified file format.
9. the optical character recognition device of a pdf document is characterized in that, comprising:
The target pages determining unit is used for determining target pages in pdf document;
First acquiring unit is used for obtaining the page size information of described target pages by resolving the logical organization and the storage organization of pdf document;
The Memory Allocation unit is used for according to described page size information and preset resolution information, generates the image-region of corresponding size in internal memory;
Second acquisition unit is used for by resolving the logical organization and the storage organization of pdf document, obtains the page-describing instruction of described target pages;
Extraction unit is used for extracting the content of pages data and the positional information of described page-describing instruction;
Draw performance element, be used for drawing described content of pages data in the relevant position of described image-region, the PDF content of pages data in the described internal memory have been converted into corresponding view data according to described positional information;
Recognition unit is used for described content of pages data are carried out optical character identification, obtains recognition result.
10. device as claimed in claim 9 is characterized in that, described content of pages data comprise view data, graph data and/or character data, and described drafting performance element further comprises:
The image rendering subelement is used for converting described image data decoding to bitmap, draws described bitmap in the relevant position of described image-region;
And/or the graphic plotting subelement is used for directly drawing described graph data in the relevant position of described image-region;
And/or subelement drawn in character, is used for generating character picture according to the attribute information of described character data, draws described character picture in the relevant position of described image-region.
11. device as claimed in claim 10 is characterized in that, described page-describing instruction has many, and described drafting performance element further comprises:
The circulation subelement is used for when described target pages also has next bar page-describing instruction, continues to extract content of pages data and positional information in next bar page-describing instruction.
12. as claim 9,10 or 11 described devices, it is characterized in that, also comprise:
The data decode unit is used for when described page-describing instructs through compressed encoding, and described page-describing instruction carrying out data decode is handled.
13. as claim 9,10 or 11 described devices, it is characterized in that, also comprise:
The file destination determining unit is used for determining target P DF file.
14. device as claimed in claim 13 is characterized in that, described file destination determining unit further comprises:
The page number obtains subelement, is used to obtain the page number information of described pdf document;
The locator unit is used for determining that described page number corresponding page is a target pages in the scope of the page number at described page number information of current appointment the time.
15. device as claimed in claim 10 is characterized in that, also comprises:
Preserve the unit, be used for the content of pages data in the described image-region are saved as image file.
16. as claim 9 or 15 described devices, it is characterized in that, also comprise;
Specify output unit, be used for described recognition result is output as specified file format.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2007101776734A CN101441713B (en) | 2007-11-19 | 2007-11-19 | Optical character recognition method and apparatus of PDF document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2007101776734A CN101441713B (en) | 2007-11-19 | 2007-11-19 | Optical character recognition method and apparatus of PDF document |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101441713A CN101441713A (en) | 2009-05-27 |
CN101441713B true CN101441713B (en) | 2010-12-08 |
Family
ID=40726140
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2007101776734A Active CN101441713B (en) | 2007-11-19 | 2007-11-19 | Optical character recognition method and apparatus of PDF document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101441713B (en) |
Families Citing this family (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8565474B2 (en) | 2010-03-10 | 2013-10-22 | Microsoft Corporation | Paragraph recognition in an optical character recognition (OCR) process |
CN101853246B (en) * | 2010-06-14 | 2012-05-23 | 深圳市万兴软件有限公司 | Method and device for converting document format |
CN103186912B (en) * | 2011-12-28 | 2016-07-06 | 北京神州泰岳软件股份有限公司 | The method and system of word are shown with picture format |
CN102831106A (en) * | 2012-08-27 | 2012-12-19 | 腾讯科技(深圳)有限公司 | Electronic document generation method of mobile terminal and mobile terminal |
CN104077593A (en) * | 2013-03-27 | 2014-10-01 | 富士通株式会社 | Image processing method and image processing device |
CN103279753B (en) * | 2013-06-09 | 2016-03-09 | 中国科学院自动化研究所 | A kind of English scene text block identifying method instructed based on tree construction |
CN104283921A (en) * | 2013-07-08 | 2015-01-14 | 腾讯科技(深圳)有限公司 | Method and device for releasing microblog |
CN103744609B (en) * | 2014-01-20 | 2018-10-19 | 华为终端(东莞)有限公司 | A kind of data extraction method and device |
US9542622B2 (en) * | 2014-03-08 | 2017-01-10 | Microsoft Technology Licensing, Llc | Framework for data extraction by examples |
US10158549B2 (en) * | 2015-09-18 | 2018-12-18 | Fmr Llc | Real-time monitoring of computer system processor and transaction performance during an ongoing performance test |
CN105335346B (en) * | 2015-11-09 | 2018-12-04 | 汉王科技股份有限公司 | A kind of Text Extraction and device of PDF document |
US9779293B2 (en) * | 2016-01-27 | 2017-10-03 | Honeywell International Inc. | Method and tool for post-mortem analysis of tripped field devices in process industry using optical character recognition and intelligent character recognition |
CN106446863B (en) * | 2016-10-11 | 2020-01-21 | 同方知网(北京)技术有限公司 | PDF document logic diagram identification method |
US11562557B2 (en) * | 2017-07-25 | 2023-01-24 | Hewlett-Packard Development Company, L.P. | Character-recognition sharpness determinations |
CN110929479A (en) * | 2018-09-03 | 2020-03-27 | 珠海金山办公软件有限公司 | Method and device for converting PDF scanning piece, electronic equipment and storage medium |
CN109492199B (en) * | 2018-10-17 | 2023-04-28 | 四川译讯信息科技有限公司 | PDF file conversion method based on OCR pre-judgment |
CN109446995A (en) * | 2018-10-30 | 2019-03-08 | 广西科技大学 | The treating method and apparatus of billing information |
CN109948123B (en) * | 2018-11-27 | 2023-06-02 | 创新先进技术有限公司 | Image merging method and device |
CN110321470B (en) * | 2019-05-23 | 2024-05-28 | 平安科技(深圳)有限公司 | Document processing method, device, computer equipment and storage medium |
CN110991279B (en) * | 2019-11-20 | 2023-08-22 | 北京灵伴未来科技有限公司 | Document Image Analysis and Recognition Method and System |
CN111143213A (en) * | 2019-12-24 | 2020-05-12 | 北京数衍科技有限公司 | Software automation test method and device and electronic equipment |
CN112069771B (en) * | 2020-08-26 | 2024-05-28 | 中国建设银行股份有限公司 | Method and device for analyzing pictures in PDF (portable document format) file |
CN112036123B (en) * | 2020-08-31 | 2024-05-10 | 三六零数字安全科技集团有限公司 | PDF generation method, device, equipment and storage medium based on webpage |
CN112446373B (en) * | 2020-12-15 | 2023-06-06 | 万兴科技(湖南)有限公司 | Method, system, computer device and storage medium for identifying converted image file |
CN112861821B (en) * | 2021-04-06 | 2024-04-19 | 刘羽 | Map data reduction method based on PDF file analysis |
CN113553962A (en) * | 2021-07-27 | 2021-10-26 | 未鲲(上海)科技服务有限公司 | Electronic signature positioning method, device, equipment and storage medium |
CN113792659B (en) * | 2021-09-15 | 2024-04-05 | 上海金仕达软件科技股份有限公司 | Document identification method and device and electronic equipment |
-
2007
- 2007-11-19 CN CN2007101776734A patent/CN101441713B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN101441713A (en) | 2009-05-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101441713B (en) | Optical character recognition method and apparatus of PDF document | |
US7681121B2 (en) | Image processing apparatus, control method therefor, and program | |
US7664321B2 (en) | Image processing method, system, program, program storage medium and information processing apparatus | |
US7349577B2 (en) | Image processing method and image processing system | |
US8954845B2 (en) | Image processing device, method and storage medium for two-way linking between related graphics and text in an electronic document | |
US5907835A (en) | Electronic filing system using different application program for processing drawing commands for printing | |
US8520006B2 (en) | Image processing apparatus and method, and program | |
EP2162859B1 (en) | Image processing apparatus, image processing method, and computer program | |
US8112706B2 (en) | Information processing apparatus and method | |
US7493250B2 (en) | System and method for distributing multilingual documents | |
US8514462B2 (en) | Processing document image including caption region | |
US20040213458A1 (en) | Image processing method and system | |
JP3518304B2 (en) | Information browsing system | |
JPH10149410A (en) | Method for generating user interface form | |
CN111753717A (en) | Method, apparatus, device and medium for extracting structured information of text | |
JP3683925B2 (en) | Electronic filing device | |
US7453594B2 (en) | Document filing apparatus for storing information added to a document file | |
JP5551660B2 (en) | Computer-implemented method for encoding text into matrix code symbols, computer-implemented method for decoding matrix code symbols, encoder for encoding text into matrix code symbols, and decoder for decoding matrix code symbols | |
JP2004246577A (en) | Image processing method | |
JP2022092119A (en) | Image processing apparatus, image processing method, and program | |
JP2000322417A (en) | Device and method for filing image and storage medium | |
CN115131794A (en) | Information processing apparatus, recording medium, and information processing method | |
KR100708389B1 (en) | The device which the compression and memorial to a PDF file of the security and method thereof | |
JP5501307B2 (en) | Apparatus for decoding matrix code symbols and method for decoding matrix code symbols | |
JP2005208872A (en) | Image processing system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20210831 Address after: 100124 first floor, building 8, No. 1129, Huihe South Street, Banbidian village, Gaobeidian Township, Chaoyang District, Beijing Patentee after: Beijing Hanwang Yingyan Technology Co.,Ltd. Address before: 100094, No. 5, building 8, No. three northeast Wang Xi Road, Beijing, Haidian District Patentee before: HANWANG TECHNOLOGY Co.,Ltd. |
|
TR01 | Transfer of patent right |