CN104732228B - A kind of detection of PDF document mess code, the method for correction - Google Patents

A kind of detection of PDF document mess code, the method for correction Download PDF

Info

Publication number
CN104732228B
CN104732228B CN201510181385.0A CN201510181385A CN104732228B CN 104732228 B CN104732228 B CN 104732228B CN 201510181385 A CN201510181385 A CN 201510181385A CN 104732228 B CN104732228 B CN 104732228B
Authority
CN
China
Prior art keywords
font
character
mess code
correction
pdf document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510181385.0A
Other languages
Chinese (zh)
Other versions
CN104732228A (en
Inventor
邹季英
梁洵
袁仁慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
TONGFANG KNOWLEDGE NETWORK DIGITAL PUBLICATION TECHNOLOGY Co Ltd
Original Assignee
TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
TONGFANG KNOWLEDGE NETWORK DIGITAL PUBLICATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd, TONGFANG KNOWLEDGE NETWORK DIGITAL PUBLICATION TECHNOLOGY Co Ltd filed Critical TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
Priority to CN201510181385.0A priority Critical patent/CN104732228B/en
Publication of CN104732228A publication Critical patent/CN104732228A/en
Application granted granted Critical
Publication of CN104732228B publication Critical patent/CN104732228B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a kind of method of detection of PDF document mess code, correction, including:Extract all character features in PDF document;Font is divided into by normal font, mess code font and font undetermined according to character feature;The dot matrix image of character in font undetermined is extracted, and the mess code detection algorithm based on image statisticses feature calculates dot matrix image and the similarity of corresponding coding, and normal character or the mess code character in font undetermined is judged according to similarity;Mess code character in mess code character and mess code font in the font undetermined is subjected to vertical and horizontal and adapts correction;By correcting modified result PDF document, mess code character is removed.The present invention realizes the automatic detection of mess code by the way of the characteristics of image of character feature and character is combined, vertical and horizontal, which are adapted to be combined, reduces the artificial time-consuming of mess code correction, it effectively removes mess code, eliminate the interference that mess code is processed to follow-up fragmentation, processing efficiency and quality are improved, reduces processing cost.

Description

A kind of detection of PDF document mess code, the method for correction
Technical field
The present invention relates to mess code character machining in the fragmentation process of PDF document, correction method more particularly in The detection of literary and English PDF document mess code character, the method for correction.
Background technology
PDF (Portable Document Format, portable document format) is a kind of electronic file form, is had and behaviour The characteristics of making system platform independence, it has also become electronic document is issued and widely used preferable document in digital information propagation Form.
In the fragmentation process (metadata indexing) of PDF document, document is carried out taking word operation.It is so-called to take word Refer to document character being copied and pasted to specified location.Generally, document display content is correct and display content and takes word knot Fruit is consistent.When display content is with taking word result inconsistent, i.e., display is correct, when taking the word to malfunction, and this phenomenon is referred to as PDF document Mess code phenomenon.When taking word result to contain a large amount of mess codes, indexer must knock in indexing content word by word and sentence by sentence with keyboard;When a small amount of Or indivedual mess codes are adulterated when wherein being difficult to find, word knot is taken to ensure that quality of indexing indexer will take a significant amount of time inspection Fruit.Therefore, mess code phenomenon seriously reduces the operating efficiency and quality of metadata indexing.
Mess code phenomenon has also had a strong impact on the accuracy of data content in electronic document secondary operation.With computer skill The continuous development of art, network technology, digital information, which is propagated, turns into main flow circulation way.In digital information propagation, to expire Mutual conversion requirements between the sufficient different types of electronic document of different-format, such as mutually turn between PDF and WORD, EPUB.PDF document Following phenomenons are likely to occur in transfer process:One PDF document is converted to other lattice under the premise of page text importing is correct During formula electronic document, there is Char Disorder phenomenon in the document after conversion.Although the document after conversion can be sent out by hand inspection Now and mess code is corrected, but hand inspection is not only wasted time and energy, and also when the doping of a small amount of mess code, human eye is not easy to discover in a document, Data content accuracy is have impact on, reduces crudy.
When PDF document fragmentation is processed, if first carrying out mess code detection, correction to document, mess code is found from source Correct mess code, so that it may avoid harmful effect of the mess code to following process.Therefore, mess code detection is carried out to PDF document, correction is ten Divide necessary.At present, the method for rarely having disclosed maturation solves PDF document Confused-code.Approximate technology, such as in PDF words OCR (Optical Character Recognition) technologies are combined in extraction to improve the accuracy of Word Input.OCR skills Art is a kind of technology that the image of character is converted to character computer ISN using character recognition technologies.OCR technique includes figure As data prediction, printed page analysis, character segmentation, monocase identification.Mainly used in OCR technique in PDF Word Inputs Individual character identification technology.In mess code detection, if unifying to use in OCR technique with not making any distinction between to each character of document Individual character identification technology, the cost spent are very high.For example, for the normal PDF document only containing a small amount of mess code of most of character, it is right Each character uses OCR individual character identification technologies, will inevitably consume the plenty of time on normal character is identified.
The content of the invention
In order to solve the above technical problems, detected it is an object of the invention to provide a kind of PDF document mess code, the method for correction, This method is realized the automatic detection of mess code, excluded disorderly by the way of the combination of the image statisticses feature of character feature and character Interference of the code to the processing of PDF document fragmentation, improving crudy reduces processing cost.
The purpose of the present invention is realized by following technical scheme:
A kind of PDF document mess code detection, the method for correction, including:
Extract all character features in PDF document;
Font is divided into by normal font, mess code font and font undetermined according to character feature;
The dot matrix image of character in font undetermined is extracted, and the mess code detection algorithm based on image statisticses feature calculates dot matrix Image and the similarity of corresponding coding, normal character or the mess code character in font undetermined is judged according to similarity;
Mess code character in mess code character and mess code font in the font undetermined is subjected to vertical and horizontal and adapts school Just;
By correcting modified result PDF document, mess code character is removed.
Compared with prior art, one or more embodiments of the invention can have the following advantages that:
From two angles of characteristics of image of PDF document character feature and character, complement each other, further improve mess code Detection efficiency;
When mess code detects in units of font, the character that same font repeats need to only detect once, abandon from text Shelves word for word take word to repeat the poorly efficient mode detected sentence by sentence page by page;
In mess code detection, based on the mess code detection algorithm of image statisticses feature compared with OCR individual character identification technologies, advantage It is that the former combines characteristics of image as guiding using character code and carries out mess code judgement, i.e., according to the coding lookup feature of current character The statistical nature of corresponding dot matrix image in storehouse, judge to work as by the dot matrix image of current character and the similarity of statistical nature Whether preceding character is mess code.And the latter is directly identified according to dot matrix image, then recognition result and character code are contrasted and sentenced It is disconnected.OCR individual characters identification technology typically carries out two stage recognition:Thick identification and thin identification.Thick identification reduces the scope, and thin identification determines most Terminate fruit.And in mess code detection, character code has determined that scope and need not slightly identified and reduces the scope.It is based on as can be seen here The mess code detection algorithm of image statisticses feature compares OCR individual character identification technologies, more simply, it is time saving and energy saving more suitable for mess code examine Survey.
Vertical and horizontal, which are adapted to be combined to reduce, manually adapts the used time, improves mess code correction efficiency.
Brief description of the drawings
Fig. 1 is the detection of PDF document mess code, the method flow diagram of correction.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with embodiment and accompanying drawing to this hair It is bright to be described in further detail.
As shown in figure 1, for the method flow of the detection of PDF document mess code, correction, methods described includes:
Extract all character features in PDF document;
Font is divided into by normal font, mess code font and font undetermined according to character feature;
The dot matrix image of character in font undetermined is extracted, and the mess code detection algorithm based on image statisticses feature calculates dot matrix Image and the similarity of corresponding coding, normal character or the mess code character in font undetermined is judged according to similarity;
Mess code character in mess code character and mess code font in the font undetermined is subjected to vertical and horizontal and adapts school Just;
By correcting modified result PDF document, mess code character is removed.
Above-mentioned character feature includes:Font type, font coded system, with the presence or absence of between present encoding and standard code Mapping relations, whether be embedded font etc..The font type is broadly divided into two kinds:Composite font (Composite Font) With simple font (Simple Font).Composite font and the difference of simple font are:The former uses multibyte Coding and description word Symbol, the latter's generally use single byte Coding and description character.The font coded system can be divided into standard code mode and self-defined Coded system.Standard code mode refers to coded system published, sanctified by usage, such as EUC (Extended Unix Code)、UCS2(Universal Multiple-Octet Coded Character Set 2)、ANSI(American National Standards Institute) etc.;Custom coding mode refers to undocumented, privately owned coded system.Institute The mapping relations stated between present encoding and standard code refer to, when font coded system is self-defined, with present encoding Privately owned custom coding can be converted to disclosed standard code by the mapping relations between standard code.The embedded word The resources such as the shape (Glyph) of all characters for the font that body refers to document being related to are stored in PDF document by certain rule In.Embedded font is corresponding with non-embedded font, and font resource is not embedded in document by non-embedded font, used money Source is outside document, such as system font resource.
The true cause that PDF document produces mess code is that the embedded font in document has used custom coding, but is the absence of Mapping relations between standard code;Or the mapping relations of mistake be present.
Above-mentioned normal font, it is without any processing as normal character to regard all characters under the font;
Mess code font, all characters under the font are determined as mess code;
Font undetermined, the dot matrix image and corresponding coding (Unicode of all characters for the font that extraction document is related to Coding), using the mess code detection algorithm based on image statisticses feature, both similarities are calculated, the character of dissmilarity is judged For mess code.
Mess code detection algorithm based on image statisticses feature employs the thought of statistical-simulation spectrometry, need to collect respectively each Character (predominantly Chinese and English character) various sizes of image pattern of different fonts, extracts characteristics of image, finds this feature space The regularity of distribution, the distribution is described with statistical model.The image pattern of each character needs to reach certain amount just with statistics Meaning, the image pattern number lower limit of the character involved by the present embodiment is 100.The characteristics of image of character, statistical model etc. are believed Breath saves as feature database for future use by certain rule.
When calculating similarity, the lattice image features of character to be checked are extracted first, then found from feature database with word coding The statistical information of corresponding character, finally estimates the probability that characteristics of image to be checked occurs in corresponding statistical model.Probability is higher Show that similarity degree is higher, the more low then similarity degree of probability is lower.A threshold value can be preset, is then determined as less than the threshold value Mess code.
It is above-mentioned that to adapt correction be to export all mess codes to the instrument of adapting to carry out adapting correction.Longitudinal direction adapt correction, be by The identical characters of different fonts, which pool together, concentrates batch modification;Correction laterally is adapted, is by the kinds of characters of same font Pool together, picture and text control, manual amendment.
According to pdf document design feature, in units of font, established with correction result or renewal embeds all words under font Mapping table between the present encoding and standard code of symbol, you can reach the purpose for removing mess code.
Although disclosed herein embodiment as above, described content only to facilitate understand the present invention and adopt Embodiment, it is not limited to the present invention.Any those skilled in the art to which this invention pertains, this is not being departed from On the premise of the disclosed spirit and scope of invention, any modification and change can be made in the implementing form and in details, But the scope of patent protection of the present invention, still should be subject to the scope of the claims as defined in the appended claims.

Claims (2)

1. a kind of detection of PDF document mess code, the method for correction, it is characterised in that methods described includes:
Extract all character features in PDF document;
Font is divided into by normal font, mess code font and font undetermined according to character feature;
The dot matrix image of character in font undetermined is extracted, and the mess code detection algorithm based on image statisticses feature calculates dot matrix image With the similarity of corresponding coding, normal character or the mess code character in font undetermined is judged according to similarity;
Mess code character in mess code character and mess code font in the font undetermined is subjected to vertical and horizontal and adapts correction;
By correcting modified result PDF document, mess code character is removed;
The character feature includes:Font type, font coded system, with the presence or absence of the mapping between present encoding and standard code Relation, whether it is embedded font;
The mess code detection algorithm of described image statistical nature includes:Image characteristics extraction algorithm, the parameter estimation side of statistical model Method and Image Feature Matching algorithm;
It is that the identical characters of different fonts pool together that correction is adapted in the longitudinal direction, concentrates batch to modify;
It is that the kinds of characters of same font pools together that the transverse direction, which adapts correction, and picture and text control is modified;
Described with correction modified result PDF document is used in units of font, is established with correction result or is updated PDF document Mapping table under embedded font between the present encoding of all characters and standard code, to remove mess code character.
2. the detection of PDF document mess code as claimed in claim 1, the method for correction, it is characterised in that in the font undetermined Character pattern image with it is corresponding coding it is similar, then be determined as normal character;Otherwise it is determined as mess code character.
CN201510181385.0A 2015-04-16 2015-04-16 A kind of detection of PDF document mess code, the method for correction Active CN104732228B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510181385.0A CN104732228B (en) 2015-04-16 2015-04-16 A kind of detection of PDF document mess code, the method for correction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510181385.0A CN104732228B (en) 2015-04-16 2015-04-16 A kind of detection of PDF document mess code, the method for correction

Publications (2)

Publication Number Publication Date
CN104732228A CN104732228A (en) 2015-06-24
CN104732228B true CN104732228B (en) 2018-03-30

Family

ID=53456103

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510181385.0A Active CN104732228B (en) 2015-04-16 2015-04-16 A kind of detection of PDF document mess code, the method for correction

Country Status (1)

Country Link
CN (1) CN104732228B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105488471B (en) * 2015-11-30 2019-03-29 北大方正集团有限公司 A kind of font recognition methods and device
CN105718554A (en) * 2016-01-19 2016-06-29 深圳市天朗时代科技有限公司 Document collaboration conversion method and system
US10261729B1 (en) 2018-02-27 2019-04-16 Ricoh Company, Ltd. Document manipulation mechanism
CN110728115B (en) * 2018-07-17 2024-01-26 珠海金山办公软件有限公司 Document content messy code identification method and device and electronic equipment
CN110728111B (en) * 2018-07-17 2024-06-25 珠海金山办公软件有限公司 Document content messy code repairing method and device, terminal equipment and server
CN108985289A (en) * 2018-07-18 2018-12-11 百度在线网络技术(北京)有限公司 Messy code detection method and device
CN110765826A (en) * 2018-07-27 2020-02-07 珠海金山办公软件有限公司 Method and device for identifying messy codes in Portable Document Format (PDF)
CN109684962B (en) * 2018-12-14 2023-04-18 苏州梦想人软件科技有限公司 AR electronic book quality detection method
CN111695327B (en) * 2019-02-28 2024-01-26 珠海金山办公软件有限公司 Method and device for repairing messy codes, electronic equipment and readable storage medium
CN111144107B (en) * 2019-12-25 2022-08-09 福建天晴在线互动科技有限公司 Messy code identification method based on slicing algorithm
CN111401362A (en) * 2020-03-06 2020-07-10 上海眼控科技股份有限公司 Tampering detection method, device, equipment and storage medium for vehicle VIN code
CN113627129B (en) * 2020-05-08 2024-06-21 珠海金山办公软件有限公司 Text copying method and device, electronic equipment and readable storage medium
CN113158745B (en) * 2021-02-02 2024-04-02 北京惠朗时代科技有限公司 Multi-feature operator-based messy code document picture identification method and system
CN114529930B (en) * 2022-01-13 2024-03-01 上海森亿医疗科技有限公司 PDF restoration method, storage medium and device based on nonstandard mapping fonts
CN114519858B (en) * 2022-02-16 2023-09-05 北京百度网讯科技有限公司 Document image recognition method and device, storage medium and electronic equipment
CN114629707B (en) * 2022-03-16 2024-05-24 深信服科技股份有限公司 Disorder code detection method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101110072A (en) * 2007-08-21 2008-01-23 无敌科技(西安)有限公司 Device and method for automatic identifying literal code
CN104346616A (en) * 2013-08-09 2015-02-11 北大方正集团有限公司 Character recognition device and character recognition method
CN104424165A (en) * 2013-09-06 2015-03-18 北大方正集团有限公司 Messy code detection method and system for text documents
CN104424010A (en) * 2013-09-06 2015-03-18 北大方正集团有限公司 Method and system for detecting and repairing text document messy codes

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02278104A (en) * 1989-04-19 1990-11-14 Fuji Electric Co Ltd Detecting method for angle of inclination of document image

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101110072A (en) * 2007-08-21 2008-01-23 无敌科技(西安)有限公司 Device and method for automatic identifying literal code
CN104346616A (en) * 2013-08-09 2015-02-11 北大方正集团有限公司 Character recognition device and character recognition method
CN104424165A (en) * 2013-09-06 2015-03-18 北大方正集团有限公司 Messy code detection method and system for text documents
CN104424010A (en) * 2013-09-06 2015-03-18 北大方正集团有限公司 Method and system for detecting and repairing text document messy codes

Also Published As

Publication number Publication date
CN104732228A (en) 2015-06-24

Similar Documents

Publication Publication Date Title
CN104732228B (en) A kind of detection of PDF document mess code, the method for correction
CN101782896B (en) PDF character extraction method combined with OCR technology
CN108415887B (en) Method for converting PDF file into OFD file
CN105068997B (en) The construction method and device of parallel corpora
CN104750666B (en) A kind of recognition methods of text character codes mode and system
CN111428474A (en) Language model-based error correction method, device, equipment and storage medium
US9286527B2 (en) Segmentation of an input by cut point classification
CN106127265B (en) A kind of text in picture identification error correction method based on activating force model
CN112396049A (en) Text error correction method and device, computer equipment and storage medium
CN111488732B (en) Method, system and related equipment for detecting deformed keywords
CN113408535B (en) OCR error correction method based on Chinese character level features and language model
CN112085011A (en) OCR recognition result error correction method, device and storage medium
CN112329482A (en) Machine translation method, device, electronic equipment and readable storage medium
CN104331400B (en) A kind of Mongolian code conversion method and device
CN108038093A (en) PDF text extraction methods and device
CN112084308A (en) Method, system and storage medium for text type data recognition
CN111931489A (en) Text error correction method, device and equipment
CN109325237B (en) Complete sentence recognition method and system for machine translation
CN110705217A (en) Wrongly-written character detection method and device, computer storage medium and electronic equipment
CN105677718A (en) Character retrieval method and apparatus
CN107491441B (en) Method for dynamically extracting translation template based on forced decoding
CN104933030A (en) Uygur language spelling examination method and device
CN102880874B (en) Character identifying method and Character recognizer
CN105653516B (en) The method and apparatus of parallel corpora alignment
CN104699662B (en) The method and apparatus for identifying overall symbol string

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant