CN104732228B

CN104732228B - A kind of detection of PDF document mess code, the method for correction

Info

Publication number: CN104732228B
Application number: CN201510181385.0A
Authority: CN
Inventors: 邹季英; 梁洵; 袁仁慧
Original assignee: TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd; TONGFANG KNOWLEDGE NETWORK DIGITAL PUBLICATION TECHNOLOGY Co Ltd
Current assignee: TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd; TONGFANG KNOWLEDGE NETWORK DIGITAL PUBLICATION TECHNOLOGY Co Ltd
Priority date: 2015-04-16
Filing date: 2015-04-16
Publication date: 2018-03-30
Anticipated expiration: 2035-04-16
Also published as: CN104732228A

Abstract

The invention discloses a kind of method of detection of PDF document mess code, correction, including：Extract all character features in PDF document；Font is divided into by normal font, mess code font and font undetermined according to character feature；The dot matrix image of character in font undetermined is extracted, and the mess code detection algorithm based on image statisticses feature calculates dot matrix image and the similarity of corresponding coding, and normal character or the mess code character in font undetermined is judged according to similarity；Mess code character in mess code character and mess code font in the font undetermined is subjected to vertical and horizontal and adapts correction；By correcting modified result PDF document, mess code character is removed.The present invention realizes the automatic detection of mess code by the way of the characteristics of image of character feature and character is combined, vertical and horizontal, which are adapted to be combined, reduces the artificial time-consuming of mess code correction, it effectively removes mess code, eliminate the interference that mess code is processed to follow-up fragmentation, processing efficiency and quality are improved, reduces processing cost.

Description

A kind of detection of PDF document mess code, the method for correction

Technical field

The present invention relates to mess code character machining in the fragmentation process of PDF document, correction method more particularly in The detection of literary and English PDF document mess code character, the method for correction.

Background technology

PDF (Portable Document Format, portable document format) is a kind of electronic file form, is had and behaviour The characteristics of making system platform independence, it has also become electronic document is issued and widely used preferable document in digital information propagation Form.

In the fragmentation process (metadata indexing) of PDF document, document is carried out taking word operation.It is so-called to take word Refer to document character being copied and pasted to specified location.Generally, document display content is correct and display content and takes word knot Fruit is consistent.When display content is with taking word result inconsistent, i.e., display is correct, when taking the word to malfunction, and this phenomenon is referred to as PDF document Mess code phenomenon.When taking word result to contain a large amount of mess codes, indexer must knock in indexing content word by word and sentence by sentence with keyboard；When a small amount of Or indivedual mess codes are adulterated when wherein being difficult to find, word knot is taken to ensure that quality of indexing indexer will take a significant amount of time inspection Fruit.Therefore, mess code phenomenon seriously reduces the operating efficiency and quality of metadata indexing.

Mess code phenomenon has also had a strong impact on the accuracy of data content in electronic document secondary operation.With computer skill The continuous development of art, network technology, digital information, which is propagated, turns into main flow circulation way.In digital information propagation, to expire Mutual conversion requirements between the sufficient different types of electronic document of different-format, such as mutually turn between PDF and WORD, EPUB.PDF document Following phenomenons are likely to occur in transfer process：One PDF document is converted to other lattice under the premise of page text importing is correct During formula electronic document, there is Char Disorder phenomenon in the document after conversion.Although the document after conversion can be sent out by hand inspection Now and mess code is corrected, but hand inspection is not only wasted time and energy, and also when the doping of a small amount of mess code, human eye is not easy to discover in a document, Data content accuracy is have impact on, reduces crudy.

When PDF document fragmentation is processed, if first carrying out mess code detection, correction to document, mess code is found from source Correct mess code, so that it may avoid harmful effect of the mess code to following process.Therefore, mess code detection is carried out to PDF document, correction is ten Divide necessary.At present, the method for rarely having disclosed maturation solves PDF document Confused-code.Approximate technology, such as in PDF words OCR (Optical Character Recognition) technologies are combined in extraction to improve the accuracy of Word Input.OCR skills Art is a kind of technology that the image of character is converted to character computer ISN using character recognition technologies.OCR technique includes figure As data prediction, printed page analysis, character segmentation, monocase identification.Mainly used in OCR technique in PDF Word Inputs Individual character identification technology.In mess code detection, if unifying to use in OCR technique with not making any distinction between to each character of document Individual character identification technology, the cost spent are very high.For example, for the normal PDF document only containing a small amount of mess code of most of character, it is right Each character uses OCR individual character identification technologies, will inevitably consume the plenty of time on normal character is identified.

The content of the invention

In order to solve the above technical problems, detected it is an object of the invention to provide a kind of PDF document mess code, the method for correction, This method is realized the automatic detection of mess code, excluded disorderly by the way of the combination of the image statisticses feature of character feature and character Interference of the code to the processing of PDF document fragmentation, improving crudy reduces processing cost.

The purpose of the present invention is realized by following technical scheme：

A kind of PDF document mess code detection, the method for correction, including：

Extract all character features in PDF document；

Font is divided into by normal font, mess code font and font undetermined according to character feature；

The dot matrix image of character in font undetermined is extracted, and the mess code detection algorithm based on image statisticses feature calculates dot matrix Image and the similarity of corresponding coding, normal character or the mess code character in font undetermined is judged according to similarity；

Mess code character in mess code character and mess code font in the font undetermined is subjected to vertical and horizontal and adapts school Just；

By correcting modified result PDF document, mess code character is removed.

Compared with prior art, one or more embodiments of the invention can have the following advantages that：

From two angles of characteristics of image of PDF document character feature and character, complement each other, further improve mess code Detection efficiency；

When mess code detects in units of font, the character that same font repeats need to only detect once, abandon from text Shelves word for word take word to repeat the poorly efficient mode detected sentence by sentence page by page；

In mess code detection, based on the mess code detection algorithm of image statisticses feature compared with OCR individual character identification technologies, advantage It is that the former combines characteristics of image as guiding using character code and carries out mess code judgement, i.e., according to the coding lookup feature of current character The statistical nature of corresponding dot matrix image in storehouse, judge to work as by the dot matrix image of current character and the similarity of statistical nature Whether preceding character is mess code.And the latter is directly identified according to dot matrix image, then recognition result and character code are contrasted and sentenced It is disconnected.OCR individual characters identification technology typically carries out two stage recognition：Thick identification and thin identification.Thick identification reduces the scope, and thin identification determines most Terminate fruit.And in mess code detection, character code has determined that scope and need not slightly identified and reduces the scope.It is based on as can be seen here The mess code detection algorithm of image statisticses feature compares OCR individual character identification technologies, more simply, it is time saving and energy saving more suitable for mess code examine Survey.

Vertical and horizontal, which are adapted to be combined to reduce, manually adapts the used time, improves mess code correction efficiency.

Brief description of the drawings

Fig. 1 is the detection of PDF document mess code, the method flow diagram of correction.

Embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with embodiment and accompanying drawing to this hair It is bright to be described in further detail.

As shown in figure 1, for the method flow of the detection of PDF document mess code, correction, methods described includes：

Extract all character features in PDF document；

By correcting modified result PDF document, mess code character is removed.

Above-mentioned character feature includes：Font type, font coded system, with the presence or absence of between present encoding and standard code Mapping relations, whether be embedded font etc..The font type is broadly divided into two kinds：Composite font (Composite Font) With simple font (Simple Font).Composite font and the difference of simple font are：The former uses multibyte Coding and description word Symbol, the latter's generally use single byte Coding and description character.The font coded system can be divided into standard code mode and self-defined Coded system.Standard code mode refers to coded system published, sanctified by usage, such as EUC (Extended Unix Code)、UCS2(Universal Multiple-Octet Coded Character Set 2)、ANSI(American National Standards Institute) etc.；Custom coding mode refers to undocumented, privately owned coded system.Institute The mapping relations stated between present encoding and standard code refer to, when font coded system is self-defined, with present encoding Privately owned custom coding can be converted to disclosed standard code by the mapping relations between standard code.The embedded word The resources such as the shape (Glyph) of all characters for the font that body refers to document being related to are stored in PDF document by certain rule In.Embedded font is corresponding with non-embedded font, and font resource is not embedded in document by non-embedded font, used money Source is outside document, such as system font resource.

The true cause that PDF document produces mess code is that the embedded font in document has used custom coding, but is the absence of Mapping relations between standard code；Or the mapping relations of mistake be present.

Above-mentioned normal font, it is without any processing as normal character to regard all characters under the font；

Mess code font, all characters under the font are determined as mess code；

Font undetermined, the dot matrix image and corresponding coding (Unicode of all characters for the font that extraction document is related to Coding), using the mess code detection algorithm based on image statisticses feature, both similarities are calculated, the character of dissmilarity is judged For mess code.

Mess code detection algorithm based on image statisticses feature employs the thought of statistical-simulation spectrometry, need to collect respectively each Character (predominantly Chinese and English character) various sizes of image pattern of different fonts, extracts characteristics of image, finds this feature space The regularity of distribution, the distribution is described with statistical model.The image pattern of each character needs to reach certain amount just with statistics Meaning, the image pattern number lower limit of the character involved by the present embodiment is 100.The characteristics of image of character, statistical model etc. are believed Breath saves as feature database for future use by certain rule.

When calculating similarity, the lattice image features of character to be checked are extracted first, then found from feature database with word coding The statistical information of corresponding character, finally estimates the probability that characteristics of image to be checked occurs in corresponding statistical model.Probability is higher Show that similarity degree is higher, the more low then similarity degree of probability is lower.A threshold value can be preset, is then determined as less than the threshold value Mess code.

It is above-mentioned that to adapt correction be to export all mess codes to the instrument of adapting to carry out adapting correction.Longitudinal direction adapt correction, be by The identical characters of different fonts, which pool together, concentrates batch modification；Correction laterally is adapted, is by the kinds of characters of same font Pool together, picture and text control, manual amendment.

According to pdf document design feature, in units of font, established with correction result or renewal embeds all words under font Mapping table between the present encoding and standard code of symbol, you can reach the purpose for removing mess code.

Although disclosed herein embodiment as above, described content only to facilitate understand the present invention and adopt Embodiment, it is not limited to the present invention.Any those skilled in the art to which this invention pertains, this is not being departed from On the premise of the disclosed spirit and scope of invention, any modification and change can be made in the implementing form and in details, But the scope of patent protection of the present invention, still should be subject to the scope of the claims as defined in the appended claims.

Claims

1. a kind of detection of PDF document mess code, the method for correction, it is characterised in that methods described includes：

Extract all character features in PDF document；

The dot matrix image of character in font undetermined is extracted, and the mess code detection algorithm based on image statisticses feature calculates dot matrix image With the similarity of corresponding coding, normal character or the mess code character in font undetermined is judged according to similarity；

Mess code character in mess code character and mess code font in the font undetermined is subjected to vertical and horizontal and adapts correction；

By correcting modified result PDF document, mess code character is removed；

The character feature includes：Font type, font coded system, with the presence or absence of the mapping between present encoding and standard code Relation, whether it is embedded font；

The mess code detection algorithm of described image statistical nature includes：Image characteristics extraction algorithm, the parameter estimation side of statistical model Method and Image Feature Matching algorithm；

It is that the identical characters of different fonts pool together that correction is adapted in the longitudinal direction, concentrates batch to modify；

It is that the kinds of characters of same font pools together that the transverse direction, which adapts correction, and picture and text control is modified；

Described with correction modified result PDF document is used in units of font, is established with correction result or is updated PDF document Mapping table under embedded font between the present encoding of all characters and standard code, to remove mess code character.

2. the detection of PDF document mess code as claimed in claim 1, the method for correction, it is characterised in that in the font undetermined Character pattern image with it is corresponding coding it is similar, then be determined as normal character；Otherwise it is determined as mess code character.