CN104732228B - A kind of detection of PDF document mess code, the method for correction - Google Patents
A kind of detection of PDF document mess code, the method for correction Download PDFInfo
- Publication number
- CN104732228B CN104732228B CN201510181385.0A CN201510181385A CN104732228B CN 104732228 B CN104732228 B CN 104732228B CN 201510181385 A CN201510181385 A CN 201510181385A CN 104732228 B CN104732228 B CN 104732228B
- Authority
- CN
- China
- Prior art keywords
- font
- character
- mess code
- correction
- pdf document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a kind of method of detection of PDF document mess code, correction, including:Extract all character features in PDF document;Font is divided into by normal font, mess code font and font undetermined according to character feature;The dot matrix image of character in font undetermined is extracted, and the mess code detection algorithm based on image statisticses feature calculates dot matrix image and the similarity of corresponding coding, and normal character or the mess code character in font undetermined is judged according to similarity;Mess code character in mess code character and mess code font in the font undetermined is subjected to vertical and horizontal and adapts correction;By correcting modified result PDF document, mess code character is removed.The present invention realizes the automatic detection of mess code by the way of the characteristics of image of character feature and character is combined, vertical and horizontal, which are adapted to be combined, reduces the artificial time-consuming of mess code correction, it effectively removes mess code, eliminate the interference that mess code is processed to follow-up fragmentation, processing efficiency and quality are improved, reduces processing cost.
Description
Technical field
The present invention relates to mess code character machining in the fragmentation process of PDF document, correction method more particularly in
The detection of literary and English PDF document mess code character, the method for correction.
Background technology
PDF (Portable Document Format, portable document format) is a kind of electronic file form, is had and behaviour
The characteristics of making system platform independence, it has also become electronic document is issued and widely used preferable document in digital information propagation
Form.
In the fragmentation process (metadata indexing) of PDF document, document is carried out taking word operation.It is so-called to take word
Refer to document character being copied and pasted to specified location.Generally, document display content is correct and display content and takes word knot
Fruit is consistent.When display content is with taking word result inconsistent, i.e., display is correct, when taking the word to malfunction, and this phenomenon is referred to as PDF document
Mess code phenomenon.When taking word result to contain a large amount of mess codes, indexer must knock in indexing content word by word and sentence by sentence with keyboard;When a small amount of
Or indivedual mess codes are adulterated when wherein being difficult to find, word knot is taken to ensure that quality of indexing indexer will take a significant amount of time inspection
Fruit.Therefore, mess code phenomenon seriously reduces the operating efficiency and quality of metadata indexing.
Mess code phenomenon has also had a strong impact on the accuracy of data content in electronic document secondary operation.With computer skill
The continuous development of art, network technology, digital information, which is propagated, turns into main flow circulation way.In digital information propagation, to expire
Mutual conversion requirements between the sufficient different types of electronic document of different-format, such as mutually turn between PDF and WORD, EPUB.PDF document
Following phenomenons are likely to occur in transfer process:One PDF document is converted to other lattice under the premise of page text importing is correct
During formula electronic document, there is Char Disorder phenomenon in the document after conversion.Although the document after conversion can be sent out by hand inspection
Now and mess code is corrected, but hand inspection is not only wasted time and energy, and also when the doping of a small amount of mess code, human eye is not easy to discover in a document,
Data content accuracy is have impact on, reduces crudy.
When PDF document fragmentation is processed, if first carrying out mess code detection, correction to document, mess code is found from source
Correct mess code, so that it may avoid harmful effect of the mess code to following process.Therefore, mess code detection is carried out to PDF document, correction is ten
Divide necessary.At present, the method for rarely having disclosed maturation solves PDF document Confused-code.Approximate technology, such as in PDF words
OCR (Optical Character Recognition) technologies are combined in extraction to improve the accuracy of Word Input.OCR skills
Art is a kind of technology that the image of character is converted to character computer ISN using character recognition technologies.OCR technique includes figure
As data prediction, printed page analysis, character segmentation, monocase identification.Mainly used in OCR technique in PDF Word Inputs
Individual character identification technology.In mess code detection, if unifying to use in OCR technique with not making any distinction between to each character of document
Individual character identification technology, the cost spent are very high.For example, for the normal PDF document only containing a small amount of mess code of most of character, it is right
Each character uses OCR individual character identification technologies, will inevitably consume the plenty of time on normal character is identified.
The content of the invention
In order to solve the above technical problems, detected it is an object of the invention to provide a kind of PDF document mess code, the method for correction,
This method is realized the automatic detection of mess code, excluded disorderly by the way of the combination of the image statisticses feature of character feature and character
Interference of the code to the processing of PDF document fragmentation, improving crudy reduces processing cost.
The purpose of the present invention is realized by following technical scheme:
A kind of PDF document mess code detection, the method for correction, including:
Extract all character features in PDF document;
Font is divided into by normal font, mess code font and font undetermined according to character feature;
The dot matrix image of character in font undetermined is extracted, and the mess code detection algorithm based on image statisticses feature calculates dot matrix
Image and the similarity of corresponding coding, normal character or the mess code character in font undetermined is judged according to similarity;
Mess code character in mess code character and mess code font in the font undetermined is subjected to vertical and horizontal and adapts school
Just;
By correcting modified result PDF document, mess code character is removed.
Compared with prior art, one or more embodiments of the invention can have the following advantages that:
From two angles of characteristics of image of PDF document character feature and character, complement each other, further improve mess code
Detection efficiency;
When mess code detects in units of font, the character that same font repeats need to only detect once, abandon from text
Shelves word for word take word to repeat the poorly efficient mode detected sentence by sentence page by page;
In mess code detection, based on the mess code detection algorithm of image statisticses feature compared with OCR individual character identification technologies, advantage
It is that the former combines characteristics of image as guiding using character code and carries out mess code judgement, i.e., according to the coding lookup feature of current character
The statistical nature of corresponding dot matrix image in storehouse, judge to work as by the dot matrix image of current character and the similarity of statistical nature
Whether preceding character is mess code.And the latter is directly identified according to dot matrix image, then recognition result and character code are contrasted and sentenced
It is disconnected.OCR individual characters identification technology typically carries out two stage recognition:Thick identification and thin identification.Thick identification reduces the scope, and thin identification determines most
Terminate fruit.And in mess code detection, character code has determined that scope and need not slightly identified and reduces the scope.It is based on as can be seen here
The mess code detection algorithm of image statisticses feature compares OCR individual character identification technologies, more simply, it is time saving and energy saving more suitable for mess code examine
Survey.
Vertical and horizontal, which are adapted to be combined to reduce, manually adapts the used time, improves mess code correction efficiency.
Brief description of the drawings
Fig. 1 is the detection of PDF document mess code, the method flow diagram of correction.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with embodiment and accompanying drawing to this hair
It is bright to be described in further detail.
As shown in figure 1, for the method flow of the detection of PDF document mess code, correction, methods described includes:
Extract all character features in PDF document;
Font is divided into by normal font, mess code font and font undetermined according to character feature;
The dot matrix image of character in font undetermined is extracted, and the mess code detection algorithm based on image statisticses feature calculates dot matrix
Image and the similarity of corresponding coding, normal character or the mess code character in font undetermined is judged according to similarity;
Mess code character in mess code character and mess code font in the font undetermined is subjected to vertical and horizontal and adapts school
Just;
By correcting modified result PDF document, mess code character is removed.
Above-mentioned character feature includes:Font type, font coded system, with the presence or absence of between present encoding and standard code
Mapping relations, whether be embedded font etc..The font type is broadly divided into two kinds:Composite font (Composite Font)
With simple font (Simple Font).Composite font and the difference of simple font are:The former uses multibyte Coding and description word
Symbol, the latter's generally use single byte Coding and description character.The font coded system can be divided into standard code mode and self-defined
Coded system.Standard code mode refers to coded system published, sanctified by usage, such as EUC (Extended Unix
Code)、UCS2(Universal Multiple-Octet Coded Character Set 2)、ANSI(American
National Standards Institute) etc.;Custom coding mode refers to undocumented, privately owned coded system.Institute
The mapping relations stated between present encoding and standard code refer to, when font coded system is self-defined, with present encoding
Privately owned custom coding can be converted to disclosed standard code by the mapping relations between standard code.The embedded word
The resources such as the shape (Glyph) of all characters for the font that body refers to document being related to are stored in PDF document by certain rule
In.Embedded font is corresponding with non-embedded font, and font resource is not embedded in document by non-embedded font, used money
Source is outside document, such as system font resource.
The true cause that PDF document produces mess code is that the embedded font in document has used custom coding, but is the absence of
Mapping relations between standard code;Or the mapping relations of mistake be present.
Above-mentioned normal font, it is without any processing as normal character to regard all characters under the font;
Mess code font, all characters under the font are determined as mess code;
Font undetermined, the dot matrix image and corresponding coding (Unicode of all characters for the font that extraction document is related to
Coding), using the mess code detection algorithm based on image statisticses feature, both similarities are calculated, the character of dissmilarity is judged
For mess code.
Mess code detection algorithm based on image statisticses feature employs the thought of statistical-simulation spectrometry, need to collect respectively each
Character (predominantly Chinese and English character) various sizes of image pattern of different fonts, extracts characteristics of image, finds this feature space
The regularity of distribution, the distribution is described with statistical model.The image pattern of each character needs to reach certain amount just with statistics
Meaning, the image pattern number lower limit of the character involved by the present embodiment is 100.The characteristics of image of character, statistical model etc. are believed
Breath saves as feature database for future use by certain rule.
When calculating similarity, the lattice image features of character to be checked are extracted first, then found from feature database with word coding
The statistical information of corresponding character, finally estimates the probability that characteristics of image to be checked occurs in corresponding statistical model.Probability is higher
Show that similarity degree is higher, the more low then similarity degree of probability is lower.A threshold value can be preset, is then determined as less than the threshold value
Mess code.
It is above-mentioned that to adapt correction be to export all mess codes to the instrument of adapting to carry out adapting correction.Longitudinal direction adapt correction, be by
The identical characters of different fonts, which pool together, concentrates batch modification;Correction laterally is adapted, is by the kinds of characters of same font
Pool together, picture and text control, manual amendment.
According to pdf document design feature, in units of font, established with correction result or renewal embeds all words under font
Mapping table between the present encoding and standard code of symbol, you can reach the purpose for removing mess code.
Although disclosed herein embodiment as above, described content only to facilitate understand the present invention and adopt
Embodiment, it is not limited to the present invention.Any those skilled in the art to which this invention pertains, this is not being departed from
On the premise of the disclosed spirit and scope of invention, any modification and change can be made in the implementing form and in details,
But the scope of patent protection of the present invention, still should be subject to the scope of the claims as defined in the appended claims.
Claims (2)
1. a kind of detection of PDF document mess code, the method for correction, it is characterised in that methods described includes:
Extract all character features in PDF document;
Font is divided into by normal font, mess code font and font undetermined according to character feature;
The dot matrix image of character in font undetermined is extracted, and the mess code detection algorithm based on image statisticses feature calculates dot matrix image
With the similarity of corresponding coding, normal character or the mess code character in font undetermined is judged according to similarity;
Mess code character in mess code character and mess code font in the font undetermined is subjected to vertical and horizontal and adapts correction;
By correcting modified result PDF document, mess code character is removed;
The character feature includes:Font type, font coded system, with the presence or absence of the mapping between present encoding and standard code
Relation, whether it is embedded font;
The mess code detection algorithm of described image statistical nature includes:Image characteristics extraction algorithm, the parameter estimation side of statistical model
Method and Image Feature Matching algorithm;
It is that the identical characters of different fonts pool together that correction is adapted in the longitudinal direction, concentrates batch to modify;
It is that the kinds of characters of same font pools together that the transverse direction, which adapts correction, and picture and text control is modified;
Described with correction modified result PDF document is used in units of font, is established with correction result or is updated PDF document
Mapping table under embedded font between the present encoding of all characters and standard code, to remove mess code character.
2. the detection of PDF document mess code as claimed in claim 1, the method for correction, it is characterised in that in the font undetermined
Character pattern image with it is corresponding coding it is similar, then be determined as normal character;Otherwise it is determined as mess code character.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510181385.0A CN104732228B (en) | 2015-04-16 | 2015-04-16 | A kind of detection of PDF document mess code, the method for correction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510181385.0A CN104732228B (en) | 2015-04-16 | 2015-04-16 | A kind of detection of PDF document mess code, the method for correction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104732228A CN104732228A (en) | 2015-06-24 |
CN104732228B true CN104732228B (en) | 2018-03-30 |
Family
ID=53456103
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510181385.0A Active CN104732228B (en) | 2015-04-16 | 2015-04-16 | A kind of detection of PDF document mess code, the method for correction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104732228B (en) |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105488471B (en) * | 2015-11-30 | 2019-03-29 | 北大方正集团有限公司 | A kind of font recognition methods and device |
CN105718554A (en) * | 2016-01-19 | 2016-06-29 | 深圳市天朗时代科技有限公司 | Document collaboration conversion method and system |
US10261729B1 (en) | 2018-02-27 | 2019-04-16 | Ricoh Company, Ltd. | Document manipulation mechanism |
CN110728115B (en) * | 2018-07-17 | 2024-01-26 | 珠海金山办公软件有限公司 | Document content messy code identification method and device and electronic equipment |
CN110728111B (en) * | 2018-07-17 | 2024-06-25 | 珠海金山办公软件有限公司 | Document content messy code repairing method and device, terminal equipment and server |
CN108985289A (en) * | 2018-07-18 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Messy code detection method and device |
CN110765826A (en) * | 2018-07-27 | 2020-02-07 | 珠海金山办公软件有限公司 | Method and device for identifying messy codes in Portable Document Format (PDF) |
CN109684962B (en) * | 2018-12-14 | 2023-04-18 | 苏州梦想人软件科技有限公司 | AR electronic book quality detection method |
CN111695327B (en) * | 2019-02-28 | 2024-01-26 | 珠海金山办公软件有限公司 | Method and device for repairing messy codes, electronic equipment and readable storage medium |
CN111144107B (en) * | 2019-12-25 | 2022-08-09 | 福建天晴在线互动科技有限公司 | Messy code identification method based on slicing algorithm |
CN111401362A (en) * | 2020-03-06 | 2020-07-10 | 上海眼控科技股份有限公司 | Tampering detection method, device, equipment and storage medium for vehicle VIN code |
CN113627129B (en) * | 2020-05-08 | 2024-06-21 | 珠海金山办公软件有限公司 | Text copying method and device, electronic equipment and readable storage medium |
CN113158745B (en) * | 2021-02-02 | 2024-04-02 | 北京惠朗时代科技有限公司 | Multi-feature operator-based messy code document picture identification method and system |
CN114529930B (en) * | 2022-01-13 | 2024-03-01 | 上海森亿医疗科技有限公司 | PDF restoration method, storage medium and device based on nonstandard mapping fonts |
CN114519858B (en) * | 2022-02-16 | 2023-09-05 | 北京百度网讯科技有限公司 | Document image recognition method and device, storage medium and electronic equipment |
CN114629707B (en) * | 2022-03-16 | 2024-05-24 | 深信服科技股份有限公司 | Disorder code detection method and device, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101110072A (en) * | 2007-08-21 | 2008-01-23 | 无敌科技(西安)有限公司 | Device and method for automatic identifying literal code |
CN104346616A (en) * | 2013-08-09 | 2015-02-11 | 北大方正集团有限公司 | Character recognition device and character recognition method |
CN104424165A (en) * | 2013-09-06 | 2015-03-18 | 北大方正集团有限公司 | Messy code detection method and system for text documents |
CN104424010A (en) * | 2013-09-06 | 2015-03-18 | 北大方正集团有限公司 | Method and system for detecting and repairing text document messy codes |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH02278104A (en) * | 1989-04-19 | 1990-11-14 | Fuji Electric Co Ltd | Detecting method for angle of inclination of document image |
-
2015
- 2015-04-16 CN CN201510181385.0A patent/CN104732228B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101110072A (en) * | 2007-08-21 | 2008-01-23 | 无敌科技(西安)有限公司 | Device and method for automatic identifying literal code |
CN104346616A (en) * | 2013-08-09 | 2015-02-11 | 北大方正集团有限公司 | Character recognition device and character recognition method |
CN104424165A (en) * | 2013-09-06 | 2015-03-18 | 北大方正集团有限公司 | Messy code detection method and system for text documents |
CN104424010A (en) * | 2013-09-06 | 2015-03-18 | 北大方正集团有限公司 | Method and system for detecting and repairing text document messy codes |
Also Published As
Publication number | Publication date |
---|---|
CN104732228A (en) | 2015-06-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104732228B (en) | A kind of detection of PDF document mess code, the method for correction | |
CN101782896B (en) | PDF character extraction method combined with OCR technology | |
CN108415887B (en) | Method for converting PDF file into OFD file | |
CN105068997B (en) | The construction method and device of parallel corpora | |
CN104750666B (en) | A kind of recognition methods of text character codes mode and system | |
CN111428474A (en) | Language model-based error correction method, device, equipment and storage medium | |
US9286527B2 (en) | Segmentation of an input by cut point classification | |
CN106127265B (en) | A kind of text in picture identification error correction method based on activating force model | |
CN112396049A (en) | Text error correction method and device, computer equipment and storage medium | |
CN111488732B (en) | Method, system and related equipment for detecting deformed keywords | |
CN113408535B (en) | OCR error correction method based on Chinese character level features and language model | |
CN112085011A (en) | OCR recognition result error correction method, device and storage medium | |
CN112329482A (en) | Machine translation method, device, electronic equipment and readable storage medium | |
CN104331400B (en) | A kind of Mongolian code conversion method and device | |
CN108038093A (en) | PDF text extraction methods and device | |
CN112084308A (en) | Method, system and storage medium for text type data recognition | |
CN111931489A (en) | Text error correction method, device and equipment | |
CN109325237B (en) | Complete sentence recognition method and system for machine translation | |
CN110705217A (en) | Wrongly-written character detection method and device, computer storage medium and electronic equipment | |
CN105677718A (en) | Character retrieval method and apparatus | |
CN107491441B (en) | Method for dynamically extracting translation template based on forced decoding | |
CN104933030A (en) | Uygur language spelling examination method and device | |
CN102880874B (en) | Character identifying method and Character recognizer | |
CN105653516B (en) | The method and apparatus of parallel corpora alignment | |
CN104699662B (en) | The method and apparatus for identifying overall symbol string |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |