CN102929843B - A kind of method that word is adapted system and adapted - Google Patents

A kind of method that word is adapted system and adapted Download PDF

Info

Publication number
CN102929843B
CN102929843B CN201210338739.4A CN201210338739A CN102929843B CN 102929843 B CN102929843 B CN 102929843B CN 201210338739 A CN201210338739 A CN 201210338739A CN 102929843 B CN102929843 B CN 102929843B
Authority
CN
China
Prior art keywords
adapt
page analysis
space
printed page
different
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210338739.4A
Other languages
Chinese (zh)
Other versions
CN102929843A (en
Inventor
王艳
瞿洋
梁洵
袁仁慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
" Academic Magazine (cd-Rom) " Co Ltd Of E-Magazine Society
Original Assignee
" Academic Magazine (cd-Rom) " Co Ltd Of E-Magazine Society
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by " Academic Magazine (cd-Rom) " Co Ltd Of E-Magazine Society filed Critical " Academic Magazine (cd-Rom) " Co Ltd Of E-Magazine Society
Priority to CN201210338739.4A priority Critical patent/CN102929843B/en
Publication of CN102929843A publication Critical patent/CN102929843A/en
Application granted granted Critical
Publication of CN102929843B publication Critical patent/CN102929843B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of method that word is adapted system and adapted, described system comprises: printed page analysis module, space of a whole page processing module and adapt merging module, described printed page analysis module, for the treatment of the non-legible content of the space of a whole page, and the per unit block analyzed by rank scanning in document, calculate the languages attribute of described plate; Space of a whole page processing module, for auxiliary printed page analysis module, adjusts the units chunk and units chunk attribute needing interactive printed page analysis; Adapt merging module, utilize the document that printed page analysis produces, carry out different identification by different languages and adapt, generate and different adapt text, and different text of adapting is carried out merging and generates and finally adapt text.The present invention greatly can improve and adapts efficiency, reduces costs, improves quality; Adjusted by the interactive space of a whole page, integrate each languages and independently adapt system, can be quick, high-quality complete the task of adapting, through test can obtain, adapt according to the present invention, annual cost can save 71.6%.

Description

A kind of method that word is adapted system and adapted
Technical field
The present invention relates to the electronization of scanned document, particularly relate to a kind of word based on interactive printed page analysis and adapt system.
Background technology
The main flow instrument current streamline relating to pictograph identification has Han Wang, FineReader two kinds, wherein extensive with the use of Han Wang software again.According to the experience of production division's Long-Time Service, these instruments achieve extraordinary effect in some applications, but there is again very large deficiency simultaneously, are mainly manifested in: the support of Han Wang identification software to Chinese is fairly good, but perform poor in English identification.FineReader is very good to english literature recognition effect, but Chinese identifies that support is bad.Be used alone certain and identify that engine can increase the quantity adapting character on the one hand, the lifting of efficiency is adapted in impact, on the other hand owing to adapting the increase of character, consistent adapt error rate under add the quantity of error character, thus reduce the quality of final products.Therefore which kind of identification facility no matter the document for Chinese and English mixing select have respective bottleneck, needs improvement to adapt system.
Summary of the invention
For solving above-mentioned middle Problems existing and defect, the invention provides one and adapt system and adapt method, this system and method can greatly improve to be adapted efficiency, reduce costs, improves quality.Described technical scheme is as follows:
System adapted in a kind of word, comprising:
Described system comprises: printed page analysis module, space of a whole page processing module and adapt merging module, described in
Printed page analysis module, for the treatment of the non-legible content of the space of a whole page, and is analyzed the per unit block in document, calculates the languages attribute of described plate by rank scanning;
Space of a whole page processing module, for auxiliary printed page analysis module, adjusts the units chunk and units chunk attribute needing interactive printed page analysis;
Adapt merging module, utilize the document that printed page analysis produces, carry out different identification by different languages and adapt, generate and different adapt text, and different text of adapting is carried out merging and generates and finally adapt text.
The method that word is adapted, comprising:
The non-legible content of the space of a whole page is processed;
Analyzed the per unit block in document by rank scanning, and calculate the languages attribute of described units chunk;
The units chunk and units chunk attribute that need interactive printed page analysis are adjusted;
By different languages different identification carried out to document and adapt, generating and different adapt text, and different text of adapting is carried out merging and generates and finally adapt text.
The beneficial effect of technical scheme provided by the invention is:
Greatly can improve and adapt efficiency, reduce costs, improve quality;
Adjusted by the interactive space of a whole page, integrate each languages and independently adapt system, can be quick, high-quality complete the task of adapting, through test can obtain, adapt according to the present invention, annual cost can save 71.6%.
Accompanying drawing explanation
Fig. 1 is that system construction drawing adapted in word;
Fig. 2 is that method flow diagram adapted in word.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail:
As shown in Figure 1, for system architecture adapted in word, comprising: printed page analysis module, space of a whole page processing module and adapt merging module, described in
Printed page analysis module, for the treatment of the non-legible content of the space of a whole page, and is analyzed the per unit block in document, calculates the languages attribute of described plate by rank scanning;
Space of a whole page processing module, for auxiliary printed page analysis module, adjusts the units chunk and units chunk attribute needing interactive printed page analysis;
Adapt merging module, utilize the document that printed page analysis produces, carry out different identification by different languages and adapt, generate and different adapt text, and different text of adapting is carried out merging and generates and finally adapt text.
The process of the non-legible content of the above-mentioned space of a whole page comprises the non-legible content etc. in black surround, impurity and image.
Processing in the non-legible content document of the space of a whole page, analyzing as far as possible accurately to make the space of a whole page, adopting following algorithm:
1) line scanning: line by line scan to image, the pixel number of the every a line of Statistics Division, utilizes its statistical nature, obtains the up-and-down boundary of every a line.
2) column scan: to each rank scanning of advancing, counts the pixel number of each row, utilizes its statistical nature, obtain the right boundary of every a line, thus obtain per unit block.
3) identification of units chunk languages: part carries out simple identifying processing to often composing a piece of writing, and analyzes the feature of Chinese and English languages, as the aspect ratio features etc. of Chinese and English word.
4) aftertreatment: the document that personalisation process is dissimilar.
Interactive printed page analysis
After automatic plate surface analysis, for the good document of most of typesetting, result can accept substantially, but for some formats more disorderly, more complicated document, need auxiliary certain interactive printed page analysis, namely adjust other attributes such as the languages of the units chunk of the space of a whole page, often block, guarantee the correctness of final edition surface analysis.
Adapt by languages identification
By the document of languages form after interactive printed page analysis, submit to and respective adapt system; To using Chinese part, adopt Han Wang and Wen Tong to identify, inconsistent part is dished out and is adapted; For English part, adopt FineReader and Wen Tong to identify, inconsistent part is dished out and is adapted.
Adapt result to merge
Different text of adapting is carried out merging generating and final adapts result.
As shown in Figure 2, for method adapted in word, the method comprises:
The non-legible content of the space of a whole page is processed;
Analyzed the per unit block in document by rank scanning, and calculate the languages attribute of described units chunk;
The units chunk and units chunk attribute that need interactive printed page analysis are adjusted;
By different languages different identification carried out to document and adapt, generating and different adapt text, and different text of adapting is carried out merging and generates and finally adapt text.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (2)

1. a system adapted in word, it is characterized in that, described system comprises: printed page analysis module, space of a whole page processing module and adapt merging module, described in
Printed page analysis module, for the treatment of the non-legible content of the space of a whole page, and is analyzed the per unit block in document, calculates the languages attribute of described units chunk by rank scanning, form the overall space of a whole page of document;
Space of a whole page processing module, for auxiliary printed page analysis module, adjusts the units chunk of printed page analysis gained and units chunk attribute;
Adapt merging module, utilize the document that printed page analysis produces, carry out different identification by different languages and adapt, generate and different adapt text, and different text of adapting is carried out merging and generates and finally adapt text;
According to the effective pixel points number that described line scanning is often gone, and the distribution of effective pixel points number is added up to the up-and-down boundary of often being gone;
Described column scan is to each rank scanning of advancing, the pixel number of each row of Statistics Division, and adds up the right boundary of often being gone to pixel number feature;
Described document unit block is obtained according to the up-and-down boundary of described row and right boundary;
Described printed page analysis module comprises pretreatment unit and automatic plate surface analysis unit; Described space of a whole page processing module comprises interactive printed page analysis unit; Described adapting merges module and comprises identification and adapt unit and adapt result merge cells.
2. the method adapted of word, it is characterized in that, described method comprises:
The non-legible content of the space of a whole page is processed;
Analyzed the per unit block in document by rank scanning, and calculate the languages attribute of described units chunk;
The units chunk and units chunk attribute that need interactive printed page analysis are adjusted;
By different languages different identification carried out to document and adapt, generating and different adapt text, and different text of adapting is carried out merging and generates and finally adapt text;
The effective pixel points number that described line scanning is often gone, and the distribution of effective pixel points number is added up to the up-and-down boundary of often being gone;
Described column scan is to each rank scanning of advancing, the pixel number of each row of Statistics Division, and adds up the right boundary of often being gone to pixel number feature;
Described document unit block is obtained according to the up-and-down boundary of described row and right boundary.
CN201210338739.4A 2012-09-14 2012-09-14 A kind of method that word is adapted system and adapted Active CN102929843B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210338739.4A CN102929843B (en) 2012-09-14 2012-09-14 A kind of method that word is adapted system and adapted

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210338739.4A CN102929843B (en) 2012-09-14 2012-09-14 A kind of method that word is adapted system and adapted

Publications (2)

Publication Number Publication Date
CN102929843A CN102929843A (en) 2013-02-13
CN102929843B true CN102929843B (en) 2015-10-14

Family

ID=47644644

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210338739.4A Active CN102929843B (en) 2012-09-14 2012-09-14 A kind of method that word is adapted system and adapted

Country Status (1)

Country Link
CN (1) CN102929843B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103995904B (en) * 2014-06-13 2017-09-12 上海珉智信息科技有限公司 A kind of identifying system of image file electronic bits of data
CN110348000B (en) * 2019-07-16 2023-12-26 仲恺农业工程学院 Typesetting document interaction calculation method, device, equipment and computer readable medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1320481C (en) * 2004-11-22 2007-06-06 北京北大方正技术研究院有限公司 Method for conducting title and text logic connection for newspaper pages
JP4835459B2 (en) * 2007-02-16 2011-12-14 富士通株式会社 Table recognition program, table recognition method, and table recognition apparatus
CN102298696B (en) * 2010-06-28 2013-07-24 方正国际软件(北京)有限公司 Character recognition method and system
CN101923643B (en) * 2010-08-11 2012-11-21 中科院成都信息技术有限公司 General form recognizing method
CN101887519B (en) * 2010-08-16 2012-04-18 同方知网(北京)技术有限公司 Character recognition and modification method
CN102054169B (en) * 2010-12-28 2013-01-16 青岛海信网络科技股份有限公司 License plate positioning method
CN102592121B (en) * 2011-12-28 2013-12-04 方正国际软件有限公司 Method and system for judging leakage recognition based on OCR (Optical Character Recognition)

Also Published As

Publication number Publication date
CN102929843A (en) 2013-02-13

Similar Documents

Publication Publication Date Title
US20190294663A1 (en) Method and device for positioning table in pdf document
US10602032B2 (en) Method of correcting image distortion of optical device in display device and display device
US20150262007A1 (en) Detecting and extracting image document components to create flow document
CN108132916A (en) Parse method, the storage medium of PDF list datas
CN102929843B (en) A kind of method that word is adapted system and adapted
CN106897690A (en) PDF table extracting methods
CN102855232A (en) Table analysis and edit processing method
CN111368511A (en) PDF document analysis method and device
EP2435903A4 (en) System and related method for digital attitude mapping
CN105631447A (en) Method of recognizing characters in round seal
CN109598185B (en) Image recognition translation method, device and equipment and readable storage medium
CN111814598A (en) Financial statement automatic identification method based on deep learning framework
CN104951429A (en) Recognition method and device for page headers and page footers of format electronic document
EP2975574A3 (en) Method, apparatus and terminal for image retargeting
CN111914805A (en) Table structuring method and device, electronic equipment and storage medium
CN105160343A (en) Information identification method and device applied to film on-demand-printing system
US10552535B1 (en) System for detecting and correcting broken words
US9047528B1 (en) Identifying characters in grid-based text
CN106228972B (en) Method and system are read aloud in multi-language text mixing towards intelligent robot system
CN116311317A (en) Paragraph information restoration method after paper document electronization
CN103714047A (en) Lateral proofreading and double-layer PDF file outputting method and device
Saleh et al. Pixel. js: Web-based pixel classification correction platform for ground truth creation
CN106529521A (en) Ancient book character digital recording method
CN114782957A (en) Method, device, electronic equipment and medium for determining text information in stamp image
CN100424683C (en) A typesetting method of character in monotonic range

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent for invention or patent application
CB02 Change of applicant information

Address after: 100084 Haidian District Tsinghua Yuan Tsinghua University Beijing District 1407, 1408, 36, 1409

Applicant after: " academic magazine (CD-ROM) " company limited of e-magazine society

Address before: 100084 Beijing city Haidian District Tsinghua University Tsinghua Yuan 36 zone B1410, Huaye building 1412, room 1414

Applicant before: "Chinese Academic Journals (CD)" Electronic Magazine

COR Change of bibliographic data

Free format text: CORRECT: APPLICANT; FROM: CHINA ACADEMIC JOURNAL (CD) ELECTRONIC PUBLISHING HOUSE TO: CHINA ACADEMIC JOURNAL (CD) ELECTRONIC PUBLISHING HOUSE CO., LTD.

C14 Grant of patent or utility model
GR01 Patent grant