CN105512647A - Method and device for intelligent layout division of scanned file on small-screen equipment - Google Patents

Method and device for intelligent layout division of scanned file on small-screen equipment Download PDF

Info

Publication number
CN105512647A
CN105512647A CN201610035391.XA CN201610035391A CN105512647A CN 105512647 A CN105512647 A CN 105512647A CN 201610035391 A CN201610035391 A CN 201610035391A CN 105512647 A CN105512647 A CN 105512647A
Authority
CN
China
Prior art keywords
information
file
scanned
text message
version
Prior art date
Application number
CN201610035391.XA
Other languages
Chinese (zh)
Inventor
张晓博
张斌
Original Assignee
同方知网(北京)技术有限公司
《中国学术期刊(光盘版)》电子杂志社有限公司
同方知网数字出版技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 同方知网(北京)技术有限公司, 《中国学术期刊(光盘版)》电子杂志社有限公司, 同方知网数字出版技术股份有限公司 filed Critical 同方知网(北京)技术有限公司
Priority to CN201610035391.XA priority Critical patent/CN105512647A/en
Publication of CN105512647A publication Critical patent/CN105512647A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/00442Document analysis and understanding; Document recognition
    • G06K9/00456Classification of image contents, e.g. text, photographs, tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/189Automatic justification
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/00442Document analysis and understanding; Document recognition
    • G06K9/00469Document understanding by extracting the logical structure, e.g. chapters, sections, columns, titles, paragraphs, captions, page number, and identifying its elements, e.g. author, keywords, ZIP code, money amount

Abstract

The invention discloses a method and a device for intelligent layout division of a scanned file on small-screen equipment. Through analyzing the layout information and the position of contents on the layout in the scanned file, the information in the layout is split and recombined; the layout can be fast and efficiently reset, so that the reset layout is more suitable for the reading in a small screen. From the macroscopic view, the whole file layout is reset. The file is more suitable for small-screen reading. From the microscopic view, each block of split content is identical to that of the original layout; the information loss is avoided. A novel idea is also provided for electronic file reading on the small-screen equipment.

Description

Scanned version file divides method and the device of version in small screen device intelligence

Technical field

The present invention relates to a kind of pattern-recognition and printed page analysis technology of computer information processing field, be specifically related to a kind of scanned version file at the small screen device intelligence method of point version and device.

Background technology

At present, when scanned version file is read on small screen device, because viewing area is less, the original document space of a whole page is comparatively large, conveniently reads after generally all adopting OCR technology to change into streaming document (such as EPub) and reads.But format document is changed in the process of streaming document and inevitably information dropout can occur, and complicated, the rigorous document of some comparison in compositions is in transfer process information dropout more so.

Read if only amplified format document, then to need in reading process constantly left and right to drag up and down, reading experience is very poor again, nor is convenient to user and actively reads required information.

Summary of the invention

The present invention carries out to solve above-mentioned problem, a kind of method that object is to provide scanned version file to read at small screen device and device.

The invention provides the method for a kind of scanned version file in small screen device intelligence point version, there is such feature, comprise the following steps: (i) utilizes the information of OCR technology substance of extraction document from the space of a whole page of scanned version file; (ii) from information, identify type page information, headerfooter information, page number information and the delimiter information in file respectively, type page information comprises text message, image information, form data and formula info; (iii) filter out headerfooter information, page number information and delimiter information, retain type page information; (iv) reading order sort algorithm is used type page information to be carried out to the sequence of reading order; V () type page information is female information, according to setting, female information is cut into sub-information; (vi) export after sub-information being carried out two minor sorts.

At scanned version file provided by the invention in the method for small screen device intelligence point version, such feature can also be had: wherein, in step (v), cut being set as of female information: (a) identifies formula info, image information and form data, and takes not cutting process to it; B () cuts text message: according to the paragraph in text message, and cutting text message, obtains the sub-information of paragraph of text message.

At scanned version file provided by the invention in the method for small screen device intelligence point version, such feature can also be had: wherein, also secondary cut can be carried out: the text message width of row in the sub-information of paragraph being exceeded to setting threshold value tears row, line-break process open according to order, and the width after making it split is less than setting threshold value to the sub-information of the paragraph of text message.

At scanned version file provided by the invention in the method for small screen device intelligence point version, can also have such feature: wherein, setting threshold value is that artificial setting or system set automatically.

The invention provides a kind of device using scanned version file at small screen device, there is such feature, comprising: identifying scanned version file module, for reading the information that scanned version file carries; Intelligence point version module, for carrying out intelligence point version to scanned version file; Reading terminal, for presenting the information of intelligence point version module again after point version.Wherein, intelligence point version module adopts any one method above to scanned version file intelligence point version.

The effect of invention

Scanned version file involved in the present invention is in the method for small screen device intelligence point version, because by position and the composition information of content on the space of a whole page in analysis scan version file, again cutting and restructuring are carried out to the element in the space of a whole page, the space of a whole page is reset rapidly and efficiently, so in the method for small screen device intelligence point version, scanned version file of the present invention ensure that reading scanned version file on a small screen neither loses the object that main information improves again reading experience.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of embodiments of the invention;

Fig. 2 is scanned version file map in embodiments of the invention;

Fig. 3 is the schematic diagram extracting type page information in embodiments of the invention;

Fig. 4 is the schematic diagram of first time cutting type page information in embodiments of the invention;

Fig. 5 is the schematic diagram of second time cutting type page information in embodiments of the invention; And

Fig. 6 (a) and Fig. 6 (b) is the last file schematic diagram exported in embodiments of the invention.

Embodiment

The technological means realized to make the present invention, creation characteristic, reach object and effect is easy to understand, following examples are specifically addressed in the method for small screen device intelligence point version and device scanned version file of the present invention by reference to the accompanying drawings.

Fig. 1 is the schematic flow sheet of the present embodiment.

As shown in Figure 1, the present embodiment comprises following steps:

I () utilizes the information of OCR technology substance of extraction document from the space of a whole page of scanned version file;

(ii) from information, identify type page information, headerfooter information, page number information and the delimiter information in file respectively, type page information comprises text message, image information, form data and formula info;

(iii) filter out headerfooter information, page number information and delimiter information, retain type page information;

(iv) reading order sort algorithm is used type page information to be carried out to the sequence of reading order;

V () type page information is female information, according to setting, female information is cut into sub-information:

A () identifies described formula info, image information and form data, and take not cutting process to it;

B () cuts described text message: according to the paragraph in described text message, cut described text message, obtain the sub-information of paragraph of described text message, then secondary cut is carried out to the sub-information of the paragraph of text message, the text message width of row in the sub-information of paragraph being exceeded to setting threshold value tears row, line-break process open according to order, and the width after making it split is less than setting threshold value;

(vi) export after sub-information being carried out two minor sorts.

Fig. 2 is scanned version file map in the present embodiment; Fig. 3 is the schematic diagram extracting type page information in the present embodiment; Fig. 4 is the schematic diagram of first time cutting type page information in the present embodiment; Fig. 5 is the schematic diagram of second time cutting type page information in the present embodiment; And Fig. 6 (a) and Fig. 6 (b) is the last file schematic diagram exported in the present embodiment.

As figures 2-6, the information of the substance of scanned version file is read;

Utilize OCR technology from information, identify type page information, headerfooter information, page number information and delimiter information in file respectively, type page information comprises text message, image information, form data and formula info;

Filter out headerfooter information, page number information and delimiter information, retain type page information;

Reading order sort algorithm is used type page information to be carried out to the sequence of reading order;

With type page information for female information, according to setting, female information is cut into sub-information:

A () identifies described formula info, image information and form data, and take not cutting process to it, b () cuts described text message: according to the paragraph in described text message, cut described text message, obtain the sub-information of paragraph of described text message, then secondary cut is carried out to the sub-information of the paragraph of text message, the text message width of row in the sub-information of paragraph being exceeded to setting threshold value tears row open according to order, line-break process, width after making it split is less than setting threshold value, and (setting threshold value can be that system sets automatically, also can be manually set according to the custom of user, the threshold value set in the present embodiment is no more than 19 words as every row), sub-information is integrated, exports after two minor sorts.

In addition, the method for the present embodiment can be used on the device of small screen device use scanned version file.This device comprises: identify scanned version file module, for reading the information that scanned version file carries; Intelligence point version module, for carrying out intelligence point version to scanned version file; Reading terminal, for presenting the information of intelligence point version module again after point version.An intelligence point version module adopts the present embodiment method process scanned version file.

The effect of embodiment and effect

Scanned version file involved by the present embodiment is in the method for small screen device intelligence point version, because by position and the composition information of content on the space of a whole page in analysis scan version file, again cutting and restructuring are carried out to the element in the space of a whole page, the space of a whole page is reset rapidly and efficiently, so in the method for small screen device intelligence point version, the scanned version file of the present embodiment ensure that reading scanned version file on a small screen neither loses the object that main information improves again reading experience.

Above embodiment be only the present invention conceive under basic explanation, do not limit the invention.And according to any equivalent transformation that technical scheme of the present invention is done, all belong to protection scope of the present invention.

Claims (5)

1. scanned version file is in a method for small screen device intelligence point version, it is characterized in that, comprises the following steps:
I () utilizes the information of OCR technology substance of extraction document from the space of a whole page of described scanned version file;
(ii) from described information, identify type page information, headerfooter information, page number information and the delimiter information in described file respectively, described type page information comprises text message, image information, form data and formula info;
(iii) filter out headerfooter information, page number information and delimiter information, retain described type page information;
(iv) reading order sort algorithm is used described type page information to be carried out to the sequence of reading order;
V () described type page information is female information, according to setting, described female information is cut into sub-information;
(vi) export after described sub-information being carried out two minor sorts.
2. scanned version file according to claim 1 is in the method for small screen device intelligence point version, it is characterized in that:
Wherein, in step (v), cut being set as of female information:
A () identifies described formula info, image information and form data, and take not cutting process to it;
B () cuts described text message: according to the paragraph in described text message, cuts described text message, obtains the sub-information of paragraph of described text message.
3. scanned version file according to claim 2 is in the method for small screen device intelligence point version, it is characterized in that:
Wherein, also secondary cut can be carried out to the sub-information of the paragraph of described text message: the text message width of row in the sub-information of described paragraph being exceeded to setting threshold value tears row, line-break process open according to order, and the width after making it split is less than setting threshold value.
4. scanned version file according to claim 3 is in the method for small screen device intelligence point version, it is characterized in that:
Wherein, described setting threshold value is that artificial setting or system set automatically.
5. use a device for scanned version file at small screen device, it is characterized in that, comprising:
Identify scanned version file module, for reading the information that described scanned version file carries;
Intelligence point version module, for carrying out intelligence point version to described scanned version file;
Reading terminal, divides the information of edition module again after point version for presenting described intelligence,
Wherein, described intelligence divides edition module to adopt any one method in Claims 1 to 4 to described scanned version file intelligence point version.
CN201610035391.XA 2016-01-19 2016-01-19 Method and device for intelligent layout division of scanned file on small-screen equipment CN105512647A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610035391.XA CN105512647A (en) 2016-01-19 2016-01-19 Method and device for intelligent layout division of scanned file on small-screen equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610035391.XA CN105512647A (en) 2016-01-19 2016-01-19 Method and device for intelligent layout division of scanned file on small-screen equipment

Publications (1)

Publication Number Publication Date
CN105512647A true CN105512647A (en) 2016-04-20

Family

ID=55720614

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610035391.XA CN105512647A (en) 2016-01-19 2016-01-19 Method and device for intelligent layout division of scanned file on small-screen equipment

Country Status (1)

Country Link
CN (1) CN105512647A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334805A (en) * 2017-03-08 2018-07-27 腾讯科技(深圳)有限公司 The method and apparatus for detecting file reading sequences

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10126613A (en) * 1996-10-16 1998-05-15 Ricoh Co Ltd Image processor
CN102479173A (en) * 2010-11-25 2012-05-30 北京大学 Method and device for identifying reading sequence of layout
CN104268127A (en) * 2014-09-22 2015-01-07 同方知网(北京)技术有限公司 Method for analyzing reading order of electronic layout file
CN104834645A (en) * 2014-02-11 2015-08-12 阿里巴巴集团控股有限公司 Method and device for presenting layout document

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10126613A (en) * 1996-10-16 1998-05-15 Ricoh Co Ltd Image processor
CN102479173A (en) * 2010-11-25 2012-05-30 北京大学 Method and device for identifying reading sequence of layout
CN104834645A (en) * 2014-02-11 2015-08-12 阿里巴巴集团控股有限公司 Method and device for presenting layout document
CN104268127A (en) * 2014-09-22 2015-01-07 同方知网(北京)技术有限公司 Method for analyzing reading order of electronic layout file

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334805A (en) * 2017-03-08 2018-07-27 腾讯科技(深圳)有限公司 The method and apparatus for detecting file reading sequences

Similar Documents

Publication Publication Date Title
CN104504150B (en) News public sentiment monitoring system
US8922562B2 (en) Meteorology and oceanography geospatial analysis toolset
US6336124B1 (en) Conversion data representing a document to other formats for manipulation and display
US7836390B2 (en) Strategies for processing annotations
CN104182750B (en) A kind of Chinese detection method based on extreme value connected domain in natural scene image
US20170286767A1 (en) Method and apparatus for finding differences in documents
CN104731881B (en) A kind of chat record method and its mobile terminal based on communications applications
CN103837770B (en) Power equipments defect detects maintaining method
CA2765951C (en) System for diagnosis of plant anomalies
CN101558425B (en) Image processing apparatus, image processing method
Chen et al. Tag-based image retrieval improved by augmented features and group-based refinement
JP6653334B2 (en) Information extraction method and device
WO2014044159A1 (en) Method and device for batch scanning 2d barcodes
US9798925B2 (en) Method for identifying PDF document
CN104298982A (en) Text recognition method and device
Lohmann et al. Visual analysis of microblog content using time-varying co-occurrence highlighting in tag clouds
JP2011048816A (en) Discrimination method, discrimination device and computer program
CA2668413A1 (en) Media material analysis of continuing article portions
CN105205142B (en) Preserve method, device and the mobile terminal of journal file
JP6379520B2 (en) Processing method, processing system, and computer program
CN103425644B (en) The extracting method of picture and device in Web page text
CN102270206A (en) Method and device for capturing valid web page contents
US6351559B1 (en) User-enclosed region extraction from scanned document images
JP5361174B2 (en) Display control apparatus, display control method, and program
US8249356B1 (en) Physical page layout analysis via tab-stop detection for optical character recognition

Legal Events

Date Code Title Description
PB01 Publication
C06 Publication
SE01 Entry into force of request for substantive examination
C10 Entry into substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160420

WD01 Invention patent application deemed withdrawn after publication