CN103218351A - Modern local literature electronic book manufacture method - Google Patents

Modern local literature electronic book manufacture method Download PDF

Info

Publication number
CN103218351A
CN103218351A CN2013100853160A CN201310085316A CN103218351A CN 103218351 A CN103218351 A CN 103218351A CN 2013100853160 A CN2013100853160 A CN 2013100853160A CN 201310085316 A CN201310085316 A CN 201310085316A CN 103218351 A CN103218351 A CN 103218351A
Authority
CN
China
Prior art keywords
text
correction
page
check
xml
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013100853160A
Other languages
Chinese (zh)
Other versions
CN103218351B (en
Inventor
周小芳
朱国明
戚凌均
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HANGZHOU ZHONGYUAN DIGITAL TECHNOLOGY Co Ltd
Original Assignee
HANGZHOU ZHONGYUAN DIGITAL TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HANGZHOU ZHONGYUAN DIGITAL TECHNOLOGY Co Ltd filed Critical HANGZHOU ZHONGYUAN DIGITAL TECHNOLOGY Co Ltd
Priority to CN201310085316.0A priority Critical patent/CN103218351B/en
Publication of CN103218351A publication Critical patent/CN103218351A/en
Application granted granted Critical
Publication of CN103218351B publication Critical patent/CN103218351B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

In order to realize the goals of converting modern paper local chronicles with complicated contents into electronic local chronicles and realizing the retrieval according to the requirements of customers, the invention provides a modern local literature electronic book manufacture method, which comprises the following steps including image scanning, image processing, PDF (portable document format) conversion, version analysis identification checking, TXT (textfile) type setting and JPG (joint picture group) illustration index, content manufacture, XML (extensive markup language) file manufacture, XML file generation and XML quality checking. Through scientific steps, the efficiency for converting paper texts into electronic texts is improved. Under the general conditions, a hundred-thousand-word local chronicle can be completed within one week by a twenty-person group. The error rate is about ten thousandth. The completed draft manuscripts are made into contents through the XML files, and the lookup and the retrieval are convenient.

Description

Modern local collection e-book method for making
Technical field
The present invention relates to a kind of e-book method for making, especially to the method for making of the e-book of modern local collection class books.
Background technology
Knowledge is power, and over the past thousands of years, books are the succession carrier of knowledge always.But along with the progress of science and technology, people find slowly that also the preservation of paper book is a difficult problem.Main cause is paper because the saw lumber relation is damaged easily.The reason that causes paper to damage roughly has following several:
One, temperature
Temperature is the index of cold and hot degree in the expression air, is measuring of object heat energy, and heat energy is a kind of form of energy that promotes that organic historical relic is rotten, when temperature is high more, the chance that atom, molecule bump against is just many more, chemical reaction is just accelerated, and scientific experiment proves: in chemical reaction, temperature raises 10 ℃, its reaction rate will increase by two times, and to papery, temperature raises 5 ℃, rotten speed will increase by two times, even at high temperature place the short time, also can make paper flavescence, embrittlement.From following table as can be seen, temperature is high more, and the paper half life period is short more, otherwise temperature is low more, and the half life period is long more.
Temperature is fluctuated, also is disadvantageous to the protection of papery, and temperature is neglected high, and being evaporated to airborne vapour quantity just increases, and air is just crossed the what humidity; Otherwise temperature is low suddenly, is evaporated to airborne steam and also just reduces suddenly, and it is dry that air is just crossed the what universe.Fluctuated by the what temperature, then can cause the fiber in the paper neglect to expand, contract suddenly, and influence the tensile strength of paper fibre.
Two, humidity
Humidity is expression water in air vapour content or the wet degree of the universe.Moist environment not only can make paper become tide and hydrolysis takes place, and the falling into oblivion of writing of poor water resistance faded, and is smudgy.Also favourable what microbial growth breeding impels paper to go rotten, and damages by worms, and is rotten.Can quicken other objectionable impurities in addition (as sour gas CO in the atmosphere 2, NO 2, SO 2Deng), very easily absorbed by the moisture content in the moist paper, form the stronger mineral acid of corrosivity, and alum more facile hydrolysis generate sulfuric acid, and quicken the damage of paper.
Three, illumination
Light is to the harm of papery, is commonly considered as that the heat effect of light and photochemical effect cause.
To this, in order to protect books, our mode commonly used now is books to be carried out electronization handle scanning.Transfer books to electronization.But in fact in this process, have many beyond thought things and take place.Degree of accuracy as electron scanning identification is an insoluble problem always.Though relevant patent is also arranged in the prior art, as Shenzhen Datum Data Co., Ltd. publication number just being arranged is that the patent of invention " double-language sentence alignment schemes and device " of CN101488126 is used to improve the copyright aligning efficiency; And " layout reversion method " patent of invention of publication number CN101308491 is used to improve the correspondence of layout position.But in progress not outstanding aspect the literal check and correction.And the mode of consulting of e-book is also different with the papery text, needs modes such as link easily, and scanning document of the prior art all can't satisfy such requirement.
The kind of text is also a lot, but to have the literal amount big for modern local collection, and multiple characteristics such as data, pattern, the mixing of literal form, and very high for the requirement of literal, digital accuracy rate are to change higher a kind of of difficulty in the electronic version file.And after changing into e-text, satisfy requirements such as being convenient to query and search again, post-production requires high.But advantage is that the papery of modern local collection papery text is better, is fit to modern scanning tools.
Summary of the invention
In order to realize that the modern local chronicle with complex contents is transferred to electronics by papery, and can retrieve, the invention provides a kind of modern local collection e-book method for making, may further comprise the steps according to customer demand:
Step 1. image scanning scans the Hard copy local collection in the computer by professional scanner, like this paper spare document is changed into electronic image;
Step 2. Flame Image Process, described image processing step comprises the calibration information integrality, guarantees not omit the information such as literal, picture, note of text, and image is rectified a deviation, and is just guaranteeing the picture position, and the picture decontamination handles, and guarantees that picture is neat and artistic;
Step 3.PDF conversion adopts the PDF picture format to encapsulate by the minimum institutional framework of catalogue;
Step 4. printed page analysis identification check and correction, comprise image layout analysis, text OCR identification, text proofreading, described text proofreading has comprised horizontal check and correction and vertically check and correction, described horizontal check and correction i.e. check and correction line by line, described vertical check and correction is sought all positions that these words occur in the text for to select all different words in the books one by one, and the whether correct of these literal check and correction confirmed in contrast one by one;
Step 5.TXT sets type and also to carry out JPG illustration index, comprises the TXT file of the text that produces through described step 4 identification is set type and the illustration in the middle of the text is carried out index, guarantee that the JPG illustration is nested in tram in the text, and index is accurate;
Step 6. catalogue is made, and according to regular scheduling and finish the catalogue index, the file of finishing generates catalogue file;
Step 7.XML documenting is described the XML file that every local document is used to put in storage according to the catalogue text of establishment and the TXT text generation of described step 5;
Step 8.XML quality inspection.Inspection comprises the inspection of all fields such as title, author, publishing house, text, PDF path at the XML file layout that every document generates, and guarantees corresponding fully with the content of this document of body paper.
Preferably, comprise preliminary identification in the correction operation in the described step 2 to literal; Rectify a deviation after being confirmed to be the inclination that causes of scanning, the angle of correction back literal and horizontal direction is no more than 3 degree.Conversion can improve the accuracy rate of later stage text OCR identification later like this, alleviates the work load of later stage check and correction.
It is characterized in that: after described correction work is finished, word segment is carried out projection, this projection has covered certain zone, this zone can amplify and dwindle in proportion, with this view field adjust to original copy geometric ratio size after, its four summits and original copy position are proofreaded one by one, see whether can overlap simultaneously, check then whether the projecting edge between every adjacent two summits overlaps with original copy.Whether have from space of a whole page check like this and omit or lack.So also can check out the problem of copy error page or leaf.
Preferably, adopt the PDF picture format to encapsulate described in the described step 3; Be that illustration behind the front cover is packaged into a node, called after " hat figure " allows to insert in order between the hat figure, preface, and hat figure is divided into hat Fig. 1, hat Fig. 2; Illustration before the back cover is packaged into a node, called after " accompanying drawing "; Front cover, hat figure, catalogue, preface, preface, autograph page or leaf, colophon, title page, the content of compiling text fronts such as the council encapsulate PDF separately by its title; Do not need OCR identification according to each page as whole pictures processing for front cover, hat figure, catalogue, autograph page or leaf, colophon, title page, change into JPG with professional image software and be uploaded to every corresponding JPG file of book; Need discern check and correction for preface, preface, the council content of compiling.
Preferably, printed page analysis identification check and correction in the described step 4, promptly at first literal paragraph in the picture of original text and picture are carried out printed page analysis respectively, draw the regional frame of the different identification types of representative, after described printed page analysis work is finished, carry out OCR identification, after described OCR identification work is finished, the content of text after the identification laterally must be proofreaded line by line.After described horizontal proof-reading is finished, vertically proofread again, promptly select all different words in the books one by one, seek all positions that these words occur in the text, contrast confirms whether the appearance in the text of these literal is correct one by one, the identification error rate of guaranteeing literal will be lower than ten thousand/.
Preferably, XML documenting process in the described step 7 is: all fields such as books Chinese, collection sign, first class catalogue, second-level directory, title, text, PDF in the XML file are all abideed by the objective typing of books, being the simplified font of typing of simplified font, is the typing traditional font font of traditional font font; If book cover title and autograph page head are inconsistent, then get the autograph page head; Book contents has "<〉" character, adopts " () " to replace in XML; Notes content is arranged in the books, and description rule is:
------(this page note begins)------
Notes content
------(this page note is intact)------
Have can't typing rare Chinese character replace with the solid black square frame;
Mathematical formulae in the text, chemical molecular formula or equation are pressed illustration and are handled, and provide the picture indices address; Have can't typing special symbol, if can describe with Chinese, adopt and express with Chinese text; A table content does not finish at first page, at second page continuous table is arranged, and does not have " continuous table " two words on second page, then increase " continuous table " two words and also be logged in the text, after add the index address of this form; The Catalog Header of books and text title are inconsistent, the typing Catalog Header.
Preferably, the directory level of XML only is fabricated into three grades, i.e. first class catalogue, second-level directory and title.Avoided too much directory level to cause the navigating directory confusion like this.
Preferably, complete publication also needs always to examine after finishing the quality inspection work of step 8 at described single presents, and generates the xml file.Make that like this retrieval is more convenient.
For not needing to carry out verification, only be the modern local collection e-book making side that scanning is preserved, may further comprise the steps:
Step 1. image scanning scans the papery local collection in the computer by professional scanner, like this paper spare text is changed into image;
Step 2. Flame Image Process, described image processing step comprises the calibration information integrality, guarantees not omit the information such as note of text, and image is rectified a deviation, and is just guaranteeing the picture position, and the picture decontamination is handled;
Step 3.PDF conversion, whole book directly encapsulates.Such electronic edition text is generally used for backup, and value for preservation is not high.The encapsulation back does not need follow-up quality inspection as long as guarantee clear picture when Flame Image Process.
Preferably, after described correction work is finished, word segment is carried out projection, this projection has covered certain zone, this zone can amplify and dwindle in proportion, with this view field adjust to original copy geometric ratio size after, its four summits and original copy position are proofreaded one by one, see whether can overlap simultaneously, check then whether the projecting edge between every adjacent two summits overlaps with original copy.Can guarantee that like this pdf document do not omit urtext information.
The present invention has following effect: by the step of science, improved the efficient that the papery text is changed into e-text.The local chronicle of next these 100,000 word of normal conditions can be finished in a week by one 20 people group.Error rate ten thousand/about.The manuscript of finishing is all made catalogue by the XML file, conveniently consults and retrieves.
Description of drawings
The invention will be further described below in conjunction with accompanying drawing:
Fig. 1 is the overall flow synoptic diagram of this practical modern local collection books method for making.
Embodiment
As shown in Figure 1, this practical modern local collection books method for making is the making streamline arrangement of scale with 20 people, and 3 people do the PDF encapsulation, and 13 people do the identification check and correction, and 1 people does catalogue and makes, and 3 people do composing picture indices and XML quality inspection.The groundwork amount that local collection is made on this procedure of identification check and correction since make code requirement identification error rate to be lower than ten thousand/, just must be through laterally check and correction and vertically check and correction, this has also just directly had influence on the time of discerning required cost.After the operating personnel of PDF encapsulation finish one's work, can be assigned in the operation of identification check and correction and composing index, just regulate the personnel placement of this two procedures, and catalogue making pro-rata can be done the catalogue making by row 1 people, these operating personnel's work saturation degree is higher, arrange 3 people to do 2 procedures simultaneously---composing index and XML quality inspection, such arrangement personnel mobility is higher, adjusts leicht fallen D/A.
In order to realize to modern local chronicle the invention provides a kind of modern local collection e-book method for making, may further comprise the steps by papery with complex contents:
Step 1. image scanning scans the Hard copy local collection in the computer by professional scanner, like this paper spare document is changed into electronic image;
Step 2. Flame Image Process, described image processing step comprises the calibration information integrality, guarantees not omit the information such as note of text, and image is rectified a deviation, and is just guaranteeing the picture position, and the picture decontamination is handled; Comprise preliminary identification in the correction operation to literal; Rectify a deviation after being confirmed to be the inclination that causes of scanning, the angle of correction back literal and horizontal direction is no more than 3 degree.Conversion can improve the success ratio of later stage ORC identification later like this, alleviates the work load of later stage check and correction.After described Flame Image Process and PDF conversion work are finished, the effective information of image is partly carried out projection, this projection has covered all effective information zones of image, this zone can amplify and dwindle in proportion, four summits and the original copy position of this view field are proofreaded one by one, see whether can overlap simultaneously, check then whether the projecting edge between every adjacent two summits overlaps with original copy.Can guarantee that like this pdf document and original image are in full accord, do not omit the original image effective information.
Step 3.PDF conversion adopts the PDF picture format to encapsulate by bibliographic structure; Adopt the PDF picture format to encapsulate described in the described step 3.Be exactly that illustration behind the front cover is packaged into a node specifically, called after " hat figure " allows to insert in order between the hat figure, preface, and hat figure is divided into hat Fig. 1, hat Fig. 2; Illustration before the back cover is packaged into a node, called after " accompanying drawing "; Front cover, hat figure, catalogue, preface, preface, autograph page or leaf, colophon, title page, the content of compiling text fronts such as the council encapsulate PDF separately by its title; Do not need OCR identification according to each page as whole pictures processing for front cover, hat figure, catalogue, autograph page or leaf, colophon, title page, change into JPG with professional image software and be uploaded to every corresponding JPG file of book; Need discern check and correction for preface, preface, the council content of compiling.
Step 4. printed page analysis identification check and correction comprises text OCR identification, text proofreading, described text proofreading has comprised horizontal check and correction and vertically check and correction, described horizontal check and correction i.e. check and correction line by line, after the ORC identification, with the picture of the original text text after by row cutting back and identification with the format permutation of delegation to delegation.Convenient laterally check and correction.
Described vertical check and correction is sought the position that these words occur in the text for to select all different words in the books one by one, and contrast confirms whether the appearance in the text of these literal is correct one by one;
Step 5.TXT sets type and carries out JPG illustration index;
Step 6. catalogue is made, and according to regular scheduling and finish the catalogue index, the file of finishing generates catalogue;
Step 7.XML documenting is according to the catalogue text of establishment and the text generation XML file after the check and correction;
Step 8.XML quality inspection.
Preferably, the XML documenting process in the described step 7 is: all header fields in the XML file are all abideed by the objective typing of books, are the simplified fonts of typing of simplified font, are the typing traditional font fonts of traditional font font; If book cover title and autograph page head are inconsistent, then get the autograph page head; Book contents has "<〉" character, adopts " () " to replace in XML; Notes content is arranged in the books, and the injection rule is:
------(this page note begins)------
Notes content
------(this page note is intact)------
Have can't typing rare Chinese character replace with solid black square frame " ■ ";
Mathematical formulae in the text, chemical molecular formula or equation are pressed illustration and are handled, and provide the picture indices address; Have can't typing special symbol, if can describe with Chinese, adopt and express with Chinese text; A table content does not finish at first page, at second page continuous table is arranged, and does not have " continuous table " two words on second page, then increase " continuous table " two words and also be logged in the text, after add the index address of this form; The Catalog Header of books and text title are inconsistent, the typing Catalog Header.
Embodiment 1:
<?xml?version="1.0"encoding="gbk"?>
<!DOCTYPE?TRS[
<!ELEMENT?TRS(REC)>
<REC>
<books Zhong Wenmingcheng>Zhejiang Province ethnic group will</books Zhong Wenmingcheng><br/><br/>
<collection sign>F426.21214.21563</collection sign><br/><br/>
<catalogue Ci Xu>28</catalogue Ci Xu><br/><br/>
<the catalogue page number; 161</catalogue the page number;<br/><br/>
<yi Jimulu>First compiles the She</Yi Jimulu><br/><br/>
<er Jimulu>Chapter 4, economic life</Er Jimulu><br/><br/>
<ming Cheng>The 8th joint material life</Ming Cheng><br/><br/>
<cun Fangmulu>F426.21214.21563</Cun Fangmulu><br/><br/>
<zheng Wen>The 8th joint material life<br/>
Income
Before the founding of the state, the reactionary rule of imperialism, feudalism, bureaucrat capitalism has extremely seriously fettered She's area production
<img
src=http://digldata.zjlib.cn/dfz/F426.21214.21563/1-4-8-5.jpg>
<img
src=http://digldata.zjlib.cn/dfz/F426.21214.21563/1-4-8-6.jpg>
<img
src=http://digldata.zjlib.cn/dfz/F426.21214.21563/1-4-8-7.jpg>
Living conditions
Before the founding of the state, the She nationality, distributed over Fujian, Zhejiang, Jiangxi and Guangdong village disperses, and scale is less, is distributed in the Ao of mountain basically, on the hill-side.Half farming room is the civil structure one-storey house, and half is the simple and crude short room on thatched shed or Cortex Cunninghamiae Lanceolatae bedding roof, and there is the garden structural building at the rich family of minority.
The people of Wenzhou the She nationality, distributed over Fujian, Zhejiang, Jiangxi and Guangdong lived in moist dark nonventilated careless small house in 1949, accounted for 30~40%.The eighties, 239 7130 square metres in cover tile room, the blue or green street of Pingyang County, the Cangnan county builds in the Heshan village 79 of new houses, and Ban Gong builds in the village 68 of new houses.There is the people of the She nationality, distributed over Fujian, Zhejiang, Jiangxi and Guangdong people more than 3100 in the Ju Xi town, is a maximum small towns of the She of the whole province, and the old times are lived careless small house thatched cottage, existing tile-roofed house, the new house lived in entirely.To nineteen ninety, still live only surplus 19 families, the She nationality, distributed over Fujian, Zhejiang, Jiangxi and Guangdong people whole city of careless small house.The change of the people of Lishui the She nationality, distributed over Fujian, Zhejiang, Jiangxi and Guangdong housing condition sees Table 4-9.
<img
src=http://digldata.zjlib.cn/dfz/F426.21214.21563/1-4-8-8.jpg>
In the later stage eighties, household electrical appliance enter man of the She nationality, distributed over Fujian, Zhejiang, Jiangxi and Guangdong.There are 119 of televisors at 620 families, 8 villages, 4 counties, Wenzhou, and average 6 families have 1.The every one thousand families of the people of Lishui Prefecture the She nationality, distributed over Fujian, Zhejiang, Jiangxi and Guangdong have 11 of colour TVs, 215 of black and white television sets, 10 of washing machines, 5 in refrigerator, 279 in electric fan, 208 of electric cookers, 501 in sewing machine, 880 of bicycles, 3 in motorcycle.See Table 4-10.
<img?src=/dfz/F426.21214.21563/1-4-8-9.jpg>
</Zheng Wen><br/><br/>
<pdf document Ming>1-4-8.pdf</pdf document Ming><br/><br/>
</REC>
</TRS>
After such definition, when retrieving keywords such as " material lifes ", just can find the text.The directory level of XML only is fabricated into three grades, i.e. first class catalogue, second-level directory and title.
Complete publication also needs always to examine after finishing the quality inspection work of step 8 at described single presents, and generates composite catalog.So further guarantee to make the accurate convenient search of result.Usually total quality inspection comprises:
Need do quality inspection to pdf document when catalogue is made, whether the quality inspection pdf document has the page or leaf of leakage or the multipage or the phenomenon of putting the cart before the horse, and whether content organizes construction packages according to the catalogue minimum;
The collection symbol wants correct in the XML file, note that "-" in the collection identifier changed into " _ ", " () " changes " () " into, and whether XML file designation is correct;
Whether one, two, three catalogue of quality inspection is accurate; Catalog Header is that the title of " appendix " will be enclosed the title of front one-level; To enclose the title of three grades of catalogues before the level Four Catalog Header, with three grades of catalogue peers;
The content of a REC node is complete, sets type and wants correct, should have between title, trifle and the trifle
WithSeparate, content is corresponding with the PDF image file;
The maximal value of XML file in-list order is corresponding with the PDF number;
It is correct that want the chained address of picture, and the chained address number is corresponding with JPG picture number, otherwise import the phenomenon that chain is not received behind the database corresponding illustration picture, checks whether compliant of JPG quality;
OTIFF file, pdf document and XML file are named with the collection symbol;
The TIFF number wants consistent with the PDF number of pages of whole book in the OTIFF file, also revises as the inconsistent reason of looking for; Check the flow process record of whether doing over again after single, having the record of doing over again then need check to do over again whether finish, and according to the pdf document of whole the book of record modification of doing over again.
By method of the present invention, can be fast, transfer the local chronicle document to electronic version efficiently and accurately, and also slewing rate is fast, and error is low, has played extraordinary effect.
For not needing to carry out verification, only be the modern local collection e-book method for making that scanning is preserved, may further comprise the steps:
Step 1. image scanning, by professional scanner with textual scan in computer, like this paper spare text is changed into image;
Step 2. Flame Image Process, described image processing step comprises the calibration information integrality, guarantees not omit the information such as note of text, and image is rectified a deviation, and is just guaranteeing the picture position, and the picture decontamination is handled;
Step 3.PDF conversion, whole book directly encapsulates.Such electronic edition text is generally used for backup, and value for preservation is not high.The encapsulation back does not need follow-up quality inspection as long as guarantee clear picture when Flame Image Process.After described Flame Image Process and PDF conversion work are finished, the effective information of image is partly carried out projection, this projection has covered all effective information zones of image, this zone can amplify and dwindle in proportion, four summits and the original copy position of this view field are proofreaded one by one, see whether can overlap simultaneously, check then whether the projecting edge between every adjacent two summits overlaps with original copy.Can guarantee that like this pdf document and original image are in full accord, do not omit the original image effective information.
Such simplified control, but guaranteeing to scan does not omit any information in the file of keeping on file.

Claims (10)

1. modern local collection e-book method for making is characterized in that: may further comprise the steps:
Step 1. image scanning scans the Hard copy local collection in the computer by professional scanner, like this paper spare document is changed into electronic image;
Step 2. Flame Image Process, described image processing step comprises the calibration information integrality, guarantees not omit the information such as literal, picture, note of text, and image is rectified a deviation, and is just guaranteeing the picture position, and the picture decontamination handles, and guarantees that picture is neat and artistic;
Step 3.PDF conversion adopts the PDF picture format to encapsulate by the minimum institutional framework of catalogue;
Step 4. printed page analysis identification check and correction, comprise image layout analysis, text OCR identification, text proofreading, described text proofreading has comprised horizontal check and correction and vertically check and correction, described horizontal check and correction i.e. check and correction line by line, described vertical check and correction is sought all positions that these words occur in the text for to select all different words in the books one by one, and the whether correct of these literal check and correction confirmed in contrast one by one;
Step 5.TXT sets type and also to carry out JPG illustration index, comprises the TXT file of the text that produces through described step 4 identification is set type and the illustration in the middle of the text is carried out index, guarantee that the JPG illustration is nested in tram in the text, and index is accurate;
Step 6. catalogue is made, and according to regular scheduling and finish the catalogue index, the file of finishing generates catalogue file;
Step 7.XML documenting is described the XML file that every local document is used to put in storage according to the catalogue text of establishment and the TXT text generation of described step 5;
Step 8.XML quality inspection.Inspection comprises the inspection of all fields such as title, author, publishing house, text, PDF path at the XML file layout that every document generates, and guarantees corresponding fully with the content of this document of body paper.
2. modern local collection e-book method for making as claimed in claim 1 is characterized in that: comprise the preliminary identification to literal in the correction operation in the described step 2; Rectify a deviation after being confirmed to be the inclination that causes of scanning, the angle of correction back literal and horizontal direction is no more than 3 degree.
3. modern local collection e-book method for making as claimed in claim 1, it is characterized in that: it is characterized in that: after described correction work is finished, word segment is carried out projection, this projection has covered certain zone, this zone can amplify and dwindle in proportion, with this view field adjust to original copy geometric ratio size after, its four summits and original copy position are proofreaded one by one, see whether can overlap simultaneously, check then whether the projecting edge between every adjacent two summits overlaps with original copy.
4. modern local collection e-book method for making as claimed in claim 1 is characterized in that: adopt the PDF picture format to encapsulate described in the described step 3; Be that illustration behind the front cover is packaged into a node, called after " hat figure " allows to insert in order between the hat figure, preface, and hat figure is divided into hat Fig. 1, hat Fig. 2; Illustration before the back cover is packaged into a node, called after " accompanying drawing "; Front cover, hat figure, catalogue, preface, preface, autograph page or leaf, colophon, title page, the content of compiling text fronts such as the council encapsulate PDF separately by its title; Do not need OCR identification according to each page as whole pictures processing for front cover, hat figure, catalogue, autograph page or leaf, colophon, title page, change into JPG with professional image software and be uploaded to every corresponding JPG file of book; Need discern check and correction for preface, preface, the council content of compiling.
5. modern local collection e-book method for making as claimed in claim 1, it is characterized in that: printed page analysis identification check and correction in the described step 4, promptly at first literal paragraph in the picture of original text and picture are carried out printed page analysis respectively, draw the regional frame of the different identification types of representative, after described printed page analysis work is finished, carry out OCR identification, after described OCR identification work is finished, content of text after the identification laterally must be proofreaded line by line, after described horizontal proof-reading is finished, vertically proofread again, promptly select all different words in the books one by one, seek all positions that these words occur in the text, contrast confirms whether the appearance in the text of these literal is correct one by one, the identification error rate of guaranteeing literal will be lower than ten thousand/.
6. modern local collection e-book method for making as claimed in claim 1, it is characterized in that: the XML documenting process in the described step 7 is: all fields such as books Chinese, collection sign, first class catalogue, second-level directory, title, text, PDF in the XML file are all abideed by the objective typing of books, being the simplified font of typing of simplified font, is the typing traditional font font of traditional font font; If book cover title and autograph page head are inconsistent, then get the autograph page head; Book contents has "<〉" character, adopts " () " to replace in XML; Notes content is arranged in the books, and description rule is:
------(this page note begins)------
Notes content
------(this page note is intact)------
Have can't typing rare Chinese character replace with the solid black square frame;
Mathematical formulae in the text, chemical molecular formula or equation are pressed illustration and are handled, and provide the picture indices address; Have can't typing special symbol, if can describe with Chinese, adopt and express with Chinese text; A table content does not finish at first page, at second page continuous table is arranged, and does not have " continuous table " two words on second page, then increase " continuous table " two words and also be logged in the text, after add the index address of this form; The Catalog Header of books and text title are inconsistent, the typing Catalog Header.
7. modern local collection e-book method for making as claimed in claim 6, it is characterized in that: the directory level of XML only is fabricated into three grades, i.e. first class catalogue, second-level directory and title.
8. modern local collection e-book method for making as claimed in claim 1 is characterized in that: complete publication also needs always to examine after finishing the quality inspection work of step 8 at described single presents, and generates the xml file.
9. for not needing to carry out verification, only be the modern local collection e-book method for making that scanning is preserved, it is characterized in that: may further comprise the steps:
Step 1. image scanning scans the papery local collection in the computer by professional scanner, like this paper spare text is changed into image;
Step 2. Flame Image Process, described image processing step comprises the calibration information integrality, guarantees not omit the information such as note of text, and image is rectified a deviation, and is just guaranteeing the picture position, and the picture decontamination is handled;
Step 3.PDF conversion, whole book directly encapsulates.
10. modern local collection e-book method for making as claimed in claim 9, it is characterized in that: after described correction work is finished, word segment is carried out projection, this projection has covered certain zone, this zone can amplify and dwindle in proportion, with this view field adjust to original copy geometric ratio size after, its four summits and original copy position are proofreaded one by one, see whether can overlap simultaneously, check then whether the projecting edge between every adjacent two summits overlaps with original copy.
CN201310085316.0A 2013-03-15 2013-03-15 Modern local literature electronic book manufacture method Active CN103218351B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310085316.0A CN103218351B (en) 2013-03-15 2013-03-15 Modern local literature electronic book manufacture method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310085316.0A CN103218351B (en) 2013-03-15 2013-03-15 Modern local literature electronic book manufacture method

Publications (2)

Publication Number Publication Date
CN103218351A true CN103218351A (en) 2013-07-24
CN103218351B CN103218351B (en) 2016-06-22

Family

ID=48816156

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310085316.0A Active CN103218351B (en) 2013-03-15 2013-03-15 Modern local literature electronic book manufacture method

Country Status (1)

Country Link
CN (1) CN103218351B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605639A (en) * 2013-11-28 2014-02-26 厦门市乐创信息科技有限公司 Method of making e-books based on EPUB format
WO2015021737A1 (en) * 2013-08-12 2015-02-19 福建福昕软件开发股份有限公司北京分公司 Method for converting paper file into electronic file
CN104750662A (en) * 2015-03-27 2015-07-01 西藏藏医学院 Rescue, reorganization and protection method for only existing copy of ancient literature on Tibetan medicine theories
CN104750370A (en) * 2013-12-30 2015-07-01 顺富科技实业有限公司 Electronic document reading displaying method
TWI547857B (en) * 2013-12-25 2016-09-01 Electronic document reading display method
CN106250830A (en) * 2016-07-22 2016-12-21 浙江大学 Digital book structured analysis processing method
CN106778507A (en) * 2016-11-24 2017-05-31 北京小米移动软件有限公司 Text extraction method and device
CN106844567A (en) * 2016-12-23 2017-06-13 《中国医药科学》杂志社有限公司 A kind of papery contribution is converted to the method and system of the network text page
CN106874240A (en) * 2016-12-22 2017-06-20 华南师范大学 Digital publishing method and system
CN107085505A (en) * 2017-04-21 2017-08-22 武汉印链科技有限公司 A kind of CDR files are automatically processed and automatic comparison method and system
CN107153635A (en) * 2016-03-04 2017-09-12 《中国学术期刊(光盘版)》电子杂志社有限公司 It is a kind of to automatically extract the method and system that paper quotes bibliography after content and correspondence text
CN107194659A (en) * 2017-04-26 2017-09-22 珠海泰坦软件系统有限公司 A kind of archival digitalization copy quality automated detection method
CN108074214A (en) * 2017-12-20 2018-05-25 江苏省质量和标准化研究院 A kind of standard resource processes detergency processing method
CN109635681A (en) * 2018-11-26 2019-04-16 汉王科技股份有限公司 A kind of literature processing method and device
CN110765902A (en) * 2019-10-10 2020-02-07 延安大学 Digital protection and inheritance device for ancient and old newspapers
CN112306433A (en) * 2020-11-12 2021-02-02 深圳市华博创新科技有限公司 Printing processing method for electronic draft bag
CN112836073A (en) * 2021-02-02 2021-05-25 嘉应学院 Historical literature digitization method, system, device and storage medium
CN113448918A (en) * 2021-08-31 2021-09-28 中国建筑第五工程局有限公司 Enterprise scientific research result management method, management platform, equipment and storage medium
CN116092108A (en) * 2023-03-20 2023-05-09 四川竺信档案数字科技有限责任公司 Method, system and storage medium for generating PDF file by scanning entity document

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101334775A (en) * 2007-06-29 2008-12-31 高等教育出版社 Formalization representation method based on XML book content structure
CN101464903A (en) * 2009-01-09 2009-06-24 江阴明伦科技有限公司 OCR picture and text recognition and retrieval method and system through web mode
CN101794278A (en) * 2009-09-21 2010-08-04 广东省标准化研究院 Method and software for digitalizing full text of standard document

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101334775A (en) * 2007-06-29 2008-12-31 高等教育出版社 Formalization representation method based on XML book content structure
CN101464903A (en) * 2009-01-09 2009-06-24 江阴明伦科技有限公司 OCR picture and text recognition and retrieval method and system through web mode
CN101794278A (en) * 2009-09-21 2010-08-04 广东省标准化研究院 Method and software for digitalizing full text of standard document

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王天泉: "中华人民共和国档案行业标准 纸质档案数字化技术规范", 《中国档案》, 30 March 2006 (2006-03-30) *
金晨,牛离平: "农业古籍全文数字化加工技术", 《农业图书情报学刊》, vol. 17, no. 10, 5 October 2005 (2005-10-05) *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015021737A1 (en) * 2013-08-12 2015-02-19 福建福昕软件开发股份有限公司北京分公司 Method for converting paper file into electronic file
CN104376317A (en) * 2013-08-12 2015-02-25 福建福昕软件开发股份有限公司北京分公司 Method for transforming paper file into electronic file
CN104376317B (en) * 2013-08-12 2018-12-14 福建福昕软件开发股份有限公司北京分公司 A method of paper document is converted into electronic document
CN103605639A (en) * 2013-11-28 2014-02-26 厦门市乐创信息科技有限公司 Method of making e-books based on EPUB format
TWI547857B (en) * 2013-12-25 2016-09-01 Electronic document reading display method
CN104750370A (en) * 2013-12-30 2015-07-01 顺富科技实业有限公司 Electronic document reading displaying method
CN104750662A (en) * 2015-03-27 2015-07-01 西藏藏医学院 Rescue, reorganization and protection method for only existing copy of ancient literature on Tibetan medicine theories
CN107153635A (en) * 2016-03-04 2017-09-12 《中国学术期刊(光盘版)》电子杂志社有限公司 It is a kind of to automatically extract the method and system that paper quotes bibliography after content and correspondence text
CN106250830A (en) * 2016-07-22 2016-12-21 浙江大学 Digital book structured analysis processing method
CN106250830B (en) * 2016-07-22 2019-05-24 浙江大学 Digital book structured analysis processing method
CN106778507A (en) * 2016-11-24 2017-05-31 北京小米移动软件有限公司 Text extraction method and device
CN106874240A (en) * 2016-12-22 2017-06-20 华南师范大学 Digital publishing method and system
CN106844567A (en) * 2016-12-23 2017-06-13 《中国医药科学》杂志社有限公司 A kind of papery contribution is converted to the method and system of the network text page
CN107085505B (en) * 2017-04-21 2020-01-14 武汉印链科技有限公司 CDR file automatic processing and automatic comparison method and system
CN107085505A (en) * 2017-04-21 2017-08-22 武汉印链科技有限公司 A kind of CDR files are automatically processed and automatic comparison method and system
CN107194659A (en) * 2017-04-26 2017-09-22 珠海泰坦软件系统有限公司 A kind of archival digitalization copy quality automated detection method
CN108074214B (en) * 2017-12-20 2020-01-10 江苏省质量和标准化研究院 Standard resource processing decontamination treatment method
CN108074214A (en) * 2017-12-20 2018-05-25 江苏省质量和标准化研究院 A kind of standard resource processes detergency processing method
CN109635681B (en) * 2018-11-26 2021-11-26 汉王科技股份有限公司 Document processing method and device
CN109635681A (en) * 2018-11-26 2019-04-16 汉王科技股份有限公司 A kind of literature processing method and device
CN110765902A (en) * 2019-10-10 2020-02-07 延安大学 Digital protection and inheritance device for ancient and old newspapers
CN110765902B (en) * 2019-10-10 2023-04-18 延安大学 Digital protection and inheritance device for ancient and old newspapers
CN112306433A (en) * 2020-11-12 2021-02-02 深圳市华博创新科技有限公司 Printing processing method for electronic draft bag
CN112836073A (en) * 2021-02-02 2021-05-25 嘉应学院 Historical literature digitization method, system, device and storage medium
CN113448918B (en) * 2021-08-31 2021-11-12 中国建筑第五工程局有限公司 Enterprise scientific research result management method, management platform, equipment and storage medium
CN113448918A (en) * 2021-08-31 2021-09-28 中国建筑第五工程局有限公司 Enterprise scientific research result management method, management platform, equipment and storage medium
CN116092108A (en) * 2023-03-20 2023-05-09 四川竺信档案数字科技有限责任公司 Method, system and storage medium for generating PDF file by scanning entity document

Also Published As

Publication number Publication date
CN103218351B (en) 2016-06-22

Similar Documents

Publication Publication Date Title
CN103218351A (en) Modern local literature electronic book manufacture method
Pacios et al. Efficient polyfluorene based solar cells
CN103927606A (en) Method for managing paper files and electronic files based on feature codes
Scott et al. The adaptive value of energy efficiency programs in a warmer world
Jin et al. Multi-objective optimization for brownfield remediation on the basis of land use planning
ALI et al. Evaluation of growth and morphological parameters in two poplar species (P. nigra L. & P. alba L.) to tree growth reveal traits related to productivity (case study in Kermanshah, Zanjan and Esfahan provinces)
Andersson Benefits of Integrated Upgrading of Biofuels in Biorefineries: Systems Analysis
Priyanka et al. Determinants of Financing Decisions of Start Up Firms
Farahat et al. Optimization of linear parabolic solar collectors with exergy concept
Jang et al. The Methods of Collecting, Preservation, Reproduction for Records of Public Sector's Facebook Pages
Bakken et al. Energy use and emission of greenhouse gases from grassland agriculture.
Mokhtarianpour Process Model for Designing Islamic-Iranian Model of Progress
Baradaran et al. Study plant mite fauna of ornamental plants cultivated in indoor and outdoor conditions in Tehran, Markazei and Isfahan provinces
Nair et al. Influence of Building Design on Occupant Comfort in Adaptive Climate Zones
Gregore et al. Webinar National Agriculture Green House Gas (GHG) data collection in the Caribbean. Lessons Learned
Pradhan Green Entrepreneurship and Solar Energy in India: Progress, Prospect and Challenges
Noordin et al. Industry 4.0: A Review on Drivers and Challenges in Development of Smart Supply Chain Management
Sadegh et al. Determination of Salinity and Water Sensitivity Coefficient with the Approach Study Water Productivity of Kochia Scoparia L. in Ahvaz Climate
Zhao QingJian et al. Energy flows and carbon footprint in the forestry-pulp and paper industry.
Jin DingQiang et al. Analysis of carbon emission reduction in multi-power areas under the greenhouse gas emission reduction method.
Alimoradi et al. An Analysis teachings of urban planning in the Formation of the City''''s physical (Case Study: Safavid Era, Isfahan city)
Ryu et al. Long-term Changes of Species-Specific Tree Growth in Korea using 5th National Forest Inventory
Pandita et al. Liveability Assessment Methodology in Residential Apartments
ESFANJARI et al. Investigation of technical efficiency and technological gap of Iranian laying hen industrial units
Jun et al. A Study on Promotion plans of the Cultural Contents Industry from a Perspective of Simulacre-Focused on Characters

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant