WO2019108209A1 - Digital part-page detectors - Google Patents

Digital part-page detectors Download PDF

Info

Publication number
WO2019108209A1
WO2019108209A1 PCT/US2017/064023 US2017064023W WO2019108209A1 WO 2019108209 A1 WO2019108209 A1 WO 2019108209A1 US 2017064023 W US2017064023 W US 2017064023W WO 2019108209 A1 WO2019108209 A1 WO 2019108209A1
Authority
WO
WIPO (PCT)
Prior art keywords
title
pages
book
page
candidate
Prior art date
Application number
PCT/US2017/064023
Other languages
French (fr)
Inventor
Ricardo Da Silva BECK
Rodrigo Marques DALMAS
Original Assignee
Hewlett-Packard Development Company, L.P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Development Company, L.P. filed Critical Hewlett-Packard Development Company, L.P.
Priority to PCT/US2017/064023 priority Critical patent/WO2019108209A1/en
Publication of WO2019108209A1 publication Critical patent/WO2019108209A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors

Definitions

  • Book publishing traditionally involves many digital document- processing steps before publication of books.
  • the authors write books under contract, and receive royalties for the books sold.
  • the traditional publisher creates books that are mass published without change due to the costs associated with the many steps before publication.
  • the books may be published as digital books or physical books.
  • Figure 1 illustrates examples of a system for generating a custom digital book for publication
  • Figure 2 illustrates examples of heuristics identified by the system 100 on a page in a book for publication
  • Figure 3A, 3B and 3C illustrate examples of the system 100 identifying actual part-title pages in a digital book
  • Figure 4 illustrates a method according to examples for determining a set of chapters for a digital book for publication
  • Figure 5 illustrates a method according to examples for generating a custom book.
  • a technical problem associated with creating a custom digital book based on an existing digital book is the electronic identification of chapters in the existing digital book.
  • chapters of the existing book may be electronically added, removed or modified. After modifications, it is technically difficult to electronically identify where chapters begin and end for creating the custom book and for creating a table of contents for the custom book.
  • systems and methods are provided that may electronically identify chapters in a digital book and create a custom digital book based on the identified chapters.
  • a system may determine part-title pages in a digital book, identify chapters in the book based on the part-title pages, receive specifications to modify the book and generate a digital table of contents for the modified book.
  • a digital book is a book made available in digital form, which may include text, images, or both, and which may be readable on a display of a device, such as a computer display, laptop, smartphone, etc.
  • the digital book may be provided in a pdf format or may be provided in another digital format.
  • a part-title page may separate chapters in books.
  • the part-title page may include details about the title of a chapter and other formatting that differentiates the part-title page from other pages in the book.
  • the part-title page may include the initial paragraph of the chapter, the title in larger font, and other heuristics that different it from other pages in the book.
  • the system may parse pages of a digital book to identify heuristics present on the pages.
  • the system may heuristically identify candidate part-title pages separating chapters in a book.
  • a heuristic may be a property of a page in a book that makes it more likely that the page is or is not a part-title page. For example, pages with the same font size as the rest of the pages in a book may indicate a page is more likely a part-title page. However, a page that starts with text that is a font larger than the font size for most of the text in the book may be a heuristic that may indicate a page is more likely a part-title page.
  • a probability indicative of whether a page may be a part-title page may be determined for each page based on the heuristics present on the page, and candidate part-title pages are selected based on the probabilities.
  • the system may determine a template part-title page with the highest probability.
  • a candidate part-title page may have the highest probability when the number of heuristics matched in the part-title page are higher when compared to other candidate part-title pages.
  • the system may have more than one part-title page with heuristics present on the page.
  • the system may determine the template part-title page using the minimum common subset of heuristics in the pages with the highest probability.
  • the system may compare the template part-title page with the pages in the candidate part-title pages to identify candidate part-title pages that are actual part-title pages. For example, to compare the pages, the system may compare a heuristic in the template part-title page with a heuristic feature in the part-title page. For example, in a published book part-title pages may appear on the same side of the book, i.e., chapters may begin on the same side of the book. The system may use this heuristic from the template part-title page to select a subset of actual part-title pages from the candidate part-title pages. The system may solve the technical problem of identifying part-title pages, without an exhaustive search of permutation and combinations to arrive at an approximate solution.
  • An approximate solution may be a solution not verified to be the best solution to the problem of identifying part-title images.
  • the system may generate the digital table of contents of the book based on the actual part- title pages.
  • the system may use a set of part-title heuristics to determine part- titles of the book, such as chapter titles. For example, the system may use the position where a part-title header appears.
  • a part-title header may refer to a chapter title.
  • the part-title headers are near the fore-edge of the page. The fore-edge of a book is the right-hand edge of a book when opened, opposite the spine.
  • the part-title page in a book may have a printed recto side and an unprinted verso side, i.e., absence of content in a part- title page.
  • a recto side may mean the right-hand page of an opening in a book.
  • the part-title page may include characters with larger font sizes, i.e., presence of larger font sizes compared to other characters in the rest of the book.
  • books may include a large letter for the first word of the chapter with normal cases for the rest of the chapter.
  • content may be absent in a part-title page verso.
  • there may be a vignette in the part-title page, i.e., presence of a vignette in the part-title page.
  • a vignette may be an image used to begin a chapter.
  • some books may use an image in a part-title page, an image and a text.
  • the part-title header may include the word “Chapter”.
  • a numerical character may follow in Arabic numerals or in Roman numerals.
  • textbooks may use the word Chapter I, Chapter II to identify next chapter.
  • the running headers and/or running footer may be absent in part-title pages.
  • a running header may be text at the top of page.
  • a running header may include a title of a page, page number, and chapter number.
  • a running footer may be text at the bottom of the page and may include the same information as the running header.
  • the information and the use of the running header and the running footer may depend on the publisher. Some publishers may use one or the other or both.
  • the system may use the heuristics listed above as a set of part-title heuristics to identify candidate part-title pages. [0016]
  • the system may allow generation of a custom digital book. The system may receive a
  • the specification may be a selection of part-title pages for deletion, a set of pages for inclusion near a part- title page and/or a request to rearrange chapters to generate a custom book.
  • the system 100 may use the specification to generate the custom book.
  • the specification may also include references to indicate where the pages may be inserted such as near or next to a part-title page.
  • a technical problem associated with part-title page identification is that different publishers or even different books by the same publisher may format their books differently. Accordingly, a fixed template may not be viable to accurately identify chapters of a digital book.
  • a part-title page may be on the left-hand side or on the right-hand side.
  • the system may determine part-title pages that demarcate a division between chapters using heuristics.
  • the system may use a set of part-title heuristics to identify candidate part-title pages determine an approximation solution to determine an actual set of part-title pages.
  • the other traditional approaches fail, because they are not flexible when the part-title page has inconsistent characteristics between books.
  • part-title page identification is slow and may use expensive computational resources. For example, it may not helpful for the system to identify the best solution to the problem of determining part-title pages. For example, for creating custom books it may be efficient to allow part- title chapters to be identified as an approximate solution. However, a custom book may be proof-read before publication. A less optimal solution may meet the use case compared to a more optimal solution that uses more system resources and/or is expensive.
  • the system described according to examples of the current disclosure may identify part-title pages separating chapters, based on heuristics to determine an approximate solution. The approximate solution may be the best solution, but the system may not verify the solution is the best solution to conserve system resources such as Central Processing Unit (CPU), Random Access Memory (RAM) and power utilization.
  • CPU Central Processing Unit
  • RAM Random Access Memory
  • the system also allows for custom creation of digital books.
  • portions of the book may be deleted, modified or additional content added to create the custom book.
  • the system can create a digital table of contents for the custom book based on identification of part-title pages in the book.
  • Figure 1 shows examples of a system 100 for identifying part-title pages in a document, extracting chapters, generating a custom book based on a specification and generating a digital table of contents for the custom book based on heuristics.
  • the system 100 may include a processor 102 and a memory 104.
  • the system 100 may include machine readable instructions 106 stored in the memory 104 or another type of non-transitory computer readable medium executable by the processor 102.
  • the system 100 may be a workstation, a desktop computer, a laptop computer, a handheld device or any other device.
  • the system 100 may include instructions in a non-transitory computer readable medium executable by a processor for display on a browser.
  • a server may allow interaction with the system 100 through a remote interface such as the browser.
  • the system 100 may be connected to a network such as the internet through a network.
  • the network may be a wide area network, a local area network, a cellular network, a satellite network and the like.
  • the system 100 may determine an approximate solution to determine actual part-title pages in a book based on heuristics.
  • the approximate solution may be the best solution, but the system may not expend computer resources to determine the best solution or to verify the determined solution is the optimal solution.
  • the system 100 may use a heuristic approach to identify part-title pages that does not consider possible combinations and permutations.
  • the system 100 may be used in publishing industry to allow generation of custom books. For example, a university course may use a custom book.
  • the system 100 may allow a publisher to allow changes to a textbook.
  • the system 100 may allow addition of examples, a set of custom problems after chapters in the book, addition or removal of content in the book, reordering of chapters, automatic generation of digital table of contents and the like.
  • the system 100 may parse pages in a digital book to identify heuristics present on pages of a book.
  • the system 100 may determine probabilities that pages in the digital book are of candidate part-title pages based on a comparison between the identified heuristics and a set of part-title heuristics. From the candidate part-title pages, the system 100 may then determine a template part-title page.
  • the system 100 may determine the template part-title page based on the determine probabilities. For example, the system 100 may determine a page that includes more heuristics from the set of part-title heuristics compared to other candidate part title pages as the template part-title page.
  • the system may determine a part-title page comprising part-title heuristics selected from the candidate part-title pages based on a count of the heuristics in candidate part-title pages with a higher probability compared to other part-title pages. For example, assume a group of candidate part-title pages includes six out of ten heuristics in a set of part-title heuristics. The system 100 may determine the template part-title comprising heuristics from the group of candidate part-title pages based on the heuristics with a higher count of heuristic matches in the candidate part-title pages with a higher determined probability than other candidate part-title pages with a lower determine probability. In examples, when more than one part-title page has a similar number of heuristic matches, the system may determine as common denominator of part-title pages to generate the template part-title page.
  • the system 100 may compare the template part-title page with the pages in the candidate part-title pages. Comparing the template page may refer to selecting a heuristic on the part-title page and comparing the heuristic to heuristics present on the candidate part-title pages. For example, the system 100 may determine that part-title pages in the book are on the right-hand side of in the book. The system 100 may select a subset of the candidate part-title pages located on the right-hand side of the book. The system may ignore other permutation and combinations to verify whether the solution determined is optimal.
  • the system 100 may optimize the solution by comparing permutations and combinations of the heuristics in the part-title page to heuristics in a page of the candidate part-title pages.
  • the system 100 may compare the position of a part-title header in the template part-title page to determine whether pages in the candidate part-title pages has a similarly positioned part-title header.
  • the system 100 may identify a heuristic in the candidate part-title pages appearing in the part-title heuristics with the highest probability of identifying actual part-title pages. For example, a heuristic may appear in candidate pages. The system may select a part-title heuristic with the highest count in the candidate part-title pages. The system 100 may identify this heuristic besides the template part-title page to determine actual part-title pages.
  • the system 100 may select a subset of the candidate part-title pages to be actual part-title pages based on the comparison. For example, the system may determine pages with the highest probability in the candidate part-title pages are actual part-title pages. Also, the system may determine other pages with lower probability based on the comparison with the template part-title page. In examples, the system 100 may generate the digital table of contents based on the actual part-title pages. For example, the system 100 may determine a set of chapters based on the part-title pages that separate chapters. Once the system identifies part-title pages separating chapter, the system may determine that pages between two part-title pages is a chapter. The system 100 may generate a digital table of contents based on the identified set of chapters.
  • the system 100 may allow the generation of custom books.
  • the system 100 may receive a specification for a custom book.
  • specifications may include specifications to include chapters, exclude chapters, delete part of a chapter, and add material to a chapter, location of the material to be deleted, included and/or modified and the like.
  • the specification received includes a request to delete a chapter
  • the system 100 may receive the specification as selection of a chapter from the set of chapters for deletion.
  • the system 100 may delete the chapter.
  • the system 100 may generate a digital table of contents for the modified book.
  • the specification may contain insertion of content to the book.
  • a university course may include additional material for a course in the specification with a location for the additional material.
  • the system 100 may insert the additional material at a location provided in the specification.
  • the specification may provide a chapter where the new content such as a set of pages is to be inserted in the book.
  • the system 100 may rearrange the book.
  • the system 100 may rearrange the book.
  • the system 100 may determine a digital table of contents for the custom book.
  • the system 100 may receive a selection from a set of chapters for deletion.
  • the system 100 may delete the received selection from the book.
  • the system 100 may generate the digital table of contents for the book with the deleted selection.
  • the system 100 may generate a proof copy of the book.
  • the proof copy may be printed.
  • the system 100 may publish a limited run of the custom book once the proof is approved.
  • the machine readable instructions 106 may include instructions 108 to parse pages of a digital book to identify heuristics present on the pages.
  • the system 100 may use image processing to parse pages of a digital book to identify heuristics present on the pages.
  • image processing techniques include appearance based methods, edge matching methods, divide and conquer search, greyscale matching, histogram of field responses, model set searches, feature based methods, hypothesis and test method, pose consistency, pose clustering, invariance, geometric hashing, scale-invariant feature transform, speeded up robust features, genetic algorithms, shading, template matching, texture, biologically inspired
  • system 100 may use text recognition, metadata from pdf content, an image processing method or a combination to determine the heuristics present on the page.
  • the instructions 106 may include machine readable instructions 1 10 to determine probabilities that pages in a book are part of candidate part-title pages.
  • the system 100 may determine whether pages in the book are part of the candidate part-title pages.
  • the system 100 may use a heuristic such as an attribute of a page that indicates the page is likely to be a part-title page.
  • the header on a part-title page be located near the fore- edge of a page.
  • the words chapter or title of the chapter may be near the fore- edge of the page.
  • the system 100 may determine part-title pages with headers at a similar position near the fore-edge of the page.
  • the system 100 may similarly apply other heuristics to determine candidate part-title pages.
  • the system 100 may use a set of heuristics that may be helpful in determining part-title pages to determine chapters in the book. For example, in a book the publisher may begin a part-title page on one side of the book, i.e., the left-hand side or the right-hand side. In another example, a publisher may include a blank page before the next chapter. The system 100 may use this as a heuristic to determine part-title pages that follow blank pages. In another example, a publisher may use a font size for the beginning of part- title pages that are larger than the rest of the text in the part-title page or larger than the rest of the text in other pages.
  • a publisher may use a vignette in a part-title page.
  • a vignette may be an image used in a part-title page instead of or besides characters.
  • the publisher may use an image without a header at the top of a page.
  • Other pages with illustrations may include a header.
  • a publisher may use an entire page for a part-title page.
  • the part-title page may contain a lot fewer words.
  • the part-title page may include the title for the chapter that follows a listing of sub-headings in the chapter and the like.
  • the part-title page may include margins before and after a chapter header.
  • the part-title page may include the word“chapter” in the part- title page followed by a numeral.
  • the numeral may be a Roman numeral or an Arabic numeral.
  • the publisher may not use a running header in a part-title page. Similarly, a publisher may not use a running footer.
  • system 100 may use the heuristics listed above as a set of heuristics to identify the part-title pages.
  • the instructions 106 may include machine readable instructions 1 12 to determine a template part-title page comprising the heuristics from the candidate part-title pages with a higher probability than other candidate part-title pages.
  • the template part-title page may include the heuristics from one-candidate part-title pages.
  • the template part-title page may include heuristics from two or more candidate part-title pages.
  • the system 100 may determine the template part-title page based on the candidate part-title pages identified in the previous instruction at 108. The system 100 may select the page with the highest probability as the template part-title page. The highest probability may refer to the number of matches of the template page to a heuristic in the set of heuristics. For example, a page in the candidate part-title pages may match seven of the ten heuristics compared to other pages in the candidate part-title pages.
  • the machine readable instructions 106 may include instructions 1 12 to compare template part-title page to the candidate part-title pages. After selecting the template part-title page, the system 100 may compare the template part-title page to pages in the candidate part-title pages. In examples, the system 100 may determine whether a heuristic on the template part-title page matches a heuristic on a candidate part-title page. For example, in the template part-title page one of the heuristics may be that the part-title page is always on the right-hand side of a book. The system 100 may determine candidate part-title pages that are on the right-hand side.
  • the instructions 1 14 may select a subset of the candidate part-title pages to be the actual part-title pages.
  • the system 100 may select a subset of the candidate part-title pages to be the actual part-title pages based on the comparison results at the previous step. For example, the system 100 may select part-title pages on the right-hand side of the page as actual part- title pages.
  • the system 100 may not compare permutation and combinations of the pages in the book to the template part-title page to optimize the result.
  • the system 100 may not compare permutations and combinations of the heuristics in the template part-title page to heuristics on the candidate part-title page.
  • the system 100 may determine the approximate solution, which may also be the optimal solution.
  • the machine readable instructions 106 may include instructions 1 16 to generate a digital table of contents.
  • the system 100 may use the part-title pages to determine the beginning and end of a chapter. The pages between a part-title page and the next part-title page may be a chapter. The system 100 may then generate a digital table of contents based on the content on the part-title page. For example, the system 100 may identify the title of a chapter and include the title and the page number of the chapter in the digital table of contents. In another example, the system 100 may determine the sub-headings in the chapter and include the sub-headings in the digital table of contents with the appropriate page numbers.
  • the system 100 may include heuristics such as the preferences of a publisher.
  • the system 100 may determine the heuristics specific to books from a publisher. For example, an encyclopedia from Britannica may use a part-title page with changes to indicate the beginning of a new chapter.
  • the system 100 may also receive information about the preferences of a particular publisher such as a vignette, a particular font in part- title pages, a particular font size, absence of headers or footers in part-title pages.
  • the system 100 may receive information about heuristics from a related book, or a set of related books.
  • the figure illustrates heuristics identified by the system 100 on a page in a book.
  • the system 100 may identify the heuristic, the presence of the bigger fonts on the part-title header compared to fonts in the rest of the paragraph 204.
  • the system 100 may identity the use of the word“chapter” 202 in a part-title page. Also, a number may follow the word chapter.
  • the system 100 may identify the use of margin before and/or after a part-title header in the part-title page 206.
  • the system 100 may identify the absence of a running footer 208 present on the page.
  • the system 100 may also identify the side the part-title page is present 210 as a heuristic present in the page.
  • the system 100 may identify candidate part-title pages. Also, the system 100 may identify counts of the heuristic matches for pages in the candidate part-title pages.
  • Figure 3A, 3B, 3C depicts examples of the system 100 identifying actual part-title pages.
  • the book may be a pdf file.
  • the system 100 may receive a book without demarcation between the chapters.
  • the system 100 may determine the heuristics present on the pages of the book receive in figure 3A to determine candidate part-title pages as shown in figure 3B.
  • the system 100 may determine the number of heuristic matches on pages of the received book.
  • the system 100 may then generate a count of the heuristic matches in the candidate part-title pages to determine probability that the page is a part-title page.
  • the figure 3B shows two groups, group 308 and group 310.
  • the group 308 has a probability of 5/8 and the group 310 has a probability of 3/8.
  • the system 100 may identify pages with at least one heuristic match to determine the candidate part-title pages.
  • the system 100 may associate a probability for pages in the candidate part-title pages. For example, as shown in figure 3B, the part-title pages in group 308 may match five of the eight heuristics test.
  • the system 100 may determine the template part-title page as a page with the heuristics in the group 308.
  • the template part-title page may be a page that includes five heuristics of the group 308.
  • the system 100 may then compare the template part-title page with other pages in the candidate part-title pages. For example, the system 100 may compare each page in the group 310 to the template part-title page.
  • the system 100 may remove pages that do not match the template part-title page.
  • the page 310a does not include a header with a spacing before or after the header, includes a running header missing in the template page and the like.
  • the system 100 may determine page 310a is not a part-title page.
  • the system 100 may similarly perform a heuristic match against the template part-title page to select the actual part-title pages shown in figure 3C.
  • the system 100 may not compare the template part-title page to the page in a group 308 to minimize system resource utilization.
  • the system 100 may select pages in the group 308, without dropping a page from the group.
  • the system 100 may determine that the group 308 pages have a higher probability of being actual part-title pages compared with other candidate part-title pages.
  • the system 100 may determine pages in the group 308 are actual part-title pages.
  • the system 100 may include them in the actual part-title pages in figure 3C.
  • the system 100 may obtain an approximate solution.
  • the system 100 may then generate a set of chapters based on the determined part-title pages.
  • the system 100 may allow removal of a chapter.
  • the system 100 may receive a specification to generate a custom book.
  • the specification may add, delete, remove or modify the pages, chapters, contents and the like of the book.
  • the system 100 may for example, delete chapters not taught in a university course.
  • the system 100 may then determine a digital table of contents for the custom book as discussed above in figure 1.
  • Figure 4 illustrates a method 400 according to examples for determining a set of chapters for the book based on the actual part-title pages.
  • the method 400 may be performed by the system 100 in figure 1 or other systems.
  • the system 100 may determine the probabilities that pages in the book are part of candidate part-title pages. In examples, the system 100 may determine candidate part-title pages based on heuristics. In examples, the system 100 may receive a set of heuristics to use. In another example, the system 100 may use a pre-selected set of heuristics. In examples, the heuristic search may be a quick search to reject pages that are unlikely to be part-title pages. For example, the system 100 may ignore pages with no part-title header or vignette on the page, or pages that includes small fonts in comparison to other fonts. In another example, the system 100 may apply the set of heuristics. In the case of borderline matches where the probability that the page has the heuristics, the system may include the pages in the candidate part-title pages to avoid system resource utilization. The system 100 may delay a decision on a borderline page until more information is available.
  • the system 100 may identify the template part-title page with the highest probability. In examples, the system 100 may determine the template part-title page based on the highest candidate part-title pages. The system 100 may identify the part-title page with the highest number of heuristics matches in the candidate part-title pages. [0040] At 408, the system 100 may compare heuristics present on template part-title pages to heuristics present on candidate part-title pages. In examples, the system 100 may compare one heuristic in the template part-title page against heuristics in the candidate part-title pages. For example, the system 100 may compare the whether the part-title page is located on the verso side of a page.
  • a publishing house may use verso side of a page for the part- title page.
  • the system 100 may find an approximate solution using this heuristic.
  • the system 100 may not waste resources on finding an optimal solution.
  • the system 100 to conserve resources may not compare the template part-title page with pages of the book not on the candidate part-title pages.
  • the system 100 may select a subset of the candidate part- title pages to be actual part-title pages of the book based on the comparison.
  • the system 100 may use heuristics to determine the actual part-title pages.
  • the system 100 may determine an approximate solution based on the comparison in the previous step. For example, a publisher may reuse a part-title page with minor changes in a book. For example, the publisher may always start a part-title page on the verso side of a page. The system 100 may discard candidate part-title pages not on the verso side of the page
  • Figure 5 illustrates a method 500 according to examples for generating a custom book.
  • the method 500 may be performed by the system 100 in figure 1 or other systems.
  • the system 100 may determine probabilities that pages in a book are part of candidate title pages. For example, the system 100 may determine whether pages in books are part-title pages based on heuristic matches. For example, the system 100 may determine whether the page has a part-title header. The system 100 may use the location of the part-title header.
  • part-title may be offset towards the fore-edge of the book away from the spine of the book.
  • the system 100 may determine whether a page has larger fonts in some pages, presence or absence of headers and footers, space between the part-title page and the rest of the text and the like. The system 100 may then identify candidate part-title pages that match the selected heuristics. In examples, the system may determine pages that display n heuristics out of m heuristics tested for as part of the candidate part-title pages. In examples, the system 100 may identify the candidate part-title pages that are approximate matches for later filter. For example, the system 100 may include candidate part-title pages that a lower probability of being part-title pages in this initial step to weed out later. The system 100 may use approximations to avoid using further resources on borderline cases by differing the solution to a later point in time.
  • the system 100 may compare the template part-title page with the candidate part-title pages. For example, the system 100 may compare the heuristics on the template part-title page to heuristics in the candidate part- title pages. In another example, the system 100 may determine part-title pages with lower heuristics. In examples, the system 100 may use the set of identified part-title pages. The system 100 may not compare permutations and combinations of heuristics on pages in the book to determine an approximate solution. In examples, the system 100 may use the heuristic less than the initial set of heuristics to select candidate part-title pages are actual part-title pages. For example, the system 100 may determine whether the template part-title page was on the right-hand side.
  • the system 100 may select a set of the candidate part-title pages to be actual part-title pages.
  • the system 100 may select the subset based on the heuristic comparison. For example, the probability that publishers locate part-title pages on the same side of the book may be higher than part-title pages on both sides in the same book.
  • the system 100 may determine the actual part-title pages from the candidate part-title pages based on this probability.
  • the system 100 may solve the problem without applying permutations and combinations, i.e., without trying to optimize the solution to further improve the detection. Although the system 100 may use approximations to solve the problem, the solution may be the best solution, i.e., the system 100 may not verify whether the solution is the best solution.
  • the system 100 may winnow the candidate part-title pages to determine the actual part-title pages.
  • the system 100 may receive specifications for a custom book.
  • the specifications may include specifications to include additional material provided.
  • the specifications may describe chapters to be deleted.
  • a university course may use a custom set of chapter problems. The university course may skip chapters, reorder chapters for the course and the like.
  • the system 100 may generate a proof copy of the custom book based on the specifications.
  • the system 100 may change the book based on the specifications received in the previous step.
  • the system 100 may then generate a proof copy of the custom book for review.
  • the system 100 may send a proof copy to the university or professor for review. After approval, of the proof copy the system 100 may publish low volumes of copies on demand for the course.
  • the system 100 may also allow students registered at the university to access additional custom material for the custom book.

Abstract

In examples, a system may parse pages of a digital book and determine candidate part-title pages separating chapters of the book based on part-title heuristics. The system may identify a template part-title and compare heuristics of the candidate part-title pages and the template part-title to select a subset of the candidate part-title pages to be actual part-title pages of the book. The system can generate a digital table of contents for the book from the actual part-title pages.

Description

DIGITAL PART-PAGE DETECTORS
BACKGROUND
[0001] Book publishing traditionally involves many digital document- processing steps before publication of books. The authors write books under contract, and receive royalties for the books sold. The traditional publisher creates books that are mass published without change due to the costs associated with the many steps before publication. The books may be published as digital books or physical books.
BRIEF DESCRIPTION OF DRAWINGS
[0002] Examples are described in detail in the following description with reference to the following figures. In the accompanying figures, like reference numerals indicate similar elements.
[0003] Figure 1 illustrates examples of a system for generating a custom digital book for publication;
[0004] Figure 2 illustrates examples of heuristics identified by the system 100 on a page in a book for publication;
[0005] Figure 3A, 3B and 3C illustrate examples of the system 100 identifying actual part-title pages in a digital book;
[0006] Figure 4 illustrates a method according to examples for determining a set of chapters for a digital book for publication; and
[0007] Figure 5 illustrates a method according to examples for generating a custom book.
DETAILED DESCRIPTION OF EMBODIMENTS
[0008] Traditional publishers typically create books that are mass published without change. For example, a predetermined quantity of the books may be printed or the books may be offered as electronic books. However, once the books are published, either physically or electronically, the books are not modified. There may be scenarios where custom books are in demand. For example, a university may teach a course that is not based on a single book, and may utilize content from multiple different text books. In these instances, the students may purchase multiple books for the course because a single book does not provide the content for the course. T raditional publishing steps are not suitable for creating custom digital books and are often not able to create custom digital books on demand.
[0009] A technical problem associated with creating a custom digital book based on an existing digital book is the electronic identification of chapters in the existing digital book. To create the custom book, chapters of the existing book may be electronically added, removed or modified. After modifications, it is technically difficult to electronically identify where chapters begin and end for creating the custom book and for creating a table of contents for the custom book. According to examples of the present disclosure, systems and methods are provided that may electronically identify chapters in a digital book and create a custom digital book based on the identified chapters.
[0010] For simplicity and illustrative purposes, the principles of the embodiments are described by referring mainly to examples thereof. In the following description, numerous specific details are set forth in order to provide an understanding of the embodiments. It will be apparent, however, to one of ordinary skill in the art, that the embodiments may be practiced without limitation to these specific details. In some instances, well-known methods and/or structures have not been described in detail so as not to unnecessarily obscure the description of the embodiments and examples described herein.
[0011] According to examples of the present disclosure, a system may determine part-title pages in a digital book, identify chapters in the book based on the part-title pages, receive specifications to modify the book and generate a digital table of contents for the modified book. A digital book is a book made available in digital form, which may include text, images, or both, and which may be readable on a display of a device, such as a computer display, laptop, smartphone, etc. The digital book may be provided in a pdf format or may be provided in another digital format. A part-title page may separate chapters in books. For example, the part-title page may include details about the title of a chapter and other formatting that differentiates the part-title page from other pages in the book. The part-title page may include the initial paragraph of the chapter, the title in larger font, and other heuristics that different it from other pages in the book.
[0012] The system according to examples of the present disclosure may parse pages of a digital book to identify heuristics present on the pages. The system may heuristically identify candidate part-title pages separating chapters in a book. A heuristic may be a property of a page in a book that makes it more likely that the page is or is not a part-title page. For example, pages with the same font size as the rest of the pages in a book may indicate a page is more likely a part-title page. However, a page that starts with text that is a font larger than the font size for most of the text in the book may be a heuristic that may indicate a page is more likely a part-title page. A probability indicative of whether a page may be a part-title page may be determined for each page based on the heuristics present on the page, and candidate part-title pages are selected based on the probabilities.
[0013] From the candidate part-title pages, the system may determine a template part-title page with the highest probability. A candidate part-title page may have the highest probability when the number of heuristics matched in the part-title page are higher when compared to other candidate part-title pages. In examples, the system may have more than one part-title page with heuristics present on the page. In another example, the system may determine the template part-title page using the minimum common subset of heuristics in the pages with the highest probability.
[0014] The system may compare the template part-title page with the pages in the candidate part-title pages to identify candidate part-title pages that are actual part-title pages. For example, to compare the pages, the system may compare a heuristic in the template part-title page with a heuristic feature in the part-title page. For example, in a published book part-title pages may appear on the same side of the book, i.e., chapters may begin on the same side of the book. The system may use this heuristic from the template part-title page to select a subset of actual part-title pages from the candidate part-title pages. The system may solve the technical problem of identifying part-title pages, without an exhaustive search of permutation and combinations to arrive at an approximate solution. An approximate solution may be a solution not verified to be the best solution to the problem of identifying part-title images. The system may generate the digital table of contents of the book based on the actual part- title pages. [0015] The system may use a set of part-title heuristics to determine part- titles of the book, such as chapter titles. For example, the system may use the position where a part-title header appears. A part-title header may refer to a chapter title. In some books, the part-title headers are near the fore-edge of the page. The fore-edge of a book is the right-hand edge of a book when opened, opposite the spine. In some books, the part-title page in a book may have a printed recto side and an unprinted verso side, i.e., absence of content in a part- title page. A recto side may mean the right-hand page of an opening in a book.
In some books, the part-title page may include characters with larger font sizes, i.e., presence of larger font sizes compared to other characters in the rest of the book. For example, books may include a large letter for the first word of the chapter with normal cases for the rest of the chapter. In some books, content may be absent in a part-title page verso. In some books, there may be a vignette in the part-title page, i.e., presence of a vignette in the part-title page. A vignette may be an image used to begin a chapter. For example, some books may use an image in a part-title page, an image and a text. In some books, there may be a larger margin before and after a part-title header compared to the margin in the rest of the book. For example, some books may include two or more line spaces before the part-title header compared to line space between lines in the chapter. In some books, the part-title header may include the word “Chapter”. A numerical character may follow in Arabic numerals or in Roman numerals. For example, textbooks may use the word Chapter I, Chapter II to identify next chapter. In some books, the running headers and/or running footer may be absent in part-title pages. A running header may be text at the top of page. A running header may include a title of a page, page number, and chapter number. A running footer may be text at the bottom of the page and may include the same information as the running header. The information and the use of the running header and the running footer may depend on the publisher. Some publishers may use one or the other or both. In examples, the system may use the heuristics listed above as a set of part-title heuristics to identify candidate part-title pages. [0016] The system according to examples of the present disclosure may allow generation of a custom digital book. The system may receive a
specification for modification of a book. For example, the specification may be a selection of part-title pages for deletion, a set of pages for inclusion near a part- title page and/or a request to rearrange chapters to generate a custom book. The system 100 may use the specification to generate the custom book. In other examples, the specification may also include references to indicate where the pages may be inserted such as near or next to a part-title page.
[0017] A technical problem associated with part-title page identification is that different publishers or even different books by the same publisher may format their books differently. Accordingly, a fixed template may not be viable to accurately identify chapters of a digital book. For example, a part-title page may be on the left-hand side or on the right-hand side. The system according to examples of the current disclosure may determine part-title pages that demarcate a division between chapters using heuristics. The system may use a set of part-title heuristics to identify candidate part-title pages determine an approximation solution to determine an actual set of part-title pages. The other traditional approaches fail, because they are not flexible when the part-title page has inconsistent characteristics between books.
[0018] Another technical problem associated with part-title page identification in books is that optimal chapter identification is slow and may use expensive computational resources. For example, it may not helpful for the system to identify the best solution to the problem of determining part-title pages. For example, for creating custom books it may be efficient to allow part- title chapters to be identified as an approximate solution. However, a custom book may be proof-read before publication. A less optimal solution may meet the use case compared to a more optimal solution that uses more system resources and/or is expensive. The system described according to examples of the current disclosure may identify part-title pages separating chapters, based on heuristics to determine an approximate solution. The approximate solution may be the best solution, but the system may not verify the solution is the best solution to conserve system resources such as Central Processing Unit (CPU), Random Access Memory (RAM) and power utilization.
[0019] The system also allows for custom creation of digital books.
However, portions of the book may be deleted, modified or additional content added to create the custom book. The system can create a digital table of contents for the custom book based on identification of part-title pages in the book.
[0020] Figure 1 shows examples of a system 100 for identifying part-title pages in a document, extracting chapters, generating a custom book based on a specification and generating a digital table of contents for the custom book based on heuristics. The system 100 may include a processor 102 and a memory 104. The system 100 may include machine readable instructions 106 stored in the memory 104 or another type of non-transitory computer readable medium executable by the processor 102. In examples, the system 100 may be a workstation, a desktop computer, a laptop computer, a handheld device or any other device. In examples, the system 100 may include instructions in a non-transitory computer readable medium executable by a processor for display on a browser. For example, a server may allow interaction with the system 100 through a remote interface such as the browser. The system 100 may be connected to a network such as the internet through a network. The network may be a wide area network, a local area network, a cellular network, a satellite network and the like.
[0021] In examples, the system 100 may determine an approximate solution to determine actual part-title pages in a book based on heuristics. The approximate solution may be the best solution, but the system may not expend computer resources to determine the best solution or to verify the determined solution is the optimal solution. In examples, the system 100 may use a heuristic approach to identify part-title pages that does not consider possible combinations and permutations. The system 100 may be used in publishing industry to allow generation of custom books. For example, a university course may use a custom book. The system 100 may allow a publisher to allow changes to a textbook. The system 100 may allow addition of examples, a set of custom problems after chapters in the book, addition or removal of content in the book, reordering of chapters, automatic generation of digital table of contents and the like.
[0022] In examples, the system 100 may parse pages in a digital book to identify heuristics present on pages of a book. The system 100 may determine probabilities that pages in the digital book are of candidate part-title pages based on a comparison between the identified heuristics and a set of part-title heuristics. From the candidate part-title pages, the system 100 may then determine a template part-title page. The system 100 may determine the template part-title page based on the determine probabilities. For example, the system 100 may determine a page that includes more heuristics from the set of part-title heuristics compared to other candidate part title pages as the template part-title page. In examples, the system may determine a part-title page comprising part-title heuristics selected from the candidate part-title pages based on a count of the heuristics in candidate part-title pages with a higher probability compared to other part-title pages. For example, assume a group of candidate part-title pages includes six out of ten heuristics in a set of part-title heuristics. The system 100 may determine the template part-title comprising heuristics from the group of candidate part-title pages based on the heuristics with a higher count of heuristic matches in the candidate part-title pages with a higher determined probability than other candidate part-title pages with a lower determine probability. In examples, when more than one part-title page has a similar number of heuristic matches, the system may determine as common denominator of part-title pages to generate the template part-title page.
[0023] The system 100 may compare the template part-title page with the pages in the candidate part-title pages. Comparing the template page may refer to selecting a heuristic on the part-title page and comparing the heuristic to heuristics present on the candidate part-title pages. For example, the system 100 may determine that part-title pages in the book are on the right-hand side of in the book. The system 100 may select a subset of the candidate part-title pages located on the right-hand side of the book. The system may ignore other permutation and combinations to verify whether the solution determined is optimal. For example, the system 100 may optimize the solution by comparing permutations and combinations of the heuristics in the part-title page to heuristics in a page of the candidate part-title pages. In examples, the system 100 may compare the position of a part-title header in the template part-title page to determine whether pages in the candidate part-title pages has a similarly positioned part-title header. In examples, the system 100 may identify a heuristic in the candidate part-title pages appearing in the part-title heuristics with the highest probability of identifying actual part-title pages. For example, a heuristic may appear in candidate pages. The system may select a part-title heuristic with the highest count in the candidate part-title pages. The system 100 may identify this heuristic besides the template part-title page to determine actual part-title pages.
[0024] The system 100 may select a subset of the candidate part-title pages to be actual part-title pages based on the comparison. For example, the system may determine pages with the highest probability in the candidate part-title pages are actual part-title pages. Also, the system may determine other pages with lower probability based on the comparison with the template part-title page. In examples, the system 100 may generate the digital table of contents based on the actual part-title pages. For example, the system 100 may determine a set of chapters based on the part-title pages that separate chapters. Once the system identifies part-title pages separating chapter, the system may determine that pages between two part-title pages is a chapter. The system 100 may generate a digital table of contents based on the identified set of chapters.
[0025] In examples, the system 100 may allow the generation of custom books. To generate custom book, the system 100 may receive a specification for a custom book. Examples of specifications may include specifications to include chapters, exclude chapters, delete part of a chapter, and add material to a chapter, location of the material to be deleted, included and/or modified and the like. When the specification received includes a request to delete a chapter, the system 100 may receive the specification as selection of a chapter from the set of chapters for deletion. The system 100 may delete the chapter. After deletion, the system 100 may generate a digital table of contents for the modified book. In another example, the specification may contain insertion of content to the book. For example, a university course may include additional material for a course in the specification with a location for the additional material. The system 100 may insert the additional material at a location provided in the specification. In another example, the specification may provide a chapter where the new content such as a set of pages is to be inserted in the book. In another example, the system 100 the specification may rearrange the book. The system 100 may rearrange the book. After rearranging the book, the system 100 may determine a digital table of contents for the custom book. In examples, the system 100 may receive a selection from a set of chapters for deletion. The system 100 may delete the received selection from the book. In addition, the system 100 may generate the digital table of contents for the book with the deleted selection. In examples, the system 100 may generate a proof copy of the book. The proof copy may be printed. The system 100 may publish a limited run of the custom book once the proof is approved.
[0026] The machine readable instructions 106 may include instructions 108 to parse pages of a digital book to identify heuristics present on the pages. In examples, the system 100 may use image processing to parse pages of a digital book to identify heuristics present on the pages. Examples of image processing techniques include appearance based methods, edge matching methods, divide and conquer search, greyscale matching, histogram of field responses, model set searches, feature based methods, hypothesis and test method, pose consistency, pose clustering, invariance, geometric hashing, scale-invariant feature transform, speeded up robust features, genetic algorithms, shading, template matching, texture, biologically inspired
recognition, context, explicit and implicit models and the like. In another example, the system 100 may use text recognition, metadata from pdf content, an image processing method or a combination to determine the heuristics present on the page.
[0027] The instructions 106 may include machine readable instructions 1 10 to determine probabilities that pages in a book are part of candidate part-title pages. The system 100 may determine whether pages in the book are part of the candidate part-title pages. In examples, the system 100 may use a heuristic such as an attribute of a page that indicates the page is likely to be a part-title page. For example, the header on a part-title page be located near the fore- edge of a page. The words chapter or title of the chapter may be near the fore- edge of the page. The system 100 may determine part-title pages with headers at a similar position near the fore-edge of the page. The system 100 may similarly apply other heuristics to determine candidate part-title pages.
[0028] In examples, the system 100 may use a set of heuristics that may be helpful in determining part-title pages to determine chapters in the book. For example, in a book the publisher may begin a part-title page on one side of the book, i.e., the left-hand side or the right-hand side. In another example, a publisher may include a blank page before the next chapter. The system 100 may use this as a heuristic to determine part-title pages that follow blank pages. In another example, a publisher may use a font size for the beginning of part- title pages that are larger than the rest of the text in the part-title page or larger than the rest of the text in other pages. In another example, a publisher may use a vignette in a part-title page. A vignette may be an image used in a part-title page instead of or besides characters. In another example, the publisher may use an image without a header at the top of a page. Other pages with illustrations may include a header. In another example, a publisher may use an entire page for a part-title page. The part-title page may contain a lot fewer words. For example, the part-title page may include the title for the chapter that follows a listing of sub-headings in the chapter and the like. In another example, the part-title page may include margins before and after a chapter header. In another example, the part-title page may include the word“chapter” in the part- title page followed by a numeral. The numeral may be a Roman numeral or an Arabic numeral. In another example, the publisher may not use a running header in a part-title page. Similarly, a publisher may not use a running footer.
In examples, the system 100 may use the heuristics listed above as a set of heuristics to identify the part-title pages.
[0029] The instructions 106 may include machine readable instructions 1 12 to determine a template part-title page comprising the heuristics from the candidate part-title pages with a higher probability than other candidate part-title pages. In examples, the template part-title page may include the heuristics from one-candidate part-title pages. In another example, the template part-title page may include heuristics from two or more candidate part-title pages. In examples, the system 100 may determine the template part-title page based on the candidate part-title pages identified in the previous instruction at 108. The system 100 may select the page with the highest probability as the template part-title page. The highest probability may refer to the number of matches of the template page to a heuristic in the set of heuristics. For example, a page in the candidate part-title pages may match seven of the ten heuristics compared to other pages in the candidate part-title pages.
[0030] In examples, the machine readable instructions 106 may include instructions 1 12 to compare template part-title page to the candidate part-title pages. After selecting the template part-title page, the system 100 may compare the template part-title page to pages in the candidate part-title pages. In examples, the system 100 may determine whether a heuristic on the template part-title page matches a heuristic on a candidate part-title page. For example, in the template part-title page one of the heuristics may be that the part-title page is always on the right-hand side of a book. The system 100 may determine candidate part-title pages that are on the right-hand side.
[0031] In examples, the instructions 1 14 may select a subset of the candidate part-title pages to be the actual part-title pages. The system 100 may select a subset of the candidate part-title pages to be the actual part-title pages based on the comparison results at the previous step. For example, the system 100 may select part-title pages on the right-hand side of the page as actual part- title pages. The system 100 may not compare permutation and combinations of the pages in the book to the template part-title page to optimize the result. In another example, the system 100 may not compare permutations and combinations of the heuristics in the template part-title page to heuristics on the candidate part-title page. The system 100 may determine the approximate solution, which may also be the optimal solution. However, the system 100 may not verify whether an optimal solution is determined. [0032] In examples, the machine readable instructions 106 may include instructions 1 16 to generate a digital table of contents. In examples, the system 100 may use the part-title pages to determine the beginning and end of a chapter. The pages between a part-title page and the next part-title page may be a chapter. The system 100 may then generate a digital table of contents based on the content on the part-title page. For example, the system 100 may identify the title of a chapter and include the title and the page number of the chapter in the digital table of contents. In another example, the system 100 may determine the sub-headings in the chapter and include the sub-headings in the digital table of contents with the appropriate page numbers.
[0033] In examples, the system 100 may include heuristics such as the preferences of a publisher. In examples, the system 100 may determine the heuristics specific to books from a publisher. For example, an encyclopedia from Britannica may use a part-title page with changes to indicate the beginning of a new chapter. The system 100 may also receive information about the preferences of a particular publisher such as a vignette, a particular font in part- title pages, a particular font size, absence of headers or footers in part-title pages. In examples, the system 100 may receive information about heuristics from a related book, or a set of related books.
[0034] Referring to figure 2, the figure illustrates heuristics identified by the system 100 on a page in a book. The system 100 may identify the heuristic, the presence of the bigger fonts on the part-title header compared to fonts in the rest of the paragraph 204. In another example, the system 100 may identity the use of the word“chapter” 202 in a part-title page. Also, a number may follow the word chapter. Similarly, the system 100 may identify the use of margin before and/or after a part-title header in the part-title page 206. The system 100 may identify the absence of a running footer 208 present on the page. The system 100 may also identify the side the part-title page is present 210 as a heuristic present in the page. Thus, the system 100 may identify candidate part-title pages. Also, the system 100 may identify counts of the heuristic matches for pages in the candidate part-title pages. [0035] Figure 3A, 3B, 3C depicts examples of the system 100 identifying actual part-title pages. In examples, as shown in figure 3A the book may be a pdf file. The system 100 may receive a book without demarcation between the chapters. In example, the system 100 may determine the heuristics present on the pages of the book receive in figure 3A to determine candidate part-title pages as shown in figure 3B. For example, the system 100 may determine the number of heuristic matches on pages of the received book. The system 100 may then generate a count of the heuristic matches in the candidate part-title pages to determine probability that the page is a part-title page. For example, the figure 3B shows two groups, group 308 and group 310. The group 308 has a probability of 5/8 and the group 310 has a probability of 3/8. In examples, the system 100 may identify pages with at least one heuristic match to determine the candidate part-title pages. Also, the system 100 may associate a probability for pages in the candidate part-title pages. For example, as shown in figure 3B, the part-title pages in group 308 may match five of the eight heuristics test. The system 100 may determine the template part-title page as a page with the heuristics in the group 308. For example, the template part-title page may be a page that includes five heuristics of the group 308. The system 100 may then compare the template part-title page with other pages in the candidate part-title pages. For example, the system 100 may compare each page in the group 310 to the template part-title page. The system 100 may remove pages that do not match the template part-title page. For example, the page 310a does not include a header with a spacing before or after the header, includes a running header missing in the template page and the like. The system 100 may determine page 310a is not a part-title page. The system 100 may similarly perform a heuristic match against the template part-title page to select the actual part-title pages shown in figure 3C. In addition, the system 100 may not compare the template part-title page to the page in a group 308 to minimize system resource utilization. In examples, the system 100 may select pages in the group 308, without dropping a page from the group. The system 100 may determine that the group 308 pages have a higher probability of being actual part-title pages compared with other candidate part-title pages. The system 100 may determine pages in the group 308 are actual part-title pages. The system 100 may include them in the actual part-title pages in figure 3C. The system 100 may obtain an approximate solution.
[0036] The system 100 may then generate a set of chapters based on the determined part-title pages. The system 100 may allow removal of a chapter. In examples, the system 100 may receive a specification to generate a custom book. The specification may add, delete, remove or modify the pages, chapters, contents and the like of the book. The system 100 may for example, delete chapters not taught in a university course. The system 100 may then determine a digital table of contents for the custom book as discussed above in figure 1.
[0037] Figure 4 illustrates a method 400 according to examples for determining a set of chapters for the book based on the actual part-title pages. The method 400 may be performed by the system 100 in figure 1 or other systems.
[0038] At 404, the system 100 may determine the probabilities that pages in the book are part of candidate part-title pages. In examples, the system 100 may determine candidate part-title pages based on heuristics. In examples, the system 100 may receive a set of heuristics to use. In another example, the system 100 may use a pre-selected set of heuristics. In examples, the heuristic search may be a quick search to reject pages that are unlikely to be part-title pages. For example, the system 100 may ignore pages with no part-title header or vignette on the page, or pages that includes small fonts in comparison to other fonts. In another example, the system 100 may apply the set of heuristics. In the case of borderline matches where the probability that the page has the heuristics, the system may include the pages in the candidate part-title pages to avoid system resource utilization. The system 100 may delay a decision on a borderline page until more information is available.
[0039] At 406, the system 100 may identify the template part-title page with the highest probability. In examples, the system 100 may determine the template part-title page based on the highest candidate part-title pages. The system 100 may identify the part-title page with the highest number of heuristics matches in the candidate part-title pages. [0040] At 408, the system 100 may compare heuristics present on template part-title pages to heuristics present on candidate part-title pages. In examples, the system 100 may compare one heuristic in the template part-title page against heuristics in the candidate part-title pages. For example, the system 100 may compare the whether the part-title page is located on the verso side of a page. A publishing house may use verso side of a page for the part- title page. Thus, the system 100 may find an approximate solution using this heuristic. The system 100 may not waste resources on finding an optimal solution. In examples, the system 100 to conserve resources may not compare the template part-title page with pages of the book not on the candidate part-title pages.
[0041] At 410, the system 100 may select a subset of the candidate part- title pages to be actual part-title pages of the book based on the comparison. In examples, the system 100 may use heuristics to determine the actual part-title pages. The system 100 may determine an approximate solution based on the comparison in the previous step. For example, a publisher may reuse a part-title page with minor changes in a book. For example, the publisher may always start a part-title page on the verso side of a page. The system 100 may discard candidate part-title pages not on the verso side of the page
[0042] Figure 5 illustrates a method 500 according to examples for generating a custom book. The method 500 may be performed by the system 100 in figure 1 or other systems.
[0043] At 504, the system 100 may determine probabilities that pages in a book are part of candidate title pages. For example, the system 100 may determine whether pages in books are part-title pages based on heuristic matches. For example, the system 100 may determine whether the page has a part-title header. The system 100 may use the location of the part-title header.
In some books, part-title may be offset towards the fore-edge of the book away from the spine of the book. Also, the system 100 may determine whether a page has larger fonts in some pages, presence or absence of headers and footers, space between the part-title page and the rest of the text and the like. The system 100 may then identify candidate part-title pages that match the selected heuristics. In examples, the system may determine pages that display n heuristics out of m heuristics tested for as part of the candidate part-title pages. In examples, the system 100 may identify the candidate part-title pages that are approximate matches for later filter. For example, the system 100 may include candidate part-title pages that a lower probability of being part-title pages in this initial step to weed out later. The system 100 may use approximations to avoid using further resources on borderline cases by differing the solution to a later point in time.
[0044] At 508, the system 100 may compare the template part-title page with the candidate part-title pages. For example, the system 100 may compare the heuristics on the template part-title page to heuristics in the candidate part- title pages. In another example, the system 100 may determine part-title pages with lower heuristics. In examples, the system 100 may use the set of identified part-title pages. The system 100 may not compare permutations and combinations of heuristics on pages in the book to determine an approximate solution. In examples, the system 100 may use the heuristic less than the initial set of heuristics to select candidate part-title pages are actual part-title pages. For example, the system 100 may determine whether the template part-title page was on the right-hand side.
[0045] At 510, the system 100 may select a set of the candidate part-title pages to be actual part-title pages. In examples, the system 100 may select the subset based on the heuristic comparison. For example, the probability that publishers locate part-title pages on the same side of the book may be higher than part-title pages on both sides in the same book. The system 100 may determine the actual part-title pages from the candidate part-title pages based on this probability. In examples, the system 100 may solve the problem without applying permutations and combinations, i.e., without trying to optimize the solution to further improve the detection. Although the system 100 may use approximations to solve the problem, the solution may be the best solution, i.e., the system 100 may not verify whether the solution is the best solution. The system 100 may winnow the candidate part-title pages to determine the actual part-title pages.
[0046] At 512, the system 100 may receive specifications for a custom book. In examples, the specifications may include specifications to include additional material provided. In another example, the specifications may describe chapters to be deleted. For example, a university course may use a custom set of chapter problems. The university course may skip chapters, reorder chapters for the course and the like.
[0047] At 514, the system 100 may generate a proof copy of the custom book based on the specifications. In examples, the system 100 may change the book based on the specifications received in the previous step. The system 100 may then generate a proof copy of the custom book for review. In examples, the system 100 may send a proof copy to the university or professor for review. After approval, of the proof copy the system 100 may publish low volumes of copies on demand for the course. The system 100 may also allow students registered at the university to access additional custom material for the custom book.
[0048] While embodiments of the present disclosure have been described referring to examples, those skilled in the art can variously modify the described embodiments without departing from the claimed embodiments.

Claims

We claim:
1. A system comprising:
a processor; and
a memory on which is stored machine readable instructions that are to cause the processor to:
parse pages of a digital book to identify heuristics present on the pages;
determine probabilities that the pages in the digital book are candidate part-title pages separating chapters of the book based on a comparison between the identified heuristics and a set of part-title heuristics;
identify a template part-title page based on the determined probabilities;
compare the template part-title page to the candidate part-title pages;
select at least a subset of the candidate part-title pages to be actual part-title pages of the book based on the comparison; and
generate a digital table of contents for the book from the actual part- title pages.
2. The system of claim 1 , wherein the instructions to compare the template part-title page to the candidate part-title pages cause the processor to:
determine whether the template part-title page is located near a fore- edge of the book; and
in response to the determination that the template part-title page is near the fore-edge of the book, identify actual part-title pages from the candidate part-title pages located near the fore-edge of the book.
3. The system of claim 1 , wherein the instructions to compare the template part-title page to the candidate part-title pages cause the processor to:
determine whether the template part-title page is not near a fore-edge of the book; and in response to a determination that the template part-title page is not near the fore-edge of the book, identify actual part-title pages from the candidate part-title pages not located near the fore-edge of the book.
4. The system of claim 1 , wherein the instructions to compare the template part-title page to the candidate part-title pages cause the processor to:
determine a position of a part-title header on the template part-title page; and
identify actual part-title pages from the candidate part-title pages that have a similarly positioned part-title header.
5. The system of claim 1 , wherein the instructions cause the processor to:
identify a set of chapters based on the actual part-title pages of the book;
receive a selection from the set of chapters in the book for deletion; delete the received selection from the book; and
modify the digital table of contents according to the deleted selection.
6. The system of claim 1 , wherein the instructions cause the processor to:
receive a specification for a custom book;
modify the digital book based on the specification; and generate a table of contents for the custom book.
7. The system of claim 1 , wherein the set of part-title heuristics comprises at least one of:
a position where a part-title header appears near a fore-edge of a page;
an absence of content in a part-title page verso;
a presence of content in a part-title page verso;
a presence of larger font sizes in part-title pages; an absence of a larger font size in a part-title page compared to other text in a book;
a presence of vignette in a part-title page;
a margin before and/or after a part-title header;
a presence of a word chapter on a part-title page followed by a number;
an absence of a running header on a part-title page; and an absence of a footer on a part-title page.
8. The system of claim 1 , wherein the instructions cause the processor to:
determine the set of part-title heuristics based on heuristics present in a set of related books.
9. The system of claim 1 , wherein the template part-title page comprises heuristics from the candidate part-title pages with a higher probability than other candidate part-title pages.
10. The system of claim 1 , wherein the instructions identify the template part- title page based on the determine probabilities cause the processor to:
determine the template part-title page based on a minimum set of heuristics from two or more candidate part-title pages having higher probabilities than other candidate part-title pages.
1 1. The system of claim 1 , wherein the instructions identify the template part- title page based on the determined probabilities cause the processor to:
determine the template part-title page from the candidate part-title pages with a higher count of heuristic matches compared to other candidate part-title pages.
12. A non-transitory computer readable medium comprising machine readable instructions executable by a processor to: determine probabilities that pages in a digital book are candidate part-title pages separating chapters of a book based on part-title heuristics; identify a template part-title page from the candidate part-title pages having a higher probability than other candidate part-title pages based on the determined probabilities;
compare heuristics present on the template part-title page to heuristics present on the candidate part-title pages;
select a subset of the candidate part-title pages to be actual part- title pages of the book based on the comparison; and
determine chapters for the book based on the actual part-title pages.
13. The non-transitory computer readable medium of claim 12, comprising machine readable instructions executable by the processor to:
receive a selection from the actual part-title pages in the book for deletion;
delete pages of the book based on the received selection from the book; and
generate a digital table of contents based on the determined chapters and the deleted pages.
14. The non-transitory computer readable medium of claim 12, wherein the instruction to compare the template part-title page the candidate part-title pages comprises machine readable instructions executable by the processor to:
determine a heuristic count for each heuristic in the candidate part-title pages;
select a heuristic having the highest heuristic count; and
compare the selected heuristic to heuristics present on the candidate part-title pages.
15. A method comprising: determining probabilities that pages in a digital book are candidate part-title pages separating chapters of a book based on part-title heuristics; determining a template part-title page with a higher probability than other candidate part-title pages based on the probabilities;
comparing heuristics present on the template part-title page to heuristics present on the candidate part-title pages;
selecting a subset of the candidate part-title pages to be actual part- title pages of the book based on the comparison;
receiving a specification for a custom book based on the digital book; and
generating a proof of the custom book based on the specification and the actual part-title pages.
PCT/US2017/064023 2017-11-30 2017-11-30 Digital part-page detectors WO2019108209A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2017/064023 WO2019108209A1 (en) 2017-11-30 2017-11-30 Digital part-page detectors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2017/064023 WO2019108209A1 (en) 2017-11-30 2017-11-30 Digital part-page detectors

Publications (1)

Publication Number Publication Date
WO2019108209A1 true WO2019108209A1 (en) 2019-06-06

Family

ID=66665213

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2017/064023 WO2019108209A1 (en) 2017-11-30 2017-11-30 Digital part-page detectors

Country Status (1)

Country Link
WO (1) WO2019108209A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460801A (en) * 2020-03-30 2020-07-28 北京百度网讯科技有限公司 Title generation method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030042319A1 (en) * 2001-08-31 2003-03-06 Xerox Corporation Automatic and semi-automatic index generation for raster documents
US20150304521A1 (en) * 2014-04-17 2015-10-22 Xerox Corporation Dynamically generating table of contents for printable or scanned content
US20160360063A1 (en) * 2011-10-06 2016-12-08 Uri Zernik Device, System and Method for Identifying Sections of Documents

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030042319A1 (en) * 2001-08-31 2003-03-06 Xerox Corporation Automatic and semi-automatic index generation for raster documents
US20160360063A1 (en) * 2011-10-06 2016-12-08 Uri Zernik Device, System and Method for Identifying Sections of Documents
US20150304521A1 (en) * 2014-04-17 2015-10-22 Xerox Corporation Dynamically generating table of contents for printable or scanned content

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460801A (en) * 2020-03-30 2020-07-28 北京百度网讯科技有限公司 Title generation method and device and electronic equipment
CN111460801B (en) * 2020-03-30 2023-08-18 北京百度网讯科技有限公司 Title generation method and device and electronic equipment

Similar Documents

Publication Publication Date Title
US8452132B2 (en) Automatic file name generation in OCR systems
US7756871B2 (en) Article extraction
CN110795919B (en) Form extraction method, device, equipment and medium in PDF document
US9384389B1 (en) Detecting errors in recognized text
US20150095769A1 (en) Layout Analysis Method And System
US9348799B2 (en) Forming a master page for an electronic document
US20090144277A1 (en) Electronic table of contents entry classification and labeling scheme
US9098581B2 (en) Method for finding text reading order in a document
US20030004991A1 (en) Correlating handwritten annotations to a document
CN110837788B (en) PDF document processing method and device
CN109858036B (en) Method and device for dividing documents
US7310773B2 (en) Removal of extraneous text from electronic documents
US9740692B2 (en) Creating flexible structure descriptions of documents with repetitive non-regular structures
US10579372B1 (en) Metadata-based API attribute extraction
US20090276378A1 (en) System and Method for Identifying Document Structure and Associated Metainformation and Facilitating Appropriate Processing
US10572528B2 (en) System and method for automatic detection and clustering of articles using multimedia information
CN108053545B (en) Certificate verification method and device, server and storage medium
CN112132710B (en) Legal element processing method and device, electronic equipment and storage medium
US8527516B1 (en) Identifying similar digital text volumes
EP2191396B1 (en) An apparatus for preparing a display document for analysis
US10095677B1 (en) Detection of layouts in electronic documents
US8799268B2 (en) Consolidating tags
CN110990539B (en) Manuscript internal duplicate checking method and device and electronic equipment
WO2019108209A1 (en) Digital part-page detectors
US20090327210A1 (en) Advanced book page classification engine and index page extraction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17933665

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17933665

Country of ref document: EP

Kind code of ref document: A1