GB2336012A - Document production - Google Patents

Document production Download PDF

Info

Publication number
GB2336012A
GB2336012A GB9825582A GB9825582A GB2336012A GB 2336012 A GB2336012 A GB 2336012A GB 9825582 A GB9825582 A GB 9825582A GB 9825582 A GB9825582 A GB 9825582A GB 2336012 A GB2336012 A GB 2336012A
Authority
GB
United Kingdom
Prior art keywords
document
conversion
format
validation
tags
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB9825582A
Other versions
GB2336012B (en
GB9825582D0 (en
Inventor
David Keenan
Anthony Joseph Donnelly
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Datapage Ireland Ltd
Original Assignee
Datapage Ireland Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from IE1998/0234A external-priority patent/IE83604B1/en
Application filed by Datapage Ireland Ltd filed Critical Datapage Ireland Ltd
Publication of GB9825582D0 publication Critical patent/GB9825582D0/en
Publication of GB2336012A publication Critical patent/GB2336012A/en
Application granted granted Critical
Publication of GB2336012B publication Critical patent/GB2336012B/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/157Transformation using dictionaries or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • G06F40/143Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/226Validation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

Structured-format documents are produced in a process which receives a file in a particular word processing format or in any other format which is then converted to the latter. The system loads a parameter activation table which sets document parameter values to allow Document Type Definition information to be automatically implemented. The document is cleaned and tagged. The tagging provides an important link to allow automatic conversion at a later stage in the process. There is copy-editing followed by validation of the file preparation stage. This involves automatic validation of tags, including validation of their order and nesting arrangement. Automatic conversion to SGML is performed in a sequence of symbol/character conversion, tag conversion, equation processing, and floating element processing. Final validation is then performed.

Description

2336012 "Document Production" The invention relates to production of
documents in a structured format such as in Standardized General Markup Language (SGkIL) format.
A structured format such as SGMI- allows output of a document in a wide variety of formats usim, available tools. Such a structured format is therefore of enormous benefit to the document production industry'. such as for publication of academic Journals. In the art. W098,134179. US5557720. and US5140521 describe techniques for processing 10 structuredformat documents. In general, this prior art relates to either altering a structured-format document. or processing such documents to - generate a required output format for either display or printing.
However. a major problem for production of documents in a structured format is that of reachin(y this format. If the document is authored in the structured format. then c specialised knowledge is required and the task'is time -consuming. Alternatively, if the document is authored in a conventional word processor format and is subsequently converted, the conversion is verv time-consuminc and is error-prone.
1 C> The invention is therefore directed toward providing a process for producing a document in a structured format in a more efficient manner. Another object is that errors in the document be more consistently reduced.
According to the invention. there is provided a document production process carried out essor by. a processor ha-ving an editor interface and memory access means, the proc comprising the steps of.- C writing a document in a word processor format to memory.
writing document parameter rules to memorv.
1 automatically. correcting the document accordino to ty'pesetting and document rules..
automaticall,,, ta Yging the document according to typesetting and LLdocument rules: and automatically; converting the document characters and tags to a structured format to provide a structured document.
The steps of automatically correcting according to typesettin-, and document rules. automatic ta-ing. and automatic conversion allow for a hizhl-,, automated process for t - 1 brinainu a document from a standard word processor fon-nat to a structured format. This t c allows the document author to use a word processor which he or she is familiar with. and divorces him or her from structured format techniques. These steps also help to ensure that errors are minimised.
In one embodiment, character and tag conversion is performed by automatic comparison with reference characters and tags stored in look-up tables.
0 Preferably. the conversion step includes the sub-step of passing foreign objects to a C C1 separate processor, which converts the foreign object to a text format, and subsequently c processing the text to convert to the structured format.
c In one embodiment. the conversion step comprises the sub-step of separately converting floatine elements according to document parameter rules and structure of the floating element.
These automatic conversion steps in sequence provide comprehensive conversion to a 3'0 structured format.
1 - j - Preferably. the process comprises the further step of parsing the structured format code for final validation. This helps to ensure document quality.
In one embodiment. the document parameter rules are written as an array. of flacs which t activate and deactivate parameter options. This is a,.-er,,,, effective way, of recordinc C parameter rules for a particular document, Preferably. the tagging step inyolves automatic recoanition of elements.
1 C>0 -- In one embodiment, the process comprises the further step of copv-edltlng the document after tagging by, automatically converting words according to a break-down of the word CC-I characters.
In another embodiment, the copy-editing includes the sub-steps of building an array of C document references by automatic recognition and subsequently sorting them according to an operator- inputted sort criterion.
Preferably. the process comprises the further step of automatic preconversion validation, in which tags are compared with reference tags and nesting is validated according to the document parameter values.
In one embodiment, the pre-conversion validation step includes the substep of automatically locating any invalid symbols and generating corresponding error messages.
C t5 In another embodiment, the pre-conversion validation step includes the sub-step of automatically, identifying references, building, an array in memory. and searching to determine if any do not exist in the document.
- 4 The invention will be more clearR. understood from the followim, description of some
C embodiments thereof. given by way of example only with reference to the accompanying drawinúz in xk.h'lch Figs.](a). l(b) and 1(c) are together a flow chart illustratino a t -- - production process of the invention: and Figs. 2 to 4 are document samples at various tP process staces.
The drawinos show a process 1 for producing a document in a structured format. in this embodiment SGM1_. The process takes an authored document in a particular word processor format (input A). or in a different word processor format or a manually ty,'pe- written document (input B). If input B, the process in step 2 converts the document to the particular word processor format by; optical character recognition or word processor conversion as applicable. Fig. 2 is a sample from a received document in Word T"I format.
In step 3 process identifiers are inputted by an operator. These identifiers identify, the particular document being published. the client and other identification information.
In step 4. a parameter activation table is loaded. This table includes flags which activate 0 or deactivate various document parameter values. The rules and the table are structured 20 to represent Document Type Definition (DTD) information in the system so that DTD information may. be automatically processed.
In step 5 the document is automatically prepared and cleaned. This involves the system processor applying, typesetting rules, such as removing multiple spaces. In addition, various rules are applied for consistency such as removing spaces from around mathematical symbols. Also, spelling mistakes are corrected using a spell-checker program. Table and figures are moved to the end of the document to facilitate later processing steps.
In step 6 the document is tagged with internal sy-stem tags. The system progresses throu,-,h the frontmatter. bodvrnatter. backmatter. tables and fi,,ures and the cross reference in sequence. The tags are subscquently of benefit in automatically converting the document to SGMI---. The s,,; stem asks various question of the operator and based on the operator"s responses and internally stored rules. the system recognises elements of the sections and then taas then accordingly. An example of such tags is shown in Fig. 3.
I'> C> t In step 7 the s),stem performs copy-editing. This involves spell-checking and grammar 1 -- -- checkin- the document. The processor operates according to a findireplace program 0 Z-1 10.k.hicli automatically breaks down character strings to validate internal fonts used. For 2 X X2,' 2, 3 example. the author mas inean 3 but have used j x, at different places in the text. The system converts all instances of x23) into their correct form. As part of the editing step 7. the sy,stem converts styles in the document into their correct form as required by the document parameter values. A particular example is bibliographic reference styrle. Some publishers require these references to be name/date references. while others have these references numbered. For example. if the first reference in a document is a reference to an article published by---Srnithand Jones- in 1998 in the name./date format, the text for this in the biblioaraphy group would be ordered alphabetically, and so would therefore be about half way, down the list, On the other hand, in a numbered reference format, the text would be at the start of the bibliography list as it'is the first one cited. The system prompts the operator to select between these styles and then automatically implements them by generatine a list of all of the references and sortina them accordingly. Finally', the editing step 7 involves pulling all floating elements to the end of the document to facilitate faster handling at a later stage in the process.
This,orkcompletes a preparatory stage of the process and this stage is then verified as illustrated in Fig. 1 (b) in steps 8 to 15. In step 8 the tags are automatically compared with an internally-stored set of reference tags. This comparison performed according to the received document parameter values. The order and nesting of the tags are checked in steps 9 and 10, again according to the document parameter values. In step 11 symbols within the document are checked to locate any unknown one. This is performed b.,. automated searching for characters which are not in this ran-es 1-9. a-z. or A-Z and C do not match a list of valid characters held by the system. Any unknown characters found in the document are reported for correction.
In step 12 cross-references are checked for validity. Cross references include bibliographic references and references to tables, figures, and footnotes. This involves the system makine a list of the items referred to in the mernorv. The system then checks each reference in the body of the document. The system reports on references that cite an non-existent items and items that should be referred to but are not. As for steps 8 to 11. errors found are reported. However. in addition to step 12 there is an additional step 15 in which a list of unlinked cross-references is generated to prompt feedback by, the operator. Generation of error messages is indicated by the step 14, and correction by step 15. The correction may involve interactive input by, the operator.
Referring now to Fig. 1(c), the final phase of the process is illustrated. In step 20 every symbol and character not in the 1-9, a-z. and A-Z ranges. are checked against a list to locate the SGNIL code for that character. The SGNIL code is substituted in the text automatically. In step 21, tags which were inserted in the preparation stage of the process are converted to their SGML equivalent. Again, this is automated because the tags are simply checked against a list in a look- up table and substituted. In step 22 equations and foreign objects in the document are converted to their correct SGNIL tags. This involves the system transmitting. commands to convert the object into a format which can be understood by the system. For example, for a mathematical equation. a command is sent to a..M.athTypeTm" application to convert the equation into a text equivalent of the object's code. The system then converts this into SGMI- by searching the (now text) object and process sub- objects. Floating elements are converted to SGML and are embodied in the SGNIL document at the correct position in step 23. For example. the 3 0 document parameter values may, require the 4'fioats" to be at the end of the body of the document, while others require each float to be located immediately after the first 1 reference to it. The floats are converted based on rules held in memorv. These rules are taken from both document parameter values and the float structure so that- for example. tables kvill always have cells and rows and this structure is used in the process. A sample 5 of an SWAL file is shown in Fig. 4.
In step 24 the SGMI- file is passed through a parser to ensure that the SGMI- is perfectly correct. This parser is a tool which exhaustively. checks and validates the file against the complete document parameter values. This ensures that the correct set of document parameter values are used as are the various rules held by the system. This acts as a system cheek and reports any errors.
An intermediate-output SGMI- is provided in step 25 and this is used as the basis for the final output. For example. there may be DT13-specific conversion in step 27 to provide a final output SGNIL file in step 28. Alternatively. there may be journal-specific conversion in step 29 with typeset code editing in step 30 and a postscript output oenerated in step 1. Thus. the output SGML file may be converted into the typeset code required to correctly. style and display the document for a typesetting system. Because the document provided in step 26 is an SGIVIL format, many alternatives are possible.
The invention is not limited to the embodiments described, but may, be varied in construction and detail within the scope of the claims.

Claims (1)

  1. Claillis
    A document production process carried out by a processor having an editor C interface and memorv access means. the processor comprising the steps of- writine a document in a word processor formal to iiiei-nor..-:
    C - "ritin(, document parameter rules to merriory:
    W 1 autornaticalb, correctinz the document accordino to t.,'Pesettiii() and document rules:
    and atitomaticall, ta(-,(,ina the document according to typesettin.
    - c document rules. and automatically. converting the document characters and tags to a structured 0 format to provide a structured document.
    2. A process as claimed in claim I- wherein character and tag conversion is performed by automatic comparison with reference characters and tags stored in look-up tables.
    -,0 A process as claimed in claims 1 or 2, wherein the conversion step includes the sub-step of passing foreign objects to a separate process. which converts the 1 c foreign object to a text format. and subsequently' processing the text to convert to the structured forniat, 4. A process as claimed in any preceding claim. ",herein the conversion step comprises the sub-step of separately, convertins., floating elements according to Z Z> document parameter rules and structure of the floating element.
    5. A process as claimed in any preceding claim. wherein the process comprises the further step of parsing the structured format code for final validation.
    6. A process as claimed in any. preceding claim, wherein the document parameter rules are "Titten as an arrax, of flags which activate and deactivate parameter options.
    7. A process as claimed in any- preceding claim. wherein the ta(,(,, in_Ll' step involves automatic recocynition of elements.
    8. A process as claimed in an,,- preceding claim. wherein the process comprises the further step of copy-editing the document after ta-ling, by automatically W --Z-- c convertina words accordin- to a break-down of the word characters.
    9. A process as claimed in claim 8, wherein the copy-editing includes the sub-steps of building an array, of document references by automatic recognition and subsequently sorting them according to an operatorinputted sort criterion.
    10. A processor as claimed in any preceding claim. comprising the further step of automatic pre-conversion validation. in which tags are compared with reference tags and nesting is validated according to the document parameter values.
    4_ 11. A processor as claimed in claim 10, wherein the pre-conversion validation step includes the sub-step of automatically, locating any. invalid slmbols and (lenerating correspondin. error messaces.
    c n C1 12. A process as claimed in claims 10 or 11, wherein the pre-conversion validation step includes the sub-steps of automatically identifying references. building an tr CI array7 in memory.. and searching to determine if any; do not exist in the document.
    1 '), A process substantially as described with reference to the drawings.
    14. Documents whene-ver produced by a process as claimed in any preceding claim.
GB9825582A 1998-03-31 1998-11-24 Document production Expired - Fee Related GB2336012B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
IE1998/0234A IE83604B1 (en) 1998-03-31 Document production.

Publications (3)

Publication Number Publication Date
GB9825582D0 GB9825582D0 (en) 1999-01-13
GB2336012A true GB2336012A (en) 1999-10-06
GB2336012B GB2336012B (en) 2001-10-03

Family

ID=11041749

Family Applications (1)

Application Number Title Priority Date Filing Date
GB9825582A Expired - Fee Related GB2336012B (en) 1998-03-31 1998-11-24 Document production

Country Status (2)

Country Link
GB (1) GB2336012B (en)
IE (1) IES980960A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2405730A (en) * 2003-09-03 2005-03-09 Business Integrity Ltd Cross-reference generation
GB2425635A (en) * 2005-04-30 2006-11-01 Hewlett Packard Development Co Recursive flows in variable-data printing document templates

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0400269A2 (en) * 1989-04-26 1990-12-05 International Business Machines Corporation Method for manipulating elements within a structured document and actively interpreting user intentions
US5133051A (en) * 1990-12-13 1992-07-21 Handley George E Automatic high speed publishing system
WO1996013781A1 (en) * 1994-10-31 1996-05-09 Moore Business Forms, Inc. Method and system for checking print orders for short run printing applications

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0400269A2 (en) * 1989-04-26 1990-12-05 International Business Machines Corporation Method for manipulating elements within a structured document and actively interpreting user intentions
US5133051A (en) * 1990-12-13 1992-07-21 Handley George E Automatic high speed publishing system
WO1996013781A1 (en) * 1994-10-31 1996-05-09 Moore Business Forms, Inc. Method and system for checking print orders for short run printing applications

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2405730A (en) * 2003-09-03 2005-03-09 Business Integrity Ltd Cross-reference generation
US7506251B2 (en) 2003-09-03 2009-03-17 Business Intergity Limited Cross-reference generation
GB2425635A (en) * 2005-04-30 2006-11-01 Hewlett Packard Development Co Recursive flows in variable-data printing document templates
US8904280B2 (en) 2005-04-30 2014-12-02 Hewlett-Packard Development Company, L.P. Recursive flows in variable-data printing document templates

Also Published As

Publication number Publication date
IES80866B2 (en) 1999-04-21
GB2336012B (en) 2001-10-03
IE980234A1 (en) 1999-10-20
IES980960A2 (en) 1999-04-21
GB9825582D0 (en) 1999-01-13

Similar Documents

Publication Publication Date Title
US6490603B1 (en) Method and system for producing documents in a structured format
US8108202B2 (en) Machine translation method for PDF file
EP0597611B1 (en) Apparatus and Method for Use in Aligning Bilingual Corpora
JPH0969101A (en) Method and device for generating structured document
JP2007073044A (en) Text correction for pdf conversion apparatus
WO2004084094A1 (en) Conversion of structured information
CN112506488A (en) Method for generating programming language class based on sql creating statement
GB2336012A (en) Document production
JP3724878B2 (en) Keyword extraction rule generation method
KR101497411B1 (en) A converting apparatus and a method for a literary style, a storage means and a service system and a method for automatic chatting
CN112905025B (en) Information processing method, electronic device, and readable storage medium
CN116992824A (en) Method and system for converting LaTex formula into natural language
US9430451B1 (en) Parsing author name groups in non-standardized format
KR100631086B1 (en) Method and apparatus for text normalization using extensible markup language(xml)
JP5295576B2 (en) Natural language analysis apparatus, natural language analysis method, and natural language analysis program
JP2003099428A (en) Translation supporting device, translator terminal control program and proofreader terminal control program
US6832197B2 (en) Machine interface
CN112750434B (en) Method and device for optimizing voice recognition system and electronic equipment
CN112949283B (en) Text processing method, device, nonvolatile storage medium and processor
JPH04211867A (en) System for analyzing japanese syntax
JPH09101959A (en) Structured document generator
JP2021117743A (en) Business support system and business support method
Wadata et al. Boosting Algorithm for Empty Node Recovery in a Sentence
Jamal et al. XML Schema Validation Using Java API for XML Processing
JPH0546370A (en) Program generating device

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20111124