IES980960A2

IES980960A2 - Document production

Info

Publication number: IES980960A2
Application number: IE980960A
Authority: IE
Inventors: David Keenan; Anthony Joseph Donnelly
Original assignee: Datapage Ireland Ltd
Priority date: 1998-03-31
Filing date: 1998-11-20
Publication date: 1999-04-21
Also published as: IE980234A1; GB2336012B; GB2336012A; IES80866B2; GB9825582D0

Abstract

Structured-format documents are produced in a process in which a file in a particular word processing format (input A) or in any other format (Input B) are converted (2) to a particular word processor format. The system loads a parameter activation table which sets document parameter values to allow DTDs to be automatically implemented. The document is cleaned (5) and tagged (60. The tagging provides an important link to allow automatic conversion at a later stage in the process. There is copy editing (7) followed by validation of the file preparation stage. This involves automatic validation of tags, including validation of their order and nesting arrangement. Automatic conversion to SGML is performed in a sequence of symbol/character conversion (20), tag conversion (21), equation processing (22), and floating element processing (23). Final validation (24) is then performed. <Fig. 1>

Description

The invention relates to production of documents in a structured format such as in Standardized General Markup Language (SGML) format. Λ structured format such as SGML allows output of a document in a wide variety of formats using available tools. Such a structured format is therefore of enormous benefit to the document production industry, such as for publication of academic journals. In the art. WO98/34179, US5557720, and US5140521 describe techniques for processing structured-format documents. In general, this prior art relates to either altering a structured-format document, or processing such documents to generate a required output format for either display or printing.

However, a major problem for production of documents in a structured format is that of reaching this format. If the document is authored in the structured format, then specialised knowledge is required and the task is time-consuming. Alternatively, if the document is authored in a conventional word processor format and is subsequently converted, the conversion is very time-consuming and is error-prone. y · The invention is therefore directed toward providing a process for producing a document in a structured format in a more efficient manner. Another object is that errors in the document be more consistently reduced.

According to the invention, there is provided a document production process carried out by a processor having an editor interface and memory access means, the processor comprising the steps of;writing a document in a word processor format to memory; IE 980960 automatically correcting the document according to typesetting and document rules: automatically tagging the document according to typesetting and document rules; and automatically converting the document characters and tags to a structured format to provide a structured document.

The steps of automatically correcting according to typesetting and document rules, automatic tagging, and automatic conversion allow for a highly automated process for bringing a document from a standard word processor format to a structured format. This allows the document author to use a word processor which he or she is familiar with, and divorces him or her from structured format techniques. These steps also help to ensure that errors are minimised.

In one embodiment, character and tag conversion is performed by automatic comparison with reference characters and tags stored in look-up tables.

/ · Preferably, the conversion step includes the sub-step of passing foreign objects to a separate processor, which converts the foreign object to a text format, and subsequently processing the text to convert to the structured format.

In one embodiment, the conversion step comprises the sub-step of separately converting floating elements according to document parameter rules and structure of the floating element.

These automatic conversion steps in sequence provide comprehensive conversion to a structured format.

IE 980960 -3Preferably, (he process comprises the further step of parsing the structured format code for final validation. This helps to ensure document quality.

In one embodiment, the document parameter rules are written as an array of flags which activate and deactivate parameter options. This is a very effective way of recording parameter rules for a particular document.

Preferably, the tagging step involves automatic recognition of elements.

In one embodiment, the process comprises the further step of copy-editing the document after tagging by automatically converting words according to a break-down of the word characters.

In another embodiment, the copy-editing includes the sub-steps of building an array of document references by automatic recognition and subsequently sorting them according to an operator-inputted sort criterion.

Preferably, the process comprises the further step of automatic pre-conversion validation, in which tags are compared with reference tags and nesting is validated according to the document parameter values.

In one embodiment, the pre-conversion validation step includes the sub-step of automatically locating any invalid symbols and generating corresponding error messages.

In another embodiment, the pre-conversion validation step includes the sub-step of automatically identifying references, building an array in memory, and searching to determine if any do not exist in the document.

IE 980960 -4The invention will be more clearly understood from the following description of some embodiments thereof, given by way of example only with reference to the accompanying drawing in which Figs. 1(a). 1(b) and 1(c) are together a flow chart illustrating a production process of the invention; and Figs. 2 to 4 are document samples at various process stages.

The drawings show a process I for producing a document in a structured format, in this embodiment SGML. The process takes an authored document in a particular word processor format (input A), or in a different word processor format or a manually typewritten document (input B). If input B, the process in step 2 converts the document to the particular word processor format by optical character recognition or word processor conversion as applicable. Fig. 2 is a sample from a received document in Word 1W format.

In step 3 process identifiers are inputted by an operator. These identifiers identify the particular document being published, the client and other identification information.

In step 4. a parameter activation table is loaded. This table includes flags which activate or deactivate various document parameter values . The rules and the table are structured to represent Document Type Definition (DTD) information in the system so that DTD information may be automatically processed.

In step 5 the document is automatically prepared and cleaned. This involves the system processor applying typesetting rules, such as removing multiple spaces. In addition, various rules are applied for consistency such as removing spaces from around mathematical symbols. Also, spelling mistakes are corrected using a spell-checker program. Table and figures are moved to the end of the document to facilitate later processing steps.

IE 980960 - 5 In step 6 the document is lagged with internal system tags. The system progresses through the frontmatter. bodymatter. backmatter. tables and figures and the cross reference in sequence. The tags are subsequently of benefit in automatically converting the document to SGML. The system asks various question of the operator and based on the operator's responses and internally stored rules, the system recognises elements of the sections and then tags then accordingly. An example of such tags is shown in Fig. 3.

In step 7 the system performs copy-editing. This involves spell-checking and grammarchecking the document. The processor operates according to a find/replace program which automatically breaks down character strings to validate internal fonts used. For example, the author may mean x2j but have used x23, x23. x;3 at different places in the text. The system converts all instances of x23 into their correct forffi. As part of the editing step 7. the system converts styles in the document into their correct form as required by the document parameter values. A particular example is bibliographic reference style. Some publishers require these references to be name/date references, while others have these references numbered. For example, if the first reference in a document is a reference to an article published by “Smith and Jones” in 1998 in the name/date format, the text for this in the bibliography group would be ordered alphabetically and so would therefore be about half way down the list. On the other X · hand, in a numbered reference format, the text would be at the start of the bibliography list as it is the first one cited. The system prompts the operator to select between these styles and then automatically implements them by generating a list of all of the references and sorting them accordingly. Finally, the editing step 7 involves pulling all floating elements to the end of the document to facilitate faster handling at a later stage in the process.

This work completes a preparatory stage of the process and this stage is then verified as illustrated in Fig. 1 (b) in steps 8 to 15. In step 8 the tags are automatically compared w ith an internally-stored set of reference tags. This comparison is performed according to the received document parameter values. The order and nesting of the tags are IE 980960 -6checked in steps 9 and 10. again according to the document parameter values. In step I 1 symbols within the document arc checked to locate any unknown one. This is performed by automated searching for characters which are not in this ranges 1-9. a-z. or A-Z and do not match a list of valid characters held by the system. Any unknown characters found in the document are reported for correction.

In step 12 cross-references are checked for validity. Cross references include bibliographic references and references to tables, figures, and footnotes. This involves the system making a list of the items referred to in the memory. The system then checks each reference in the body of the document. The sv stem reports on references that cite an non-existent items and items that should be referred to but are not. As for steps 8 to 11. errors found are reported. However, in addition to step 12 there is an additional step 15 in which a list of unlinked cross-references is generated to prompt feedback by the operator. Generation of error messages is indicated by the step 14. and correction by step . The correction may involve interactive input by the operator.

Referring now to Fig. 1(c). the final phase of the process is illustrated. In step 20 every symbol and character not in the 1-9, a-z, and A-Z ranges, are checked against a list to locate the SGML code for that character. The SGML code is substituted in the text / · automatically. Γη step 21, tags which were inserted in the preparation stage of the process are converted to their SGML equivalent. Again, this is automated because the tags are simply checked against a list in a look-up table and substituted. In step 22 equations and foreign objects in the document are converted to their correct SGML tags. This involves the system transmitting commands to convert the object into a format which can be understood by the system. For example, for a mathematical equation, a command is sent to a “MathType™ application to convert the equation into a text equivalent of the object's code. The system then converts this into SGML by searching the (now text) object and process sub-objects. Floating elements are converted to SGML and are embodied in the SGML document at the correct position in step 23. For example, the document parameter values may require the “floats to be at the end of the body of the IE 980960 - 7document, while others require each float to be located immediately after the first reference to it. The floats are converted based on rules held in memory. These rules are taken from both document parameter values and the float structure so that, for example, tables will always have cells and rows and this structure is used in the process. Λ sample of an SGML tile is shown in Fie. 4.

In step 24 the SGML file is passed through a parser to ensure that the SGML is perfectly correct. This parser is a tool which exhaustively checks and validates the file against the complete document parameter values. This ensures that the correct set of document parameter values are used as are the various rules held by the system. This acts as a system check and reports any errors.

An intermediate-output SGML is provided in step 25 and this is used as the basis for the final output. For example, there may be DTD-specific conversion in step 27 to provide a final output SGML file in step 28. Alternatively, there may be journal-specific conversion in step 29 with typeset code editing in step 30 and a postscript output generated in step 31. Thus, the output SGML file may be converted into the typeset code required to correctly style and display the document for a typesetting system. Because the document provided in step 26 is an SGML format, many alternatives are possible.

The invention is not limited to the embodiments described, but may be varied in construction and detail within the scope of the claims.

Claims

1. A document production process carried out by a processor having an editor interlace and memory access means, the processor comprising the steps of;5 writing a document in a word processor format to memory: writing document parameter rules to memory; 10 automatically correcting the document according to typesetting and document rules; automatically tagging the document according to typesetting and document rules; and automatically converting the document characters and tags to a structured format to provide a structured document.

2. A process as claimed in claim 1, wherein character and tag conversion is 20 performed by automatic comparison with reference characters and tags stored in look-up tables, and wherein the conversion step includes the sub-step of passing foreign objects to a separate process, which converts the foreign object to a text format, and subsequently processing the text to convert to the structured format, and wherein the conversion step comprises the sub-step of separately converting 25 floating elements according to document parameter rules and structure of the floating element, and wherein the process comprises the further step of parsing the structured format code for final validation.

3. A process as claimed in any preceding claim, wherein the document parameter 30 rules are written as an array of flags which activate and deactivate parameter IE 980960 -9options. and wherein the tagging step involves automatic recognition of elements, and wherein the process comprises the further step of copy-editing the document after tagging by automatically converting words according to a break-down of the word characters, and wherein the copy-editing includes the sub-steps of building an array of document references by automatic recognition and subsequently sorting them according to an operator-inputted sort criterion. A processor as claimed in any preceding claim, comprising the further step of automatic pre-conversion validation, in which tags are compared with reference tags and nesting is validated according to the document parameter values, and wherein the pre-conversion validation step includes the sub-step of automatically locating any invalid symbols and generating corresponding error messages, and w herein the pre-conversion validation step includes the sub-steps of automatically identifying references, building an array in memom and searching to determine if any do not exist in the document. Documents whenever produced by a process as claimed in any preceding claim.