GB2369219A

GB2369219A - System for synchronous display of text and audio data

Info

Publication number: GB2369219A
Application number: GB0118739A
Authority: GB
Inventors: John Christian Doughty Nissen
Original assignee: Individual
Current assignee: Individual
Priority date: 2000-07-31
Filing date: 2001-07-31
Publication date: 2002-05-22
Also published as: GB0118739D0; GB0018733D0

Abstract

A system for synchronously displaying text and audio data comprises display and audio output devices and a user controlled input device. In use the system will synchronise a speech output from the audio device with a word by word text display on the display device. The user can control the speed, style and position within the text using the input device. Preferably the input device comprises either a small keypad similar to those on a mobile phone or on screen buttons which are activated by a mouse and which allow a user to navigate through a hierarchical menu. Typically the display may be a small LCD panel on which the size of the words displayed can fill the screen. In other embodiments of the system the output means can display the text as braille or as symbols or signs as used in sign language.

Description

,, ccp I p

SPECIFICATION OF SERIAL PRESENTATION SYSTEM 1. Background and inventive aspects 1.1 Mobile devices There is a real problem with current mobile information devices, such as mobile phones, palmtops, and PDAs (personal digital assistants), concerning the ability to present information in a universally accessible way, and allow the user to browse through information quickly and efficiently. The problems arise from the small screen and limited physical space for buttons.

The WAP approach has information marked up on a web site and presented in small chunks, and the user has to step from one chunk to another. In the current invention, the information is analysed at the time of access for syntactic units and for structural and style elements, typically marked up in HTML. The larger syntactic units, such as paragraphs and sentences, may be divided into smaller units and ultimately into words and individual characters. Words or groups of words are displayed serially, where the time between successive displays is dependent either on synchronisation with the speaking of the words using a speech synthesiser or on an approximation to the time it would take if they were spoken. The speed is set by the user so that a whole document can be read without user intervention, as compared to WAP where the user has to step from one chunk to another. But the user can also step through units of different sizes, to scan the document quickly, and get from one point to another in a way which is based on the meaning of the document, rather than how it would appear on the page.

For example they can step sentences rather than lines.

With the word-by-word display, or display in a small group of words, it is possible to have relatively large characters on the screen, thus improving legibility, speed of reading, and accessibility to people with visual impairment. By allowing synchronised speech, accessibility is given to people who may be blind, dyslexic, illiterate, or otherwise"print disabled". This increases the market for the mobile device.

The display of the invention has a main field for reading the document, and subsidiary fields in which structural elements of a document, such as headings, and presentation values, such as speed, can be shown. However these elements and values can be read out in the main display.

This caters for the situation where the display has little or no space for subsidiary fields.

Another aspect of the current invention concerns the control of such a device, and allows control of serial presentation and stepping all using a few keys or buttons. A state machine can be used in the implementation, and the interface can be dynamically changed (e. g. simplified to fewer levels for a child) by changing the definition of the states and transitions to which the state machine is working.

1. 2 Reading and writing Another area of application of the invention is as a reading and writing aid. In this case there may be no limitation on screen size or space for buttons, and a conventional computer monitor and keyboard may be used. The invention allows print disabled people to read. Through combination of the invention with an editor, it allows such people to write as well as to read.

Writing using just four buttons is possible using the means described below, see 3.10.

1.3 Learning to read, speak and write-or to communicate in other ways The invention can be used as a language learning aid. For example where the language of the synthesiser is not the first language of the user, the user is able see the words and hear them at the same time, and associate sight with sound. The user can write words and immediately hear them spoken. Thus the invention can be used in the teaching of, say, English as a foreign language. It can be used in the teaching literacy skills in children with special needs. An embodiment below describes the system in terms of adult and child usage in this context. The simplicity of operation by the child using the cursor keys is important in this context.

The invention has other optional output modalities. There is a tactile modality-which may be dynamic refreshible Braille, or it may be a code for serial tactile output of characters or phonemes across fingers and thumb. There is a symbol modality, when words are associated with symbols, according to the PCS or Rebus symbol sets for example. And there is a sign language modality, when words are associated with signs, according to British Sign Language (BSL) for example.

The invention allows output of a word in different modalities to be synchronised. As a word is displayed or spoken, the corresponding symbol (if there is one) is displayed in a separate window. There is a database of symbols associated with words in a wordlist. In the case where there are several symbols for a single word, such as"present", the choices can be displayed together. However the disambiguation can be recognised in mark-up, so that the appropriate choice of symbol is displayed.

A similar arrangement is possible with sign language, where there is a database of signs corresponding to words. As a word of text is displayed or spoken, the corresponding sign is displayed in parallel and synchrony.

1.4 Web application In a certain embodiment, the speech synthesiser is mounted on the web server, but the presentation is still controlled from the user's client. Speech is sent from the server as an encoded file, such as a". wav" file, to the client. However this file, or a parallel file, has extra coding to mark word boundaries, allowing synchronisation of serial visual display with the speech. Typically the server sends a sentence in advance of the sentence that the user is reading. The files are stored in the client, for reuse, in case the user wants to step backwards a sentence or more. The user can step forwards and backwards a word at a time, and the relevant word is extracted from the file, using the word boundary markings.

In another embodiment, the speech synthesiser is split into a front end and a back end, with the text converted to a phonetic notation by the front end, and the notation passed from the server to the client machine and then converted to speech sounds on the user's client. The former conversion may be done at an earlier stage, so that a web page contains both a text version and a corresponding static phonetic version. The latter conversion may be performed by an applet previously downloaded from the web site, or by a program allowing the conversion of a stream of standard phonetic notation from any web site.

Such a notation might be used in a pronunciation dictionary. It would typically contain allophones and syllable stress markings, but also word stress and intonation, in order to give a complete inflection for each word, and thus for the sentence as a whole.

The notation can allow a reverse translation, back to words of text, so that the software (e. g. applet) on the client can display each word as its synthesised sound is output-i. e. the word is displayed at the same time as it is spoken.

The notation can also disambiguate words that have several meanings. For example it can use mark-up similar to that proposed for the semantic web. This disambiguation is useful if there is a choice of symbol according to the meaning.

This system allows a web site to be made accessible to a visitor without the need for assistive technology. The functionality on the user client can be implemented as an applet, which can be downloaded from the site. If spoken sentences are downloaded as compressed audio in advance of reading, it only requires a bandwidth sufficient for this audio stream, and the system is able to keep up with a reader reading through the page. The synthesiser can be implemented as a servlet, for portability across different server platforms.

Note that the storing of a sentence as a file with word boundaries has advantages in this and other embodiments, because the speech synthesiser can work out inflection and pronunciation for each word appropriate for the sentence as a whole. Thus the user can stop, start and step through the sentence while this inflection and pronunciation is maintained for each word.

Normally if you give a synthesiser a part of a sentence or a single word, it cannot reproduce the inflection and pronunciation it would have given in the context of a complete sentence.

The same approach can be used with pre-recorded speech, such as produced by an actor for a talking book. This speech is divided typically into sentences, and each sentences is marked up with word boundaries so that the display can subsequently be synchronised with the speech.

The mark up may be virtual, e. g. by a time indication of when each word ends with respect to the beginning of the sentence.

1. 5 Maintaining context It is important, that the reader is aware of current context. For this reason it is useful to be able to read the title of the current document, the proportion of the document that has been read, the heading of the current section, and the closest previous link. These values are maintained while the position of presentation (i. e. where you are reading in the document) changes. For example, when searching through a document for instances of a particular word, as each instance is examined, the context is provided.

In another embodiment, the reader is presented with just two levels in the tree: the current level on which the operations apply, and the level above, which gives the user the context. For example when setting the year, the word"year"might be seen above its value, say"2001", see figure 5.

1.6 Consistent presentation With this invention it is possible to treat all information in the same way, i. e. with the same, serial presentation, and the same means of control. Structural information and sets of presentation parameters (or style sheets) can be treated as meta-documents associated with the "principle"document, i. e. the document being read. The meta-documents are in effect subtrees of a main tree, for example as shown in Figure 5.

2. Description of invention The system comprises one or more serial computer output means, and one or more computer input means by which the presentation of the serial output is controlled and by which the content of that output is selected; where the content can be a document, and the presentation can be controlled by the input with respect to style, speed, and position within the document; and the selection of document for presentation involves navigation of documents associated through a directory structure and/or hypertext linkage.

The serial output means typically comprises a word-by-word serial visual presentation synchronised with the speaking of the said words by a speech synthesiser. The control input means typically comprises a small number of buttons or keys.

This combination of serial visual presentation, synthesised speech and operation with a small number of buttons or keys allows use of the system as a user interface for small hand-held or body-warn devices providing access to information services and/or reading material, where the physical size of the device necessitates a small display and a few buttons.

This combination of serial visual presentation and synthesised speech allows use of the system as a user interface to electronic text for people with a visual impairment, a visual processing impairment (such as dyslexia), a general learning difficulty, illiteracy, or any other difficulty in reading the written word in the language of the system (for example if this language is their second language). The system can also be used for language learning.

The operation with a small number of operations allows the system to be used by a person with limited manual dexterity. It allows for a simple variant with an alternative input based on speech recognition of a few commands corresponding to each of these operations.

An embodiment of the invention has four basic operations and a further operation to change mode, so can be operated using five keys, buttons or voice commands. Typically the four basic operations would be activated by the four arrow or"cursor"keys : Up, Down, Left and Right. The fifth key might be the tab key.

The serial visual output of the system can be accompanied by visual display of significant text from, or associated with, the document being read via the serial output. Such text can include headings in the document, the link text from hypertext links embodied in the document, and the title of the document. Each piece of significant text can have one or more associated values, for example the link text has an associated URL but could also have binary value indicating whether the target of the link has been visited or not.

The serial visual output of the system can be accompanied by visual indication of the values of parameters associated with the serial presentation, both visual and audio. This visual indication has a textual form, such as a percentage number, but may also have an analogue form, such as a progress bar.

The abovementioned pieces of significant text with their associated values, and the above parameters with their associated values, are stored as lists, which can be examined by the user through list operations. The system allows values can be changed, and items to be added or deleted from the lists, if it appropriate to do so One aspect of the invention is that parameters, lists and embedded objects can all be examined and manipulated using the same means of navigation and control. Thus a single paradigm can cover the whole system; this makes the system easier to use and to learn to use.

Normally, at any given moment, there is one document which is the principle text, and this is presented through the"main field". Information about this document can be extracted and inserted into lists. One of the lists can be a list of hypertext links within the document, another can be a list of headings within the document. Such lists can be considered as"metadocuments", as they are documents about documents. In the case of the headings list, it describes the structure of the principle document.

The presentation parameters and their values can also be grouped together into lists, which can be considered as style sheets-another kind of meta-document. This kind of meta-document can also be used for changing the values of parameters, and thus change the characteristics of the presentation of the principle document.

The document may contain embedded objects, such as tables, which have a structure in their own right. When the user reads through the document and reaches an embedded object, the system provides a view of the embedded object as a self-contained document nested inside the "outer"document.

The principle document and its associated meta-documents and nested documents can each be navigated and controlled as a document. The system provides the means for changing of focus (for user navigation and control) between the principle document and the other documents.

Each of these associated document can be serially presented (e. g. through a word-by-word display or speech) through the same output means. Thus for visual presentation, the same field can be used for the display; so effectively this field is time-shared between the display of the words from the principle documents and from the other documents, depending on the current focus for navigation and control. The style of presentation may be changed while on another document, so that the user is aware that they are not dealing with the principle document. For example a different voice or voice pitch can be used for speech, and a different character size for text display.

However the system also allows the user to continue serial presentation of the principle document while altering the focus to one of the other documents. This allows one of these other documents to be navigated and changed, which may have an instantaneous effect on the position or style of presentation within the principle document. For example if the current heading is changed in the heading list, the position of presentation changes so that presentation continues at the point of that heading. As another example, if the colour is changed for a presentation parameter, then the visual presentation of the principle document continues in that colour. The navigation and control of the"other"document is performed in parallel with the presentation of the principle document.

An embodiment of the system can handle multi-way communication by Internet Relay Chat.

The log of the conversation or"conference"is treated as the principle document, which is appended as text arrives from parties to the discussion. The input by the user is treated as an associated document, which is cleared as the input is sent. The focus can be moved between these two documents. The user has the options (a) of having the presentation of the input as the user types it, interrupting any presentation of the text arriving (from other parties to the discussion) or conversely (b) of having the presentation of the principle document interrupting the input as new text arrives or (c) ofhavmg the presentation move with the focus under user control.

An embodiment of the system can handle the presentation of synchronised multimedia, where the synchronisation and relative timing of parallel output streams may be defined by a mark of the material to be presented according to a standard such as SMIL. Such streams typically include text and accompanying audio, images and video. The system allows the user to change position in any one stream, and adjust the position in the other streams according to defined rules of synchronisation and relative timing. For example, if there are pictures associated with chunks of text, and the user changes position of reading of text from one chunk to another, then the picture will change accordingly; while conversely if the user changes one picture for the next, the position of the reading of the text is moved to the beginning of the next chunk.

An embodiment of the system includes means to synchronise audio output of words of recorded speech with visual output of words in a serial display of the text which was spoken.

This allows the word-by-word display and control to be used in multimedia presentation, where one of the streams is recorded speech, e. g. for a talking book. The user can navigate the text, and control the style of presentation, while maintaining the synchronisation at the word level. The means of synchronisation can include: a speech recognition engine which converts speech to text; a correlator which compares and aligns this text with the text being spoken; a marking process whereby the digitised speech recording is marked with wordbeginnings and/or word-ending codes; and search engine that looks for these marks when speech output is to accompany visual output of the words.

The system can be extended to deal with tactile output in the form of dynamic Braille multicharacter display, Braille single cell display or other tactile display. The tactile output can be synchronised with speech, the speed of speech being adjusted to allow the user to read the tactile display, e. g. by lengthening gaps between words. As an aid to learning the tactile code, the speech may be delayed with respect to the tactile display.

In general the system can deal with the serial presentation of any set of data streams (or sequences) where the data stream can be divided into units of different sizes. For example a table can be divided into rows, and rows into individual cells.

3. Some implementation details 3.1 FIVE KEY OPERATION For use on a keyboard or keypad with four cursor keys, and at least one other control key, there is a means of operating the system with five keys. Four of the keys are used for Left, Right, Up and Down operations, these operations changing the"state"of reading of a sequence of data units (typically textual), considered as a"document". A fifth key is used for changing from one document to another, while leaving the first document in its current state, i. e. without disturbing the reading process on that document. The first document may be the "principle"document and the second document may be a meta-document, i. e. a document about the"principle"document, e. g. controlling the presentation (speech and/or display) of the principle document.

The state of reading of the sequence of data units can be shown by a diagram, see figure 1.

This state-chart shows how, with four operations, you can control the reading of the sequence of units, allowing you to: - pause at any point; - continue reading from that point; step back to the beginning of a unit; - repeat the reading of a unit, - step back to the previous unit, 'step forward onto the next unit; change the size of the unit above.

The unit size is considered as a level, and, in this example embodiment, there is a document level as the top level, with sentence, word and spelling levels underneath. At the spelling level, the unit is a single character. The Up and Down operations change the level (i. e. the unit size) up and down respectively.

The Down operations are accompanied by the output of a"system"message, typically in a different voice for speech output to distinguish it from the voice used in reading the document itself. Such system messages provides feedback to the user as to the state of the system and the level.

Right is used for moving forward through text, and Left for moving backward. These operations would be interchanged for use in a language like Arabic, where printed text is read from right to left.

Other interchange of operations may be appropriate for other situations, e. g. Up and Down instead of Left and Right for moving backwards and forwards through Chinese text.

3.2 ALTERNATIVE FIVE KEY OPERATION An alternative state chart is shown in figure 2, which provides these same capabilities In this implementation there is feedback for all level change operations.

Left is used for pausing and then stepping left. Right is used for playing a unit and then stepping right. After playing to the end of a unit, a Left takes you into the pause state with a message to say that you have reached the end of the unit, and a Right starts you reading the next unit (unless you are at the end of the document, m which case you are told this) In general, after a Left. the Right will continue the reading from the point reached : the following Right will start reading the next unit. In general, after a Right, the Left will pause the reading, and the following Left will step you back to the first word of the unit; but if you are on the first word it will step you to the first word of the previous unit.

Up and Down are used for going up and down a level. Note that in this alternative, the Left and Right have no effect on the level. The fifth key can have the same effect as before-to change you from one document (or meta-document) to another.

3. 3 THREE KEY OPERATION Note that by having the level changes in a cycle, one can replace the Up and Down operations with a single operation. This allows operation with three keys plus a fourth to change document.

A further reduction to three keys is possible. The Left and Right operations are as in Figure 2, while you are on a level. But after the third key is operated, the Left and Right are used to step left and right in a list of options, some of which may be to change level, others to change to an associated document, others to take a link (e. g. if the pause position in the document is over link text) which may be to another document, and others to perform operations appropriate to the level and/or to the type of document or meta-document that you were reading. One of the options may take you to a further list of options. There will normally be an option to take you back to where you were in the document, with a null action. The third key is operated a second time to take the action associated with the option selected using the Left and Right operations.

These types of operation might be most suitable in an interface for a very small device, such as a wearable computer worn on the wrist like a watch, see below.

3.4 WRIST-WARN EMBODIMENT An embodiment is warn on the wrist like a wrist-watch. The buttons are placed such that they can be reached and pressed by fingers of the other hand, reaching around the wrist, see figure 4. This enables the user to read documents and control their presentation while viewing the text at the same time. This embodiment can use one of the operation means described above, with three to five buttons/keys.

3.5 SYNCHRONISATION Synchronisation of the serial visual display with speech depends on how and when the speech is generated.

In the case of synchronisation with synthesised speech, before (or after) each word is output from the speech synthesiser, an interrupt message is sent to the display processing software to change the display to show this (or the next) word. or, for multi-word display, to move onto the next group of words if appropriate In the case of pre-synthesised speech, or in the case of natural speech which has been recorded and marked with word boundary markers, the display is moved on to the next word as a word boundary is reached in playing the encoded speech image (typically a sentence). The word boundaries may be indicated by codes embedded in the encoded speech file, or by codes in a parallel file Such a parallel file could contain timing information about when each word finishes in relation to the start of sentence 3.6 CHANGING PARAMETER VALUES The meta-document allows user input to change parameters. An HTML form with radiobuttons and select options is treated is a similar manner.

Each parameter has a list of values. When a parameter is interrogated, the current value is simultaneously displayed and read out. A value can be changed by viewing the list of values and selecting a different value. In the five key implementation, the Left and Right operations may be used to view the list and the Down operation may be used to select a value.

3.7 LISTS AND LISTS OF LISTS The meta-document is a structure containing lists. Such lists may contain names and addresses, words and their pronunciations, etc. Lists of lists can be used to implement tables.

A list can be considered as a single unit at one level, items or rows at the next level, words at the next level, and characters at the bottom level. This allows the list to be examined using the five key operation above.

3.8 HIERARCHY-TREE-PRINCIPLES OF OPERATION All the information in the system can be kept in a single hierarchy of documents, i. e. a single directory tree, which can be navigated as a single structure, e. g. by five key operation. This is a suitable embodiment for the watch, where the fifth key might be a help key or emergency button.

Parameters and their values can also be arranged within a tree structure, see figure 5.

It works on the following principles: * All functions are on a tree, whose top is called"Menu".

* The tree is navigated using 4 cursor keys: Up, Down, Left, Right.

* The tree is a list of lists of lists, etc.

* List items are arranged left to right ; the tree has the root at the top, and leaves at the bottom.

* You can perform an operation or change a value by going'Down'from a leaf identifying that operation or value. You then automatically return to reading the current document.

. If you don't want to perform an operation or change a value, you can exit by going back 'Up'to the top of the tree and then, with a further'Up', returning to reading the current document.

. Some lists are circular (e. g. the days of the week, seen when"setting time") w For non-circular lists of quantitative values,'Right'increases the value, and'Left'decreases the value.

For playing the current document, there is a simpler state machine than shown in Figure 1 or 2, since the stepping size is chosen from the tree, see Figure 5.

* The"main field"has a word-at-a-time display of the current document.

'At any moment, the UI"focus"is either on the main field or on the tree.

'A'Down'moves the focus from the main field to the top of the tree.

'The text may continue playing in the main field, while the focus is moved to the tree, and an operation performed or a value changed. Either when you select a value by going Down from a leaf, or when you leave the tree by going Up from the top of the tree, the focus is returned to the main field.

For reading the text in the main field there is a normal mode (corresponding to the document level), and a stepping mode (for the other levels).

. The stepping size is a value on the tree, which can be set.

Normal mode:

While the"focus"is on the main field, and in normal reading mode (not stepping),'Right' plays the text, then a'Left'pauses.

'When the end of text is reached, you are paused on the last word ; a further'Right'gives an "End of document"message.

When paused, a'Left'takes you back a word.

. If you do a'Left'when you are on the first word, you get a"Start of document"message.

Stepping mode: In stepping mode, after the focus is returned to the main field, a'Right'continues playing the unit, and a'Left'takes you to the start of the current unit.

. A'Right'after another'Right'takes you to the next unit, unless you in the last unit, when you get"End of document".

* A'Left'after another'Left'takes you to the start of the previous unit, unless you are in the first unit, when you get"Start of document".

'A'Right'after a'Left'plays the current unit.

. A'Left'after a'Right'takes you to the start of the current unit.

3.9 THE TREE ITSELF-See Figure 5 At the top is"Menu". Below this are the top level functions that are provided by the system: Time, Mode, Navigation, Controls, and so on. Here is a simplified example of the tree, fleshing out'Time'in more detail than the rest. Note that there can be any number of alarms, and these are numbered 1,2, etc. Under'year'will be its value'2001', which cannot be changed. However under each alarm is a time and a period which can be changed. (The mechanism is not shown in the tree.) 3.10 EDITING AND CHARACTER INPUT An editing function is required for web navigation in order that the user can input L'RLS. It is also required for HTML forms where there are text entry fields.

The five key operation for reading can be used at the same time as editing, since the five keys (four cursor keys plus tab) are independent of typed input of print characters, carriage- retum/line-feed, and typical edit operations (such as control-C for copy). For the blind user, there can be immediate speech feedback as characters or words are typed, and then the passage can be read using the cursor keys. The fifth key (say tab) can take you out of edit mode, and back into normal reading of the principle document.

A special method of input of text is possible using four keys and the tree structure. For the first letter of a word, you go Down. Then you are presented with a circular list of groups, such as ABCD, EFGH, DKLMN, OPQRST, UVWXYZ; these are presented as a branch of the tree, which you can traverse using Left and Right. You select one group with a Down, and then you are presented with the individual letters of that group to choose between, and you select one with a second Down. You are then presented with the circular list of groups again, and you can go on to select the second letter, third letter, and so on. Thus there is a two-stage selection for each character. When you finish the word you do an Up, and you can hear the

word spoken to give you feedback. Then you can do a Down to start the next word, or do a Left or Right for other possibilities, such as selection of punctuation mark, or change to (or from) capitalisation.

A similar method of input of numbers is possible, but only one stage is needed. Again the numbers are presented as a circular list. On Down, you are presented with'0', and then a single Right will give you'1', two Rights give you'2'.

3.11 LEARNING There a various modes of use the invention, in which you can have output modalities:

output modalities are in parallel and synchronised, e. g. speech with word display ; using a modality as a check or prompt, e. g. speech for the reader to check a word ; for example the user can click on a particular button (e. g. fifth button) to hear the word, and this button is recorded in a word list ; * using one modality to follow another, e. g. so the user sees the word before hearing it; with the user inputting in one modality, before, during or after the system has output a word in the same or another modality, e. g. so the user sees the word and tries speaking it.

The system can record the user's input, either as separate words or as phrases or sentences, and then synchronise with the corresponding output of the system in a different modality. For example a user can type words as they are spoken, and then play back with the speech and text displayed in synchrony. Or the user can speak words as they are displayed, and then play back the text with their own speech synchronised to it. If the recording is done a word at a time, the synchronisation can be precise. An instructor can listen and watch the playback, to spot errors.

The system can thus be used as a language laboratory. For example the user can read silently, then read with the synthesised or instructor's voice, then read a word and record their own speech, then hear their speech back, synchronised with the text. The system records each word spoken along with the text of the word, so it can be played back at a later moment. There can be a record of the instructor's spoken word as well as the student, so they can be played together, or alternating.

4. Mixed paradigm embodiment In one embodiment of the invention, aimed at universal accessibility, there are three means for you (the user) to operate the system : there is the"five key"operation, there is full keyboard operation, and there is"point and click"operation, typically using a mouse.

There is a"main field"which is used for word-by-word display of the document you are

reading, and also for editing names and values when in"editing mode".

4.1 SPECIAL NEEDS SUBSET The embodiment includes a subset of functionality designed for a person (assumed to be a child for purposes of description below) with special needs-specifically for assistance in reading. It provides five key and mouse operation to give access to a subset of the full system functionality. The full functionality is available, via keyboard and mouse operation, to a supervisor (assumed to be an adult for purposes of description below).

4.2 LIST OPERATIONS In this embodiment, each list item is a duple (name, value), with"="as separator. For example the bookmark list item is a name and its associated URL address.

Each list can be scanned either by cursor keys, or by"short cut"operation from the keyboard, or by clicking on buttons or fields associated with the list. There are a number of operations for acting on a list, e. g. a list of bookmarks: Enter (Enter), Cancel (Esc), Copy (control-c), Add (control-a), and Remove (control-d).

The Copy operation copies the current item to the main field where it can be edited if necessary. The item in the edit field can be added or deleted from certain lists, using Add or Remove. Such lists have items arranged in alphabetic order. When editing is finished, the Escape operation takes you back to reading the document. If the item in the edit field has a URL part to it, you can follow this link using the Enter operation, whereupon a new page or document is fetched and editing mode is left.

4 3 CHILD CONTROLS The facilities for the child will all keyboard or mouse operation of very limited facilities : play through a document; play, step forward and step backwards a paragraph, sentence or word; . step through the characters of a word to spell it.

The document is divided into units: the document itself, its sections, paragraphs, sentences, words and characters. The child can change the unit size or"level", so that the system plays, or steps, through a document, paragraph, sentence, word or character.

4.4 CHILD SCREEN There is a simple window on the screen for the child, showing: the title of the document (such as a book); a percentage showing how far they are through the document ; the current section heading; the current word; the set of four arrow keys arranged as inverted T; an indication of the level (document, paragraph, sentence, word or spelling);

an indication of the state (paused, playing or stopped at the end of unit).

The current word is shown in the' main field".

4.5 ADULT CONTROL The adult can use either mouse or keyboard to access full functionality.

The adult can set up a booklist and the particular book to be read by the child. The adult can navigate hypertext, select links, edit them, etc. The adult can control the display and speech parameters, for example the size of text in the main field, and the volume of the speech.

To select a field, the user can click on the field itself, and the system will read the full value in the field. The user can then click on up and down buttons to change the value. In the case of navigation fields, the user can press Enter to select the address to go to, or Copy to put it in the main field for editing.

For control of display and speech parameters using the mouse, the user has a button which brings up a dialogue box giving each parameter name and its value, for example: "Magnify"and approximate character size (in points) of text in the main field ; "Text"and its colour; "Paper" (background) and its colour; "Volume"and its value (on scale 0-9); '"Modality"and value (see below); ."Speed"and its approximate value in words per minute.

The value is applied when you click on the"OK"button in the box.

Correspondingly there are"short cut"operations using upper case, lower case and control characters: M increases text size, m decreases it, and control-m queries it; T and t take you up and down list of colours for text, control-t queries the colour; P and p take you up and down list of colours for paper, control-p queries the colour; V increases the volume, v decreases the volume, control-v queries the volume ; + increases the degree of speech, = decreases it, and control---queries it, S increases the speed, s decreases it. and control-s queries it.

The effect of the short-cut operations is immediate. I. e. there is no need to press an OK.

When selecting colours by short-cut, you cannot select text colour the same as the current paper colour or vice versa.

4.6 MODALITIES The system has several modalities of operation, with different degrees of speech, including a modality where there is no speech. When there is no speech, words are displayed for a time which is a function of the speed setting, the type of character and number of each type, with a longer time for punctuation. Otherwise the display is synchronised with the synthesised speech.

There is also a modality with no text visible in the main field as a word is spoken. The user can type in the word, and have it checked for spelling against the word that was spoken.

There are two independent speed parameters: one for speech when the display is synchronised to the speech, and one for the display is not thus synchronised.

When using the short-cut operations, the speed refers to the display speed when there is no speech, otherwise it refers to the speech speed to which the display is synchronised.

4.7 EMBEDDED STRUCTURED OBJECTS As stated above, a document is divided into units of different sizes. However the document may contain a structured object, such as a table, where the units inside the object have a different set of values, e. g. table ; row ; * cell ; word ; formula ; . characters.

There is a default value depending on context, e. g. cell for tables.

4.8 CHANGING FOCUS USING KEYBOARD In this embodiment, the focus is moved from the main field to another field by typing control+letter where the letter is associated with that field. For return to the main field you can type Escape. However when selecting an address to go to, the Enter will return you to the main field for the page or file that you have fetched.

You can use tab and shift-tab to more focus between fields. For numeric pad operation you can use the I and 7 keys for"tab"and"shift-tab".

4.9 CHANGING FOCUS USING MOUSE You can change focus by clicking on a field. You can thus change focus back to the main field by clicking on the main field. You can also change focus back to the main field by clicking on the Escape button. And after Enter, the focus will automatically revert to the main field, as for keyboard operation.

4.10 CHANGING FOCUS WHILE PLAYING If you change focus while in the playing state, the system should continue reading out, uninterrupted by changes in values. For example you should be able to change the volume without interruption of the reading of the main text Note that all fields except the main field and the time are static. The time ticks over regardless of what is happening in other fields-so it is as if it is a"track"playing in parallel with other fields.

4.11 EDITING MODE The editing mode will be entered when you"copy"a value from a field using Copy or controlc, and the main field will display the value"copied", for you to edit. You exit from editing mode, either by pressing Escape or by pressing Return. The latter has the result of taking the text you have edited and treating the part after the"="sign as a URL for a page to be fetched, which then becomes the principal document.

4.12 WINDOWS FOR FUNCTIONS There can be different windows, or pull-down boxes, for different functions. This is important for situations where there is limited screen space, since such a window or box can overlay the main window. The operations and presentation in the subsidiary windows or boxes may obey the same paradigms as in the main window, and their content may be treated as"metadocuments"associated with the"principle document"being displayed in the main window.

The simplest window is a reading window for the child. There can be other windows or boxes for: * bookmarks and URL editing; forms ; wconfiguration of-speech and display parameters; * tables; . search; conventional scrolled display of original/source text; * conventional edit; . frames ; 'dictionary.

It is possible to change from the main (read-only) window to another window, either directly, using the numeric pad, or indirectly, with an existing"short-cut"command, using letters.

4.12. 1 READING WINDOW This will be as for"layout for child", but without the bookmarks which will now be in a separate window.

4.12. 2 BOOKMARKS WINDOW This will contain the bookmark list, analogous to the list of links, containing name-URL pairs as at present. The main field will be for displaying text read from the list. There will be an edit field, also with resizeable characters, but scrolling.

Whilst in any window, a control-c on a URL or name of a URL-name pair will copy a value into the edit field of the bookmarks window and transfer focus to it, ready for editing and/or addition to the bookmark list. While in the bookmarks window, the Enter command will cause the URL in focus to be used to fetch a file or page, and the focus will be transferred back to the reading window's main field, whilst the name-URL pair is added to the return stack, with the name appearing in the Return field.

4.12. 3 FORMS WINDOW When reading a page, and encountering a form, the focus can be transferred to the forms window, allowing the form to be filled and submitted. The forms window will have a main field for reading fixed text on the form, e. g. questions. There will be a text field for typing written answers, and a choice field, allowing multiple-choice selection (with radio button or tick box functions).

The text field will be scrolling, with resizeable characters, as for URL editing in the Bookmarks window.

4.12. 4 CONFIGURATION WINDOW This will have the same format as the forms window, but without the editing field. It will allow values of speech and display parameters to be changed.

4.12. 5 TABLES WINDOW When you are reading through an HTML page, and come across a table, you can switch to a table window, which will allow you to navigate up and down columns, and across rows, seeing the column and row headings.

4.12. 6 SEARCH WINDOW This will have a format corresponding to a conventional search or find dialogue box. There will be an edit field for typing the search string, and there will be other fields where you can select an option (case sensitive, etc.).

4.12. 7 ORIGINALS (SOURCE) WINDOW This will allow you to look at the original"source"text, i. e. showing the HTML markings if any. It will present the text conventionally in a scrollable window. It will have a cursor corresponding to the position reached in the Reading window.

4.12. 8 DICTIONARY WINDOW This allows you to look at a dictionary definition of a word that appears in the main window.

Notes on Diagrams FIGURE 1 This shows various states, e. g. 10 shows the"Play all through"state (when the system is reading through the document without stopping). These states are connected by lines with a direction, e. g. 11 and 12. On these lines: D = down arrow, U = up arrow, L = left arrow, R = Right arrow Ellipses show return to state, to read next or previous unit (sentence, word or character).

There are levels according to stepping size. The top level is for playing the whole document, then there are levels for sentence, word and character. Pause from top level is achieved by D (see 11), then U to carry on playing (see 12), or L to play current sentence, or R to play next sentence. While playing sentence, you can press U to continue playing to end of document (back to top level), or L to play previous sentence, R to play next sentence, or D to pause on a word (ready to step through words).

The system can be extended with levels for intermediate sizes, e. g. section and paragraph levels between document level and sentence level. There is a pair of states for each level, as for the sentence and word level above, with corresponding interconnections (e. g. paragraph to sentence has the same interconnection pattern as sentence to word).

FIGURE 2 Figure 2a shows the operation of the four cursor keys (Left, Right, Up and Down) in the"five key operation"embodiment. There are two basic states, 13 and 14, at the higher levels, and one basic state, 15, at the lower (word and character) levels.

With three key operation, two keys are Left and Right, operating as shown, and the third key takes you from one of levels into a list of options, see 16, as shown in Figure 2b. And then the third key operated again takes the action selected from the options using Left and Right operations.

FIGURE 3 This shows a window on the screen for a particular embodiment.

17 shows the title field, 18 shows the heading field, 19 shows the progress indicators, 20 shows the main field, and 21 shows the cursor buttons.

FIGURE 4 This shows an embodiment as a wrist worn unit. 22 shows the display. 23 shows one of a number of buttons on each side of the display for the fingers and thumb to operate.

FIGURE 5 Figure 5 shows a tree of values, documents, etc. This would be a typical structure.

Menu Time Current time Year Month Day (of month) Day (of week) Time (of day) Exact time (minutes and seconds) Alarms 1. Alarm Time (of day) Period (daily, weekly) 2. Alarm Chime Hourly Quarter-hourly Off Setting the time (with some values and effect of operations below) Year 2001 (Rights take you to 2002, etc., and Lefts to 2000, etc.) Month July (Rights take you to August, etc. , and Lefts to June, etc.) Day (of month) 31 (Right gives you 1, Left gives you 30) Day (of week) Tuesday (Right gives you Wednesday, Left gives you Monday) Hour 16 (Right gives you 17, Left gives you 15) Minutes 05 Seconds

24 Stepping Word Sentence Paragraph Section Column Row Off (not stepping-as for reading straight through a document) Speak (things displayed, in case can't be seen) Current word Current heading Current title Current URL Progress (% through reading a document) Navigation Start (go to start of document) End (go to end of document) Link Back Forward Documents Drives

A : B : C : .... (root directory with tree below) ..... (subdirectory) Display Normal text Speed 10 (words per minute)

20 20 Size 10 (in points)

20 20 Text colour Red (degree) Blue (degree) Yellow (degree) Background Red (degree) Blue (degree) Yellow (degree) Swap colours Unvisited link Visited link Headings Speech Normal text Speed (speed that each word is spoken) Gap (proportion of gaps to words) Volume Voice Pitch Unvisited links Visited links Headings System messages

Off Controls (for environment and home appliances) TV Off Volume Channels Front door Lock Unlock Open Central heating Memoranda Bookmarks Shopping list Engagements Phone Phone book Call (find name and call the number) 1.... (First entry) 2.... (Second entry) Add entry Messages Past calls (call register) Settings

Claims

Serial Presentation System-Claims 1 A serial presentation system comprising two or more serial computer output means, including a display device such as a computer monitor and a sound output device such a loudspeaker; and one or more computer input means by which the presentation of the serial output is controlled and by which the content of that output is selected ; where the content can be a document, and the presentation can be controlled by the input with respect to style, speed, and position within the document ; and where the serial computer output means includes a word-by-word serial visual presentation on the display device which can be synchronised to the output on the sound output device of the said words as serial speech presentation, from a speech synthesiser or using pre-recorded speech, such that each word is displayed as it is spoken.
2 A system as claimed in claim 1, including a state machine by which the presentation is controlled; where the state machine has states corresponding to the playing of the serial output and the pausing of the serial output, and these states may exist at several levels, a level typically corresponding to a syntactic unit of a document being read; and where transitions

between states are caused by operations through the input means, thereby controlling the presentation of the serial output, and allowing the user of the system to step backwards and forwards by a unit, such as a sentence, word or character of the document being presented, see figures 1 and 2 for example.
3 A system as claimed in claim 2, in which the state machine can be altered for different users, to provide them with a different set of state definitions and transitions, according to each user's preference or capabilities.
4 A system as claimed in claimed in any preceding claim, in which the input means for control of presentation comprises a small set of keys or buttons as typically found on a keypad or on a mobile phone.
5 A system as claimed in claim 4, in which the buttons are arranged so that they can be operated with fingers and thumb while the user is viewing the display, see figure 4.
6 A system as claimed in any preceding claim, in which the input means for control of presentation comprises a small set of on-screen buttons, each being an area on the display which the user can select, typically by clicking with a mouse.
7 A system as claimed in any preceding claim, in which the input means for control of presentation includes a speech recognition engine capable of recognising any word from a set of spoken commands from the user of the system, where some or all of these commands cause operations to be performed.
S A system as claimed in any of claims 2 to 7, in which the input means allows four basic operations, called herein Up, Down, Left and Right corresponding to the four cursor keys on a keyboard or keypad, but could be implemented by four commands with speech recognition, or by four buttons on a watch, or by the four non-numeric buttons typically found on a mobile phone; see figures 1 and 2.
9 A system as claimed in claim 8, in which the four basic operations are used to navigate a tree structure, using Up and Down to go up and down the tree, and Left and Right to select a branch or a'leaf of the tree, where a leaf is typically a value of a presentation parameter, such as speed, see figure 5.
10 A system as claimed in claim 8, but in which there are only three basic operations: Left and Right with the third taking the action selected from the options using the Left and Right, see bottom of figure 2.
11 A system as claimed in any preceding claim, in which there is an extra operation allowing a'principal'document to be continue to be serially displayed while the other four operations are used in navigating the tree,'meta-document'or nested document, allowing control of the presentation characteristics of the principal document, such as speed, while it is playing; similar parallel presentation being possible for a principal document and other documents, or 'streams'of serial presentation, which might include: information about this document extracted and inserted into lists, e. g. a list of hypertext links within the document, or a list of headings within the document ; style sheets; embedded objects, such as tables and hypertext links, which have a structure in their own right ; * communication by Internet Relay Chat, where the log of the conversation or"conference" is treated as the principle document, which is appended as text arrives from parties to the discussion; and the input by the user is treated as an associated document, which is cleared as the input is sent; streams defined by a mark-up of the material to be presented according to a standard such AS SEL ; where the system provides the means for changing of focus, for user navigation and control, between the principle document and the other documents or streams.
12 A system as claimed in any preceding claim, in which the selection of document for presentation involves navigation of documents associated through a directory structure, for example as part of a tree-see figure 5 under"documents", and/or through hypertext linkage.
13 A system as claimed in any preceding claim, in which the display is physically small compared to a typical computer display, and would be typically a liquid crystal display (LCD), see figure 4 for example; and in which the size of the characters of the words being serially displayed can be adjusted such that longer words occupy much of the width of the display area of the display device.
14 A system as claimed in any preceding claim, in which there is a main field for reading the document, and subsidiary fields in which structural elements of a document, such as headings, and presentation values, such as speed, can be shown, see figure 3.
15 A system as claimed in any preceding claim, in which there other modalities: a tactile output modality which may be dynamic refreshible Braille, or it may be a code for serial tactile output of characters or phonemes across fingers and thumb; is a symbol modality, when words are associated with symbols, according to the PCS or Rebus symbol sets for example ; and or a sign language modality, when words are associated with signs, according to British Sign Language (BSL) for example.
16 A system as claimed in any preceding claim, allowing output of a word in different modalities to be synchronised, such that, as a word is displayed or spoken, the corresponding symbol or sign (if there is one) is displayed in a separate window, there being a database of symbols or signs associated with words in a wordlist and in the case where there are several symbols for a single word, such as"present", the choices can be displayed together.
17 A system as claimed in any preceding claim, with capabilities for the use of modalities in combination : output modalities are in parallel and synchronised, e. g. speech with word display; using a modality as a check or prompt, where the user can check a word by clicking on a certain button or pressing a certain key, to obtain that word expressed in a different modality; using one modality to follow another, e. g. so the user sees the word before hearing it; 'with the user inputting in one modality, before, during or after the system has output a word in the same or another modality, e. g. so the user sees the word and tries speaking it ; 'the system can record the user's input, either as separate words or as phrases or sentences, and then synchronise with the corresponding output of the system in a different modality ; 'a user can type words as they are spoken, and then play back with the speech and text displayed in synchrony; 'the user can speak words as they are displayed, and then play back the text with their own speech synchronised to it; the user can read silently, then read with the synthesised or instructor's voice, then read a

word and record their own speech, then hear their speech back, synchronised with the text, the system recording each word spoken along with the text of the word, so it can be played back at a later moment; there can be a record of the instructor's spoken word as well as the student, so they can be played together, or alternating.
18 A system as claimed in any preceding claim, allowing input of text using four keys and the tree structure for example using the following procedure: with the first letter of a word, you go Down ; then you are presented with a circular list of groups, such as ABCD, EFGH, IJKLMN, OPQRST, UVWXYZ ; these are presented as a branch of the tree, which you can traverse using Left and Right; you select one group with a Down; and then you are presented with the individual letters of that group to choose between, and you select one with a second Down; you are then presented with the circular list of groups again, and you can go on to select the second letter, third letter, and so on-thus there is a two stage selection for each character ; when you finish the word you do an Up, and you can hear the word spoken to give you feedback; then you can do a Down to start the next word, or do a Left or Right for other possibilities, such as selection of punctuation mark, or change to (or from) capitalisation ; with a similar procedure for input of numbers being possible, but only one stage is needed, where the numbers are presented as a circular list: on Down, you are presented with'0', and then a single Right will give you'1', two Rights give you'2', etc.
19 A system as claimed in any preceding claim, in which the speech synthesiser is mounted on the web server, but the presentation is still controlled from the user's client: speech is sent from the server as an encoded file, such as a". wav"file, to the client ; this file, or a parallel file, has extra coding to mark word boundaries, allowing synchronisation of serial visual display with the speech; 'typically the server sends a sentence in advance of the sentence that the user is reading ; the files are stored in the client, for reuse, in case the user wants to step backwards a sentence or more; the user can step forwards and backwards a word at a time, and the relevant word is extracted from the file, using the word boundary markings; or in which the speech synthesiser is split into a front end and a back end, with the text converted to a phonetic notation by the front end, and the notation passed from the server to the client machine and then converted to speech sounds on the user's client; but where the former conversion may be done at an earlier stage, so that a web page contains both a text version and a corresponding static phonetic version; and the latter conversion may be performed by an applet previously downloaded from the web site, or by a program allowing the conversion of a stream of standard phonetic notation from any web site: * such a notation might be that used in a pronunciation dictionary ; it would typically contain allophones and syllable stress markings, but also word stress and intonation, in order to give a complete inflection for each word, and thus for the sentence as a whole; the notation can allow a reverse translation, back to words of text, so that the software (e. g. applet) on the client can display each word as its synthesised sound is output-i. e. the word is displayed at the same time as it is spoken ; the notation can also disambiguate words that have several meanings, so for example it can use mark-up similar to that proposed for the semantic web, this disambiguation being useful if there is a choice of symbol according to the meaning. or the same approach can be used with pre-recorded speech, such as produced by an actor for a talking book: this speech is divided typically into sentences, each sentences is marked up with word boundaries so that the display can subsequently be synchronised with the speech; the mark up may be virtual, e. g. by a time indication of when each word ends with respect to the beginning of the sentence.
20 A system as claimed in any preceding claim, including means to synchronise audio output of words of recorded speech with visual output of words in a serial display of the text which was spoken; this allowing the word-by-word display and control to be used in multimedia presentation, where one of the streams is recorded speech, e. g. for a talking book; such that the user can navigate the text, and control the style of presentation, while maintaining the synchronisation at the word level; and where the means of synchronisation can include: a speech recognition engine which converts speech to text; a correlator which compares and aligns this text with the text being spoken; a marking process whereby the digitised speech recording is marked with word-beginnings and/or word-ending codes; and search engine that looks for these marks when speech output is to accompany visual output of the words.