GB2197097A - Data entry using dynamic word predication algorithm - Google Patents

Data entry using dynamic word predication algorithm Download PDF

Info

Publication number
GB2197097A
GB2197097A GB08626367A GB8626367A GB2197097A GB 2197097 A GB2197097 A GB 2197097A GB 08626367 A GB08626367 A GB 08626367A GB 8626367 A GB8626367 A GB 8626367A GB 2197097 A GB2197097 A GB 2197097A
Authority
GB
United Kingdom
Prior art keywords
word
words
data input
sub
vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
GB08626367A
Other versions
GB8626367D0 (en
Inventor
Adrian Pickering
Andrew Swiffen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Dundee
Original Assignee
University of Dundee
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Dundee filed Critical University of Dundee
Priority to GB08626367A priority Critical patent/GB2197097A/en
Publication of GB8626367D0 publication Critical patent/GB8626367D0/en
Publication of GB2197097A publication Critical patent/GB2197097A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/274Converting codes to words; Guess-ahead of partial word inputs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

A method of data input comprises establishing a database or vocabulary of words and entering a word prefix. The word prefix is used to select all words from the vocabulary commencing with that prefix to form a sub-set for subsequent processing. The subsequent processing involves ordering the members of the sub-set in accordance with their potential relevance and truncating the ordered sub-set so as to identify a predetermined number of the most potentially relevant words. The truncated ordered sub-set is subsequently presented to enable the intended input word to be selected, if present, from among the presented words. The words are ordered in accordance with their recency of use, or their historical frequency of use, or preferably by a combination of the two.

Description

SPECIFICATION Dynamic word prediction algorithm This invention relates to an algorithm for dynamically developing a list of the most likely words that follow a given word prefix. The algorithm is adaptive in that it takes account of the past patterns of use of the words in developing its predictions. It has application in effort-efficient user interfaces in the field of Man-Machine interaction. In particular, the technique enables significant effort savings in computer interfaces for those whose physical abilities are impaired.
A 'word' is an ordered sequence of elements (normally characters) picked from an alphabet of elements bounded by word delimiters. A delimiter is normally an element from another set which is disjoint from the word alphabet set. A word prefix comprises the sub-sequences made up from zero or more of the first elements in a particular word. All the words together make up the vocabulary (set of all known words) from which higher order structures are built, such as sentences, which in turn make up a 'text'.
As words are chosen to make up sentences it is possible to record their usage statistics and make use of these to develop predictions of what is most likely to follow. The prediction subset can be refined every time further prefix information becomes available. They can be presented to the user so that a selection can be made from the subset in order to complete the word. To develop the entire subset of the vocabulary which applies at a particular instant is trivial. In practice, the cardinality of these subsets is so large that it is neither possible nor desirable to present the user with all their members. Thus it is necessary to order the set and truncate it so that the prediction list is assessable by the user and a member is selectabie.
The algorithm presently to be described seeks to ensure that the the set ordering is optimal and only the most likely words are offered to the user. The member words that are not displayed remain accessible through use of the conventional input method, normally the QWERTY keyboard of a computer terminal. However, every time another element is added to the word prefix, the subset can be further refined. The selection device for choosing the predictions can be any of the wide variety available; examples being function keys on a conventional computer terminal, a touch-sensitive screen or the use of a 'mouse'.
A word w belongs to a vocabulary V = {wI,w2,...,wnt, containing n words. A text is an ordered sequence of t words in which words may be repeated e.g. w4 w1 w4 wry . . The text is fully defined by an index set T = GkIGkCT} where T is the set of all numbers from I. .t, k = I..n and the disjoint sets Gk contain the word positions for Wk. The cardinality of Gk, IGkl, corresponds to the number of times wk was used.Typically the text demonstrates a Zipf distribution, expressed as: I kl p P = ~ t k where: 1 ln (n) + 7 with y being Eulers' constant (-0.57721). Note that this is a very slow function of n when n > > 100. A rank ordering over V is denoted by k with w1 being the highest frequency, lowest ranking word.
All members of V, Wk, have a corresponding average distance dk between their use given by: k d = - - words P Thus, in order to capture all words that have a frequency rank position greater than some threshold value, k', text samples of dk. words must be examined. A special case of this is the average distance between the occurences of the rarest word wmin which must be the same as the text sample size. Thus a relationship exists between n and t: t = n [ in(n)+y ] The Zipf distribution is an expression of the 'least cost' behaviour of humans when generating information carrying material when it is analysed a posteriori.It is essentially a static description and takes no account of the temporal behaviour of word selection; words fall into disuse and others increase in popularity as time proceeds and context changes.
The recency of a word can be expressed by its instantaneous word distance d'k which is the span between the last position of wk and the last word in the text (the current word). If the spans are measured in words, the values of d'k are unique, lie in the range I. .t and are noncontiguous. The average of all the past d'k spans will tend towards dk.
Recency defines another ordering on V in terms of d'k; the word WX is said to be more recent than w, if d', < d'y Studies have shown words in texts with Zipf-type distributions are more quickly found on average if they are examined in recency order. The 'move to front' strategy of self-organising lists and the cacheing principle are practical embodiments this principle used in other computing domains. In the context of prediction presentation, recency ordering provides a better basis on which to perform the set truncation.
The prediction subset V' is defined: V' = iwk 1 Wk C V and all wk have a common prefix An ordering on V and thus on V' enables the first few members of V' to be chosen to be presented to the user. These form the set of predictions P c V' such that IP|P| < p, with p being the maximum number of shown predictions.
Recency ordering does not perform so well when the d'k are comparable with d, since the members of P tend to be the same as the last few words in the text. This situation is encountered when trying to produce predictions with a zero initial prefix which yields a prediction subset of V. Here the historical ordering, based upon the frequency ordering of V, might provide a better basis for truncation. But, since the instantaneous distance d'k is of greater importance for dynamic adaption, it may not be economic to maintain information relating to absolute word frequency as well. This frequency information might take the form of word usage counts or keeping a running average dk. A discrete approximation sufficient for identifying the p most frequent words may be all that is necessary.
Each word in the vocabulary V has associated with it (a) recency information and, optionally, (b) frequency information. Both forms of information enable an ordering over V to be defined which may be used for developing the most suitable predictions P. Both need keeping up to date as the text being created is developed. Frequency information is relatively simple to update since only when an word is used is its frequency affected. However, the recency information is changing each time the text expands. If the number of words in the vocabulary, n, is large then updating the recency information can be impractical and the following methods may be employed to reduce the processing effort.
(1) The distance unit can be increased. In order that the d'k are still unique the unit can be increased to that of the most frequent word. Assuming a Zipf distribution, the formula given above will give a minimum dk of about 10 words.
Thus the recency information can be kept in units of 10 words and thus will only need updating after every tenth word. A further advantage of using a greater span unit is economy in the storage of the recency number.
(2) If a non-unique recency value is acceptable, then the span unit can be increased much further. Indeed the span can be such as to ensure that the p most regularly used words will always retain the highest recency. However, a problem with non-unique recency values is that, failing any further information, the truncation of V' to P can be somewhat arbitrary. However, if frequency information is available this can be used in conjunction with the recency information to resolve ties and produce a better P.
(3) The current value of the text size, t (in span units), can be written into the recency information field when the word is used. This can be done at the same time as the frequency information is updated. To derive d'k this value is subtracted from t. Thus to derive the recency ordering for V' the d'k are derived by subtracting the recency field from t at least each time t changes. This avoids having to explicitly run through the entire vocabulary after each span is complete updating the recency value.
The algorithm will now be presented by reference to an actual implementation embodied in a small, portable typing aid for those with motion disabilities. The microcomputer-based aid has provision for a vocabulary of up to 1000 words and the ability to display and use up to 5 predictions. The vocabulary is 'complete' as new words are captured and added when they occur for the first time. With n = 1000, ,u = 1/7.485 = 0.1336 and t = 7485 words by the above formulae. For efficiency reasons explicit updating of the recency values is performed. The average distance between the pth rank word, d = 37.42 words and thus the span between updating should well exceed this. A convenient span is the 150-200 words of a typical paragraph which provides a natural break in the text when updating can be performed.The fre quency information provides the additional information for making the exact trucation position in V'. The coordinality of the recency information should allow the current 'working' vocabulary to be regarded as being recent. If 25% of the vocabulary is to be so treated then dk for the 250th word is 1871 words or about 8-12 spans.
The frequency information is simply a word incidence count. Its size should be estimated by the formulae and allow for use well past the expected text size t. The highest expected count for the 1000 word vocabulary is 1000.
The ordering is performed by simple numeric comparison on the concatenation of recency and frequency treating the whole as an unsigned, binary coded number. Recency is represented using 1 's complement so that the ordering is compatible with that of the word count, which is represented conventionally. In the specific example cited 15 bits are used to generate the ordering, identified bo to b,4 with bo with lowest and b,4 with highest binary weighting. The highest weighted 3 bits, b,2 to bl4, are used for the recency representation and the remaining 12 bits, bo to b", are used to accommodate the count. The exact split between the available number of bits determines when the transition between recency and word count ordering occurs; minimum recency is represented by all zeros thus leaving the word count with its true weighting.
Other systems seeking to employ the ordering algorithm herein described should use the formulae and guidelines to produce an implementation to suit the specific circumstances.

Claims (13)

1. A method of data input comprising the steps of establishing a database or vocabulary of words (as hereinbefore defined); entering a prefix of a word to be input; establishing from the vocabulary a sub-set of words each of which includes the entered prefix, arranging the words of the sub-set in order of potential relevance; truncating the ordered sub-set so as to limit the sub-set to a predetermined number of the most potentially relevant words; and presenting the truncated ordered sub-set so that the word to be input may be selected if present in the truncated ordered sub-set.
2. A method of data input as claimed in Claim 1, wherein ordering of words of the sub-set includes determining the recency of use of each word of the sub-set.
3. A method of data input as claimed in Claim 1 or 2, wherein ordering of the sub-set includes determining the historical frequency of use of each of the words of the sub-set for a predetermined period of data input.
4. A method of data input as claimed in Claim 3, comprising the step of storing the absolute word frequency count for each word in the vocabulary so as to define said historical frequency of use.
5. A method of data input as claimed in Claim 3, comprising the step of storing a cumulative average for the frequency of use of each word of the vocabulary in order to determine said historical frequency of use.
6. A method of data input as claimed in Claim 2, comprising the step of storing an absolute recency value for each word of the vocabulary in order to define recency of use of each word.
7. A method of data input as claimed in Claim 2, comprising the steP of storing for each word of the vocabulary a recency value which is up-dated only after input of a plurality of words.
8. A method of data input as claimed in Claim 7, comprising the step of selecting said plurality to be equal to the number of words between subsequent occurrences of the most frequently used word for a predetermined period of data input.
9. A method of data input as claimed in Claim 2, comprising the step of storing for each word of the vocabulary the value of the total number of words input at the time the word was most recently used and subsequently subtracting said value from the current total number of words input in order to define the recency of use of the word for ordering of the sub-set.
10. A method of data input as claimed in any preceding Claim wherein the words are stored in binary digital form.
11. A method of data input as claimed in Claim 10 and claim 2 wherein the recency information for each word is stored in l's complement form.
12. A method of data input substantially as hereinbefore described.
13. Apparatus when adapted and arranged to carry out the method of any preceding claim.
GB08626367A 1986-11-04 1986-11-04 Data entry using dynamic word predication algorithm Pending GB2197097A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB08626367A GB2197097A (en) 1986-11-04 1986-11-04 Data entry using dynamic word predication algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB08626367A GB2197097A (en) 1986-11-04 1986-11-04 Data entry using dynamic word predication algorithm

Publications (2)

Publication Number Publication Date
GB8626367D0 GB8626367D0 (en) 1986-12-31
GB2197097A true GB2197097A (en) 1988-05-11

Family

ID=10606795

Family Applications (1)

Application Number Title Priority Date Filing Date
GB08626367A Pending GB2197097A (en) 1986-11-04 1986-11-04 Data entry using dynamic word predication algorithm

Country Status (1)

Country Link
GB (1) GB2197097A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1054329A2 (en) * 1999-05-19 2000-11-22 Sun Microsystems, Inc. Methods and apparatus for processing a document
WO2004111871A1 (en) * 2003-06-18 2004-12-23 Zi Corporation Configurable information identification system and method
WO2006024157A1 (en) * 2004-08-31 2006-03-09 Research In Motion Limited Handheld electronic device with text disambiguation
WO2007040872A1 (en) * 2005-09-30 2007-04-12 Motorola Inc. Method of predictive text input in which phrases are input in abbreviated form
US7475004B2 (en) 2004-08-31 2009-01-06 Research In Motion Limited Handheld electronic device with text disambiguation
WO2009090262A3 (en) * 2008-01-18 2010-03-25 Ugochukwu Akuwudike Method and system for storing and retrieving characters, words and phrases

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4438505A (en) * 1979-02-09 1984-03-20 Sharp Kabushiki Kaisha Electronic dictionary and language interpreter with auto-search key for deriving a full-length word and its associated translation word based on a partial word entered
EP0172357A2 (en) * 1984-06-28 1986-02-26 Yoshiro Akiyama Display type dictionary apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4438505A (en) * 1979-02-09 1984-03-20 Sharp Kabushiki Kaisha Electronic dictionary and language interpreter with auto-search key for deriving a full-length word and its associated translation word based on a partial word entered
EP0172357A2 (en) * 1984-06-28 1986-02-26 Yoshiro Akiyama Display type dictionary apparatus

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1054329A3 (en) * 1999-05-19 2002-04-24 Sun Microsystems, Inc. Methods and apparatus for processing a document
EP1054329A2 (en) * 1999-05-19 2000-11-22 Sun Microsystems, Inc. Methods and apparatus for processing a document
US8364706B2 (en) 2003-06-18 2013-01-29 Zi Corporation Of Canada, Inc. System and method for information identification
WO2004111871A1 (en) * 2003-06-18 2004-12-23 Zi Corporation Configurable information identification system and method
WO2006024157A1 (en) * 2004-08-31 2006-03-09 Research In Motion Limited Handheld electronic device with text disambiguation
GB2434237A (en) * 2004-08-31 2007-07-18 Research In Motion Ltd Handheld electronic device with text disambiguation
US7475004B2 (en) 2004-08-31 2009-01-06 Research In Motion Limited Handheld electronic device with text disambiguation
GB2434237B (en) * 2004-08-31 2010-01-06 Ontario Inc 2012244 Handheld electronic device with text disambiguation
US8489383B2 (en) 2004-08-31 2013-07-16 Research In Motion Limited Text disambiguation in a handheld electronic device with capital and lower case letters of prefix objects
US8768685B2 (en) 2004-08-31 2014-07-01 Blackberry Limited Handheld electronic device with text disambiguation
US9588596B2 (en) 2004-08-31 2017-03-07 Blackberry Limited Handheld electronic device with text disambiguation
WO2007040872A1 (en) * 2005-09-30 2007-04-12 Motorola Inc. Method of predictive text input in which phrases are input in abbreviated form
WO2009090262A3 (en) * 2008-01-18 2010-03-25 Ugochukwu Akuwudike Method and system for storing and retrieving characters, words and phrases
US8903718B2 (en) 2008-01-18 2014-12-02 Ugochukwu Akuwudike Method and system for storing and retrieving characters, words and phrases

Also Published As

Publication number Publication date
GB8626367D0 (en) 1986-12-31

Similar Documents

Publication Publication Date Title
KR100604039B1 (en) Dynamic database reordering system
Park et al. Hybrid text mining for finding abbreviations and their definitions
KR101431194B1 (en) Encoding and adaptive, scalable accessing of distributed models
Ebeling et al. Entropy and long-range correlations in literary English
US5963671A (en) Enhancement of soft keyboard operations using trigram prediction
CA2411227C (en) System and method of creating and using compact linguistic data
US6470307B1 (en) Method and apparatus for automatically identifying keywords within a document
JP5147712B2 (en) System, computer program and method for forming data input
US20040002945A1 (en) Program for changing search results rank, recording medium for recording such a program, and content search processing method
US9430473B1 (en) Data table performance optimization
RU2006133906A (en) INTERFACES &#34;MAN-COMPUTER&#34;
KR20090007343A (en) Alphanumeric data entry apparatus and method using multicharacter kets of a keypad
Carneiro et al. A universal multilingual weightless neural network tagger via quantitative linguistics
GB2197097A (en) Data entry using dynamic word predication algorithm
US5553283A (en) Stored mapping data with information for skipping branches while keeping count of suffix endings
CN109542248A (en) A kind of control method and control device of incremental update dictionary data
CN105808636B (en) Hypertext link pushing system based on APP information data
Ktistakis et al. Succinct bwt-based sequence prediction
US6223174B1 (en) Method and apparatus for performing radix lookups using valid bit tables with pointers
US6185570B1 (en) Method and apparatus for performing radix lookups using transition bits and fields in transition tables
WO2004006122A2 (en) System and method of creating and using compact linguistic data
JPH0683812A (en) Kana/kanji converting device for document input device
Lehikoinen et al. An empirical and theoretical evaluation of binscroll: A rapid selection technique for alphanumeric lists
KR100798008B1 (en) A method and a device for frequency counting
Tapsai et al. TLS-ART-MC, A new algorithm for Thai word segmentation