GB1403816A

GB1403816A - Apparatus for identifying an unidentified item of data

Info

Publication number: GB1403816A
Application number: GB4240273A
Authority: GB
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1972-10-03
Filing date: 1973-09-10
Publication date: 1975-08-28
Also published as: FR2201788A5; DE2349116A1; JPS4973934A

Abstract

1403816 Character recognition systems INTERNATIONAL BUSINESS MACHINES CORP 10 Sept 1973 [3 Oct 1972] 42402/73 Heading G4R An apparatus for identifying an unidentified data item, such as a character, in a group of such items, such as an intelligible word, comprises means for determining a first group of possible identities of the item. Items adjacent the unidentified item are identified and each is classified in one of a set of groups to provide a context word. The context word is used to derive a second group of possible identities, and the first and second groups are compared to determine the correct identity of the unidentified item. The various operations may be performed by hardware or software. As applied to a character recognition system, a recognition unit (not described) attempts to identify successive characters, and provides a code signal indicative of each character. Each code signal may be accompanied by a "confideuce" flag indicating whether the identification is deemed to be correct or to be unreliable. In the latter case the first group of possible identities is determined, e.g. by table look-up, or by other means (Fig. 6, not shown). Thus, for example, a character unconfidently identified as O may result in a first group, termed a confusion list, consisting of O, D, Q and U. The context word is derived by consideration of characters adjacent the character. For example, if the character is the O of GROUP the context word may be derived from GR and UP, or from GR, RU and UP, The codes for these characters are used to classify each of the characters into one of a number of sets of characters. The sets constitute the context word. Since only certain combinations of characters are permitted in the language under consideration, the characters forming two sets of characters can be associated only with certain other characters. The context word can therefore be used to select, e.g. by table look-up, a set of characters most probably associated with the sets forming the context word. This set (the second group) is compared with the first group (confusion list) and the common character is deemed to be the correct identity of the unreliably identified character. If the most probable set does not contain a character appearing in the confusion list the unreliable character is deemed to be unidentified and a reject signal is given. A reject signal may also be given if certain of the characters from which the context word is derived are unreliable, provided that the character under consideration has not been reliably identified by the prior recognition unit. Hardware implementations of the above arrangement are described with reference to Figs. 2 and 4 (not shown). The arrangement may be combined with a further processor (Fig. 3, not shown) which, if the recognition unit has not reliably identified the character, determines whether it is part of a short word, i.e. one of less than 6 characters. If it is not, the character is determined as above, but if it is then the fide characters address a small-word dictionary table look-up which indicates directly the most likely identification of the character. The character codes provided by the above arrangements may be passed to a further unit (Fig. 5, not shown) which detects character sequences whose probability of occurrence is so low that they may be considered to be impossible. For example the sub-sequence GRO, ROU, and OUP of the sequence GROUP are examined, and the common character O is rejected if any sub-sequence is impermissible.