GB2367914A

GB2367914A - An iterative method for identification of an item from a dataset by selecting questions to eliminate the maximum number of items at each ite tion

Info

Publication number: GB2367914A
Application number: GB0024633A
Authority: GB
Inventors: David Boris Johnson-Davies
Original assignee: JOHNSON DAVIES DAVID BORIS
Current assignee: JOHNSON DAVIES DAVID BORIS
Priority date: 2000-10-09
Filing date: 2000-10-09
Publication date: 2002-04-17
Also published as: GB0024633D0; GB2367914A9

Abstract

A method and system for identifying an item from a dataset comprises the steps of presenting a series of questions to a user such that the question is iteratively selected to maximise the number of candidate items eliminated. Items are stored in a dataset and a number of characteristics are associated with each item. The data set further comprises a number of questions, each question being associated with a characteristic. In use the questions are asked in series to the user to narrow down the number of candidate items. On receiving each answer each possible subordinate question is rated, the question having the highest value being presented to the user. Preferably the formula uses a weighted probability function to calculate the number of items likely to be eliminated. If the user provides an answer indicating that none of the remaining characteristic features are present no reduction is made in the number of items.

Description

A Computer-based Interactive Item Identification System The present invention relates to a computer-based interactive system for identifying one or more candidate items from amongst a range of possible candidate items.

Many items can be grouped or collected into sets of similar such items. Once such a set is formed however, it can be more difficult to select a desired particular item from the set according to known characteristics of that desired item, or to match a known item to one already in the set on the basis that such items have similar characteristics.

For example, a person may wish to identify a particular font used on printed matter. This may be necessary if it is desired to produce new additional printed matter with text in a font that matches the earlier printed matter.

Although workers in a printing company may have sufficient experience to identify a particular font, or at least to match it with another very similar font, this is a difficult task for a non-expert user.

It is an object of the present invention to provide a computer-based interactive system for identifying one or more candidate items from amongst a range of possible candidate items.

According to the invention, there is provided a computerbased apparatus for matching an unidentified item with one or more candidate items from amongst a set of a plurality of possible candidate items, comprising data processing means, a memory for storing data, user data entry means for entering information input into the computer, and a user display for displaying information output by the

computer, wherein : - the memory stores data relating to said set ; said set data includes at least one characteristic feature associated with each item in said set, and a plurality of questions each of which when presented to a user on the user display asks about the presence or absence of one or more characteristic features; - the data processing means is arranged to present in stages a series of said questions on the user display, the answers to said questions determining whether or not items remain possible candidate items; and - the questions are selected by the data processing means at each stage in order to eliminate as many candidate items as possible at each stage of questioning.

Also according to the invention, there is provided a method for identifying one or more candidate items from amongst a set of a plurality of possible candidate items, using an apparatus comprising data processing means, a memory for storing data, user data entry means for entering information input into the computer, and a user display for displaying information output by the computer, wherein the method comprises the steps of: - storing data relating to said set in the memory; - including in said set data at least one characteristic feature associated with each item in said set, and a plurality of questions each of which when presented to a user on the user display asks about the presence or absence of one or more characteristic features;

- presenting in stages a series of said questions on the user display, the answers to said questions determining whether or not items remain possible candidate items; and - selecting the questions at each stage in order to eliminate as many candidate items as possible at each stage of questioning.

In order to determine which questions should be asked at each stage of questioning, the data processing means may calculate which questions can be expected to eliminate the greatest number of candidates at each stage of questioning assuming that each of the remaining candidates is equally likely. If it is known that some candidates are more likely than others, for example based on a history of previous question and answer sessions, then the questions can be selected according to this unequal weighting of the remaining candidates.

After all possible questions have been responded to by the user, the computer-based system can present to the user the sole or the minimum number of possible candidates that have been identified The invention will now be described in further detail by way of example, with reference to the accompanying drawings, in which: Figure 1, shows an embodiment of computer-based apparatus for identifying candidate items from amongst a set of a plurality of possible candidate items, including a screen displaying a question and a set of possible answers to that question; and

Figure 2 shows another screen display displaying another question and another set of possible answers to that question.

Figure 1 shows schematically an apparatus 1 for identifying a particular font or typeface. The apparatus includes a microcomputer 2. The microcomputer 2 is in this example a standard personal computer. The computer is connected 4 to a standard user display 6 for displaying information to a user, and also connected 8,12 to a standard keyboard 10 and mouse 14 by which the user may enter information. Optionally, the display 6 is a touch sensitive display by which the user may enter information via the display 6.

The computer 2 includes a microprocessor 16 and a memory 18, which term includes both solid state and disk-based memory.

The memory 18 is loaded with data, referred to herein as "set data", relating to a list of items (items Ii, I2,..

. I) 20, and associated with each item in the list, one or more characteristic features (CF) 22, which in the memory 18 are each represented by a different numerical

value. For example, the first item Il may have three characteristic features CF, CF2 and CF, that are respectively represented in memory by discrete values. The second item I2 may have four characteristic features CF2, CF3, CF, and CF,, that are respectively represented in

memory by four distinct values. In general, an item IN will have"x"characteristic features CF., CFb, Cl.,... that are respectively represented in memory by"x"discrete values, where"x"is an integer number greater than or equal to 1.

The memory 18 also stores a separate list of questions 24 (Ql to Qma) R not associated with any of the items 20 in particular. Associated with each question (Q) 24, are possible answers which when given by the user indicate the presence or absence of characteristic features (CF) 25, and which are in memory represented by distinct values.

For example, the first question Ql may have two characteristic features CF, and CF3. The second question Q2 may have four answers Caf4, CF5, CF, and CF,,. In general, a question QN will have"x"characteristic features CF., CFb, CFc, where"x"is an integer number greater than or equal to 1.

The answer to a question indicates the presence of one of the characteristic features 25 presented in the question, and this is then matched to the characteristic features 22 of the remaining items 20. Alternatively the answer could be"not sure", in which case no match is possible. could be the mutually exclusive answers"affirmative", "negative", or"not sure". This will result in a characteristic feature 25 being present or absent, which is then matched to the characteristic features 22 of the remaining items 20.

In the particular example illustrated in Figure 1, the display 6 shows a question regarding the identification of a font style. Here, a question 124 is presented :"Is the 'a'single-story or double-story?". Three possible answers 125, 225, 325, which are the characteristic features"Double storey"and"Single storey", and the option"not sure", are also presented together with appropriate graphic illustrations 30,31, 32.

This question 124 can be answered either via the keyboard 10 or mouse 12. If the answer is in anything other than "not sure", the processor 16 is then able to eliminate a number of potential candidate items 20 which do not have the characteristic feature 22 of either the single-story razor the double story"a". In this example, it is preferred if the"not sure"answer is included, which indicates that neither of the characteristic features "Double story"or"Single story"can be said to be present, as the user may be trying to identify the font from a section of text not including the lower case letter "a". If the answer is"not sure"then no reduction of potential candidate answers can be made.

Figure 2 shows another example of a screen display 33 where a question 224 has been asked"What is the shape of the dot on the'i'or'j' ?" There are five possible answers 425,525, 625,725, 825 representing the characteristic features"Circular dot", "Square dot", "Diamond-shaped dot"or"No dot"and the option"Not sure". These characteristic features are represented by appropriate graphic illustrations 34,35, 36,37, 38. In this example, it is preferred if the"not sure"answer is included, as the user may be trying to identify the font from a section of text not including the lower case letter "i"or"j". If the answer is"not sure"then none of the characteristic features can be said to be present, and no reduction of potential candidate answers can be made.

The system can be designed to accept more than one answer. For example, if the user selects both"Square dot"and "Diamond dot", then this is equivalent to saying that the characteristic feature"Circular dot"is not present, and a corresponding reduction in the remaining candidate items

can be made.

In the case of font identification, each font can be classified according to a number of features, each of which can have two or more mutually exclusive values.

Therefore, each font is one of the items in the list of items 20, and is characterised by numerous associated characteristic features 22. The identification procedure involves asking a series of the questions 24, each of which determines if a characteristic feature 25 is present, until either a single candidate typeface (i. e. particular item 20) is uniquely identified, or a greatly reduced number of potential candidate typefaces (i. e. a reduced number of items 20) has been identified.

The memory 18 only needs to contain information about characteristic features 22,25 that will be particularly useful in distinguishing a typeface 20 from the other typefaces in the database.

The identification procedure is constructed so that at each stage the processor 16 selects a question 25 that is likely to be most effective in reducing the size of the list of potential typeface candidates 20.

As a result of this, the processor 16 does not ask an inappropriate question. For example, if all the remaining candidates are serif typefaces it will not ask a question relevant only to sans-serif typefaces.

Apart from the first question, the sequence of questions will in general vary from session to session depending on the earlier answers.

The mechanism of the identification procedure is as

follows : The following terms are used in this explanation : An'answer'is the user's reply to a question about a particular characteristic feature, and is a value"v"of that feature.

'Answers'is the list of replies given by the user.

'Potential candidates'is the list of typefaces that remain as potential solutions to the identification procedure, because they have not been eliminated by previous questions.

'Potential questions'are the remaining questions, i. e. those questions relating to characteristic features that have not yet been identified by the user.

At any stage during the identification procedure a state can be defined by the list of answers already given by the user.

It should be noted that although Figure 1 shows the procedure running on a single computer, it would of course be possible for the process to be spread over a network, or the internet, for example with a user at a terminal computer interacting with a host at a remote location. In this case, the host would most likely have the processor 16, and memory 18, while the display 6 and data entry means 10,14 would be at the user's site. In the Web-based implementation of the procedure, these characteristic features 25 are supplied to the remote Web server as an encoded string of letters and digits. Note that this is the only state information that needs to be given to the

processor 16 at each stage in an identification procedure. The Web server processes this response to retain candidate items 20 for which this characteristic feature 22 is present.

From this a list of characteristic features that are present can be built up, together with a list of questions 24 that have already been asked. Initially the list of answers is blank.

Any feature can also have the value'Not sure' corresponding to the case where the user clicks the'Not sure'button in reply to a question. This is handled by a special case in the following procedures: 1) Calculate the set of candidates for the list of answers supplied, as follows. a) Start with a list of all the typefaces 20 in the memory 18. b) For each answer,

- if the value is'Not sure', then ignore it - else remove all typefaces that have a different characteristic feature for that question.

2) If there is only one potential candidate left in the candidates list 20, display it as the identified typeface.

3) If there are no potential candidates left in the candidates list 20, display to the user that no typefaces match the sequence of. answers given by the user.

4) Recalculate the set of potential questions 24 as follows : a) Start with a list of all the questions 24 in the memory 18. b) For each supplied answer 25, remove the question 24 corresponding to that answer.

5) Use the processor 16 to calculate the best question to ask, that is, the question that will eliminate as many possible candidate items as possible at each stage of questioning. This can be done by calculating a rating R for each particular characteristic feature 26 associated with each question 24 amongst the list of remaining potential questions. The rating R is calculated as follows: a) For each value of a potential characteristic feature, define n (v) to be the number of potential candidates that have value v. b) Define a total T to be the sum of all the n (v) values for all potential characteristic features: T = # [n (v)] (1) c) Define N = total number of remaining candidates. d) The rating R can then be defined as: R = E [n (v)- {T-n (v) l]/N (2) R = E [n (v)- {S [n (v)]-n (v)}]/N (3)

For example, if the feature'i dot shape'has values 'square', circle'or'diamond', and there are respectively 20,30, and 50 candidates with these values amongst the remaining potential candidates, then: T = 20 + 30 + 50 = 100 (4) R = [20- (100-20) +30- (100-30) +50' (100-50)]/100 (5) R = [20*80 + 30970 + 50-50]/100 = 62 (6) This is equivalent to the expected number of candidates that will be weeded out by the question.

In the example of equations (4) to (6), there is an assumption that the user is equally likely to give any of the possible answers for any given question. It would, of course, be possible to include other factors into equations (4) to (6) in order to scale the various contributions to the rating from each question according either to a pattern deduced from previous answers, or a known likelihood of various answers based on previous experience or the known occurrence for various items in the item list 20.

6) The next step is to find the feature with the best rating. a) If the best rating is zero none of the remaining questions is any help in narrowing the list of candidates, so report the current list of candidates. b) Otherwise ask the user the question relating to

the characteristic feature with the best rating.

The same procedure can be used to determine the first question to present to the user.

7) Add the answer to the list of answers already supplied.

Additional refinements and modifications can be made to this procedure. For example, the computer 1 (or website hosting the identification service) may include an 'Identify from a sample'feature which allows the user to specify a sample of text, such as a word or sentence. The program them only presents questions involving letters that are present in the sample, or general questions (such as whether the typeface is serif or sans serif).

Optionally, if several potential characteristic features have equal ratings in step 5) above, one corresponding question may be chosen by the processor 16 at random. This makes the identification process different even for the same sequence of answers, which is more entertaining for the user.

Although the invention has been described in terms of a system for identifying fonts, the invention may also be used to identify other types of item from a list of items 20. For example, the invention may be used to help identify house plants, trees, roses, wild flowers, garden pest/blight identification, antiques identification, mechanical spare part identification, or silver mark identification.

The invention can also be used for selecting a software product for a particular application, from amongst a range of possible candidate software products. For example, the database could include information about the range of word-processing packages available. It would present questions such as"What platform do you use ?" with answers

such as"Macintosh","PC","Unix", or questions such as "Do you need mail-merge capabilities ?" and after several such questions it would present a suggested list of suitable software products.

Claims

Claims 1. A computer-based apparatus for matching an unidentified item with one or more candidate items from amongst a set of a plurality of possible candidate items, comprising data processing means, a memory for storing data, user data entry means for entering information input into the computer, and a user display for displaying information output by the computer, wherein: - the memory stores data relating to said set; said set data includes at least one characteristic feature associated with each item in said set, and a plurality of questions each of which when presented to a user on the user display asks about the presence or absence of one or more characteristic features; the data processing means is arranged to present in stages a series of said questions on the user display, the answers to said questions determining whether or not items remain possible candidate items; and - the questions are selected by the data processing means at each stage in order to eliminate as many candidate items as possible at each stage of questioning.
2. A method for identifying one or more candidate items from amongst a set of a plurality of possible candidate items, using an apparatus comprising data processing

means, a memory for storing data, user data entry means for entering information input into the computer, and a user display for displaying information output by the computer, wherein the method comprises the steps of: - storing data relating to said set in the memory; - including in said set data at least one characteristic feature associated with each item in said set, and a plurality of questions each of which when presented to a user on the user display asks about the presence or absence of one or more characteristic features; - presenting in stages a series of said questions on the user display, the answers to said questions determining whether or not items remain possible candidate items; and - selecting the questions at each stage in order to eliminate as many candidate items as possible at each stage of questioning.
3. A method as claimed in Claim 2, in which the plurality of possible candidate items are different fonts.
4. A method as claimed in Claim 2 or Claim 3, in which the answers include an answer indicating that none of the characteristic features is present, which when selected as the answer to a question results in no reduction in the number of candidate items.
5. A method as claimed in any of Claims 2 to 4, in which for at least one question, there are at least three possible answers.
6. A method as claimed in any of Claims 2 to 5, in which for at least one question, more than one answer may be provided.
7. A method as claimed in any of Claims 2 to 6, in which questions are selected at each stage by calculating a rating value"R"for each particular characteristic feature associated with each question, R being defined by the equation R = [n (v)' {T-n (v)}]/N where: "v"is a value of a particular characteristic feature identified by an answer to said question; "n (v)" is the number of potential candidates that have the value v; "T"is a total number being the sum of all the n (v) values for all potential characteristic features T = [n (v)] ; and "N"is the total number of remaining candidates having possible values v for answers to said question.
8. A computer-based apparatus for matching an unidentified item with one or more candidate items from amongst a set of a plurality of possible candidate items, substantially as herein described, with reference to the accompanying drawings.
9. A method for identifying one or more candidate items from amongst a set of a plurality of possible candidate items, substantially as herein described, with reference to the accompanying drawings.