GB2448357A

GB2448357A - System for estimating text readability

Info

Publication number: GB2448357A
Application number: GB0707148A
Authority: GB
Inventors: Stephen Molton
Original assignee: Individual
Current assignee: Individual
Priority date: 2007-04-13
Filing date: 2007-04-13
Publication date: 2008-10-15
Also published as: GB0707148D0

Abstract

A system for estimating text readability based on the collocates (word pairs or bigrams) of words found in a corpus. The collocates may be stored in a collocation database. Text readability can be estimated by constructing an index based on the possible collocates of a word by taking into account the frequency of occurrence of a collocate and/or the position of a collocate in an array of possible collocates sorted by frequency of occurrence. Left- or right-hand collocates may be used.

Description

A system for estimating the readability of texts Readability indices

are sometimes useful for * Companng the difficulties of texts used for examination purposes; * Estimating the difficulty of texts used for public information and for language teaching; * Estimating how well student texts are written.

A number of readability indices exist (e.g. Fog, Flesch, Kincaid, Fry, Bormuth, Coleman Liau, etc.) but they mainly rely on sentence, length, number of syllables in a word, paragraph length etc. and pay no attention to the words which actually are used in a text or how they are used in context.

This method relies on the fact that any particular word can naturally be followed by a limited set of words and that in that set of words some words occur more frequently than others. This can be established by examining a corpus and finding the collocations of a word (see the collocates of each word (headword) of the sentence uThis is a good idea.", given in Tables 1 -4).

The method requires a database (at least 20,000) of the commonest words in the language. For each of these words the list of possible following words is given in order of frequency of occurrence. If method A (see below) is to be used, the database must also include the number of occurrences of each collocate for each source word in the corpus used to construct the database.

It also requires a parsing program which takes the string of entered text and examines it There are two methods of calculating the index.

Method A requires that the database contain the actual number of occurrences of each collocate (as in row a in the example tables).

Method B requires only the order of frequency of occurrence of each collocate (as in row b in the example tables).

Method A For each word, its right-hand collocate is examined and from the database two numbers are extracted. X is the number of occurrences of the highest scoring collocate (if we take the words "good idea", X equals 1917 as the highest scoring collocate of the word "good" -see table 4), and V is the number of occurrences of the collocate of the word in the text (so for"good idea", "idea" scores 1861).

The final score for each word is given by Y/X x 100 (so idea" for the word "good" would score 97.07877).

Hence the word at the beginning of the array will always score 100 and the rest will score a percentage related to its number of occurrences compared to the number of occurrences of the first word in the array.

As the program parses through the text each word is given a score for its collocate and this score is added to the total. The total is divided by the number of words examined to give a final readability score.

Method B Since, in most cases, the scores for number of occurrences of each coflocate of a given headword fall off in roughly logarithmic fashion, a good approximate index can be calculated without having a database with score for each word in. All that is necessary is that, in the database, collocates are ordered by frequency of occurrence.

For each word in a sentence, the following word is examined and it's position in the order of frequency is noted. The higher the frequency (and therefore the earlier in the array the word occurs) the higher the index given to the word.

In this case, for each word the index is 100 divided by its position in the array. See tables 7 & 8 for an illustration using methods A & B with the sentence "This is a good idea".

The parsing program skips words followed by punctuation (., ; : " ! ?) because these words can have no right-hand collocates; the running total remains the same and the number of words counted is not augmented till the next word which can have a right hand collocate is encountered (normally the first word of the next sentence).

This system could be further refined by incorporating a similar index for left-hand collocates (i.e. words preceding the headword). To take as an example, the words "modern convenience": in the list of possible right-hand collocates of "modern" the word "convenience" is very low (680th in the British National Corpus). But to a native ear these words seem to go together very naturally. The problem is that, like many adjectives, "modem" collocates well with very many nouns. If we take the word "convenience" and examine its possible left-hand collocates, we find that umodern comes eighth in the list. So combining both left-hand and right-hand collocation indices The simplest way to implement this system is to hold the database in the parsing program as an XML file.

Claims

I

Claims 1. A system for estimating text readability based on word collocates found in a corpus.
2. A system for estimating text readability according to daim 1, based on the actual numbers of coHocates of a word found in a corpus.
3. A system for estimating text readability according to daim 1, based on the order of frequency of collocates of a word found in a corpus.
4. A system for estimating text readability according to daim 1, calculated by using the frequency of occurrence of right-hand collocates of words in a text.
5. A system for estimating text readability according to claim 1, calculated by using the actual numbers of right-hand collocates of a word found in a corpus.
6. A system for estimating text readability according to daim 1, calculated by using the frequency of occurrence of left-hand collocates of words in a text.
7. A system for estimating text readability according to daim 1, calculated by using the actual numbers of left-hand collocates of a word found in a corpus.