GB2448357A - System for estimating text readability - Google Patents

System for estimating text readability Download PDF

Info

Publication number
GB2448357A
GB2448357A GB0707148A GB0707148A GB2448357A GB 2448357 A GB2448357 A GB 2448357A GB 0707148 A GB0707148 A GB 0707148A GB 0707148 A GB0707148 A GB 0707148A GB 2448357 A GB2448357 A GB 2448357A
Authority
GB
United Kingdom
Prior art keywords
collocates
word
words
estimating
hand
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0707148A
Other versions
GB0707148D0 (en
Inventor
Stephen Molton
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to GB0707148A priority Critical patent/GB2448357A/en
Publication of GB0707148D0 publication Critical patent/GB0707148D0/en
Publication of GB2448357A publication Critical patent/GB2448357A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • G06F17/274
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A system for estimating text readability based on the collocates (word pairs or bigrams) of words found in a corpus. The collocates may be stored in a collocation database. Text readability can be estimated by constructing an index based on the possible collocates of a word by taking into account the frequency of occurrence of a collocate and/or the position of a collocate in an array of possible collocates sorted by frequency of occurrence. Left- or right-hand collocates may be used.

Description

A system for estimating the readability of texts Readability indices
are sometimes useful for * Companng the difficulties of texts used for examination purposes; * Estimating the difficulty of texts used for public information and for language teaching; * Estimating how well student texts are written.
A number of readability indices exist (e.g. Fog, Flesch, Kincaid, Fry, Bormuth, Coleman Liau, etc.) but they mainly rely on sentence, length, number of syllables in a word, paragraph length etc. and pay no attention to the words which actually are used in a text or how they are used in context.
This method relies on the fact that any particular word can naturally be followed by a limited set of words and that in that set of words some words occur more frequently than others. This can be established by examining a corpus and finding the collocations of a word (see the collocates of each word (headword) of the sentence uThis is a good idea.", given in Tables 1 -4).
The method requires a database (at least 20,000) of the commonest words in the language. For each of these words the list of possible following words is given in order of frequency of occurrence. If method A (see below) is to be used, the database must also include the number of occurrences of each collocate for each source word in the corpus used to construct the database.
It also requires a parsing program which takes the string of entered text and examines it There are two methods of calculating the index.
Method A requires that the database contain the actual number of occurrences of each collocate (as in row a in the example tables).
Method B requires only the order of frequency of occurrence of each collocate (as in row b in the example tables).
Method A For each word, its right-hand collocate is examined and from the database two numbers are extracted. X is the number of occurrences of the highest scoring collocate (if we take the words "good idea", X equals 1917 as the highest scoring collocate of the word "good" -see table 4), and V is the number of occurrences of the collocate of the word in the text (so for"good idea", "idea" scores 1861).
The final score for each word is given by Y/X x 100 (so idea" for the word "good" would score 97.07877).
Hence the word at the beginning of the array will always score 100 and the rest will score a percentage related to its number of occurrences compared to the number of occurrences of the first word in the array.
As the program parses through the text each word is given a score for its collocate and this score is added to the total. The total is divided by the number of words examined to give a final readability score.
Method B Since, in most cases, the scores for number of occurrences of each coflocate of a given headword fall off in roughly logarithmic fashion, a good approximate index can be calculated without having a database with score for each word in. All that is necessary is that, in the database, collocates are ordered by frequency of occurrence.
For each word in a sentence, the following word is examined and it's position in the order of frequency is noted. The higher the frequency (and therefore the earlier in the array the word occurs) the higher the index given to the word.
In this case, for each word the index is 100 divided by its position in the array. See tables 7 & 8 for an illustration using methods A & B with the sentence "This is a good idea".
The parsing program skips words followed by punctuation (., ; : " ! ?) because these words can have no right-hand collocates; the running total remains the same and the number of words counted is not augmented till the next word which can have a right hand collocate is encountered (normally the first word of the next sentence).
This system could be further refined by incorporating a similar index for left-hand collocates (i.e. words preceding the headword). To take as an example, the words "modern convenience": in the list of possible right-hand collocates of "modern" the word "convenience" is very low (680th in the British National Corpus). But to a native ear these words seem to go together very naturally. The problem is that, like many adjectives, "modem" collocates well with very many nouns. If we take the word "convenience" and examine its possible left-hand collocates, we find that umodern comes eighth in the list. So combining both left-hand and right-hand collocation indices The simplest way to implement this system is to hold the database in the parsing program as an XML file.

Claims (7)

  1. I
    Claims 1. A system for estimating text readability based on word collocates found in a corpus.
  2. 2. A system for estimating text readability according to daim 1, based on the actual numbers of coHocates of a word found in a corpus.
  3. 3. A system for estimating text readability according to daim 1, based on the order of frequency of collocates of a word found in a corpus.
  4. 4. A system for estimating text readability according to daim 1, calculated by using the frequency of occurrence of right-hand collocates of words in a text.
  5. 5. A system for estimating text readability according to claim 1, calculated by using the actual numbers of right-hand collocates of a word found in a corpus.
  6. 6. A system for estimating text readability according to daim 1, calculated by using the frequency of occurrence of left-hand collocates of words in a text.
  7. 7. A system for estimating text readability according to daim 1, calculated by using the actual numbers of left-hand collocates of a word found in a corpus.
GB0707148A 2007-04-13 2007-04-13 System for estimating text readability Withdrawn GB2448357A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB0707148A GB2448357A (en) 2007-04-13 2007-04-13 System for estimating text readability

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0707148A GB2448357A (en) 2007-04-13 2007-04-13 System for estimating text readability

Publications (2)

Publication Number Publication Date
GB0707148D0 GB0707148D0 (en) 2007-05-23
GB2448357A true GB2448357A (en) 2008-10-15

Family

ID=38116677

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0707148A Withdrawn GB2448357A (en) 2007-04-13 2007-04-13 System for estimating text readability

Country Status (1)

Country Link
GB (1) GB2448357A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2511831A1 (en) * 2011-04-14 2012-10-17 James Lawley Text processor and method of text processing
WO2013142852A1 (en) * 2012-03-23 2013-09-26 Sententia, LLC Method and systems for text enhancement

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5617488A (en) * 1995-02-01 1997-04-01 The Research Foundation Of State University Of New York Relaxation word recognizer
EP0933713A2 (en) * 1998-01-30 1999-08-04 Sharp Kabushiki Kaisha Method of and apparatus for processing an input text, method of and apparatus for performing an approximate translation and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5617488A (en) * 1995-02-01 1997-04-01 The Research Foundation Of State University Of New York Relaxation word recognizer
EP0933713A2 (en) * 1998-01-30 1999-08-04 Sharp Kabushiki Kaisha Method of and apparatus for processing an input text, method of and apparatus for performing an approximate translation and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Anagnostou, N. K. and Weir, G. R. S. (2006) "From corpus-based collocation frequencies to readability measure." In: ICT in the Analysis, Teaching and Learning of Languages, Preprints of the ICTATLL Workshop 2006, 21-22 Aug 2006, Glasgow, UK. Available at URL: http://eprints.cdlr.strath.ac.uk/2381/ *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2511831A1 (en) * 2011-04-14 2012-10-17 James Lawley Text processor and method of text processing
WO2013142852A1 (en) * 2012-03-23 2013-09-26 Sententia, LLC Method and systems for text enhancement

Also Published As

Publication number Publication date
GB0707148D0 (en) 2007-05-23

Similar Documents

Publication Publication Date Title
Schroeder et al. childLex: A lexical database of German read by children
Khandelwal et al. Gender prediction in english-hindi code-mixed social media content: Corpus and baseline system
US9164983B2 (en) Broad-coverage normalization system for social media language
US9043339B2 (en) Extracting terms from document data including text segment
US8375033B2 (en) Information retrieval through identification of prominent notions
Scannell Statistical unicodification of African languages
Al-Haj et al. The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation
JPWO2016051551A1 (en) Sentence generation system
Greavu et al. A classification of borrowings: Observations from Romanian/English contact
Bourgonje et al. The Potsdam commentary corpus 2.2: Extending annotations for shallow discourse parsing
Guégan et al. A French Human Reference Corpus for Multi-Document Summarization and Sentence Compression.
Anglemark et al. The use of English-language business and finance terms in European languages
Szmrecsanyi An analytic-synthetic spiral in the history of English
US8346541B2 (en) Method for constructing Chinese dictionary and apparatus and storage media using the same
Khan et al. Towards domain adaptation for parsing web data
GB2448357A (en) System for estimating text readability
Miller Analysing frequency lists
Vlachos Tackling the BioCreative2 gene mention task with conditional random fields and syntactic parsing
Costa-jussa et al. Towards human linguistic machine translation evaluation
Skirgård Français Tirailleur Pidgin–a corpus study
Augustinus et al. The IPP effect in Afrikaans: a corpus analysis
Ronan Simple versus light verb constructions in late modern Irish English correspondence: A qualitative and quantitative analysis
Busse et al. Problem areas of English grammar between usage, norm, and variation
Volk The automatic resolution of prepositional phrase attachment ambiguities in German
Garley Crossing the lexicon: Anglicisms in the German hip hop community

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)