GB2448357A - System for estimating text readability - Google Patents
System for estimating text readability Download PDFInfo
- Publication number
- GB2448357A GB2448357A GB0707148A GB0707148A GB2448357A GB 2448357 A GB2448357 A GB 2448357A GB 0707148 A GB0707148 A GB 0707148A GB 0707148 A GB0707148 A GB 0707148A GB 2448357 A GB2448357 A GB 2448357A
- Authority
- GB
- United Kingdom
- Prior art keywords
- collocates
- word
- words
- estimating
- hand
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- SGPGESCZOCHFCL-UHFFFAOYSA-N Tilisolol hydrochloride Chemical compound [Cl-].C1=CC=C2C(=O)N(C)C=C(OCC(O)C[NH2+]C(C)(C)C)C2=C1 SGPGESCZOCHFCL-UHFFFAOYSA-N 0.000 claims 5
- 230000003190 augmentative effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G06F17/274—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
A system for estimating text readability based on the collocates (word pairs or bigrams) of words found in a corpus. The collocates may be stored in a collocation database. Text readability can be estimated by constructing an index based on the possible collocates of a word by taking into account the frequency of occurrence of a collocate and/or the position of a collocate in an array of possible collocates sorted by frequency of occurrence. Left- or right-hand collocates may be used.
Description
A system for estimating the readability of texts Readability indices
are sometimes useful for * Companng the difficulties of texts used for examination purposes; * Estimating the difficulty of texts used for public information and for language teaching; * Estimating how well student texts are written.
A number of readability indices exist (e.g. Fog, Flesch, Kincaid, Fry, Bormuth, Coleman Liau, etc.) but they mainly rely on sentence, length, number of syllables in a word, paragraph length etc. and pay no attention to the words which actually are used in a text or how they are used in context.
This method relies on the fact that any particular word can naturally be followed by a limited set of words and that in that set of words some words occur more frequently than others. This can be established by examining a corpus and finding the collocations of a word (see the collocates of each word (headword) of the sentence uThis is a good idea.", given in Tables 1 -4).
The method requires a database (at least 20,000) of the commonest words in the language. For each of these words the list of possible following words is given in order of frequency of occurrence. If method A (see below) is to be used, the database must also include the number of occurrences of each collocate for each source word in the corpus used to construct the database.
It also requires a parsing program which takes the string of entered text and examines it There are two methods of calculating the index.
Method A requires that the database contain the actual number of occurrences of each collocate (as in row a in the example tables).
Method B requires only the order of frequency of occurrence of each collocate (as in row b in the example tables).
Method A For each word, its right-hand collocate is examined and from the database two numbers are extracted. X is the number of occurrences of the highest scoring collocate (if we take the words "good idea", X equals 1917 as the highest scoring collocate of the word "good" -see table 4), and V is the number of occurrences of the collocate of the word in the text (so for"good idea", "idea" scores 1861).
The final score for each word is given by Y/X x 100 (so idea" for the word "good" would score 97.07877).
Hence the word at the beginning of the array will always score 100 and the rest will score a percentage related to its number of occurrences compared to the number of occurrences of the first word in the array.
As the program parses through the text each word is given a score for its collocate and this score is added to the total. The total is divided by the number of words examined to give a final readability score.
Method B Since, in most cases, the scores for number of occurrences of each coflocate of a given headword fall off in roughly logarithmic fashion, a good approximate index can be calculated without having a database with score for each word in. All that is necessary is that, in the database, collocates are ordered by frequency of occurrence.
For each word in a sentence, the following word is examined and it's position in the order of frequency is noted. The higher the frequency (and therefore the earlier in the array the word occurs) the higher the index given to the word.
In this case, for each word the index is 100 divided by its position in the array. See tables 7 & 8 for an illustration using methods A & B with the sentence "This is a good idea".
The parsing program skips words followed by punctuation (., ; : " ! ?) because these words can have no right-hand collocates; the running total remains the same and the number of words counted is not augmented till the next word which can have a right hand collocate is encountered (normally the first word of the next sentence).
This system could be further refined by incorporating a similar index for left-hand collocates (i.e. words preceding the headword). To take as an example, the words "modern convenience": in the list of possible right-hand collocates of "modern" the word "convenience" is very low (680th in the British National Corpus). But to a native ear these words seem to go together very naturally. The problem is that, like many adjectives, "modem" collocates well with very many nouns. If we take the word "convenience" and examine its possible left-hand collocates, we find that umodern comes eighth in the list. So combining both left-hand and right-hand collocation indices The simplest way to implement this system is to hold the database in the parsing program as an XML file.
Claims (7)
- IClaims 1. A system for estimating text readability based on word collocates found in a corpus.
- 2. A system for estimating text readability according to daim 1, based on the actual numbers of coHocates of a word found in a corpus.
- 3. A system for estimating text readability according to daim 1, based on the order of frequency of collocates of a word found in a corpus.
- 4. A system for estimating text readability according to daim 1, calculated by using the frequency of occurrence of right-hand collocates of words in a text.
- 5. A system for estimating text readability according to claim 1, calculated by using the actual numbers of right-hand collocates of a word found in a corpus.
- 6. A system for estimating text readability according to daim 1, calculated by using the frequency of occurrence of left-hand collocates of words in a text.
- 7. A system for estimating text readability according to daim 1, calculated by using the actual numbers of left-hand collocates of a word found in a corpus.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0707148A GB2448357A (en) | 2007-04-13 | 2007-04-13 | System for estimating text readability |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0707148A GB2448357A (en) | 2007-04-13 | 2007-04-13 | System for estimating text readability |
Publications (2)
Publication Number | Publication Date |
---|---|
GB0707148D0 GB0707148D0 (en) | 2007-05-23 |
GB2448357A true GB2448357A (en) | 2008-10-15 |
Family
ID=38116677
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB0707148A Withdrawn GB2448357A (en) | 2007-04-13 | 2007-04-13 | System for estimating text readability |
Country Status (1)
Country | Link |
---|---|
GB (1) | GB2448357A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2511831A1 (en) * | 2011-04-14 | 2012-10-17 | James Lawley | Text processor and method of text processing |
WO2013142852A1 (en) * | 2012-03-23 | 2013-09-26 | Sententia, LLC | Method and systems for text enhancement |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5617488A (en) * | 1995-02-01 | 1997-04-01 | The Research Foundation Of State University Of New York | Relaxation word recognizer |
EP0933713A2 (en) * | 1998-01-30 | 1999-08-04 | Sharp Kabushiki Kaisha | Method of and apparatus for processing an input text, method of and apparatus for performing an approximate translation and storage medium |
-
2007
- 2007-04-13 GB GB0707148A patent/GB2448357A/en not_active Withdrawn
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5617488A (en) * | 1995-02-01 | 1997-04-01 | The Research Foundation Of State University Of New York | Relaxation word recognizer |
EP0933713A2 (en) * | 1998-01-30 | 1999-08-04 | Sharp Kabushiki Kaisha | Method of and apparatus for processing an input text, method of and apparatus for performing an approximate translation and storage medium |
Non-Patent Citations (1)
Title |
---|
Anagnostou, N. K. and Weir, G. R. S. (2006) "From corpus-based collocation frequencies to readability measure." In: ICT in the Analysis, Teaching and Learning of Languages, Preprints of the ICTATLL Workshop 2006, 21-22 Aug 2006, Glasgow, UK. Available at URL: http://eprints.cdlr.strath.ac.uk/2381/ * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2511831A1 (en) * | 2011-04-14 | 2012-10-17 | James Lawley | Text processor and method of text processing |
WO2013142852A1 (en) * | 2012-03-23 | 2013-09-26 | Sententia, LLC | Method and systems for text enhancement |
Also Published As
Publication number | Publication date |
---|---|
GB0707148D0 (en) | 2007-05-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Schroeder et al. | childLex: A lexical database of German read by children | |
Khandelwal et al. | Gender prediction in english-hindi code-mixed social media content: Corpus and baseline system | |
US9164983B2 (en) | Broad-coverage normalization system for social media language | |
US9043339B2 (en) | Extracting terms from document data including text segment | |
US8375033B2 (en) | Information retrieval through identification of prominent notions | |
Scannell | Statistical unicodification of African languages | |
Al-Haj et al. | The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation | |
JPWO2016051551A1 (en) | Sentence generation system | |
Greavu et al. | A classification of borrowings: Observations from Romanian/English contact | |
Bourgonje et al. | The Potsdam commentary corpus 2.2: Extending annotations for shallow discourse parsing | |
Guégan et al. | A French Human Reference Corpus for Multi-Document Summarization and Sentence Compression. | |
Anglemark et al. | The use of English-language business and finance terms in European languages | |
Szmrecsanyi | An analytic-synthetic spiral in the history of English | |
US8346541B2 (en) | Method for constructing Chinese dictionary and apparatus and storage media using the same | |
Khan et al. | Towards domain adaptation for parsing web data | |
GB2448357A (en) | System for estimating text readability | |
Miller | Analysing frequency lists | |
Vlachos | Tackling the BioCreative2 gene mention task with conditional random fields and syntactic parsing | |
Costa-jussa et al. | Towards human linguistic machine translation evaluation | |
Skirgård | Français Tirailleur Pidgin–a corpus study | |
Augustinus et al. | The IPP effect in Afrikaans: a corpus analysis | |
Ronan | Simple versus light verb constructions in late modern Irish English correspondence: A qualitative and quantitative analysis | |
Busse et al. | Problem areas of English grammar between usage, norm, and variation | |
Volk | The automatic resolution of prepositional phrase attachment ambiguities in German | |
Garley | Crossing the lexicon: Anglicisms in the German hip hop community |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |