WO2008052239A1 - Email document parsing method and apparatus - Google Patents

Email document parsing method and apparatus Download PDF

Info

Publication number
WO2008052239A1
WO2008052239A1 PCT/AU2007/000440 AU2007000440W WO2008052239A1 WO 2008052239 A1 WO2008052239 A1 WO 2008052239A1 AU 2007000440 W AU2007000440 W AU 2007000440W WO 2008052239 A1 WO2008052239 A1 WO 2008052239A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
email
ratio
analysis
processing
Prior art date
Application number
PCT/AU2007/000440
Other languages
French (fr)
Inventor
Ben Hutchinson
Tanja Gaustad
Dominique Estival
Wil Radford
Son Bao Pham
Original Assignee
Appen Pty Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from AU2006906095A external-priority patent/AU2006906095A0/en
Application filed by Appen Pty Limited filed Critical Appen Pty Limited
Priority to EP07718687A priority Critical patent/EP2092447A4/en
Priority to US12/447,898 priority patent/US20100100815A1/en
Priority to AU2007314123A priority patent/AU2007314123B2/en
Publication of WO2008052239A1 publication Critical patent/WO2008052239A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/107Computer-aided management of electronic mailing [e-mailing]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/131Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Definitions

  • the present invention relates to a method and apparatus for parsing electronic mail (also known as "email") documents.
  • Embodiments of the present invention find application, though not exclusively, in the field of computational text processing, which is also known in some contexts as natural language processing, human language technology or computational linguistics.
  • the outputs of some preferred embodiments of the invention may be used in a wide range of computing tasks such as automatic email categorization techniques, sentiment analysis, author attribution, and the like.
  • the known prior art is typically restricted to analysing emails that are composed in the English language and which are expressed in the ASCII character set. Further, at least some of the prior art was developed at a point in time that was prior to the use of email becoming extremely widespread and such prior art is therefore not well adapted to parse the contemporary genre of email expression.
  • a computer implemented method of parsing an email document so as to categorize text from the email document as author composed text or non-author composed text said method including the steps of: processing the text to determine the presence of signature text and categorizing any such signature text as non-author composed text; processing the text to determine the presence of automatically appended advertisement text and categorizing any such automatically appended advertisement text as non-author composed text; processing the text to determine the presence of quotation text and categorizing any such quotation text as non-author composed text; processing the text to determine the presence of text contained in an embedded reply chain of email messages and categorizing any such text contained in an embedded reply chain of email messages as non-author composed text; and categorizing at least some of the remaining text as author composed text.
  • At least one of the text processing steps includes a linguistic analysis of the words in the text.
  • the linguistic analysis includes identification of predefined words and phrases of any one or more of the following types: peoples' names, locations, dates, times, organizations, currency, uniform resource locators (URL's), email addresses, addresses, organizational descriptors, phone numbers, typical greetings and/or typical farewells.
  • Such a preferred embodiment typically includes a database of words and phrases of any one or more of the said types.
  • preferred embodiments of the invention further include the step of anonymising information contained within the text of the email document.
  • At least one of the text processing steps includes an analysis of the punctuation used in the text. Also preferably, at least one of the text processing steps includes an analysis of the paragraph and sentence segmentation used in the text.
  • results of the linguistic analysis, the punctuation analysis and the paragraph and sentence segmentation are represented by one or more data structures associated with segments of the text.
  • segments of the text are lines of the text, although in other embodiments alternative segments are used.
  • At least one of the text processing steps further includes utilizing a machine learning system that is responsive to the one or more data structures.
  • the data structures are feature vectors and the machine learning system utilizes any one or more of the following techniques:
  • the machine learning system has been trained with reference to a representative sample of email documents in which at least a proportion of the email documents are contemporary.
  • the concept of a "contemporary email document" should be construed as being an email document that was originally authored within the preceding two year period.
  • a preferred embodiment includes a step of processing the text to determine the presence of header text and categorizing any such header text as non-author composed text.
  • This preferred embodiment also includes a step of processing the email document to determine the presence of any attachments and stripping any such attachments from the email document prior to processing the text.
  • Another step taken by this preferred embodiment relates to processing the email document to determine the presence of any forwarded material and stripping any such forwarded material from the email document prior to processing the text.
  • Yet another step taken by the preferred embodiment relates to processing the email document to ascertain whether the email document is in a preferred format and, if the email document is not in the preferred format, converting at least some of the information within the email document to the preferred format.
  • a computer-readable medium containing computer executable code for instructing a computer to perform a method in accordance with the first aspect of the present invention.
  • a computing apparatus having a central processing unit, associated memory and storage devices, and input and output devices, said apparatus being configured to perform a method according to the first aspect of the present invention.
  • Figure 1 is a flow chart illustrating the main processing steps carried out by a preferred embodiment of the invention
  • Figure 2 is a schematic depiction of a typical email document; and Figure 3 is a schematic depiction of a preferred embodiment of a computing apparatus according to the invention.
  • FIG. 1 A preferred example of the process flow of the inventive method 1 is depicted in figure 1.
  • the first step 2 of the method 1 is to import an email document 3 to be parsed.
  • a typical email document 3 may include some or all of a number of different sections, as shown schematically in figure 2. These sections may consist of, for example, a link 4 to one or more attachments, a header 5, a body 6, a signature block 7, some automatically appended advertisement materials 8 and/or an embedded reply chain of previous email messages 9. It will be appreciated that the ordering and number of occurrences of these various sections 4 to 9 may vary from that depicted in figure 2. With the exception of the link to an attachment 4, each of the sections 5 to 9 are at least initially coded by the processing computer as a single block of text, with the divisions between the various sections being typically initially unknown to the processing computer. In other words, the header 5, body 6, signature block 7, advertisement 8 and the embedded reply chain 9 are typically all encoded as a single unparsed text field.
  • each email 3 is imported and parsed in real time immediately after receipt or interception.
  • a database of received or intercepted emails is maintained and each email 3 is imported from the database as required, either immediately after receipt, or at some later point in time.
  • an original copy of the email 3 is stored for later reference, and all analysis takes place upon a copy of the original.
  • the computing apparatus is a stand alone computer, whilst in other embodiments the computing apparatus is formed from a networked array of interconnected computers.
  • the preferred embodiment utilizes a computing apparatus 50 as shown in figure 3, which is configured to perform the parsing processing.
  • This computing apparatus includes a computer 51 having a central processing unit (CPU); associated memory, in particular RAM and ROM; storage devices such as hard drives, writable CD ROMS and flash memory.
  • the computer 51 is also communicatively connected via a wireless network hub 52 to an email server 53, a database server 54 and a laptop computer 56, which functions as a user interface to the networked hardware.
  • the laptop computer 56 provides the user with input devices such as a keyboard 57 and a mouse (not illustrated) ; and a display in the form of a screen 58.
  • the laptop computer 56 is also communicatively connected via the wireless network hub 52 to an output device in the form of a printer 59.
  • the email server 53 includes an external communications link in the form of a modem. Email messages 3 are received by the email server 55 and relayed via the wireless network hub 52 to the computer 51 for parsing. Depending upon user requirements, a copy of the email 3 may also be stored on the database server 54.
  • the email 3 is processed to determine the presence of any header text 5 (excluding any header text that may be within the embedded reply chain) or attachments 4, including attached email documents, if any.
  • This preprocessing is relatively straight forward for those skilled in the art. It may be thought of as a basic "cleaning up" of the email 3 prior to more sophisticated parsing.
  • the preprocessing step 10 takes place in real time immediately prior to the parsing steps described below. In other embodiments, the preprocessing 10 takes place separately from the remaining steps, for example when a copy of the email 3 is saved on the database server 54 for future parsing.
  • these components of the email 3 are categorized by the computer 51 as non-author composed text.
  • the recordal of such categorization is achieved by inserting annotations into the text, for example by: inserting the tag " ⁇ header>" at the commencement of the header 5; and inserting the tag " ⁇ /header>” at the conclusion of the header 5.
  • Alternative embodiments record the categorization by means other than by inserting annotations into the text.
  • the text that has been categorized is copied into a memory location or bulk storage location that is exclusively reserved for the relevant category of text.
  • the appearance of the categorized text is altered, for example by altering the background or foreground colour or font of the categorized text.
  • the annotations are stored in an annotation repository, along with pointer data indicating the positions within the text of the email 3 to which the annotation is applicable. It will be appreciated that many other means for recording the categorization of text may be devised by those skilled in the art.
  • any header text 5, attachments 4 or other forwarded materials are simply stripped from the version of the email 3 that progresses to the further parsing steps.
  • the process flow of the parsing computer 51 moves to the step of normalization 11.
  • This entails processing the email document 3 to ascertain whether it is in a preferred format and, if the email document 3 is not in the preferred format, converting at least some of the information within the email document to the preferred format.
  • the imported emails 3 may be in any one of a variety of character sets and encodings, for example US -ASCII, UTF-8, ISO-8859-1, ISO-8859-2, ISO-8859-6, windows- 1251, windows-1252 or windows-1256.
  • Occasionally documents may have headers which specify an incorrect encoding (e.g. a
  • UTF-8 document may have a header claiming it is ISO-8859-1). In such cases, a set of heuristics are used to guess at the correct encoding. Once the encoding is known, all text in formats other than UTF-8 is converted to UTF-8 so as to provide a single consistent format for the parsing to follow. Of course, formats other than UTF-8 are used as preferred formats in other embodiments.
  • the process flow of the parsing computer 51 now progresses through several analysis steps, referred to as the segmentation step 12, the linguistic analysis step 13 and the punctuation analysis step 14.
  • the results of these analysis steps 12 to 14 are recorded in suitable memory or storage means accessible to the CPU of the parsing computer 51.
  • segmentation step 12 the text of email 3 is split into paragraphs, and the paragraphs are split into sentences.
  • this segmentation analysis 12 is performed by a publicly available third party tool, known as the General Architecture for Text Engineering (GATE) segmentation tool, which is distributed by The University of the GATE.
  • GATE General Architecture for Text Engineering
  • the preferred embodiment records segmentation using annotations inserted in the text. As applied to the running example, this results in the following annotated email text:
  • the parsing computer 51 performs linguistic analysis of the words in the text at step 13. This analysis includes identification of predefined words and phrases of various types. An exemplary list of some of the types of words and phrases that are identified in this stage of the analysis is set out in table 1.
  • the preferred embodiment has an extensive database of examples of such types of words and phrases, which functions as a lexicon to assist in the identification of such key words and phrases.
  • This data is stored in database server 54.
  • the results of the linguistic analysis are inserted as annotations into the text in the manner described above. As applied to the running example, this results in the following annotated email text (for the sake of clarity only some of the possible annotations are shown here):
  • ⁇ paragraphxsentence> If you require any moving services, however, please feel free to browse our website for our movers' information and then call us at ⁇ Phone>888.572.9427 ⁇ /Phone> so that we can set up an appointment for an estimate .
  • ⁇ /sentencex/paragraph> ⁇ paragraphxsentence>If you have any questions, please don't hesitate to email or call at ⁇ Phone>888.572.9427 ⁇ /Phone> .
  • ⁇ /sentencex/paragraph> If you have any questions, please don't hesitate to email or call at ⁇ Phone>888.572.9427 ⁇ /Phone> .
  • Punctuation analysis takes place at step 14 of the process flow.
  • the parsing computer 51 analyses the text at the character level so as to check for use of sentence punctuation marks and other predefined characters, such as: special markers, e.g. two hyphens "— " (which often indicate that an email signature follows); the greater-than character ">” (which often indicate the presence of reply lines); quotation marks (which may signal the presence of a quotation); emoticons (e.g. ":-)", “:o)”) (which are typically indicative of either an emotive state of the author, or an emotive state that the author wishes to elicit from the recipient of the email) .
  • special markers e.g. two hyphens "— " (which often indicate that an email signature follows); the greater-than character ">” (which often indicate the presence of reply lines); quotation marks (which may signal the presence of a quotation); emoticons (e.g. ":-)", “:o)”) (which are typically indicative of either an emotive state of the author, or
  • step 15 in which the analysed email document, including any annotations that have been inserted, is saved into the memory of the computing apparatus, along with any extraneous results of the analysis.
  • Steps 16 and 17 are optional and relate to the anonymisation of the document. This entails stripping some of the text identified in the linguistic analysis step 13, such as the names of people, locations, phone numbers, URLs, and emails addresses so as to remove any information that may identify one or more parties associated with the email. This typically entails stripping text from the body 6 of the email 3, and also from any signatures 7 and headers 5. For many applications it is not necessary to anonymise the email text, in which case steps 16 and 17 are omitted and the parsing processing instead proceeds directly from step 15 to step 18.
  • a feature is a descriptive statistic calculated from either or both of the raw text and the annotations.
  • a feature might express the ratio of frequencies of two different annotation types (e.g. the ratio of sentence annotations to paragraph annotations), or the presence or absence of an annotation type (e.g. greeting). More particularly, the features can be generally divided into three groupings:
  • Character level features which summarise the analysis of each individual character in the text of the email. Typically the results of the punctuation analysis step 14 provide the majority of these features. Examples include: o proportion of characters that are:
  • Lexical level features which summarise the keywords and phrases, emoticons, multiword prepositional phrases, farewell expressions, greeting expressions, part-of-speech tags, etc. identified during the linguistic analysis step 13.
  • Examples include: o frequency and distribution of different parts of speech; o word type- token ratio; o frequency distribution of specific function words drawn from the keyword database; and o frequency distribution of multiword prepositions; and proportion of words that are function words.
  • Structural level features which typically refer to the annotations made regarding structural features of the text such as the presence of a signature block, reply status, attachments, headers, etc. Examples include information regarding: o indentation of paragraphs; o presence of farewells; o document length in characters, words, lines, sentences and/or paragraphs; and o mean paragraph length in lines, sentences and/or words.
  • Information regarding the categories, descriptions and names of the various features that are calculated for a typical email document 3 in the preferred embodiment is set out in the following table:
  • Words its part-of-speech posVBU Word_ratio_posVBU_all VBU posIN Words its part-of-speech equal IN Word_ratio_posIN_all posJJ Words its part-of-speech equal JJ Word_ratio_posJJ_all posRB Words its part-of-speech equal RB Word_ratio_posRB_all posPR Words its part-of-speech equal PR Word_ratio_posPR_all posNNP Words its part-of-speech equal NNP Word_ratio_posNNP_all posPOS Words its part-of-speech equal POS Word_ratio_posPOS_all posMD Words its part-of-speech equal MD Word_ratio_posMD_all caseUpper Words of character case type upper Word_ratio_caseUpper_all caseLower Words of character case type lower Word_ratio_caseLower_all caseCamel Words of character case type
  • Wordclasses all wordclasses annotations Word_ratio_wordClas s_all wordclassesSP wordclass spelling error (SP) Word_ratio_wordClas s S P_all wordclassesTP wordclass typing error (TP) Word_ratio_wordClas sTP_all wordclass creative wordformation wordclassesCF (CF) Word_ratio_wordClas sCF_all wordclassesAB wordclass abbreviation (AB) Word_ratio_wordClas s AB_all wordclassesWS wordclass missing whitespace (WS) Word_ratio_wordClas s WS_all wordclassesGR wordclass grammatical error (GR) Word_ratio_wordClas sGR_all wordclassesFW wordclass foreign word (FW) Word_ratio_wordClas sFW_all
  • MultiwordPrepositions All multiword prepositions (mwp) MultiwordPreposition_count_all
  • Farewell_count_tarewellO All annotations of farewell words Farewell count all Annotations matching farewell around farewellO through farewell 186 j • Farewell_count_tarewellO, etc.
  • HTML annotations, and annotations html concerning the HTML HTML_count_all
  • HTML font tag with attribute size HTML_ratio_htmlFontAttributeSize- htmlFontAttributeSize- 1 -1 l_htmlTag
  • HTML font tag with attribute size HTML_ratio_htmlFontAttributeSize+l_ htmlFont AttributeSize+ 1 +1 htmlTag
  • HTML font tag with attribute size HTML_ratio_htmlFontAttributeSize- htmlFontAttributeSize-2 -2 2_htmlTag
  • HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorLi htmlFontAttributeColorLime lime me_htmlTag
  • HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorGr htmlFontAttributeColorGreen green een_htmlTag
  • HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorSil htmlFont Attribu teColorS il ver silver ver_htmlTag
  • HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorFu htmlFontAttributeColorFuchsia fuchsia chsia_htmlTag
  • HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorW htmlFontAttributeColorWhite white hite_htmlTag
  • HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorYe htmlFontAttributeColor Yellow yellow llow_htmlTag
  • HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorBla htmlFontAttributeColorBlack black ckJitmlTag
  • HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorPur htmlFontAttributeColorPurple purple pleJitmlTag
  • HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorOli htmlFontAttributeColorOlive olive ve_htmlTag
  • HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorRe htmlFontAttributeColorRed red d_htmlTag
  • HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorMa htmlFontAttributeColorMaroon maroon roon_htmlTag
  • HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorAq htmlFontAttributeColorAqua aqua ua_htmlTag
  • HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorGr htmlFontAttributeColorGray gray ay_htmlTag
  • HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorBl htmlFontAttributeColorBlue blue ue_htmlTag
  • HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorOt htmlFontAttributeColorOther other herJitmlTag
  • HTML font tag with attribute face HTML_ratio_htmlFontAttributeFaceAria litmlFontAttributeFaceArial arial l_htmlTag
  • HTML font tag with attribute face HTML_ratio_htmlFontAttributeFaceVer htmlFontAttributeFaceVerdana verdana dana_htmlTag
  • HTML font tag with attribute face HTML_ratio_htmlFontAttributeFaceDef litmlFontAttributeFaceDefault default ault_htmlTag htmlTagB HTML ⁇ B> tags HTML_ratio_htmlTagB_htmlTag htmlTagl HTML ⁇ I> tags HTML_ratio_htmlTagI_htmlTag HTML_ratio_htmlTagSTRONG_htmlTa htmlTagSTRONG HTML ⁇ STRONG> tags htmlTagU HTML ⁇ U> tags HTML_ratio_ .
  • Time_count_all Time_ratio_all_allWords Time_meanLengthIn_Char Time_meanLengthIn_Word Time annotations such as 23:15 or time24 08: 15 Time ratio time24 all
  • Time annotations that are time Ambiguous ambiguous e.g. 8: 15 Time_ratio_timeAmbiguous_all
  • Date annotations with a day hasDay specified Date_ratio_hasDay_all
  • the feature Char_count_punc33 is a numeric value equal to the number of times ASCII code 33 (i.e. !) is used in the document being parsed.
  • Some of the other features mentioned in the above list are counts and/or ratios associated with user-defined lexicons of commonly used emoticons, farewells, function words, greetings and multiword prepositions.
  • Each of the feature names is a variable that is set to a numeric value that is calculated for the respective feature. For example, for an email comprised of 488 characters, the feature char_count_all is set to a value of 488.
  • the features extracted at step 18 are converted into data structures associated with segments of the text.
  • the type of data structure chosen must be suitable for use with the type of machine learning system that will be used in step 20.
  • the preferred embodiment uses feature vectors as the preferred data structure and makes use of the Conditional Random Fields technique in the machine learning system.
  • Each of the feature vectors is associated with a line of the text of the email 3.
  • a feature vector is essentially a list of features that is structured in a predefined manner to function as input for the Conditional Random Field processing that occurs at the next step.
  • the machine learning system uses the Conditional Random Fields technique, receives the feature vectors and associated lines of text as input and is responsive to that input so as to categorise each line of text as broadly falling into one of two categories: author composed text or non- author composed text. More specifically, the category of non-author composed text is divided into five sub-categories as follows: 1. signature text 7;
  • the machine learning categorization step 20 focuses upon identifying the other four sub-categories of non-author composed text.
  • the results are stored in accordance with a storage protocol.
  • the preferred embodiment once again makes use of annotations, as described in detail above, to record the results of the parsing.
  • the identified sub-categories of non-author composed text are denoted by the following tags: ⁇ header>, ⁇ quote>, ⁇ signature>, ⁇ reply> and ⁇ advert>.
  • the text that does not fall into any of these non-author composed sub -categories is categorized as author composed text and is annotated with the following tag: ⁇ AuthorText>.
  • the annotated text reads as follows:
  • the machine learning system makes use of a predictive model that is established during a training phase, in which the machine learning system receives training data consisting of pairs of feature vectors and lines statuses, where the status of a line can be any one of: author composed text 6; automatically appended advertisement text 8; signature text 7; embedded reply chain text 9 or quotation text.
  • the training data is compiled from a representative sample of email documents 3, at least some of which are preferably contemporary.
  • the machine learning system formulates the predictive model that is used in the machine learning categorization of step 20.
  • various other preferred embodiments make use of one or more of the following types of known machine learning techniques, including:
  • the present invention may be embodied in computer software in the form of executable code for instructing a computer to perform the inventive method.
  • the software and its associated data are capable of being stored upon a computer -readable medium in the form of one or more compact disks (CD's).
  • CD's compact disks
  • Alternative embodiments make use of other forms of digital storage media, such as Digital Versatile Discs (DVD's), hard drives, flash memory, Erasable Programmable Read-Only Memory (EPROM), and the like.
  • DVD's Digital Versatile Discs
  • EPROM Erasable Programmable Read-Only Memory
  • the software and its associated data may be stored as one or more downloadable or remotely executable files that are accessible via a computer communications network such as the internet.
  • the processing of email text undertaken by the preferred embodiment advantageously identifies advertisements and quotations in addition to reply lines, signatures and text written by the author.
  • This parsing may be performed with a comparatively high degree of accuracy. It is achieved with the use of a rich set of linguistic features, such as a database storing a plurality of named entities, common greetings and farewell phrases.
  • the parsing also makes use of a comprehensive set of punctuation features.
  • segmentation analysis provides further useful input to the parsing processing, for example to help avoid incorrectly categorizing half of a sentence as author composed text and the other half of a sentence as a reply line.
  • the preferred embodiment can advantageously function with input email text represented in a variety of formats.
  • alternative preferred embodiments are configurable for use in parsing email text expressed in languages other than English.
  • the machine learning system is regularly re-trained on a contemporary set of training data, the preferred embodiment can effectively keep abreast of newly emergent email writing styles and expressions. This assists in maintaining a comparatively high degree of accuracy as the email writing genre evolves over time.

Abstract

A preferred example of the process flow of the inventive method (1) is depicted in figure (1). The first step (2) of the method (1) is to import an email document (3) to be parsed. In the preprocessing step (10) the email (3) is processed to determine the presence of any header text (5) (excluding any header text that may be within the embedded reply chain) or attachments 4, including attached email documents, if any. Once the header text (5), attachments (4) or other forwarded materials have been identified in the preprocessing step (10), these components of the email (3) are categorized by the computer (51) as non-author composed text. Next the process flow of the parsing computer (51) moves to the step of normalization (11). This entails processing the email document (3) to ascertain whether it is in a preferred format and, if the email document (3) is not in the preferred format, converting at least some of the information within the email document to the preferred format. The parsing computer (51) now progresses through several analysis steps, referred to as the segmentation step (12), the linguistic analysis step (13) and the punctuation analysis step (14). The results of these analysis steps (12) to (14) are recorded in suitable memory or storage means accessible to the CPU of the parsing computer (51). In the segmentation step (12) the text of email (3) is split into paragraphs, and the paragraphs are split into sentences. The linguistic analysis step (13) includes identification of predefined words and phrases of various types. In the punctuation analysis step (14) the parsing computer (51) analyses the text at the character level so as to check for use of sentence punctuation marks and other predefined characters. At the completion of the analysis steps (12) to (14), the process flow proceeds to step (15), in which the analysed email document, including any annotations that have been inserted, is saved into the memory of the computing pparatus, along with any extraneous results of the analysis. Next a number of features are defined at step (18). Typically, a feature is a descriptive statistic calculated from either or both of the raw text and the annotations. At step (19) the features extracted at step (18) are converted into data structures associated with segments of the text. At step (20) the machine learning system receives the data structures and associated lines of text as input and is responsive to that input so as to categorise each line of text as broadly falling into one of two categories: author composed text or non- author composed text.

Description

EMAIL DOCUMENT PARSING METHOD AND APPARATUS
STATEMENT RE U.S. GOVERNMENT RIGHTS
This invention was made with U.S. Government support under Contract No. W91CRB-06-C-0012 awarded by U.S. Army RDECOM ACQ CTR - W91CRB. The U.S. Government has certain rights in this invention.
FIELD OF THE INVENTION
The present invention relates to a method and apparatus for parsing electronic mail (also known as "email") documents. Embodiments of the present invention find application, though not exclusively, in the field of computational text processing, which is also known in some contexts as natural language processing, human language technology or computational linguistics. The outputs of some preferred embodiments of the invention may be used in a wide range of computing tasks such as automatic email categorization techniques, sentiment analysis, author attribution, and the like.
BACKGROUND OF THE INVENTIION
The use of electronic mail, or "email", has become increasingly pervasive throughout the last decade and hence the data contained within email messages may constitute a valuable source of data to some entities, particularly those that either receive or intercept a large volume of email traffic. To assist in extracting and analysing data from emails it is useful in some contexts to focus analysis upon text that has been composed by the author of the email and to disregard other types of text that may be included with typical email documents. It has been appreciated by the inventors of the present invention that the known prior art attempts to automatically parse text from emails can suffer from a number of disadvantages. In particular, the known prior art identifies only a very limited range of types of non -author composed text and utilises fairly unsophisticated processing techniques. Additionally, the known prior art is typically restricted to analysing emails that are composed in the English language and which are expressed in the ASCII character set. Further, at least some of the prior art was developed at a point in time that was prior to the use of email becoming extremely widespread and such prior art is therefore not well adapted to parse the contemporary genre of email expression.
Any discussion of documents, acts, materials, devices, articles or the like which has been included in this specification is solely for the purpose of providing a context for the present invention. It is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present invention as it existed in Australia or elsewhere before the priority date of this application.
SUMMARY OF THE INVENTION
It is an object of the present invention to overcome, or substantially ameliorate, one or more of the disadvantages of the prior art, or to provide a useful alternative.
In accordance with a first aspect of the present invention there is provided a computer implemented method of parsing an email document so as to categorize text from the email document as author composed text or non-author composed text, said method including the steps of: processing the text to determine the presence of signature text and categorizing any such signature text as non-author composed text; processing the text to determine the presence of automatically appended advertisement text and categorizing any such automatically appended advertisement text as non-author composed text; processing the text to determine the presence of quotation text and categorizing any such quotation text as non-author composed text; processing the text to determine the presence of text contained in an embedded reply chain of email messages and categorizing any such text contained in an embedded reply chain of email messages as non-author composed text; and categorizing at least some of the remaining text as author composed text. Preferably at least one of the text processing steps includes a linguistic analysis of the words in the text. In one preferred embodiment the linguistic analysis includes identification of predefined words and phrases of any one or more of the following types: peoples' names, locations, dates, times, organizations, currency, uniform resource locators (URL's), email addresses, addresses, organizational descriptors, phone numbers, typical greetings and/or typical farewells. Such a preferred embodiment typically includes a database of words and phrases of any one or more of the said types. For some applications preferred embodiments of the invention further include the step of anonymising information contained within the text of the email document.
Preferably at least one of the text processing steps includes an analysis of the punctuation used in the text. Also preferably, at least one of the text processing steps includes an analysis of the paragraph and sentence segmentation used in the text.
In a preferred embodiment the results of the linguistic analysis, the punctuation analysis and the paragraph and sentence segmentation are represented by one or more data structures associated with segments of the text. Preferably the segments of the text are lines of the text, although in other embodiments alternative segments are used.
Preferably at least one of the text processing steps further includes utilizing a machine learning system that is responsive to the one or more data structures. In a preferred embodiment the data structures are feature vectors and the machine learning system utilizes any one or more of the following techniques:
Conditional Random Fields;
Support Vector Machines;
Naϊve Bayes; Decision Trees; and/or
Maximum Entropy.
Preferably the machine learning system has been trained with reference to a representative sample of email documents in which at least a proportion of the email documents are contemporary. As used in this document, the concept of a "contemporary email document" should be construed as being an email document that was originally authored within the preceding two year period.
A preferred embodiment includes a step of processing the text to determine the presence of header text and categorizing any such header text as non-author composed text. This preferred embodiment also includes a step of processing the email document to determine the presence of any attachments and stripping any such attachments from the email document prior to processing the text. Another step taken by this preferred embodiment relates to processing the email document to determine the presence of any forwarded material and stripping any such forwarded material from the email document prior to processing the text. Yet another step taken by the preferred embodiment relates to processing the email document to ascertain whether the email document is in a preferred format and, if the email document is not in the preferred format, converting at least some of the information within the email document to the preferred format.
In another aspect of the present invention there is provided a computer-readable medium containing computer executable code for instructing a computer to perform a method in accordance with the first aspect of the present invention.
In yet another aspect of the present invention there is provided a downloadable or remotely executable file or combination of files containing computer executable code for instructing a computer to perform a method in accordance with the first aspect of the present invention.
In a yet further aspect of the present invention there is provided a computing apparatus having a central processing unit, associated memory and storage devices, and input and output devices, said apparatus being configured to perform a method according to the first aspect of the present invention.
The features and advantages of the present invention will become further apparent from the following detailed description of preferred embodiments, provided by way of example only, together with the accompanying drawings.
BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS
Figure 1 is a flow chart illustrating the main processing steps carried out by a preferred embodiment of the invention;
Figure 2 is a schematic depiction of a typical email document; and Figure 3 is a schematic depiction of a preferred embodiment of a computing apparatus according to the invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS OF THE INVENTION
A preferred example of the process flow of the inventive method 1 is depicted in figure 1. The first step 2 of the method 1 is to import an email document 3 to be parsed.
A typical email document 3 may include some or all of a number of different sections, as shown schematically in figure 2. These sections may consist of, for example, a link 4 to one or more attachments, a header 5, a body 6, a signature block 7, some automatically appended advertisement materials 8 and/or an embedded reply chain of previous email messages 9. It will be appreciated that the ordering and number of occurrences of these various sections 4 to 9 may vary from that depicted in figure 2. With the exception of the link to an attachment 4, each of the sections 5 to 9 are at least initially coded by the processing computer as a single block of text, with the divisions between the various sections being typically initially unknown to the processing computer. In other words, the header 5, body 6, signature block 7, advertisement 8 and the embedded reply chain 9 are typically all encoded as a single unparsed text field.
In some embodiments each email 3 is imported and parsed in real time immediately after receipt or interception. In other embodiments, a database of received or intercepted emails is maintained and each email 3 is imported from the database as required, either immediately after receipt, or at some later point in time. In the preferred embodiment, an original copy of the email 3 is stored for later reference, and all analysis takes place upon a copy of the original.
It will be appreciated that the actual hardware platform upon which the invention is implemented will vary depending upon the amount of processing power required. In some embodiments the computing apparatus is a stand alone computer, whilst in other embodiments the computing apparatus is formed from a networked array of interconnected computers.
The preferred embodiment utilizes a computing apparatus 50 as shown in figure 3, which is configured to perform the parsing processing. This computing apparatus includes a computer 51 having a central processing unit (CPU); associated memory, in particular RAM and ROM; storage devices such as hard drives, writable CD ROMS and flash memory. The computer 51 is also communicatively connected via a wireless network hub 52 to an email server 53, a database server 54 and a laptop computer 56, which functions as a user interface to the networked hardware. The laptop computer 56 provides the user with input devices such as a keyboard 57 and a mouse (not illustrated) ; and a display in the form of a screen 58. The laptop computer 56 is also communicatively connected via the wireless network hub 52 to an output device in the form of a printer 59. The email server 53 includes an external communications link in the form of a modem. Email messages 3 are received by the email server 55 and relayed via the wireless network hub 52 to the computer 51 for parsing. Depending upon user requirements, a copy of the email 3 may also be stored on the database server 54.
For the sake of a running example, the processing of the following exemplary email document shall be described:
Original Message
From: Commercial Services Sent: Monday, May 08, 2006 3:23 PM To: 'jbloggs@hotmail.com'
Subject: RE: Special Request
Hi Joe,
Thank you for inquiring about our Commercial Services program. Thank you for your recent Commercial Services inquiry. The B&W Commercial
Services program can give you one-stop convenience for all of your upkeep and commercial improvement needs, including online change of address and utilities connections with the QC product . Here is the link to access this information: http://commercialservices.bw.com. The vendors are listed by category and their contact information is also available on-line. In order to receive quotes on the services you've requested, it is advised to directly contact that vendor as Commercial Services does not have access to pricing information. If you require any moving services, however, please feel free to browse our website for our movers' information and then call us at
888.572.9427 so that we can set up an appointment for an estimate.
If you have any questions, please don't hesitate to email or call at
888.572.9427. Best Regards,
The Commercial Services Team 888.572.9427 commercialservices@bw . com
Original Message
From: jbloggs@hotmail.com [mailto:jbloggs@hotmail.com] Sent: Monday, May 08, 2006 3:13 PM To: Commercial Services Subject: Special Request
BW Commercial Services - Special Request
Submitted
Time: 5/8/2006 4:12:32 PM
Origins
Origin: Our Site Origin 2 : Message from
Name: Joe Bloggs E-mail: jbloggs@hotmail.com Phone: (507) 359-7891 Additional Phone: Contact Method: phone
Contact Time: Evening (5:00 pm - 8:00 pm) Contact ASAP: Yes
Customer responses I'm interested in renting, and I would like:
More information on your Commercial Services program
B&W - Your Favorite Commercial Services Provider Since 1875
In the preprocessing step 10 the email 3 is processed to determine the presence of any header text 5 (excluding any header text that may be within the embedded reply chain) or attachments 4, including attached email documents, if any. This preprocessing is relatively straight forward for those skilled in the art. It may be thought of as a basic "cleaning up" of the email 3 prior to more sophisticated parsing. In some embodiments the preprocessing step 10 takes place in real time immediately prior to the parsing steps described below. In other embodiments, the preprocessing 10 takes place separately from the remaining steps, for example when a copy of the email 3 is saved on the database server 54 for future parsing.
Once the header text 5, attachments 4 or other forwarded materials have been identified in the preprocessing step 10, these components of the email 3 are categorized by the computer 51 as non-author composed text. In the preferred embodiment the recordal of such categorization is achieved by inserting annotations into the text, for example by: inserting the tag "<header>" at the commencement of the header 5; and inserting the tag "</header>" at the conclusion of the header 5.
As applied to the running example, this results in the following annotated header text 5:
<header> Original Message
From: Commercial Services Sent: Monday, May 08, 2006 3:23 PM To: 'jbloggs@hotmail.com' Subject: RE: Special Request</header>
Alternative embodiments record the categorization by means other than by inserting annotations into the text. In one such embodiment, the text that has been categorized is copied into a memory location or bulk storage location that is exclusively reserved for the relevant category of text. In yet another embodiment the appearance of the categorized text is altered, for example by altering the background or foreground colour or font of the categorized text. In a further embodiment the annotations are stored in an annotation repository, along with pointer data indicating the positions within the text of the email 3 to which the annotation is applicable. It will be appreciated that many other means for recording the categorization of text may be devised by those skilled in the art. In further alternative embodiments, any header text 5, attachments 4 or other forwarded materials are simply stripped from the version of the email 3 that progresses to the further parsing steps.
Subsequent to preprocessing 10, the process flow of the parsing computer 51 moves to the step of normalization 11. This entails processing the email document 3 to ascertain whether it is in a preferred format and, if the email document 3 is not in the preferred format, converting at least some of the information within the email document to the preferred format. More particularly, the imported emails 3 may be in any one of a variety of character sets and encodings, for example US -ASCII, UTF-8, ISO-8859-1, ISO-8859-2, ISO-8859-6, windows- 1251, windows-1252 or windows-1256.
Occasionally documents may have headers which specify an incorrect encoding (e.g. a
UTF-8 document may have a header claiming it is ISO-8859-1). In such cases, a set of heuristics are used to guess at the correct encoding. Once the encoding is known, all text in formats other than UTF-8 is converted to UTF-8 so as to provide a single consistent format for the parsing to follow. Of course, formats other than UTF-8 are used as preferred formats in other embodiments.
The process flow of the parsing computer 51 now progresses through several analysis steps, referred to as the segmentation step 12, the linguistic analysis step 13 and the punctuation analysis step 14. The results of these analysis steps 12 to 14 are recorded in suitable memory or storage means accessible to the CPU of the parsing computer 51.
In the segmentation step 12 the text of email 3 is split into paragraphs, and the paragraphs are split into sentences. In the preferred embodiment this segmentation analysis 12 is performed by a publicly available third party tool, known as the General Architecture for Text Engineering (GATE) segmentation tool, which is distributed by The University of
Sheffield. Other third party segmentation tools, such those provided by Stanford
University, may also be utilised.
The preferred embodiment records segmentation using annotations inserted in the text. As applied to the running example, this results in the following annotated email text:
<header> Original Message
From: Commercial Services Sent: Monday, May 08, 2006 3:23 PM To: 'jbloggs@hotmail.com'
Subject: RE: Special Request</header>
<paragraph>Hi Joe, </paragraph> <paragraphxsentence>Thank you for inquiring about our
Commercial Services program. </sentencexsentence>Thank you for your recent Commercial Services inquiry . </sentencexsentence>The B&W Commercial Services program can give you one-stop convenience for all of your upkeep and commercial improvement needs, including online change of address and utilities connections with the QC product . </sentencexsentence>Here is the link to access this information: http : //commercial services . bw . com. </sentencexsentence>The vendors are listed by category and their contact information is also available on- line .</sentencexsentence>In order to receive quotes on the services you've requested, it is advised to directly contact that vendor as Commercial Services does not have access to pricing information . </sentencex/paragraph> <paragraphxsentence>If you require any moving services, however, please feel free to browse our website for our movers' information and then call us at 888.572.9427 so that we can set up an appointment for an estimate . </sentencex/paragraph>
<paragraphxsentence>If you have any questions, please don't hesitate to email or call at
888.572.9427.</sentencex/paragraph> <paragraph>Best Regards,
The Commercial Services Team
888.572.9427 commercialservices@bw. com</paragraph> <paragraph> Original Message
From: jbloggs@hotmail.com [mailto:jbloggs@hotmail.com]
Sent: Monday, May 08, 2006 3:13 PM
To: Commercial Services
Subject: Special Request</paragraph>
<paragraph>BW Commercial Services - Special
Request</paragraph>
<paragraph> Submitted
Time: 5/8/2006 4:12:32 PM</paragraph>
<paragraph>Origins Origin: Our Site
Origin 2 : </paragraph>
<paragraph>Message from
Name: Joe Bloggs E-mail: jbloggs@hotmail.com Phone: (507) 359-7891 Additional Phone: Contact Method: phone Contact Time: Evening (5:00 pm - 8:00 pm) Contact ASAP: Yes </paragraph>
<paragraph>Customer responses
<sentence>I 'm interested in renting, and I would like:</sentence>
<sentence>More information on your Commercial Services program</sentencex/paragraph>
<paragraph>B&W - Your Favorite Commercial Services Provider Since 1875</paragraph>
Following segmentation analysis, the parsing computer 51 performs linguistic analysis of the words in the text at step 13. This analysis includes identification of predefined words and phrases of various types. An exemplary list of some of the types of words and phrases that are identified in this stage of the analysis is set out in table 1.
Figure imgf000013_0001
Figure imgf000014_0001
Table 1
The preferred embodiment has an extensive database of examples of such types of words and phrases, which functions as a lexicon to assist in the identification of such key words and phrases. This data is stored in database server 54. In the preferred embodiment the results of the linguistic analysis are inserted as annotations into the text in the manner described above. As applied to the running example, this results in the following annotated email text (for the sake of clarity only some of the possible annotations are shown here):
<header> Original Message
From: <Organization>Commercial Services</Organization> Sent: <Date>Monday, May 08, 2006</Date> <Time>3:23 PM</Time> To: ' <Email> jbloggs@hotmail . com</Email> ' Subject: RE: Special Request</header>
<paragraph>Hi <Person>Joe</Person>, </paragraph> <paragraphxsentence>Thank you for inquiring about our <Organization>Commercial Services</0rganization> program. </sentence> <sentence>Thank you for your recent <Organization>Commercial Services</0rganization> inquiry . </sentence> <sentence>The <Organization>B&W Commercial Services</Organization> program can give you one-stop convenience for all of your upkeep and commercial improvement needs, including online change of address and utilities connections with the QC product . </sentence> <sentence>Here is the link to access this information: <Url>http : //commercialservices . bw. com</Url> .</sentence> <sentence>The vendors are listed by category and their contact information is also available on-line . </sentence> <sentence>In order to receive quotes on the services you've requested, it is advised to directly contact that vendor as <Organization>Commercial Services</Organization> does not have access to pricing information . </sentencex/paragraph>
<paragraphxsentence> If you require any moving services, however, please feel free to browse our website for our movers' information and then call us at <Phone>888.572.9427</Phone> so that we can set up an appointment for an estimate . </sentencex/paragraph> <paragraphxsentence>If you have any questions, please don't hesitate to email or call at <Phone>888.572.9427</Phone> . </sentencex/paragraph>
<paragraph>Best Regards, The <Organization>Commercial Services</Organization> Team <Phone>888.572.9427</Phone> <Emai1>commercialservices@bw . com</Emailx/paragraph>
<paragraph> Original Message From: <Email>jbloggs@hotmail . com</Email>
[mailto : <Email>jbloggs@hotmail . com</Email> ]
Sent: <Date>Monday, May 08, 2006</Date> <Time>3:13
PM</Time>
To: <Organization>Commercial Services</Organization> Subject: Special Request</paragraph>
<paragraphxOrganization>BW Commercial Services</Organization> - Special request </paragraph> <paragraph> Submitted
Time: <Date>5/8/2006</Date> <Time>4 : 12 : 32 PM</Timex/paragraph> <paragraph>Origins
Origin: Our Site Origin 2 : </paragraph> <paragraph>Message from
Name: <Person>Joe Bloggs</Person> E-mail : <Email> jbloggs@hotmail . com</Email> Phone: <Phone>(507) 359-7891</Phone> Additional Phone: Contact Method: phone
Contact Time: Evening (<Time>5:00 pm</Time> <Time>8:00 pm</Time>)
Contact ASAP: Yes </paragraph> <paragraph>Customer responses
<sentence>I ' m interested in renting, and I would like : </sentence>
<sentence>More information on your <Organization>Commercial Services</0rganization> program</sentencex/paragraph>
<paragraphxOrganization>B&W<Organization> - Your Favorite <Organization>Commercial Services</Organization> Provider Since 1875</paragraph>
Punctuation analysis takes place at step 14 of the process flow. In this step the parsing computer 51 analyses the text at the character level so as to check for use of sentence punctuation marks and other predefined characters, such as: special markers, e.g. two hyphens "— " (which often indicate that an email signature follows); the greater-than character ">" (which often indicate the presence of reply lines); quotation marks (which may signal the presence of a quotation); emoticons (e.g. ":-)", ":o)") (which are typically indicative of either an emotive state of the author, or an emotive state that the author wishes to elicit from the recipient of the email) .
At the completion of the analysis steps 12 to 14, the process flow proceeds to step 15, in which the analysed email document, including any annotations that have been inserted, is saved into the memory of the computing apparatus, along with any extraneous results of the analysis.
Steps 16 and 17 are optional and relate to the anonymisation of the document. This entails stripping some of the text identified in the linguistic analysis step 13, such as the names of people, locations, phone numbers, URLs, and emails addresses so as to remove any information that may identify one or more parties associated with the email. This typically entails stripping text from the body 6 of the email 3, and also from any signatures 7 and headers 5. For many applications it is not necessary to anonymise the email text, in which case steps 16 and 17 are omitted and the parsing processing instead proceeds directly from step 15 to step 18.
To summarise the results of the processing that has occurred to this point a number of features are defined at step 18. Typically, a feature is a descriptive statistic calculated from either or both of the raw text and the annotations. For example, a feature might express the ratio of frequencies of two different annotation types (e.g. the ratio of sentence annotations to paragraph annotations), or the presence or absence of an annotation type (e.g. greeting). More particularly, the features can be generally divided into three groupings:
• Character level features - which summarise the analysis of each individual character in the text of the email. Typically the results of the punctuation analysis step 14 provide the majority of these features. Examples include: o proportion of characters that are:
alphabetic,
numeric,
white space,
punctuation, and ■ special symbols; o proportion of words with less than four characters; and o mean word length.
• Lexical level features - which summarise the keywords and phrases, emoticons, multiword prepositional phrases, farewell expressions, greeting expressions, part-of-speech tags, etc. identified during the linguistic analysis step 13.
Examples include: o frequency and distribution of different parts of speech; o word type- token ratio; o frequency distribution of specific function words drawn from the keyword database; and o frequency distribution of multiword prepositions; and proportion of words that are function words.
• Structural level features - which typically refer to the annotations made regarding structural features of the text such as the presence of a signature block, reply status, attachments, headers, etc. Examples include information regarding: o indentation of paragraphs; o presence of farewells; o document length in characters, words, lines, sentences and/or paragraphs; and o mean paragraph length in lines, sentences and/or words. Information regarding the categories, descriptions and names of the various features that are calculated for a typical email document 3 in the preferred embodiment is set out in the following table:
Feature Category Feature Description Feature Name CHARACTERS
All chars Char_count_all
Char_ratio_inWord_all alpha Alpha chars Char_ratio_alpha_all upperCase Upper case chars Char_ratio_upperCase_all
Char_ratio_upperCase_alpha lowerCase Lower case chars digit Lower case chars Char_ratio_digit_all whiteSpace White spaces Char_ratio_space_whiteSpace
Char_ratio_whiteSpace_all space Spaces Char_ratio_space_all tab Tabs Char_count_tab
Char_ratio_tab_all
Char_ratio_tab_whiteSpace punctuation Punctuation Char_count_punctuation
Char_ratio_punctuation_all alphabeticA through alphabeticZ character A, etc. Char_count_alphabeticA, etc. punc44 punctuation character , Char_count_punc44 punc46 punctuation character . Char_count_punc46 punc63 punctuation character ? Char_count_punc63 punc33 punctuation character ! Char_count_punc33 punc58 punctuation character : Char_count_punc58 punc59 punctuation character ; Char_count_punc59 punc39 punctuation character ' Char_count_punc39 punc34 punctuation character ' Char_count_ _punc34 specialCharl26 special character ~ Char_count_ _specialCharl26 specialChar64 special character @ Char_count_ _specialChar64 specialChar35 special character # Char_count_ _specialChar35 specialChar36 special character $ Char_count_ _specialChar36 specialChar37 special character % Char_count_ _specialChar37 specialChar94 special character Char_count_ _specialChar94 specialChar38 special character & Char_count_ _specialChar38 specialChar42 special character * Char_count_ _specialChar42 specialChar45 special character - Char_count_ _specialChar45 specialChar95 special character _ Char_count_ _specialChar95 specialCharόl special character = Char_count_ specialCharόl specialChar43 special character + Char_count_ _specialChar43 specialCharόO special character < Char_count_ specialCharόO specialChar62 special character > Char_count_ _specialChar62 specialChar91 special character [ Char_count_ _specialChar91 specialChar93 special character ] Char_count_ _specialChar93 specialCharl23 special character { Char_count_ _specialCharl23 specialCharl25 special character } Char_count_ _specialCharl25 specialChar92 special character \ Char_count_ _specialChar92 specialChar47 special character / Char_count_ _specialChar47 specialCharl24 special character I Char_count_ _specialCharl24
WORDS
Word All word Tokens Word_count_all
Word_meanLengthIn_Char
Word_ratio_wordType_all
Short words of length less than 4 shortWord characters Word_ratio_shortWord_all
Function words from predefined functionWord lexicon such as: up, to Word ratio functionWord all
Intermediate entities consisting of wordLength entities having various word lengths Word_ratio_wordLenl_all, etc.
1-30 characters
Intermediate entities consisting of posTag entities of various part-of-speech Word_ratio_posTag_all types posNN Words its part-of-speech equal NN Word_ratio_posNN_all posVBT Words its part-of-speech equal VBT Word_ratio_posVBT_all
Words its part-of-speech equal posVBU Word_ratio_posVBU_all VBU posIN Words its part-of-speech equal IN Word_ratio_posIN_all posJJ Words its part-of-speech equal JJ Word_ratio_posJJ_all posRB Words its part-of-speech equal RB Word_ratio_posRB_all posPR Words its part-of-speech equal PR Word_ratio_posPR_all posNNP Words its part-of-speech equal NNP Word_ratio_posNNP_all posPOS Words its part-of-speech equal POS Word_ratio_posPOS_all posMD Words its part-of-speech equal MD Word_ratio_posMD_all caseUpper Words of character case type upper Word_ratio_caseUpper_all caseLower Words of character case type lower Word_ratio_caseLower_all caseCamel Words of character case type camel Word_ratio_caseCamel_all
Words of character case type caseFirstUpper firstUpper Word_ratio_caseFirstUpper_all
Words of character case type caseSlowShiftRelease slowShiftRelease Word_ratio_caseSlowShiftRelease_all
Words of character case type caseSingletonUpper singletonUpper Word_ratio_caseSingletonUpper_all
Words correlating with author trait
CorrelateEducated Educated Word_ratio_CorrelateEducated_all
Words correlating with author trait
CorrelateFemale Female Word_ratio_CorrelateFemale_all
Words correlating with author trait Word_ratio_CorrelateHighAgreeablenes
CorrelateHigh Agreeablenes s High Agreeablenes s s_all
Words correlating with author trait Word_ratio_CorrelateHighConscientious
CorrelateHighConscientiousness HighConscientiousness ness_all
Words correlating with author trait Word_ratio_CorrelateHighExtraversion_
CorrelateHighExtraversion HighExtraversion all
Words correlating with author trait Word_ratio_CorrelateHighNeuroticism_
CorrelateHighNeuroticism HighNeuroticism all
Words correlating with author trait
CorrelateHighOpennes s HighOpenness Word_ratio_CorrelateHighOpenness_all
Words correlating with author trait Word_ratio_CorrelateLowAgreeableness
CorrelateLo w Agreeablenes s Lo wAgreeablenes s _all
Words correlating with author trait Word_ratio_CorrelateLowConscientious
CorrelateLowConscientiousness LowConscientiousness ness_all
Words correlating with author trait Word_ratio_CorrelateLowExtraversion_
CorrelateLowExtraversion LowExtraversion all
Words correlating with author trait Word_ratio_CorrelateLowNeuroticism_a
CorrelateLowNeuroticism LowNeuroticism 11
Words correlating with author trait
CorrelateLo wOpennes s Lo wOpennes s Word_ratio_CorrelateLowOpenness_all Words correlating with author trait
CorrelateMale Male Word_ratio_CorrelateMale_all
Words correlating with author trait
CorrelateNonUS Word_ratio_CorrelateNonUS_all NonUS
Words correlating with author trait
CorrelateOld Old Word_ratio_CorrelateOld_all
Words correlating with author trait
CorrelateUneducated Uneducated Word_ratio_CorrelateUneducated_all
Words correlating with author trait
CorrelateUS Word_ratio_CorrelateUS_all
US
Words correlating with author trait
Correlate Young Young Word_ratio_CorrelateYoung_all
Wordclasses all wordclasses annotations Word_ratio_wordClas s_all wordclassesSP wordclass spelling error (SP) Word_ratio_wordClas s S P_all wordclassesTP wordclass typing error (TP) Word_ratio_wordClas sTP_all wordclass creative wordformation wordclassesCF (CF) Word_ratio_wordClas sCF_all wordclassesAB wordclass abbreviation (AB) Word_ratio_wordClas s AB_all wordclassesWS wordclass missing whitespace (WS) Word_ratio_wordClas s WS_all wordclassesGR wordclass grammatical error (GR) Word_ratio_wordClas sGR_all wordclassesFW wordclass foreign word (FW) Word_ratio_wordClas sFW_all
MULTIWORD PREPOSITIONS
MultiwordPrepositions All multiword prepositions (mwp) MultiwordPreposition_count_all
MultiwordPreposition_ratio_all_allWord s
MultiwordPreposition_meanLengthIn_W ord
MultiwordPreposition_meanLengthIn_C har mwpO through mwpl9 mwp ' s from predefined lexicon Multi wordPreposition_ratio_mwp 1 _all FUNCTION WORDS
FunctionWord All annotations of function words Function Word_count_all
Annotations matching function „ . τ,τ , . „ . „ „ functionO through 149 . . . Function Word_ratio_functionO_all, etc.
GREETINGS
Greeting All annotations of greeting words Greeting_count_all Annotations matching greeting greetingO through greeting86 lexicon Greeting_count_greetingO, etc.
FAREWELLS
Farewell All annotations of farewell words Farewell count all Annotations matching farewell „ farewellO through farewell 186 j • Farewell_count_tarewellO, etc.
EMOTICONS
All annotations representing
Emoticon emoticon symbols Emoticon_count_all
Annotations matching emoticon emoticonO through emoticon70 lexicon Emoticon_count_emoticonO, etc.
LINES
Line All lines strings Line_count_all
Line_meanLengthIn_Char blank Blank lines Line_ratio_blank_all SENTENCES
Sentence All sentence annotations Sentence_count_all
Sentence_meanLengthIn_Char
Sentence_meanLengthIn_Word
PARAGRAPHS
Paragraph All paragraph annotations Paragraph_count_all Paragraph_meanLengthIn_Char Paragraph_meanLengthIn_Word Paragraph_meanLengthIn_Sentence
Paragraphs with the first line indented indented Paragraph_ratio_indented_all
HTML
HTML annotations, and annotations html concerning the HTML HTML_count_all
HTML_ratio_all_allWords
HTML_meanLengthIn_Char
HTML_meanLengthIn_Word
Intermediate entities consisting of htmlTag entities of various HTML tags HTML_ratio_htmlTag_all htmlFontAttributeSizel through HTML font tag with attribute size = HTML_ratio_htmlFontAttributeSizel_ht
Size7 1, etc. mlTag, etc.
HTML font tag with attribute size = HTML_ratio_htmlFontAttributeSize- htmlFontAttributeSize- 1 -1 l_htmlTag
HTML font tag with attribute size = HTML_ratio_htmlFontAttributeSize+l_ htmlFont AttributeSize+ 1 +1 htmlTag
HTML font tag with attribute size = HTML_ratio_htmlFontAttributeSize- htmlFontAttributeSize-2 -2 2_htmlTag
HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorNa htmlFontAttributeColorNavy = navy vy_htmlTag HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorTe htmlFontAttributeColorTeal = teal al_htmlTag
HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorLi htmlFontAttributeColorLime = lime me_htmlTag
HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorGr htmlFontAttributeColorGreen = green een_htmlTag
HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorSil htmlFont Attribu teColorS il ver = silver ver_htmlTag
HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorFu htmlFontAttributeColorFuchsia = fuchsia chsia_htmlTag
HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorW htmlFontAttributeColorWhite = white hite_htmlTag
HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorYe htmlFontAttributeColor Yellow = yellow llow_htmlTag
HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorBla htmlFontAttributeColorBlack = black ckJitmlTag
HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorPur htmlFontAttributeColorPurple = purple pleJitmlTag
HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorOli htmlFontAttributeColorOlive = olive ve_htmlTag
HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorRe htmlFontAttributeColorRed = red d_htmlTag
HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorMa htmlFontAttributeColorMaroon = maroon roon_htmlTag
HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorAq htmlFontAttributeColorAqua = aqua ua_htmlTag
HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorGr htmlFontAttributeColorGray = gray ay_htmlTag
HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorBl htmlFontAttributeColorBlue = blue ue_htmlTag
HTML font tag with attribute color HTML_ratio_htmlFontAttributeColorOt htmlFontAttributeColorOther = other herJitmlTag
HTML font tag with attribute face = HTML_ratio_htmlFontAttributeFaceAria litmlFontAttributeFaceArial arial l_htmlTag
HTML font tag with attribute face = HTML_ratio_htmlFontAttributeFaceVer htmlFontAttributeFaceVerdana verdana dana_htmlTag
HTML font tag with attribute face = HTML_ratio_htmlFontAttributeFaceTah litmlFontAttributeFaceTahoma tahoma oma_htmlTag litmlFontAttributeFaceGaramon HTML font tag with attribute face = HTML_ratio_htmlFontAttributeFaceGar d garamond amond_ htmlTag
HTML font tag with attribute face = HTML_ratio_htmlFontAttributeFaceGeo litmlFontAttributeFaceGeorgia georgia rgia_htmlTag htmlFontAttributeFaceWingding HTML font tag with attribute face = HTML_ratio_htmlFontAttributeFaceWin s wingdings gdings_htmlTag htmlFontAttributeFacePapyrus HTML font tag with attribute face = HTML_ratio_htmlFontAttributeFacePap papyrus yrus_htmlTag
HTML font tag with attribute face = HTML_ratio_htmlFontAttributeFaceDef litmlFontAttributeFaceDefault default ault_htmlTag htmlTagB HTML <B> tags HTML_ratio_htmlTagB_htmlTag htmlTagl HTML <I> tags HTML_ratio_htmlTagI_htmlTag HTML_ratio_htmlTagSTRONG_htmlTa htmlTagSTRONG HTML <STRONG> tags htmlTagU HTML <U> tags HTML_ratio_ .htmlTagUJitmlTag htmlTagTT HTML <TT> tags HTML_ratio_ htmlTagTT_htmlTag htmlTagSMALL HTML <SMALL> tags HTML_ratio_ htmlTagSMALLJitmlTag htmlTagBIG HTML <BIG> tags HTML_ratio_ .htmlTagBIGJitmlTag htmlTagEM HTML <EM> tags HTML_ratio_ htmlTagEMJitmlTag htmlTagTABLE HTML <TABLE> tags HTML_ratio_ htmlTagTABLEJitmlTag htmlTagTR HTML <TR> tags HTML_ratio_ htmlTagTRJitmlTag htmlTagTD HTML <TD> tags HTML_ratio_ .htmlTagTDJitmlTag htmlTagHR HTML <HR> tags HTML_ratio_ .htmlTagHRJitmlTag HTML ratio htmlTagCENTERJitmlTa htmlTagCENTER HTML <CENTER> tags htmlTagLI HTML <LI> tags HTML_ratio_htmlTagLI_htmlTag htmlTagUL HTML <UL> tags HTML_ratio_htmlTagUL_htmlTag
AUTHORJΓEXT
AuthorText All author text annotations AuthorText_count_all
REPLY
Reply All reply annotations Reply_count_all
SIGNATURE
Signature All signature annotations Signature_count_all
PERSONAL personal all category personal annotations personal_count_all
PROFESSIONAL all category professional professional annotations professional_count_all
BUSINESS business all category business annotations business_count_all TIME
Time All Time annotations Time_count_all Time_ratio_all_allWords Time_meanLengthIn_Char Time_meanLengthIn_Word Time annotations such as 23:15 or time24 08: 15 Time ratio time24 all
Time annotations having am or pm „ . ,„„ , „ timeAMPM tokens e.g. 8:15 am Time_ratio_timeAMPM_all timeOClock Time annotations such as 5 o'clock Time_ratio_timeOClock_all
Time annotations that are time Ambiguous ambiguous e.g. 8: 15 Time_ratio_timeAmbiguous_all
MONEY
Money All Money annotations Money_count_all Money_ratio_all_allWords Money_meanLengthIn_Char Money_meanLengthIn_Word
Money annotations having a dollar hasDollarSign sign e.g. $5.0 Money_ratio_hasDollarSign_all
PERSON
Person All Person annotations Person_count_all
Person_ratio_all_allWords
Person_meanLengthIn_Char
Person_meanLengthIn_Word
Person annotations having a title „ . , „ , hasTitle e.g. Mr. John Smith Person_ratio_hasTitle_all
DATE
Date All Date annotations Date_count_all Date_ratio_all_allWords Date_meanLengthIn_Char Date_meanLengthIn_Word
Date annotations with numeric dateNum month component Date_ratio_dateNum_all
Date annotations with worded dateWorded month component Date_ratio_dateWorded_all
Date annotations with a day hasDay specified Date_ratio_hasDay_all
Date annotations with a year hasYear specified Date_ratio_hasYear_all
Numeric Date annotations written dateUK in UK format e.g. 30/12/2005 Date_ratio_dateUK_dateNum
Numeric Date annotations written dateUS Date_ratio_dateUS_dateNum in US format e.g. 12/30/2005
Numeric Date annotations with dateAmbiguous ambiguous( US or UK) style e.g. Date_ratio_dateAmbiguous_dateNum 5/6/2005 Worded Date annotations with monthDate month before date e.g. July 7th Date_ratio_monthDate_dateWorded
Worded Date annotations with date dateMonth before month e.g. 7th of July Date_ratio_dateMonth_date Worded
ADDRESS
Address all address annotations Address_count_all Address_meanLengthIn_Char Address_meanLengthIn_Word Address_ratio_all_allWords
EMAIL
Email all email annotations Email_count_all Email_meanLengthIn_Char Email_meanLengthIn_Word Email_ratio_all_allWords
LOCATION
Location all location annotations Location_count_all Location_meanLengthIn_Char Location_meanLengthIn_Word Location_ratio_all_allWords
ORGANIZATION
Organization all organization annotations Organization_count_all Organization_meanLengthIn_Char Organization_meanLengthIn_Word Organization_ratio_all_allWords
PERCENT
Percent all percent annotations Percent_count_all Percent_meanLengthIn_Char Percent_meanLengthIn_Word Percent_ratio_all_allWords
PHONE
Phone all phone annotations Phone_count_all Phone_meanLengthIn_Char Phone_meanLengthIn_Word Phone ratio all allWords URL
UrI all url annotations Url_count_all
Url_meanLengthIn_Char
Url_meanLengthIn_Word
Url_ratio_all_allWords
It will be appreciated by those skilled in the art that in the above feature list "char" is short for "character" and the numbers after the terms "punc" and "specialChar" refer to the American Standard Code for Information Interchange (ASCII). Hence, for example, the feature Char_count_punc33 is a numeric value equal to the number of times ASCII code 33 (i.e. !) is used in the document being parsed. Some of the other features mentioned in the above list are counts and/or ratios associated with user-defined lexicons of commonly used emoticons, farewells, function words, greetings and multiword prepositions. Each of the feature names is a variable that is set to a numeric value that is calculated for the respective feature. For example, for an email comprised of 488 characters, the feature char_count_all is set to a value of 488.
At step 19 the features extracted at step 18 are converted into data structures associated with segments of the text. The type of data structure chosen must be suitable for use with the type of machine learning system that will be used in step 20. The preferred embodiment uses feature vectors as the preferred data structure and makes use of the Conditional Random Fields technique in the machine learning system. Each of the feature vectors is associated with a line of the text of the email 3. A feature vector is essentially a list of features that is structured in a predefined manner to function as input for the Conditional Random Field processing that occurs at the next step. At step 20 the machine learning system, using the Conditional Random Fields technique, receives the feature vectors and associated lines of text as input and is responsive to that input so as to categorise each line of text as broadly falling into one of two categories: author composed text or non- author composed text. More specifically, the category of non-author composed text is divided into five sub-categories as follows: 1. signature text 7;
2. automatically appended advertisement text 8;
3. quotation text; 4. text contained in an embedded reply chain of email messages 9; and
5. header text 5.
In the preferred embodiment, if the text does not fall into any of these five sub- categories of non-author composed text, it is categorized as author composed text. Since header text 5 is typically identified in the preprocessing step 10, the machine learning categorization step 20 focuses upon identifying the other four sub-categories of non- author composed text.
Once the parsing is complete, the results are stored in accordance with a storage protocol. The preferred embodiment once again makes use of annotations, as described in detail above, to record the results of the parsing. The identified sub-categories of non- author composed text are denoted by the following tags: <header>, <quote>, <signature>, <reply> and <advert>. The text that does not fall into any of these non-author composed sub -categories is categorized as author composed text and is annotated with the following tag: <AuthorText>. With reference to the running example, the annotated text reads as follows:
<header> Original Message
From: <Organization>Commercial Services</Organization> Sent: <Date>Monday, May 08, 2006</Date> <Time>3:23 PM</Time> To: ' <Email> jbloggs@hotmail . com</Email> ' Subject: RE: Special Request</header>
<AuthorTextxparagraph>Hi <Person>Joe</Person>, </paragraph>
<paragraphxsentence>Thank you for inquiring about our <Organization>Commercial Services</0rganization> program. </sentence> <sentence>Thank you for your recent <Organization>Commercial Services</0rganization> inquiry . </sentence> <sentence>The <Organization>B&W Commercial Services</Organization> program can give you one-stop convenience for all of your upkeep and commercial improvement needs, including online change of address and utilities connections with the QC product . </sentence> <sentence>Here is the link to access this information: <Url>http : //commercialservices . bw . com</Url> .</sentence> <sentence>The vendors are listed by category and their contact information is also available on-line . </sentence> <sentence>In order to receive quotes on the services you've requested, it is advised to directly contact that vendor as <Organization>Commercial Services</Organization> does not have access to pricing information. </sentencex/paragraph>
<paragraphxsentence> If you require any moving services, however, please feel free to browse our website for our movers' information and then call us at
<Phone>888.572.9427</Phone> so that we can set up an appointment for an estimate . </sentencex/paragraph>
<paragraphxsentence>If you have any questions, please don't hesitate to email or call at
<Phone>888.572.9427</Phone>. </sentencex/paragraph>
<paragraph>Best Regards,
<signature>The <Organization>Commercial Services</Organization> Team <Phone>888.572.9427</Phone>
<Emai1>commercialservices@bw . com</Emailx/signaturex/parag raph></AuthorText> <replyxparagraph> Original Message
From: <Email>jbloggs@hotmail . com</Email>
[mailto : <Email>jbloggs@hotmail . com</Email> ]
Sent: <Date>Monday, May 08, 2006</Date> <Time>3:13
PM</Time> To: <Organization>Commercial Services</Organization>
Subject: Special Request</paragraph>
<paragraphxOrganization>BW Commercial
Services</Organization> - Special request</paragraph>
<paragraph> Submitted
Time: <Date>5/8/2006</Date> <Time>4 : 12 : 32 PM</Timex/paragraph>
<paragraph>Origins
Origin: Our Site Origin 2 : </paragraph>
<paragraph>Message from
Name: <Person>Joe Bloggs</Person> E-mail : <Email> jbloggs@hotmail . com</Email> Phone: <Phone>(507) 359-7891</Phone> Additional Phone:
Contact Method: phone
Contact Time: Evening (<Time>5:00 pm</Time> - <Time>8:00 pm</Time>) Contact ASAP: Yes </paragraph>
<paragraph>Customer responses <sentence>I 'm interested in renting, and I would like:</sentence>
<sentence>More information on your <Organization>Commercial Services</0rganization> program</sentencex/paragraphx/reply>
<advertxparagraph><Organization>B&W<Organization> - Your Favorite <Organization>Commercial Services</Organization> Provider Since 1875</paragraphx/advert>
The above annotated email text represents an example of a structured document
21 , which is the final output of the preferred method 1. Note that not all of the annotations generated during steps 12 to 14 are included in the output of the method 1, for example some of the annotations associated with character level features are not included. Other embodiments are specifically tailored to recognize further sub -categories of non-authored text, however it has been appreciated by the inventors of the present invention that identification of the five sub-categories of non-author composed text that are set out above is sufficient to identify the vast bulk of non-author composed text present in a typical representative sample of email messages as at the priority date of this patent application. In other words, restricting the identification of non-authored text to the five sub-categories set out above represents a workable compromise between accuracy and processing requirements.
The machine learning system makes use of a predictive model that is established during a training phase, in which the machine learning system receives training data consisting of pairs of feature vectors and lines statuses, where the status of a line can be any one of: author composed text 6; automatically appended advertisement text 8; signature text 7; embedded reply chain text 9 or quotation text. The training data is compiled from a representative sample of email documents 3, at least some of which are preferably contemporary. Once sufficient training iterations have been completed, the machine learning system formulates the predictive model that is used in the machine learning categorization of step 20. In addition to, or as an alternative to, the Conditional Random Fields technique, various other preferred embodiments make use of one or more of the following types of known machine learning techniques, including:
Support Vector Machines; Nave Bays;
Decision Trees; and/or
Maximum Entropy.
It will be appreciated by those skilled in the art that the present invention may be embodied in computer software in the form of executable code for instructing a computer to perform the inventive method. The software and its associated data are capable of being stored upon a computer -readable medium in the form of one or more compact disks (CD's). Alternative embodiments make use of other forms of digital storage media, such as Digital Versatile Discs (DVD's), hard drives, flash memory, Erasable Programmable Read-Only Memory (EPROM), and the like. Alternatively the software and its associated data may be stored as one or more downloadable or remotely executable files that are accessible via a computer communications network such as the internet.
Hence, the processing of email text undertaken by the preferred embodiment advantageously identifies advertisements and quotations in addition to reply lines, signatures and text written by the author. This parsing may be performed with a comparatively high degree of accuracy. It is achieved with the use of a rich set of linguistic features, such as a database storing a plurality of named entities, common greetings and farewell phrases. The parsing also makes use of a comprehensive set of punctuation features. Additionally, the use of segmentation analysis provides further useful input to the parsing processing, for example to help avoid incorrectly categorizing half of a sentence as author composed text and the other half of a sentence as a reply line.
The preferred embodiment can advantageously function with input email text represented in a variety of formats. Advantageously, alternative preferred embodiments are configurable for use in parsing email text expressed in languages other than English. Provided the machine learning system is regularly re-trained on a contemporary set of training data, the preferred embodiment can effectively keep abreast of newly emergent email writing styles and expressions. This assists in maintaining a comparatively high degree of accuracy as the email writing genre evolves over time. While a number of preferred embodiments have been described, it will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

Claims

THE CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS:
1. A computer implemented method of parsing an email document so as to categorize text from the email document as author composed text or non-author composed text, said method including the steps of: processing the text to determine the presence of signature text and categorizing any such signature text as non-author composed text; processing the text to determine the presence of automatically appended advertisement text and categorizing any such automatically appended advertisement text as non-author composed text; processing the text to determine the presence of quotation text and categorizing any such quotation text as non-author composed text; processing the text to determine the presence of text contained in an embedded reply chain of email messages and categorizing any such text contained in an embedded reply chain of email messages as non-author composed text; and categorizing at least some of the remaining text as author composed text.
2. A method according to claim 1 wherein at least one of the text processing steps includes a linguistic analysis of the words in the text.
3. A method according to claim 2 wherein said linguistic analysis includes identification of predefined words and phrases.
4. A method according to claim 3 wherein said words and phrases include any one or more of the following types: peoples' names, locations, dates, times, organizations, currency, uniform resource locators (URL's), email addresses, addresses, organizational descriptors, phone numbers, typical greetings and/or typical farewells.
5. A method according to claim 4 further including a database of words and phrases of any one or more of the following types: peoples' names, locations, dates, times, organizations, currency, uniform resource locators (URL's), email addresses, addresses, organizational descriptors, phone numbers, typical greetings and/or typical farewells.
6. A method according to claim 4 or 5 further including the step of anonymising information contained within the text of the email document.
7. A method according to any one of the preceding claims wherein at least one of the text processing steps includes an analysis of the punctuation used in the text.
8. A method according to any one of the preceding claims wherein at least one of the text processing steps includes an analysis of the paragraph segmentation used in the text.
9. A method according to any one of the preceding claims wherein at least one of the text processing steps includes an analysis of the sentence segmentation used in the text.
10. A method according to claim 1 wherein at least one of the text processing steps includes any one or more of: a linguistic analysis of the words in the text, an analysis of the punctuation used in the text; an analysis of the paragraph segmentation used in the text; and/or an analysis of the sentence segmentation used in the text, and wherein the results of said analyses are represented by one or more data structures associated with segments of the text.
11. A method according to claim 10 wherein said segments of the text are lines of the text.
12. A method according to claim 10 or 11 wherein at least one of the text processing steps further includes utilising a machine learning system that is responsive to said one or more data structures.
13. A method according to claim 12 wherein the data structures are feature vectors and the machine learning system utilizes any one or more of the following techniques:
Conditional Random Fields; Support Vector Machines;
Naϊve Bayes; Decision Trees; and/or Maximum Entropy.
14. A method according to claim 12 or 13 wherein the machine learning system has been trained with reference to a representative sample of email documents.
15. A method according to claim 14 wherein the representative sample of email documents includes a proportion of contemporary email documents.
16. A method according to any one of the preceding claims including a step of processing the text to determine the presence of header text and categorizing any such header text as non-author composed text.
17. A method according to any one of the preceding claims including a step of processing the email document to determine the presence of any attachments and stripping any such attachments from the email document prior to processing the text.
18. A method according to any one of the preceding claims including a step of processing the email document to determine the presence of any forwarded material and stripping any such forwarded material from the email document prior to processing the text.
19. A method according to any one of the preceding claims including a step of processing the email document to ascertain whether the email document is in a preferred format and, if the email document is not in the preferred format, converting at least some of the information within the email document to the preferred format.
20. A computer-readable medium containing computer executable code for instructing a computer to perform a method according to any one of the preceding claims.
21. A downloadable or remotely executable file or combination of files containing computer executable code for instructing a computer to perform a method according to any one of claims 1 to 19.
22. A computing apparatus having a central processing unit, associated memory and storage devices, and input and output devices, said apparatus being configured to perform a method according to any one of claims 1 to 19.
Dated: 5 April, 2007
Appen Pty Limited,
By Their Patent Attorneys,
ADAMS PLUCK
PCT/AU2007/000440 2006-11-03 2007-04-05 Email document parsing method and apparatus WO2008052239A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP07718687A EP2092447A4 (en) 2006-11-03 2007-04-05 Email document parsing method and apparatus
US12/447,898 US20100100815A1 (en) 2006-11-03 2007-04-05 Email document parsing method and apparatus
AU2007314123A AU2007314123B2 (en) 2006-11-03 2007-04-05 Email document parsing method and apparatus

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
AU2006906095 2006-11-03
AU2006906095A AU2006906095A0 (en) 2006-11-03 Email document parsing method and apparatus
AU2006906623 2006-11-28
AU2006906623A AU2006906623A0 (en) 2006-11-28 Document processor and associated method

Publications (1)

Publication Number Publication Date
WO2008052239A1 true WO2008052239A1 (en) 2008-05-08

Family

ID=39343669

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/AU2007/000440 WO2008052239A1 (en) 2006-11-03 2007-04-05 Email document parsing method and apparatus
PCT/AU2007/000441 WO2008052240A1 (en) 2006-11-03 2007-04-05 Document processor and associated method

Family Applications After (1)

Application Number Title Priority Date Filing Date
PCT/AU2007/000441 WO2008052240A1 (en) 2006-11-03 2007-04-05 Document processor and associated method

Country Status (4)

Country Link
US (2) US20100114562A1 (en)
EP (2) EP2092447A4 (en)
AU (2) AU2007314124B2 (en)
WO (2) WO2008052239A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110055206A1 (en) * 2008-01-15 2011-03-03 West Services, Inc. Systems, methods and software for processing phrases and clauses in legal documents
WO2011154023A1 (en) * 2010-06-11 2011-12-15 Siemens Enterprise Communications Gmbh & Co. Kg Method for producing a document with the aid of an information processing system
US20180124007A1 (en) * 2016-10-28 2018-05-03 Hewlett Packard Enterprise Development Lp Hashes of email text
US11463500B1 (en) * 2017-08-04 2022-10-04 Grammarly, Inc. Artificial intelligence communication assistance for augmenting a transmitted communication

Families Citing this family (54)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10862994B1 (en) * 2006-11-15 2020-12-08 Conviva Inc. Facilitating client decisions
US8874725B1 (en) 2006-11-15 2014-10-28 Conviva Inc. Monitoring the performance of a content player
US9264780B1 (en) 2006-11-15 2016-02-16 Conviva Inc. Managing synchronized data requests in a content delivery network
US8751605B1 (en) 2006-11-15 2014-06-10 Conviva Inc. Accounting for network traffic
US8312379B2 (en) * 2007-08-22 2012-11-13 International Business Machines Corporation Methods, systems, and computer program products for editing using an interface
US9177313B1 (en) 2007-10-18 2015-11-03 Jpmorgan Chase Bank, N.A. System and method for issuing, circulating and trading financial instruments with smart features
GB2463735A (en) * 2008-09-30 2010-03-31 Paul Howard James Roscoe Fully biodegradable adhesives
US10346879B2 (en) * 2008-11-18 2019-07-09 Sizmek Technologies, Inc. Method and system for identifying web documents for advertisements
CN101742442A (en) * 2008-11-20 2010-06-16 银河联动信息技术(北京)有限公司 System and method for transmitting electronic certificate through short message
US8402494B1 (en) 2009-03-23 2013-03-19 Conviva Inc. Switching content
US9203913B1 (en) * 2009-07-20 2015-12-01 Conviva Inc. Monitoring the performance of a content player
US8612293B2 (en) 2010-10-19 2013-12-17 Citizennet Inc. Generation of advertising targeting information based upon affinity information obtained from an online social network
US9098836B2 (en) 2010-11-16 2015-08-04 Microsoft Technology Licensing, Llc Rich email attachment presentation
US9349130B2 (en) 2010-11-17 2016-05-24 Eloqua, Inc. Generating relative and absolute positioned resources using a single editor having a single syntax
US9419928B2 (en) 2011-03-11 2016-08-16 James Robert Miner Systems and methods for message collection
US8819156B2 (en) 2011-03-11 2014-08-26 James Robert Miner Systems and methods for message collection
US20120254166A1 (en) * 2011-03-30 2012-10-04 Google Inc. Signature Detection in E-Mails
US9063927B2 (en) * 2011-04-06 2015-06-23 Citizennet Inc. Short message age classification
US20130097166A1 (en) * 2011-10-12 2013-04-18 International Business Machines Corporation Determining Demographic Information for a Document Author
US9613042B1 (en) 2012-04-09 2017-04-04 Conviva Inc. Dynamic generation of video manifest files
US10489433B2 (en) 2012-08-02 2019-11-26 Artificial Solutions Iberia SL Natural language data analytics platform
US9418151B2 (en) * 2012-06-12 2016-08-16 Raytheon Company Lexical enrichment of structured and semi-structured data
US9268765B1 (en) 2012-07-30 2016-02-23 Weongozi Inc. Systems, methods and computer program products for neurolinguistic text analysis
US9246965B1 (en) 2012-09-05 2016-01-26 Conviva Inc. Source assignment based on network partitioning
US10182096B1 (en) 2012-09-05 2019-01-15 Conviva Inc. Virtual resource locator
US10439969B2 (en) * 2013-01-16 2019-10-08 Google Llc Double filtering of annotations in emails
US9208142B2 (en) 2013-05-20 2015-12-08 International Business Machines Corporation Analyzing documents corresponding to demographics
US9483519B2 (en) * 2013-08-28 2016-11-01 International Business Machines Corporation Authorship enhanced corpus ingestion for natural language processing
US20150074202A1 (en) * 2013-09-10 2015-03-12 Lenovo (Singapore) Pte. Ltd. Processing action items from messages
RU2013144681A (en) 2013-10-03 2015-04-10 Общество С Ограниченной Ответственностью "Яндекс" ELECTRONIC MESSAGE PROCESSING SYSTEM FOR DETERMINING ITS CLASSIFICATION
US9275242B1 (en) * 2013-10-14 2016-03-01 Trend Micro Incorporated Security system for cloud-based emails
US9607319B2 (en) 2013-12-30 2017-03-28 Adtile Technologies, Inc. Motion and gesture-based mobile advertising activation
US9606977B2 (en) 2014-01-22 2017-03-28 Google Inc. Identifying tasks in messages
US10691872B2 (en) * 2014-03-19 2020-06-23 Microsoft Technology Licensing, Llc Normalizing message style while preserving intent
US9652530B1 (en) 2014-08-27 2017-05-16 Google Inc. Generating and applying event data extraction templates
US9563689B1 (en) 2014-08-27 2017-02-07 Google Inc. Generating and applying data extraction templates
US9785705B1 (en) 2014-10-16 2017-10-10 Google Inc. Generating and applying data extraction templates
US10178043B1 (en) 2014-12-08 2019-01-08 Conviva Inc. Dynamic bitrate range selection in the cloud for optimized video streaming
US10305955B1 (en) 2014-12-08 2019-05-28 Conviva Inc. Streaming decision in the cloud
US10216837B1 (en) 2014-12-29 2019-02-26 Google Llc Selecting pattern matching segments for electronic communication clustering
US10097489B2 (en) 2015-01-29 2018-10-09 Sap Se Secure e-mail attachment routing and delivery
US9578493B1 (en) 2015-08-06 2017-02-21 Adtile Technologies Inc. Sensor control switch
US10003561B2 (en) 2015-08-24 2018-06-19 Microsoft Technology Licensing, Llc Conversation modification for enhanced user interaction
US9639524B2 (en) 2015-08-26 2017-05-02 International Business Machines Corporation Linguistic based determination of text creation date
US9659007B2 (en) 2015-08-26 2017-05-23 International Business Machines Corporation Linguistic based determination of text location origin
US10275446B2 (en) 2015-08-26 2019-04-30 International Business Machines Corporation Linguistic based determination of text location origin
US10437463B2 (en) 2015-10-16 2019-10-08 Lumini Corporation Motion-based graphical input system
US9940318B2 (en) * 2016-01-01 2018-04-10 Google Llc Generating and applying outgoing communication templates
US10140291B2 (en) 2016-06-30 2018-11-27 International Business Machines Corporation Task-oriented messaging system
US10387559B1 (en) * 2016-11-22 2019-08-20 Google Llc Template-based identification of user interest
US9983687B1 (en) 2017-01-06 2018-05-29 Adtile Technologies Inc. Gesture-controlled augmented reality experience using a mobile communications device
US10762895B2 (en) 2017-06-30 2020-09-01 International Business Machines Corporation Linguistic profiling for digital customization and personalization
US10929617B2 (en) * 2018-07-20 2021-02-23 International Business Machines Corporation Text analysis in unsupported languages using backtranslation
US11068530B1 (en) * 2018-11-02 2021-07-20 Shutterstock, Inc. Context-based image selection for electronic media

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001025966A1 (en) * 1999-10-01 2001-04-12 Talisma Corporation Web mail management method and system
US20040024825A1 (en) * 2002-08-01 2004-02-05 Peter Chou Method and system for parsing e-mail
WO2006083820A2 (en) * 2005-02-01 2006-08-10 Metalincs Cor. Thread identification and classification

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5111398A (en) * 1988-11-21 1992-05-05 Xerox Corporation Processing natural language text using autonomous punctuational structure
US5636325A (en) * 1992-11-13 1997-06-03 International Business Machines Corporation Speech synthesis and analysis of dialects
US6173406B1 (en) * 1997-07-15 2001-01-09 Microsoft Corporation Authentication systems, methods, and computer program products
US6285978B1 (en) * 1998-09-24 2001-09-04 International Business Machines Corporation System and method for estimating accuracy of an automatic natural language translation
US6732087B1 (en) * 1999-10-01 2004-05-04 Trialsmith, Inc. Information storage, retrieval and delivery system and method operable with a computer network
US6836768B1 (en) * 1999-04-27 2004-12-28 Surfnotes Method and apparatus for improved information representation
US6507829B1 (en) * 1999-06-18 2003-01-14 Ppd Development, Lp Textual data classification method and apparatus
US6647395B1 (en) * 1999-11-01 2003-11-11 Kurzweil Cyberart Technologies, Inc. Poet personalities
US7275029B1 (en) * 1999-11-05 2007-09-25 Microsoft Corporation System and method for joint optimization of language model performance and size
US6567805B1 (en) * 2000-05-15 2003-05-20 International Business Machines Corporation Interactive automated response system
US7346492B2 (en) * 2001-01-24 2008-03-18 Shaw Stroz Llc System and method for computerized psychological content analysis of computer and media generated communications to produce communications management support, indications, and warnings of dangerous behavior, assessment of media images, and personnel selection support
US20030043188A1 (en) * 2001-08-30 2003-03-06 Daron John Bernard Code read communication software
US6993534B2 (en) * 2002-05-08 2006-01-31 International Business Machines Corporation Data store for knowledge-based data mining system
US7369985B2 (en) * 2003-02-11 2008-05-06 Fuji Xerox Co., Ltd. System and method for dynamically determining the attitude of an author of a natural language document
US7813917B2 (en) * 2004-06-22 2010-10-12 Gary Stephen Shuster Candidate matching using algorithmic analysis of candidate-authored narrative information
US20060129602A1 (en) * 2004-12-15 2006-06-15 Microsoft Corporation Enable web sites to receive and process e-mail
WO2006088915A1 (en) * 2005-02-14 2006-08-24 Inboxer, Inc. System for applying a variety of policies and actions to electronic messages before they leave the control of the message originator
US20080084972A1 (en) * 2006-09-27 2008-04-10 Michael Robert Burke Verifying that a message was authored by a user by utilizing a user profile generated for the user

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001025966A1 (en) * 1999-10-01 2001-04-12 Talisma Corporation Web mail management method and system
US20040024825A1 (en) * 2002-08-01 2004-02-05 Peter Chou Method and system for parsing e-mail
WO2006083820A2 (en) * 2005-02-01 2006-08-10 Metalincs Cor. Thread identification and classification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2092447A4 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110055206A1 (en) * 2008-01-15 2011-03-03 West Services, Inc. Systems, methods and software for processing phrases and clauses in legal documents
US8788523B2 (en) * 2008-01-15 2014-07-22 Thomson Reuters Global Resources Systems, methods and software for processing phrases and clauses in legal documents
WO2011154023A1 (en) * 2010-06-11 2011-12-15 Siemens Enterprise Communications Gmbh & Co. Kg Method for producing a document with the aid of an information processing system
US20180124007A1 (en) * 2016-10-28 2018-05-03 Hewlett Packard Enterprise Development Lp Hashes of email text
US10511563B2 (en) * 2016-10-28 2019-12-17 Micro Focus Llc Hashes of email text
US11463500B1 (en) * 2017-08-04 2022-10-04 Grammarly, Inc. Artificial intelligence communication assistance for augmenting a transmitted communication
US11620566B1 (en) 2017-08-04 2023-04-04 Grammarly, Inc. Artificial intelligence communication assistance for improving the effectiveness of communications using reaction data
US11727205B1 (en) 2017-08-04 2023-08-15 Grammarly, Inc. Artificial intelligence communication assistance for providing communication advice utilizing communication profiles

Also Published As

Publication number Publication date
AU2007314124B2 (en) 2009-08-20
AU2007314123A1 (en) 2008-05-08
EP2092447A1 (en) 2009-08-26
AU2007314123B2 (en) 2009-09-03
EP2084620A4 (en) 2011-05-11
US20100100815A1 (en) 2010-04-22
AU2007314124A1 (en) 2008-05-08
EP2092447A4 (en) 2011-03-02
US20100114562A1 (en) 2010-05-06
WO2008052240A1 (en) 2008-05-08
EP2084620A1 (en) 2009-08-05

Similar Documents

Publication Publication Date Title
AU2007314123B2 (en) Email document parsing method and apparatus
EP0914637B1 (en) Document producing support system
Maekawa et al. Balanced corpus of contemporary written Japanese
US7269544B2 (en) System and method for identifying special word usage in a document
US20150278195A1 (en) Text data sentiment analysis method
US8706470B2 (en) Methods of offering guidance on common language usage utilizing a hashing function consisting of a hash triplet
Chen et al. Mining user requirements to facilitate mobile app quality upgrades with big data
US20100023318A1 (en) Method and device for retrieving data and transforming same into qualitative data of a text-based document
CN101887414A (en) The evaluation that the text message that comprises pictorial symbol is passed on is the server of marking automatically
WO2013003008A2 (en) Automatic classification of electronic content into projects
US20150026178A1 (en) Subject-matter analysis of tabular data
GB2389437A (en) Automatic data checking and correction
Rao et al. CMEE-IL: Code Mix Entity Extraction in Indian Languages from Social Media Text@ FIRE 2016-An Overview.
JP7208872B2 (en) Systems and methods for generating proposals based on request for proposals (RFPs)
Şeker et al. Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content 1
Almuqren et al. AraCust: a Saudi Telecom Tweets corpus for sentiment analysis
CN111259645A (en) Referee document structuring method and device
Jahan et al. A pronoun replacement-based special tagging system for bengali language processing (blp)
Litvak et al. Multilingual Text Analysis: Challenges, Models, and Approaches
US20180075157A1 (en) Method and System for Converting Disparate Financial, Regulatory, and Disclosure Documents to a Linked Table
Gobin-Rahimbux et al. KreolStem: A hybrid language-dependent stemmer for Kreol Morisien
Gupta et al. LemmaQuest Lemmatizer: A Morphological Analyzer Handling Nominalization
Šostaka et al. The Semi-Algorithmic Approach to Formation of Latvian Information and Communication Technology Terms.
Varadarajan et al. Text-mining: Application development challenges
Kim et al. From Words to Numbers: Getting Started with Text Analysis for Applied Social Scientists

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07718687

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2007314123

Country of ref document: AU

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2007718687

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2007314123

Country of ref document: AU

Date of ref document: 20070405

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 12447898

Country of ref document: US