GB2615736A

GB2615736A - Text-based document processing

Info

Publication number: GB2615736A
Application number: GB2118781.0A
Authority: GB
Inventors: Meijer Bernet; Wood Lewis; Jew Luke; Satti Riham; Doraiswamy Vivek; Lamplough Nathan
Original assignee: Oxiway Ltd
Current assignee: Oxiway Ltd
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2023-08-23

Abstract

A computer-implemented method of processing a text-based document 400 comprises receiving data representative of a text-based document and applying an image-based object detection process to an image representation of the text-based document to identify a plurality of blocks of text (103-109, fig 1) in the image. Each block contains a plurality of words. The method further comprises, for each of the identified blocks of text, identifying a respective portion of text data 413-419 encoding one or more words contained in the respective block, and processing the identified portion of text data to determine a label 423-429 for the block of text. The text-based document may be a CV or resume and the method may support accurate classification of its sections. Object detection may involve determining boundaries around each block 403-409 and/or applying a lossy compression operation (downsampling or blurring, fig 3) to the image representation prior to identifying the blocks.

Description

Text-Based Document Processing

BACKGROUND OF THE INVENTION

This invention relates to methods, systems and software for processing text-based documents.

There are many types of text-based document for which different examples have similar content but do not share a common, well-defined structure. For example, résumés and curricula vitae (CVs) typically contain standard content, such as academic qualifications and employment history, but exhibit a great diversity in their layout, ordering, design and typography. This lack of consistent structure makes it challenging for computer-based methods to correctly classify the text within such documents.

The present invention seeks to provide an improved approach to processing text-based documents that can allow for more accurate classification.

SUMMARY OF THE INVENTION

From a first aspect, the invention provides a computer-implemented method of processing a text-based document, the method comprising: receiving data representative of a text-based document; applying an image-based object detection process to an image representation of the text-based document to identify a plurality of blocks of text in the image, each block containing a plurality of words; and for each of the identified blocks of text, identifying a respective portion of text data encoding one or more words contained in the respective block, and processing the identified portion of text data to determine a label for the block of text.

From a second aspect, the invention provides computer software which, when executed on a processing system, causes the processing system to: receive data representative of a text-based document; -2 -apply an image-based object detection process to an image representation of the text-based document to identify a plurality of blocks of text in the image, each block containing a plurality of words; and for each of the identified blocks of text, identify a respective portion of text data encoding one or more words contained in the respective block, and process the identified portion of text data to determine a label for the block of text.

From a further aspect, the invention provides a processing system comprising a processor and a memory storing a computer program comprising instructions which, when executed by the processor, cause the processor to: receive data representative of a text-based document; apply an image-based object detection process to an image representation of the text-based document to identify a plurality of blocks of text in the image, each block containing a plurality of words; and for each of the identified blocks of text, identify a respective portion of text data encoding one or more words contained in the respective block, and processing the identified portion of text data to determine a label for the block of text.

Thus it will be seen that, in accordance with embodiments of the invention, image processing is first used to identify one or more blocks of text (i.e. sections of text) from an image representation of the document, and then text-based processing (e.g. semantic analysis) is used to classify each of the blocks based on its textual content. This can enable the text of a block to be labelled more accurately because all the text is likely to relate to the same subject (e.g. "hobbies") by virtue of its physical proximity within the document, as determined from the image representation.

The blocks of text within the document may be separated by regions in which no text is present, e.g. white space. The blocks of text may therefore be well separated, making it easier to define boundaries around them. The image-based object detection process may include determining a respective boundary for each of the identified blocks of text.

This may allow the portion of the text data associated with each block of text identified in the image to be to be more accurately determined. Each boundary may be a closed path. Each may be polygonal. The boundaries may be rectangular, although in some embodiments at least one boundary may comprise more than four edges; this may be useful for segmenting documents having complex layouts where blocks of text cannot 3 -be separated using only rectangular boxes. In some embodiments, the image-based object detection process may determine a respective polygonal boundary (e.g. rectangular bounding box) around each of the blocks of text. Coordinates for vertices of the boundary may be determined. These coordinates may be used to determine (e.g. identify and/or extract from the text-based document) the respective portion of text data, e.g. by processing layout data associated with text data for the document.

The plurality of blocks identified by the image-based object detection process may, in some embodiments and for at least some documents, collectively contain all the text of the document. The blocks are preferably non-overlapping.

In certain document types, text-based content may be presented in the form of one or more horizontal or vertical lines of text, each line comprising characters grouped into one or more words. In some embodiments, the image-based object detection process is configured such that each block of text contains only complete lines (e.g. one or more complete lines). However, some embodiments may be able to identify blocks that contain one or more partial lines.

Receiving the data representative of the document may comprise reading the data from a memory, or receiving the data over a software interface (e.g. an API) or a physical interface (e.g. a network connection). The document data may be received as one or more files or data streams.

The received data may comprise any one or more of: image data, text data, and layout data. In some embodiments, it comprises the image representation of the text-based document and also comprises text data encoding some or all of the textual content of the text-based document, optionally with layout data. However, this is not necessary in all embodiments. Some methods may comprise generating the image representation from received text data and layout data-e.g. by processing text data and layout data to generate an image (e.g. one or more bitmap images) of the document. Other methods may comprise generating text data (optionally with layout data) from received image data-e.g. from a or the received image representation of the text-based document. -4 -

The text-based document may be received as a Portable Document Format (PDF) file or a PostScript (PS) file or as a word-processing file-e.g. a Microsoft Word file or an Open Document Format for Office Applications (ODF) file.

In some embodiments, the text data may be stored as data in a text file, e.g., in the form of ASCII or Unicode characters. Layout information defining the position of text within the document (e.g. a location for each character or word on an identified page of the document) may be stored separately, e.g. in a separate file, or may be received in a common file with the text data.

The image representation of the text-based document may comprise raster or vector image data. It may be encoded in any appropriate manner. It may comprise one or more bitmaps. It may be compressed. It may be stored as one or more files. It may be encoded within a Portable Document Format (PDF) file. It may be received as input to the system, or it may be generated from received text data. In some embodiments, the image representation may be received in addition to the text data of the text-based document, embedded within the same file or as a separate file.

In some embodiments, the text-based document may be received as image data without containing text data, e.g. if the text-based document is received in the form of a scan or photograph of a paper document. In such embodiments, methods may comprise generating text data encoding textual content of the text-based document. This may be achieved, for example, by applying optical character recognition (OCR) to the image representation of the text-based document to generate the text data. Thus, in some embodiments, the text data is generated from an image using an optical character recognition (OCR) process.

In some embodiments, the image-based object detection process may comprise applying a trained machine learning algorithm to identify and/or classify objects, including the plurality of blocks of text, within the image representation. The machine learning algorithm may have been trained on training data. The training data may comprise image representations of a plurality of text-based documents. These documents may be of a same type as the received text-based document (e.g. all being CVs). The training data may include data that identifies blocks of text in the images, which may have been identified by a human. The machine learning algorithm may be -5 -trained using supervised techniques, or unsupervised learning techniques. The image-based object detection process may use a trained classifier, such as a convolutional neural network (CNN), to identify the plurality of blocks of text.

When determining a label for an identified portion of text data corresponding to a respective block of text, the text data may be processed using any appropriate method to determine information about its contents. In some embodiments, the text data may be processed to identify a heading-e.g. by identifying the initial words before a first line break, or identifying the initial words before a reduction in font size. The text data may be processed by applying one or more regular expressions to the portion of text data. This may allow one or more predetermined strings (e.g. a heading) to be identified within the text data. A label may be assigned in dependence on the identified string or strings. The string itself may be assigned as the label, or a mapping may be used to determine the label. For example, if the string "Personal Skills" is identified, this may then be determined to be equivalent, in mapping data, to a predetermined string "Skills". The predetermined string may thus be assigned as the label for the block of text in which the text data is contained.

In some embodiments, determining a label for a portion of text data may comprise applying a naïve Bayes classifier to the portion of text data. Words within the text data may be identified, and input to the naïve Bayes classifier, which may be configured (e.g. trained) to process each word within the text data independently of any other word. An appropriate label may be determined based on the frequency with which certain words appear within the portion of text data. Thus, in some embodiments, processing the identified portion of text data may comprise applying a trained naive Bayes classifier to the portion of text data. The naive Bayes classifier may have been trained on training data which may comprise a portions of text data extracted from text-based documents of a same type as the received text-based document (e.g. CVs). The training data may include labels for portions of text data, which may have been labelled by a human. -6 -

The identified portion of data may be further processed, e.g. to extract information from the text data. Information may be extracted by inspecting the text in block line-by-line and/or sentence-by-sentence. One or more regular expressions may be applied to identify numerical data, such as dates, within the text data.

At least for some document types, the image-based object detection process may comprise applying a lossy compression operation, such as downsampling reducing the image resolution) or blurring, to the image prior to the object detection process being performed. Such lossy compression may advantageously remove some or all personally-identifying information from the image (e.g. from the text content represented in the image). This may help support privacy and/or blind recruiting policies.

Compressing the image may also advantageously improve efficiency by reducing storage and/or processing requirements of the system.

The same or a different lossy compression operation may be applied (or have been applied) to training data used to train the image-based object detection process. In addition to anonymising the data, this may, in some embodiments, usefully bias the detection process towards considering larger layout features of the document, rather than finer details which may be less relevant to effective segmentation. In particular, it may reduce or remove language-specific features from the image, thereby enabling image data to be usefully included in the training dataset regardless of its language or content (e.g. even if it is in a different language from the text-based document processed during inference). The image-based object detection process may thus be language-agnostic in some embodiments.

In some embodiments, the image-based object detection process may comprise applying a downsampling operation to the image representation prior to identifying a plurality of blocks of text in the image. The downsampling may comprise reducing a spatial resolution of the image in one or preferably two dimensions. It may comprise combining (e.g. averaging) adjacent pixels into fewer larger pixels (e.g. each equivalent to 3x3 original pixels). -7 -

In some embodiments, the image-based object detection process may additionally or alternatively comprise applying a blurring filter to the image representation prior to identifying a plurality of blocks of text in the image. Blurring of the image may be achieved by applying a two dimensional convolution function to the image. In some embodiments, applying the blurring filter to the image comprises applying a Gaussian blur to the image. It may also comprise adding white noise to the image, which may help ensure that the down-weighted small-scale structure of the image is lost as a result of the noise applied. The properties of the Gaussian blur function may be selected to optimise the results of the object detection process. For example, the full width half maximum (FVVHM) of the Gaussian blur may be selected to correspond to 3 pixels in the image.

The document may comprise one or more pages. It may be a portion of a larger document or file.

The document may be any type of document. However, in a preferred set of embodiments the data representative of the document is unstructured, but the document is of a type that can be labelled with a set of predetermined labels. In some preferred embodiments the document is a résumé or curriculum vitae (CV). Methods disclosed herein are particularly well suited for labelling blocks of text within such unstructured documents.

Features of any aspect or embodiment described herein may, wherever appropriate, be applied to any other aspect or embodiment described herein. Where reference is made to different embodiments or sets of embodiments, it should be understood that these are not necessarily distinct but may overlap.

BRIEF DESCRIPTION OF THE DRAWINGS

Certain preferred embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which: FIG. 1 is an image of a CV that may be processed using a method embodying the invention; FIG. 2 is a flow diagram illustrating steps of a method, embodying the invention, for extracting relevant content from the CV; -8 -FIG. 3 is an image of the CV after being downsampled according to a method embodying the invention; FIG. 4 is schematic diagram of data involved in a process of labelling a CV in accordance with a method embodying the invention; and FIG. 5 is a schematic diagram of a processing system embodying the invention.

DETAILED DESCRIPTION

Figure 1 shows an exemplary simplified curriculum vitae (CV) 100 of a job applicant that may be processed using the method of the present disclosure. In this example the text-based document 100 is a CV, however it will be appreciated that the methods described herein are also applicable to other types of text-based document, such as newspaper or scientific articles, administrative forms, novels, etc. The CV 100 comprises the name 101 of the person to whom the CV 100 relates, an image 102 of the applicant, and a number of blocks of text, here exemplified as four sections 103-109, each section comprising a respective section heading 103a, 105a, 107a, 109a and associated body text 103b, 105b, 107b, 109b. A first section heading 103a is part of a first section relating to the "personal details" of the applicant, such as their address, age, contact details, etc. These details are provided in the associated body text 103b. A second section heading 105a is part of a second section relating to a "career summary" of the applicant. Details of the applicant's career history, e.g. places of work and start and end dates, are provided in the associated body text 105b. A third section heading 107a is part of a section relating to "awards and achievements" of the applicant, and the associated body text 107b provides details on awards won by the applicant. A final fourth section heading 109a is part of a section relating to "personal skills" of the applicant, which are described in detail in the associated body text 109b.

Although the example shown in Figure 1 contains four sections of text, in addition to the name field 101, in some embodiments a text-based document could instead have fewer sections, e.g. one, two or three sections, or more sections, e.g. five, six, seven, eight or more sections. It could have tens or hundreds of sections, each of which may have a respective section heading and associated body text. The headings and body -9 -text thus do not need to match those shown in Figure 1, which is provided for the purposes of illustration only.

During a recruitment process, recruiters are typically required to process large numbers (e.g. hundreds, thousands) of CV's. Automated classification and extraction of textual information from such documents for review or comparison has the potential to improve the speed at which these documents can be processed and sent to humans or CV analysis software for closer review.

Methods for information extraction from text-based documents, such as CV 100, are described in the following in accordance with the present disclosure, with reference to Figures 2 to 5.

The methods disclosed herein break a text-based document into sections, and perform analysis of the text in each section to determine what category of information is contained in each section. In a first stage, sections of the text-based document are identified by analysing the structure of the document using image-based object detection-i.e. without any textual semantic analysis of document. Computer vision methods, such as semantic image segmentation and object detection, are used to identify sections of text based on image analysis. In a second stage, the textual content of each of the identified sections of text is then processed to label each section, so as to allow for more efficient extraction of information from the CV 100 for further processing by a computer or a human.

Figure 2 is a flow diagram illustrating a method for information extraction from text-based documents such as the CV 100 shown in Figure 1 according to an embodiment of the present disclosure.

In a first step 200, the CV 100 is received by a processing system 500 (described in more detail below in relation to Figure 5) as a Portable Document Format (PDF) file, which may encapsulate a complete description of the CV 100, including the text, fonts, layout information for the text, as well as any vector graphics, raster images and other information needed to render the document. Other file formats may also be supported.

-10 -In step 202, an image of the CV 100 is generated, for example as bitmap image data, using conventional PDF rendering techniques. In some embodiments, the PDF file may already contain an image representation of the document (e.g. if the document was scanned from a paper version).

In step 204, a downsampling (and/or blurring) operation is optionally applied to the image to generate a lossily-compressed image 300, as shown in Figure 3. For example, an image that is 2350 pixels on the longest edge may be reduced to 320 pixels on the longest edge. The present inventors have recognised that by irreversibly compressing the image at this stage, the efficiency of the image segmentation process described in the following steps may be improved, particularly due to a reduction in file size, at least for some types of document and some types of image-based object detection processes, without significantly reducing detection performance.

An image-based object detection process is then performed on the original image or on the compressed image 300 in step 206 to detect blocks of text within the CV 100, i.e. to identify the four sections 103, 105, 107 and 109. This process may use a trained object-detection process based on deep-learning.

In some embodiments, a deep-learning algorithm is used to detect the sections 103, 105, 107 and 109 within the compressed image 300 and to associate respective sets of pixels (referred to herein as annotations) with each of the blocks of text of the four sections 103, 105, 107, and 109. Each annotation defines a set of pixels, which may form a single contiguous region, referred to herein as a mask), where a section object is identified by the process, with a different value being assigned to each section (i.e. to each block of text) on each page of the text-based document. The annotations thus define a plurality of masks overlaying the sections 103, 105, 107, 109.

Figure 3 shows an example of four masks 303, 305, 307, 309, having closed boundary paths, overlaid on the compressed image 300. The masks 303, 305, 307 and 309 contain the blocks of text within the four sections 103, 105, 107 and 109, and identify the position of these sections within the compressed image 300, and hence within the CV 100.

In one embodiment, the section detection process is performed using the Mask-RCNN R50-FPN instance segmentation model of Facebook AS Research's library DETECTRON2, pre-trained on the COCO dataset. The Mask-RCNN framework enables object detection on two levels: bounding boxes can first be predicted around objects (such as sections 103, 105, 107 and 109), and instance segmentation can subsequently be performed, in which each instance of an object is provided with a mask 303, 305, 307, 309. Although the DETECTRON2 library is used in the embodiment described in relation to Figure 2, it will be appreciated that other object detection algorithms may be employed in some embodiments, such as those built using TENSORFLOW.

Having identified broad areas in which blocks of text are present (i.e. within the masks 303, 305, 307 and 309), a rectangular bounding box is next created for each detected section within the compressed image 300 in step 208, for extracting corresponding text data from the PDF file using the Mask-RCNN. Examples of such bounding boxes are shown in Figure 3 by the dashed boxes 304, 306, 308 and 310 formed around each block of text.

In step 210, the bounding boxes 304, 306, 308 and 310 identified using the compressed image 300 of the CV 100 (i.e. within the image file) are used to extract, from the PDF file, the text data contained within each of the bounding boxes 304, 306 308 and 310 (i.e. each identified section). This may be achieved by determining the coordinates of the vertices of the bounding boxes 304, 306, 308, 310 within the image file and processing layout data in the PDF version of the text-based document to determine what characters fall wholly or partly within the bounding box.

Figure 4 illustrates four exemplary bounding boxes 403, 405, 407, 409 for a text-based document 400, corresponding to different sections of the document 400 identified by the object-detection process. The textual content present within each respective bounding box is extracted as four portions 413, 415, 417, 419 of text data from the document data, in order to be classified and optionally further processed in subsequent steps.

If the PDF does not encode text data, or if the document is provided only as image data (e.g. as a TIFF image), the masks or bounding boxes may be used to determine -12 -respective regions of the image to be input, individually, to an optical character recognition (OCR) process, to determine the contents of each isolated section.

Having extracted the text associated with each section 103, 105, 107, 109 of the CV 100 or document 400, the text data of each extracted text portion 413, 415, 417, 419 is analysed to determine an appropriate label 423, 425, 427, 429 relating to the classification of each section of text, e.g. 'education', 'experience', 'skills', 'hobbies', 'personal statement' etc. in the case of a CV 100.

In some approaches, the purpose of each section may be determined based solely on a heading of the section, e.g. by identifying any section heading 103a, 105a, 107a, 109a that is present and using a mapping of all possible section headings to a predetermined set of headings. For example: the section headings 'Education History' and 'Qualifications' would both map to a purpose of 'Education'. Similarly, the section headings 'Prior Employment' and 'Work Experience' would both map to an Employment' section. However, such an approach is not particularly robust, and has several disadvantages. Firstly, it can be difficult to distinguish section headers 103a, 105a, 107a, 109a from body text 103b, 105b, 107b, 109b. Secondly, sometimes there are no section headings present at all. Thirdly, there is a large variety of possible section headers that people could use and it is therefore difficult to build a complete mapping from possible headings to purposes.

Instead, in some methods disclosed herein, an alternative, more robust approach is taken. Rather than analysing headings, a classifier is applied to each extracted text portion 413, 415, 417, 419 to determine an appropriate label for the portion based on all of the text data within said portion.

In some embodiments, a naïve Bayes classifier is used to determine an appropriate label for each extracted text portion 413, 415, 417, 419. Naive Bayes classifiers are probabilistic classifiers that make use of Bayes' rule and assume independence of the features (i.e. each word within the block of text). By treating the text as a collection of words, a naive Bayes classifier can be trained to predict an appropriate section purpose label based on the frequencies with which different words appear in a section. Possible labels that may be predicted by the naive Bayes classifier, when trained on a database of CVs, include, for example: "Personal Information", "Personal Statement", -13 - "Work Experience", "Education", "Skills", "Professional Memberships", "References" etc. Thus, for each extracted text portion 413, 415, 417, 419, a respective label 423, 425, 427, 429 is determined using a naive Bayes classifier that has been trained on a database of real CVs. These labels are represented in Figure 4 for simplicity as "Label 1" 423, "Label 2" 425, "Label 3" 427 and "Label 4" 429 respectively.

Once an extracted portion of text 413, 415, 417, 419 is labelled, detailed information may optionally be extracted from the text data, in step 214, by processing the text associated with each label. Information may be extracted by inspecting the text in each block of text line-by-line and sentence-by-sentence. The exact method of this extraction is dependent on the section label and the desired end goal for the processing.

For example, in a 'work experience' section, each instance of experience will often be accompanied by start and end dates. Regular expressions can be used to identify any dates and a search can be conducted on lines near these dates for organization names and job titles. Similarly, in an 'education' section, each educational establishment a person attends will often be accompanied by dates. Regular expressions are therefore applied to identify any dates and then search nearby lines for educational establishments and qualifications. In a further example, 'References' sections tend not to include dates, but it may be important to extract the contact details of any referees. These can then be used to contact the referees as part of the recruitment process. These elements can be extracted with a rules-based approach by searching for special symbols or patterns of characters (such as @', and numbers).

Having identified sections of the CV 100 and assigned labels to each section, this information may be stored in the form of data in a memory 504 of the processing system 500-e.g. in a database in which text portions are associated with the respective labels. In some embodiments, no further processing is performed and the portions of text may simply be output or stored, e.g. for viewing by a human user. However, where analysis of the text data has been performed, any further relevant information associated with each section may also be stored.

-14 -Figure 5 shows an exemplary processing system 500 arranged to perform the method shown in Figure 2 or other methods disclosed herein. The processing system 500 comprises a processor 502 and a memory 504. The memory 504 stores instructions in the form of software 508, which can be executed by the processor 502 in order to perform the method. The memory 504 also stores a plurality of text-based documents 506, such as CVs, for example in the form of a database. The text-based documents 506 can be processed by the processor 502 when executing software 508 to identify blocks of text within the text-based documents 506 using image-based processing, then to identify a respective portion of text data of the text-based document 506 corresponding to each respective block, and to determine a label for the block of text.

The text data associated with each label can optionally be processed to extract detailed information in the form of individual entries, such as an employer and an associated date, and the processed data may then be stored in a database 510 in the memory 504, arranged to store processed data.

It will be appreciated by those skilled in the art that the invention has been illustrated by describing one or more specific embodiments thereof, but is not limited to these embodiments; many variations and modifications are possible, within the scope of the accompanying claims.

Claims

-15 -CLAIMS 1. A computer-implemented method of processing a text-based document, the method comprising: receiving data representative of a text-based document; applying an image-based object detection process to an image representation of the text-based document to identify a plurality of blocks of text in the image, each block containing a plurality of words; and for each of the identified blocks of text, identifying a respective portion of text data encoding one or more words contained in the respective block, and processing the identified portion of text data to determine a label for the block of text.
2. The method of claim 1 wherein the image-based object detection process determines a respective boundary around each block of text.
3. The method of claim 2, wherein the boundaries are rectangular bounding boxes.
4. The method of any preceding claim, comprising receiving the text-based document as a Portable Document Format (PDF) or PostScript (PS) file.
5. The method of any preceding claim, comprising generating the image representation from text data and layout data in the received data representative of the text-based document.
6. The method of any of claims 1 to 4, comprising generating text data and layout data from image data in the received data representative of the text-based document using an optical character recognition (OCR) process.
7. The method of any preceding claim, wherein the image-based object detection process uses a trained classifier to identify the plurality of blocks of text in the image.
8. The method of claim 7, wherein the trained classifier comprises a convolutional neural network.
-16 - 9. The method of any preceding claim, wherein processing each identified portion of the text data to determine a label for the respective block of text comprises identifying a respective heading of the block of text and processing the heading to determine the label for the respective block of text.
10. The method of any preceding claim, wherein processing each identified portion of the text data to determine a label for the block of text comprises applying a naive Bayes classifier to the respective portion of text data.
11. The method of any preceding claim, comprising further processing one or more of the identified portions of text data to extract information from the portion of text data.
12. The method of claim 11, comprising using a regular expression to extract information from one or more of the identified portions of text data.
13. The method of any preceding claim, wherein the image-based object detection process comprises applying a lossy compression operation to the image representation of the text-based document prior to identifying the plurality of blocks of text in the image.
14. The method of claim 13, wherein the image-based object detection process uses a trained classifier to identify the plurality of blocks of text in the image, and wherein the trained classifier has been trained on training data comprising image representations of a plurality of text-based documents to which said lossy compression operation has been applied.
15. The method of claim 13 or 14, wherein the lossy compression operation comprises downsampling.
16. The method of any of claims 13 to 15, wherein the lossy compression operation comprises applying a blurring filter.
17. The method of any preceding claim, wherein the data representative of the text-based document is unstructured data.
-17 - 18. The method of any preceding claim, wherein the text-based document is a résumé or curriculum vitae.
19. Computer software which, when executed a processing system, causes the processing system to perform the method of any preceding claim.
20. A processing system comprising a processor and a memory storing a computer program comprising instructions which, when executed by the processor, cause the processor to perform the method of any of claims 1 to 18.