GB2615736A - Text-based document processing - Google Patents

Text-based document processing Download PDF

Info

Publication number
GB2615736A
GB2615736A GB2118781.0A GB202118781A GB2615736A GB 2615736 A GB2615736 A GB 2615736A GB 202118781 A GB202118781 A GB 202118781A GB 2615736 A GB2615736 A GB 2615736A
Authority
GB
United Kingdom
Prior art keywords
text
image
data
document
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
GB2118781.0A
Inventor
Meijer Bernet
Wood Lewis
Jew Luke
Satti Riham
Doraiswamy Vivek
Lamplough Nathan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oxiway Ltd
Original Assignee
Oxiway Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oxiway Ltd filed Critical Oxiway Ltd
Priority to GB2118781.0A priority Critical patent/GB2615736A/en
Publication of GB2615736A publication Critical patent/GB2615736A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/131Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Compression Of Band Width Or Redundancy In Fax (AREA)

Abstract

A computer-implemented method of processing a text-based document 400 comprises receiving data representative of a text-based document and applying an image-based object detection process to an image representation of the text-based document to identify a plurality of blocks of text (103-109, fig 1) in the image. Each block contains a plurality of words. The method further comprises, for each of the identified blocks of text, identifying a respective portion of text data 413-419 encoding one or more words contained in the respective block, and processing the identified portion of text data to determine a label 423-429 for the block of text. The text-based document may be a CV or resume and the method may support accurate classification of its sections. Object detection may involve determining boundaries around each block 403-409 and/or applying a lossy compression operation (downsampling or blurring, fig 3) to the image representation prior to identifying the blocks.

Description

Text-Based Document Processing
BACKGROUND OF THE INVENTION
This invention relates to methods, systems and software for processing text-based documents.
There are many types of text-based document for which different examples have similar content but do not share a common, well-defined structure. For example, résumés and curricula vitae (CVs) typically contain standard content, such as academic qualifications and employment history, but exhibit a great diversity in their layout, ordering, design and typography. This lack of consistent structure makes it challenging for computer-based methods to correctly classify the text within such documents.
The present invention seeks to provide an improved approach to processing text-based documents that can allow for more accurate classification.
SUMMARY OF THE INVENTION
From a first aspect, the invention provides a computer-implemented method of processing a text-based document, the method comprising: receiving data representative of a text-based document; applying an image-based object detection process to an image representation of the text-based document to identify a plurality of blocks of text in the image, each block containing a plurality of words; and for each of the identified blocks of text, identifying a respective portion of text data encoding one or more words contained in the respective block, and processing the identified portion of text data to determine a label for the block of text.
From a second aspect, the invention provides computer software which, when executed on a processing system, causes the processing system to: receive data representative of a text-based document; -2 -apply an image-based object detection process to an image representation of the text-based document to identify a plurality of blocks of text in the image, each block containing a plurality of words; and for each of the identified blocks of text, identify a respective portion of text data encoding one or more words contained in the respective block, and process the identified portion of text data to determine a label for the block of text.
From a further aspect, the invention provides a processing system comprising a processor and a memory storing a computer program comprising instructions which, when executed by the processor, cause the processor to: receive data representative of a text-based document; apply an image-based object detection process to an image representation of the text-based document to identify a plurality of blocks of text in the image, each block containing a plurality of words; and for each of the identified blocks of text, identify a respective portion of text data encoding one or more words contained in the respective block, and processing the identified portion of text data to determine a label for the block of text.
Thus it will be seen that, in accordance with embodiments of the invention, image processing is first used to identify one or more blocks of text (i.e. sections of text) from an image representation of the document, and then text-based processing (e.g. semantic analysis) is used to classify each of the blocks based on its textual content. This can enable the text of a block to be labelled more accurately because all the text is likely to relate to the same subject (e.g. "hobbies") by virtue of its physical proximity within the document, as determined from the image representation.
The blocks of text within the document may be separated by regions in which no text is present, e.g. white space. The blocks of text may therefore be well separated, making it easier to define boundaries around them. The image-based object detection process may include determining a respective boundary for each of the identified blocks of text.
This may allow the portion of the text data associated with each block of text identified in the image to be to be more accurately determined. Each boundary may be a closed path. Each may be polygonal. The boundaries may be rectangular, although in some embodiments at least one boundary may comprise more than four edges; this may be useful for segmenting documents having complex layouts where blocks of text cannot 3 -be separated using only rectangular boxes. In some embodiments, the image-based object detection process may determine a respective polygonal boundary (e.g. rectangular bounding box) around each of the blocks of text. Coordinates for vertices of the boundary may be determined. These coordinates may be used to determine (e.g. identify and/or extract from the text-based document) the respective portion of text data, e.g. by processing layout data associated with text data for the document.
The plurality of blocks identified by the image-based object detection process may, in some embodiments and for at least some documents, collectively contain all the text of the document. The blocks are preferably non-overlapping.
In certain document types, text-based content may be presented in the form of one or more horizontal or vertical lines of text, each line comprising characters grouped into one or more words. In some embodiments, the image-based object detection process is configured such that each block of text contains only complete lines (e.g. one or more complete lines). However, some embodiments may be able to identify blocks that contain one or more partial lines.
Receiving the data representative of the document may comprise reading the data from a memory, or receiving the data over a software interface (e.g. an API) or a physical interface (e.g. a network connection). The document data may be received as one or more files or data streams.
The received data may comprise any one or more of: image data, text data, and layout data. In some embodiments, it comprises the image representation of the text-based document and also comprises text data encoding some or all of the textual content of the text-based document, optionally with layout data. However, this is not necessary in all embodiments. Some methods may comprise generating the image representation from received text data and layout data-e.g. by processing text data and layout data to generate an image (e.g. one or more bitmap images) of the document. Other methods may comprise generating text data (optionally with layout data) from received image data-e.g. from a or the received image representation of the text-based document. -4 -
The text-based document may be received as a Portable Document Format (PDF) file or a PostScript (PS) file or as a word-processing file-e.g. a Microsoft Word file or an Open Document Format for Office Applications (ODF) file.
In some embodiments, the text data may be stored as data in a text file, e.g., in the form of ASCII or Unicode characters. Layout information defining the position of text within the document (e.g. a location for each character or word on an identified page of the document) may be stored separately, e.g. in a separate file, or may be received in a common file with the text data.
The image representation of the text-based document may comprise raster or vector image data. It may be encoded in any appropriate manner. It may comprise one or more bitmaps. It may be compressed. It may be stored as one or more files. It may be encoded within a Portable Document Format (PDF) file. It may be received as input to the system, or it may be generated from received text data. In some embodiments, the image representation may be received in addition to the text data of the text-based document, embedded within the same file or as a separate file.
In some embodiments, the text-based document may be received as image data without containing text data, e.g. if the text-based document is received in the form of a scan or photograph of a paper document. In such embodiments, methods may comprise generating text data encoding textual content of the text-based document. This may be achieved, for example, by applying optical character recognition (OCR) to the image representation of the text-based document to generate the text data. Thus, in some embodiments, the text data is generated from an image using an optical character recognition (OCR) process.
In some embodiments, the image-based object detection process may comprise applying a trained machine learning algorithm to identify and/or classify objects, including the plurality of blocks of text, within the image representation. The machine learning algorithm may have been trained on training data. The training data may comprise image representations of a plurality of text-based documents. These documents may be of a same type as the received text-based document (e.g. all being CVs). The training data may include data that identifies blocks of text in the images, which may have been identified by a human. The machine learning algorithm may be -5 -trained using supervised techniques, or unsupervised learning techniques. The image-based object detection process may use a trained classifier, such as a convolutional neural network (CNN), to identify the plurality of blocks of text.
When determining a label for an identified portion of text data corresponding to a respective block of text, the text data may be processed using any appropriate method to determine information about its contents. In some embodiments, the text data may be processed to identify a heading-e.g. by identifying the initial words before a first line break, or identifying the initial words before a reduction in font size. The text data may be processed by applying one or more regular expressions to the portion of text data. This may allow one or more predetermined strings (e.g. a heading) to be identified within the text data. A label may be assigned in dependence on the identified string or strings. The string itself may be assigned as the label, or a mapping may be used to determine the label. For example, if the string "Personal Skills" is identified, this may then be determined to be equivalent, in mapping data, to a predetermined string "Skills". The predetermined string may thus be assigned as the label for the block of text in which the text data is contained.
In some embodiments, determining a label for a portion of text data may comprise applying a naïve Bayes classifier to the portion of text data. Words within the text data may be identified, and input to the naïve Bayes classifier, which may be configured (e.g. trained) to process each word within the text data independently of any other word. An appropriate label may be determined based on the frequency with which certain words appear within the portion of text data. Thus, in some embodiments, processing the identified portion of text data may comprise applying a trained naive Bayes classifier to the portion of text data. The naive Bayes classifier may have been trained on training data which may comprise a portions of text data extracted from text-based documents of a same type as the received text-based document (e.g. CVs). The training data may include labels for portions of text data, which may have been labelled by a human. -6 -
The identified portion of data may be further processed, e.g. to extract information from the text data. Information may be extracted by inspecting the text in block line-by-line and/or sentence-by-sentence. One or more regular expressions may be applied to identify numerical data, such as dates, within the text data.
At least for some document types, the image-based object detection process may comprise applying a lossy compression operation, such as downsampling reducing the image resolution) or blurring, to the image prior to the object detection process being performed. Such lossy compression may advantageously remove some or all personally-identifying information from the image (e.g. from the text content represented in the image). This may help support privacy and/or blind recruiting policies.
Compressing the image may also advantageously improve efficiency by reducing storage and/or processing requirements of the system.
The same or a different lossy compression operation may be applied (or have been applied) to training data used to train the image-based object detection process. In addition to anonymising the data, this may, in some embodiments, usefully bias the detection process towards considering larger layout features of the document, rather than finer details which may be less relevant to effective segmentation. In particular, it may reduce or remove language-specific features from the image, thereby enabling image data to be usefully included in the training dataset regardless of its language or content (e.g. even if it is in a different language from the text-based document processed during inference). The image-based object detection process may thus be language-agnostic in some embodiments.
In some embodiments, the image-based object detection process may comprise applying a downsampling operation to the image representation prior to identifying a plurality of blocks of text in the image. The downsampling may comprise reducing a spatial resolution of the image in one or preferably two dimensions. It may comprise combining (e.g. averaging) adjacent pixels into fewer larger pixels (e.g. each equivalent to 3x3 original pixels). -7 -
In some embodiments, the image-based object detection process may additionally or alternatively comprise applying a blurring filter to the image representation prior to identifying a plurality of blocks of text in the image. Blurring of the image may be achieved by applying a two dimensional convolution function to the image. In some embodiments, applying the blurring filter to the image comprises applying a Gaussian blur to the image. It may also comprise adding white noise to the image, which may help ensure that the down-weighted small-scale structure of the image is lost as a result of the noise applied. The properties of the Gaussian blur function may be selected to optimise the results of the object detection process. For example, the full width half maximum (FVVHM) of the Gaussian blur may be selected to correspond to 3 pixels in the image.
The document may comprise one or more pages. It may be a portion of a larger document or file.
The document may be any type of document. However, in a preferred set of embodiments the data representative of the document is unstructured, but the document is of a type that can be labelled with a set of predetermined labels. In some preferred embodiments the document is a résumé or curriculum vitae (CV). Methods disclosed herein are particularly well suited for labelling blocks of text within such unstructured documents.
Features of any aspect or embodiment described herein may, wherever appropriate, be applied to any other aspect or embodiment described herein. Where reference is made to different embodiments or sets of embodiments, it should be understood that these are not necessarily distinct but may overlap.
BRIEF DESCRIPTION OF THE DRAWINGS
Certain preferred embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which: FIG. 1 is an image of a CV that may be processed using a method embodying the invention; FIG. 2 is a flow diagram illustrating steps of a method, embodying the invention, for extracting relevant content from the CV; -8 -FIG. 3 is an image of the CV after being downsampled according to a method embodying the invention; FIG. 4 is schematic diagram of data involved in a process of labelling a CV in accordance with a method embodying the invention; and FIG. 5 is a schematic diagram of a processing system embodying the invention.
DETAILED DESCRIPTION
Figure 1 shows an exemplary simplified curriculum vitae (CV) 100 of a job applicant that may be processed using the method of the present disclosure. In this example the text-based document 100 is a CV, however it will be appreciated that the methods described herein are also applicable to other types of text-based document, such as newspaper or scientific articles, administrative forms, novels, etc. The CV 100 comprises the name 101 of the person to whom the CV 100 relates, an image 102 of the applicant, and a number of blocks of text, here exemplified as four sections 103-109, each section comprising a respective section heading 103a, 105a, 107a, 109a and associated body text 103b, 105b, 107b, 109b. A first section heading 103a is part of a first section relating to the "personal details" of the applicant, such as their address, age, contact details, etc. These details are provided in the associated body text 103b. A second section heading 105a is part of a second section relating to a "career summary" of the applicant. Details of the applicant's career history, e.g. places of work and start and end dates, are provided in the associated body text 105b. A third section heading 107a is part of a section relating to "awards and achievements" of the applicant, and the associated body text 107b provides details on awards won by the applicant. A final fourth section heading 109a is part of a section relating to "personal skills" of the applicant, which are described in detail in the associated body text 109b.
Although the example shown in Figure 1 contains four sections of text, in addition to the name field 101, in some embodiments a text-based document could instead have fewer sections, e.g. one, two or three sections, or more sections, e.g. five, six, seven, eight or more sections. It could have tens or hundreds of sections, each of which may have a respective section heading and associated body text. The headings and body -9 -text thus do not need to match those shown in Figure 1, which is provided for the purposes of illustration only.
During a recruitment process, recruiters are typically required to process large numbers (e.g. hundreds, thousands) of CV's. Automated classification and extraction of textual information from such documents for review or comparison has the potential to improve the speed at which these documents can be processed and sent to humans or CV analysis software for closer review.
Methods for information extraction from text-based documents, such as CV 100, are described in the following in accordance with the present disclosure, with reference to Figures 2 to 5.
The methods disclosed herein break a text-based document into sections, and perform analysis of the text in each section to determine what category of information is contained in each section. In a first stage, sections of the text-based document are identified by analysing the structure of the document using image-based object detection-i.e. without any textual semantic analysis of document. Computer vision methods, such as semantic image segmentation and object detection, are used to identify sections of text based on image analysis. In a second stage, the textual content of each of the identified sections of text is then processed to label each section, so as to allow for more efficient extraction of information from the CV 100 for further processing by a computer or a human.
Figure 2 is a flow diagram illustrating a method for information extraction from text-based documents such as the CV 100 shown in Figure 1 according to an embodiment of the present disclosure.
In a first step 200, the CV 100 is received by a processing system 500 (described in more detail below in relation to Figure 5) as a Portable Document Format (PDF) file, which may encapsulate a complete description of the CV 100, including the text, fonts, layout information for the text, as well as any vector graphics, raster images and other information needed to render the document. Other file formats may also be supported.
-10 -In step 202, an image of the CV 100 is generated, for example as bitmap image data, using conventional PDF rendering techniques. In some embodiments, the PDF file may already contain an image representation of the document (e.g. if the document was scanned from a paper version).
In step 204, a downsampling (and/or blurring) operation is optionally applied to the image to generate a lossily-compressed image 300, as shown in Figure 3. For example, an image that is 2350 pixels on the longest edge may be reduced to 320 pixels on the longest edge. The present inventors have recognised that by irreversibly compressing the image at this stage, the efficiency of the image segmentation process described in the following steps may be improved, particularly due to a reduction in file size, at least for some types of document and some types of image-based object detection processes, without significantly reducing detection performance.
An image-based object detection process is then performed on the original image or on the compressed image 300 in step 206 to detect blocks of text within the CV 100, i.e. to identify the four sections 103, 105, 107 and 109. This process may use a trained object-detection process based on deep-learning.
In some embodiments, a deep-learning algorithm is used to detect the sections 103, 105, 107 and 109 within the compressed image 300 and to associate respective sets of pixels (referred to herein as annotations) with each of the blocks of text of the four sections 103, 105, 107, and 109. Each annotation defines a set of pixels, which may form a single contiguous region, referred to herein as a mask), where a section object is identified by the process, with a different value being assigned to each section (i.e. to each block of text) on each page of the text-based document. The annotations thus define a plurality of masks overlaying the sections 103, 105, 107, 109.
Figure 3 shows an example of four masks 303, 305, 307, 309, having closed boundary paths, overlaid on the compressed image 300. The masks 303, 305, 307 and 309 contain the blocks of text within the four sections 103, 105, 107 and 109, and identify the position of these sections within the compressed image 300, and hence within the CV 100.
In one embodiment, the section detection process is performed using the Mask-RCNN R50-FPN instance segmentation model of Facebook AS Research's library DETECTRON2, pre-trained on the COCO dataset. The Mask-RCNN framework enables object detection on two levels: bounding boxes can first be predicted around objects (such as sections 103, 105, 107 and 109), and instance segmentation can subsequently be performed, in which each instance of an object is provided with a mask 303, 305, 307, 309. Although the DETECTRON2 library is used in the embodiment described in relation to Figure 2, it will be appreciated that other object detection algorithms may be employed in some embodiments, such as those built using TENSORFLOW.
Having identified broad areas in which blocks of text are present (i.e. within the masks 303, 305, 307 and 309), a rectangular bounding box is next created for each detected section within the compressed image 300 in step 208, for extracting corresponding text data from the PDF file using the Mask-RCNN. Examples of such bounding boxes are shown in Figure 3 by the dashed boxes 304, 306, 308 and 310 formed around each block of text.
In step 210, the bounding boxes 304, 306, 308 and 310 identified using the compressed image 300 of the CV 100 (i.e. within the image file) are used to extract, from the PDF file, the text data contained within each of the bounding boxes 304, 306 308 and 310 (i.e. each identified section). This may be achieved by determining the coordinates of the vertices of the bounding boxes 304, 306, 308, 310 within the image file and processing layout data in the PDF version of the text-based document to determine what characters fall wholly or partly within the bounding box.
Figure 4 illustrates four exemplary bounding boxes 403, 405, 407, 409 for a text-based document 400, corresponding to different sections of the document 400 identified by the object-detection process. The textual content present within each respective bounding box is extracted as four portions 413, 415, 417, 419 of text data from the document data, in order to be classified and optionally further processed in subsequent steps.
If the PDF does not encode text data, or if the document is provided only as image data (e.g. as a TIFF image), the masks or bounding boxes may be used to determine -12 -respective regions of the image to be input, individually, to an optical character recognition (OCR) process, to determine the contents of each isolated section.
Having extracted the text associated with each section 103, 105, 107, 109 of the CV 100 or document 400, the text data of each extracted text portion 413, 415, 417, 419 is analysed to determine an appropriate label 423, 425, 427, 429 relating to the classification of each section of text, e.g. 'education', 'experience', 'skills', 'hobbies', 'personal statement' etc. in the case of a CV 100.
In some approaches, the purpose of each section may be determined based solely on a heading of the section, e.g. by identifying any section heading 103a, 105a, 107a, 109a that is present and using a mapping of all possible section headings to a predetermined set of headings. For example: the section headings 'Education History' and 'Qualifications' would both map to a purpose of 'Education'. Similarly, the section headings 'Prior Employment' and 'Work Experience' would both map to an Employment' section. However, such an approach is not particularly robust, and has several disadvantages. Firstly, it can be difficult to distinguish section headers 103a, 105a, 107a, 109a from body text 103b, 105b, 107b, 109b. Secondly, sometimes there are no section headings present at all. Thirdly, there is a large variety of possible section headers that people could use and it is therefore difficult to build a complete mapping from possible headings to purposes.
Instead, in some methods disclosed herein, an alternative, more robust approach is taken. Rather than analysing headings, a classifier is applied to each extracted text portion 413, 415, 417, 419 to determine an appropriate label for the portion based on all of the text data within said portion.
In some embodiments, a naïve Bayes classifier is used to determine an appropriate label for each extracted text portion 413, 415, 417, 419. Naive Bayes classifiers are probabilistic classifiers that make use of Bayes' rule and assume independence of the features (i.e. each word within the block of text). By treating the text as a collection of words, a naive Bayes classifier can be trained to predict an appropriate section purpose label based on the frequencies with which different words appear in a section. Possible labels that may be predicted by the naive Bayes classifier, when trained on a database of CVs, include, for example: "Personal Information", "Personal Statement", -13 - "Work Experience", "Education", "Skills", "Professional Memberships", "References" etc. Thus, for each extracted text portion 413, 415, 417, 419, a respective label 423, 425, 427, 429 is determined using a naive Bayes classifier that has been trained on a database of real CVs. These labels are represented in Figure 4 for simplicity as "Label 1" 423, "Label 2" 425, "Label 3" 427 and "Label 4" 429 respectively.
Once an extracted portion of text 413, 415, 417, 419 is labelled, detailed information may optionally be extracted from the text data, in step 214, by processing the text associated with each label. Information may be extracted by inspecting the text in each block of text line-by-line and sentence-by-sentence. The exact method of this extraction is dependent on the section label and the desired end goal for the processing.
For example, in a 'work experience' section, each instance of experience will often be accompanied by start and end dates. Regular expressions can be used to identify any dates and a search can be conducted on lines near these dates for organization names and job titles. Similarly, in an 'education' section, each educational establishment a person attends will often be accompanied by dates. Regular expressions are therefore applied to identify any dates and then search nearby lines for educational establishments and qualifications. In a further example, 'References' sections tend not to include dates, but it may be important to extract the contact details of any referees. These can then be used to contact the referees as part of the recruitment process. These elements can be extracted with a rules-based approach by searching for special symbols or patterns of characters (such as @', and numbers).
Having identified sections of the CV 100 and assigned labels to each section, this information may be stored in the form of data in a memory 504 of the processing system 500-e.g. in a database in which text portions are associated with the respective labels. In some embodiments, no further processing is performed and the portions of text may simply be output or stored, e.g. for viewing by a human user. However, where analysis of the text data has been performed, any further relevant information associated with each section may also be stored.
-14 -Figure 5 shows an exemplary processing system 500 arranged to perform the method shown in Figure 2 or other methods disclosed herein. The processing system 500 comprises a processor 502 and a memory 504. The memory 504 stores instructions in the form of software 508, which can be executed by the processor 502 in order to perform the method. The memory 504 also stores a plurality of text-based documents 506, such as CVs, for example in the form of a database. The text-based documents 506 can be processed by the processor 502 when executing software 508 to identify blocks of text within the text-based documents 506 using image-based processing, then to identify a respective portion of text data of the text-based document 506 corresponding to each respective block, and to determine a label for the block of text.
The text data associated with each label can optionally be processed to extract detailed information in the form of individual entries, such as an employer and an associated date, and the processed data may then be stored in a database 510 in the memory 504, arranged to store processed data.
It will be appreciated by those skilled in the art that the invention has been illustrated by describing one or more specific embodiments thereof, but is not limited to these embodiments; many variations and modifications are possible, within the scope of the accompanying claims.

Claims (20)

  1. -15 -CLAIMS 1. A computer-implemented method of processing a text-based document, the method comprising: receiving data representative of a text-based document; applying an image-based object detection process to an image representation of the text-based document to identify a plurality of blocks of text in the image, each block containing a plurality of words; and for each of the identified blocks of text, identifying a respective portion of text data encoding one or more words contained in the respective block, and processing the identified portion of text data to determine a label for the block of text.
  2. 2. The method of claim 1 wherein the image-based object detection process determines a respective boundary around each block of text.
  3. 3. The method of claim 2, wherein the boundaries are rectangular bounding boxes.
  4. 4. The method of any preceding claim, comprising receiving the text-based document as a Portable Document Format (PDF) or PostScript (PS) file.
  5. 5. The method of any preceding claim, comprising generating the image representation from text data and layout data in the received data representative of the text-based document.
  6. 6. The method of any of claims 1 to 4, comprising generating text data and layout data from image data in the received data representative of the text-based document using an optical character recognition (OCR) process.
  7. 7. The method of any preceding claim, wherein the image-based object detection process uses a trained classifier to identify the plurality of blocks of text in the image.
  8. 8. The method of claim 7, wherein the trained classifier comprises a convolutional neural network.
  9. -16 - 9. The method of any preceding claim, wherein processing each identified portion of the text data to determine a label for the respective block of text comprises identifying a respective heading of the block of text and processing the heading to determine the label for the respective block of text.
  10. 10. The method of any preceding claim, wherein processing each identified portion of the text data to determine a label for the block of text comprises applying a naive Bayes classifier to the respective portion of text data.
  11. 11. The method of any preceding claim, comprising further processing one or more of the identified portions of text data to extract information from the portion of text data.
  12. 12. The method of claim 11, comprising using a regular expression to extract information from one or more of the identified portions of text data.
  13. 13. The method of any preceding claim, wherein the image-based object detection process comprises applying a lossy compression operation to the image representation of the text-based document prior to identifying the plurality of blocks of text in the image.
  14. 14. The method of claim 13, wherein the image-based object detection process uses a trained classifier to identify the plurality of blocks of text in the image, and wherein the trained classifier has been trained on training data comprising image representations of a plurality of text-based documents to which said lossy compression operation has been applied.
  15. 15. The method of claim 13 or 14, wherein the lossy compression operation comprises downsampling.
  16. 16. The method of any of claims 13 to 15, wherein the lossy compression operation comprises applying a blurring filter.
  17. 17. The method of any preceding claim, wherein the data representative of the text-based document is unstructured data.
  18. -17 - 18. The method of any preceding claim, wherein the text-based document is a résumé or curriculum vitae.
  19. 19. Computer software which, when executed a processing system, causes the processing system to perform the method of any preceding claim.
  20. 20. A processing system comprising a processor and a memory storing a computer program comprising instructions which, when executed by the processor, cause the processor to perform the method of any of claims 1 to 18.
GB2118781.0A 2021-12-22 2021-12-22 Text-based document processing Pending GB2615736A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB2118781.0A GB2615736A (en) 2021-12-22 2021-12-22 Text-based document processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB2118781.0A GB2615736A (en) 2021-12-22 2021-12-22 Text-based document processing

Publications (1)

Publication Number Publication Date
GB2615736A true GB2615736A (en) 2023-08-23

Family

ID=87430936

Family Applications (1)

Application Number Title Priority Date Filing Date
GB2118781.0A Pending GB2615736A (en) 2021-12-22 2021-12-22 Text-based document processing

Country Status (1)

Country Link
GB (1) GB2615736A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5164899A (en) * 1989-05-01 1992-11-17 Resumix, Inc. Method and apparatus for computer understanding and manipulation of minimally formatted text documents
US20200210695A1 (en) * 2019-01-02 2020-07-02 Capital One Services, Llc Utilizing optical character recognition (ocr) to remove biasing
US20210374398A1 (en) * 2020-05-29 2021-12-02 Microsoft Technology Licensing, Llc Constructing a computer-implemented semantic document

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5164899A (en) * 1989-05-01 1992-11-17 Resumix, Inc. Method and apparatus for computer understanding and manipulation of minimally formatted text documents
US20200210695A1 (en) * 2019-01-02 2020-07-02 Capital One Services, Llc Utilizing optical character recognition (ocr) to remove biasing
US20210374398A1 (en) * 2020-05-29 2021-12-02 Microsoft Technology Licensing, Llc Constructing a computer-implemented semantic document

Similar Documents

Publication Publication Date Title
US11250255B2 (en) Systems and methods for generating and using semantic images in deep learning for classification and data extraction
US11314969B2 (en) Semantic page segmentation of vector graphics documents
RU2699687C1 (en) Detecting text fields using neural networks
Yang et al. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks
Parthiban et al. Optical character recognition for English handwritten text using recurrent neural network
RU2757713C1 (en) Handwriting recognition using neural networks
Jo et al. Handwritten text segmentation via end-to-end learning of convolutional neural networks
JP5674615B2 (en) Character recognition device and character recognition method
Baluja Learning typographic style: from discrimination to synthesis
Elanwar et al. Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model
Bhattacharya et al. Understanding contents of filled-in Bangla form images
Mariyathas et al. Sinhala handwritten character recognition using convolutional neural network
US11727706B2 (en) Systems and methods for deep learning based approach for content extraction
CN113673294B (en) Method, device, computer equipment and storage medium for extracting document key information
Al Ghamdi A novel approach to printed Arabic optical character recognition
Kawabe et al. Application of deep learning to classification of braille dot for restoration of old braille books
Gowda et al. Kannada handwritten character recognition and classification through OCR using hybrid machine learning techniques
US20230143430A1 (en) Image Localizability Classifier
Magotra et al. A Comparative analysis for identification and classification of text segmentation challenges in Takri Script
GB2615736A (en) Text-based document processing
Duth et al. Recognition of hand written and printed text of cursive writing utilizing optical character recognition
García et al. Deep layout extraction applied to historical postcards
Kumar et al. Point feature based recognition of handwritten Meetei Mayek script
Gal et al. Visual-linguistic methods for receipt field recognition
Bartz et al. Synthetic data for the analysis of archival documents: Handwriting determination