WO2022241190A2

WO2022241190A2 - Machine learning-based systems and methods for extracting information from pathology reports

Info

Publication number: WO2022241190A2
Application number: PCT/US2022/029142
Authority: WO
Inventors: Ross Mitchell
Original assignee: H. Lee Moffitt Cancer Center And Research Institute, Inc.
Priority date: 2021-05-14
Filing date: 2022-05-13
Publication date: 2022-11-17
Also published as: WO2022241190A3

Abstract

Described herein are machine learning-based systems and methods for extracting information from pathology reports are related machine learning model training methods. Also described herein are machine learning-based systems and methods for predicting site and histology codes for a disease. An example system for automatically extracting information from pathology reports includes a transformer-based machine learning model, and a computing device. The computing device includes a processor and a memory operably coupled to the processor, and the memory having computer-executable instructions stored thereon. The computing device is configured to receive a pathology report, transmit the pathology report to the transformer-based machine learning model, and receive at least one of a site description or a histology description for a disease. The transformer-based machine learning model is configured to extract the at least one of the site description or the histology description from the pathology report.

Description

MACHINE LEARNING-BASED SYSTEMS AND METHODS FOR EXTRACTING INFORMATION FROM

PATHOLOGY REPORTS

CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This application claims the benefit of U.S. provisional patent application No. 63/188,611, filed on May 14, 2021, and titled "MACHINE LEARNING-BASED SYSTEMS AND METHODS FOR EXTRACTING INFORMATION FROM PATHOLOGY REPORTS," the disclosure of which is expressly incorporated herein by reference in its entirety.

BACKGROUND

[0002] Much of the information in electronic medical records (EMRs) required for the practice of clinical oncology and cancer research is contained in unstructured text. Natural language processing (NLP) has been used to extract information from medical text for several decades [1-5]. Many of the early systems employed regular expression and rule-based systems [6,7]. However, these require considerable up-front development and can be difficult to adapt and maintain. Consequently, there is growing interest in more highly automated deep learning (DL) approaches for clinical NLP. Recent literature reviews on this topic [5,8] suggest that DL methods "have not yet fully penetrated clinical NLP" [5], but are growing rapidly.

[0003] Many of the reviewed papers pre-date the late 2018 publication and release of a powerful new DL NLP algorithm: Bidirectional Encoder Representations from Transformers (BERT)[9]. BERT rediscovered the classical NLP pipeline [10], and established new, state-of-the-art performance levels on common non-clinical NLP benchmarks [11]. This success spawned rapid research and development of multiple BERT-inspired and transformer-based neural architectures [12-20]. Several of these have, for the first time, achieved or surpassed human-level performance on tasks as diverse as question-answering, named entity recognition, speech recognition, and more [17, 20-22],

1 [0004] The success of BERT and related architectures has also inspired multiple medical applications, including: processing of electronic health records (EH R)[23,24]; outcome prediction [25-27]; identification of medical terms and concepts[28]; medical chatbots[29]; sentiment analysis[30]; recommender systems[31]; and, others. However, only one application of BERT has been applied to free-text pathology reports [32]. This study focused on classification of text into only a few cancer-related categories including: afflicted organ (15 organ groups); disease type (non cancer, pre- malignant, or cancer); cancer reason (6 histology groups); and, presence of metastatic disease (no, yes: in lymph nodes, and yes: in non-lymph node tissue).

SUMMARY

[0005] An example system for automatically extracting information from pathology reports is described herein. The system includes a transformer-based machine learning model, and a computing device. The computing device includes a processor and a memory operably coupled to the processor, and the memory having computer-executable instructions stored thereon. The computing device is configured to receive a pathology report, transmit the pathology report to the transformer-based machine learning model, and receive at least one of a site description or a histology description for a disease. The transformer-based machine learning model is configured to extract the at least one of the site description or the histology description from the pathology report.

[0006] In some implementations, the computing device is configured to provide a disease diagnosis, provide a treatment option, identify a patient for a clinical trial, or monitor adherence to a treatment pathway.

[0007] In some implementations, the site description and the histology description are received from the transformer-based machine learning model.

[0008] Alternatively or additionally, the pathology report is a free-text pathology report.

[0009] Alternatively or additionally, the pathology report is for a solid tumor.

2 [0010] An example method for automatically extracting information from pathology reports is also described herein. The method includes receiving a pathology report, inputting the pathology report into a transformer-based machine learning model, and extracting, using transformer-based machine learning model, from the pathology report at least one of a site description or a histology description for a disease.

[0011] In some implementations, the method further includes diagnosing the disease based on the at least one of the site description or the histology description.

[0012] In some implementations, the method further includes recommending a treatment for the disease based on the at least one of the site description or the histology description. Optionally, the method further includes treating a patient with the recommended treatment for the disease.

[0013] Alternatively or additionally, the pathology report is a free-text pathology report.

[0014] Alternatively or additionally, the pathology report is for a solid tumor.

[0015] An example system for predicting site and histology codes for diseases is also described herein. The system includes a first transformer-based machine learning model configured to extract information from pathology reports, a second transformer-based machine learning model configured to predict site codes for diseases, and a third transformer-based machine learning model configured to predict histology codes for diseases. The first transformer-based machine learning model is configured to receive a pathology report, and extract information from the pathology report, where the extracted information includes a site description and a histology description for a disease. The second transformer-based machine learning model is configured to receive the extracted information, and predict a site code for the disease based on the extracted information. The third transformer-based machine learning model is configured to receive the extracted information, and predict a histology code for the disease based on the extracted information.

[0016] In some implementations, the system further includes a computing device. The computing device includes a processor and a memory operably coupled to the processor, and the

3 memory having computer-executable instructions stored thereon. The computing device is configured to transmit the pathology report to the first transformer-based machine learning model. In some implementations, the computing device is further configured to receive the site code predicted by the second transformer-based machine learning model and the histology code predicted by the third transformer-based machine learning model. In some implementations, the computing device is further configured to generate a report comprising the site code and the histology code.

[0017] Optionally, in some implementations, the second transformer-based machine learning model is further configured to predict a top-n most accurate site codes for the disease, wherein n is an integer greater than 1. Alternatively or additionally, in some implementations, the third transformer-based machine learning model is optionally further configured to predict a top-m most accurate histology codes for the disease, wherein m is an integer greater than 1.

[0018] Alternatively or additionally, the pathology report is a free-text pathology report.

[0019] Alternatively or additionally, the pathology report is for a solid tumor.

[0020] An example method for predicting site and histology codes for diseases is also described herein. The method includes receiving a pathology report, inputting the pathology report into a first transformer-based machine learning model, and extracting, using the first transformer- based machine learning model, information from the pathology report, where the extracted information includes a site description and a histology description for a disease. The method also includes inputting the extracted information into a second transformer-based machine learning model, and predicting, using the second transformer-based machine learning model, a site code for the disease based on the extracted information. The method further includes inputting the extracted information into a third transformer-based machine learning model, and predicting, using the third transformer-based machine learning model, a histology code for the disease based on the extracted information.

4 [0021] In some implementations, the method further includes generating a report comprising the site code and the histology code.

[0022] In some implementations, the method further includes diagnosing the disease based on the site code and the histology code.

[0023] Optionally, in some implementations, the second transformer-based machine learning model predicts a top-n most accurate site codes for the disease, wherein n is an integer greater than 1. Alternatively or additionally, in some implementations, the third transformer-based machine learning model optionally predicts a top-m most accurate histology codes for the disease, wherein m is an integer greater than 1.

[0024] In some implementations, the method further includes recommending a treatment for the disease based on the site code and the histology code. Optionally, the method further includes treating a patient with the recommended treatment for the disease.

[0025] Alternatively or additionally, the pathology report is a free-text pathology report.

[0026] Alternatively or additionally, the pathology report is for a solid tumor.

[0027] An example machine learning training method is also described herein. The method includes performing unsupervised training on a transformer-based machine learning model with a first dataset, where the first dataset includes a plurality of pathology reports. The method also includes creating a second dataset including a plurality of pathology reports, where each of the pathology reports in the second dataset includes respective ground truth labels for a site description and a histology description for a disease. The step of creating the second dataset includes creating a first hierarchal tree structure configured to hold acceptable site description terminology, creating a second hierarchal tree structure configured to hold acceptable histology description terminology, constructing a dictionary using the first and second hierarchal tree structures, and for each of the pathology reports in the second dataset, performing a search, using the dictionary, to identify respective matching text strings within a pathology report that correspond to the respective ground truth labels. The method further includes performing supervised training on the transformer-based

5 machine learning model with the second dataset, where the second dataset further includes the respective matching text strings identified by the search.

[0028] Additionally, a respective matching text string is an exact match for preferred or acceptable diverse terminology contained in the dictionary.

[0029] It should be understood that the above-described subject matter may also be implemented as a computer-controlled apparatus, a computer process, a computing system, or an article of manufacture, such as a computer-readable storage medium.

[0030] Other systems, methods, features and/or advantages will be or may become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features and/or advantages be included within this description and be protected by the accompanying claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0031] The components in the drawings are not necessarily to scale relative to each other. Like reference numerals designate corresponding parts throughout the several views.

[0032] FIGURE 1 is a diagram of an example system according to implementations described herein.

[0033] FIGURE 2 is an example computing device.

[0034] FIGURE 3 is Table 1, which illustrates a fragment of text from an example pathology report.

[0035] FIGURE 4 is a diagram illustrating how a pathology language model described herein ("Cancer BERT") leverages work performed by others.

[0036] FIGURE 5 is a diagram illustrating the process of training three instances of a pathology language model described herein ("caBERT").

[0037] FIGURE 6 is a diagram illustrating an example data curation process described herein.

6 [0038] FIGURE 7 is a chart illustrating the impact of training caBERT, a pathology-specific language model, on question-and-answering performance.

[0039] FIGURE 8 is a graph illustrating the effect of culling rare tumor sites and histologies on the top-N accuracy of predicting fine-grained ICD-O-3 codes.

[0040] FIGURE 9 illustrates the overall mean accuracy (93.5%) of predicting tumor site group codes from unstructured, and previously unseen, pathology reports on solid tumors using caBERTnet.

[0041] FIGURE 10 illustrates the overall mean accuracy (97.7%) of predicting tumor histology group codes from unstructured, and previously unseen, pathology reports on solid tumors using caBERTnet.

[0042] FIGURE 11 is a diagram illustrating construction of an example acceptable answer phrase table.

[0043] FIGURE 12 is Table SI, which illustrates experimental parameters used to train caBERT instances.

[0044] FIGURE 13 is Table S2, which illustrates example of histology and site term hierarchies.

DETAILED DESCRIPTION

[0045] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. Methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present disclosure. As used in the specification, and in the appended claims, the singular forms "a," "an," "the" include plural referents unless the context clearly dictates otherwise. The term "comprising" and variations thereof as used herein is used synonymously with the term "including" and variations thereof and are open, non-limiting terms. The terms "optional" or "optionally" used herein mean that the subsequently described feature, event or circumstance may or may not occur, and that the

7 description includes instances where said feature, event or circumstance occurs and instances where it does not. Ranges may be expressed herein as from "about" one particular value, and/or to "about" another particular value. When such a range is expressed, an aspect includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by use of the antecedent "about," it will be understood that the particular value forms another aspect. It will be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint. While implementations will be described for extracting information from free-text pathology reports related to tumors, it will become evident to those skilled in the art that the implementations are not limited thereto, but may be applicable for extracting information from free-text pathology reports related to other diseases.

[0046] Described herein are machine learning-based systems and methods for extracting information from pathology reports. Also described herein are machine learning-based systems and methods for predicting site and histology codes for a disease. Also described herein are machine learning model training methods. The system and methods described herein provide improvements over conventional NLP systems which include, but are not limited to, 1) extracting accurate tumor site and histology descriptions from free-text pathology reports; 2) accommodating the diverse terminology used in free-text pathology reports to indicate the same pathology; and/or 3) providing accurate standardized tumor site and histology codes for use by downstream applications. As described herein, the systems and methods can be used to diagnose disease, recommend treatments, reduce treatment delays, increase enrollment in clinical trials of new therapies, and/or improve patient outcomes.

[0047] Additionally, the pathology reports relate to a disease such as cancer. Cancer is a disease caused by uncontrolled division of abnormal cells, e.g., a malignant growth. A solid tumor is an abnormal mass of hyperproliferative or neoplastic cells from a tissue other than blood, bone marrow, or the lymphatic system, which may be benign or cancerous. In general, the tumors described herein are cancerous. This disclosure contemplates that the systems and methods

8 described herein may be applicable to solid tumors in any cells, tissue, or organ of the subject. As used herein, the terms "hyperproliferative" and "neoplastic" refer to cells having the capacity for autonomous growth, i.e., an abnormal state or condition characterized by rapidly proliferating cell growth. Hyperproliferative and neoplastic disease states may be categorized as pathologic, i.e., characterizing or constituting a disease state, or may be categorized as non-pathologic, i.e., a deviation from normal but not associated with a disease state. The term is meant to include all types of solid cancerous growths, metastatic tissues or malignantly transformed cells, tissues, or organs, irrespective of histopathologic type or stage of invasiveness. "Pathologic hyperproliferative" cells occur in disease states characterized by malignant tumor growth. Examples of non-pathologic hyperproliferative cells include proliferation of cells associated with wound repair. Examples of solid tumors are sarcomas, carcinomas, and lymphomas. Leukemias (cancers of the blood) generally do not form solid tumors.

[0048] The term "carcinoma" is art recognized and refers to malignancies of epithelial or endocrine tissues including respiratory system carcinomas, gastrointestinal system carcinomas, genitourinary system carcinomas, testicular carcinomas, breast carcinomas, prostatic carcinomas, endocrine system carcinomas, and melanomas. In some implementations, the disease is lung carcinoma, rectal carcinoma, colon carcinoma, esophageal carcinoma, prostate carcinoma, head and neck carcinoma, or melanoma. Exemplary carcinomas include those forming from tissue of the cervix, lung, prostate, breast, head and neck, colon and ovary. The term also includes carcinosarcomas, e.g., which include malignant tumors composed of carcinomatous and sarcomatous tissues. An "adenocarcinoma" refers to a carcinoma derived from glandular tissue or in which the tumor cells form recognizable glandular structures. The term "sarcoma" is art recognized and refers to malignant tumors of mesenchymal derivation.

[0049] Referring now to Fig. 1, an example system for automatically extracting information from pathology reports is shown. The system includes a transformer-based machine learning model 102 (also referred to as "first transformer-based machine learning model 102" with

9 regard to Fig. 1) and a computing device (not shown). The computing device can be, for example, the computing device of Fig. 2. The transformer-based machine learning model 102 and computing device can be coupled through one or more communication links. This disclosure contemplates the communication links are any suitable communication link. For example, a communication link may be implemented by any medium that facilitates data exchange including, but not limited to, wired, wireless and optical links. Example communication links include, but are not limited to, a local area network (LAN), a wireless local area network (WLAN), a wide area network (WAN), a metropolitan area network (MAN), Ethernet, the Internet, or any other wired or wireless link such as WiFi, WiMax,3G, 4G, or 5G.

[0050] Transformer-based machine learning models are deep learning models commonly used in the field of natural language processing (NLP). Transformer-based machine learning models have an encoder-decoder architecture, where a plurality of encoder layers iteratively process the input layer-by-layer and a plurality of decoder layers iteratively process the output layer-by-layer. Each encoder and decoder layer also includes an attention unit (e.g., scaled dot-product) that weights the relevance of the layer inputs. This disclosure contemplates that the attention unit can be implemented using a computing device (e.g., a processing unit and memory as described herein). Additionally, each encoder and decoder layer includes an artificial neural network. An example transformer-based machine learning model for NLP is the Bidirectional Encoder Representations from Transformers (BERT) developed by Google LLC of Mountain View, California. It should be understood that BERT is provided only as an example. This disclosure contemplates that the transformer-based machine learning models described herein can be models other than BERT.

[0051] An artificial neural network is a computing system including a plurality of interconnected neurons (e.g., also referred to as "nodes"). This disclosure contemplates that the nodes can be implemented using a computing device (e.g., a processing unit and memory as described herein). The nodes can optionally be arranged in a plurality of layers such as input layer, output layer, and one or more hidden layers. Each node is connected to one or more other nodes in

10 the ANN. For example, each layer is made of a plurality of nodes, where each node is connected to all nodes in the previous layer. The nodes in a given layer are not interconnected with one another, i.e., the nodes in a given layer function independently of one another. As used herein, nodes in the input layer receive data from outside of the ANN, nodes in the hidden layer(s) modify the data between the input and output layers, and nodes in the output layer provide the results. Each node is configured to receive an input, implement an activation function (e.g., binary step, linear, sigmoid, tanH, or rectified linear unit (ReLU) function), and provide an output in accordance with the activation function. Additionally, each node is associated with a respective weight. ANNs are trained with a dataset to maximize or minimize an objective function. In some implementations, the objective function is a cost function, which is a measure of the ANN's performance (e.g., error such as LI or L2 loss) during training, and the training algorithm tunes the node weights and/or bias to minimize the cost function. This disclosure contemplates that any algorithm that finds the maximum or minimum of the objective function can be used for training the ANN. Training algorithms for ANNs include, but are not limited to, backpropagation. It should be understood that an artificial neural network is provided only as an example machine learning model.

[0052] Transformer-based machine learning models are trained with a data set (or "dataset"). In some implementations, the training is supervised training (i.e., using labeled training data). In other implementations, the training is unsupervised training (i.e., using unlabeled training data such as a plain text corpus). In yet other implementations, the training is semi-supervised, where unsupervised training (i.e., using unlabeled training data) is followed by supervised training (i.e., using labeled training data). Transformer-based machine learning models are known in the art and are therefore not described in further detail herein.

[0053] In Fig. 1, the transformer-based machine learning model 102 is operating in inference mode. The transformer-based machine learning model 102 has therefore been trained with a dataset such that it has "learned" a function that maps an input 105 (e.g., a pathology report) to an output 110. The output 110 of the transformer-based machine learning model 102 is at least

11 one of a site description for the disease or a histology description for the disease. As used herein, the site description indicates the disease's site of origin (e.g., the tumor site). The site of origin is sometimes referred to as topographic location. Optionally, in some implementations described herein, a coding system is used to define the site of origin. An example coding system for cancer is the International Classification of Disease for Oncology (ICD-O). It should be understood that ICD-0 is provided only as an example. This disclosure contemplates using other coding and/or classification systems to define the disease's site of origin. Additionally, as used herein, the histology description indicates the disease's microscopic appearance (e.g., the tumor's microscopic appearance). Optionally, in some implementations described herein, a coding system is used to define the histology. One example coding system for cancer is the International Classification of Disease for Oncology (ICD-O). It should be understood that ICD-0 is provided only as an example. This disclosure contemplates using other coding and/or classification systems to define the histology. Optionally, in some implementations, the output 110 of the transformer-based machine learning model 102 includes both the site description and the histology description. In other words, the transformer- based machine learning model 102 is configured to extract a site description and/or a histology description for a disease from a pathology report. Machine learning model training is discussed in further detail below.

[0054] The computing device is configured to receive a pathology report, transmit the pathology report to the transformer-based machine learning model 102, and receive at least one of a site description or a histology description for a disease from the transformer-based machine learning model 102. The site and/or histology descriptions are extracted by the transformer-based machine learning model 102. As noted herein, there is no conventional system (even NLP system) capable of automatically extracting information from free-text pathology reports with high accuracy. For example, clinicians, certified tumor registrars (CTRs), or other users typically need to review free- text pathology reports and manually obtain such information. This is a time-intensive process. For example, patients may not be timely identified as eligible for clinical trials. Some of these patients

12 inevitably experience negative health outcomes, including death, as a result. In other cases, reporting (which is required by government agencies) is not made in a timely manner. For example, CTRs at some institutions may have pathology report backlogs of 6 months, 1 year, or even longer. Optionally, in some implementations, the computing device is configured to provide a disease diagnosis, provide a treatment option, identify a patient for a clinical trial, or monitor adherence to a treatment pathway. Such diagnosis, treatment options, clinical trial eligibility, monitoring, etc. is based on the information extracted by the transformer-based machine learning model 102. For example, tumor site and histology descriptions are used clinically for disease diagnosis and/or treatment. The machine learning-based systems and methods described herein, which use transformer-based machine learning models, offer improvements over conventional systems, including those using NLP. This is at least in part due to the use of transformer-based machine learning.

[0055] In some implementations, the system optionally includes a plurality of transformer-based machine learning models. In other words, the system includes a network of transformer-based machine learning models. The transformer-based machine learning models and computing device can be coupled through one or more communication links. As described herein, this disclosure contemplates the communication links are any suitable communication link. The network is configured for predicting site and histology codes for diseases as described herein. For example, as described above, the system includes the transformer-based machine learning model 102 (also referred to herein as "first transformer-based machine learning model 102"), which is configured to extract information from pathology reports. As described herein, the extracted information includes a site description and a histology description for a disease. Site and histology descriptions for disease are described above.

[0056] The system also includes a second transformer-based machine learning model 104, which is configured to predict site codes for diseases. As described above, a coding system such as ICD-O, for example, can be used to define the disease's site of origin. In conventional practice, site

13 codes are assigned by clinicians or CTRs. In Fig. 1, the second transformer-based machine learning model 104 is operating in inference mode. The second transformer-based machine learning model 104 has therefore been trained with a dataset such that it has "learned" a function that maps the output 110 of the first transformer-based machine learning model 102 (e.g., the extracted site and/or histology descriptions) to an output 115. The output 115 of the second transformer-based machine learning model 104 is the site code for the disease. Optionally, the second transformer- based machine learning model 104 is further configured to predict a top-n most accurate site codes for the disease, e.g., the top-5 most accurate codes. In other words, the second transformer-based machine learning model 104 is configured to receive the extracted information (e.g., output 110 in Fig. 1), and predict a site code for the disease based on the extracted information. The site code is the output 115.

[0057] The system also includes a third transformer-based machine learning model 106, which is configured to predict histology codes for diseases. As described above, a coding system such as ICD-O, for example, can be used to define the disease's histology. In conventional practice, histology codes are assigned by clinicians or CTRs. In Fig. 1, the third transformer-based machine learning model 106 is operating in inference mode. The third transformer-based machine learning model 106 has therefore been trained with a dataset such that it has "learned" a function that maps the output 110 of the first transformer-based machine learning model 102 (e.g., the extracted site and/or histology descriptions) to an output 120. The output 120 of the third transformer-based machine learning model 106 is the histology code for the disease. Optionally, the third transformer- based machine learning model 106 is further configured to predict a top-m most accurate histology codes for the disease, e.g., the top-5 most accurate codes. In other words, the third transformer- based machine learning model 106 is configured to receive the extracted information (e.g., output 110 in Fig. 1), and predict a histology code for the disease based on the extracted information. The histology code is the output 120.

14 [0058] Optionally, in some implementations, the computing device is further configured to receive the site code (e.g., output 115) predicted by the second transformer-based machine learning model 104 and the histology code (e.g., output 120) predicted by the third transformer- based machine learning model 106. Optionally, in some implementations, the computing device is further configured to generate a report comprising the site code and the histology code.

[0059] An example method for training a transformer-based machine learning model is described below. The method can be used, for example, to train the first transformer-based machine learning model 102 of Fig. 1. First, the base language-model is trained to comprehend the technical language in pathology reports. In the examples below, this involved unsupervised learning on a training corpus of 275,605 electronic pathology reports (e.g., a first dataset) from 164,531 unique patients that included 121 million words. Next, a Q&A "head" that connects to, and work with, the pathology language model is trained to answer pathology questions. The Q&A system can be designed to search for the answers to two predefined questions in each pathology report: 1) "What organ contains the tumor?"; and, 2) "What is the kind of tumor or carcinoma?". In the examples below, this involved supervised training on 8,197 pathology reports (e.g., a second dataset), each with ground truth answers to these two questions determined by Certified Tumor Registrars. The dataset included 214 tumor sites and 193 histologies. The second dataset is created by creating a first hierarchal tree structure configured to hold acceptable site description terminology, creating a second hierarchal tree structure configured to hold acceptable histology description terminology, constructing a dictionary using the first and second hierarchal tree structures, and for each of the pathology reports in the second dataset, performing a search, using the dictionary, to identify respective matching text strings within a pathology report that correspond to the respective ground truth labels. The tumor site and histology phrases extracted by the Q&A model are used to predict ICD-O-3 site and histology codes. This involved fine-tuning two additional BERT models, for example, the second transformer-based machine learning model 104 of Fig. 1 and the third transformer-based

15 machine learning model 106 of Fig. 1: one to predict site codes, and the second to predict histology codes.

[0060] It should be appreciated that the logical operations described herein with respect to the various figures may be implemented (1) as a sequence of computer implemented acts or program modules (i.e., software) running on a computing device (e.g., the computing device described in Fig. 2), (2) as interconnected machine logic circuits or circuit modules (i.e., hardware) within the computing device and/or (3) a combination of software and hardware of the computing device. Thus, the logical operations discussed herein are not limited to any specific combination of hardware and software. The implementation is a matter of choice dependent on the performance and other requirements of the computing device. Accordingly, the logical operations described herein are referred to variously as operations, structural devices, acts, or modules. These operations, structural devices, acts and modules may be implemented in software, in firmware, in special purpose digital logic, and any combination thereof. It should also be appreciated that more or fewer operations may be performed than shown in the figures and described herein. These operations may also be performed in a different order than those described herein.

[0061] Referring to Fig. 2, an example computing device 200 upon which the methods described herein may be implemented is illustrated. It should be understood that the example computing device 200 is only one example of a suitable computing environment upon which the methods described herein may be implemented. Optionally, the computing device 200 can be a well-known computing system including, but not limited to, personal computers, servers, handheld or laptop devices, multiprocessor systems, microprocessor-based systems, network personal computers (PCs), minicomputers, mainframe computers, embedded systems, and/or distributed computing environments including a plurality of any of the above systems or devices. Distributed computing environments enable remote computing devices, which are connected to a communication network or other data transmission medium, to perform various tasks. In the

16 distributed computing environment, the program modules, applications, and other data may be stored on local and/or remote computer storage media.

[0062] In its most basic configuration, computing device 200 typically includes at least one processing unit 206 and system memory 204. Depending on the exact configuration and type of computing device, system memory 204 may be volatile (such as random access memory (RAM)), non-volatile (such as read-only memory (ROM), flash memory, etc.), or some combination of the two. This most basic configuration is illustrated in Fig. 2 by dashed line 202. The processing unit 206 may be a standard programmable processor that performs arithmetic and logic operations necessary for operation of the computing device 200. The computing device 200 may also include a bus or other communication mechanism for communicating information among various components of the computing device 200.

[0063] Computing device 200 may have additional features/functionality. For example, computing device 200 may include additional storage such as removable storage 208 and non removable storage 210 including, but not limited to, magnetic or optical disks or tapes. Computing device 200 may also contain network connection(s) 216 that allow the device to communicate with other devices. Computing device 200 may also have input device(s) 214 such as a keyboard, mouse, touch screen, etc. Output device(s) 212 such as a display, speakers, printer, etc. may also be included. The additional devices may be connected to the bus in order to facilitate communication of data among the components of the computing device 200. All these devices are well known in the art and need not be discussed at length here.

[0064] The processing unit 206 may be configured to execute program code encoded in tangible, computer-readable media. Tangible, computer-readable media refers to any media that is capable of providing data that causes the computing device 200 (i.e., a machine) to operate in a particular fashion. Various computer-readable media may be utilized to provide instructions to the processing unit 206 for execution. Example tangible, computer-readable media may include, but is not limited to, volatile media, non-volatile media, removable media and non-removable media

17 implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. System memory 204, removable storage 208, and non-removable storage 210 are all examples of tangible, computer storage media. Example tangible, computer-readable recording media include, but are not limited to, an integrated circuit (e.g., field-programmable gate array or application-specific 1C), a hard disk, an optical disk, a magneto-optical disk, a floppy disk, a magnetic tape, a holographic storage medium, a solid-state device, RAM, ROM, electrically erasable program read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices.

[0065] In an example implementation, the processing unit 206 may execute program code stored in the system memory 204. For example, the bus may carry data to the system memory 204, from which the processing unit 206 receives and executes instructions. The data received by the system memory 204 may optionally be stored on the removable storage 208 or the non-removable storage 210 before or after execution by the processing unit 206.

[0066] It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, where appropriate, with a combination thereof. Thus, the methods and apparatuses of the presently disclosed subject matter, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium wherein, when the program code is loaded into and executed by a machine, such as a computing device, the machine becomes an apparatus for practicing the presently disclosed subject matter. In the case of program code execution on programmable computers, the computing device generally includes a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. One or more programs may implement or utilize the processes described in connection with the presently disclosed subject matter, e.g., through the use of an application

18 programming interface (API), reusable controls, or the like. Such programs may be implemented in a high level procedural or object-oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired.

In any case, the language may be a compiled or interpreted language and it may be combined with hardware implementations.

[0067] Examples

[0068] The following examples are put forth so as to provide those of ordinary skill in the art with a complete disclosure and description of how the compounds, compositions, articles, devices and/or methods claimed herein are made and evaluated, and are intended to be purely exemplary and are not intended to limit the disclosure. Efforts have been made to ensure accuracy with respect to numbers (e.g., amounts, temperature, etc.), but some errors and deviations should be accounted for. Unless indicated otherwise, parts are parts by weight, temperature is in °C or is at ambient temperature, and pressure is at or near atmospheric.

[0069] Information in pathology reports is critical for cancer care. Natural language processing (NLP) systems to extract information from pathology reports are often narrow in scope or require extensive tuning. Consequently, there is growing interest in automated deep learning approaches. A powerful new NLP algorithm, Bidirectional Encoder Representations from Transformers (BERT), was published in late 2018. BERT set new performance standards on tasks as diverse as question answering, named entity recognition, speech recognition, and more.

[0070] The systems and methods described herein pursue three specific aims: 1) extract accurate tumor site and histology descriptions from free-text pathology reports; 2) accommodate the diverse terminology used to indicate the same pathology; and 3) provide accurate standardized tumor site and histology codes for use by downstream applications. First, a base language-model was trained to comprehend the technical language in pathology reports. This involved unsupervised learning on a training corpus of 275,605 electronic pathology reports from 164,531 unique patients that included 121 million words. Next, a Q&A "head" that would connect to, and work with, the

19 pathology language model was trained to answer pathology questions. The Q&A system was designed to search for the answers to two predefined questions in each pathology report: 1) "What organ contains the tumor?"; and, 2) "What is the kind of tumor or carcinoma?". This involved supervised training on 8,197 pathology reports, each with ground truth answers to these two questions determined by Certified Tumor Registrars. The dataset included 214 tumor sites and 193 histologies. The tumor site and histology phrases extracted by the Q&A model were used to predict ICD-O-3 site and histology codes. This involved fine-tuning two additional BERT models: one to predict site codes, and the second to predict histology codes. The final system includes a network of 3 BERT-based models, which is referred to in this example as caBERTnet (pronounced "Cabernet"). caBERTnet was evaluated using a sequestered test dataset of 2,050 pathology reports with ground truth answers determined by Certified Tumor Registrars.

[0071] caBERTnet's accuracies for predicting group-level site and histology codes were 93.5% and 97.7%, respectively. The top-5 accuracies for predicting fine-grained ICD-O-3 site and histology codes with 5 or more samples each in the training dataset were 93.6% and 95.4%, respectively.

[0072] caBERTnet has achieved expert-level performance predicting ICD-O-3 codes across a broad range of tumor sites and histologies. This level of performance has not been achieved by state of the art NLP systems. caBERTnet can help reduce treatment delays, increase enrollment in clinical trials of new therapies, and improve patient outcomes.

[0073] The BERT-based system described herein was developed and evaluated to extract detailed tumor site and histology information from free-text pathology reports. Initial efforts were focused on these data elements since they are particularly critical for cancer diagnosis and care, including the selection of treatment options, identification of patients eligible for clinical trials, and monitoring of adherence to established clinical treatment pathways. More broadly, the discrete information contained in pathology reports is of critical utility for cancer research and population- based cancer surveillance. The availability of manually curated data within the Cancer Registry at the

20 H. Lee Moffitt Cancer Center and Research Institute (Moffitt) represented a unique opportunity to train a BERT-based system using a gold standard dataset classified using a standard ontology.

[0074] BERT's proficiency at question-answering prompted us to construct a question- and-answer (Q&A) system to extract clinical data from pathology reports. This concept has long been compelling - Q&A systems for medical data extraction have been pursued for over 40 years [33].

Such a system has several desirable properties: an intuitive user interface; the ability to extract additional data fields by searching for answers to additional questions; and the ability to generalize to other medical documents. Furthermore, it allows users to make data available for clinical and research use close to real-time, thus reducing treatment delays, increasing enrollment in clinical trials of new therapies, and improving patient outcomes.

[0075] The system and methods described herein make the following contributions to the field of NLP: 1) A new BERT language model for comprehension of pathology reports in oncology, this is referred to below as "CancerBERT", or "caBERT" for short; 2) a new question-and-answer caBERT-based system, tolerant to varied terminologies, word orders, and spelling mistakes, to extract tumor site and histology descriptions from free-text pathology reports; and 3) a new caBERT network (caBERTnet, pronounced "Cabernet") to predict International Classification of Diseases, Oncology, version 3.2 (ICD-O-3.2) codes from the extracted descriptions. This system can handle up to 332 organ sites and 1,143 tumor histologies. On an unseen test dataset with 214 sites and 193 histologies it achieved overall accuracies comparable to human experts.

[0076] Methods

[0077] To construct the system described herein, three specific aims were achieved: 1) extract accurate tumor site and histology descriptions from complex free-text pathology reports; 2) accommodate the diverse terminology used to indicate the same pathology; and 3) provide accurate standardized tumor site and histology codes for use by downstream applications.

[0078] Extract Tumor Site and Histology Descriptions

21 [0079] Constructing this system first required training a base language model to comprehend the technical language in pathology reports (Table 1, Fig. 3). This involved unsupervised training on a large corpus of pathology text. For this a "pathology-language-model training dataset", described in more detail below, was constructed.

[0080] Figure 3 illustrates Table 1, which is generated at Moffitt Cancer Center. Note the terse grammatical style and use of technical terms. A single report often has multiple negative indications(illustrated in underlined text) along with a positive diagnosis indicating (in bold) the tumor site (right lower lobe) and histology (squamous cell carcinoma). The goal of this project was to develop an NLP system that when passed this block of text would respond to the question "What organ contains the tumor?" with "C343: lower lobe, lung", and would respond to the question "What is the kind of tumor or carcinoma?" with"8070/3: squamous cell carcinoma, nos".

[0081] Next, it required training a Q&A "head" that would connect to, and work with, the pathology language model to answer pathology questions. The Q&A system was designed to search for the answers to two predefined questions in each pathology report: 1) "What organ contains the tumor?"; and 2) "What is the kind of tumor or carcinoma?"

[0082] This involved supervised training on a set of pathology reports, each with ground truth answers to these two questions determined by human experts. To do this a second "model- head training dataset" described in more detail below was constructed. Finally, the system was evaluated using a sequestered "model-head testing dataset" described in more detail below.

[0083] Pathology Language Model

[0084] Training a base language model to comprehend pathology reports leveraged prior work by several groups (Figure 4). Lee et. al. [34] performed transfer learning on BERT using nearly 18 billion words extracted from PubMed abstracts. The result, BioBERT, is tuned for biomedical language comprehension tasks and is publicly available. Next, Alsentzer et. al. [35] performed transfer learning on BioBERT to tune it for clinical language comprehension. They used electronic medical record notes in the Medical Information Mart for Intensive Care, version 3

22 (MIMIC-III) dataset [36] which includes data from approximately 60,000 intensive care unit stays by patients at Beth-lsrael Hospital in Boston, MA. Alsentzer's model, ClinicalBERT, was also made publicly available.

[0085] Figure 4 shows how the pathology language model described herein ("Cancer BERT") leverages work performed by others. Google released the original BERT in late 2018.

BioBERT, a model for biomedical language comprehension, was created by tuning BERT with PubMed abstracts. ClinicalBERT, a model for EMR language comprehension, was created by tuning BioBERT with MIMIC-III notes. CancerBERT (caBERT), a language model for solid tumor pathology report comprehension, was created by tuning ClinicalBERT with 275,605 pathology reports on solid tumor cases at Moffitt Cancer Center.

[0086] Alsentzer et. al. created two models built upon BioBERT: one trained on all MIMIC-III notes, and one trained on just the MIMIC-III discharge summaries. Initial pre-training experimentation revealed that the latter provided the higher accuracies on a separate sample of our pathology reports. It is noted that Moffitt pathology reports have a language structure closer to discharge summaries than to general clinical notes. Consequently, the model described herein was initialized with weights from the latter of the two ClinicalBERT models: "ClinicalBERT - Bio + Discharge Summary BERT Model".

[0087] Transfer learning was accomplished by performing masked-language modeling [9]. Briefly, 15% of words in the corpus are selected at random, then replaced with a "mask" token. The language model is then trained to predict the masked words. The word masking process is performed automatically at the beginning of each training run.

[0088] The language-model training corpus included electronic pathology reports of solid tumors produced by pathologists at Moffitt between 1986 and 2020. The year 1986 was the earliest date on pathology reports catalogued in Moffitt's enterprise data warehouse. The dataset was restricted to solid tumors for two reasons: first, to focus the problem domain for this proof-of-

23 concept study; and, second, Moffitt hematologic pathology reports follow a quasi-structured format, reducing the need for extraction of data from unstructured text.

[0089] This dataset contained both Health Level Seven International (HL7) messages and plain-text pathology reports. These were minimally processed in order to extract and clean the text relevant to pathology, more details can be found in the supplementary material. The final language- model training corpus included 275,605 electronic pathology reports from 164,531 unique patients and included 121 million words.

[0090] Pathology Question & Answer Head

[0091] The pathology Q&A "head" lesson plan involved 3 stages, each intended to improve the system's comprehension of pathology reports, and thereby increase the accuracy of question answering (Figure 5). The three stages involved training the Q&A head to: 1) answer general English language questions; 2) answer technical biomedical science questions; and, 3) answer questions from Moffitt pathology reports.

[0092] Each training stage used supervised learning. This required a training dataset that included passages of text, one or more questions related to each passage, and ground-truth answers to those questions that appeared as contiguous phrases within the related passage. At the end of each stage the system was evaluated using the same sequestered test dataset constructed from Moffitt pathology reports, as described in more detail below. The experimental parameters used to train the Q&A head were held constant over all stages and are listed in Figure 12, Table SI.

[0093] For the first stage of training the Stanford Question and Answering Dataset (SQuAD vl.l) [37] was used. SQuAD consists of over 100,000 questions and answers created by crowd workers on Wikipedia articles. The SQuAD data format is widely used in NLP research. Therefore, the system was designed to read and process datasets in this format.

[0094] For the second stage of training the large-scale biomedical semantic indexing and question answering dataset (BioASQ [38]) was used. In particular, data from the BioASQ Challenge 7b: Biomedical Semantic Question Answering, was used. This dataset contains 2,747 training

24 questions along with their ground-truth answers. According to the BioASQ Challenge 7b description:

"All the questions are constructed by biomedical experts from around Europe". This dataset was converted to SQuAD format by Yoon et. al. and made available for public use [39].

[0095] Next, a Q&A dataset was constructed in SQuAD format based on Moffitt pathology reports. Ground-truth answers were obtained to the two questions from data abstracted by Moffitt Certified Tumor Registrars (CTRs). CTRs undergo an extensive training and internship program to become proficient at extracting quantitative and categorical data from unstructured pathology reports. They are widely employed by cancer centers and other organizations to extract data for clinical and research applications and for reporting to state and national agencies. The Moffitt Cancer Registry deploys state-of-the art quality assurance procedures: its benchmark for quality is 90% and its target accuracy is 95% [40].

[0096] Figure 5 shows how the pathology language model described herein ("Cancer BERT") was developed, 3 instances of caBERT were trained using the same 8,197 pathology reports (the Moffitt model-head training dataset). Ground truth answers were created from data in the Moffitt Cancer Registry. The first caBERT instance, "A" uses a question-and-answer head to extract tumor site and histology descriptions from pathology reports. The training process involved a 3- stage "lesson plan": 1) train the system to answer general English language questions with the SQuAD vl.l dataset; 2) train it to answer technical biomedical questions with the BioASQ 7b dataset; and finally, 3) train it to answer our pathology questions with the Moffitt model-head dataset. Training input to "B" and "C" consisted of concatenated site and histology descriptions, for example, "site: lung lower lobe, histology: squamous cell carcinoma". Combining the phrases allows each caBERT instance to leverage correlations between site and histology to improve its performance. caBERT "B" is trained to map the input phrase onto ICD-O-3 site codes (eg: "C343"), while "C" is trained to map this phrase onto ICD-O-3 histology codes (eg: "8070/3"). Output from each classification head consists of the five ICD-O-3 codes with the highest probabilities, or scores. At the

25 end of each training stage the models were evaluated using 2,050 sequestered pathology reports in the Moffitt model-head testing dataset. Additional information is provided in the text.

[0097] Moffitt Cancer Center's enterprise data warehouse was searched to find solid tumor pathology reports post 2006, with matched Cancer Registry (CR) data (Figure 6). These reports were screened to ensure they contained a description of a positive diagnosis of a single primary tumor. Next, each report was processed to ensure that it contained an answer to at least one of the Q&A questions. This was accomplished programmatically by searching each report for a phrase contained in a table of acceptable answer phrases (Figure 11, SI), as described in more detail below. The search produced 16,782 reports that met these inclusion criteria (Figure 6).

[0098] Next, these reports were curated to ensure that: a) the relative frequencies of the 10 most common tumor sites and histologies in this collection matched the relative frequencies in Moffitt's patient population as a whole; and, b) all CR assigned tumor sites and histologies reported for Moffitt patients post 2006 were represented in the dataset. The final curated collection contained 10,247 reports (Figure 6).

[0099] Figure 6 is a flowchart depicting the data curation process. 1) all Moffitt pathology reports stored in H L7 format from 2007 onward were retrieved. 2) surgical records post 2006 abstracted by our Cancer Registry were also retrieved. Flowever, one patient may have multiple tumor records and one tumor record may have multiple surgical records. Therefore, 3) HL7 pathology reports were cross-referenced with a Cancer Registry surgical record using MRN and date. This produced 4) a dataset with discrete abstracted data existing in the research biobanking system. These records were linked using MRN and surgical case id. Next, the Cancer Registry primary site and histology were verified against the biobanking primary site and histology. In parallel, we identified 5) cases with a synoptic pathology report verified using Cancer Registry primary site and histology data and the discrete synoptic diagnosis. This resulted in 6) a collection of cases with pathology validated via a third source (biobanking or synoptic pathology report). Finally, this group was curated to create

7) a collection of cases that reflected Moffitt's patient distribution.

26 [00100] The curated collection of reports was randomly divided to create two datasets: 80% (N=8,197) of the reports were used to create the Moffitt model-head training dataset and 20% (N=2,050) were used to create the Moffitt model-head testing dataset. Each dataset was saved in SQuAD format using a custom-written Python program. The training dataset was used for the final stage of Q&A training, and also for ICD-O-3 code predictions, described in more detail below. The testing subset was used to evaluate the impact of Q&A training at the end of each training stage (Figure 7), and also to evaluate the performance of the final pipeline (Figure 1). The final caBERT network shown in Figure 1 connects caBERT instances "A", "B" and "C". This network was tested using the 2,050 sequestered test reports, with ground truth ICD-O-3 codes determined from our Cancer Registry (the Moffitt model-head testing dataset). These test reports were first fed into "A" which produced predicted site and histology descriptions. These descriptions were combined into a single phrase, as described above, which was fed into "B" and "C". "B" then predicted the 5 most probable site codes, while "C" predicted the 5 most probably histology codes. These codes, and their corresponding group codes were then compared to ground truth values to measure performance.

[00101] Question-answering accuracy was evaluated using two metrics: exact match and Fl-score. Exact match is true if the caBERT extracted phrase is an identical word-for-word match with the CR phrase, and false otherwise. The average number of true results was calculated across all test samples. The Fl-score is a measure of the degree of overlap between words in the caBERT extracted phrase and the CR phrase. This varies from 0 (no words in common) to 1 (all words in common, but not necessarily in the same order). Each exact match corresponded to an Fl-score of 1.0. The average Fl-score was calculated across all test samples, then expressed this as a percentage.

[00102] After all training stages were complete, they were repeated using the initial

ClinicalBERT model ("Clinica IBERT - Bio + Discharge Summary BERT") with a new randomly initialized

Q&A "head". This allowed for determination of the impact of developing a pathology-tuned BERT

27 model on extraction accuracy over the baseline accuracy of using ClinicalBERT alone (Figure 7). The training parameters were set to those optimized for ClinicalBERT and reported by Alsentzer et. al. [35],

[00103] Figure 7 illustrates the impact of training caBERT, a pathology-specific language model, on question-and-answering performance. We trained two Q&A "heads" one connected to ClinicalBERT and one connected to caBERT. Q&A head training proceeded in 3 stages: 1) train it to answer general English language questions with SQuAD vl.l; 2) train it to answer technical biomedical questions with BioASQ 7b; and, 3) train it to answer pathology questions with a local (Moffitt) model-head training dataset derived from our pathology reports, with ground truth provided by our Cancer Registry. At the end of each stage, the system was tested using the Moffitt model-head testing dataset. This contained 2,050 sequestered pathology reports, each with ground truth. "Exact Match" refers to a perfect, word-for-word, match to the Cancer Registry phrase. "FI" is the FI measure of overlap between words in the extracted phrase and the Cancer Registry phrase. This varies from 0 (no words in common) to 1 (all words in common, but not necessarily in the same order), and is expressed here as a percentage.

[00104] Accommodate Diverse Terminology

[00105] The pathology reports came supplied with ground truth labels for primary site and histology in the form of ICD-O-3 codes[41], which were abstracted by Moffitt CTRs. In order to train the Q&A head a method was needed to determine the precise location within each pathology report of the actual text corresponding to those labels. This proved non-trivial owing to the rather diverse terminology within each pathology report used to refer to each primary site and histology.

[00106] To address this issue, data from several canonical sources was utilized. The primary source was the ICD-O-3 standards [42], which was used to define the primary "preferred" terminology for each code. Within the ICD-O-3 standards there are 332 unique site codes and 1,143 unique histology codes, each with accompanying preferred terms. Along with the preferred term,

28 many codes also have an additional set of synonyms, which were stored together with the preferred term in a table of acceptable phrases for each code. In addition to the ICD-O-3 tables, terminology from the National Cancer Institute's (NCI) Surveillance, Epidemiology, and End Results (SEER) program Site/Histology validation table [43] was used along with the SEER site-specific training module website [44].

[00107] In a little more detail, the specific sources used to construct the acceptable phrase tables were as follows. For histology, the ICD-O-3.2 morphology table (version 15112019)[42] was used and supplemented with terms from the SEER Site/Histology validation table (version 20150918), current versions of which are both available in excel format from their respective websites. For the site terms, the ICD-O-3 mapping table maintained by NCI [45] was used, supplemented again by the SEER Site/FH istology validation table. In addition, for the site terms the tables contained in the SEER site-specific learning module website for any new terms was also scraped.

[00108] The above sources have the benefit of being subject to an international standard and are useful in designating preferred terms for each histology and site code. However, it should be noted that there do exist slight discrepancies between the World Health Organization (WHO) maintained ICD-O-3 coding standards and the North American Association of Central Cancer Registries (NAACCR) coding guidelines, which are followed in the SEER materials. For simplicity, the model described herein is based on the ICD-O-3 standards, but this caveat may prove relevant for any future Cancer Registry applications of the model.

[00109] While these sources provided preferred and alternate terminologies, they did not encompass the full range of language used for every label in our pathology reports — which often included things such as permutations of word orderings as well as acronyms and other typographical differences with the canonical terms. Fortunately, Moffitt CTRs routinely record a short description of the histology and site for every labeled pathology report in a text-based field.

29 For each histology and site code, these additional phrases were appended to the list of synonyms of the preferred canonical terminology.

[00110] Using the sources described above, two hierarchical tree structures were created as illustrated in Figure 13, Table S2: one to hold histology terms, and one to hold site terms. To construct these trees, the histology and site codes were first grouped into broad morphology and site groups as specified in the ICD-O-3 tables. Within each group are a collection of specific codes, where each code has an associated preferred term, along with a list of synonyms. For efficient searching, these trees were stored as JavaScript Object Notation (JSON) objects which were imported into Python as nested dictionaries and lists. See Figure 13, SI for an example of one entry in the acceptable phrase table.

[00111] In order to search each pathology report for appropriate spans of text, the trees were used to construct a dictionary with keys given by the specific site and histology codes and values given by the associated acceptable phrase table. Using this dictionary, for each pathology report we implemented a simple search for an exact match from the list of preferred terms and synonyms for the labelled ground truth histology and site code, giving preference for the preferred term, followed by each synonym ordered by length (with the longest matching synonym given preference over the others).

[00112] Even with the diverse terminology within the acceptable phrase table for each code, not every pathology report contained an exact match within the list of allowed terms. For pathology reports that did not contain an exact match, the search was further refined by allowing for matches that only overlapped with a subset of the word tokens within each phrase, again giving preference to the longest synonyms and also employing a set of stop terms to avoid overly general terminology. In order to capture potential word ordering differences, these word token subsets were allowed to be constructed in an arbitrarily permuted order, which was made efficient by utilizing the itertools module available as part of the Python Standard Library.

30 [00113] Using the above procedure, of the 10,247 pathology reports in the Moffitt model-head training and model-head testing datasets find appropriate textual answers were found within 10,096 (98.5%) reports for primary site and within 10,218 (99.7%) reports for histology.

[00114] Accurate ICD-O-3 Codes

[00115] The tumor site and histology phrases extracted by the Q&A model (Figure 5) were used to predict ICD-O-3 site and histology codes. This involved fine-tuning two additional copies of caBERT: one to predict site codes, and the second to predict histology codes. The final system (illustrated in Figure 1) includes a network of 3 caBERT-based models.

[00116] Training the ICD-O-3 Site and Histology Classifiers

[00117] Classifier training parameters are described in more detail in the

Supplementary Material (Figure 12, Table SI). Briefly, each caBERT instance was trained to perform a classification task: given an input phrase, predict the corresponding ICD-O-3 code. Classification tasks were trained using the Moffitt model- head training dataset, which is described above. Training samples were screened to ensure that each contained ground-truth site and histology codes, and at least one site or histology phrase, provided by the Moffitt Cancer Registry. Missing site and histology phrases were filled using SEER preferred terms. These were identified by performing a lookup into the ICD-O-3 table using the site or histology code in the training sample.

[00118] After screening, the ground-truth phrases were labeled, then concatenated to form a single combined phrase. For example, if the CR phrases were "lung lower lobe" and "squamous cell carcinoma" then the combined phrase would be "site: lung lower lobe, histology: squamous cell carcinoma." The combined phrase was used to train both the caBERT site classification head and the caBERT histology classification head. The use of a combined phrase allowed caBERTnet to leverage any correlation between site and histology to improve its performance. For example, astrocytomas are brain tumors. When caBERTnet encountered a previously unseen pathology report during the test phase with the combined phrase "site: frontal.

31 histology: anaplastic astrocytoma" it correctly predicted a brain site of "C711, frontal lobe", and a histology of "9401/3 astrocytoma anaplastic NOS" (not otherwise specified).

[00119] Testing the ICD-O-3 Site and Histology Code Classifiers

[00120] After training of the site and histology ICD-O-3 code classification heads was complete, caBERTnet performance was evaluated using the sequestered Moffitt model-head test dataset described above. For each test sample, the CR generated site and histology phrases were used to create a "ground truth" combined phrase. Next, the site and histology phrases extracted by the Q&A stage of caBERTnet were used to create a "predicted" combined phrase. The predicted phrase was tokenized to prepare it for input into each classification head. Ground truth site and histology codes from the CR were enumerated, as above, and stored as true labels. Then, the trained site and histology classification heads were used to classify the tokenized "predicted" combined phrases for each test sample. The outputs from this classification, logits, were converted to probabilities, sorted and converted back into ICD-O-3 codes as described above, labeled as "predicted" codes, and saved for further performance analysis.

[00121] caBERTnet performance was evaluated four different ways. First, the top-5 accuracies were determined. This metric (or its inverse, the top-5 error rate) is commonly used to evaluate classification algorithms [46]. Briefly, it calculates the average probability that the correct site or histology code occurs within the top-N predicted codes, as N is varied from 1 through 5. Top-1 accuracy, the accuracy of the code scored most highly by the classification algorithm, is equivalent to precision, recall and FI score for this classification task.

[00122] Second, the effect of culling, or removing, infrequently occurring codes was examined. The hypothesis was that the caBERT site and histology code classifiers suffer when they do not have enough training data to learn from. Therefore, to examine the effect of training sample size, site and histology codes were iteratively eliminated from the test dataset when the number of examples with a particular code in the training set (alone) fell below a specified threshold. The

32 threshold was varied from 0 samples (no culling) to 35 samples, in increments of 5 samples. At each culling threshold the top-5 performance of the site and histology classifiers were recalculated.

[00123] Third, the overall mean accuracy of predicting the correct "group" code for each site and histology code was calculated. "Group" codes occur higher up in the ontological tree, and as the name implies, encompass a group or range of related tumor sites or histologies. For example, the site codes "C341 upper lobe, lung" and "C349 lung, NOS" have the same group code: "C34 bronchus and lung". The histology codes "8070/3 squamous cell carcinoma, NOS" and "8051/0 verrucous carcinoma" both have the same group code: "805-808 squamous cell neoplasms". The ICD-O-3 ontology includes 82 group-level site codes covering the 332 fine-grained site codes. It includes 49 group-level histology codes covering the 1,143 fine-grained histology codes.

[00124] Group codes are useful for search and summary applications. The group codes for both the predicted and ground truth fine-grained codes were determined by searching in the tree data structures described above. For each fine-grained code, the search started at that code's location in the tree and proceeded upward. Finally, the mean accuracy of prediction within each group code, for both site and histology predictions, were calculated.

[00125] Results

[00126] The accuracy of the both the ClinicalBERT and caBERT Q&A models when tested on the Moffitt model-head testing dataset improved at each Q&A training stage (SQuAD, BioASQ, and Moffitt training data, Figure 7). ClinicalBERT had higher accuracy than caBERT on the Moffitt test set after each of the first two training stages. This suggests that the specialized pathology language tuning reduced caBERT's ability to learn from the SQuAD and BioASQ training datasets. Flowever, caBERT outperformed ClinicalBERT after training on Moffitt pathology reports. This was true both for Exact Match (81.3% for caBERT vs. 76.4% for ClinicalBERT) and Fl-score (88.4% for caBERT vs. 85.7% for ClinicalBERT).

[00127] The top-N accuracy of predicting fine-grained site codes ranged from 72.6%

(top-1) to 91.4% (top- 5), without culling (Figure 8). The accuracy for predicting histology codes

33 ranged from 84.0% (top- 1) to 94.7% (top-5). Culling 3.5% of the test samples - those site and histology codes with less than 5 samples in the training dataset - improved accuracy for site code prediction to 75.0% (+2.4%, top-1) and 93.6% (+2.2%, top-5). The same culling improved the accuracy of histology code prediction to 84.6% (+0.6%, top-1) and 95.4% (+0.7%, top-5).

[00128] Figure 8 illustrates the effect of culling rare tumor sites and histologies on the top-N accuracy of predicting fine-grained ICD-O-3 codes. Site and histology codes were iteratively eliminated from the test dataset when the number of examples with a particular code in the training set (alone) fell below a specified threshold, "E". E was varied from 0 samples (no exclusion) to 35 samples, in increments of 5 samples. The legend indicates the percentage of test samples removed at each threshold, and the corresponding plot color. Without exclusion the accuracy of site code prediction ranged from 72.6% (top-1) to 91.4% (top-5; blue line, round symbols). The accuracy of histology code prediction ranged from 84.0% (top-1) to 94.7% (top-5; blue line, square symbols). Excluding codes with < 5 samples in the training set reduced the test set by 3.5% (72 samples) but increased the top-1 and top-5 site code accuracies to 75.0% and 93.6%, respectively (green line, round symbols) and the histology code top-1 and top-5 accuracies to 84.6% and 95.4%, respectively (green line, square symbols).

[00129] The accuracy of predicting group-level site codes was 93.5% overall (Figure

9). The 10 most commonly represented sites (1. breast; 2. skin; 3. lung and bronchus; 4. prostate gland; 5. corpus uteri; 6. thyroid gland; 7. kidney; 8. large intestine; 9. rectum; and, 10. ovary) included 79.5% of the test samples and had an average accuracy of 97.6%. (Note the term accuracy is used here to refer to the proportion of correct predictions within a fixed ground-truth group-level code, so the average accuracy here can also be referred to as the macro-averaged recall). Accuracies below 80% were observed for: connective and soft tissues (51.7% with 1.4% of samples); stomach (48.1% with 1.3% of samples); small intestine (69.2% with 0.6% of samples); retroperitoneum and peritoneum (72.7% with 0.5% of samples), and other (56.2%, a collection of 27 sites totaling 4.4% of samples).

34 [00130] Figure 9 illustrates the overall mean accuracy of predicting tumor site group codes from unstructured, and previously unseen, pathology reports on solid tumors was 93.5%. Ground truth values were provided by Moffitt Cancer Registry Certified Tumor Registrars. In total, 2,009 pathology reports were included in this test dataset. These included 214 unique SEER ICD-O-3 site codes combined into 82 site group codes. Each slice in this figure corresponds to a group code except "Other" which includes summary values from 27 rare sites (each < 0.4% of cases). Within each slice the name is indicated in bold followed by its average prediction accuracy, which also determines the slice color. The 10 most common sites (labeled 1 through 10, excluding "Other") included 79.5% of the test samples and had an average accuracy of 97.6%.

[00131] The accuracy of predicting group-level histology codes was 97.7% overall

(Figure 10). The 10 most commonly represented histologies (1. adenomas and adenocarcinomas; 2. ductal and lobular neoplasms; 3. nevi and melanomas; 4. squamous cell carcinomas; 5. cystic mucinous and serous neoplasms; 6. transitional cell papillomas and carcinomas; 7. gliomas; 8. epithelial neoplasms; 9. complex mixed and stromal neoplasms; and, 10. lipomatous neoplasms) included 96.0% of the test samples and had an average accuracy of 98.0%. An accuracy below 80% was observed for epithelial neoplasms only (77.8% with 0.9% of samples).

[00132] Figure 10 illustrates the overall mean accuracy of predicting tumor histology group codes from unstructured, and previously unseen, pathology reports on solid tumors was 97.7%. Ground truth values were provided by Moffitt Cancer Registry Certified Tumor Registrars. In total, 2,041 pathology reports were included in this test dataset. These included 193 unique SEER ICD-O-3 histology codes, combined into 27 histology group codes. Each slice in this figure corresponds to a group code, except "Other" which includes summary values from 16 rare histologies (each < 0.5% of cases). Within each slice, the name is indicated in bold, followed by its average prediction accuracy, which also determines the slice color. The 10 most common histologies (labeled 1 through 10, excluding "Other") included 96.0% of the test samples and had an average accuracy of 98.0%.

35 [00133] Discussion

[00134] The systems and methods described herein make several contributions.

First, caBERT: a BERT-based language model for comprehension of cancer pathology reports, was created. Only one other attempt to create a pathology / oncology specific BERT language model [32] is known. That study included 290,438 pathology reports created between 2005 and 2015 from a tertiary teaching hospital in the United States. However, only 8,870 of those reports were cases who had cancer. This study included 275,605 pathology reports from cancer patients diagnosed or treated at Moffitt Cancer Center. The larger corpus of cancer specific reports should help the system achieve higher performance levels with cancer-related NLP tasks. Additionally, as noted above, this study focused on classification of text into only a few cancer-related categories including: afflicted organ (15 organ groups); disease type (non-cancer, pre- malignant, or cancer); cancer reason (6 histology groups); and, presence of metastatic disease (no, yes: in lymph nodes, and yes: in non lymph node tissue).

[00135] Second, a question-and-answer system was created to extract tumor site and histology descriptions from free-text pathology reports. This is the first functional Q&A system for extracting information from pathology reports known to the authors. The Q&A format has two important benefits. First, it provides a user-friendly interface to the information extraction system. Second, incorporation of additional questions into the system is straightforward. With appropriate ground- truth labeled training data this allows one to extract additional data fields from free-text pathology reports.

[00136] Third, a caBERT network, caBERTnet, was created to predict fine-grained

ICD-O-3 site and histology codes using the answers extracted via the initial Q&A component. There has been considerable prior work using NLP methods to predict ICD-0 codes from pathology reports [47-50]. The results of this study are compared to five of the most highly cited recent publications in this area.

[00137] Comparisons with Prior Work

36 [00138] Much of the prior work has focused on a single anatomical site or histology.

For example, Coden et. al. [48] described a system to extract information on tumor site, histology, grade, lymph nodes, tumor size, and reporting date from free-text pathology reports of colon cancer. They achieved precision and recall values between 0.95 and 0.98 for both site and histology ICD-0 codes. Their system used a rule-based NLP pipeline, with a large number of controlling parameters that required extensive manual tuning to obtain optimal results. BERT-based NLP systems, in contrast, can both discover and tune the steps of a traditional NLP pipeline automatically [10]. This has significant advantages in terms of reduced effort, but also allows these systems to be quickly re-tuned for datasets from other institutions or different applications via transfer learning [51]·

[00139] The BERT system from Ma et. al. [32], mentioned above, was used to extract information on 15 "primary cancer sites", 6 "cancer reasons" and 3 "metastatic disease" states. Eleven of their cancer sites corresponded to ICD-O-3 group level site classifications (eg: Breast, Lung or Bronchus). The others were broader groupings (eg: Colorectal, Upper Gl, Head and Neck). Four of their cancer reasons corresponded to ICD-O-3 group level histology classifications (eg: Melanoma, Soft Tissue / Sarcoma). The remaining two were very broad groupings (eg: Carcinoma, Blastoma). They achieved accuracies on the full test set of 96.7% and 98.5% for cancer site and cancer reason, respectively. However, they did not predict ICD-O-3 group or fine-grained codes.

[00140] Nguyen et. al.[52] developed a system to monitor HL7 electronic pathology reports from across the state of Queensland in Australia. Their system relied on business rules and symbolic reasoning using Systematized Nomenclature of Medicine (SNOMED) codes. They tuned their system using 201 pathology reports then tested it on 220 unseen reports. They extracted 8 different cancer characteristics from these reports. These characteristics included ICD-O-3 site codes (both fine- grained, Cxxx, and group, Cxx), and histological type. Their dataset included 66 sites and 94 histologies. They achieved FI scores of 61.1%, 73.2% and 63.7% on fine-grained site codes, group site codes and histology codes, respectively.

37 [00141] Alawad et. al. developed a multi-stage system of deep convolutional neural networks (CNNs) to extract the primary site, histological grade, and laterality from pathology reports [50]. They achieved an FI score of 77.5% over 12 ICD-O-3 site codes.

[00142] Qiu et. al. [47] also developed a deep CNN to extract ICD-O-3 codes from breast and lung cancer pathology reports. Training was based on 942 pathology reports annotated by Cancer Registry experts. The dataset included 7 breast sites and 5 lung sites. Six of the 12 sites had at least 50 samples per code. The remaining 6 sites had between 10 and 50 samples each. They evaluated their system using a 10-fold cross-validation. Their overall mean FI score for predicting tumor sites across all 12 ICD-O-3 codes was 72.2%.

[00143] The study described herein was more comprehensive than prior studies yet managed to obtain similar or better accuracy scores. For example, the dataset in this study included 214 site codes and 193 histology codes. This is far more than any previously reported work and represents the diversity of cases encountered at a large academic National Cancer Institute Designated Comprehensive Cancer Center. Many of the site and histology codes in the training dataset of this study included 5 samples or less, while prior studies reported 10 or more training samples per code. Culling codes from the test set with 5 or fewer samples in the training set reduced the size of the test dataset by 3.5% (72 pathology reports). Flowever, this increased our top-1 accuracies on the test data to 75.0% (+2.4%) and 84.6% (+0.6%) for site and histology, respectively. The system described herein also ranks and reports the top-5 predictions for ICD-O-3 site and histology codes. This has useful clinical applications: often there is a degree of uncertainty, or "hedging" in the pathology reports [48]. Listing the top-5 predicted codes could help mollify this uncertainty. For example, an Al-assisted abstraction system that provides the top-5 predicted ICD-O- 3 codes for a particular pathology report (in a pull-down menu for example) could aid the process of abstraction and enhance the workflow in Cancer Registries. Our top-5 accuracies for fine-grained codes with 5 or more training samples were 93.6% and 95.4% for site and histology, respectively.

[00144] Additional Insights from the Results

38 [00145] Figure 8 shows the top-5 results at various levels of rare code elimination from the test dataset. This figure provides three additional insights. First, as N increased from 1 to 5, the improvement in accuracy for sites was larger than that for histologies. This suggests that there is more uncertainty predicting site codes than in predicting histology codes. Second, eliminating rare codes, for example going from E=0 (green lines) to E=5 (blue lines), improved site accuracy more than it improved histology accuracy. This suggests that site prediction was more dependent on sample size. Third, site accuracy failed to improve for E > 20. This suggests that 20 samples per code were required to maximize site code prediction accuracy. Similarly, histology accuracy failed to improve for E > 5, suggesting that 5 samples per code were required to maximize histology prediction accuracy.

[00146] The overall mean accuracy for predicting site group codes was 93.5% (Figure

9). Nevertheless, several site groups codes had accuracies below 80%. The "Other" group (56.2% accuracy) is discussed here along with two of the site group codes with the lowest accuracies: "C49 Connective, Subcutaneous and Other Soft Tissues" (51.7% accuracy); and, "C16 Stomach" (48.1% accuracy).

[00147] The "Other" site category included 27 group codes. Together, these group codes contained 61 fine- grained codes with at least one sample pathology report each in the training dataset, as determined by Moffitt's Cancer Registry. The mean and median number of reports in the training dataset for each fine- grained code in the "Other" category were 6.3 and 4, respectively. Consequently, caBERTnet accuracy on these rare sites was likely limited by the availability of training data.

[00148] caBERTnet failed to predict the CR site code for 9 test cases in the group

"C49 Connective, Subcutaneous and Other Soft Tissues". All of these cases were labeled by Moffitt's CR as soft tissue of the limb, shoulder, hip or pelvis (codes "C491", "C492" and "C495"). In two of these cases the information required to determine the correct site was not present in the pathology report text. In these situations, CTRs will utilize additional information in the patient record.

39 However, this information was not available to caBERTnet. In the remaining seven cases the pathology report described characteristics of a lesion that had metastasized from the limb, shoulder, hip or pelvis to another location. The CR recorded the originating organ as the tumor site, while caBERTnet predicted the metastasis site.

[00149] caBERTnet failed to predict the CR site code for 13 test cases in the group

"C16 Stomach". All of these cases were labeled by our CR as "C160 Cardia, NOS", and by caBERTnet as lesions of the lower third of the esophagus ("C155", 11 cases) or as overlapping lesions of the esophagus ("C158", 2 cases). The CR labels are due to a rule in the American Joint Committee on Cancer Staging Manual, 8th Edition [53]. On page 189 in that manual it states:

[00150] "Cancers involving the Esophagogastric Junction (EGJ) that have their epicenter within the proximal 2 cm of the cardia (Siewert types I/ll) are to be staged as esophageal cancers. Cancers whose epicenter is more than 2 cm distal from the EGJ, even if the EGJ is involved, will be staged using the stomach cancer TNM (primary tumor, lymph nodes, and distant metastases) and stage groupings (see Chapter 17)."

[00151] The pathology reports on these cases did not mention the spatial location of the tumor sample in relation to the EGJ. Consequently, measurement of the tumor location in pretreatment imaging was required to determine the correct tumor site code.

[00152] The overall mean accuracy for predicting histology group codes was 97.7%

(Figure 10). Only one group code had an accuracy below 80%: "801 - 804 Epithelial Neoplasms, NOS" (77.8%). caBERTnet failed to predict the CR histology code for 6 of these cases. In 4 of these cases the pathology report was based on histology at the metastatic site of disease. Moffitt's CR coded these as the histology of the originating tumor, while caBERTnet predicted the histology at the metastatic site. In one case, the information required to determine the correct histology code was not present in the pathology report and required the CTR to do a review of the patient medical record.

40 [00153] The last case was quite interesting since the pathology report included an initial intraoperative diagnosis that disagreed with the final diagnosis. The former indicated a histological type of "spindle cell carcinoma". The latter included the following statements: "the differential diagnosis includes sarcomatoid carcinoma and inflammatory myofibroblastic tumor . the histomorphologic and immunoprofile support the diagnosis of sarcomatoid carcinoma". Moffitt's CR coded the histology as "8032/3 spindle cell carcinoma, NOS", based on the intraoperative statements, while caBERTnet predicted "8033/3 pseudosarcomatous carcinoma". The phrase "sarcomatoid carcinoma" is an alternate form of the ICD-O-3 preferred phrase "pseudosarcomatous carcinoma". Although caBERTnet's prediction did not agree with our CR, downstream applications may still value automatic prediction and codification of the final diagnosis.

[00154] Applications

[00155] There are multiple applications of caBERTnet. For example, there is a delay of several months between initial pathology report dictation and CTR abstraction, as the CTR's typically wait for enough time to have elapsed for the first course treatment to have been administered, in order to minimize the number of times they have to review the medical record. CaBERTnet can be used to extract information from pathology reports in a timelier way, thus facilitating the use of the data for clinical pathway reporting and screening for clinical trials. Furthermore, CTRs only abstract the subset of pathology reports associated with the cancer diagnosis and first course treatment. CaBERTnet can be used to extract information from pathology reports associated with subsequent biopsies and surgeries that would never be manually curated by the CTRs. To facilitate these use cases, tumor site and histology information can be extracted in close to real time and link those values to other patient data stored in the analytics platform. These data can be incorporated into real-time dashboards and datasets for a wide range of decision support and research applications.

[00156] caBERTnet may help simplify and accelerate CR workflows. For example, caBERTnet can pre- process pathology reports to identify the top-5 site and histology ICD-O-3 codes

41 and their corresponding phrases. The phrases could then be highlighted within the report body. Two pull-down menus could be pre-populated with top-5 code predictions: one for site and the other for histology. The CTR could then quickly choose a code from either pull-down menu. If the correct code was not among the top-5, then the CTR would resort to their current workflow - entering this information by hand.

[00157] This disclosure contemplates extending the caBERTnet with additional questions and CR-derived ground truth labels to train it to extract additional tumor characteristics. These include grade, size, involvement of lymph nodes, primary or metastatic status, presence or absence of molecular markers, and others. caBERTnet could also be customized to extract information on hematological malignancies.

[00158] A caBERTnet-assisted CR abstraction tool may also be used for active [54], or human-in-the- loop [55], learning. Briefly, this approach uses human labeled data to improve the performance of machine learning algorithms over time. It is particularly useful when the subject- matter expert (a CTR in our case) provides labels for cases with low-confidence predictions by the machine learning algorithm. However, it would require careful engineering to avoid common pitfalls and ensure seamless operation [56].

[00159] This disclosure contemplates that caBERTnet's accuracy on rare sites and histologies can be improved with additional training data. For example, caBERTnet can be distributed for use in other organizations for training on their local pathology reports. Their local CR data can be used for ground-truth labeling of training and test datasets as done at Moffitt. This process is enhanced by the highly standardized nature of CR data across cancer centers, nationally and even internationally. The standardization of ground truth labels would also be beneficial for federated multi-task learning [57]. Briefly, this approach distributes copies of a central model to multiple spoke sites, who each perform tuning of the central model on local data. Information learned at local sites (eg: model weights), but not sensitive local data, are then transmitted back to the central node. There, the information is combined in a pluralistic way that avoids the need to

42 impose consensus on the data distributions at the spoke sites. This allows for both heterogeneity in local data, and broad generalizability of the central model. Such a model could be distributed widely for use in clinical and research applications.

[00160] Conclusions

[00161] The NLP system described herein, caBERTnet, is built around a network of 3 cooperating BERT instances. On a sequestered test dataset, it produced top-5 accuracies of 93.6% and 95.4% for fine-grained ICD-O-3 site and histology codes, respectively. The test dataset had cases with < 5 training samples removed. This level of accuracy is on par with CTRs operating at state Cancer Registries [58]. To the authors knowledge, this is the first time an NLP system has achieved CTR-level performance on tumor site and histology prediction across a broad range of ICD-O-3 codes from free-text pathology reports.

[00162] Suoolementarv Material

[00163] The NLP system described herein was developed using Python 3.6

(Anaconda Inc. Austin TX), and the PyCharm integrated development environment (JetBrains Inc., Prague, Czech Republic). The PyTorch machine learning framework [59] was used and the HuggingFace Transformer [60] library. A DGX-1 Deep Learning System (Nvidia Inc, San Jose CA) equipped with 8 x Tesla V100-SXM2-32GB GPUs was employed. This system ran Ubuntu 18.04 LTS and had the CUDA 10.2 (Nvidia Inc.) acceleration framework installed. Training programs were executed inside Docker containers (version 19.03, Docker Inc., Palo Alto CA)

[00164] In total, 275,605 unique pathology reports were available for processing.

Each pathology report was stored either as a plain-text file, or as an HL7 message. For the latter, we used the "python-hl7" library to extract OBX segments containing diagnoses and comments into plain text files. Prior to training, all plain text files were concatenated into a single training file. A single carriage return marked the end of each report within the training file. This meant that line-by- line processing, an option during training, would cause each report to be processed separately.

Otherwise, all reports would be treated as related paragraphs in a larger contiguous document. Also,

43 all text was converted to lowercase and cleaned in order to remove obvious PHI along with a set of special characters (including asterisks, commas, underscores, dashes, tildes, brackets, parentheses, carets, quotes, pipes, accents). Five percent of the reports were selected at random and moved from the training file into a separate evaluation file. These reports were used to provide an independent estimate of neural network loss during training each time log results were saved (every 1,000 steps). No other preprocessing was performed.

[00165] The default vocabulary provided with the ClinicalBERT model was extended to add words that appeared frequently in our sample of pathology reports. Vocabulary expansion has been shown to improve language model performance for domain-specific tasks. The general process described previously for extending the English language BERT vocabulary for nuclear science [61] was followed. However, for the application the open source scispaCy package for biomedical, scientific and clinical text processing [62] and its "en_core_sci_md" language model were used. This model was used to construct a vocabulary of words in the pathology report text file described above. This vocabulary contained a list of all unique words in the file, along with the number of occurrences of each word in the file. Next, the pathology vocabulary was filtered to remove any words that appeared in the ClnicalBERT vocabulary. The filtered pathology vocabulary was then sorted by word count, and the 100 most common words identified. Next, 100 unused entries in the ClinicalBERT vocabulary, each marked with an "[unused]" token, were replaced with these common words.

[00166] The experimental parameters used to train caBERT are listed in Figure 12,

Table SI. Two epochs were used for each of the first two Q&A "head" training stages. However, the final Q&A training stage, on Moffitt data, trained over 3 epochs. This value was derived based on hyperparameter tuning using a separate validation dataset that was not part of the training or test set. Moving from 2 to 3 epochs was found to increase accuracy. Moving from 3 to 4 epochs reduced accuracy. Increasing the number of epochs to 5 or 6 reduced accuracy further.

[00167] Training the Site and Histology Code Classifiers: Additional Information

44 [00168] CR generated ground-truth phrases were labeled, then concatenated to form a combined phrase. For example, if the CR phrases were "lung lower lobe" and "squamous cell carcinoma" then the combined phrase would be "site: lung lower lobe, histology: squamous cell carcinoma." Labels and punctuation were included in the combined phrase to enhance performance. The standard BERT tokenizer employed in our experiments adds a classification token to each sentence. The two periods in our combined phrase resulted in two classification tokens, allowing more complex relationships between phrases and codes to be represented. The "site:" and "histology:" labels were included to provide clear delineation of the site and histology sub-phrases, and once tokenized, to provide additional focus points for the transformer attention mechanism [63,64]. This can help caBERTnet learn to leverage site and histology phrases that contained similar or complementary words. For example, given the combined phrase "site: upper lobe lung, hist: with bronchioloalveolar features." caBERTnet correctly predicted a histology of "8250/3, lepidic adenocarcinoma" and a site of "C341, upper lobe lung".

[00169] The site classification head was initiated to have 332 labels or classes, one for each of the 332 possible ICD-O-3 site codes. We used a linear lookup table to map, or enumerate, each site code onto a unique integer between 0 and 331. The histology classification head initialization was similar, except that it had 1,143 labels, one for each ICD-O-3 histology code.

[00170] Next, the combined phrase for each sample was tokenized and stored.

During training, the input sequence consisted of the tokenized combined phrase. The training label was the enumeration corresponding to the appropriate ICD-O-3 code - the site enumeration for the site classifier, and the histology enumeration for the histology classifier. Cross entropy was used to calculate the loss for each classifier, as is common for multi-label classification tasks [65]. Each trained classifier returned logits: 332 for the site classifier, and 1,143 for the histology classifier. The logits were converted to probabilities using a softmax function, then sorted to identify and return the 5 highest probability enumerations for each classifier. These enumerations were converted back into ICD-O-3 codes by inverting the application of the code-to-enumeration lookup table.

45 [00171] Figure 11, SI: Construction of the acceptable answer phrase table.

Pathology reports used in our training procedures were screened against this table for acceptable "answers" (phrases) corresponding to specific site and histology codes. If a report did not include a match to at least one phrase from this table, it lacked a ground-truth answer, and could not be used for training. This table was constructed starting with the ICD-O- 3 and SEER coding materials, which provided the codes, a preferred phrase for each code, and a short list of acceptable alternate phrases. We complemented that with additional phrases from the Moffitt Cancer Registry. In this example, the code "8070/3" corresponds to a preferred phrase of "squamous cell carcinoma, NOS" (No Origin Specified). The phrases in the table labeled 1 through 17 were alternates found in the Cancer Registry that also mapped onto code "8070/3". Ultimately, code "8070/3" ended up with a total of 161 alternate phrases. The complete table comprised all 332 site codes and 1143 histology codes. After restricting to codes contained in the Moffitt dataset, the final table contained 214 site codes and 193 histology codes, each with a preferred phrase and a list of acceptable alternate phrases.

[00172] Table SI: Experimental parameters used to train our caBERT instances.

Training and testing leveraged the HuffingFace PyTorch NLP library (v3.03). Training was performed on an Nvidia Inc. DGX-1 with 8 x Tesla V100-SXM2-32GB GPUs. Any parameters not listed used default values.

[00173] Table S2: Example of histology and site term hierarchies. Next to each code in the histology and site trees, we list the preferred term as determined by the ICD-O-3.2 table (for histology) and the SEER site-specific coding manual (for site). For each code, the remaining terms are referred to as synonyms. To each code histology group and site group labels (located at the base of the tree, in bold) are also associated, these represent the broad classes of histological morphology and site location. These morphology and site groups are the main prediction targets of the caBERT network.

[00174] References

46 1. Pratt AW, Thomas LB. An information processing system for pathology data.

Pathol Annu. 1966.

2. Dunham GS, Pacak MG, Pratt AW. Automatic indexing of pathology data. J Am Soc Inf Sci. 1978;29: 81-90. doi:10.1002/asi.4630290207

3. Meystre S, Haug PJ. Natural language processing to extract medical problems from electronic clinical documents: performance evaluation. J Biomed Inform. 2006;39: 589-599. doi:10.1016/j.jbi.2005.11.004

4. Murff HJ, FitzHenry F, Matheny ME, Gentry N, Kotter KL, Crimin K, et al. Automated identification of postoperative complications within an electronic medical record using natural language processing. JAMA. 2011;306: 848-855. doi:10.1001/jama.2011.1204

5. Wu S, Roberts K, Datta S, Du J, Ji Z, Si Y, et al. Deep learning in clinical natural language processing: a methodical review. J Am Med Inform Assoc. 2020;27: 457-470. doi: 10.1093/jam ia/ocz200

6. Schadow G, McDonald CJ. Extracting structured information from free text pathology reports. AMIA Annu Symp Proc. 2003; 584-588. Available: https://www.ncbi.nlm.nih.gov/pubmed/14728240

7. Burger G, Abu-Flanna A, de Keizer N, Cornet R. Natural language processing in pathology: a scoping review. J Clin Pathol. 2016. doi:10.1136/jclinpath-2016-203872

8. Neveol A, Zweigenbaum P, Section Editors for the IMIA Yearbook Section on Clinical Natural Language Processing. Expanding the Diversity of Texts and Applications: Findings from the Section on Clinical Natural Language Processing of the International Medical Informatics Association Yearbook. Yearb Med Inform. 2018;27: 193-198. doi:10.1055/s-0038-1667080

9. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv [cs.CLj. 2018. Available: http://arxiv.org/abs/1810.04805

47 10. Tenney I, Das D, Pavlick E. BERT Rediscovers the Classical NLP Pipeline. arXiv

[cs.CL]. 2019. Available: http://arxiv.org/abs/1905.05950

11. Wang A, Singh A, Michael J, Hill F, Levy 0, Bowman SR. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv [cs.CL]. 2018. Available: http://arxiv.org/abs/1804.07461

12. Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. cs.ubc.ca; 2018. Available: https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf

13. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog. 2019;1: 9. Available: https://www.ceid.upatras.gr/webpages/faculty/zaro/teaching/alg-ds/PRESENTATIONS/ PAPERS/2019-Radford-et-a l_Language-Models-Are-Unsupervised-Multitask-%20Learners.pdf

14. Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. arXiv [cs.LG]. 2019. Available: http://arxiv.org/abs/1901.02860

15. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv [cs.CL]. 2019. Available: http://arxiv.org/abs/1906.08237

16. Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. ALBERT: A Lite BERT for Self- supervised Learning of Language Representations. arXiv [cs.CL]. 2019. Available: http://arxiv.org/abs/1909.11942

17. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv [cs.CL]. 2019. Available: http://arxiv.org/abs/1907.11692

18. Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv [cs.CL]. 2019. Available: http://arxiv.org/abs/1910.01108

48 19. Lample G, Conneau A. Cross-lingual Language Model Pretraining. arXiv [cs.CL].

2019. Available: http://arxiv.org/abs/1901.07291

20. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv [cs.LG]. 2019. Available: http://arxiv.org/abs/1910.10683

21. Ju Y, Zhao F, Chen S, Zheng B, Yang X, Liu Y. Technical report on Conversational Question Answering. arXiv [cs.CL]. 2019. Available: http://arxiv.org/abs/1909.10772

22. Han W, Zhang Z, Zhang Y, Yu J, Chiu C-C, Qin J, et al. ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context. arXiv [eess.AS]. 2020. Available: http://arxiv.org/abs/2005.03191

23. Peng X, Long G, Shen T, Wang S, Jiang J, Zhang C. BiteNet: Bidirectional Temporal Encoder Network to Predict Medical Outcomes. arXiv [cs.LG]. 2020. Available: http://arxiv.org/abs/2009.13252

24. Li F, Jin Y, Liu W, Rawat BPS, Cai P, Yu H. Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)— Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study. JMIR medical informatics. 2019;7: el4830. Available: https://medinform.jmir.Org/2019/3/el4830/

25. Huang K, Altosaar J, Ranganath R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv [cs.CL]. 2019. Available: http://arxiv.org/abs/1904.05342

26. Tahayori B, Chini-Foroush N, Akhlaghi H. Advanced natural language processing technique to predict patient disposition based on emergency triage notes. Emerg Med Australas.

2020. doi: 10.1111/1742-6723.13656

27. Blinov P, Avetisian M, Kokh V, Umerenkov D, Tuzhilin A. Predicting Clinical Diagnosis from Patients Electronic Health Records Using BERT-Based Neural Networks. Artificial Intelligence in Medicine. Springer International Publishing; 2020. pp. 111-121. doi:10.1007/978-3-

030- 59137-3 11

49 28. Xu D, Gopale M, Zhang J, Brown K, Begoli E, Bethard S. Unified Medical Language

System resources improve sieve-based generation and Bidirectional Encoder Representations from Transformers (BERT)--based ranking for concept normalization. J Am Med Inform Assoc. 2020;27:

1510-1519. Available: https://academic.oup.com/jamia/article- abstract/27/10/1510/5876963

29. Kandpal P, Jasnani K, Raut R, Bhorge S. Contextual Chatbot for Healthcare Purposes (using Deep Learning). 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4). ieeexplore.ieee.org; 2020. pp. 625-634. doi: 10.1109/WorldS450073.2020.9210351

30. Zhang L, Fan H, Peng C, Rao G, Cong Q. Sentiment Analysis Methods for HPV Vaccines Related Tweets Based on Transfer Learning. Healthc Pap. 2020;8: 307. doi:10.3390/healthcare8030307

31. Shang J, Ma T, Xiao C, Sun J. Pre-training of Graph Augmented Transformers for Medication Recommendation. arXiv [cs.AI]. 2019. Available: http://arxiv.org/abs/1906.00346

32. Ma R, Chen P-HC, Li G, Weng W-H, Lin A, Gadepalli K, et al. Human-centric Metric for Accelerating Pathology Reports Annotation. arXiv [cs.CL]. 2019. Available: http://arxiv.org/abs/ 1911.01226

33. Grishman R, Hirschman L. Question answering from natural language medical data bases. Artif Intell. 1978;11: 25-43. doi:10.1016/0004-3702(78)90011-5

34. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36: 1234-1240. doi:10.1093/bioinformatics/btz682

35. Alsentzer E, Murphy JR, Boag W, Weng W-H, Jin D, Naumann T, et al. Publicly

Available Clinical BERT Embeddings. arXiv [cs.CL]. 2019. Available: http://arxiv.org/abs/1904.03323

50 36. Johnson AEW, Pollard TJ, Shen L, Lehman L-WH, Feng M, Ghassemi M, et al.

MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3: 160035. doi:10.1038/sdata.2016.35

37. Rajpurkar P, Zhang J, Lopyrev K, Liang P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv [cs.CL]. 2016. Available: http://arxiv.org/abs/1606.05250

38. Tsatsaronis G, Balikas G, Malakasiotis P, Partalas I, Zschunke M, Alvers MR, et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics. 2015;16: 138. doi:10.1186/sl2859-015-0564-6

39. Yoon W, Lee J, Kim D, Jeong M, Kang J. Pre-trained Language Model for Biomedical Question Answering. arXiv [cs.CL]. 2019. Available: http://arxiv.org/abs/1909.08229

40. Hawhee V, Jones J, Stewart S, Lucia K, Rollison DE. Quality Assurance and Continuing Education: A Cyclic Approach for Maintaining High Quality Data in a High Volume Cancer Registry. Cancer Control. 2020;27: 1073274820946794. doi:10.1177/1073274820946794

41. World Health Organization. International Classification of Diseases for Oncology. World Health Organization; 2013. Available: https://play. google. com/store/books/details?id=g83ljgEACAAJ

42. IACR- ICD-O-3. [cited 9 Nov 2020], Available: http://www.iacr.com.fr/index.php?option=com_content&view=category&layout=blog&id=100&lte mid=577

43. ICD-O-3 Coding Materials, [cited 9 Nov 2020]. Available: https://seer.cancer.gov/icd-o-3/

44. Site-Specific Modules. [cited 9 Nov 2020]. Available: https://training.seer.cancer.gov/modules_site_spec.html

45. ICD-O-3.1-NCIt Mapping Files, [cited 9 Nov 2020]. Available: https://evs.nci.nih.gov/ftpl/NCI_Thesaurus/Mappings/ICD-0-3_Mappings/About.html

51 46. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 2012. pp. 1097- 1105.

47. Qiu JX, Yoon H-J, Fearn PA, Tourassi GD. Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports. IEEE J Biomed Health Inform. 2018;22: 244-251. doi : 10.1109/JBH 1.2017.2700722

48. Coden A, Savova G, Sominsky I, Tanenblatt M, Masanz J, Schuler K, et al. Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model. J Biomed Inform. 2009;42: 937-949. doi:10.1016/j.jbi.2008.12.005

49. Gao S, Young MT, Qiu JX, Yoon H-J, Christian JB, Fearn PA, et al. Hierarchical attention networks for information extraction from cancer pathology reports. J Am Med Inform Assoc. 2018;25: 321-330. doi:10.1093/jamia/ocxl31

50. Alawad M, Yoon H, Tourassi GD. Coarse-to-fine multi-task training of convolutional neural networks for automated information extraction from cancer pathology reports. 2018 IEEE EMBS International Conference on Biomedical Health Informatics (BHI). ieeexplore.ieee.org; 2018. pp. 218-221. doi:10.1109/BHI.2018.8333408

51. Pan SJ, Yang Q. A Survey on Transfer Learning. IEEE Trans Knowl Data Eng. 2010;22: 1345- 1359. doi:10.1109/TKDE.2009.191

52. Nguyen AN, Moore J, O'Dwyer J, Philpot S. Assessing the Utility of Automatic Cancer Registry Notifications Data Extraction from Free-Text Pathology Reports. AMIA Annu Symp Proc. 2015;2015: 953-962. Available: https://www.ncbi.nlm.nih.gov/pubmed/26958232

53. Amin MB, Edge SB. AJCC cancer staging manual, springer; 2017.

54. Settles B. Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences; 2009. Available: https://minds.wisconsin.edu/handle/1793/60660

52 55. Holzinger A. Interactive machine learning for health informatics: when do we need the human- in-the-loop? Brain Inform. 2016;3: 119-131. doi:10.1007/s40708-016-0042-6

56. Settles B. From theories to queries: Active learning in practice. Active Learning and Experimental Design workshop In conjunction with AISTATS 2010. jmlr.org; 2011. pp. 1-18. Available: http://www.jmlr.org/proceedings/papers/vl6/settleslla/settleslla.pdf

57. Xu J, Glicksberg BS, Su C, Walker P, Bian J, Wang F. Federated Learning for Flealthcare Informatics. Int J Healthc Inf Syst Inform. 2020; 1-19. doi:10.1007/s41666-020-00082-4

58. Thoburn KK, German RR, Lewis M, Nichols PJ, Ahmed F, Jackson-Thompson J.

Case completeness and data accuracy in the Centers for Disease Control and Prevention's National Program of Cancer Registries. Cancer. 2007;109: 1607-1616. doi:10.1002/cncr.22566

59. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An

Imperative Style, High-Performance Deep Learning Library. In: Wallach H, Larochelle H, Beygelzimer A, d\textquotesingle Alche-Buc F, Fox E, Garnett R, editors. Advances in Neural Information Processing Systems 32. Curran Associates, Inc.; 2019. pp. 8026-8037.

Available: http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high- performance-deep-learning- library.pdf

60. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing. arXiv [cs.CLj. 2019. Available: http://arxiv.org/abs/1910.03771

61. Jain A, Meenachi NM, Venkatraman B. NukeBERT: A Pre-trained language model for Low Resource Nuclear Domain. arXiv [cs.LGj. 2020. Available: http://arxiv.org/abs/2003.13821

62. Neumann M, King D, Beltagy I, Ammar W. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. arXiv [cs.CLj. 2019. Available: http://arxiv.org/abs/1902.07669

63. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is All you Need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al.,

53 editors. Advances in Neural Information Processing Systems 30. Curran Associates, Inc.; 2017. pp.

5998-6008. Available: http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

64. Clark K, Khandelwal U, Levy O, Manning CD. What Does BERT Look At? An Analysis of BERT's Attention. arXiv [cs.CL]. 2019. Available: http://arxiv.org/abs/1906.04341

65. Zhang Z, Sabuncu M. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, editors. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2018. pp. 8778-8788. Available: https://proceedings.neurips.cc/paper/2018/file/f2925f97bcl3ad2852a7a551802feea0-

Paper.pdf

[00175] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

54

Claims

WHAT IS CLAIMED:

1. A system for automatically extracting information from pathology reports, comprising: a transformer-based machine learning model; and a computing device comprising a processor and a memory operably coupled to the processor, the memory having computer-executable instructions stored thereon that, when executed by the processor, cause the processor to: receive a pathology report; transmit the pathology report to the transformer-based machine learning model; and receive at least one of a site description or a histology description for a disease, wherein the transformer-based machine learning model is configured to extract the at least one of the site description or the histology description from the pathology report.

2. The system of claim 1, wherein the memory has further computer-executable instructions stored thereon that, when executed by the processor, cause the processor to provide a disease diagnosis, provide a treatment option, identify a patient for a clinical trial, or monitor adherence to a treatment pathway.

3. The system of claim 1 or 2, wherein the site description and the histology description are received from the transformer-based machine learning model.

4. The system of any one of claims 1-3, wherein the pathology report is a free-text pathology report.

5. The system of any one of claims 1-4, wherein the pathology report is for a solid tumor.

6. A method, comprising: receiving a pathology report; inputting the pathology report into a transformer-based machine learning model; and extracting, using transformer-based machine learning model, from the pathology report at least one of a site description or a histology description for a disease.

55

7. The method of claim 6, further comprising diagnosing the disease based on the at least one of the site description or the histology description.

8. The method of claim 6 or 7, further comprising recommending a treatment for the disease based on the at least one of the site description or the histology description.

9. The method of claim 8, further comprising treating a patient with the recommended treatment for the disease.

10. The method of any one of claims 1-9, wherein the pathology report is a free-text pathology report.

11. The method of any one of claims 1-10, wherein the pathology report is for a solid tumor.

12. A system for predicting site and histology codes for diseases, comprising: a first transformer-based machine learning model configured to extract information from pathology reports; a second transformer-based machine learning model configured to predict site codes for diseases; and a third transformer-based machine learning model configured to predict histology codes for diseases, wherein the first transformer-based machine learning model is configured to: receive a pathology report, and extract information from the pathology report, the extracted information comprising a site description and a histology description for a disease, wherein the second transformer-based machine learning model is configured to: receive the extracted information, and predict a site code for the disease based on the extracted information, wherein the third transformer-based machine learning model is configured to: receive the extracted information, and predict a histology code for the disease based on the extracted information.

13. The system of claim 12, further comprising a computing device comprising a processor and a memory operably coupled to the processor, the memory having computer-

56 executable instructions stored thereon that, when executed by the processor, cause the processor to transmit the pathology report to the first transformer-based machine learning model.

14. The system of claim 13, wherein the memory has further computer-executable instructions stored thereon that, when executed by the processor, cause the processor to receive the site code predicted by the second transformer-based machine learning model and the histology code predicted by the third transformer-based machine learning model.

15. The system of claim 14, wherein the memory has further computer-executable instructions stored thereon that, when executed by the processor, cause the processor to generate a report comprising the site code and the histology code.

16. The system of any one of claims 12-15, wherein the second transformer-based machine learning model is further configured to predict a top-n most accurate site codes for the disease, wherein n is an integer greater than 1.

17. The system of any one of claims 12-16, wherein the third transformer-based machine learning model is further configured to predict a top-m most accurate histology codes for the disease, wherein m is an integer greater than 1.

18. The system of any one of claims 12-17, wherein the pathology report is a free-text pathology report.

19. The system of any one of claims 12-18, wherein the pathology report is for a solid tumor.

20. A method, comprising: receiving a pathology report; inputting the pathology report into a first transformer-based machine learning model; extracting, using the first transformer-based machine learning model, information from the pathology report, the extracted information comprising a site description and a histology description for a disease; inputting the extracted information into a second transformer-based machine learning model;

57 predicting, using the second transformer-based machine learning model, a site code for the disease based on the extracted information; inputting the extracted information into a third transformer-based machine learning model; and predicting, using the third transformer-based machine learning model, a histology code for the disease based on the extracted information.

21. The method of claim 20, further comprising generating a report comprising the site code and the histology code.

22. The method of claim 20 or 21, further comprising diagnosing the disease based on the site code and the histology code.

23. The method of claim 22, further comprising recommending a treatment for the disease based on the site code and the histology code.

24. The method of claim 23, further comprising treating a patient with the recommended treatment for the disease.

25. The method of any one of claims 20-24, wherein the second transformer-based machine learning model predicts a top-n most accurate site codes for the disease, wherein n is an integer greater than 1.

26. The method of any one of claims 20-25, wherein the third transformer-based machine learning model predicts a top-m most accurate histology codes for the disease, wherein m is an integer greater than 1.

27. The method of any one of claims 20-26, wherein the pathology report is a free-text pathology report.

28. The method of any one of claims 20-27, wherein the pathology report is for a solid tumor.

29. A machine learning training method, comprising:

58 performing unsupervised training on a transformer-based machine learning model with a first dataset, wherein the first dataset comprises a plurality of pathology reports; creating a second dataset comprising a plurality of pathology reports, wherein each of the pathology reports in the second dataset comprises respective ground truth labels for a site description and a histology description for a disease, and wherein creating the second dataset comprises: creating a first hierarchal tree structure configured to hold acceptable site description terminology; creating a second hierarchal tree structure configured to hold acceptable histology description terminology; constructing a dictionary using the first and second hierarchal tree structures; for each of the pathology reports in the second dataset, performing a search, using the dictionary, to identify respective matching text strings within a pathology report that correspond to the respective ground truth labels; and performing supervised training on the transformer-based machine learning model with the second dataset, wherein the second dataset further comprises the respective matching text strings identified by the search.

30. The method of claim 29, wherein a respective matching text string is an exact match for preferred or acceptable diverse terminology contained in the dictionary.

59