CA3160093A1 - Systems and methods for optimizing multi-stage data processing - Google Patents

Systems and methods for optimizing multi-stage data processing

Info

Publication number
CA3160093A1
CA3160093A1 CA3160093A CA3160093A CA3160093A1 CA 3160093 A1 CA3160093 A1 CA 3160093A1 CA 3160093 A CA3160093 A CA 3160093A CA 3160093 A CA3160093 A CA 3160093A CA 3160093 A1 CA3160093 A1 CA 3160093A1
Authority
CA
Canada
Prior art keywords
data
inference data
filtered
inference
threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CA3160093A
Other languages
French (fr)
Inventor
Dan Ni Yang
Meline Nikoghossian
Elham Hajarian
Behjat Soltanifar
Karishma Harshal Patel
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toronto Dominion Bank
Original Assignee
Toronto Dominion Bank
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toronto Dominion Bank filed Critical Toronto Dominion Bank
Priority to CA3160093A priority Critical patent/CA3160093A1/en
Publication of CA3160093A1 publication Critical patent/CA3160093A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Systems and methods for optimizing multi-stage data machine learning model processing, by providing an intermediate processing step to identify or filter records in voluminous inference data that meet a predetermined threshold. Multiple downstream processes can be optimized by applying multiple thresholds to the voluminous inference data.

Description

SYSTEMS AND METHODS FOR OPTIMIZING MULTI-STAGE DATA PROCESSING
TECHNICAL FIELD
[0001] The disclosed exemplary embodiments relate to computer-implemented systems and methods for processing data.
BACKGROUND
[0002] Many distributed or cloud-based computing clusters provide parallelized, fault-tolerant distributed computing and analytical protocols (e.g., the Apache Spark TM
distributed, cluster-computing framework, the DatabricksTM analytical platform, etc.) that facilitate adaptive training of machine learning or artificial intelligence processes, and real-time application of the adaptively trained machine learning processes or artificial intelligence processes to input datasets or input feature vectors. These processes can involve large numbers of massively parallelizable vector-matrix operations, and the distributed or cloud-based computing clusters often include graphics processing units (GPUs) capable of processing thousands of operations (e.g., vector operations) in a single clock cycle and/or tensor processing units (TPUs) capable of processing hundreds of thousands of operations (e.g., matrix operations) in a single clock cycle.
Use of such distributed or cloud-based computing clusters can therefore accelerate the training and subsequent deployment of the machine-learning and artificial-intelligence processes, and may result in a higher throughput during training and subsequent deployment, when compared to the training and subsequent deployment of the machine-learning and artificial-intelligence processes across the existing computing systems of a particular organization.
SUMMARY
[0003] The following summary is intended to introduce the reader to various aspects of the detailed description, but not to define or delimit any invention.
[0004] In at least one broad aspect, there is provided a system for optimized multi-stage processing, the system comprising: an application database storing inference data from a machine learning model, the inference data having fields and records in tabular form; a computer operatively coupled to the application database, the computer ¨1-Date Recue/Date Received 2022-05-11 comprising a memory and a processor configured to: retrieve the inference data from the application database; process the inference data to identify records that meet a predetermined threshold; generate filtered inference data, the filtered inference data having fields and records in tabular form, the filtered inference data further having a threshold column, wherein for each record, an indication of whether the respective record meets the predetermined threshold is stored in the threshold column; and store the filtered inference data in the application database.
[0005] In some cases, the system further comprises a downstream application server operatively coupled to the application database, the downstream application server comprising a downstream application server memory and a downstream application server processor configured to retrieve a subset of the filtered inference data from the application database and process the subset of the filtered inference data to generate application data, wherein the subset is determined based on the threshold column.
[0006] In some cases, the processing the subset of the filtered inference data comprises providing the subset of the filtered inference data as input to an additional machine learning model, wherein an output of the additional machine learning model is the application data.
[0007] In some cases, the system further comprises a user device operatively coupled to the downstream application server, wherein the downstream application server processor is further configured to generate a notification based on the application data, and transmit the notification to the user device.
[0008] In some cases, the processor is further configured to process the inference data to identify records that meet a second predetermined threshold, wherein the filtered inference data has a second threshold column, wherein for each record, an indication of whether the respective record meets the second predetermined threshold is stored in the second threshold column.
[0009] In some cases, the system further comprises a second computer operatively coupled to the application database, the second computer comprising a second memory and a second processor configured to: retrieve a second subset of the filtered inference data from the application database and process the second subset of ¨2-Date Recue/Date Received 2022-05-11 the filtered inference data to generate second application data, wherein the second subset is determined based on the threshold column.
[0010] In some cases, the system further comprises a preprocessor configured to preprocess input data in tabular form to generate preprocessed data, and a machine learning processor configured to execute the machine learning model on the preprocessed data to generate inference data and store the inference data in the application database, prior to the processor retrieving the inference data from the application database.
[0011] In some cases, the system further comprises a source database and a publishing server operatively coupled to the source database, the publishing server comprising a publishing server memory and a publishing server processor configured to export the input data from the source database to the application database.
[0012] In some cases, the predetermined threshold is determined based on a percentile placement of each of the records in the inference data.
[0013] In some cases, at least one of the fields of the tabular data comprises numerical data, and the percentile placement is based on the numerical data.
[0014] In another broad aspect, there is provided a method of optimized multi-stage processing, the method comprising: receiving, using a first processor, inference data from a machine learning model, the inference data having fields and records in tabular form; processing, using the first processor, the inference data to identify records that meet a predetermined threshold; generating, using the first processor, filtered inference data, the filtered inference data having fields and records in tabular form, the filtered inference data further having a threshold column, wherein for each record, an indication of whether the respective record meets the predetermined threshold is stored in the threshold column; and the first processor storing the filtered inference data in an application database.
[0015] In some cases, the method further comprises a second processor retrieving a subset of the filtered inference data from the application database and processing the ¨3-Date Recue/Date Received 2022-05-11 subset of the filtered inference data to generate application data, wherein the subset is determined based on the threshold column.
[0016] In some cases, the processing the subset of the filtered inference data comprises providing the subset of the filtered inference data as input to an additional machine learning model, wherein an output of the additional machine learning model is the application data.
[0017] In some cases, the method further comprises generating a notification based on the application data, and transmitting the notification to a user device.
[0018] In some cases, the method further comprises processing the inference data to identify records that meet a second predetermined threshold, wherein the filtered inference data has a second threshold column, wherein for each record, an indication of whether the respective record meets the second predetermined threshold is stored in the second threshold column.
[0019] In some cases, the method further comprises a second processor retrieving a second subset of the filtered inference data from the application database and processing the second subset of the filtered inference data to generate second application data, wherein the second subset is determined based on the threshold column.
[0020] In some cases, the method further comprises, prior to receiving the inference data from the machine learning model: preprocessing, using a preprocessor, input data in tabular form to generate preprocessed data; and executing, using a machine learning processor, the machine learning model on the preprocessed data to generate the inference data.
[0021] In some cases, the predetermined threshold is determined based on a percentile placement of each of the records in the inference data.
[0022] In some cases, at least one of the fields of the tabular data comprises numerical data, and the percentile placement is based on the numerical data.
[0023] According to some aspects, the present disclosure provides a non-transitory computer-readable medium storing computer-executable instructions.
The ¨4-Date Recue/Date Received 2022-05-11 computer-executable instructions, when executed, configure a processor to perform any of the methods described herein.
BRIEF DESCRIPTION OF THE DRAWINGS
[0024]
The drawings included herewith are for illustrating various examples of articles, methods, and systems of the present specification and are not intended to limit the scope of what is taught in any way. In the drawings:
FIG. 1A is a schematic block diagram of a system for optimized multi-stage processing in accordance with at least some embodiments;
FIG. 1B is a schematic block diagram of a computing cluster of FIG. 1A in accordance with at least some embodiments;
FIG. 2 is a block diagram of a computer in accordance with at least some embodiments;
FIG. 3A is a flowchart diagram of an example method of preprocessing data and executing a machine learning model in accordance with at least some embodiments; and FIG. 3B is a flowchart diagram of an example method of optimized multi-stage processing in accordance with at least some embodiments.
DETAILED DESCRIPTION
[0025]
Many organizations possess and maintain confidential data regarding their operations. For instance, some organizations may have confidential data concerning industrial formulas and processes. Other organizations may have confidential data concerning customers and their interactions with those customers. In a large organization, this confidential data may be stored in a variety of databases, which may have different, sometimes incompatible schemas, fields and compositions. A sufficiently large organization may have hundreds of millions of records across these various databases, corresponding to tens of thousands, hundreds of thousands or even millions of customers. This quantity and scope of confidential data represents a highly desirable source of data to be used as input into machine learning models that can be trained, e.g., to predict future occurrences of events, such as customer interactions or non-interactions.
¨5-Date Recue/Date Received 2022-05-11
[0026] With such large volumes of data, it may be desirable to use the computational resources available in distributed or cloud-based computing systems. For instance, machine learning models may be used to generate predictions or inferences regarding these sets of data. In some cases, a first model may be trained to predict a .. likelihood of an event occurring in the future, given certain existing information relevant to the prospective event. For instance, a first model may be trained to predict the likelihood of a person taking a specific action, given historical or biographical knowledge of the person. Often, the first model will have a large volume of data to consider, both about the person, and also because there may be large numbers of persons in the data .. for which to generate predictions, resulting in large output inference data. If further processing of the inference data is performed, particularly by further machine learning models, the computational effort required may be compounded.
[0027] The described embodiments generally provide for an intermediate processing step to identify and/or filter records in voluminous inference data that meet a predetermined threshold, such as a given percentile, thereby simplifying and speeding downstream processing by concentrating only on those records that meet the predetermined threshold. Multiple downstream processes can be made more efficient by applying multiple thresholds. In this way, computational resources are conserved and dedicated only to those records that are relevant to the downstream process.
[0028] According to at least some embodiments, when inference data is generated, further processing or prediction may take place, for example in a second model. However, it may be inefficient for the second model to process all of the prediction data generated by the first model. One solution is to prune the volume of information produced by the first model, however this may not be desirable in all cases. For instance, if the first model feeds a plurality of downstream models, it may be difficult to configure the first model to prune its output to only the data that is relevant to each of the downstream models. The described embodiments provide a way to reduce the computing burden for downstream models or processing, without unnecessarily reducing the size of the output of the first model. Generally, systems and methods are provided for processing prediction data output by a first model to identify records in the prediction data that meet certain threshold ¨ 6 ¨

Date Recue/Date Received 2022-05-11 criteria, and then generating tagged prediction data output that can be easily filtered by any downstream models or other processing.
[0029] Referring now to FIG. 1A, there is illustrated a block diagram of an example computing system, in accordance with at least some embodiments. Computing system 100 has a source database system 110, an enterprise data provisioning platform (EDPP) 120 operatively coupled to the source database system 110, and a cloud-based computing cluster 130 that is operatively coupled to the EDPP 120.
[0030] Source database system 110 has one or more databases, of which three are shown for illustrative purposes: database 112a, database 112b and database 112c.
Each of the databases of source database system 110 may contain confidential information that is subject to restrictions on export. One or more export modules 114 may periodically (e.g., daily, weekly, monthly, etc.) export data from the databases 112 to EDP
120. In some cases, the export data may be exported in the form of comma separated value (CSV) data, however other formats may also be used.
[0031] EDPP 120, which may also be referred to as a publishing server, receives source data exported by the export modules 114 of source database system 110, processes it and exports the processed data to an application database within the cluster 130. For example, a parsing module 122 of EDPP 120 may perform extract, transform and load (ETL) operations on the received source data.
[0032] In many environments, access to the EDPP may be restricted to relatively few users, such as administrative users. However, with appropriate access permissions, data relevant to an application or group of applications (e.g., a client application) may be exported via reporting and analysis module 124 or an export module 126. In particular, parsed data can then be processed and transmitted to the cloud-based computing cluster 130 by a reporting and analysis module 124. Alternatively, one or more export modules 126 can export the parsed data to the cluster 130.
[0033] In some cases, there may be confidentiality and privacy restrictions imposed by governmental, regulatory, or other entities on the use or distribution of the source data. These restrictions may prohibit confidential data from being transmitted to computing systems that are not "on-premises" or within the exclusive control of an ¨7-Date Recue/Date Received 2022-05-11 organization, for example, or that are shared among multiple organizations, as is common in a cloud-based environment. In particular, such privacy restrictions may prohibit the confidential data from being transmitted to distributed or cloud-based computing systems, where it can be processed by machine learning systems, without appropriate anonym ization or obfuscation of P11 in the confidential data. Moreover, such "on-premises" systems typically are designed with access controls to limit access to the data, and thus may not be resourced or otherwise suitable for use in broader dissemination of the data. To comply with such restrictions, one or more module of EDPP 120 may "de-risk" data tables that contain confidential data prior to transmission to cluster 130. This .. de-risking process may, for example, obfuscate or mask elements of confidential data, or may exclude certain elements, depending on the specific restrictions applicable to the confidential data. The specific type of obfuscation, masking or other processing is referred to as a data treatment."
[0034] Referring now to FIG. 1B, there is illustrated a block diagram of computing cluster 130, showing greater detail of the elements of the cluster, which may be implemented by computing nodes of the cluster that are operatively coupled.
[0035] Within cluster 130, both data received from reporting and analysis module 124 and data received from export modules 126 is ingested by a data ingestion module 134. Ingested data may be stored, e.g., in a distributed file system 132 such as the Hadoop Distributed File System (HDFS). HDFS can be used to implement one or more application database 139, each of which may contain one or more tables, and which may be partitioned temporally or otherwise.
[0036] For ease of illustration, only one application database 139 is shown, with two temporal partitions 140a and 140b depicted. However, additional application databases may be provided. Generally, the application database stores data, such as inference data from a machine learning model, the inference data having fields and records in tabular form.
[0037] Partition 140a is a current partition, corresponding to a current operation.
Partition 140b is a partition corresponding to a previous run. Additional previous partitions may also be present. Each partition corresponds to a run of application data processing, ¨8-Date Recue/Date Received 2022-05-11 such as an execution of a machine learning model. Data for and from each previous run may be stored in its own partition.
[0038] Each partition of application database 139 has one or more input data tables 142, and one or more inference data tables for storing inference data generated by execution of a machine learning model. Generally, a machine learning model can be executed by a node (or nodes) that has access to the application database.
During execution, the node may retrieve information from the application database, perform processing to generate an output inference file, and store the output inference data in the appropriate table of the application database.
[0039] In the illustrated example embodiment, the inference data tables include an inference data table 144, a ground truth table 146, and a predicted activity table 148.
[0040] Input data table 142 contains input data that may be received directly from data ingestion module 134, or from preprocessing module 152 following preprocessing.
Inference data table 144 stores inference data output by a processing node 180 following execution of a first machine learning model. Similarly, ground truth table 146 stores ground truth data output by the processing node 180 following execution of the first machine learning model. However, predicted activity table 148 stores inference data output by a processing node 150 following execution of an activity prediction machine learning model.
[0041] Application database 139 also includes one or more tables that may exist outside of temporal partitions in the distributed file system. In some cases, these tables may be implemented in Apache Hive TM.
[0042] Processing node 150 has a preprocessing module 152 for conditioning data from ingestion module 134. For example, in many cases, it may be desirable for input to a machine learning model to be preprocessed or otherwise conditioned prior to ingestion to the model.
[0043] Generally, the preprocessing module preprocesses input data in tabular form to generate preprocessed data. The preprocessing module 152 may perform several functions. It may preprocess data to, e.g., perform input validation, data normalization ¨9-Date Recue/Date Received 2022-05-11 and filtering to remove unneeded data. For instance, data normalization may include converting alphabetic characters to all uppercase, formatting numerical values, trimming data, etc. In some cases, preprocessing module 152 may apply data treatments.
Following preprocessing, the output of preprocessing module 152 may be stored in input data table 142, in the current partition of application database 139. In some cases, the output of preprocessing module 152 may also be stored in a HiveTM table (not shown).
[0044] Processing node 150 has a prediction decision module 154 that retrieves or receives input data from inference data table 144 of the application database, processes the inference data to identify records that meet one or more predetermined threshold, and generates filtered inference data. For instance, for a first predetermined threshold, processing node 150 processes the inference data to identify records that meet the first predetermined threshold and adds a first threshold column to the filtered inference data, where for each record the field corresponding to the record row and first threshold column serves as an indication of whether the respective record meets the first predetermined threshold. Similarly, for a second predetermined threshold, processing node processes the inference data to identify records that meet the second predetermined threshold and adds a second threshold column to the filtered inference data, where for each record the field corresponding to the record row and second threshold column serves as an indication of whether the respective record meets the second predetermined threshold. This process may be repeated for as many thresholds as are desired.
[0045] In some cases, the processing for a second or subsequent predetermined threshold may be optimized by processing only those records that meet a prior predetermined threshold.
[0046] Once the processing is completed, the filtered inference data can then be stored together with the original inference data in a single table 172.
Alternatively, only those records that satisfy the one or more predetermined threshold may be stored in a filtered table 174.
[0047] In general, prediction decision module 154 may retrieve inference files generated by a machine learning model and perform analysis to determine whether individual records in the inference files meet one or more threshold requirements. The ¨ 10 ¨

Date Recue/Date Received 2022-05-11 inference files may be in tabular form, with rows of data representing individual records, and columns that corresponding to fields.
[0048] The thresholding process may add one or more additional column to the inference data table to contain an indication of whether each record has met a particular threshold and thereby produce filtered inference data. If there is one threshold, then only one column may be added. If there is more than one threshold to be evaluated (e.g., for different downstream purposes), then multiple columns may be added. The value for each record in the threshold column may be binary to indicate whether the threshold of that column has been met. In some cases, a numerical score may be provided instead.
[0049] Various thresholds can be set. For example, a threshold may be an indication of whether each record belongs to a predetermined percentile of a desired metric, that is, whether a record falls within a certain percentile of all records under consideration for some metric, such as a numerical value. In one example, the desired metric may be a credit risk of a user, where each record corresponds to a single user. In such an example, the threshold may be set at the 95th percentile, with the result that records of users who present a credit risk that is in the top 5% of all users will be flagged.
The threshold can, of course, be set at different levels. As previously noted, multiple thresholds may also be set (e.g., 50th percentile, 95th percentile, 99th percentile, etc.)
[0050] The thresholding process may involve employing a machine learning model configured to determine the category into which each record falls, or it may involve conventional processing.
[0051] As with the inference data, the filtered inference data generally has fields and records in tabular form. The filtered inference data also has an additional column corresponding to each predeterm ined threshold, which is used to indicate whether each record (corresponding to a row) meets the respective threshold. Table 1 illustrates an example filtered inference data table.
¨11 ¨

Date Recue/Date Received 2022-05-11 Original Inference Data Added Columns ID Length Weight Threshold 1 Threshold (length > 500) (weight >
100) 0001 900 50 Yes No 0002 400 110 No Yes 0003 800 105 Yes Yes Table 1 ¨ Example filtered inference data
[0052] In the example of Table 1, a first predetermined threshold is a length of greater than 500 units. Records 0001 and 0003 meet this threshold, therefore the "Threshold 1" column contains an indication of "Yes" for the rows corresponding to records 0001 and 0003. The row corresponding to record 0002 contains an indication of "No" in the "Threshold 1" column.
[0053] Similarly, a second predetermined threshold is a weight of greater than 100 units. Records 0002 and 0003 meet this threshold, therefore the "Threshold 2"
column contains an indication of "Yes" for the rows corresponding to records 0002 and 0003. The row corresponding to record 0001 contains an indication of "No" in the "Threshold 2"
column.
[0054] Although "Yes" and "No" indications are shown, any kind of suitable indication may be used, including numerical indications. In some cases, a numerical value within a range may be used to indicate a degree to which a given threshold is met.
[0055] Furthermore, the example above depicts a simple threshold based on a single numerical value. However, the predetermined threshold may be based on a combination of factors, on percentiles, or may be based on meeting a threshold determined by application of a machine learning model.
[0056] Once the thresholding process of prediction decision module 154 is complete, filtered inference data is stored in the application database.
[0057] Subsequently, further processing of the filtered inference data can be performed by other machine learning models or conventional processes. For example, a first process may take the filtered inference data, identify the records that have met a first predetermined threshold, and perform processing on only those records that have met the first predetermined threshold to generate first application data. A second process may ¨ 12 ¨

Date Recue/Date Received 2022-05-11 take the filtered inference data, identify the records that have met a second predetermined threshold, and perform processing on only those records that have met the second predetermined threshold to generate second application data, and so forth.
[0058] In the illustrated example embodiments, processing node 150 further has a predicted activity module 156 that receives input data from filtered inference data table 174, and processes the filtered inference data to generate predictions regarding activity, such as user activity. The further processing may therefore involve a prediction of an upcoming event or activity, that can generate recommendations for users who fall within a particular threshold. These recommendations, once generated, can be sent to a downstream application which distributes them to the users' devices.
[0059] For example, in one example embodiment, the filtered inference data table contains user records that include information such as account balance information, recent account activity, and so forth. The predicted activity module 156 may apply a machine learning model to identify users who are at risk of default for a credit facility. In this case, the filtered inference data would have been filtered by prediction decision module 154 using a threshold that identified users, e.g., in the bottom 5th percentile of account balances, to screen out the remaining 95% of users and thereby reduce the processing load on the predicted activity module 156.
[0060] The predicted activity module 156 outputs its prediction data to activity inference data table 178. Optionally, predicted activity module 156 also may output training and evaluation data to tables 176 and 177, where it can be used to improve the performance of the predicted activity module 156.
[0061] A notification module 158 retrieves the prediction data from activity inference data table 178 and provides it to publish module 160, which can format and transmit the prediction data to a downstream application or server 190, or to user devices, or both.
[0062] Although processing node 150 is shown as one node, in practice its functions may be implemented by several nodes distributed throughout the cluster 130.
Similarly, processing node 180 may be implemented by several nodes distributed throughout the cluster 130.
¨ 13 ¨

Date Recue/Date Received 2022-05-11
[0063] In some embodiments, a downstream application server may be operatively coupled to cluster 130 and to application database 139. The downstream application server may be a remote server, for example, that is configured to retrieve a subset of the filtered inference data from the application database and process the subset of the filtered inference data to generate application data, wherein the subset is determined based on the threshold column.
[0064] The downstream application server may implement an additional machine learning model, in which case the processing involves providing the subset of the filtered inference data as input to the additional machine learning model, and an output of the additional machine learning model is the application data that is generated.
[0065] In some cases, the downstream application server may further generate notifications based on the application data, and transmit those notifications to one or more user devices.
[0066] Referring now to FIG. 2, there is illustrated a simplified block diagram of a .. computer in accordance with at least some embodiments. Computer 200 is an example implementation of a computer such as source database system 110, EDPP 120, processing node 150 or 180 of FIG. 1. Computer 200 has at least one processor operatively coupled to at least one memory 220, at least one communications interface 230, at least one input/output device 240.
[0067] The at least one memory 220 includes a volatile memory that stores instructions executed or executable by processor 210, and input and output data used or generated during execution of the instructions. Memory 220 may also include non-volatile memory used to store input and/or output data ¨ e.g., within a database ¨
along with program code containing executable instructions.
[0068] Processor 210 may transmit or receive data via communications interface 230, and may also transmit or receive data via any additional input/output device 240 as appropriate.
[0069] In some implementations, computer 200 may be batch processing system that is generally designed and optimized to run a large volume of operations at once, and ¨ 14 ¨

Date Recue/Date Received 2022-05-11 are typically used to perform high-volume, repetitive tasks that do not require real-time interactive input or output. Source database system 110 may be one such example.
Conversely, some implementations of computer 200 may be interactive systems that accept input (e.g., commands and data) and produce output in real-time. In contrast to batch processing systems, interactive systems generally are designed and optimized to perform small, discrete tasks as quickly as possible, although in some cases they may also be tasked with performing long-running computations similar to batch processing tasks. Processing nodes 150 and 180 are examples of interactive systems, which are nodes in a distributed or cloud-based computing system.
[0070] Referring now to FIG. 3A, there is illustrated a flowchart diagram of an example method of preprocessing data and executing a machine learning model in accordance with at least some embodiments. Method 300-1 may be carried out, e.g., by system 100 of FIG. 1.
[0071] Method 300-1 begins at 310 with a processor, such as a processor of processing node 150, receiving data from distributed file system 132 and/or data ingestion module 134.
[0072] At 315, preprocessing module 152 is executed by the processor to take input data from the distributed file system, e.g., in tabular form, and generate preprocessed data. As described elsewhere herein, the preprocessing module preprocesses input data in tabular form to generate preprocessed data. The preprocessing may involve, e.g., input validation, data normalization and filtering to remove unneeded data. For instance, data normalization may include converting alphabetic characters to all uppercase, formatting numerical values, trimming data, etc.
In some cases, preprocessing module 152 may apply data treatments.
[0073] At 320, following preprocessing, the preprocessed data may be stored in an application database of the distributed file system. For example, the preprocessed data may be stored in an input data table 142a, in the current partition of application database 139.
[0074] At 325, a machine learning node, such as processing node 180, retrieves the preprocessed data from the input data table 142a. Optionally, depending on the ¨ 15 ¨

Date Recue/Date Received 2022-05-11 machine learning model, the processing node 180 may retrieve additional data at 330, such as inference data from a prior run of the machine learning model (e.g., from inference data table 144b).
[0075] At 335, the processing node executes a machine learning model on the retrieved data to generate inference data and, at 340, the output inference data is stored in the appropriate table or tables of the application database, such as inference data table 144a or ground truth table 146a.
[0076] Referring now to FIG. 3B, there is illustrated a flowchart diagram of an example method of optimizing multi-stage data processing in accordance with at least some embodiments. Method 300-2 may be carried out, e.g., by system 100 of FIG.
1. In at least some cases, method 300-2 continues from 335 or 340 of method 300-1.
[0077] Method 300-2 begins at 350 with a processor, such as a processor of processing node 180, executing a prediction decision module 154 to receive inference data from an inference data table, such as inference data table 144a of application database 139 of distributed file system 132. As described elsewhere herein, execution of prediction decision module 154 may cause the processor to retrieve inference files generated by a machine learning model and perform analysis to determine whether individual records in the inference files meet one or more threshold requirements. The inference files may be in tabular form, with rows of data representing individual records, and columns that corresponding to fields.
[0078] At 355, the processor processes the inference data to identify records that meet a first predetermined threshold. As described elsewhere herein, the predetermined threshold can be determined based on a percentile placement of each of the records in the inference data, including based on the numerical data.
[0079] At 360, the processor adds a column to the tabular data and, for each record, adds an indication representing whether the first predetermined threshold is met, creating filtered inference data. For instance, for a first predetermined threshold, processing node 150 processes the inference data to identify records that meet the first predetermined threshold and adds a first threshold column to the filtered inference data, where for each record the field corresponding to the record row and first threshold colum n ¨ 16 ¨

Date Recue/Date Received 2022-05-11 serves as an indication of whether the respective record meets the first predetermined threshold.
[0080] At 365, the processor determines whether there are additional predetermined thresholds to be evaluated. If there are, the processor returns to 355 to process the second or subsequent predetermined threshold. This process may be repeated for as many thresholds as are desired.
[0081] If there are no more thresholds to evaluate, then at 370, the processor stores the filtered inference data in one or more tables of an application database, such as table 172 or filtered table 174, described elsewhere herein.
[0082] One or more downstream applications 399 may retrieve the filtered inference data for processing to generate application data.
[0083] In a first downstream application 399a, at 375a, a first processor (e.g., of a processing node such as processing node 150 or 180) retrieves a first subset of the filtered inference data from the application database. The subset may be determined based on the indication in a first threshold column indicative of a first predetermined threshold or, in some cases, a plurality of threshold columns including the first threshold column. In some cases, the entire filtered inference data may be retrieved.
[0084] At 380a, the filtered inference data is processed to generate first application data. Processing may involve providing the subset of the filtered inference data as input to an additional machine learning model, wherein an output of the additional machine learning model is the first application data.
[0085] At 390a, the first application data is stored, e.g., in one or more tables of the application database. In one example, the first application data is generated by a predicted activity module 156 of processing node 150 that receives input data from filtered inference data table 174, and processes the filtered inference data to generate predictions regarding activity, such as user activity. In this example, the predicted activity module 156 prediction data is stored to activity inference data table 178. Optionally, predicted activity module 156 also may store training and evaluation data in tables 176 and 177, where it can be used to improve the performance of the predicted activity module 156.
¨ 17 ¨

Date Recue/Date Received 2022-05-11
[0086] At 395a, the processor may take a further action based on or using the first application data, such as generating one or more notification based on the first application data, and transmitting the one or more notification to one or more user device.
[0087] In a second downstream application 399b, at 375b, a second processor (e.g., of a processing node such as processing node 150 or 180) retrieves a second subset of the filtered inference data from the application database. The second subset may be determined based on the indication in a second threshold column indicative of a second predetermined threshold or, in some cases, a plurality of threshold columns including the second threshold column. In some cases, the entire filtered inference data may be retrieved.
[0088] At 380b, the filtered inference data is processed to generate second application data. Processing may involve providing the second subset of the filtered inference data as input to an additional machine learning model, wherein an output of the additional machine learning model is the second application data.
[0089] At 390b, the second application data is stored, e.g., in one or more tables of the application database.
[0090] At 395b, the processor may take a further action based on or using the second application data, such as generating one or more notification based on the second application data, and transmitting the one or more notification to one or more user device.
[0091] There may be a plurality of downstream applications 399 that perform unique processing of the filtered inference data, based on different subsets or combinations of subsets of filtered inference data determined based on the threshold colum ns.
[0092] Various systems or processes have been described to provide examples of embodiments of the claimed subject matter. No such example embodiment described limits any claim and any claim may cover processes or systems that differ from those described. The claims are not limited to systems or processes having all the features of any one system or process described above or to features common to multiple or all the systems or processes described above. It is possible that a system or process described ¨ 18 ¨

Date Recue/Date Received 2022-05-11 above is not an embodiment of any exclusive right granted by issuance of this patent application. Any subject matter described above and for which an exclusive right is not granted by issuance of this patent application may be the subject matter of another protective instrument, for example, a continuing patent application, and the applicants, inventors or owners do not intend to abandon, disclaim or dedicate to the public any such subject matter by its disclosure in this document.
[0093] For simplicity and clarity of illustration, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth to provide a thorough understanding of the subject matter described herein. However, it will be understood by those of ordinary skill in the art that the subject matter described herein may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the subject matter described herein.
[0094] The terms "coupled" or "coupling" as used herein can have several different meanings depending in the context in which these terms are used. For example, the terms coupled or coupling can have a mechanical, electrical or communicative connotation. For example, as used herein, the terms coupled or coupling can indicate that two elements or devices are directly connected to one another or connected to one another through one or more intermediate elements or devices via an electrical element, electrical signal, or a mechanical element depending on the particular context. Furthermore, the term "operatively coupled" may be used to indicate that an element or device can electrically, optically, or wirelessly send data to another element or device as well as receive data from another element or device.
[0095] As used herein, the wording "and/or" is intended to represent an inclusive-or. That is, "X and/or Y" is intended to mean X or Y or both, for example. As a further example, "X, Y, and/or Z" is intended to mean X or Y or Z or any combination thereof.
[0096] Terms of degree such as "substantially", "about", and "approximately" as used herein mean a reasonable amount of deviation of the modified term such that the result is not significantly changed. These terms of degree may also be construed as ¨ 19 ¨

Date Recue/Date Received 2022-05-11 including a deviation of the modified term if this deviation would not negate the meaning of the term it modifies.
[0097] Any recitation of numerical ranges by endpoints herein includes all numbers and fractions subsumed within that range (e.g., 1 to 5 includes 1, 1.5, 2, 2.75, 3, 3.90, 4, .. and 5). It is also to be understood that all numbers and fractions thereof are presumed to be modified by the term "about" which means a variation of up to a certain amount of the number to which reference is being made if the result is not significantly changed.
[0098] Some elements herein may be identified by a part number, which is composed of a base number followed by an alphabetical or subscript-numerical suffix (e.g. 112a, or 1121). All elements with a common base number may be referred to collectively or generically using the base number without a suffix (e.g. 112).
[0099] The systems and methods described herein may be implemented as a combination of hardware or software. In some cases, the systems and methods described herein may be implemented, at least in part, by using one or more computer programs, executing on one or more programmable devices including at least one processing element, and a data storage element (including volatile and non-volatile memory and/or storage elements). These systems may also have at least one input device (e.g.
a pushbutton keyboard, mouse, a touchscreen, and the like), and at least one output device (e.g. a display screen, a printer, a wireless radio, and the like) depending on the nature of the device. Further, in some examples, one or more of the systems and methods described herein may be implemented in or as part of a distributed or cloud-based computing system having multiple computing components distributed across a computing network. For example, the distributed or cloud-based computing system may correspond to a private distributed or cloud-based computing cluster that is associated with an organization. Additionally, or alternatively, the distributed or cloud-based computing system be a publicly accessible, distributed or cloud-based computing cluster, such as a computing cluster maintained by Microsoft AzureTM, Amazon Web ServicesTM, Google CloudTM, or another third-party provider. In some instances, the distributed computing components of the distributed or cloud-based computing system may be configured to implement one or more parallelized, fault-tolerant distributed computing and analytical ¨ 20 ¨

Date Recue/Date Received 2022-05-11 processes, such as processes provisioned by an Apache Spark TM distributed, cluster-corn puting framework or a Databricks TM analytical platform. Further, and in addition to the CPUs described herein, the distributed computing components may also include one or more graphics processing units (GPUs) capable of processing thousands of operations (e.g., vector operations) in a single clock cycle, and additionally, or alternatively, one or more tensor processing units (TPUs) capable of processing hundreds of thousands of operations (e.g., matrix operations) in a single clock cycle.
[00100] Some elements that are used to implement at least part of the systems, methods, and devices described herein may be implemented via software that is written in a high-level procedural language such as object-oriented programming language.
Accordingly, the program code may be written in any suitable programming language such as Python or Java, for example. Alternatively, or in addition thereto, some of these elements implemented via software may be written in assembly language, machine language or firmware as needed. In either case, the language may be a compiled or interpreted language.
[00101] At least some of these software programs may be stored on a storage media (e.g., a computer readable medium such as, but not limited to, read-only memory, magnetic disk, optical disc) or a device that is readable by a general or special purpose programmable device. The software program code, when read by the programmable device, configures the programmable device to operate in a new, specific, and predefined manner to perform at least one of the methods described herein.
[00102] Furthermore, at least some of the programs associated with the systems and methods described herein may be capable of being distributed in a computer program product including a computer readable medium that bears computer usable instructions .. for one or more processors. The medium may be provided in various forms, including non-transitory forms such as, but not limited to, one or more diskettes, compact disks, tapes, chips, and magnetic and electronic storage. Alternatively, the medium may be transitory in nature such as, but not limited to, wire-line transmissions, satellite transmissions, internet transmissions (e.g. downloads), media, digital and analog signals, ¨ 21 ¨

Date Recue/Date Received 2022-05-11 and the like. The computer usable instructions may also be in various formats, including com piled and non-com piled code.
[00103] While the above description provides examples of one or more processes or systems, it will be appreciated that other processes or systems may be within the scope of the accompanying claims.
[00104] To the extent any amendments, characterizations, or other assertions previously made (in this or in any related patent applications or patents, including any parent, sibling, or child) with respect to any art, prior or otherwise, could be construed as a disclaimer of any subject matter supported by the present disclosure of this application, Applicant hereby rescinds and retracts such disclaimer. Applicant also respectfully submits that any prior art previously considered in any related patent applications or patents, including any parent, sibling, or child, may need to be re-visited.
¨ 22 ¨

Date Recue/Date Received 2022-05-11

Claims (20)

We claim:
1. A system for optimized multi-stage processing, the system com prising:
an application database storing inference data from a machine learning model, the inference data having fields and records in tabular form ;
a computer operatively coupled to the application database, the com puter com prising a memory and a processor configured to:
retrieve the inference data from the application database;
process the inference data to identify records that meet a predetermined threshold;
generate filtered inference data, the filtered inference data having fields and records in tabular form, the filtered inference data further having a threshold column, wherein for each record, an indication of whether the respective record meets the predeterm ined threshold is stored in the threshold column; and store the filtered inference data in the application database.
2. The system of claim 1, further com prising a downstream application server operatively coupled to the application database, the downstream application server com prising a downstream application server memory and a downstream application server processor configured to retrieve a subset of the filtered inference data from the application database and process the subset of the filtered inference data to generate application data, wherein the subset is determined based on the threshold column.
3. The system of claim 2, wherein the processing the subset of the filtered inference data com prises providing the subset of the filtered inference data as input to an additional machine learning model, wherein an output of the additional machine learning model is the application data.
4. The system of claim 2, further com prising a user device operatively coupled to the downstream application server, wherein the downstream application server ¨ 23 ¨

processor is further configured to generate a notification based on the application data, and transmit the notification to the user device.
5. The system of claim 1, wherein the processor is further configured to:
process the inference data to identify records that m eet a second predeterm ined threshold, wherein the filtered inference data has a second threshold column, wherein for each record, an indication of whether the respective record meets the second predetermined threshold is stored in the second threshold column.
6. The system of claim 5, further com prising a second computer operatively coupled to the application database, the second computer com prising a second m emory and a second processor configured to: retrieve a second subset of the filtered inference data from the application database and process the second subset of the filtered inference data to generate second application data, wherein the second subset is determined based on the threshold column.
7. The system of claim 1, further com prising a preprocessor configured to preprocess input data in tabular form to generate preprocessed data, and a machine learning processor configured to execute the machine learning model on the preprocessed data to generate inference data and store the inference data in the application database, prior to the processor retrieving the inference data from the application database.
8. The system of claim 1, further com prising a source database and a publishing server operatively coupled to the source database, the publishing server com prising a publishing server m emory and a publishing server processor configured to export the input data from the source database to the application database.
9. The system of claim 1, wherein the predeterm ined threshold is determined based on a percentile placement of each of the records in the inference data.
¨ 24 ¨
10. The system of claim 9, wherein at least one of the fields of the tabular data comprises numerical data, and wherein the percentile placement is based on the numerical data.
11. A method of optimized multi-stage processing, the method comprising:
receiving, using a first processor, inference data from a machine learning model, the inference data having fields and records in tabular form;
processing, using the first processor, the inference data to identify records that meet a predetermined threshold;
generating, using the first processor, filtered inference data, the filtered inference data having fields and records in tabular form, the filtered inference data further having a threshold column, wherein for each record, an indication of whether the respective record meets the predeterm ined threshold is stored in the threshold column; and the first processor storing the filtered inference data in an application database.
12. The method of claim 11, further comprising a second processor retrieving a subset of the filtered inference data from the application database and processing the subset of the filtered inference data to generate application data, wherein the subset is determ ined based on the threshold column.
13. The method of claim 12, wherein the processing the subset of the filtered inference data comprises providing the subset of the filtered inference data as input to an additional machine learning model, wherein an output of the additional machine learning model is the application data.
14. The method of claim 12, further comprising, generating a notification based on the application data, and transm itting the notification to a user device.
¨ 25 ¨
15. The method of claim 11, further comprising:
processing the inference data to identify records that meet a second predetermined threshold, wherein the filtered inference data has a second threshold column, wherein for each record, an indication of whether the respective record meets the second predetermined threshold is stored in the second threshold colum n.
16. The method of claim 15, further com prising a second processor retrieving a second subset of the filtered inference data from the application database and processing the second subset of the filtered inference data to generate second application data, wherein the second subset is determined based on the threshold column.
17. The method of claim 11, further comprising, prior to receiving the inference data from the machine learning model:
preprocessing, using a preprocessor, input data in tabular form to generate preprocessed data; and executing, using a machine learning processor, the machine learning model on the preprocessed data to generate the inference data.
18. The method of claim 11, wherein the predetermined threshold is determined based on a percentile placement of each of the records in the inference data.
19. The method of claim 18, wherein at least one of the fields of the tabular data com prises numerical data, and wherein the percentile placement is based on the numerical data.
20. A non-transitory computer readable medium storing computer executable instructions which, when executed by a computer processor, cause the com puter processor to carry out a method of processing machine learning model predictions, the method com prising:
¨ 26 ¨

receiving, using a first processor, inference data from a machine learning model, the inference data having fields and records in tabular form;
processing, using the first processor, the inference data to identify records that meet a predetermined threshold;
generating, using the first processor, filtered tabular data, the filtered inference data having fields and records in tabular form, the filtered inference data further having a threshold column, wherein for each record, an indication of whether the respective record meets the predetermined threshold is stored in the threshold column; and the first processor storing the filtered inference data in an application database.
¨ 27 ¨
CA3160093A 2022-05-11 2022-05-11 Systems and methods for optimizing multi-stage data processing Pending CA3160093A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CA3160093A CA3160093A1 (en) 2022-05-11 2022-05-11 Systems and methods for optimizing multi-stage data processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CA3160093A CA3160093A1 (en) 2022-05-11 2022-05-11 Systems and methods for optimizing multi-stage data processing

Publications (1)

Publication Number Publication Date
CA3160093A1 true CA3160093A1 (en) 2023-11-11

Family

ID=88689343

Family Applications (1)

Application Number Title Priority Date Filing Date
CA3160093A Pending CA3160093A1 (en) 2022-05-11 2022-05-11 Systems and methods for optimizing multi-stage data processing

Country Status (1)

Country Link
CA (1) CA3160093A1 (en)

Similar Documents

Publication Publication Date Title
US20210374610A1 (en) Efficient duplicate detection for machine learning data sets
Bhadani et al. Big data: challenges, opportunities, and realities
Rodríguez-Mazahua et al. A general perspective of Big Data: applications, tools, challenges and trends
US20200050968A1 (en) Interactive interfaces for machine learning model evaluations
Anisetti et al. Privacy-aware Big Data Analytics as a service for public health policies in smart cities
CA2953817C (en) Feature processing tradeoff management
US20200273570A1 (en) Predictive analysis platform
US10938678B2 (en) Automation plan generation and ticket classification for automated ticket resolution
CN111801674A (en) Improving natural language interfaces by processing usage data
US10885452B1 (en) Relation graph optimization using inconsistent cycle detection
CN106873949A (en) Code generating method and its device
EP3832559A1 (en) Controlling access to de-identified data sets based on a risk of re-identification
Sandhu et al. TDRM: tensor-based data representation and mining for healthcare data in cloud computing environments
Mazeri et al. Using data-driven approaches to improve delivery of animal health care interventions for public health
US20220114499A1 (en) System and method for efficiently training intelligible models
Gokuldhev et al. Local pollination-based moth search algorithm for task-scheduling heterogeneous cloud environment
US20230368048A1 (en) Systems and methods for optimizing multi-stage data processing
CA3160093A1 (en) Systems and methods for optimizing multi-stage data processing
Mannava Research Challenges and Technology Progress of Data Mining with Bigdata
US20230401192A1 (en) Systems and methods for optimizing data processing in a distributed computing environment
CA3162832A1 (en) Systems and methods for optimizing data processing in a distributed computing environment
Viswanathan et al. R data analysis cookbook
Kaligotla et al. Model exploration of an information-based healthcare intervention using parallelization and active learning
Chang et al. Incorporating AI Methods in Micro-dynamic Analysis to Support Group-Specific Policy-Making
Berkani Decision support based on optimized data mining techniques: Application to mobile telecommunication companies