US20210073649A1 - Automated data ingestion using an autoencoder - Google Patents

Automated data ingestion using an autoencoder Download PDF

Info

Publication number
US20210073649A1
US20210073649A1 US17/101,517 US202017101517A US2021073649A1 US 20210073649 A1 US20210073649 A1 US 20210073649A1 US 202017101517 A US202017101517 A US 202017101517A US 2021073649 A1 US2021073649 A1 US 2021073649A1
Authority
US
United States
Prior art keywords
autoencoder
subset
values
numeric
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/101,517
Inventor
Austin Grant Walters
Jeremy Edward Goodsitt
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Capital One Services LLC
Original Assignee
Capital One Services LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Capital One Services LLC filed Critical Capital One Services LLC
Priority to US17/101,517 priority Critical patent/US20210073649A1/en
Assigned to CAPITAL ONE SERVICES, LLC reassignment CAPITAL ONE SERVICES, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: GOODSITT, JEREMY EDWARD, WALTERS, AUSTIN GRANT
Publication of US20210073649A1 publication Critical patent/US20210073649A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • Embodiments disclosed herein generally relate to deep learning, and more specifically, to training an autoencoder to perform automated data ingestion.
  • Input data is often received in different formats.
  • Data engineering involves converting the format of input data to a desired format.
  • data engineering is conventionally a manual process which requires significant time and resources.
  • data engineering solutions are not portable, such that a new solution needs to be manually designed for different types of input data and/or desired output formats.
  • Embodiments disclosed herein provide systems, methods, articles of manufacture, and computer-readable media for training an autoencoder to perform automated data ingestion.
  • the autoencoder may receive streaming data comprising numeric values during a first time interval.
  • the autoencoder may determine, during the first time interval, a maximum value and a minimum value of a first subset of the numeric values.
  • the autoencoder may then process, during the first time interval, a second subset of the numeric values based on the determined maximum and minimum values.
  • FIG. 1 illustrates an embodiment of a system that uses an autoencoder to perform automated data ingestion.
  • FIG. 2 illustrates an embodiment of training an autoencoder to perform automated data ingestion.
  • FIG. 3 illustrates an embodiment of a processing pipeline.
  • FIG. 4 illustrates an embodiment of a first logic flow.
  • FIG. 5 illustrates an embodiment of a second logic flow.
  • FIG. 6 illustrates an embodiment of a computing architecture.
  • Embodiments disclosed herein provide techniques to use an autoencoder to automatically format input data according to a desired output format.
  • a statistical model (or other machine learning (ML) model) may format the data sampled from the dataset, thereby generating a formatted output dataset.
  • a training dataset may then be used to train the autoencoder to format data.
  • the training dataset may include the data sampled from the dataset as an input dataset and the formatted output dataset generated by the statistical model as an output dataset.
  • the training dataset may include overlapping “chunks” such that the same data may appear in two or more chunks.
  • the autoencoder attempts to format the input dataset, thereby generating an output.
  • the statistical model may analyze the output of the autoencoder to determine an accuracy of the autoencoder.
  • the determined accuracy of the autoencoder may then be used to train the values of a latent vector of the autoencoder.
  • the training of the autoencoder may be repeated until the accuracy of the autoencoder exceeds a threshold.
  • the trained autoencoder may then be used for data ingestion, e.g., by attaching the trained autoencoder to all new models and/or datasets.
  • embodiments disclosed herein provide techniques to automatically format data using an autoencoder.
  • the autoencoder may be trained to appropriately format all data, even if the data has not been previously analyzed.
  • embodiments disclosed herein provide scalable solutions that can be ported to any type of data processing pipeline, regardless of any particular input and/or output data formats. Further still, embodiments disclosed herein may train the autoencoder using only the training dataset and/or a portion thereof.
  • FIG. 1 depicts an exemplary system 100 , consistent with disclosed embodiments.
  • the system 100 includes a computing system 101 .
  • the computing system 101 is representative of any type of computing system, such as servers, compute clusters, desktop computers, smartphones, tablet computers, wearable devices, laptop computers, workstations, portable gaming devices, virtualized computing systems, and the like.
  • the computing system 101 includes a processor 102 , a memory 103 , and may further include a storage, network interface, and/or other components not pictured for the sake of clarity.
  • the memory 103 includes an autoencoder 104 , a machine learning (ML) model 105 , a statistical model 106 , and data stores of training data 107 and formatted data 108 .
  • the autoencoder 104 is representative of any type of autoencoder, including variational autoencoders, denoising autoencoders, sparse autoencoders, and contractive autoencoders.
  • an autoencoder is a type of artificial neural network that learns data codings (e.g., the latent vector 109 ) in an unsupervised manner.
  • Values of the latent vector 109 may be learned (or refined) during training of the autoencoder 104 , thereby training the autoencoder 104 to format input data according to a desired output format (which may include formatting according to a desired operation).
  • the trained autoencoder 104 may approximate any function and/or operation applied to input data.
  • the autoencoder 104 may convert input data comprising integer values to floating point values.
  • the autoencoder 104 may perform any encoding operation, which may include, but is not limited to, normalizing values of input data, computing a z-score (e.g., a signed value reflecting a number of standard deviations the value of input data is from a mean value) for values of input data, standardizing values of input data, recasting values of input data, filtering the input data according to one or more filtering criteria, fuzzing of the values of input data, applying statistical filters to the input data, and the like.
  • a z-score e.g., a signed value reflecting a number of standard deviations the value of input data is from a mean value
  • the use of any particular type of encoding operation as a reference example herein should not be considered limiting of the disclosure, as the disclosure is equally applicable to all types of encoding operations.
  • the use of the term “vector” to describe the latent vector 109 should not be considered limiting of the disclosure, as the latent vector 109 is also representative of a matrix
  • the training data 107 comprises columnar and/or row-based data, e.g., one or more columns of integer values, one or more columns of floating point values, etc.
  • the training data 107 may be representative of multiple datasets of any size.
  • the training data 107 may include 50 column-based datasets, where each dataset has thousands of records (or more).
  • the training data 107 may be segmented (e.g., the training data 107 may comprise a plurality of segments of one or more datasets). In one embodiment, each segmented dataset of training data 107 is overlapping, such that at least one value of the training data 107 appears in at least two segments.
  • a first dataset may include rows 0 - 1000 of the training data 107
  • a second dataset may include rows 900 - 2000 of the training data 107 , such that rows 900 - 1000 appear in the first and second datasets.
  • the size of the datasets may be learned based on hyperparameter tuning.
  • the ML model 105 and the statistical model 106 are representative of any type of computing model, such as deep learning models, machine learning models, neural networks, classifiers, clustering algorithms, support vector machines, and the like.
  • the ML model 105 and the statistical model 106 comprise the same model.
  • the ML model 105 (and/or the statistical model 106 ) may be configured to transform (or encode) input data to a target format, thereby generating an output dataset.
  • the ML model 105 may be configured to normalize integer values of input data to floating point values, and the output dataset may comprise the floating point values.
  • the ML model 105 may compute an output dataset for each input dataset of training data 107 .
  • An input dataset and corresponding formatted output dataset generated by the ML model 105 may be referred to as a “training sample” herein.
  • the autoencoder 104 may then be trained using the input dataset of one or more training samples. Generally, the autoencoder 104 may receive the input dataset as input, convert the dataset to an encoded format using the values of the latent vector 109 , and decode the converted dataset. In some embodiments, the converted dataset generated by the autoencoder 104 may then be compared to the formatted data of the training sample generated by the ML model 105 . The comparison may include determining a difference and/or least squared error of the converted dataset generated by the autoencoder 104 and the formatted data of the training sample generated by the ML model 105 . Doing so generates one or more values reflecting an accuracy of the autoencoder 104 . In some embodiments, the accuracy may comprise a loss of the autoencoder 104 .
  • the ML model 105 and/or the statistical model 106 may receive the converted data generated by the autoencoder 104 to determine the accuracy of the autoencoder 104 relative to the data of the training sample generated by the ML model 105 .
  • the ML model 105 may process the converted data generated by the autoencoder 104 and compare the output to the formatted data of the training sample.
  • the statistical model 106 may classify the converted data generated by the autoencoder 104 and compare the classification to a classification of the input dataset of the training sample.
  • the statistical model 106 may classify the formatted output generated by the autoencoder 104 as a dataset of credit card data.
  • the statistical model 106 may compute a relatively high accuracy value for the autoencoder 104 . If, however, the classification for the input dataset is for purchase order amounts, the statistical model 106 may compute a relatively low accuracy value for the autoencoder 104 . In one embodiment, the statistical model 106 may compute the accuracy value for the autoencoder 104 based on a distance between the classifications in a data space, where the accuracy increases as the distance between the classifications decreases.
  • the determined accuracy of the autoencoder 104 may then be used to refine the values of the latent vector 109 and/or other components of the autoencoder 104 via a backpropagation operation.
  • the backpropagation may be performed using any feasible backpropagation algorithm.
  • the values of the latent vector 109 and/or the other components of the autoencoder 104 are refined based on the accuracy of the formatted output generated by the autoencoder 104 . Doing so may result in a latent vector 109 that most accurately maps the input data to the desired output format.
  • the training of the autoencoder 104 may be repeated any number of times until the accuracy of the autoencoder 104 exceeds a threshold (and/or the loss of the autoencoder 104 is below a threshold).
  • the autoencoder 104 may then be configured to ingest (e.g., format) data to be processed in any processing platform, such as a streaming data platform, thereby generating the formatted data 108 .
  • the autoencoder 104 may perform estimated ingestion operations. For example, the autoencoder 104 may receive streaming data over a time interval. If the streaming data is of a reasonable size, the autoencoder 104 may perform a predictive formatting operation on the streaming data.
  • the autoencoder 104 may determine the minimum and maximum values therein. Doing so may allow the autoencoder 104 to normalize the streaming data in a predictive fashion in a single pass. Stated differently, the autoencoder 104 may normalize the streaming data in a single processing phase, rather than having to process the streaming data twice (e.g., to discover the minimum/maximum values, then normalize the data based on the identified minimum/maximum values).
  • FIG. 2 is a schematic 200 illustrating an embodiment of training the autoencoder 104 to perform automated data ingestion.
  • one or more datasets of training data 107 may be segmented.
  • the training data 107 may include row-based data and/or column-based data.
  • the segments may have a minimum size (e.g., 10,000 rows and/or columns of data).
  • one or more of the segments may be modified, for example, by dropping one or more columns of data, formatting one or more columns of data, and the like.
  • Doing so may produce varying segments of training data 107 , e.g., where a first segment has had a column dropped, a second segment has had a column formatted, a third segment has had one column dropped and one column formatted, and a fourth segment has not been modified.
  • the ML model 105 may process the segmented training data 107 to format the segmented training data 107 according to one or more formatting rules and/or operations. For example, the ML model 105 may normalize, convert, and/or filter the segmented training data 107 .
  • one or more output datasets generated by the ML model 105 at block 202 may be stored. The output datasets may include each segment of training data 204 and the corresponding formatted data 205 generated by the ML model 105 at block 202 .
  • the segmented training data 204 may include the 1,000 segments
  • the formatted data 205 may include 1,000 formatted datasets generated by the ML model 105 by processing each segment at block 202 .
  • 1,000 training samples may comprise the segmented training data as input data and the corresponding formatted data 205 generated by the ML model 105 .
  • overlapping datasets may be generated using the training samples of segmented training data 204 and formatted data 205 .
  • the 1,000 training samples may be modified to include overlapping values.
  • the autoencoder 104 may be trained using the overlapping datasets generated at block 206 .
  • the autoencoder 104 may process each input dataset (e.g., the segmented training data 204 ) of each training sample, e.g., to convert each of the input datasets of the training samples to a desired output format and/or based on a predefined operation.
  • the accuracy of the autoencoder 104 is determined based on the output generated by the autoencoder 104 at block 207 .
  • a difference and/or a least squared error may be computed between the output of the autoencoder 104 based on the segmented training data 204 and the corresponding formatted data 205 generated by the ML model 105 .
  • the difference and/or least squared error may be used as accuracy values for the autoencoder 104 .
  • the statistical model 106 may classify the output generated by the autoencoder 104 at block 207 and compare the generated classification to a classification of the corresponding segmented training data 204 . For example, if the output generated by the autoencoder 104 at block 207 for a first overlapping segment of training data 204 matches a classification generated for the formatted data 205 corresponding to the first overlapping segment of training data 204 , the statistical model 106 may compute a relatively high accuracy value for the autoencoder 104 for the first training sample.
  • the determined accuracy may be used to train the autoencoder 104 via a backpropagation operation. Doing so refines the values of the autoencoder 104 , including the latent vector 109 , based on the determined accuracy values for the autoencoder 104 and/or a loss of the autoencoder 104 .
  • the accuracy at block 208 may be determined for each training sample. Therefore, continuing with the previous example, the accuracy for each of the 1,000 training samples processed by the autoencoder 104 may be determined at block 208 .
  • Each of the 1,000 accuracy values may be provided to the autoencoder 104 to update the weights of the autoencoder 104 , e.g., via 1,000 (or fewer) backpropagation operations.
  • FIG. 3 illustrates an embodiment of a processing pipeline 300 .
  • streaming input data is received in the processing pipeline 300 .
  • the streaming input data may be any type of data, such as transaction data, stock ticker data, financial data, sensor data, and the like.
  • the streaming input data includes numeric values in one or more rows and/or columns.
  • the streaming input data may have varying types and/or formats which may need to be modified to be compatible with various components of the processing pipeline. Therefore, at block 302 , the trained autoencoder 104 may process the streaming input data.
  • the trained autoencoder 104 may format the streaming input data according to a desired output format, normalize the values of the streaming input data, compute a z-score for the streaming input data, standardizing values of the streaming input data, recasting values of the streaming input data, filtering the streaming input data according to one or more filtering criteria, fuzzing of the values of the streaming input data, and the like.
  • one or more components of the processing pipeline process the output generated by the autoencoder 104 at block 302 , e.g., the formatted and/or converted streaming input data.
  • the autoencoder 104 may process the streaming data in a single pass, e.g., by providing estimated normalization, recasting, etc., and without having to process the streaming data in two or more passes.
  • FIG. 4 illustrates an embodiment of a logic flow 400 .
  • the logic flow 400 may be representative of some or all of the operations executed by one or more embodiments described herein.
  • the logic flow 400 may include some or all of the operations to provide automated data ingestion using an autoencoder. Embodiments are not limited in this context.
  • the logic flow 400 begins at block 410 , where a target data format is determined for data.
  • the target format may specify a datatype (e.g., integers, floating points, etc.), a data space (e.g., a range of values), etc. More generally, any type of operation may be determined for the data at block 410 , e.g., normalization, filtering, score computation, etc.
  • the autoencoder 104 is trained to format data according to the target formats and/or operations defined at block 410 . Generally, the training of the autoencoder 104 is guided by the ML model 105 and/or the statistical model 106 as described in greater detail herein.
  • the accuracy of the autoencoder 104 may be determined to exceed a threshold accuracy level. For example, if the threshold is 90% accuracy, and the accuracy of the autoencoder 104 is 95%, the accuracy of the autoencoder may exceed the threshold.
  • the autoencoder 104 is configured to format data in a processing pipeline.
  • FIG. 5 illustrates an embodiment of a logic flow 500 .
  • the logic flow 500 may be representative of some or all of the operations executed by one or more embodiments described herein.
  • the logic flow 500 may include some or all of the operations performed to train the autoencoder 104 .
  • Embodiments are not limited in this context.
  • the logic flow 500 begins at block 510 , where the training data 107 , which may comprise one or more datasets, is segmented into overlapping training data subsets.
  • the training data 107 may include row and/or column-based numerical values. By generating overlapping subsets, one or more values of the training data 107 may appear in two or more subsets.
  • the ML model 105 transforms the training data subsets according to the format defined at block 410 .
  • the ML model 105 may be configured to transform the training data from a first format to a second format. More generally, the ML model 105 may perform any operation on the training data as described above. Doing so may generate a respective transformed output dataset for each of the training data subsets.
  • Each training dataset and corresponding transformed output dataset pair may comprise a training sample for the autoencoder.
  • One or more of the training samples may be selected at block 530 .
  • the autoencoder 104 may process the input dataset of the training sample selected at block 530 . Generally, the autoencoder 104 may transform the input dataset of the training sample (or perform any other operation) based at least in part on the current weights of the latent vector 109 . Doing so may generate a transformed output. At block 550 , the accuracy of the autoencoder 104 is determined based at least in part on the transformed output generated by the autoencoder 104 . As stated, the ML model 105 and/or the statistical model 106 may be used to determine the accuracy of the autoencoder 104 .
  • a difference and/or a least squared error may be computed for the output of the autoencoder 104 based on the transformed output dataset of the training sample (e.g., the output of the ML model 105 ) and the output generated by the autoencoder 104 at block 540 .
  • the difference and/or least squared error may be used as accuracy values for the autoencoder 104 .
  • the statistical model 106 may classify the output generated by the autoencoder 104 at block 540 and compare the generated classification to a classification of the training data of the input sample selected at block 530 .
  • the accuracy of the autoencoder 104 may then be determined based on a similarity of the classifications, where more similar classifications result in higher accuracy values for the autoencoder 104 .
  • the accuracy determined at block 550 may be provided to the autoencoder 104 .
  • the values of the latent vector 109 and any other values of the autoencoder 104 may be refined during a backpropagation operation. Doing so may allow the values of the latent vector 109 to more accurately reflect a mapping required to perform the desired operation on data (e.g., filtering, formatting, recasting, etc.).
  • the logic flow 500 may return to block 530 , where another training sample is selected, thereby repeating the training process until the accuracy of the autoencoder 104 exceeds the threshold. Once the accuracy of the autoencoder 104 exceeds a threshold and/or all training samples have been used to train the autoencoder 104 , the logic flow 500 may end.
  • FIG. 6 illustrates an embodiment of an exemplary computing architecture 600 comprising a computing system 602 that may be suitable for implementing various embodiments as previously described.
  • the computing architecture 600 may comprise or be implemented as part of an electronic device.
  • the computing architecture 600 may be representative, for example, of a system that implements one or more components of the system 100 .
  • computing system 602 may be representative, for example, of the computing system 101 of the system 100 .
  • the embodiments are not limited in this context. More generally, the computing architecture 600 is configured to implement all logic, applications, systems, methods, apparatuses, and functionality described herein with reference to FIGS. 1-5 .
  • a component can be, but is not limited to being, a process running on a computer processor, a computer processor, a hard disk drive, multiple storage drives (of optical and/or magnetic storage medium), an object, an executable, a thread of execution, a program, and/or a computer.
  • a component can be, but is not limited to being, a process running on a computer processor, a computer processor, a hard disk drive, multiple storage drives (of optical and/or magnetic storage medium), an object, an executable, a thread of execution, a program, and/or a computer.
  • an application running on a server and the server can be a component.
  • One or more components can reside within a process and/or thread of execution, and a component can be localized on one computer and/or distributed between two or more computers. Further, components may be communicatively coupled to each other by various types of communications media to coordinate operations. The coordination may involve the uni-directional or bi-directional exchange of information. For instance, the components may communicate information in the form of signals communicated over the communications media. The information can be implemented as signals allocated to various signal lines. In such allocations, each message is a signal. Further embodiments, however, may alternatively employ data messages. Such data messages may be sent across various connections. Exemplary connections include parallel interfaces, serial interfaces, and bus interfaces.
  • the computing system 602 includes various common computing elements, such as one or more processors, multi-core processors, co-processors, memory units, chipsets, controllers, peripherals, interfaces, oscillators, timing devices, video cards, audio cards, multimedia input/output (I/O) components, power supplies, and so forth.
  • processors multi-core processors
  • co-processors memory units
  • chipsets controllers
  • peripherals peripherals
  • oscillators oscillators
  • timing devices video cards, audio cards, multimedia input/output (I/O) components, power supplies, and so forth.
  • the embodiments are not limited to implementation by the computing system 602 .
  • the computing system 602 comprises a processor 604 , a system memory 606 and a system bus 608 .
  • the processor 604 can be any of various commercially available computer processors, including without limitation an AMD® Athlon®, Duron® and Opteron® processors; ARM® application, embedded and secure processors; IBM® and Motorola® DragonBall® and PowerPC® processors; IBM and Sony® Cell processors; Intel® Celeron®, Core®, Core ( 2 ) Duo®, Itanium®, Pentium®, Xeon®, and XScale® processors; and similar processors. Dual microprocessors, multi-core processors, and other multi processor architectures may also be employed as the processor 604 .
  • the system bus 608 provides an interface for system components including, but not limited to, the system memory 606 to the processor 604 .
  • the system bus 608 can be any of several types of bus structure that may further interconnect to a memory bus (with or without a memory controller), a peripheral bus, and a local bus using any of a variety of commercially available bus architectures.
  • Interface adapters may connect to the system bus 608 via a slot architecture.
  • Example slot architectures may include without limitation Accelerated Graphics Port (AGP), Card Bus, (Extended) Industry Standard Architecture ((E)ISA), Micro Channel Architecture (MCA), NuBus, Peripheral Component Interconnect (Extended) (PCI(X)), PCI Express, Personal Computer Memory Card International Association (PCMCIA), and the like.
  • the system memory 606 may include various types of computer-readable storage media in the form of one or more higher speed memory units, such as read-only memory (ROM), random-access memory (RAM), dynamic RAM (DRAM), Double-Data-Rate DRAM (DDRAM), synchronous DRAM (SDRAM), static RAM (SRAM), programmable ROM (PROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory (e.g., one or more flash arrays), polymer memory such as ferroelectric polymer memory, ovonic memory, phase change or ferroelectric memory, silicon-oxide-nitride-oxide-silicon (SONOS) memory, magnetic or optical cards, an array of devices such as Redundant Array of Independent Disks (RAID) drives, solid state memory devices (e.g., USB memory, solid state drives (SSD) and any other type of storage media suitable for storing information.
  • the system memory 606 can include non-volatile memory (EEPROM), flash
  • the computing system 602 may include various types of computer-readable storage media in the form of one or more lower speed memory units, including an internal (or external) hard disk drive (HDD) 614 , a magnetic floppy disk drive (FDD) 616 to read from or write to a removable magnetic disk 618 , and an optical disk drive 620 to read from or write to a removable optical disk 622 (e.g., a CD-ROM or DVD).
  • the HDD 614 , FDD 616 and optical disk drive 620 can be connected to the system bus 608 by a HDD interface 624 , an FDD interface 626 and an optical drive interface 628 , respectively.
  • the HDD interface 624 for external drive implementations can include at least one or both of Universal Serial Bus (USB) and IEEE 1394 interface technologies.
  • the computing system 602 is generally is configured to implement all logic, systems, methods, apparatuses, and functionality described herein with reference to FIGS. 1-5 .
  • the drives and associated computer-readable media provide volatile and/or nonvolatile storage of data, data structures, computer-executable instructions, and so forth.
  • a number of program modules can be stored in the drives and memory units 610 , 612 , including an operating system 630 , one or more application programs 632 , other program modules 634 , and program data 636 .
  • the one or more application programs 632 , other program modules 634 , and program data 636 can include, for example, the various applications and/or components of the system 100 , e.g., the autoencoder 104 , ML model 105 , statistical model 106 , training data 107 , formatted data 108 , and latent vector 109 .
  • a user can enter commands and information into the computing system 602 through one or more wire/wireless input devices, for example, a keyboard 638 and a pointing device, such as a mouse 640 .
  • Other input devices may include microphones, infra-red (IR) remote controls, radio-frequency (RF) remote controls, game pads, stylus pens, card readers, dongles, finger print readers, gloves, graphics tablets, joysticks, keyboards, retina readers, touch screens (e.g., capacitive, resistive, etc.), trackballs, trackpads, sensors, styluses, and the like.
  • IR infra-red
  • RF radio-frequency
  • input devices are often connected to the processor 604 through an input device interface 642 that is coupled to the system bus 608 , but can be connected by other interfaces such as a parallel port, IEEE 1394 serial port, a game port, a USB port, an IR interface, and so forth.
  • a monitor 644 or other type of display device is also connected to the system bus 608 via an interface, such as a video adaptor 646 .
  • the monitor 644 may be internal or external to the computing system 602 .
  • a computer typically includes other peripheral output devices, such as speakers, printers, and so forth.
  • the computing system 602 may operate in a networked environment using logical connections via wire and/or wireless communications to one or more remote computers, such as a remote computer 648 .
  • the remote computer 648 can be a workstation, a server computer, a router, a personal computer, portable computer, microprocessor-based entertainment appliance, a peer device or other common network node, and typically includes many or all of the elements described relative to the computing system 602 , although, for purposes of brevity, only a memory/storage device 650 is illustrated.
  • the logical connections depicted include wire/wireless connectivity to a local area network (LAN) 652 and/or larger networks, for example, a wide area network (WAN) 654 .
  • LAN and WAN networking environments are commonplace in offices and companies, and facilitate enterprise-wide computer networks, such as intranets, all of which may connect to a global communications network, for example, the Internet.
  • the computing system 602 When used in a LAN networking environment, the computing system 602 is connected to the LAN 652 through a wire and/or wireless communication network interface or adaptor 656 .
  • the adaptor 656 can facilitate wire and/or wireless communications to the LAN 652 , which may also include a wireless access point disposed thereon for communicating with the wireless functionality of the adaptor 656 .
  • the computing system 602 can include a modem 658 , or is connected to a communications server on the WAN 654 , or has other means for establishing communications over the WAN 654 , such as by way of the Internet.
  • the modem 658 which can be internal or external and a wire and/or wireless device, connects to the system bus 608 via the input device interface 642 .
  • program modules depicted relative to the computing system 602 can be stored in the remote memory/storage device 650 . It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers can be used.
  • the computing system 602 is operable to communicate with wired and wireless devices or entities using the IEEE 802 family of standards, such as wireless devices operatively disposed in wireless communication (e.g., IEEE 802.16 over-the-air modulation techniques).
  • wireless communication e.g., IEEE 802.16 over-the-air modulation techniques.
  • the communication can be a predefined structure as with a conventional network or simply an ad hoc communication between at least two devices.
  • Wi-Fi networks use radio technologies called IEEE 802.11x (a, b, g, n, etc.) to provide secure, reliable, fast wireless connectivity.
  • a Wi-Fi network can be used to connect computers to each other, to the Internet, and to wire networks (which use IEEE 802.3-related media and functions).
  • Various embodiments may be implemented using hardware elements, software elements, or a combination of both.
  • hardware elements may include processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth.
  • Examples of software may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. Determining whether an embodiment is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints.
  • One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein.
  • Such representations known as “IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that make the logic or processor.
  • Some embodiments may be implemented, for example, using a machine-readable medium or article which may store an instruction or a set of instructions that, if executed by a machine, may cause the machine to perform a method and/or operations in accordance with the embodiments.
  • Such a machine may include, for example, any suitable processing platform, computing platform, computing device, processing device, computing system, processing system, computer, processor, or the like, and may be implemented using any suitable combination of hardware and/or software.
  • the machine-readable medium or article may include, for example, any suitable type of memory unit, memory device, memory article, memory medium, storage device, storage article, storage medium and/or storage unit, for example, memory, removable or non-removable media, erasable or non-erasable media, writeable or re-writeable media, digital or analog media, hard disk, floppy disk, Compact Disk Read Only Memory (CD-ROM), Compact Disk Recordable (CD-R), Compact Disk Rewriteable (CD-RW), optical disk, magnetic media, magneto-optical media, removable memory cards or disks, various types of Digital Versatile Disk (DVD), a tape, a cassette, or the like.
  • CD-ROM Compact Disk Read Only Memory
  • CD-R Compact Disk Recordable
  • CD-RW Compact Dis
  • the instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, encrypted code, and the like, implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

Systems, methods, apparatuses, and computer program products for processing data using an autoencoder. In one example, the autoencoder may receive streaming data comprising numeric values during a first time interval. The autoencoder may determine, during the first time interval, a maximum value and a minimum value of a first subset of the numeric values. The autoencoder may then process, during the first time interval, a second subset of the numeric values based on the determined maximum and minimum values.

Description

    RELATED APPLICATIONS
  • This application is a continuation of U.S. patent application Ser. No. 16/549,465, titled “AUTOMATED DATA INGESTION USING AN AUTOENCODER” filed on Aug. 23, 2019. The contents of the aforementioned application are incorporated herein by reference in their entirety.
  • TECHNICAL FIELD
  • Embodiments disclosed herein generally relate to deep learning, and more specifically, to training an autoencoder to perform automated data ingestion.
  • BACKGROUND
  • Input data is often received in different formats. Data engineering involves converting the format of input data to a desired format. However, data engineering is conventionally a manual process which requires significant time and resources. Furthermore, data engineering solutions are not portable, such that a new solution needs to be manually designed for different types of input data and/or desired output formats.
  • SUMMARY
  • Embodiments disclosed herein provide systems, methods, articles of manufacture, and computer-readable media for training an autoencoder to perform automated data ingestion. In one example, the autoencoder may receive streaming data comprising numeric values during a first time interval. The autoencoder may determine, during the first time interval, a maximum value and a minimum value of a first subset of the numeric values. The autoencoder may then process, during the first time interval, a second subset of the numeric values based on the determined maximum and minimum values.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 illustrates an embodiment of a system that uses an autoencoder to perform automated data ingestion.
  • FIG. 2 illustrates an embodiment of training an autoencoder to perform automated data ingestion.
  • FIG. 3 illustrates an embodiment of a processing pipeline.
  • FIG. 4 illustrates an embodiment of a first logic flow.
  • FIG. 5 illustrates an embodiment of a second logic flow.
  • FIG. 6 illustrates an embodiment of a computing architecture.
  • DETAILED DESCRIPTION
  • Embodiments disclosed herein provide techniques to use an autoencoder to automatically format input data according to a desired output format. Generally, embodiments disclosed herein may sample a dataset. A statistical model (or other machine learning (ML) model) may format the data sampled from the dataset, thereby generating a formatted output dataset. A training dataset may then be used to train the autoencoder to format data. The training dataset may include the data sampled from the dataset as an input dataset and the formatted output dataset generated by the statistical model as an output dataset. The training dataset may include overlapping “chunks” such that the same data may appear in two or more chunks. Generally, during training, the autoencoder attempts to format the input dataset, thereby generating an output. The statistical model (or other ML model) may analyze the output of the autoencoder to determine an accuracy of the autoencoder. The determined accuracy of the autoencoder may then be used to train the values of a latent vector of the autoencoder. The training of the autoencoder may be repeated until the accuracy of the autoencoder exceeds a threshold. The trained autoencoder may then be used for data ingestion, e.g., by attaching the trained autoencoder to all new models and/or datasets.
  • Advantageously, embodiments disclosed herein provide techniques to automatically format data using an autoencoder. Advantageously, the autoencoder may be trained to appropriately format all data, even if the data has not been previously analyzed. Furthermore, embodiments disclosed herein provide scalable solutions that can be ported to any type of data processing pipeline, regardless of any particular input and/or output data formats. Further still, embodiments disclosed herein may train the autoencoder using only the training dataset and/or a portion thereof.
  • With general reference to notations and nomenclature used herein, one or more portions of the detailed description which follows may be presented in terms of program procedures executed on a computer or network of computers. These procedural descriptions and representations are used by those skilled in the art to most effectively convey the substances of their work to others skilled in the art. A procedure is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. These operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical, magnetic, or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It proves convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like. It should be noted, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to those quantities.
  • Further, these manipulations are often referred to in terms, such as adding or comparing, which are commonly associated with mental operations performed by a human operator. However, no such capability of a human operator is necessary, or desirable in most cases, in any of the operations described herein that form part of one or more embodiments. Rather, these operations are machine operations. Useful machines for performing operations of various embodiments include digital computers as selectively activated or configured by a computer program stored within that is written in accordance with the teachings herein, and/or include apparatus specially constructed for the required purpose or a digital computer. Various embodiments also relate to apparatus or systems for performing these operations. These apparatuses may be specially constructed for the required purpose. The required structure for a variety of these machines will be apparent from the description given.
  • Reference is now made to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for the purpose of explanation, numerous specific details are set forth in order to provide a thorough understanding thereof. It may be evident, however, that the novel embodiments can be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate a description thereof. The intention is to cover all modification, equivalents, and alternatives within the scope of the claims.
  • FIG. 1 depicts an exemplary system 100, consistent with disclosed embodiments. As shown, the system 100 includes a computing system 101. The computing system 101 is representative of any type of computing system, such as servers, compute clusters, desktop computers, smartphones, tablet computers, wearable devices, laptop computers, workstations, portable gaming devices, virtualized computing systems, and the like. The computing system 101 includes a processor 102, a memory 103, and may further include a storage, network interface, and/or other components not pictured for the sake of clarity.
  • As shown, the memory 103 includes an autoencoder 104, a machine learning (ML) model 105, a statistical model 106, and data stores of training data 107 and formatted data 108. The autoencoder 104 is representative of any type of autoencoder, including variational autoencoders, denoising autoencoders, sparse autoencoders, and contractive autoencoders. Generally, an autoencoder is a type of artificial neural network that learns data codings (e.g., the latent vector 109) in an unsupervised manner. Values of the latent vector 109 (also referred to as a code, coding, latent variables, and/or latent representation) may be learned (or refined) during training of the autoencoder 104, thereby training the autoencoder 104 to format input data according to a desired output format (which may include formatting according to a desired operation). Stated differently, the trained autoencoder 104 may approximate any function and/or operation applied to input data. As one example, the autoencoder 104 may convert input data comprising integer values to floating point values. More generally, the autoencoder 104 may perform any encoding operation, which may include, but is not limited to, normalizing values of input data, computing a z-score (e.g., a signed value reflecting a number of standard deviations the value of input data is from a mean value) for values of input data, standardizing values of input data, recasting values of input data, filtering the input data according to one or more filtering criteria, fuzzing of the values of input data, applying statistical filters to the input data, and the like. The use of any particular type of encoding operation as a reference example herein should not be considered limiting of the disclosure, as the disclosure is equally applicable to all types of encoding operations. Similarly, the use of the term “vector” to describe the latent vector 109 should not be considered limiting of the disclosure, as the latent vector 109 is also representative of a matrix having multiple dimensions (e.g., a vector of vectors).
  • To train the autoencoder 104, one or more datasets of training data 107 may be generated. In one embodiment, the training data 107 comprises columnar and/or row-based data, e.g., one or more columns of integer values, one or more columns of floating point values, etc. Generally, the training data 107 may be representative of multiple datasets of any size. For example, the training data 107 may include 50 column-based datasets, where each dataset has thousands of records (or more). Furthermore, the training data 107 may be segmented (e.g., the training data 107 may comprise a plurality of segments of one or more datasets). In one embodiment, each segmented dataset of training data 107 is overlapping, such that at least one value of the training data 107 appears in at least two segments. For example, a first dataset may include rows 0-1000 of the training data 107, while a second dataset may include rows 900-2000 of the training data 107, such that rows 900-1000 appear in the first and second datasets. In one embodiment, the size of the datasets may be learned based on hyperparameter tuning.
  • The ML model 105 and the statistical model 106 are representative of any type of computing model, such as deep learning models, machine learning models, neural networks, classifiers, clustering algorithms, support vector machines, and the like. In one embodiment, the ML model 105 and the statistical model 106 comprise the same model. Generally, the ML model 105 (and/or the statistical model 106) may be configured to transform (or encode) input data to a target format, thereby generating an output dataset. For example, the ML model 105 may be configured to normalize integer values of input data to floating point values, and the output dataset may comprise the floating point values. Generally, the ML model 105 may compute an output dataset for each input dataset of training data 107. An input dataset and corresponding formatted output dataset generated by the ML model 105 may be referred to as a “training sample” herein.
  • The autoencoder 104 may then be trained using the input dataset of one or more training samples. Generally, the autoencoder 104 may receive the input dataset as input, convert the dataset to an encoded format using the values of the latent vector 109, and decode the converted dataset. In some embodiments, the converted dataset generated by the autoencoder 104 may then be compared to the formatted data of the training sample generated by the ML model 105. The comparison may include determining a difference and/or least squared error of the converted dataset generated by the autoencoder 104 and the formatted data of the training sample generated by the ML model 105. Doing so generates one or more values reflecting an accuracy of the autoencoder 104. In some embodiments, the accuracy may comprise a loss of the autoencoder 104.
  • In some embodiments, the ML model 105 and/or the statistical model 106 may receive the converted data generated by the autoencoder 104 to determine the accuracy of the autoencoder 104 relative to the data of the training sample generated by the ML model 105. For example, the ML model 105 may process the converted data generated by the autoencoder 104 and compare the output to the formatted data of the training sample. In another embodiment, the statistical model 106 may classify the converted data generated by the autoencoder 104 and compare the classification to a classification of the input dataset of the training sample. For example, the statistical model 106 may classify the formatted output generated by the autoencoder 104 as a dataset of credit card data. If the statistical model classifies the input dataset of the training sample as being credit card data, the statistical model 106 may compute a relatively high accuracy value for the autoencoder 104. If, however, the classification for the input dataset is for purchase order amounts, the statistical model 106 may compute a relatively low accuracy value for the autoencoder 104. In one embodiment, the statistical model 106 may compute the accuracy value for the autoencoder 104 based on a distance between the classifications in a data space, where the accuracy increases as the distance between the classifications decreases.
  • The determined accuracy of the autoencoder 104 may then be used to refine the values of the latent vector 109 and/or other components of the autoencoder 104 via a backpropagation operation. The backpropagation may be performed using any feasible backpropagation algorithm. Generally, during backpropagation, the values of the latent vector 109 and/or the other components of the autoencoder 104 are refined based on the accuracy of the formatted output generated by the autoencoder 104. Doing so may result in a latent vector 109 that most accurately maps the input data to the desired output format.
  • The training of the autoencoder 104 may be repeated any number of times until the accuracy of the autoencoder 104 exceeds a threshold (and/or the loss of the autoencoder 104 is below a threshold). The autoencoder 104 may then be configured to ingest (e.g., format) data to be processed in any processing platform, such as a streaming data platform, thereby generating the formatted data 108. In some embodiments, the autoencoder 104 may perform estimated ingestion operations. For example, the autoencoder 104 may receive streaming data over a time interval. If the streaming data is of a reasonable size, the autoencoder 104 may perform a predictive formatting operation on the streaming data. For example, by ingesting enough streaming data during the time interval, the autoencoder 104 may determine the minimum and maximum values therein. Doing so may allow the autoencoder 104 to normalize the streaming data in a predictive fashion in a single pass. Stated differently, the autoencoder 104 may normalize the streaming data in a single processing phase, rather than having to process the streaming data twice (e.g., to discover the minimum/maximum values, then normalize the data based on the identified minimum/maximum values).
  • FIG. 2 is a schematic 200 illustrating an embodiment of training the autoencoder 104 to perform automated data ingestion. As shown, at block 201, one or more datasets of training data 107 may be segmented. The training data 107 may include row-based data and/or column-based data. The segments may have a minimum size (e.g., 10,000 rows and/or columns of data). In some embodiments, one or more of the segments may be modified, for example, by dropping one or more columns of data, formatting one or more columns of data, and the like. Doing so may produce varying segments of training data 107, e.g., where a first segment has had a column dropped, a second segment has had a column formatted, a third segment has had one column dropped and one column formatted, and a fourth segment has not been modified.
  • At block 202, the ML model 105 may process the segmented training data 107 to format the segmented training data 107 according to one or more formatting rules and/or operations. For example, the ML model 105 may normalize, convert, and/or filter the segmented training data 107. At block 203, one or more output datasets generated by the ML model 105 at block 202 may be stored. The output datasets may include each segment of training data 204 and the corresponding formatted data 205 generated by the ML model 105 at block 202. For example, if 1,000 segments of training data were generated at block 201, the segmented training data 204 may include the 1,000 segments, while the formatted data 205 may include 1,000 formatted datasets generated by the ML model 105 by processing each segment at block 202. In such an example, 1,000 training samples may comprise the segmented training data as input data and the corresponding formatted data 205 generated by the ML model 105.
  • At block 206, overlapping datasets may be generated using the training samples of segmented training data 204 and formatted data 205. Continuing with the previous example, the 1,000 training samples may be modified to include overlapping values. At block 207, the autoencoder 104 may be trained using the overlapping datasets generated at block 206. For example, the autoencoder 104 may process each input dataset (e.g., the segmented training data 204) of each training sample, e.g., to convert each of the input datasets of the training samples to a desired output format and/or based on a predefined operation. At block 208, the accuracy of the autoencoder 104 is determined based on the output generated by the autoencoder 104 at block 207. For example, a difference and/or a least squared error may be computed between the output of the autoencoder 104 based on the segmented training data 204 and the corresponding formatted data 205 generated by the ML model 105. The difference and/or least squared error may be used as accuracy values for the autoencoder 104.
  • As another example, the statistical model 106 may classify the output generated by the autoencoder 104 at block 207 and compare the generated classification to a classification of the corresponding segmented training data 204. For example, if the output generated by the autoencoder 104 at block 207 for a first overlapping segment of training data 204 matches a classification generated for the formatted data 205 corresponding to the first overlapping segment of training data 204, the statistical model 106 may compute a relatively high accuracy value for the autoencoder 104 for the first training sample.
  • The determined accuracy may be used to train the autoencoder 104 via a backpropagation operation. Doing so refines the values of the autoencoder 104, including the latent vector 109, based on the determined accuracy values for the autoencoder 104 and/or a loss of the autoencoder 104. Generally, the accuracy at block 208 may be determined for each training sample. Therefore, continuing with the previous example, the accuracy for each of the 1,000 training samples processed by the autoencoder 104 may be determined at block 208. Each of the 1,000 accuracy values may be provided to the autoencoder 104 to update the weights of the autoencoder 104, e.g., via 1,000 (or fewer) backpropagation operations.
  • FIG. 3 illustrates an embodiment of a processing pipeline 300. At block 301, streaming input data is received in the processing pipeline 300. The streaming input data may be any type of data, such as transaction data, stock ticker data, financial data, sensor data, and the like. In some embodiments, the streaming input data includes numeric values in one or more rows and/or columns. However, the streaming input data may have varying types and/or formats which may need to be modified to be compatible with various components of the processing pipeline. Therefore, at block 302, the trained autoencoder 104 may process the streaming input data. For example, the trained autoencoder 104 may format the streaming input data according to a desired output format, normalize the values of the streaming input data, compute a z-score for the streaming input data, standardizing values of the streaming input data, recasting values of the streaming input data, filtering the streaming input data according to one or more filtering criteria, fuzzing of the values of the streaming input data, and the like. At block 303, one or more components of the processing pipeline process the output generated by the autoencoder 104 at block 302, e.g., the formatted and/or converted streaming input data. Advantageously, the autoencoder 104 may process the streaming data in a single pass, e.g., by providing estimated normalization, recasting, etc., and without having to process the streaming data in two or more passes.
  • FIG. 4 illustrates an embodiment of a logic flow 400. The logic flow 400 may be representative of some or all of the operations executed by one or more embodiments described herein. For example, the logic flow 400 may include some or all of the operations to provide automated data ingestion using an autoencoder. Embodiments are not limited in this context.
  • As shown, the logic flow 400 begins at block 410, where a target data format is determined for data. For example, the target format may specify a datatype (e.g., integers, floating points, etc.), a data space (e.g., a range of values), etc. More generally, any type of operation may be determined for the data at block 410, e.g., normalization, filtering, score computation, etc. At block 420, the autoencoder 104 is trained to format data according to the target formats and/or operations defined at block 410. Generally, the training of the autoencoder 104 is guided by the ML model 105 and/or the statistical model 106 as described in greater detail herein. At block 430, the accuracy of the autoencoder 104 may be determined to exceed a threshold accuracy level. For example, if the threshold is 90% accuracy, and the accuracy of the autoencoder 104 is 95%, the accuracy of the autoencoder may exceed the threshold. At block 440, the autoencoder 104 is configured to format data in a processing pipeline.
  • FIG. 5 illustrates an embodiment of a logic flow 500. The logic flow 500 may be representative of some or all of the operations executed by one or more embodiments described herein. For example, the logic flow 500 may include some or all of the operations performed to train the autoencoder 104. Embodiments are not limited in this context.
  • As shown, the logic flow 500 begins at block 510, where the training data 107, which may comprise one or more datasets, is segmented into overlapping training data subsets. As stated, the training data 107 may include row and/or column-based numerical values. By generating overlapping subsets, one or more values of the training data 107 may appear in two or more subsets. At block 520, the ML model 105 transforms the training data subsets according to the format defined at block 410. For example, the ML model 105 may be configured to transform the training data from a first format to a second format. More generally, the ML model 105 may perform any operation on the training data as described above. Doing so may generate a respective transformed output dataset for each of the training data subsets. Each training dataset and corresponding transformed output dataset pair may comprise a training sample for the autoencoder. One or more of the training samples may be selected at block 530.
  • At block 540, the autoencoder 104 may process the input dataset of the training sample selected at block 530. Generally, the autoencoder 104 may transform the input dataset of the training sample (or perform any other operation) based at least in part on the current weights of the latent vector 109. Doing so may generate a transformed output. At block 550, the accuracy of the autoencoder 104 is determined based at least in part on the transformed output generated by the autoencoder 104. As stated, the ML model 105 and/or the statistical model 106 may be used to determine the accuracy of the autoencoder 104. For example, a difference and/or a least squared error may be computed for the output of the autoencoder 104 based on the transformed output dataset of the training sample (e.g., the output of the ML model 105) and the output generated by the autoencoder 104 at block 540. The difference and/or least squared error may be used as accuracy values for the autoencoder 104. As another example, the statistical model 106 may classify the output generated by the autoencoder 104 at block 540 and compare the generated classification to a classification of the training data of the input sample selected at block 530. The accuracy of the autoencoder 104 may then be determined based on a similarity of the classifications, where more similar classifications result in higher accuracy values for the autoencoder 104.
  • At block 560, the accuracy determined at block 550 may be provided to the autoencoder 104. At block 570, the values of the latent vector 109 and any other values of the autoencoder 104 may be refined during a backpropagation operation. Doing so may allow the values of the latent vector 109 to more accurately reflect a mapping required to perform the desired operation on data (e.g., filtering, formatting, recasting, etc.). If the accuracy of the autoencoder 104 determined at block 550 is lower than a threshold accuracy, the logic flow 500 may return to block 530, where another training sample is selected, thereby repeating the training process until the accuracy of the autoencoder 104 exceeds the threshold. Once the accuracy of the autoencoder 104 exceeds a threshold and/or all training samples have been used to train the autoencoder 104, the logic flow 500 may end.
  • FIG. 6 illustrates an embodiment of an exemplary computing architecture 600 comprising a computing system 602 that may be suitable for implementing various embodiments as previously described. In various embodiments, the computing architecture 600 may comprise or be implemented as part of an electronic device. In some embodiments, the computing architecture 600 may be representative, for example, of a system that implements one or more components of the system 100. In some embodiments, computing system 602 may be representative, for example, of the computing system 101 of the system 100. The embodiments are not limited in this context. More generally, the computing architecture 600 is configured to implement all logic, applications, systems, methods, apparatuses, and functionality described herein with reference to FIGS. 1-5.
  • As used in this application, the terms “system” and “component” and “module” are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution, examples of which are provided by the exemplary computing architecture 600. For example, a component can be, but is not limited to being, a process running on a computer processor, a computer processor, a hard disk drive, multiple storage drives (of optical and/or magnetic storage medium), an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a server and the server can be a component. One or more components can reside within a process and/or thread of execution, and a component can be localized on one computer and/or distributed between two or more computers. Further, components may be communicatively coupled to each other by various types of communications media to coordinate operations. The coordination may involve the uni-directional or bi-directional exchange of information. For instance, the components may communicate information in the form of signals communicated over the communications media. The information can be implemented as signals allocated to various signal lines. In such allocations, each message is a signal. Further embodiments, however, may alternatively employ data messages. Such data messages may be sent across various connections. Exemplary connections include parallel interfaces, serial interfaces, and bus interfaces.
  • The computing system 602 includes various common computing elements, such as one or more processors, multi-core processors, co-processors, memory units, chipsets, controllers, peripherals, interfaces, oscillators, timing devices, video cards, audio cards, multimedia input/output (I/O) components, power supplies, and so forth. The embodiments, however, are not limited to implementation by the computing system 602.
  • As shown in FIG. 6, the computing system 602 comprises a processor 604, a system memory 606 and a system bus 608. The processor 604 can be any of various commercially available computer processors, including without limitation an AMD® Athlon®, Duron® and Opteron® processors; ARM® application, embedded and secure processors; IBM® and Motorola® DragonBall® and PowerPC® processors; IBM and Sony® Cell processors; Intel® Celeron®, Core®, Core (2) Duo®, Itanium®, Pentium®, Xeon®, and XScale® processors; and similar processors. Dual microprocessors, multi-core processors, and other multi processor architectures may also be employed as the processor 604.
  • The system bus 608 provides an interface for system components including, but not limited to, the system memory 606 to the processor 604. The system bus 608 can be any of several types of bus structure that may further interconnect to a memory bus (with or without a memory controller), a peripheral bus, and a local bus using any of a variety of commercially available bus architectures. Interface adapters may connect to the system bus 608 via a slot architecture. Example slot architectures may include without limitation Accelerated Graphics Port (AGP), Card Bus, (Extended) Industry Standard Architecture ((E)ISA), Micro Channel Architecture (MCA), NuBus, Peripheral Component Interconnect (Extended) (PCI(X)), PCI Express, Personal Computer Memory Card International Association (PCMCIA), and the like.
  • The system memory 606 may include various types of computer-readable storage media in the form of one or more higher speed memory units, such as read-only memory (ROM), random-access memory (RAM), dynamic RAM (DRAM), Double-Data-Rate DRAM (DDRAM), synchronous DRAM (SDRAM), static RAM (SRAM), programmable ROM (PROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory (e.g., one or more flash arrays), polymer memory such as ferroelectric polymer memory, ovonic memory, phase change or ferroelectric memory, silicon-oxide-nitride-oxide-silicon (SONOS) memory, magnetic or optical cards, an array of devices such as Redundant Array of Independent Disks (RAID) drives, solid state memory devices (e.g., USB memory, solid state drives (SSD) and any other type of storage media suitable for storing information. In the illustrated embodiment shown in FIG. 6, the system memory 606 can include non-volatile memory 610 and/or volatile memory 612. A basic input/output system (BIOS) can be stored in the non-volatile memory 610.
  • The computing system 602 may include various types of computer-readable storage media in the form of one or more lower speed memory units, including an internal (or external) hard disk drive (HDD) 614, a magnetic floppy disk drive (FDD) 616 to read from or write to a removable magnetic disk 618, and an optical disk drive 620 to read from or write to a removable optical disk 622 (e.g., a CD-ROM or DVD). The HDD 614, FDD 616 and optical disk drive 620 can be connected to the system bus 608 by a HDD interface 624, an FDD interface 626 and an optical drive interface 628, respectively. The HDD interface 624 for external drive implementations can include at least one or both of Universal Serial Bus (USB) and IEEE 1394 interface technologies. The computing system 602 is generally is configured to implement all logic, systems, methods, apparatuses, and functionality described herein with reference to FIGS. 1-5.
  • The drives and associated computer-readable media provide volatile and/or nonvolatile storage of data, data structures, computer-executable instructions, and so forth. For example, a number of program modules can be stored in the drives and memory units 610, 612, including an operating system 630, one or more application programs 632, other program modules 634, and program data 636. In one embodiment, the one or more application programs 632, other program modules 634, and program data 636 can include, for example, the various applications and/or components of the system 100, e.g., the autoencoder 104, ML model 105, statistical model 106, training data 107, formatted data 108, and latent vector 109.
  • A user can enter commands and information into the computing system 602 through one or more wire/wireless input devices, for example, a keyboard 638 and a pointing device, such as a mouse 640. Other input devices may include microphones, infra-red (IR) remote controls, radio-frequency (RF) remote controls, game pads, stylus pens, card readers, dongles, finger print readers, gloves, graphics tablets, joysticks, keyboards, retina readers, touch screens (e.g., capacitive, resistive, etc.), trackballs, trackpads, sensors, styluses, and the like. These and other input devices are often connected to the processor 604 through an input device interface 642 that is coupled to the system bus 608, but can be connected by other interfaces such as a parallel port, IEEE 1394 serial port, a game port, a USB port, an IR interface, and so forth.
  • A monitor 644 or other type of display device is also connected to the system bus 608 via an interface, such as a video adaptor 646. The monitor 644 may be internal or external to the computing system 602. In addition to the monitor 644, a computer typically includes other peripheral output devices, such as speakers, printers, and so forth.
  • The computing system 602 may operate in a networked environment using logical connections via wire and/or wireless communications to one or more remote computers, such as a remote computer 648. The remote computer 648 can be a workstation, a server computer, a router, a personal computer, portable computer, microprocessor-based entertainment appliance, a peer device or other common network node, and typically includes many or all of the elements described relative to the computing system 602, although, for purposes of brevity, only a memory/storage device 650 is illustrated. The logical connections depicted include wire/wireless connectivity to a local area network (LAN) 652 and/or larger networks, for example, a wide area network (WAN) 654. Such LAN and WAN networking environments are commonplace in offices and companies, and facilitate enterprise-wide computer networks, such as intranets, all of which may connect to a global communications network, for example, the Internet.
  • When used in a LAN networking environment, the computing system 602 is connected to the LAN 652 through a wire and/or wireless communication network interface or adaptor 656. The adaptor 656 can facilitate wire and/or wireless communications to the LAN 652, which may also include a wireless access point disposed thereon for communicating with the wireless functionality of the adaptor 656.
  • When used in a WAN networking environment, the computing system 602 can include a modem 658, or is connected to a communications server on the WAN 654, or has other means for establishing communications over the WAN 654, such as by way of the Internet. The modem 658, which can be internal or external and a wire and/or wireless device, connects to the system bus 608 via the input device interface 642. In a networked environment, program modules depicted relative to the computing system 602, or portions thereof, can be stored in the remote memory/storage device 650. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers can be used.
  • The computing system 602 is operable to communicate with wired and wireless devices or entities using the IEEE 802 family of standards, such as wireless devices operatively disposed in wireless communication (e.g., IEEE 802.16 over-the-air modulation techniques). This includes at least Wi-Fi (or Wireless Fidelity), WiMax, and Bluetooth™ wireless technologies, among others. Thus, the communication can be a predefined structure as with a conventional network or simply an ad hoc communication between at least two devices. Wi-Fi networks use radio technologies called IEEE 802.11x (a, b, g, n, etc.) to provide secure, reliable, fast wireless connectivity. A Wi-Fi network can be used to connect computers to each other, to the Internet, and to wire networks (which use IEEE 802.3-related media and functions).
  • Various embodiments may be implemented using hardware elements, software elements, or a combination of both. Examples of hardware elements may include processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. Examples of software may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. Determining whether an embodiment is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints.
  • One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that make the logic or processor. Some embodiments may be implemented, for example, using a machine-readable medium or article which may store an instruction or a set of instructions that, if executed by a machine, may cause the machine to perform a method and/or operations in accordance with the embodiments. Such a machine may include, for example, any suitable processing platform, computing platform, computing device, processing device, computing system, processing system, computer, processor, or the like, and may be implemented using any suitable combination of hardware and/or software. The machine-readable medium or article may include, for example, any suitable type of memory unit, memory device, memory article, memory medium, storage device, storage article, storage medium and/or storage unit, for example, memory, removable or non-removable media, erasable or non-erasable media, writeable or re-writeable media, digital or analog media, hard disk, floppy disk, Compact Disk Read Only Memory (CD-ROM), Compact Disk Recordable (CD-R), Compact Disk Rewriteable (CD-RW), optical disk, magnetic media, magneto-optical media, removable memory cards or disks, various types of Digital Versatile Disk (DVD), a tape, a cassette, or the like. The instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, encrypted code, and the like, implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.
  • The foregoing description of example embodiments has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the present disclosure to the precise forms disclosed. Many modifications and variations are possible in light of this disclosure. It is intended that the scope of the present disclosure be limited not by this detailed description, but rather by the claims appended hereto. Future filed applications claiming priority to this application may claim the disclosed subject matter in a different manner, and may generally include any set of one or more limitations as variously disclosed or otherwise demonstrated herein.

Claims (20)

What is claimed is:
1. A system, comprising:
a processor circuit; and
a memory storing instructions which when executed by the processor circuit, cause the processor circuit to:
receive, by an autoencoder during a first time interval, streaming data comprising numeric values;
determine, by the autoencoder during the first time interval, a maximum value and a minimum value of a first subset of the numeric values; and
process, by the autoencoder during the first time interval, a second subset of the numeric values based on the determined maximum and minimum values.
2. The system of claim 1, wherein processing the second subset of the numeric values comprises normalizing the second subset of the numeric values to be within the determined maximum and minimum values.
3. The system of claim 1, wherein processing the second subset of the numeric values comprises filtering a numeric value from the second subset that is not within the determined maximum and minimum values, wherein filtering the numeric value from the second subset removes the filtered numeric value from the second subset.
4. The system of claim 1, wherein processing the second subset of the numeric values comprises converting a numeric value from the second subset that is not within the determined maximum and minimum values from a first data type to a second data type, wherein the converted numeric value of the second data type is within the determined maximum and minimum values.
5. The system of claim 1, wherein the autoencoder is trained based on a training dataset generated by a computing model, wherein the autoencoder is trained to process numeric values according to a predefined operation, wherein the autoencoder comprises a latent vector.
6. The system of claim 5, wherein an accuracy of the trained autoencoder exceeds a threshold accuracy.
7. The system of claim 1, the memory storing instructions which when executed by the processor circuit, cause the processor circuit to:
provide, by the autoencoder, the processed second subset of the numeric values to a processing pipeline.
8. A non-transitory computer-readable storage medium storing instructions that when executed by a processor cause the processor to:
receive, by an autoencoder during a first time interval, streaming data comprising numeric values;
determine, by the autoencoder during the first time interval, a maximum value and a minimum value of a first subset of the numeric values; and
process, by the autoencoder during the first time interval, a second subset of the numeric values based on the determined maximum and minimum values.
9. The medium of claim 8, wherein processing the second subset of the numeric values comprises normalizing the second subset of the numeric values to be within the determined maximum and minimum values.
10. The medium of claim 8, wherein processing the second subset of the numeric values comprises filtering a numeric value from the second subset that is not within the determined maximum and minimum values, wherein filtering the numeric value from the second subset removes the filtered numeric value from the second subset.
11. The medium of claim 8, wherein processing the second subset of the numeric values comprises converting a numeric value from the second subset that is not within the determined maximum and minimum values from a first data type to a second data type, wherein the converted numeric value of the second data type is within the determined maximum and minimum values.
12. The medium of claim 8, wherein the autoencoder is trained based on a training dataset generated by a computing model, wherein the autoencoder is trained to process numeric values according to a predefined operation, wherein the autoencoder comprises a latent vector.
13. The medium of claim 12, wherein an accuracy of the trained autoencoder exceeds a threshold accuracy.
14. The medium of claim 8, storing instructions which when executed by the processor, cause the processor to:
provide, by the autoencoder, the processed second subset of the numeric values to a processing pipeline.
15. A method, comprising:
receiving, by an autoencoder executing on a computer processor, streaming data comprising numeric values during a first time interval;
determining, by the autoencoder during the first time interval, a maximum value and a minimum value of a first subset of the numeric values; and
processing, by the autoencoder during the first time interval, a second subset of the numeric values based on the determined maximum and minimum values.
16. The method of claim 15, wherein processing the second subset of the numeric values comprises normalizing the second subset of the numeric values to be within the determined maximum and minimum values.
17. The method of claim 15, wherein processing the second subset of the numeric values comprises filtering a numeric value from the second subset that is not within the determined maximum and minimum values, wherein filtering the numeric value from the second subset removes the filtered numeric value from the second subset.
18. The method of claim 15, wherein processing the second subset of the numeric values comprises converting a numeric value from the second subset that is not within the determined maximum and minimum values from a first data type to a second data type, wherein the converted numeric value of the second data type is within the determined maximum and minimum values.
19. The method of claim 15, wherein the autoencoder is trained based on a training dataset generated by a computing model, wherein the autoencoder is trained to process numeric values according to a predefined operation, wherein the autoencoder comprises a latent vector, wherein an accuracy of the trained autoencoder exceeds a threshold accuracy.
20. The method of claim 15, further comprising:
providing, by the autoencoder, the processed second subset of the numeric values to a processing pipeline.
US17/101,517 2019-08-23 2020-11-23 Automated data ingestion using an autoencoder Pending US20210073649A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/101,517 US20210073649A1 (en) 2019-08-23 2020-11-23 Automated data ingestion using an autoencoder

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US16/549,465 US10853728B1 (en) 2019-08-23 2019-08-23 Automated data ingestion using an autoencoder
US17/101,517 US20210073649A1 (en) 2019-08-23 2020-11-23 Automated data ingestion using an autoencoder

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US16/549,465 Continuation US10853728B1 (en) 2019-08-23 2019-08-23 Automated data ingestion using an autoencoder

Publications (1)

Publication Number Publication Date
US20210073649A1 true US20210073649A1 (en) 2021-03-11

Family

ID=73554888

Family Applications (2)

Application Number Title Priority Date Filing Date
US16/549,465 Active US10853728B1 (en) 2019-08-23 2019-08-23 Automated data ingestion using an autoencoder
US17/101,517 Pending US20210073649A1 (en) 2019-08-23 2020-11-23 Automated data ingestion using an autoencoder

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US16/549,465 Active US10853728B1 (en) 2019-08-23 2019-08-23 Automated data ingestion using an autoencoder

Country Status (1)

Country Link
US (2) US10853728B1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11561948B1 (en) 2021-03-01 2023-01-24 Era Software, Inc. Database indexing using structure-preserving dimensionality reduction to accelerate database operations
EP4083858A1 (en) * 2021-04-29 2022-11-02 Siemens Aktiengesellschaft Training data set reduction and image classification
US11734318B1 (en) * 2021-11-08 2023-08-22 Servicenow, Inc. Superindexing systems and methods

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180222407A1 (en) * 2017-02-06 2018-08-09 Korea University Research And Business Foundation Apparatus, control method thereof and recording media
US20180262525A1 (en) * 2017-03-09 2018-09-13 General Electric Company Multi-modal, multi-disciplinary feature discovery to detect cyber threats in electric power grid
US20190050465A1 (en) * 2017-08-10 2019-02-14 International Business Machines Corporation Methods and systems for feature engineering

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015130928A1 (en) * 2014-02-26 2015-09-03 Nancy Packes, Inc. Real estate evaluating platform methods, apparatuses, and media
US10460251B2 (en) * 2015-06-19 2019-10-29 Preferred Networks Inc. Cross-domain time series data conversion apparatus, methods, and systems
US10832168B2 (en) 2017-01-10 2020-11-10 Crowdstrike, Inc. Computational modeling and classification of data streams
EP3599575B1 (en) 2017-04-27 2023-05-24 Dassault Systèmes Learning an autoencoder
US10417556B1 (en) * 2017-12-07 2019-09-17 HatchB Labs, Inc. Simulation-based controls optimization using time series data forecast
US11586915B2 (en) 2017-12-14 2023-02-21 D-Wave Systems Inc. Systems and methods for collaborative filtering with variational autoencoders

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180222407A1 (en) * 2017-02-06 2018-08-09 Korea University Research And Business Foundation Apparatus, control method thereof and recording media
US20180262525A1 (en) * 2017-03-09 2018-09-13 General Electric Company Multi-modal, multi-disciplinary feature discovery to detect cyber threats in electric power grid
US20190050465A1 (en) * 2017-08-10 2019-02-14 International Business Machines Corporation Methods and systems for feature engineering

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Lee, Doyup. "Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis" 9 October 2017 [ONLINE] Downloaded 10/15/2024 https://arxiv.org/pdf/1708.02635 (Year: 2017) *
Sun, Haonan et al "Stacked Denoising Autoencoder Based Stock Market Trend Prediction via K-NEarest Neighbor Data Selection" 2017 [ONLINE] Downloaded 5/18/2023 https://link.springer.com/chapter/10.1007/978-3-319-70096-0_90 (Year: 2017) *

Also Published As

Publication number Publication date
US10853728B1 (en) 2020-12-01

Similar Documents

Publication Publication Date Title
US20210073649A1 (en) Automated data ingestion using an autoencoder
US10311334B1 (en) Learning to process images depicting faces without leveraging sensitive attributes in deep learning models
US11907672B2 (en) Machine-learning natural language processing classifier for content classification
US20240265063A1 (en) Techniques to embed a data object into a multidimensional frame
US11514329B2 (en) Data-driven deep learning model generalization analysis and improvement
US11748448B2 (en) Systems and techniques to monitor text data quality
US11914583B2 (en) Utilizing regular expression embeddings for named entity recognition systems
US11238531B2 (en) Credit decisioning based on graph neural networks
CN112131322B (en) Time sequence classification method and device
US11960846B2 (en) Embedding inference
US12032549B2 (en) Techniques for creating and utilizing multidimensional embedding spaces
WO2022192270A1 (en) Identifying trends using embedding drift over time
US10783257B1 (en) Use of word embeddings to locate sensitive text in computer programming scripts
Niu et al. Efficient Multiple Kernel Learning Algorithms Using Low‐Rank Representation
US20220012535A1 (en) Augmenting Training Data Sets for ML Classifiers Using Classification Metadata
US20240013523A1 (en) Model training method and model training system
US20220284433A1 (en) Unidimensional embedding using multi-modal deep learning models
US20240078415A1 (en) Tree-based systems and methods for selecting and reducing graph neural network node embedding dimensionality
US20240311580A1 (en) Clinical context centric natural language processing solutions
Yeh et al. A wrapper-based combined recursive orthogonal array and support vector machine for classification and feature selection
US20210201334A1 (en) Model acceptability prediction system and techniques
Luan et al. Multi-Instance Learning with One Side Label Noise
Zheng et al. Character Recognition Based on k-Nearest Neighbor, Simple Logistic Regression, and Random Forest
Prexawanprasut et al. Improving Minority Class Recall through a Novel Cluster-Based Oversampling Technique
CN114418060A (en) Identity keeping confrontation training method, device and medium based on graph representation learning

Legal Events

Date Code Title Description
AS Assignment

Owner name: CAPITAL ONE SERVICES, LLC, VIRGINIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WALTERS, AUSTIN GRANT;GOODSITT, JEREMY EDWARD;REEL/FRAME:054447/0080

Effective date: 20190822

STPP Information on status: patent application and granting procedure in general

Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: PRE-INTERVIEW COMMUNICATION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

STCC Information on status: application revival

Free format text: WITHDRAWN ABANDONMENT, AWAITING EXAMINER ACTION

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER