US20220076157A1 - Data analysis system using artificial intelligence - Google Patents

Data analysis system using artificial intelligence Download PDF

Info

Publication number
US20220076157A1
US20220076157A1 US17/013,106 US202017013106A US2022076157A1 US 20220076157 A1 US20220076157 A1 US 20220076157A1 US 202017013106 A US202017013106 A US 202017013106A US 2022076157 A1 US2022076157 A1 US 2022076157A1
Authority
US
United States
Prior art keywords
data
results
machine learning
analysis system
web application
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US17/013,106
Inventor
Damian Watkins
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aperio Global LLC
Original Assignee
Aperio Global LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aperio Global LLC filed Critical Aperio Global LLC
Priority to US17/013,106 priority Critical patent/US20220076157A1/en
Assigned to Aperio Global, LLC reassignment Aperio Global, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: WATKINS, DAMIAN
Priority to US17/017,289 priority patent/US11037073B1/en
Publication of US20220076157A1 publication Critical patent/US20220076157A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/04847Interaction techniques to control parameter settings, e.g. interaction with sliders or dials
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • the claimed subject matter relates to data analysis, and more specifically to the field of data analysis systems utilizing artificial intelligence.
  • Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source.
  • Current ways of processing big data are vulnerable to multiple issues, including high reliance on accurate presentation of datasets, limitations to binary features, and vulnerability to overfitting in which an analysis corresponds too closely to a particular set of data resulting in failure to fit additional data or predict future observations reliably.
  • current ways of processing big data present complications for users attempting to calculate a specific feature's influence on the outcome of the data.
  • a data analysis system utilizing custom unsupervised machine learning processes over a communications network
  • the system comprising a repository of data connected to the communications network, a web application deployed on a web server connected to the communications network, the web application including a data collection interface between the web server and the repository of data, wherein the web application is configured for providing a graphical user interface for modifying, by a user, a plurality of threshold parameters of a clustering algorithm for clustering the data, executing the clustering algorithm with the plurality of threshold parameters that were modified by the user, thereby producing a set of results, providing a graphical user interface for reviewing, by the user, the set of results of the clustering algorithm and re-executing steps a) through c) if the set of results are not useful, copying the set of results into a deep learning software framework and, executing a deep learning algorithm in the deep learning software framework on the set of results, thereby establishing relationships between the data, and providing generalizations of the data.
  • FIG. 1 is a diagram of an operating environment that supports a system for data analysis using artificial intelligence, according to an example embodiment
  • FIG. 2 is a diagram of a machine learning modeling architecture that supports the system for data analysis using artificial intelligence, according to an example embodiment
  • FIG. 3 is a flow chart illustrating a first process flow of the system for data analysis using artificial intelligence, according to an example embodiment
  • FIG. 4 is a flow chart illustrating a second process flow of the system for data analysis using artificial intelligence, according to an example embodiment
  • FIG. 5 is a block diagram illustrating a computer system according to exemplary embodiments of the present technology.
  • the disclosed embodiments improve upon the problems with the prior art by providing a system for data analysis using artificial intelligence configured to process data sets that are too large or complex to be dealt with by traditional data-processing application software.
  • the disclosed embodiments also improve upon the problems with the prior art by providing a system for data analysis that reduces or eliminates the problems associated with processing big data, such as failure to fit additional data or predict future observations reliably.
  • the disclosed embodiments further improve upon the problems with the prior art by providing a system for data analysis using artificial intelligence configured to improve filtering for useful data within large datasets by reducing noise and identifying useful data in an unconventional manner that utilizes logistic regression tests on datasets iteratively in order to determine which datasets are useful to train a complex machine learning model.
  • the system improves the accuracy of data utilized for training sets configured for machine learning models by allowing users to establish threshold parameters and iteratively removing data that is below the established threshold parameters.
  • the system also improves visual representations and analytics associated with calculations of data features and their influence on outcomes.
  • the system improves overall computer performance by reducing the amount of computing resources utilized to generate a trained model that not only makes more accurate predictions on datasets, but also more efficiently establishes relationships between data and provides generalizations of the data.
  • System 100 comprises a user 102 operating on a computing device 104 in which computing device 104 is communicatively coupled to a communications network 106 , such as the Internet.
  • System 100 further comprises a server 108 and a data repository 110 (hereinafter referred to as “database”) in which server 108 and database 110 are communicatively coupled to the communications network 106 .
  • System 100 may further comprise a customer data repository 110 (also hereinafter referred to as “database”) communicatively coupled to communications network 106 .
  • Customer data repository 110 may house data that belongs to a customer of user 102 and said data may also or alternatively reside in the database 110 .
  • System 100 further comprises a machine learning database 112 , a machine learning model generator 114 , and a regression and clustering module 116 each of which are communicatively coupled to the communications network 106 .
  • data communicated on network 106 can be implemented using a variety of protocols on a network such as a wide area network (WAN), a virtual private network (VPN), metropolitan area networks (MANs), system area networks (SANs), a public switched telephone network (PTSN), a global Telex network, or a 2G, 3G, 4G or 5G network.
  • WAN wide area network
  • VPN virtual private network
  • MANs metropolitan area networks
  • SANs system area networks
  • PTSN public switched telephone network
  • global Telex network or a 2G, 3G, 4G or 5G network.
  • Such networks can also generally contextually be referred to herein as the Internet or cloud.
  • Machine learning is the study of computer algorithms that improve automatically through experience.
  • Machine learning is a subset of artificial intelligence that uses algorithms that build a mathematical model based on sample data, also referred to as training data, in order to make predictions or decisions.
  • Machine learning algorithms are used when it is difficult or infeasible to develop conventional algorithms to perform needed tasks.
  • Machine learning database 112 is a database configured for storing the data structures that are used for machine learning processes, such as storing data models, mathematical models, training data, training examples in array or vector form, etc.
  • Machine learning model generator 114 is a set of routines and/or processes configured for generating models that are trained on training data and then can process additional data.
  • Various types of models can be used for system 100 , including artificial neural networks, decision trees, support vector machines, regression analysis, Bayesian networks and genetic algorithms.
  • Regression and clustering module 116 is a set of routines and/or processes configured for conducting regression then cluster analysis. Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables.
  • a common form of regression analysis is linear regression, which comprises the discovery of a line that most closely fits the data according to a specific mathematical criterion.
  • Cluster analysis is the task of grouping a set of data or objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups or clusters. Members of the same cluster are grouped based on one or more metrics indicating that the members of the group share more metric similarities with each other than the members of other groups.
  • system 100 and said components may be implemented in hardware, software, or a combination thereof.
  • the various components of system 100 are implemented at least partially by hardware at one or more computing devices, such as one or more hardware processors executing instructions stored in one or more memories for performing various functions described herein.
  • computing devices such as one or more hardware processors executing instructions stored in one or more memories for performing various functions described herein.
  • descriptions of various components (or modules) as described in this application may be interpreted by one of ordinary skill in the art as providing pseudocode, an informal high-level description of one or more computer structures.
  • the descriptions of the components may be converted into software code, including code executable by an electronic processor.
  • System 100 illustrates only one of many possible arrangements of components configured to perform the functionality described herein. Other arrangements may include fewer or different components, and the division of work between the components may vary depending upon the arrangement.
  • computing device 104 , server 108 and database 111 may include, but are not limited to, one or more laptop computers, tablet computers, smartphones, desktop computers, internet of things (IOT) devices, servers, workstations, virtual machines, and any other mechanisms used to access networks and software.
  • Computing device 104 , server 108 and database 111 (which may include web servers) may be configured to run or host a web application communicatively coupled to network 106 .
  • the web application may be configured on a software platform configured to allow user 102 to contribute, monitor, and/or analyze data sets in addition to serving as a mechanism to present graphical user interfaces to user 102 that support receiving inputs from user 102 in order to perform components necessary for server 108 to perform execution of one or more algorithms disclosed herein.
  • user 102 provides a plurality of user input to computing device 104 in which server 108 receives the plurality of user input over communications network 106 and automatically performs tasks described herein on data housed in databases 110 and/or 111 .
  • machine learning database 112 machine learning model generator 114
  • regression and clustering module 116 are configured to operate in unison in order to perform tasks such as execution of regression analysis and one or more high quality clustering algorithms in order to efficiently cluster data received by system 100 .
  • machine learning model generator 114 may comprise one or more servers to support execution of applying the regression analysis and clustering algorithms and/or any other applicable algorithms to the data in databases 110 , 111 .
  • server 108 accesses databases 110 and/or 111 and data derived from user 102 and other applicable data sources (described in greater detail during discussion of FIG.
  • regression and clustering module 116 supports a plurality of regression analysis algorithms and clustering algorithms including, but not limited to, k-means clustering, k-medoids clustering, Clarans clustering, Birch clustering, Cure clustering, Chameleon clustering, DBScan clustering, an Optics clustering, Denclue clustering, Sting clustering, Clique clustering, Wave-Cluster clustering, or any combinations thereof configured to support machine learning model generator 114 in discovering patterns in the data in order to generate groups in which the members share a significant amount of associations.
  • Architecture 200 comprises a plurality of data sources 202 - 206 configured to transmit various types of datasets to system 100 in which server 108 serves as the receiving point configured to filter the data received from data sources 202 - 206 and store components of the data in databases 110 and/or 111 .
  • architecture 200 comprises a plurality of data sources 202 - 206 configured to serve various types of datasets to system 100 in which server 108 downloads or requests said data from data sources 202 - 206 and stores components of the data in databases 110 and/or 111 .
  • Architecture 200 further comprises one or more training sets 208 (which may be stored in machine learning database 112 ) comprising a plurality of training instances (training data) configured to comprise a plurality of features (in some instances values associated with each of the features) in which the plurality of features are associated with at least one of the received data or the result of the one or more clustering algorithms and one or more other applicable functions applied to the received data during an iteration.
  • training sets 208 which may be stored in machine learning database 112
  • training data configured to comprise a plurality of features (in some instances values associated with each of the features) in which the plurality of features are associated with at least one of the received data or the result of the one or more clustering algorithms and one or more other applicable functions applied to the received data during an iteration.
  • Architecture 200 further comprises a machine learning model 210 (which may be stored in machine learning database 112 and which may have been produced by machine learning model generator 114 ) configured to utilize one or more techniques in order to train the model based on at least training set 208 .
  • training set 208 and machine learning model 210 may be comprised in a deep learning software framework in which machine learning software providing generic functionality that can be selectively changed by additional user-written code, thus providing application-specific software.
  • the deep learning software framework provides a standard way to build and deploy machine learning applications and is a universal, reusable software environment that provides particular machine learning functionality as part of a larger software platform to facilitate development of machine learning software applications, products and solutions.
  • the deep learning software framework may include support programs, compilers, code libraries, tool sets, application programming interfaces (APIs) and deep learning algorithms that bring together all the different components to enable development of a machine learning project or system.
  • machine learning is the study and construction of algorithms that can learn from, and make predictions on data, and that such algorithms operate by building a model from inputs in order to make data-driven predictions or decisions.
  • machine learning model 210 is trained based on training set 208 representing weights or coefficients of the plurality of features and/or derivatives of the plurality of features.
  • Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning that can be supervised, semi-supervised or unsupervised.
  • Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses.
  • a common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data.
  • the clusters are modeled using a measure of similarity which is defined upon metrics such as Euclidean or probabilistic distance.
  • the claimed process described herein is considered unsupervised learning or unsupervised machine learning due to the use of cluster analysis.
  • server 108 is configured to continuously download, receive, filter, compile, and transmit data received from data sources 202 - 206 and computing device 104 in order for machine learning model generator 114 to perform one or more functions on said data in addition to utilizing regression and clustering module 116 to apply one or more of the aforementioned regression and clustering algorithms to the received data.
  • server 108 and/or machine learning model generator 114 may iteratively apply one or more logistic regression functions on said data (in combination with the one or more clustering algorithms) in which the clustering algorithms are configured to filter noisy data in order to make said data less noisy, and the logistic regression functions are configured to identify the most useful components of said data.
  • server 108 and/or machine learning model generator 114 are configured to execute a plurality of logistical regression tests on said data in order to determine which components of said data are useful to include in training set 208 in order to train machine learning model 210 .
  • server 108 applies one or more logistic regression tests at each iteration for said data in order to determine which components of said data are considered useful enough to be included in training set 208 .
  • a regression model may be applied to the applicable feature in order to calculate the influence of the applicable feature on the overall outcome/output of the data after the one or more aforementioned algorithms are applied during the iteration.
  • the application of the multiple logistic regression tests at each iteration increases the accuracy of the datasets overall in addition to ensuring that said data is useful enough to be included in training set 208 in order to provide more accurate predictions.
  • Each logistic regression test may also calculate a metric of influence that measures the rank or quantifies the influence of each feature of the data.
  • Logistic regression is a statistical model configured to be utilized in order to model the probability of a certain class or event existing.
  • logistic regression models are represented by utilizing weights or coefficient values to predict an output value typically modeled as a binary value.
  • the combination of clustering and linear regression may be utilized in order to remove noise from data and identify useful data for training a complex machine learning model.
  • the claimed process described herein is considered machine learning logistic regression due to the use of regression analysis in a machine learning environment.
  • Logistic regression may also be used to calculate functional data within the data being processed, wherein functional data comprises data providing information about curves, surfaces or anything else varying over a continuum.
  • server 108 is also configured to provide a graphical user interface to computing device 104 over network 106 prompting user 102 for a plurality of threshold parameters (or modifications to said threshold parameters) configured to indicate the type, nature and amount of said data that will be included in training set 208 .
  • Server 108 is configured to allocate a plurality of scores among the features of the data collected and processed (as defined above), wherein after each iteration, certain data are removed from the dataset resulting in a more accurate dataset.
  • server 108 provides functionality configured to allow user 102 to modify or toggle the plurality of threshold parameters in order for user 102 to tune the data based on his or her preferences.
  • the applicable threshold parameter of the plurality of threshold parameters set by user 102 for the applicable iteration ensures that the data or applicable feature comprising a score that is lower than the applicable threshold parameter is removed from the overall dataset; thus, preventing the applicable feature from being included in the following iteration and ultimately increasing the accuracy of the dataset overall.
  • user 102 may reset or adjust a plurality of hyperparameters associated with the set of results in order to increase the accuracy of the set of results or subsequent sets of results.
  • the web application provided by server 108 on computing device 104 is configured to allow user 102 to view the results after each iteration (of the processes undertaken by the machine learning modeling architecture 200 , as defined above) in order to not only effectively evaluate the utility of the algorithms applied during the applicable iteration, but also provide user 102 with the opportunity to adjust or toggle the plurality of threshold parameters if necessary. Furthermore, the aforementioned feature provides user 102 with the opportunity to scrutinize both the features associated with the lower scores (scores below the applicable threshold parameter) and the effect of a particular feature on the results of an iteration and/or the dataset overall. Threshold parameters may comprise a plurality of numerical values, wherein each numerical value comprises a decimal number.
  • server 108 accesses database 110 , database 111 and/or data sources 202 - 206 in order to collect data on which data analysis must be performed. It is to be understood that server 108 is designed and configured to download, request, receive, collect, filter, and compile various types of data derived from various data sources prior to performing the subsequent steps provided in this disclosure. Server 108 may continuously perform step 302 while simultaneously performing the subsequent provided steps in order to ensure that data is continuously being feed into system 100 . Any data that is collected by server 108 may be stored in database 1110 and/or database 111 and may be referred to herein as “said data”.
  • server 108 provides a graphical user interface to computing device 104 over network 106 , the interface configured to allow user 102 to provide and/or modify a plurality of threshold parameters.
  • Said graphical user interface may also be supportive in the sense that is suggests a plurality of threshold parameters or changes to an existing plurality of threshold parameters.
  • server 108 may take into account various factors associated with said data such as source of the data, computational demands required in order to process said data, pre-existing user preferences stored within databases 110 and/or 111 , or any other applicable factors.
  • server 108 utilizes architecture 200 to execute one or more high quality clustering algorithms in addition to logistic regression tests to said data.
  • the plurality of threshold parameters serves the purpose of determining the amount of data that is to be included in training set 208 .
  • machine learning model generator 114 may be continuously receiving data from server 108 and iteratively applying the aforementioned algorithms to the data set while simultaneously removing data and features of data associated with the scores that are less than the threshold parameters. In other words, at each iteration, the clustering algorithms and logistic regression tests are applied to the data and after the application of the aforementioned, the data associated with one or more scores lower than the corresponding threshold parameter are removed, thereby producing a set of results.
  • server 108 provides a graphical user interface to computing device 104 over network 106 providing user 102 with an opportunity to not only review the set of results, but also adjust or toggle the plurality of threshold parameters, if necessary, based on the set of results. If user 102 decides in step 309 to adjust or toggle the plurality of threshold parameters, then steps 304 - 308 are re-executed and server 108 (using architecture 200 ) re-executes the aforementioned algorithms and removes the applicable features of data based upon the new plurality of threshold parameters. If user 102 decides in step 309 that the resulting data is useful, then control flows to step 310 .
  • server 108 and/or user 102 arrive at a determination that the set of results (accuracy of the data) is adequate enough to be included into training set 208 , server 108 copies the resulting data to the deep learning software framework comprising training set 208 and machine learning model 210 .
  • the set of results manifested via training set 208 are used to train machine learning model 210 .
  • training set 208 comprises multiple training instances associated with the set of results and in some embodiments the training instances comprise a label.
  • different sets of training data may be based on different features.
  • machine learning model 210 is configured to capture the correlation between features of the set of results and the labels associated with the training instances. Each feature represents a measurable piece of data that can be used for analysis.
  • Features may also be referred to as variables or attributes. The features included in the dataset can vary widely.
  • a graphical user interface may be configured to present the set of results of each iteration of the aforementioned steps to user 102 .
  • the graphical user interface may provide user 102 with the option to toggle the plurality of threshold parameters in order to effectively target one or more components of the dataset.
  • execution of the aforementioned steps results in relationships being established between the data and due to the increased accuracy of the data more efficient generalizations of the data are provided to user 102 as well.
  • Relationships between data are defined as a connection between two or more pieces of data, such as a causal relationship connection or a correlated relationship connection.
  • Generalizations about data are defined as true statements that may be made about two or more pieces of data, such as a category, or classification. Generalizations posit the existence of a domain or set of data elements, as well as one or more common characteristics shared by those data elements.
  • the web application may also be configured for generating a downloadable report comprising the set of results for review by the user.
  • a flow chart 400 illustrating a second exemplary process of data analysis using artificial intelligence is depicted.
  • a plurality of data is read by server 108 from plurality of sources 202 - 206 and/or databases 110 , 111 are accessed by server 108 in order to retrieve data.
  • server 108 reads the plurality of threshold parameters from user 102 via inputs on computing device 104 .
  • the one or more clustering algorithms in addition to the logistic regression tests are performed by server 108 on the plurality of data resulting in a plurality of data clusters or groups.
  • server 108 provides a graphical user interface to computing device 104 allowing user 102 to review the set of results associated with the most recent iteration and/or a compilation of the rendered iterations. If there is a lack of scores that are beneath the applicable threshold parameter, then data may not be removed and training set 208 will represent instances of the current dataset and be used to train machine learning model 210 . It is to be understood that training set 208 comprising instances associated with the current dataset may be utilized within machine learning model 210 while server 108 is simultaneously running iterations of applying the algorithms and removing data associated with components of the data that are scored lower than the threshold parameter.
  • the set of results may comprise a plurality of scores associated with each of a plurality of features associated with the data that was analyzed.
  • server 108 assumes that there is at least one remaining score that is lower than the threshold parameter and makes a determination regarding which feature is associated with the at least one remaining score. If the feature is in fact associated with the at least one remaining score, then step 412 occurs and the feature is removed from the dataset. Otherwise, the process proceeds directly to step 414 in which components of the dataset are stored into training set 208 .
  • machine learning model 210 is trained based on training set 208 . It is to be understood that none of the aforementioned steps are required to be performed in a particular order and variations are possible; however, it is important for server 108 to receive a threshold parameter or establish a default threshold parameter in order for features to be removed from the dataset based on said threshold parameter.
  • FIG. 5 is a block diagram of a system including an example computing device 500 and other computing devices. Consistent with the embodiments described herein, the aforementioned actions performed by various components of system 100 (such as devices 112 , 114 , 116 , 111 , 108 and 104 ) may be implemented in a computing device, such as computing device 500 . Any suitable combination of hardware, software, or firmware may be used to implement computing device 500 .
  • the aforementioned system, device, and processors are examples and other systems, devices, and processors may comprise the aforementioned computing device.
  • computing device 500 may comprise an operating environment for system 100 . Processes, data related to system 100 may operate in other environments and are not limited to computing device 500 .
  • a system consistent with an embodiment of the claimed subject matter may include a plurality of computing devices, such as a computing device 500 of FIG. 5 .
  • computing device 500 may include at least one processing unit 502 and a system memory 504 .
  • system memory 504 may comprise, but is not limited to, volatile (e.g. random-access memory (RAM)), non-volatile (e.g. read-only memory (ROM)), flash memory, or any combination or memory.
  • System memory 504 may include operating system 502 , and one or more programming modules 506 . Operating system 502 , for example, may be suitable for controlling computing device 500 's operation.
  • FIG. 5 This basic configuration is illustrated in FIG. 5 by those components within a dashed line 520 .
  • Computing device 500 may have additional features or functionality.
  • computing device 500 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape.
  • additional storage is illustrated in FIG. 5 by a removable storage 509 and a non-removable storage 510 .
  • Computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data.
  • System memory 504 , removable storage 509 , and non-removable storage 510 are all computer storage media examples (i.e.
  • Computer storage media may include, but is not limited to, RAM, ROM, electrically erasable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store information and which can be accessed by computing device 500 . Any such computer storage media may be part of system 500 .
  • Computing device 500 may also have input device(s) 512 such as a keyboard, a mouse, a pen, a sound input device, a camera, a touch input device, etc.
  • Output device(s) 514 such as a display, speakers, a printer, etc. may also be included.
  • the aforementioned devices are only examples, and other devices may be added or substituted.
  • Computing device 500 may also contain a network connection device 515 that may allow device 500 to communicate with other computing devices 518 , such as over a network in a distributed computing environment, for example, an intranet or the Internet.
  • Device 515 may be a wired or wireless network interface controller, a network interface card, a network interface device, a network adapter, or a LAN adapter.
  • Device 515 allows for a communication connection 516 for communicating with other computing devices 518 .
  • Communication connection 516 is one example of communication media. Communication media may typically be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media.
  • modulated data signal may describe a signal that has one or more characteristics set or changed in such a manner as to encode information in the signal.
  • communication media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared, and other wireless media.
  • RF radio frequency
  • computer readable media as used herein may include both computer storage media and communication media.
  • program modules 506 may perform processes including, for example, one or more of the stages of a process.
  • processing unit 502 may perform other processes.
  • Other programming modules that may be used in accordance with embodiments of the claimed subject matter may include electronic mail and contacts applications, word processing applications, spreadsheet applications, database applications, slide presentation applications, drawing or computer-aided application programs, etc.
  • program modules may include routines, programs, components, data structures, and other types of structures that may perform particular tasks or that may implement particular abstract data types.
  • embodiments of the claimed subject matter may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.
  • Embodiments of the claimed subject matter may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in both local and remote memory storage devices.
  • embodiments of the claimed subject matter may be practiced in an electrical circuit comprising discrete electronic elements, packaged or integrated electronic chips containing logic gates, a circuit utilizing a microprocessor, or on a single chip (such as a System on Chip) containing electronic elements or microprocessors.
  • Embodiments of the claimed subject matter may also be practiced using other technologies capable of performing logical operations such as, for example, AND, OR, and NOT, including but not limited to mechanical, optical, fluidic, and quantum technologies.
  • embodiments of the claimed subject matter may be practiced within a general-purpose computer or in any other circuits or systems.
  • Embodiments of the claimed subject matter are described above with reference to block diagrams and/or operational illustrations of methods, systems, and computer program products according to embodiments of the claimed subject matter.
  • the functions/acts noted in the blocks may occur out of the order as shown in any flowchart.
  • two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
  • components of the system may be interchangeable or modular so that the components may be easily changed or supplemented with additional or alternative components.

Abstract

A data analysis system utilizing custom unsupervised machine learning processes over a communications network is disclosed, the system including a repository of data, a web application deployed on a web server, the web application including a data collection interface, wherein the web application is configured for providing a graphical user interface for modifying threshold parameters of a clustering algorithm for clustering the data, executing the clustering algorithm with the threshold parameters that were modified, thereby producing a set of results, providing a graphical user interface for reviewing the set of results of the clustering algorithm and re-executing previous steps if the set of results are not useful and, executing a deep learning algorithm in a deep learning software framework on the set of results, thereby establishing relationships between the data, and providing generalizations of the data.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • Not applicable.
  • STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
  • Not applicable.
  • INCORPORATION BY REFERENCE OF MATERIAL SUBMITTED ON A COMPACT DISC
  • Not applicable.
  • TECHNICAL FIELD
  • The claimed subject matter relates to data analysis, and more specifically to the field of data analysis systems utilizing artificial intelligence.
  • BACKGROUND
  • With the onset of the Information Age—the historical period that began in the late 20th century, characterized by a rapid epochal shift from the traditional industry established by the Industrial Revolution to an economy primarily based upon information technology—vast amounts of data (sometimes referred to as “big data”) have become available. To gain insight into said data, individuals traditionally manually inspected the data, cleaned the data, and labelled it correctly for processing. The volume of data today, however, has made it infeasible for humans to perform this conventional process. There are simply not enough personnel with the appropriate qualifications to process the data conventionally.
  • The emergence of big data has prompted research into different ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source. Current ways of processing big data are vulnerable to multiple issues, including high reliance on accurate presentation of datasets, limitations to binary features, and vulnerability to overfitting in which an analysis corresponds too closely to a particular set of data resulting in failure to fit additional data or predict future observations reliably. Furthermore, current ways of processing big data present complications for users attempting to calculate a specific feature's influence on the outcome of the data.
  • Therefore, there exists a need for improvements to data analysis for large datasets in order to provide users with a more efficient manner for determining the accuracy of the data in addition to measuring the influence of a specific feature value on the outcome of the data.
  • SUMMARY
  • This Summary is provided to introduce a selection of disclosed concepts in a simplified form that are further described below in the Detailed Description including the drawings provided. This Summary is not intended to identify key features or essential features of the claimed subject matter. Nor is this Summary intended to be used to limit the claimed subject matter's scope.
  • In one embodiment, a data analysis system utilizing custom unsupervised machine learning processes over a communications network is disclosed, the system comprising a repository of data connected to the communications network, a web application deployed on a web server connected to the communications network, the web application including a data collection interface between the web server and the repository of data, wherein the web application is configured for providing a graphical user interface for modifying, by a user, a plurality of threshold parameters of a clustering algorithm for clustering the data, executing the clustering algorithm with the plurality of threshold parameters that were modified by the user, thereby producing a set of results, providing a graphical user interface for reviewing, by the user, the set of results of the clustering algorithm and re-executing steps a) through c) if the set of results are not useful, copying the set of results into a deep learning software framework and, executing a deep learning algorithm in the deep learning software framework on the set of results, thereby establishing relationships between the data, and providing generalizations of the data.
  • To the accomplishment of the above and related objects, the claimed subject matter may be embodied in the form illustrated in the accompanying drawings, attention being called to the fact, however, that the drawings are illustrative only, and that changes may be made in the specific construction illustrated and described within the scope of the appended claims. The foregoing and other features and advantages of the claimed subject matter will be apparent from the following more particular description of the preferred embodiments of the claimed subject matter, as illustrated in the accompanying drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The accompanying drawings, which are incorporated in and constitute part of this specification, illustrate embodiments of the claimed subject matter and, together with the description, serve to explain the principles of the claimed subject matter. The embodiments illustrated herein are presently preferred, it being understood, however, that the claimed subject matter is not limited to the precise arrangements and instrumentalities shown, wherein:
  • FIG. 1 is a diagram of an operating environment that supports a system for data analysis using artificial intelligence, according to an example embodiment;
  • FIG. 2 is a diagram of a machine learning modeling architecture that supports the system for data analysis using artificial intelligence, according to an example embodiment;
  • FIG. 3 is a flow chart illustrating a first process flow of the system for data analysis using artificial intelligence, according to an example embodiment;
  • FIG. 4 is a flow chart illustrating a second process flow of the system for data analysis using artificial intelligence, according to an example embodiment;
  • FIG. 5 is a block diagram illustrating a computer system according to exemplary embodiments of the present technology.
  • DETAILED DESCRIPTION
  • The following detailed description refers to the accompanying drawings. Whenever possible, the same reference numbers are used in the drawings and the following description to refer to the same or similar elements. While disclosed embodiments may be described, modifications, adaptations, and other implementations are possible. For example, substitutions, additions, or modifications may be made to the elements illustrated in the drawings, and the methods described herein may be modified by substituting reordering or adding additional stages or components to the disclosed methods and devices. Accordingly, the following detailed description does not limit the disclosed embodiments. Instead, the proper scope of the disclosed embodiments is defined by the appended claims.
  • The disclosed embodiments improve upon the problems with the prior art by providing a system for data analysis using artificial intelligence configured to process data sets that are too large or complex to be dealt with by traditional data-processing application software. The disclosed embodiments also improve upon the problems with the prior art by providing a system for data analysis that reduces or eliminates the problems associated with processing big data, such as failure to fit additional data or predict future observations reliably.
  • The disclosed embodiments further improve upon the problems with the prior art by providing a system for data analysis using artificial intelligence configured to improve filtering for useful data within large datasets by reducing noise and identifying useful data in an unconventional manner that utilizes logistic regression tests on datasets iteratively in order to determine which datasets are useful to train a complex machine learning model. The system improves the accuracy of data utilized for training sets configured for machine learning models by allowing users to establish threshold parameters and iteratively removing data that is below the established threshold parameters. The system also improves visual representations and analytics associated with calculations of data features and their influence on outcomes. Finally, the system improves overall computer performance by reducing the amount of computing resources utilized to generate a trained model that not only makes more accurate predictions on datasets, but also more efficiently establishes relationships between data and provides generalizations of the data.
  • Referring now to FIG. 1, a system 100 for data analysis using artificial intelligence is depicted. System 100 comprises a user 102 operating on a computing device 104 in which computing device 104 is communicatively coupled to a communications network 106, such as the Internet. System 100 further comprises a server 108 and a data repository 110 (hereinafter referred to as “database”) in which server 108 and database 110 are communicatively coupled to the communications network 106. System 100 may further comprise a customer data repository 110 (also hereinafter referred to as “database”) communicatively coupled to communications network 106. Customer data repository 110 may house data that belongs to a customer of user 102 and said data may also or alternatively reside in the database 110.
  • System 100 further comprises a machine learning database 112, a machine learning model generator 114, and a regression and clustering module 116 each of which are communicatively coupled to the communications network 106. In one embodiment, data communicated on network 106 can be implemented using a variety of protocols on a network such as a wide area network (WAN), a virtual private network (VPN), metropolitan area networks (MANs), system area networks (SANs), a public switched telephone network (PTSN), a global Telex network, or a 2G, 3G, 4G or 5G network. Such networks can also generally contextually be referred to herein as the Internet or cloud. Machine learning is the study of computer algorithms that improve automatically through experience. Machine learning is a subset of artificial intelligence that uses algorithms that build a mathematical model based on sample data, also referred to as training data, in order to make predictions or decisions. Machine learning algorithms are used when it is difficult or infeasible to develop conventional algorithms to perform needed tasks.
  • Machine learning database 112 is a database configured for storing the data structures that are used for machine learning processes, such as storing data models, mathematical models, training data, training examples in array or vector form, etc. Machine learning model generator 114 is a set of routines and/or processes configured for generating models that are trained on training data and then can process additional data. Various types of models can be used for system 100, including artificial neural networks, decision trees, support vector machines, regression analysis, Bayesian networks and genetic algorithms. Regression and clustering module 116 is a set of routines and/or processes configured for conducting regression then cluster analysis. Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. A common form of regression analysis is linear regression, which comprises the discovery of a line that most closely fits the data according to a specific mathematical criterion. Cluster analysis is the task of grouping a set of data or objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups or clusters. Members of the same cluster are grouped based on one or more metrics indicating that the members of the group share more metric similarities with each other than the members of other groups.
  • It is to be understood that system 100 and said components may be implemented in hardware, software, or a combination thereof. In some embodiments, the various components of system 100 are implemented at least partially by hardware at one or more computing devices, such as one or more hardware processors executing instructions stored in one or more memories for performing various functions described herein. For example, descriptions of various components (or modules) as described in this application may be interpreted by one of ordinary skill in the art as providing pseudocode, an informal high-level description of one or more computer structures. The descriptions of the components may be converted into software code, including code executable by an electronic processor. System 100 illustrates only one of many possible arrangements of components configured to perform the functionality described herein. Other arrangements may include fewer or different components, and the division of work between the components may vary depending upon the arrangement.
  • In one embodiment, computing device 104, server 108 and database 111 may include, but are not limited to, one or more laptop computers, tablet computers, smartphones, desktop computers, internet of things (IOT) devices, servers, workstations, virtual machines, and any other mechanisms used to access networks and software. Computing device 104, server 108 and database 111 (which may include web servers) may be configured to run or host a web application communicatively coupled to network 106. The web application may be configured on a software platform configured to allow user 102 to contribute, monitor, and/or analyze data sets in addition to serving as a mechanism to present graphical user interfaces to user 102 that support receiving inputs from user 102 in order to perform components necessary for server 108 to perform execution of one or more algorithms disclosed herein. In one embodiment, user 102 provides a plurality of user input to computing device 104 in which server 108 receives the plurality of user input over communications network 106 and automatically performs tasks described herein on data housed in databases 110 and/or 111.
  • In one embodiment, the combination of machine learning database 112, machine learning model generator 114, and regression and clustering module 116 are configured to operate in unison in order to perform tasks such as execution of regression analysis and one or more high quality clustering algorithms in order to efficiently cluster data received by system 100. In one embodiment, machine learning model generator 114 may comprise one or more servers to support execution of applying the regression analysis and clustering algorithms and/or any other applicable algorithms to the data in databases 110, 111. In one embodiment, server 108 accesses databases 110 and/or 111 and data derived from user 102 and other applicable data sources (described in greater detail during discussion of FIG. 2) to machine learning database 112 in order for machine learning model generator 114 to apply the regression analysis and one or more high quality clustering algorithms associated with regression and clustering module 116 to the data. It is to be understood that regression and clustering module 116 supports a plurality of regression analysis algorithms and clustering algorithms including, but not limited to, k-means clustering, k-medoids clustering, Clarans clustering, Birch clustering, Cure clustering, Chameleon clustering, DBScan clustering, an Optics clustering, Denclue clustering, Sting clustering, Clique clustering, Wave-Cluster clustering, or any combinations thereof configured to support machine learning model generator 114 in discovering patterns in the data in order to generate groups in which the members share a significant amount of associations.
  • Referring now to FIG. 2, a machine learning modeling architecture 200 configured to support the system 100 for data analysis is depicted according to an example embodiment. Architecture 200 comprises a plurality of data sources 202-206 configured to transmit various types of datasets to system 100 in which server 108 serves as the receiving point configured to filter the data received from data sources 202-206 and store components of the data in databases 110 and/or 111. Alternatively, architecture 200 comprises a plurality of data sources 202-206 configured to serve various types of datasets to system 100 in which server 108 downloads or requests said data from data sources 202-206 and stores components of the data in databases 110 and/or 111.
  • Architecture 200 further comprises one or more training sets 208 (which may be stored in machine learning database 112) comprising a plurality of training instances (training data) configured to comprise a plurality of features (in some instances values associated with each of the features) in which the plurality of features are associated with at least one of the received data or the result of the one or more clustering algorithms and one or more other applicable functions applied to the received data during an iteration.
  • Architecture 200 further comprises a machine learning model 210 (which may be stored in machine learning database 112 and which may have been produced by machine learning model generator 114) configured to utilize one or more techniques in order to train the model based on at least training set 208. In one embodiment, training set 208 and machine learning model 210 may be comprised in a deep learning software framework in which machine learning software providing generic functionality that can be selectively changed by additional user-written code, thus providing application-specific software. The deep learning software framework provides a standard way to build and deploy machine learning applications and is a universal, reusable software environment that provides particular machine learning functionality as part of a larger software platform to facilitate development of machine learning software applications, products and solutions. The deep learning software framework may include support programs, compilers, code libraries, tool sets, application programming interfaces (APIs) and deep learning algorithms that bring together all the different components to enable development of a machine learning project or system. It is to be understood that machine learning is the study and construction of algorithms that can learn from, and make predictions on data, and that such algorithms operate by building a model from inputs in order to make data-driven predictions or decisions. In particular, machine learning model 210 is trained based on training set 208 representing weights or coefficients of the plurality of features and/or derivatives of the plurality of features. Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning that can be supervised, semi-supervised or unsupervised.
  • Unsupervised learning (otherwise referred to as unsupervised machine learning) is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses. A common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. The clusters are modeled using a measure of similarity which is defined upon metrics such as Euclidean or probabilistic distance. The claimed process described herein is considered unsupervised learning or unsupervised machine learning due to the use of cluster analysis.
  • In one embodiment, server 108, either alone or in combination with architecture 200 is configured to continuously download, receive, filter, compile, and transmit data received from data sources 202-206 and computing device 104 in order for machine learning model generator 114 to perform one or more functions on said data in addition to utilizing regression and clustering module 116 to apply one or more of the aforementioned regression and clustering algorithms to the received data. In one embodiment, server 108 and/or machine learning model generator 114 may iteratively apply one or more logistic regression functions on said data (in combination with the one or more clustering algorithms) in which the clustering algorithms are configured to filter noisy data in order to make said data less noisy, and the logistic regression functions are configured to identify the most useful components of said data. In one embodiment, server 108 and/or machine learning model generator 114 are configured to execute a plurality of logistical regression tests on said data in order to determine which components of said data are useful to include in training set 208 in order to train machine learning model 210.
  • In one embodiment, server 108 applies one or more logistic regression tests at each iteration for said data in order to determine which components of said data are considered useful enough to be included in training set 208. For example, for each feature of the plurality of features in each iteration, a regression model may be applied to the applicable feature in order to calculate the influence of the applicable feature on the overall outcome/output of the data after the one or more aforementioned algorithms are applied during the iteration. The application of the multiple logistic regression tests at each iteration increases the accuracy of the datasets overall in addition to ensuring that said data is useful enough to be included in training set 208 in order to provide more accurate predictions. In other words, it is possible for clustering algorithms and other applicable algorithms, such as those associated with logistic regression tests, to be applied to said data or the resulting data of the most recent iteration. Each logistic regression test may also calculate a metric of influence that measures the rank or quantifies the influence of each feature of the data.
  • Logistic regression is a statistical model configured to be utilized in order to model the probability of a certain class or event existing. Traditionally, logistic regression models are represented by utilizing weights or coefficient values to predict an output value typically modeled as a binary value. The combination of clustering and linear regression may be utilized in order to remove noise from data and identify useful data for training a complex machine learning model. The claimed process described herein is considered machine learning logistic regression due to the use of regression analysis in a machine learning environment. Logistic regression may also be used to calculate functional data within the data being processed, wherein functional data comprises data providing information about curves, surfaces or anything else varying over a continuum.
  • In one embodiment, server 108 is also configured to provide a graphical user interface to computing device 104 over network 106 prompting user 102 for a plurality of threshold parameters (or modifications to said threshold parameters) configured to indicate the type, nature and amount of said data that will be included in training set 208. Server 108 is configured to allocate a plurality of scores among the features of the data collected and processed (as defined above), wherein after each iteration, certain data are removed from the dataset resulting in a more accurate dataset. In one embodiment, server 108 provides functionality configured to allow user 102 to modify or toggle the plurality of threshold parameters in order for user 102 to tune the data based on his or her preferences. For example, the applicable threshold parameter of the plurality of threshold parameters set by user 102 for the applicable iteration ensures that the data or applicable feature comprising a score that is lower than the applicable threshold parameter is removed from the overall dataset; thus, preventing the applicable feature from being included in the following iteration and ultimately increasing the accuracy of the dataset overall. In one embodiment, user 102 may reset or adjust a plurality of hyperparameters associated with the set of results in order to increase the accuracy of the set of results or subsequent sets of results.
  • It is to be understood that the web application provided by server 108 on computing device 104 is configured to allow user 102 to view the results after each iteration (of the processes undertaken by the machine learning modeling architecture 200, as defined above) in order to not only effectively evaluate the utility of the algorithms applied during the applicable iteration, but also provide user 102 with the opportunity to adjust or toggle the plurality of threshold parameters if necessary. Furthermore, the aforementioned feature provides user 102 with the opportunity to scrutinize both the features associated with the lower scores (scores below the applicable threshold parameter) and the effect of a particular feature on the results of an iteration and/or the dataset overall. Threshold parameters may comprise a plurality of numerical values, wherein each numerical value comprises a decimal number.
  • Referring now to FIG. 3, a flow chart 300 illustrating a first exemplary process of data analysis using artificial intelligence is depicted. At step 302, server 108 accesses database 110, database 111 and/or data sources 202-206 in order to collect data on which data analysis must be performed. It is to be understood that server 108 is designed and configured to download, request, receive, collect, filter, and compile various types of data derived from various data sources prior to performing the subsequent steps provided in this disclosure. Server 108 may continuously perform step 302 while simultaneously performing the subsequent provided steps in order to ensure that data is continuously being feed into system 100. Any data that is collected by server 108 may be stored in database 1110 and/or database 111 and may be referred to herein as “said data”.
  • At step 304, server 108 provides a graphical user interface to computing device 104 over network 106, the interface configured to allow user 102 to provide and/or modify a plurality of threshold parameters. Said graphical user interface may also be supportive in the sense that is suggests a plurality of threshold parameters or changes to an existing plurality of threshold parameters. In one embodiment, when suggesting changes to the plurality of threshold parameters, server 108 may take into account various factors associated with said data such as source of the data, computational demands required in order to process said data, pre-existing user preferences stored within databases 110 and/or 111, or any other applicable factors.
  • At step 306, server 108 utilizes architecture 200 to execute one or more high quality clustering algorithms in addition to logistic regression tests to said data. It is to be understood that the plurality of threshold parameters serves the purpose of determining the amount of data that is to be included in training set 208. However, machine learning model generator 114 may be continuously receiving data from server 108 and iteratively applying the aforementioned algorithms to the data set while simultaneously removing data and features of data associated with the scores that are less than the threshold parameters. In other words, at each iteration, the clustering algorithms and logistic regression tests are applied to the data and after the application of the aforementioned, the data associated with one or more scores lower than the corresponding threshold parameter are removed, thereby producing a set of results.
  • At step 308, server 108 provides a graphical user interface to computing device 104 over network 106 providing user 102 with an opportunity to not only review the set of results, but also adjust or toggle the plurality of threshold parameters, if necessary, based on the set of results. If user 102 decides in step 309 to adjust or toggle the plurality of threshold parameters, then steps 304-308 are re-executed and server 108 (using architecture 200) re-executes the aforementioned algorithms and removes the applicable features of data based upon the new plurality of threshold parameters. If user 102 decides in step 309 that the resulting data is useful, then control flows to step 310.
  • At step 310, server 108 and/or user 102 arrive at a determination that the set of results (accuracy of the data) is adequate enough to be included into training set 208, server 108 copies the resulting data to the deep learning software framework comprising training set 208 and machine learning model 210. At step 312, the set of results manifested via training set 208 are used to train machine learning model 210. In one embodiment, training set 208 comprises multiple training instances associated with the set of results and in some embodiments the training instances comprise a label. In one embodiment, different sets of training data may be based on different features. In one embodiment, machine learning model 210 is configured to capture the correlation between features of the set of results and the labels associated with the training instances. Each feature represents a measurable piece of data that can be used for analysis. Features may also be referred to as variables or attributes. The features included in the dataset can vary widely.
  • As explained above, a graphical user interface may be configured to present the set of results of each iteration of the aforementioned steps to user 102. In one embodiment, the graphical user interface may provide user 102 with the option to toggle the plurality of threshold parameters in order to effectively target one or more components of the dataset. It is to be understood that execution of the aforementioned steps results in relationships being established between the data and due to the increased accuracy of the data more efficient generalizations of the data are provided to user 102 as well. Relationships between data are defined as a connection between two or more pieces of data, such as a causal relationship connection or a correlated relationship connection. Generalizations about data are defined as true statements that may be made about two or more pieces of data, such as a category, or classification. Generalizations posit the existence of a domain or set of data elements, as well as one or more common characteristics shared by those data elements. The web application may also be configured for generating a downloadable report comprising the set of results for review by the user.
  • Referring now to FIG. 4, a flow chart 400 illustrating a second exemplary process of data analysis using artificial intelligence is depicted. At step 402, a plurality of data is read by server 108 from plurality of sources 202-206 and/or databases 110, 111 are accessed by server 108 in order to retrieve data. At step 404, server 108 reads the plurality of threshold parameters from user 102 via inputs on computing device 104. At step 406, the one or more clustering algorithms in addition to the logistic regression tests are performed by server 108 on the plurality of data resulting in a plurality of data clusters or groups.
  • At step 408, server 108 provides a graphical user interface to computing device 104 allowing user 102 to review the set of results associated with the most recent iteration and/or a compilation of the rendered iterations. If there is a lack of scores that are beneath the applicable threshold parameter, then data may not be removed and training set 208 will represent instances of the current dataset and be used to train machine learning model 210. It is to be understood that training set 208 comprising instances associated with the current dataset may be utilized within machine learning model 210 while server 108 is simultaneously running iterations of applying the algorithms and removing data associated with components of the data that are scored lower than the threshold parameter. This configuration allows for continuous processing of new or altered data by server 108 while allowing machine learning model 210 to continuously receive data and make predictions based on the dataset at each iteration. The set of results may comprise a plurality of scores associated with each of a plurality of features associated with the data that was analyzed.
  • At step 410, server 108 assumes that there is at least one remaining score that is lower than the threshold parameter and makes a determination regarding which feature is associated with the at least one remaining score. If the feature is in fact associated with the at least one remaining score, then step 412 occurs and the feature is removed from the dataset. Otherwise, the process proceeds directly to step 414 in which components of the dataset are stored into training set 208. At step 416, machine learning model 210 is trained based on training set 208. It is to be understood that none of the aforementioned steps are required to be performed in a particular order and variations are possible; however, it is important for server 108 to receive a threshold parameter or establish a default threshold parameter in order for features to be removed from the dataset based on said threshold parameter.
  • FIG. 5 is a block diagram of a system including an example computing device 500 and other computing devices. Consistent with the embodiments described herein, the aforementioned actions performed by various components of system 100 (such as devices 112, 114, 116, 111, 108 and 104) may be implemented in a computing device, such as computing device 500. Any suitable combination of hardware, software, or firmware may be used to implement computing device 500. The aforementioned system, device, and processors are examples and other systems, devices, and processors may comprise the aforementioned computing device. Furthermore, computing device 500 may comprise an operating environment for system 100. Processes, data related to system 100 may operate in other environments and are not limited to computing device 500.
  • A system consistent with an embodiment of the claimed subject matter may include a plurality of computing devices, such as a computing device 500 of FIG. 5. In a basic configuration, computing device 500 may include at least one processing unit 502 and a system memory 504. Depending on the configuration and type of computing device, system memory 504 may comprise, but is not limited to, volatile (e.g. random-access memory (RAM)), non-volatile (e.g. read-only memory (ROM)), flash memory, or any combination or memory. System memory 504 may include operating system 502, and one or more programming modules 506. Operating system 502, for example, may be suitable for controlling computing device 500's operation. Furthermore, embodiments of the claimed subject matter may be practiced in conjunction with a graphics library, other operating systems, or any other application program and is not limited to any particular application or system. This basic configuration is illustrated in FIG. 5 by those components within a dashed line 520.
  • Computing device 500 may have additional features or functionality. For example, computing device 500 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage is illustrated in FIG. 5 by a removable storage 509 and a non-removable storage 510. Computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. System memory 504, removable storage 509, and non-removable storage 510 are all computer storage media examples (i.e. memory storage.) Computer storage media may include, but is not limited to, RAM, ROM, electrically erasable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store information and which can be accessed by computing device 500. Any such computer storage media may be part of system 500. Computing device 500 may also have input device(s) 512 such as a keyboard, a mouse, a pen, a sound input device, a camera, a touch input device, etc. Output device(s) 514 such as a display, speakers, a printer, etc. may also be included. The aforementioned devices are only examples, and other devices may be added or substituted.
  • Computing device 500 may also contain a network connection device 515 that may allow device 500 to communicate with other computing devices 518, such as over a network in a distributed computing environment, for example, an intranet or the Internet. Device 515 may be a wired or wireless network interface controller, a network interface card, a network interface device, a network adapter, or a LAN adapter. Device 515 allows for a communication connection 516 for communicating with other computing devices 518. Communication connection 516 is one example of communication media. Communication media may typically be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media. The term “modulated data signal” may describe a signal that has one or more characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared, and other wireless media. The term computer readable media as used herein may include both computer storage media and communication media.
  • As stated above, a number of program modules and data files may be stored in system memory 504, including operating system 502. While executing on processing unit 502, programming modules 506 (e.g. program module 507) may perform processes including, for example, one or more of the stages of a process. The aforementioned processes are examples, and processing unit 502 may perform other processes. Other programming modules that may be used in accordance with embodiments of the claimed subject matter may include electronic mail and contacts applications, word processing applications, spreadsheet applications, database applications, slide presentation applications, drawing or computer-aided application programs, etc.
  • Generally, consistent with embodiments of the claimed subject matter, program modules may include routines, programs, components, data structures, and other types of structures that may perform particular tasks or that may implement particular abstract data types. Moreover, embodiments of the claimed subject matter may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like. Embodiments of the claimed subject matter may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
  • Furthermore, embodiments of the claimed subject matter may be practiced in an electrical circuit comprising discrete electronic elements, packaged or integrated electronic chips containing logic gates, a circuit utilizing a microprocessor, or on a single chip (such as a System on Chip) containing electronic elements or microprocessors. Embodiments of the claimed subject matter may also be practiced using other technologies capable of performing logical operations such as, for example, AND, OR, and NOT, including but not limited to mechanical, optical, fluidic, and quantum technologies. In addition, embodiments of the claimed subject matter may be practiced within a general-purpose computer or in any other circuits or systems.
  • Embodiments of the claimed subject matter, for example, are described above with reference to block diagrams and/or operational illustrations of methods, systems, and computer program products according to embodiments of the claimed subject matter. The functions/acts noted in the blocks may occur out of the order as shown in any flowchart. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. It is also understood that components of the system may be interchangeable or modular so that the components may be easily changed or supplemented with additional or alternative components.
  • While certain embodiments of the claimed subject matter have been described, other embodiments may exist. Furthermore, although embodiments of the claimed subject matter have been described as being associated with data stored in memory and other storage mediums, data can also be stored on or read from other types of computer-readable media, such as secondary storage devices, like hard disks, floppy disks, or a CD-ROM, or other forms of RAM or ROM. Further, the disclosed methods' stages may be modified in any manner, including by reordering stages and/or inserting or deleting stages, without departing from the claimed subject matter.
  • Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (20)

We claim:
1. A data analysis system utilizing custom unsupervised machine learning processes over a communications network, the system comprising:
a repository of data connected to the communications network;
a web application deployed on a web server connected to the communications network, the web application including a data collection interface between the web server and the repository of data, wherein the web application is configured for:
a) providing a graphical user interface for modifying, by a user, a plurality of threshold parameters of a clustering algorithm for clustering the data, wherein the clustering algorithm comprises at least a machine learning logistic regression function;
b) executing the clustering algorithm with the plurality of threshold parameters that were modified by the user, thereby producing a set of results;
c) providing a graphical user interface for reviewing, by the user, the set of results of the clustering algorithm and re-executing steps a) through c) if the set of results are not useful;
d) copying the set of results into a deep learning software framework; and
e) executing a deep learning algorithm in the deep learning software framework on the set of results, thereby establishing relationships between the data, and providing generalizations of the data.
2. The data analysis system of claim 1, further comprising:
wherein the machine learning logistic regression function is configured for identifying a set of functional data comprised within the data.
3. The data analysis system of claim 2, further comprising:
wherein the machine learning logistic regression function is configured to calculate a metric of influence for each feature of a plurality of features associated with the data.
4. The data analysis system of claim 3, further comprising:
wherein the plurality of threshold parameters comprises a plurality of numerical values, wherein each numerical value comprises a decimal number.
5. The data analysis system of claim 4, further comprising:
wherein the graphical user interface for reviewing, by the user, the set of results comprises a supportive graphical user interface.
6. The data analysis system of claim 5, further comprising:
wherein the web application is further configured for generating a downloadable report comprising the set of results for review by the user.
7. The data analysis system of claim 6, further comprising:
wherein the set of results comprises a plurality of scores associated with each of the plurality of features associated with the data.
8. A method for data analysis utilizing custom unsupervised machine learning processes over a communications network, the method comprising:
storing data in a repository connected to the communications network;
providing a web application deployed on a web server connected to the communications network, the web application including a data collection interface between the web server and the repository, wherein the web application is configured for:
a) providing a graphical user interface for modifying, by a user, a plurality of threshold parameters of a clustering algorithm for clustering the data, wherein the clustering algorithm comprises at least a machine learning logistic regression function;
b) executing the clustering algorithm with the plurality of threshold parameters that were modified by the user, thereby producing a set of results;
c) providing a graphical user interface for reviewing, by the user, the set of results of the clustering algorithm and re-executing steps a) through c) if the set of results are not useful;
d) copying the set of results into a deep learning software framework; and
e) executing a deep learning algorithm in the deep learning software framework on the set of results, thereby establishing relationships between the data, and providing generalizations of the data.
9. The method of claim 8, further comprising:
wherein the machine learning logistic regression function is configured for identifying a set of functional data comprised within the data.
10. The method of claim 9, further comprising:
wherein the machine learning logistic regression function is configured to calculate a metric of influence for each feature of a plurality of features associated with the data.
11. The data method of claim 10, further comprising:
wherein the plurality of threshold parameters comprises a plurality of numerical values, wherein each numerical value comprises a decimal number.
12. The method of claim 11, further comprising:
wherein the graphical user interface for reviewing, by the user, the set of results comprises a supportive graphical user interface.
13. The method of claim 12, further comprising:
wherein the web application is further configured for generating a downloadable report comprising the set of results for review by the user.
14. The method of claim 13, further comprising:
wherein the set of results comprises a plurality of scores associated with each of the plurality of features associated with the data.
15. A data analysis system utilizing custom unsupervised machine learning processes over a communications network, the system comprising:
a repository of data connected to the communications network;
a web application deployed on a web server connected to the communications network, the web application including a data collection interface between the web server and the repository of data, wherein the web application is configured for:
a) providing a graphical user interface for modifying, by a user, a plurality of threshold parameters of a clustering algorithm for clustering the data, wherein the clustering algorithm comprises at least a machine learning logistic regression function;
b) executing the clustering algorithm with the plurality of threshold parameters that were modified by the user, thereby producing a set of results including a plurality of scores associated with at least one component of the data;
c) providing a graphical user interface for reviewing, by the user, the set of results of the clustering algorithm and re-executing steps a) through c) with an adjusted set of the plurality of threshold parameters, if the set of results are not useful;
d) copying the set of results into a deep learning software framework; and
e) executing a deep learning algorithm in the deep learning software framework on the set of results, thereby establishing relationships between the data, and providing generalizations of the data.
16. The data analysis system of claim 15, further comprising:
wherein the machine learning logistic regression function is configured for identifying a set of functional data comprised within the data.
17. The data analysis system of claim 16, further comprising:
wherein the machine learning logistic regression function is configured to calculate a metric of influence for each feature of a plurality of features associated with the data.
18. The data analysis system of claim 17, further comprising:
wherein the plurality of threshold parameters comprises a plurality of numerical values, wherein each numerical value comprises a decimal number.
19. The data analysis system of claim 18, further comprising:
wherein the graphical user interface for reviewing, by the user, the set of results comprises a supportive graphical user interface.
20. The data analysis system of claim 19, further comprising:
wherein the web application is further configured for generating a downloadable report comprising the set of results for review by the user.
US17/013,106 2020-09-04 2020-09-04 Data analysis system using artificial intelligence Abandoned US20220076157A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US17/013,106 US20220076157A1 (en) 2020-09-04 2020-09-04 Data analysis system using artificial intelligence
US17/017,289 US11037073B1 (en) 2020-09-04 2020-09-10 Data analysis system using artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US17/013,106 US20220076157A1 (en) 2020-09-04 2020-09-04 Data analysis system using artificial intelligence

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/017,289 Continuation US11037073B1 (en) 2020-09-04 2020-09-10 Data analysis system using artificial intelligence

Publications (1)

Publication Number Publication Date
US20220076157A1 true US20220076157A1 (en) 2022-03-10

Family

ID=76320948

Family Applications (2)

Application Number Title Priority Date Filing Date
US17/013,106 Abandoned US20220076157A1 (en) 2020-09-04 2020-09-04 Data analysis system using artificial intelligence
US17/017,289 Active US11037073B1 (en) 2020-09-04 2020-09-10 Data analysis system using artificial intelligence

Family Applications After (1)

Application Number Title Priority Date Filing Date
US17/017,289 Active US11037073B1 (en) 2020-09-04 2020-09-10 Data analysis system using artificial intelligence

Country Status (1)

Country Link
US (2) US20220076157A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3575813B1 (en) * 2018-05-30 2022-06-29 Siemens Healthcare GmbH Quantitative mapping of a magnetic resonance imaging parameter by data-driven signal-model learning
US20220076157A1 (en) * 2020-09-04 2022-03-10 Aperio Global, LLC Data analysis system using artificial intelligence
CN113238908B (en) * 2021-06-18 2022-11-04 浪潮商用机器有限公司 Server performance test data analysis method and related device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014014417A1 (en) * 2012-07-17 2014-01-23 Lim Shio Hwi System and method for multi party characteristics and requirements matching and timing synchronization
US20180000462A1 (en) * 2016-06-29 2018-01-04 Niramai Health Analytix Pvt. Ltd. Classifying hormone receptor status of malignant tumorous tissue from breast thermographic images
US20180060759A1 (en) * 2016-08-31 2018-03-01 Sas Institute Inc. Automated computer-based model development, deployment, and management
US20180114123A1 (en) * 2016-10-24 2018-04-26 Samsung Sds Co., Ltd. Rule generation method and apparatus using deep learning
WO2019092267A1 (en) * 2017-11-13 2019-05-16 Technische Universität München Automated noninvasive determining the fertility of a bird's egg
US20190155797A1 (en) * 2016-12-19 2019-05-23 Capital One Services, Llc Systems and methods for providing data quality management
US10449454B2 (en) * 2017-01-27 2019-10-22 Mz Ip Holdings, Llc System and method for determining events of interest in a multi-player online game
US20200251224A1 (en) * 2017-09-20 2020-08-06 Koninklijke Philips N.V. Evaluating input data using a deep learning algorithm
US11037073B1 (en) * 2020-09-04 2021-06-15 Aperio Global, LLC Data analysis system using artificial intelligence

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5862312A (en) 1995-10-24 1999-01-19 Seachange Technology, Inc. Loosely coupled mass storage computer cluster
US7043500B2 (en) 2001-04-25 2006-05-09 Board Of Regents, The University Of Texas Syxtem Subtractive clustering for use in analysis of data
US8117203B2 (en) 2005-07-15 2012-02-14 Fetch Technologies, Inc. Method and system for automatically extracting data from web sites
US8359279B2 (en) * 2010-05-26 2013-01-22 Microsoft Corporation Assisted clustering
US9043326B2 (en) 2011-01-28 2015-05-26 The Curators Of The University Of Missouri Methods and systems for biclustering algorithm
US9336302B1 (en) 2012-07-20 2016-05-10 Zuci Realty Llc Insight and algorithmic clustering for automated synthesis
US9218574B2 (en) * 2013-05-29 2015-12-22 Purepredictive, Inc. User interface for machine learning
EP2963577B8 (en) 2014-07-03 2020-01-01 Palantir Technologies Inc. Method for malware analysis based on data clustering
US10089581B2 (en) 2015-06-30 2018-10-02 The Boeing Company Data driven classification and data quality checking system
TW201812646A (en) 2016-07-18 2018-04-01 美商南坦奧美克公司 Distributed machine learning system, method of distributed machine learning, and method of generating proxy data
US20180039946A1 (en) 2016-08-03 2018-02-08 Paysa, Inc. Career Data Analysis Systems And Methods
US20200012890A1 (en) 2018-07-06 2020-01-09 Capital One Services, Llc Systems and methods for data stream simulation
US10681056B1 (en) * 2018-11-27 2020-06-09 Sailpoint Technologies, Inc. System and method for outlier and anomaly detection in identity management artificial intelligence systems using cluster based analysis of network identity graphs
CA3122509C (en) 2018-12-11 2023-09-19 Exxonmobil Research And Engineering Company Machine learning-augmented geophysical inversion

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014014417A1 (en) * 2012-07-17 2014-01-23 Lim Shio Hwi System and method for multi party characteristics and requirements matching and timing synchronization
US20180000462A1 (en) * 2016-06-29 2018-01-04 Niramai Health Analytix Pvt. Ltd. Classifying hormone receptor status of malignant tumorous tissue from breast thermographic images
US20180060759A1 (en) * 2016-08-31 2018-03-01 Sas Institute Inc. Automated computer-based model development, deployment, and management
US20180114123A1 (en) * 2016-10-24 2018-04-26 Samsung Sds Co., Ltd. Rule generation method and apparatus using deep learning
US20190155797A1 (en) * 2016-12-19 2019-05-23 Capital One Services, Llc Systems and methods for providing data quality management
US10449454B2 (en) * 2017-01-27 2019-10-22 Mz Ip Holdings, Llc System and method for determining events of interest in a multi-player online game
US20200251224A1 (en) * 2017-09-20 2020-08-06 Koninklijke Philips N.V. Evaluating input data using a deep learning algorithm
WO2019092267A1 (en) * 2017-11-13 2019-05-16 Technische Universität München Automated noninvasive determining the fertility of a bird's egg
US11037073B1 (en) * 2020-09-04 2021-06-15 Aperio Global, LLC Data analysis system using artificial intelligence

Also Published As

Publication number Publication date
US11037073B1 (en) 2021-06-15

Similar Documents

Publication Publication Date Title
US11741361B2 (en) Machine learning-based network model building method and apparatus
US11487941B2 (en) Techniques for determining categorized text
US10311368B2 (en) Analytic system for graphical interpretability of and improvement of machine learning models
US20190354810A1 (en) Active learning to reduce noise in labels
US11037073B1 (en) Data analysis system using artificial intelligence
US10692019B2 (en) Failure feedback system for enhancing machine learning accuracy by synthetic data generation
US20210303970A1 (en) Processing data using multiple neural networks
US11645500B2 (en) Method and system for enhancing training data and improving performance for neural network models
US20230351330A1 (en) Autonomous suggestion of issue request content in an issue tracking system
US9852390B2 (en) Methods and systems for intelligent evolutionary optimization of workflows using big data infrastructure
US20200409948A1 (en) Adaptive Query Optimization Using Machine Learning
US20220366297A1 (en) Local permutation importance: a stable, linear-time local machine learning feature attributor
Suleman et al. Google play store app ranking prediction using machine learning algorithm
US11669428B2 (en) Detection of matching datasets using encode values
Chivukula et al. Discovering granger-causal features from deep learning networks
Jeyaraman et al. Practical Machine Learning with R: Define, build, and evaluate machine learning models for real-world applications
US20220114490A1 (en) Methods and systems for processing unstructured and unlabelled data
US20220374401A1 (en) Determining domain and matching algorithms for data systems
US20220092452A1 (en) Automated machine learning tool for explaining the effects of complex text on predictive results
Pattewar et al. Stock prediction analysis by customers opinion in Twitter data using an optimized intelligent model
Smitha Rao et al. Conceptual machine learning framework for initial data analysis
Rahman et al. A deep learning framework for non-functional requirement classification
AU2021412848B2 (en) Integrated feature engineering
Fahrudin et al. Implementation of Big Data Analytics for Machine Learning Model Using Hadoop and Spark Environment on Resizing Iris Dataset
US11928128B2 (en) Construction of a meta-database from autonomously scanned disparate and heterogeneous sources

Legal Events

Date Code Title Description
AS Assignment

Owner name: APERIO GLOBAL, LLC, VIRGINIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:WATKINS, DAMIAN;REEL/FRAME:053698/0927

Effective date: 20200823

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED