AU2020101987A4

AU2020101987A4 - DIMA-Dataset Discovery: DATASET DISCOVERY IN DATA INVESTIGATIVE USING MACHINE LEARNING AND AI-BASED PROGRAMMING

Info

Publication number: AU2020101987A4
Application number: AU2020101987A
Authority: AU
Inventors: K. Harinath; Raja boina Raja Kumar; Venkata Rajesh Masina; Attili Venkata Ramana; Annaluri Sreenivasa Rao; N Chandra Sekhar Reddy; T. Rama Reddy; M. Shanmukhi; Divya N. Sree; Nazia Tabassum
Original assignee: Divya N Sree
Current assignee: Divya N Sree
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2020-10-01
Anticipated expiration: 2028-08-26

Abstract

DIMA-Dataset Discovery: DATASET DISCOVERY IN DATA INVESTIGATIVE USING MACHINE LEARNING AND Al-BASED PROGRAMMING ABSTRACT Our invention "DIMA-Dataset Discovery" is an improved capability are described for design, mapping, developing, training, validating and deploying discovery virtual avatars, avatars embodying mathematical models must be used for document and large data repositories. For example: an avatar may be constructed by machine learning, Al-based Programming, method, processes, including by processing information related to what types of information analysis, Investigative find useful in large data sets. The invention an avatar must be deployed as an aid to human intuition in a wide range of analytical processes, such as related to international, national security, enterprise management, data management, query mapping (advertised, sales, marketing, product, promotions, placement, pricing, etc.). The invention also dispute resolution (including litigation), forensic analysis, criminal, administrative, civil and private investigations, scientific investigations, research and development, and a wide range of others. The data elements from the source data may be presented and tracking, scored, rated mapping or ranked based at least in part on the identifiers within the data cluster relating to the super-set topic. The mathematical model must be optimized based at least in part on a comparison of the scored data elements and upon reaching a threshold of optimization, accuracy, quality, or merit, the optimized mathematical model must be saved and stored as a computer-based discovery avatar parent. A second set of extracted data features must be extracted from the source data that share a second attribute that is related to both the super-set topic and a subset topic. 26 Dr. M. Shanmukhi (Professor) NaziaTabassum (Assistant Professor) Dr. Raja boina Raja Kumar (Associate Professor) Dr. Attili Venkata Ramana (Associate Professor) Dr. Annaluri Sreenivasa Rao (Assistant Professor) N. Sree Divya (Assistant Professor) K. Harinath (Assistant Professor) Dr. Rama Reddy T (Professor) Dr. Venkata Rajesh Masina (Associate Professor) Dr. N Chandra Sekhar Reddy (Professor & Head) TOTAL NO OF SHEET: 05 NO OF FIG.:05 104 107 101 FEATURE YES TEST DOCUMENTS TOI EXTRACTION AA MST OPI INGEST10N 109 ECTORZATONNO -------- -------- CUSTR Q EUEUSER FEEDBACK Ila DETERMINE NEW EXPLOIT _ M LABELS NEW DOCUMENTS E NEWES J APPLY TOCA ? ~~INCOMING 4 TANG 112 DOCUMENTS USER EXPERIENCE STOP114 F11 REVIEW Enrich using .. .... DBpedia FIG. 1: IS ASIMPLIFIED DIAGRAM OF ACURIOSITY ENGINE METHOD AND SYSTEM FOR THE CREATION AND TRAINING OF DISCOVERY AVATARS.

Description

Dr. M. Shanmukhi (Professor) NaziaTabassum (Assistant Professor) Dr. Raja boina Raja Kumar (Associate Professor) Dr. Attili Venkata Ramana (Associate Professor) Dr. Annaluri Sreenivasa Rao (Assistant Professor) N. Sree Divya (Assistant Professor) K. Harinath (Assistant Professor) Dr. Rama Reddy T (Professor) Dr. Venkata Rajesh Masina (Associate Professor) Dr. N Chandra Sekhar Reddy (Professor & Head) TOTAL NO OF SHEET: 05 NO OF FIG.:05

104 107 101

FEATURE TOI YES TEST DOCUMENTS EXTRACTION AA MST OPI INGEST10N 109

ECTORZATONNO -------- -------- CUSTR Q EUEUSER FEEDBACK

Ila DETERMINE NEW EXPLOIT _ M LABELS

NEW DOCUMENTS E NEWES

J APPLY TOCA ? ~~INCOMING 4 TANG 112 DOCUMENTS

USER EXPERIENCEF11 STOP114 REVIEW

Enrich using .. .... DBpedia

FIG. 1: IS ASIMPLIFIED DIAGRAM OF ACURIOSITY ENGINE METHOD AND SYSTEM FOR THE CREATION AND TRAINING OF DISCOVERY AVATARS.

Australian Government IP Australia Innovation Patent Australia Patent Title: DIMA-Dataset Discovery: DATASET DISCOVERY IN DATA INVESTIGATIVE USING MACHINE LEARNING AND A-BASED PROGRAMMING

Name and address of patentees(s): Dr. M. Shanmukhi (Professor) (Department of CSE) Address: Vasavi College of Engineering(Autonomous), Ibrahimbagh, Hyderabad, TS, India. NaziaTabassum (Assistant Professor) (Department of IT) Address: Mahatma Gandhi Institute of Technology, Gandipet, Hyderabad, TS, India. Dr. Raja boina Raja Kumar (Associate Professor) (Department of CSE) Address: Rajeev Gandhi College of Engineering and Technology (Autonomous), Nandyal, Kurnool, Andhra Pradesh, India. Dr. Attili Venkata Ramana (Associate Professor) (Department of Electronics and Computer Engineering (ECM)) Address: Sreenidhi Institute of Science and Technology, Yamnampet, Ghatkesar, Hyderabad, Telangana, India. Dr. Annaluri Sreenivasa Rao (Assistant Professor) (Department of Information Technology) Address: VNR VignanaJyothi Institute of Engineering& Technology, Hyderabad 500090, India. N. Sree Divya (Assistant Professor) (Department of IT) Address: Mahatma Gandhi Institute of Technology, Gandipet, Hyderabad, TS, India. K. Harinath (Assistant Professor) (Department of IT) Address: Mahatma Gandhi Institute of Technology, Gandipet, Hyderabad, TS, India. Dr. Rama Reddy T (Professor) (Department of CSE) Address: Aditya Engineering College(A), Surampalem, E.G.Dt., AP., India. Dr. Venkata Rajesh Masina (Associate Professor) (Department of CSE) Address: Aditya Engineering College(Autonomous), Surampalem, A.P., India. Dr. N Chandra Sekhar Reddy (Professor & Head) (Department of CSE) Address: MLR Institute of Technology, Hyderabad, Telangana, India.

Complete Specification: Australian Government.

FIELD OF THE INVENTION

Our invention "DIMA-Dataset Discovery" is related to dataset discovery in data investigative using machine learning and Al-based programming and data management, discovery, organization within voluminous data repositories.

BACKGROUND OF THE INVENTION

With the rapid increase in data creation and the capability to cheaply and reliably store vast volumes of data has come an increasing complexity in organizing, searching and discovering data elements within large data repositories. One result is that traditional techniques for searching data for needed elements, such as keyword searching, Boolean operators, and enhanced search are insufficient to cull wanted data from large data repositories because even a small mismatch between, for example, a keyword and data included in a document, may result in the document being omitted from the search results. Similarly, the presence of a keyword in too many documents within a data stream may result in over-inclusive searching, producing search results that are too voluminous for a human to review in an acceptable amount of time.

Further, a keyword match may lack intelligence and produce data query results that combine documents simply on the basis of sharing a word (e.g., "state"), even though that keyword has substantively different meanings in the documents (e.g., "solid state" and "state of mind,"). Also, individuals may have a strong intuitive sense of what information is valuable within a set of results, but may not be able to develop keywords that properly reflect that intuition. Therefore, a need exists for document and data discovery methods and systems that are capable of being trained, that are capable of representing intuitive review processes, that are scalable, and that may be deployed within large data repositories.

Currently, most innovations in diagnosis and in therapy remain within the framework of morphology (e.g. the study of tumor shapes), physiology (the study o f organ function), and chemistry. With the advent of molecular biology and molecular genetics, medicine and pharmacology have entered the information age. Information technology, which has been so widely applied to the understanding of human intelligence (artificial intelligence, neural networks), telecommunications, and the Internet, should be applicable to the study of the program of life.

Disease used to be understood as the intrusion of foreign agents (e.g., bacteria) that should be deleted, or as a chemical imbalance that should be compensated. In the genomic era, diseases are interpreted as a deficiency of the genetic program to adapt to its environment caused by missing, lost, exaggerated or corrupted genetic information. We are moving towards an age when disease and disease susceptibility will be described and remedied not only in terms of their symptoms (phenotype), but in term of their cause: external agents and genetic malfunction (genotype).

A great deal of effort of the pharmaceutical industry is presently being directed toward detecting the genetic malfunction (diagnosis) and correcting it (cure), using the tools of modem genomic and biotechnology. Correcting a genetic malfunction can occur at the DNA level using gene therapy. The replacement of destroyed tissues due to, e.g., arthrosis, heart disease, or neuro-degeneration, could be achieved be activating natural regeneration processes, following a similar mechanism as that of embryonic development.

Most genes, when activated, yield the production of one or several specific proteins. Acting on proteins are projected to be the domain of modem drug therapy. There are two complementary ways of acting on proteins:

(1) the concentration of proteins soluble in serum can be modified by using them directly as drugs.

(2) chemical compounds that interact selectively with given proteins can be used as drugs.

It has been estimated that between 10,000 and 15,000 human genes code for soluble proteins. If only a small percentage of these proteins have a therapeutic effect, a considerable number of new medicinal substances based on proteins remain to be found. Presently, approximately 100 proteins are used as medicines.

All of today's drugs that are known to be safe and effective are directed at approximately 500 target molecules. Most drug targets are either enzymes (22%) or receptors (52%). Enzymes are proteins responsible for activating certain chemical reactions (catalysts). Enzyme inhibitors can, for example, halt cell reproduction for purposes of fighting bacterial infection. The inhibition of enzymes is one of the most successful strategies for finding new medicines, one example of which is the use of reverse transcriptase inhibitors to fight the infectiol,4 by the retrovirus of HIV.

Receptors can be defined as proteins that form stable bonds with ligands such as hormones or neurotransmitters. Receptors can serve as "docking stations" for toxic substances to selectively poison parasites or tumor cells (chemotherapy). In the pharmacological definition, receptors are stimuli or signal transceivers. Blocking a receptor such as a neurotransmitter receptor, a hormone receptor or an ion channel alters the functioning of the cell. Since the 1950's, many successful drugs which function as receptor blockers have been introduced, including psycho-pharmaceuticals, beta blockers, calcium antagonists, diuretics, new anesthetics, and anti-inflammatory preparations.

It can be estimated that about one thousand genes are involved in common diseases. The proteins associated with these genes may not be all good drug targets, but among the dozens of proteins that participate in the regulatory pathway, one can assume that at least three to five represent good drug targets. According to this estimate, 3,000 to 5,000 proteins could become the targets of new medicines, which is an order of magnitude greater than what is known today.

With a typical drug development process costing about $300-500 million per drug, providing a better ranking of potential leads is of the utmost importance. With the recent completion of the first draft of the human genome that revealed its 30,000 genes, and with the new microarray and combinatorial chemistry technologies, the quantity and variety of genomics data are growing at a significantly more rapid pace than the informatics capacity to analyze them.

The emphasis of molecular biology is shifting from a hypothesis driven model to a data driven model. Previously, years of intense laboratory research were required to collect data and test hypotheses regarding a single system or pathway and studying the effect of one particular drug. The new data intensive paradigm relies on a combination of proprietary data and data gathered and shared worldwide on tens of thousands of simultaneous miniaturized experiments. Bioinformatics is playing a crucial role in managing and analyzing this data.

While drug development will still follow its traditional path of animal experimentation and clinical trials for the most promising leads, it is expected that the acquisition, of data from arraying technology and combinatorial chemistry followed by proper data analysis will considerably accelerate drug discovery and cut down the development cost.

Additionally, completely new areas will develop such as personalized medicine. As is known, a mix of genetic and environmental factors causes diseases. Understanding the relationships between such factors promises to improve considerably disease prevention and yield to significant health care cost savings. With genomic diagnosis, it will also be possible to prescribe a well-targeted drug, adjust the dosage and monitor treatment.

Following the challenge of genome sequencing, it is generally recognized that the two most important bioinformatics challenges are microarray data analysis (with the analysis of tens of thousands of variables) and the construction of decision systems that integrate data analysis from different sources. The essence of the problem of designing good cost effective diagnosis test or determining good drug targets is to establish a ranking among candidate genes or proteins, the most promising ones coming at the top of the list.

To be truly effective, such a ranked list must incorporate knowledge from a great variety of sources, including genomic DNA information, gene expression, protein concentration, and pharmacological and toxicological data. Challenges include: analyzing data sets with few samples but very large numbers of inputs (thousands of gene expression coefficients from only 10-20 patients); using data of poor quality or incomplete data; combining heterogeneous data sets visualizing results; incorporating the assistance of human experts complying with rules' and checks for safety requirements satisfying economic constraints (e.g., selecting only one or two best leads to be pursued); in the case of an aid to decision makers, providing justifications of the system's recommendations, and in the case of personalized medicine, making the information easily accessible to the public.

PRIOR ART SEARCH

US6272507B1*1997-04-092001-08-07Xerox Corporation System for ranking search results from a collection of documents using spreading activation techniques. US6363429B1*1999-04-202002-03-263Com Corporation Method and system for automatic determination of priority data streams on computer networks. US6938097B1*1999-07-022005-08-30Sonicwall, Inc. System for early packet steering and FIFO-based management with priority buffer support. US7164678B2 *2001-06-252007-01-161ntel Corporation Control of processing order for received network packets. US7376642B2*2004-03-302008-05-20Microsoft Corporation Integrated full text search system and method. EP1983445B1*2006-02-172018-12-26Google LLC Encoding and adaptive, scalable accessing of distributed models. W02007130546A2*2006-05-042007-11-15Jpmorgan Chase Bank, N.A. System and method for restricted party screening and resolution services. US20100318957A1*2009-06-162010-12-16International Business Machines Corporation System, method, and apparatus for extensible business transformation using a component-based business model.

The present application claims the priority of each of the following U.S. provisional patent applications: Ser. No. 60/298,842, Ser. No. 60/298,757, and Ser. No. 60/298,867, all filed Jun. 15, 2001, and, for U.S. national stage purposes, is a continuation-in-part PCT application Ser. No. PCT/US02/16012, which was filed in the U.S. Receiving Office on May , 2002, and was filed as U.S. national stage application Ser. No. 10/478,192 on Nov. 18, 2003, which is a continuation-in-part of U.S. patent application Ser. No. 10/057,849, filed Jan. 24, 2002, now issued as Pat. No. 7,117,188, which is a continuation-in-pan of application Ser. No. 09/633,410, filed Aug. 7, 2000, now issued at Pat. No. 6,882,990, which is a continuation-in-part of application Ser. No. 09/578,011, filed May 24, 2000, now issued as Pat. No. 6,658,395, which is a continuation-in-part of application Ser. No. 09/568,301, filed May 9, 2000, now issued as Pat. No. 6,427,141, which is a continuation of application Ser. No. 09/303,387, filed May 1, 1999, now issued as Pat. No. 6,128,608, which claims priority to U.S. provisional application Ser. No. 60/083,961, filed May 1, 1998.

This application is related to co-pending applications Ser. No. 09/633,615, now abandoned, Ser. No. 09/633,616, now issued as Pat. No. 6,760,715, and Ser. No. 09/633,850, now issued as Pat. No. 6,789,061, all filed Aug. 7, 2000, which are also continuations-in-part of application Ser. No. 09/578,011. This application is also related to applications Ser. No. 09/303,386, now abandoned, and Ser. No. 09/305,345, now issued as Pat. No. 6,157,921, both filed May 1, 1999, and to application Ser. No. 09/715,832, filed Nov. 14, 2000, now abandoned, all of which also claim priority to provisional application Ser. No. 60/083,961. Each of the above-identified applications is incorporated herein by reference.

OBJECTIVES OF THE INVENTION

1. The objective of the invention is to the source data may be a stored repository of documents.

2. The objective of the invention is to the source data may derive from a plurality of distributed data storage repositories. 3. The objective of the invention is to the tokenization may be white space tokenization. 4. The objective of the invention is to the scoring may be performed by a human, and the scoring by the human may be quantitatively weighted by a metadatum associated with the human. A metadatum may be a job title, a credential, or some other type of metadatum. The scoring may also be performed by an algorithm. 5. The objective of the invention is to the discovery avatar may categorize the source data based at least in part on the use of support vector machines. 6. The objective of the invention is to the discovery avatar may be deployed for use on a second data source to create a second set of data clusters using the optimized model of the discovery avatar. 7. The objective of the invention is to the discovery avatar may be deployed for use on a plurality of data sources to create a plurality of data clusters that are scored and used to rank each of the plurality of data sources according to relevance to the substantive topic. 8. The objective of the invention is to the subset topic may be defined by terms that are included in a set of terms used to define the super-set topic. In embodiments, the subset topic may be defined by terms that are additive to a set of terms used to define the super-set topic. 9. The objective of the invention is to the avatar parent may be memorialized and locked from further iterative improvement. 10. The objective of the invention is to the avatar parent may be deployed as an analytic commodity for use on a third source of data. 11. The objective of the invention is to the genealogy of avatar parent-avatar child relations may be presented in a graphic user interface. 12. The objective of the invention is to the relevance of the at least one attribute may be based at least in part on a quantitative association to a substantive topic inherenttoa datasource.

SUMMARY OF THE INVENTION

Provided herein are methods and systems for building, modifying, deploying, using and managing one or more computer-implemented avatars, referred to herein in some cases as "discovery avatars," that can assist one or more human analysts in conducting analysis of problems or exploration of topics, where analysis or exploration may include review of one or more source data sets, such as presented to the analysts in one or more data streams. An avatar may be constructed by machine learning processes, including by processing information related to what types of information analysts find useful in large data sets, such that each avatar represents an automated, mathematical representation of an analyst's knowledge and intuition about the relevance of material that appears in such data sets.

Once constructed, an avatar as described herein may be deployed as an aid to human intuition in a wide range of analytical processes, such as related to national security, enterprise management (e.g., programs related to sales, marketing, product, promotions, placement, pricing and the like), dispute resolution (including litigation), forensic analysis, criminal, administrative, civil and private investigations, scientific investigations, research and development, and a wide range of others.

The invention, source data may be tokenized, and from the tokenized data a plurality of data features may be extracted. The extracted data features may be stored as quantitative vectors. The extracted data features may be analyzed using a mathematical model to determine a data cluster, wherein the data cluster includes extracted data features that share an attribute and includes identifiers that are associated with a plurality of data elements from the source data. Continuing the example, a first source datum, from the plurality of data elements from the source data, may be presented for review based at least in part on the identifiers within the data cluster.

The first source datum may be scored, rated, or ranked based at least in part on its relevance to a substantive topic. A second source datum from the plurality of data elements from the source data may also be presented, based at least in part on the identifiers within the data cluster, and scored, rated, or ranked based at least in part on its relevance to the substantive topic. The score of the first source datum may be compared to the score of the second source datum, and a mathematical model component of a discovery avatar may be optimized based at least in part on the comparison of scores. Following the optimization of the mathematical model, data may be iteratively selected from the source data and scored, rated, or ranked to further optimize the mathematical model. Upon reaching a threshold of optimization, accuracy, quality, or merit, the optimized mathematical may be saved and/or stored as a computer-based discovery avatar.

The invention, source data may be tokenized and from the tokenized data a plurality of data features may be extracted. The extracted data features may be stored as quantitative vectors. The extracted data features may be analyzed using a mathematical model to determine a data cluster, wherein the data cluster includes extracted data features that share an attribute that is related to a super-set topic, and includes identifiers that are associated with a plurality of data elements from the source data. The data elements from the source data may be presented and scored, rated, or ranked based at least in part on the identifiers within the data cluster relating to the super-set topic.

The mathematical model may be optimized based at least in part on a comparison of the scored data elements. Upon reaching a threshold of optimization, accuracy, quality, or merit, the optimized mathematical model may be saved and/or stored as a computer based discovery avatar parent. A second set of extracted data features may be extracted from the source data that share a second attribute that is related to both the super-set topic and a subset topic. This may result in a second optimized mathematical model that is based on the super-set and subset topics and is stored as a computer-based discovery avatar child.

An attribute of a first mathematical model inherent in a first computer-based discovery avatar may be identified that is relevant to a second mathematical model inherent in a second computer-based discovery avatar. A second attribute from the first mathematical model inherent in the first computer-based discovery avatar may be incorporated within the second computer-based discovery avatar to create a cross-trained mathematical model in the second computer-based discovery avatar. The cross-trained mathematical model may then be validated by deploying the second computer-based discovery avatar on a set of source data substantially similar to source data on which the first computer based avatar was developed, wherein the validation is confirmed based at least in part on a comparison of data clusters derived using the first discovery avatar and data clusters derived using the cross-trained mathematical model of the second computer-based discovery avatar.

The data mining platform of the present invention comprises a plurality of system modules, each formed from a plurality of components. Each module comprises an input data component, a data analysis engine for processing the input data, an output data component for outputting the results of the data analysis, and a web server to access and monitor the other modules within the unit and to provide communication to other units. Each module processes a different type of data, for example, a first module processes microarray (gene expression) data while a second module processes biomedical literature on the Internet for information supporting relationships between genes and diseases and gene functionality. In the preferred embodiment, the data analysis engine is a kernel based learning machine, and in particular, one or more support vector machines (SVMs).

The data analysis engine includes a pre-processing function for feature selection, for reducing the amount of data to be processed by selecting the optimum number of attributes, or "features", relevant to the information to be discovered. In the preferred embodiment, the feature selection means is recursive feature elimination (RFE), such that the preferred embodiment of the data analysis engine uses RFE-SVM. The output the data analysis engine of one module may be input into the data analysis engine of a different module. Thus, the output data from one module is treated as input data which would be subject to feature ranking and/or selection so that the most relevant features for a given analysis are taken from different data sources. Alternatively, the outputs of two or more modules may be input into an independent data analysis engine so that the knowledge is progressively distilled. For example, analysis results of microarray data can be validated by comparison against documents retrieved in an on-line literature search, or the results of the different modules can be otherwise combined into a single result or format.

The data analysis engine, pre-processing can include identifying missing or erroneous data points, or outliers, and taking appropriate steps to correct the flawed data or, as appropriate, remove the observation or the entire field from the scope of the problem. Such pre-processing can be referred to as "data cleaning". Pre-processing can also include clustering of data, which provides means for feature selection by substituting the cluster center for the features within that cluster, thus reducing the quantity of features to be processed. The features remaining after pre-processing are then used to train a learning machine for purposes of pattern classification, regression, clustering and/or novelty detection.

A test data set is pre-processed in the same manner as was the training data set. Then, the trained leaning machine is tested using the pre-processed test data set. A test output of the trained learning machine may be post-processing to determine if the test output is an optimal solution based on known outcome of the test data set.

In the context of a kernel-based learning machine such as a support vector machine, the present invention also provides for the selection of at least one kernel prior to training the support vector machine. The selection of a kernel may be based on prior knowledge of the specific problem being addressed or analysis of the properties of any available data to be used with the learning machine and is typically dependant on the nature of the knowledge to be discovered from the data.

Kernels are usually defined for patterns that can be represented as a vector of real numbers. For example, linear kernels, radial basis function kernels and polynomial kernels all measure the similarity of a pair of real vectors. Such kernels are appropriate when the patterns are best represented as a sequence of real numbers.

An iterative process comparing post processed training outputs or test outputs can be applied to make a determination as to which kernel configuration provides the optimal solution. If the test output is not the optimal solution, the selection of the kernel may be adjusted and the support vector machine may be retrained and retested. Once it is determined that the optimal solution has been identified, a live data set may be collected and pre-processed in the same manner as was the training data set to select the features that best represent the data. The pre-processed live data set is input into the learning machine for processing. The live output of the learning machine may then be post processed by interpreting the live output into a computationally derived alphanumeric classifier or other form suitable to further utilization of the analysis results.

BRIEF DESCRIPTION OF THE DIAGRAM

FIG. 1: is a simplified diagram of a Curiosity Engine method and system for the creation and training of discovery avatars.

FIG. 2: is an embodiment of discovery avatar development and optimization.

FIG. 3: is steps for developing, optimizing and storing a discovery avatar.

FIG. 4: is an of avatar-parent and avatar-child development and optimization.

FIG. 5: is an of cross-training discovery avatars and mathematical models associated with discovery avatars.

DESCRIPTION OF THE INVENTION

The invention can be used to analyze biological data generated at multiple stages of investigation into biological functions, and further, to integrate the different kinds of data for novel diagnostic and prognostic determinations. For example, biological data obtained from clinical case information, such as diagnostic test data, family or genetic histories, prior or current medical treatments and the clinical outcomes of such activities, and published medical literature, can be utilized in the method and system of the present invention. Additionally, clinical samples such as diseased tissues or fluids, and normal tissues and fluids, and cell separations can provide biological data that can be utilized by the current invention.

Proteomic determinations such as 2-D gel, mass spectrophotometry and antibody screening can be used to establish databases that can be utilized by the present invention. Genomic databases can also be used alone or in combination with the above-described data and databases by the present invention to provide comprehensive diagnosis, prognosis or predictive capabilities to the user of the present invention.

A first aspect of the present invention facilitates analysis of data by pre-processing the data prior to using the data to train a learning machine and/or optionally post-processing the output from a learning machine. Generally stated, pre-processing data comprises reformatting or augmenting the data in order to allow the learning machine to be applied most advantageously. More specifically, pre-processing involves selecting a method for reducing the dimensionality of the feature space, i.e., selecting the features which best represent the data.

In the preferred embodiment, recursive feature elimination (RFE) is used, however, other methods may be used to select an optimal subset of features, such as those disclosed in co-pending PCT application Serial No. PCT/US02/16012, filed in the U.S. Receiving Office on May 20, 2002, entitled "Methods for Feature Selection in a Learning Machine", which is incorporated herein by reference. The features remaining after feature selection are then used to train a learning machine for purposes of pattern classification, regression, clustering and/or novelty detection.

In a manner similar to pre-processing, post-processing involves interpreting the output of a learning machine in order to discover meaningful characteristics thereof. The meaningful characteristics to be ascertained from the output may be problem- or data specific. Post-processing involves interpreting the output into a form that, for example, may be understood by or is otherwise useful to a human observer, or converting the output into a form which may be readily received by another device for, e.g., archival or transmission.

FIG. 1: in embodiments of the present invention, a computer-based discovery avatar may be created based at least in part on starting with a data ingestion 101 or entry phase in which a set of data are selected to be used for creating and training a discovery avatar. In embodiments, data ingestion 101 may be performed using a web crawler or any search engine combined with a data storage system.

An example paradigm may include a combination such as, but not limited to, a web search software tool such as the open source tool NUTCH@ provided by APACHE@ and a search server, such as the Solr search tool provided by APACHE, which is based on the Lucene Java search library. Such a paradigm may use a distributed storage and computation tool such as the open source HADOOP" framework from APACHE. In various embodiments, a wide variety of tools known to those of ordinary skill in the art may be used to extract, transform, load and store data from disparate sources into one or more formats suitable for ingestion by a discovery avatar, including in situations using distributed storage and computation capabilities. Similarly, various known techniques for normalizing, de-duplicating, error correcting, and otherwise cleansing input data sets may be used to provide a discovery avatar with a consistent, clean data set for its use.

A discovery avatar's point of ingesting data may be conceptualized as a gate (hereinafter, "Pantheon") to the discovery avatar. As data pass through the Pantheon, a discovery avatar works to extract 102 features from data. Data feature extractors may include, but are not limited to, custom Java@ or Python" or similar programming processes that use Natural Language Processing to identify key elements of a document. Once data features are extracted 102, the discovery avatar software may again compute to transform documents and/or document elements (such as tokenized data derived from documents) into vectors 102 for further analysis, such as deriving clusters 105 that relate to a topic 104 of interest that is used by the Curiosity Engine, as described herein, to develop, train, optimize and store discovery avatars.

These vectors may be very high dimensional mathematical objects. Statistical techniques, such as variants of k-means clustering and LDA+Topic modeling may be used to create data clusters 105 and/or document clouds. The discovery avatar may take the largest member inn-space of each data cluster 105 or data cloud, the second largest member, and so forth, until a human user provides sufficient feedback for the supervised learning of the discovery avatar. Supervised Learning routines such as Support Vector Machines may be trained according to a human-user-specified topic 104, and used to queue 108 and score 111 data and/or documents, such as test documents 107, from a data source according to a relevance to the specified topic.

These scores may then be used to determine a subset of the data and/or documents to present to the user for feedback 109. Once the user is presented with a list of documents selected by the discovery avatar, the user may label documents 110 as relevant or not as it pertains to a particular topic. New labels 110 may indicate the need for new vectors or new training of additional discovery avatars focused on other topics that are discovered in the data source. The discovery avatar may provide relevance scores for both labeled and unlabeled documents.

The former may be done for the purpose of precision and recall ROC curves. In embodiments, users may add new or custom features 112 including, but not limited to, timestamps on files to word-pair proximity (e.g., how far is the word "analytic" from "engine"? New documents 113 may enter the system and be prepared for examination by the discovery avatar. Once a discovery avatar is trained and is performing well, a user may choose to cap the training 114, stop further review 116, and lock and memorialize the discovery avatar and allow no further influence the mathematical model of the avatar. The mathematical model used by the discovery avatar may be applied to incoming documents 115 before they are fully ingested and allow the user the option of adding them to the corpus. A data corpus may be determined "complete" and memorialized with a set of discovery avatars.

The invention may provide for an avatar for modeling iterative investigation, such as for obtaining an indication of some elements of a data stream that are perceived to be helpful to at least one human analyst conducting an investigation, and characterizing the helpful elements in a computer-based avatar that manages the queue of additional data stream elements to improve the quality of the data stream for the analyst, and the like. In embodiments, managing the queue may include ordering, ranking, filtering, clustering, and the like, the data stream elements.

The invention may provide for a discovery avatar for modeling iterative investigation, such as for constructing a computer-based avatar that manages a queue of data stream elements to aid at least one human analyst who is conducting an investigation, such as including tokenizing source data within a data stream presented to an analyst such that the source data may be extracted based on a topic.

A topic of investigation may be identified by the analyst, and a set of source data extracted and queued that is related to the topic. Items within the set of source data may initially be rated by the human analyst the ratings allowing formation of a computer based avatar for the topic that is based on the human ratings of the source data. The avatar may then be used to queue additional source data, and the avatar may be iteratively improved by a set of cycles of avatar formation, queuing, and analyst rating; and the like, such that with each cycle the avatar increasingly reflects the human ratings, which may be based on explicit intent, intuition, or a combination of both.

The invention may provide for a discovery avatar for modeling iterative investigation. Once a sufficient number of iterations have been conducted (as judged by human evaluation of the quality of the avatar or by comparison (optionally automated) of the performance of the avatar against a performance metric, an avatar may be locked and/or memorialized, so that in future usage the avatar is used to queue data within new data sets for an analyst, but the avatar itself remains unchanged. For instance, an indication may be obtained of some elements of a data stream that are perceived to be helpful to at least one human analyst who is conducting an investigation on a topic.

The helpful elements may be characterized in a computer-based avatar that manages a queue of additional data stream elements to improve the quality of the data stream for the topic, and a topical avatar may be iteratively improved through a series of rounds of human review and rating of the elements presented in the managed queue. A version of the avatar for the topic may be locked after such improvement. A locked avatar might, for example, represent the intuition of a particular analyst, such as a very skilled police investigator or intelligence analyst, who is perceived to have unique knowledge, training or insight when reviewing potentially relevant information. Future analysts may thus benefit from the knowledge of past expert analysts by receiving data sets that are queued according to the ratings of the past expert.

The invention may provide for a discovery avatar for modeling iterative investigation, such as for using the avatar as a commodity. For instance, an indication of some elements of a data stream may be obtained that are perceived to be helpful to at least one human analyst conducting an investigation on a topic. The helpful elements may be characterized in a computer-based avatar that manages the queue of additional data stream elements to improve the quality of the data stream presented to the analyst for the topic. The formulation of the avatar may be stored as a computing element that can be deployed by another. In embodiments, the stored avatar computing element may be an application that can be deployed as a commodity, a mathematical summary of the elements of the data stream and their relation to the topic, and the like. The mathematical summary of the elements of the data stream may be based at least in part on an algorithmic modeling of tokenized elements from the data stream.

The invention may provide for a discovery avatar for modeling iterative investigation, such as an avatar that is used for a group of participants. For instance, an indication may be obtained of some elements of a data stream that are perceived to be helpful to a plurality of human participants who are contributing to at least one analytic investigation, characterizing the helpful elements in a computer-based avatar that manages the queue of additional data stream elements to improve the quality of the data stream for the participants in the investigation, and the like. In embodiments, each member of the group may participate in rating documents, with the collective ratings being used to form the mathematical representation that comprises the avatar and that is used to queue future information.

The contributions or ratings of group members may be weighted, such that, for example, a supervisor's ratings, or the ratings of a more experienced person, are provided with more weight as compared to a less experienced or junior person. In embodiments a group avatar may be trained and locked, but variants may be spawned and maintained as "children," such as for each of the group participants, such that a data flow might be initially queued based on the group avatar, then shuffled based on the preferences of a particular member of the group.

The preferences may be specified by an analyst in a rule-based manner, in conjunction with a process that uses a discovery avatar. For example, an analyst might declare a rule to see all documents of a certain type first, notwithstanding what would otherwise be queued for the analyst based on past ratings. Thus, an avatar may be used in a compound analytic data presentation process where data queued by the avatar may be presented together with data found in other ways, such as conventional web searching, database queries, or the like.

The invention may provide for a discovery avatar for modeling iterative investigation, such as in conjunction with question-based call and response of human experts. For instance, an indication may be obtained of some elements of a data stream that are perceived to be helpful to at least one human analyst who is conducting a question-based investigation, characterizing the helpful elements in a computer-based avatar that manages the queue of additional data stream elements to improve the quality of the data stream for the analyst with respect to the topic to which the questions relates, and the like. In embodiments, this may form the topic that is the investigative purpose of the discovery avatar.

The invention may provide for a discovery avatar for modeling iterative investigation, such as using a trained avatar as a mathematical model, deployable, scalable, and the like, and which may not be reliant on the document source on which it was trained. For instance, an indication may be obtained of some elements of a data stream that are perceived to be helpful to at least one human analyst who is conducting an investigation on a topic. The helpful elements may be characterized in a computer-based avatar that manages the queue of additional data stream elements to improve the quality of the data stream presented to the analyst for the topic, and the formulation of the avatar may be stored as a mathematical model-based computing element that can be deployed on another stream independent of the data stream on which it was trained, and the like.

The invention may provide for constructing a longitudinal avatar, such that manages a queue of data stream elements to aid at least one human analyst conducting an investigation, including tokenizing source data within a data stream presented to an analyst such that the source data may be extracted based on a topic. A topic of investigation may be identified by the analyst, and a set of source data may be extracted and queued related to the topic.

The set of source data may then be rated by the human analyst, or a computer running an algorithm, and a computer-based discovery avatar for the topic may be formed based on the human ratings of the source data, wherein the human ratings are mathematically weighted according to a criterion. The discovery avatar may be used to queue additional source data, facilitating analyst rating of the additional source data, and the discovery avatar may be iteratively improved by a set of cycles of avatar formation, queuing, and analyst rating; and the like. In embodiments, the criterion may be used to mathematically weight the human rating based the date of the human rating, expertise of human, title of human, and the like.

The invention may provide for a user and/or management interface for an avatar for modeling an iterative investigation, such as in a computer program product embodied in a non-transitory computer readable medium that, when executing on one or more computers, may perform the steps of presenting an interface that is enabled to manage a computer-based avatar, wherein the avatar is a mathematical summary of data stream elements that is based at least in part on an algorithmic modeling of tokenized elements from the data stream.

A parameter selection may be received from a user of the interface, wherein the parameter relates at least in part to a criterion on which the mathematical summary is based. A visualization of the criterion may be presented to the interface. In embodiments, the criterion may be a data source, a date, or some other type of data. The visualization may depict a longitudinal trend relating to the criterion, a comparison of a first criterion with a second criterion (e.g., Data Source 1 with Data Source 2), and the like.

The invention may provide for parent-child avatars for modeling iterative investigation, such as in a method of constructing a computer-based discovery avatar that may manage a queue of data stream elements to aid at least one human analyst conducting an investigation, including tokenizing source data within a data stream presented to an analyst such that the source data may be extracted based on a super-set topic. A super-set topic of investigation may be identified by the analyst, and a set of source data related to the super-set topic may be extracted and queued.

The human analyst may rate the set of source data, forming a computer-based parent avatar for the super-set topic based on the human ratings of the source data. A second set of source data may then be tokenized such that the second set of source data may be extracted based on a subset topic relating at least in part to the super-set topic; using the avatar to queue additional source data from the source data and the second set of source data, facilitating analyst rating of the additional source data, and iteratively improving a child-avatar by a set of cycles of avatar formation, queuing, and analyst rating, wherein the cycles of formation queuing and analyst rating are based at least in part on the superset topic and subset topic. In embodiments, the second set of source data may be a subset of the set of source data, an additive to the set of source data, and the like. The parent avatar may be memorialized and locked from further iterative improvement. The parent avatar may be deployed as an analytic commodity for use on a third source of data. The genealogy of parent-child avatar relations may be tracked/visualized (e.g., "Korea" and "Japan" avatars branching from an "East Asia Industrial Organization" avatar).

The discovery avatars may be capable of communicating with one another, in order to find hidden patterns, mathematical similarities, topical relationships, connections and correlations between their models and the content they explore. This cross-avatar communication may result in relevant alerts and, where appropriate, information sharing between avatars. Avatars may alert their users where there are other avatars and research topics relevant to their own existing topics and research. By analogy, the avatars may exist within an avatar social network in which the avatars to communicate, locate, identify and "friend" (i.e., initiate a social networking-based relationship) other avatars in a manner similar to humans within a social network identifying and "friending" other humans with whom they, for example, share an interest (i.e., topic). The friending of avatars may enable nuanced recommendations to users. The friending that occurs among avatars may also enable users to learn from other users that they may not otherwise be in communication with.

The invention may provide for avatar cross-training, such as a method of optimizing a computer-based discovery avatar, including automating identification of at least one common attribute of at least one mathematical model inherent in a first computer-based avatar and at least one mathematical model inherent in a second computer-based avatar, and incorporating a second attribute from at least one mathematical model inherent in the first computer-based avatar within the second computer-based avatar to create a cross-trained mathematical model in the second computer-based avatar. The cross trained mathematical model may then be validated by deploying the second computer based avatar on a set of source data substantially similar to source data on which the first computer-based avatar was developed/trained.

The present invention may provide for an avatar-search hybrid facility, such as a method of constructing a computer-based avatar that manages a queue of data stream elements to aid at least one human analyst conducting an investigation, including tokenizing source data within a data stream presented to an analyst such that the source data may be extracted based on a topic; identifying a topic of investigation by the analyst, wherein the topic identification is further assisted using collaborative filtering based at least in part on a concordance of a stored data attribute relating to the analyst and a second stored data attribute relating to at least one other human. A set of source data related to the topic may then be extracted and queued, facilitating rating of the set of source data by the human analyst. A computer-based discovery avatar for the topic may be formed based on the human ratings of the source data, and the discovery avatar used to queue additional source data, further facilitating analyst rating of the additional source data. The discovery avatar may then be iteratively improved by a set of cycles of avatar formation, queuing, and analyst rating. In embodiments, the stored data attribute may be a job title, a credential, and the like.

The invention may provide for a discovery avatar may be deployed in different data venues including, but not limited to, the Internet, enterprise data systems, distributed storage, cloud-based storage, or some other data source or repository.

The invention may provide for a spiral processing method for populating a discovery avatar that may be used for modeling an iterative investigation, such as a method of constructing a topic for a computer-based avatar that manages a queue of data stream elements to aid at least one human analyst conducting an investigation. The method may include tokenizing source data within a data stream, wherein a priority is given to tokenizing larger data components within the data stream over smaller data components. Extracting topic clusters from the source data, wherein the extracted topic clusters are formed based at least in part on a frequency of keyword occurrence, or "magnitude," of topic prevalence. Identifying a topic of interest from the extracted topic clusters, and queuing a set of source data related to the topic of interest. The topic of interest may then be validated by rating the set of source data by a human analyst; computer algorithm, or some other scoring, rating, or ranking method or system.

FIGS. 2 and 3: in embodiments of the present invention, source data from a data stream 202 may be tokenized 204, and from the tokenized data a plurality of data features may be extracted 208. The extracted data features may be analyzed 210 and stored as quantitative vectors. The extracted data features may be analyzed using a mathematical model to determine a data cluster, wherein the data cluster includes extracted data features that share an attribute and includes identifiers that are associated with a plurality of data elements from the source data. Continuing the example, a first source datum, from the plurality of data elements from the source data, may be presented for review based at least in part on the identifiers within the data cluster. The first source datum may be scored, rated, or ranked based at least in part on its relevance to a substantive topic.

A second source datum from the plurality of data elements from the source data may also be presented, based at least in part on the identifiers within the data cluster, and scored, rated, or ranked based at least in part on its relevance to the substantive topic. The score of the first source datum may be compared to the score of the second source datum, and a mathematical model component of a discovery avatar may be optimized based at least in part on the comparison of scores 212. Following the optimization of the mathematical model, data may be iteratively selected from the source data and scored, rated, or ranked to further optimize the mathematical model. Upon reaching a threshold of optimization, accuracy, quality, or merit, the optimized mathematical may be saved and/or stored as a computer-based discovery avatar 214.

The source data may be a stored repository of documents. In embodiments, the source data may derive from a plurality of distributed data storage repositories. In embodiments, the tokenization may be white space tokenization. In embodiments, the scoring may be performed by a human, and the scoring by the human may be quantitatively weighted by a metadatum associated with the human. A metadatum may be a job title, a credential, or some other type of metadatum. The scoring may also be performed by an algorithm. In embodiments, the discovery avatar may categorize the source data based at least in part on the use of support vector machines.

The discovery avatar may be deployed for use on a second data source to create a second set of data clusters using the optimized model of the discovery avatar. In embodiments the discovery avatar may be deployed for use on a plurality of data sources to create a plurality of data clusters that are scored and used to rank each of the plurality of data sources according to relevance to the substantive topic.

FIG. 3: is steps for developing, optimizing and storing a discovery avatar. As illustrated in FIG. 3, a computer-based discovery avatar is constructed in 302. The computer-based discovery avatar can manage a queue of data stream elements to aid an investigation. In 304, the source data is tokenized and a plurality of data features can be extracted from the tokenized source data. Features of the extracted data can be analyzed in 308 using a mathematical model to determine a data cluster. In 310, a first source datum is presented for review. The first source datum can be scored in 312 based at least in part on its relevance to a substantive topic. In 314, a second source datum may be presented for review and scored in 318. The score of the first source datum can be compared to the score of the second source datum in 320. The mathematical model can then be optimized in 322 based, at least in part, on the comparison of scores. To improve the scores received by data elements from the source data, 308 through 322 can be repeated and the optimized model can be stored as a computer-based discovery avatar in 324.

FIGS. 4 and 5: in of the present invention, source data, such as from a data stream 402, may be tokenized 404 and from the tokenized data a plurality of data features may be extracted 408. The extracted data features may be stored as quantitative vectors. The extracted data features may be analyzed 410 and modeled 412, using a mathematical model to determine a data cluster, wherein the data cluster includes extracted data features that share an attribute that is related to a super-set topic, and includes identifiers that are associated with a plurality of data elements from the source data. The data elements from the source data may be presented and scored, rated, or ranked based at least in part on the identifiers within the data cluster relating to the super-set topic.

The mathematical model may be optimized based at least in part on a comparison of the scored data elements. Upon reaching a threshold of optimization, accuracy, quality, or merit, the optimized mathematical model may be saved and/or stored as a computer based discovery avatar parent 414. A second set of extracted data features 418 may be extracted from the source data 402 that share a second attribute that is related to both the super-set topic and a subset topic. This may be analyzed in a second analysis 420 and result in a second optimized mathematical model that is based on the super-set and subset topics and is stored as a computer-based discovery avatar child 422.

The subset topic may be defined by terms that are included in a set of terms used to define the super-set topic. In embodiments, the subset topic may be defined by terms that are additive to a set of terms used to define the super-set topic. In embodiments, the avatar parent may be memorialized and locked from further iterative improvement. In embodiments, the avatar parent may be deployed as an analytic commodity for use on a third source of data. In embodiments, the genealogy of avatar parent-avatar child relations may be presented in a graphic user interface.

FIGS 5: a plurality of source data (602, 610) may be used to create a plurality of discovery avatars that may be used to cross-train each other, resulting in a cross-trained mathematical model 620 that may be utilized by at least one discovery avatar. In embodiments, an attribute 618 of a first mathematical model 608 inherent in a first computer-based discovery avatar 604 may be identified that is relevant to a second mathematical model inherent in a second computer-based discovery avatar.

A second attribute from the first mathematical model inherent in the first computer based discovery avatar 604 may be incorporated within the second computer-based discovery avatar 612 to create a cross-trained mathematical model 620 in the second computer-based discovery avatar 612. The cross-trained mathematical model 620 may then be validated 622 by deploying the second computer-based discovery avatar 612 on a set of source data 610 substantially similar to source data 602 on which the first computer-based avatar 604 was developed, wherein the validation is confirmed based at least in part on a comparison of data clusters derived using the first discovery avatar 604 and data clusters derived using the cross-trained mathematical model 620 of the second computer-based discovery avatar 612.

Pre-processing Functions. Pre-processing can have a strong impact on SVM. In particular, feature scales must be comparable. A number of possible pre-processing methods may be used individually or in combination. One possible pre-processing method is to subtract the mean of a feature from each feature, then divide the result by its standard deviation. Such pre-processing is not necessary if scaling is taken into account in the computational cost function. Another pre-processing operation can be performed to reduce skew in the data distribution and provide more uniform distribution. This pre-processing step involves taking the log of the value, which is particularly advantageous when the data consists of gene expression coefficients, which are often obtained by computing the ratio of two values. For example, in a competitive hybridization scheme, DNA from two samples that are labeled differently are hybridized onto the array. One obtains at every point of the array two coefficients corresponding to the fluorescence of the two labels and reflecting the fraction of DNA of either sample that hybridized to the particular gene. Typically, the first initial preprocessing step that is taken is to take the ratio a/b of these two values. Although this initial preprocessing step is adequate, it may not be optimal when the two values are small.

Other initial preprocessing steps include:

(a-b)/(a+b) and

(log a-log b)/(log a+log b).

Another pre-processing step involves normalizing the data across all samples by subtracting the mean. This preprocessing step is supported by the fact that, using tissue samples, there are variations in experimental conditions from microarray to microarray. Although standard deviation seems to remain fairly constant, another possible preprocessing step was to divide the gene expression values by the standard deviation to obtain centered data of standardized variance.

To normalize each gene expression across multiple tissue samples, the mean expression value and standard deviation for each gene can be computed. For all the tissue sample values of that gene (training and test), that mean is then subtracted and the resultant value divided by the standard deviation. In some experiments, an additional preprocessing step can be added by passing the data through a squashing function [f(x)=c antan (x/c)] to diminish the importance of the outliers.

In a variation on several of the preceding pre-processing methods, the data can be pre processed by a simple "whitening" to make data matrix resemble "white noise." The samples can be pre-processed to: normalize matrix columns; normalize matrix lines; and normalize columns again. Normalization consists of subtracting the mean and dividing by the standard deviation. A further normalization step can be taken when the samples are split into a training set and a test set.

Clustering Methods:

Because of data redundancy, it may be possible to find many subsets of data that provide a reasonable separation. To analyze the results, the relatedness of the data should be understanding. In correlation methods, the rank order characterizes how correlated the data is with the separation. Generally, a highly ranked data point taken alone provides a better separation than a lower ranked data point. It is therefore possible to set a threshold, e.g., keep only the top ranked data points, that separates "highly informative data points" from "less informative data points".

Feature selection methods such as SVM-RFE, described below, provide subsets of data that are both smaller and more discriminant. The data selection method using SVM-RFE also provides a ranked list of data. With this list, nested subsets of data of increasing sizes can be defined. However, the fact that one data point has a higher rank than another data point does not mean that this one factor alone characterizes the better separation. In fact, data that are eliminated in an early iteration could well be very informative but redundant with others that were kept. Data ranking allows for a building nested subsets of data that provide good separations, however it provides no information as to how good an individual data point may be.

Data ranking alone is insufficient to characterize which data points are informative and which ones are not, and also to determine which data points are complementary and which are redundant. Therefore, additional pre-processing in the form of clustering may be appropriate. Feature ranking is often combined with clustering. One can obtain a ranked list of subsets of equivalent features by ranking the clusters. In one such method, a cluster can be replaced by its cluster center and scores can be computed for the cluster center. In another method, the features can be scored individually and the score of a cluster computed as the average score of the features in that cluster.

To overcome the problems of data ranking alone, the data can be preprocessed with an unsupervised clustering method. Using the QTciust("quality clustering algorithm") algorithm, which is known in the art, particularly to those in the field of analysis of gene expression profiles, or some other clustering algorithm such as hierarchical clustering or SVM clustering, data can be grouped according to resemblance (according to a given metric). Cluster centers can then be used instead of data points themselves and processed by SVM-RFE to produce nested subsets of cluster centers. An optimum subset size can be chosen with the same cross-validation method used before.

Supervised clustering may be used to show specific clusters that have relevance for the specific knowledge being determined. For example, in analysis of gene expression data for diagnosis of colon cancer, a very large cluster of genes has been found that contained muscle genes that may be related to tissue composition and may not be relevant to the cancer vs. normal separation. Thus, these genes are good candidates for elimination from consideration as having little bearing on the diagnosis or prognosis for colon cancer. Feature Selection: The problem of selection of a small amount of data from a large data source, such as a gene subset from a microarray, is particularly solved using the methods, devices and systems described herein. Previous attempts to address this problem used correlation techniques, i.e., assigning a coefficient to the strength of association between variables. In examining genetic data to find determinative genes, these methods eliminate gene redundancy automatically and yield better and more compact gene subsets. The methods, devices and systems described herein can be used with publicly-available data to find relevant answers, such as genes determinative of a cancer diagnosis, or with specifically generated data.

The score of a feature is a quantity that measures the relevance or usefulness of that feature (or feature subset), with a larger score indicating that the feature is more useful or relevant. The problem of feature selection can only be well defined in light of the purpose of selecting a subset of features. Examples of feature selection problems that differ in their purpose include designing a diagnostic test that is economically viable. In this case, one may wish to find the smallest number of features that provides the smallest prediction error, or provides a prediction error less than a specified threshold. Another example is that of finding good candidate drug targets. The two examples differ in a number of ways.

In diagnosis and prognosis problems, the predictor cannot be dissociated from the problem because the ultimate goal is to provide a good predictor. One can refer to the usefulness of a subset of features to build a good predictor. The expected value of prediction error (the prediction error computed over an infinite number of test samples) would be a natural choice to derive a score. One problem is to obtain an estimate of the expected value of the prediction error of good precision by using only the available data. Another problem is that it is usually computationally impractical to build and test all the predictors corresponding to all possible subsets of features. As a result of these constraints, one typically resorts to use of sub-optimal scores in the search of good feature subsets.

In drug target selection, the predictor is only used to substitute the biological organism under study. For example, choosing subsets of genes and building new predictors are ways of substituting computer experiments for laboratory experiments that knock out genes and observe the consequence of the phenotype. The goal of target selection is to determine which feature(s) have the greatest impact on the health of the patient. The predictor itself is not going to be used. One refers to the "relevance" of the feature(s) with respect to the condition or phenotype under study. It may be a good idea to score features using multiple predictors and using a combined score to select features. Also, in diagnosis and prognosis, correlated features may be substituted for one another. The fact that feature correlations may mean causal relationships is not significant. On the other hand, in target selection, it is much more desirable to select the feature that is at the source of a cascade of events as opposed to a feature that is further down on the chain. For these reasons, designing a good score for target selection can be a complex problem. In order to compare scores obtained from a number of different sources, and to allow simple score arithmetic, it is useful to normalize the scores. (See above discussion of pre-processing.) The ranking obtained with a given score is not affected by applying any monotonically increasing function. This includes exponentiation, multiplication or division by a positive constant, and addition or subtraction of a constant. Thus, a wide variety of normalization schemes may be applied.

As an example, the following considers conversion of scores into a quantity that can be interpreted as a probability or a degree of belief that a given feature or feature subset is "good". Assume that a given method generated scores for a family of subsets of features. Such family may include: all single features, all feature pairs, all possible subsets of features. Converting a score to a probability-like quantity may include exponentiation (to make the score positive), and normalization by dividing by the sum of all the scores in the family.

In the following, P(fi, f2,.. ., fn) denotes the score normalized as probability for the feature subset (fi, f2, . . ., f). Then, P(fiIf2, . . ., f) is the score normalized as probability of feature fi given that features (f2, . ..,f) have already been selected.

Scores converted to probabilities can be combined according to the chain rule (P(fi,f2,..., fn)=P(fif2, . . . , fn) P(f2, . . . , fn).) or Bayes rule (P(fi,f2, . . . , fn)=ZiP(fi, f2, . . . , fn|Ci) P(Ci), where Ci could be various means of scoring using different experimental data or evidence and P(Ci) would be weights measuring the reliability of such data source (iP(Ci)=1).

Scoring a large number of feature subsets is often computationally impractical. One can attempt to estimate the score of a larger subset of features from the scores of smaller subsets of features by making independence assumptions, i.e., P(fi, f2,. . ., fn)=P(fi) P(f2) .

. . , P(fn). Or, if there are scores for pairs of features, scores for triplets can be derived by replacing P(fi,f2,f3)=P(fi, f2If3) P(f3)=P(fi,f3|f2)P(f2)=P(f2,f3|fi)P(fi) with P(fi,f2,f3)=(%)(P(fi,f2|f3) P(f3)+P(fi,f3|f2)P(f2)+P(f2, f3|fi)P(fi)). Other scores for large numbers of features can be derived from the scores of small numbers of features in a similar manner.

One of the simplest structures for representing alternative choices of features is a ranked list. The features are sorted according to their scores such that the most promising features according to that score is top ranked and the least promising features are ranked lowest. The opposite order is also possible. Scores include prediction success rate of a classifier built using a single feature; absolute value of the weights of a linear classifier; value of a correlation coefficient between the feature vector and the target feature vector consisting of (+1) and (-1) values corresponding to class lables A or B (in a two class problem). Correlation coefficients include the Pearson correlation coefficient; value of the Fisher criterion in a multi-class problem. It is often desirable to select a subset of features that complement each other to provide best prediction accuracy. Using a ranked list of features, one can rank subsets of features. For example, using scores normalized as probabilities and making feature independence assumptions, the above-described chain rule can be applied.

Independence assumptions are often incorrect. Methods of forward features selection or backward elimination (including RFE, discussed below) allow the construction of nested subsets of complementary features F1 c F2 C . . . Fmusing a greedy search algorithm that progressively adds or removes features for scores normalized as probabilities, the chain rule applies. For example, assume F1={fn} and F2={fa, fb}. The relationship P(F2)=P(F1)P(F2|F1)=P(f)P(fa,fbfa) is a forward selection scheme where fbcan be added once fa has been selected with the probability P(fa,fblfa) of making a good choice. Similarly, if it is assumed that Fm-1=(fa, fb, . . . , f) and Fm=(fa, fb, . . . fj, fk), then P(Fm-1)=P(Fm)P(Fm-1|Fm)=P(fa, fb ..... fj, fk)P(fa, f, .. ., fjlfa, f, .. ., fj, fk). This can be read in a backward elimination scheme as: eliminate fkwhen the remaining subset is {fa, fb.... f., fk}, with probability P(fa, fb.... fjlfa, fb,... fj, fk) of making a good choice.

Alternatively, one can add or remove more than one feature at a time. (See detailed description of RFE below.) As an example, RFE-SVM is a backward elimination procedure that uses as the score to rank the next feature to be eliminated a quantity that approximates the difference in success rate S(Fm-1) -S(Fm). Scores are additive and probabilities multiplicative, so by using exponentiation and normalization, the score difference can be mapped to P(Fm-1|Fm)=1 because of the backward elimination procedures. Since P(Fm) is proportional to exp(Sm),

P(Fm-1|Fm)=P(FmFm-1)P(Fm-1/P(Fm)=exp(Sm-1/Sm). In a manner similar to that described for ranked lists of subsets of equivalent features, nested subsets can be constructed of complementary subsets of equivalent features. Clustering can be used to create "super features" (cluster centers). The nested subsets of super features define nested subsets of subsets of equivalent features, i.e., the corresponding clusters.

In an alternative method, nested subsets of complementary features can be constructed using cardinality increment of one. The first few subsets are kept, then the remaining features are aggregated to the features in the nested subset. In other words, the features in the nested subset are used as cluster centers, then clusters are formed around those centers with the remaining features. One application of such structures is the selection of alternate subsets of complementary features by replacing the cluster centers in a subset of cluster centers with one of the cluster members.

Nested subsets of complementary subsets of equivalent features may produce alternate complementary subsets of features that are sub-optimal. Trees can provide a better alternative for representing a large number of alternate nested subsets of complementary features. Each node of the tree is labeled with a feature and a feature subset score. The children of the root node represent alternate choices for the first feature. The children of the children of the root are alternate choices for the second features, etc. The path from the root node to a given node is a feature subset, the score of which is attached to that node. The score for siblings is the score of the subset including the child feature and all its ancestors. For scores normalized to probabilities, sorting of the siblings is done according to the joint probability P(ancestors, child). Given that siblings share the same ancestors, such sibling ranking also corresponds to a ranking according to P(child ancestors). This provides a ranking of alternate subsets of features of the same size.

Tree can be built with forward slection algorithms, backward selection algorithms, exhaustive feature subset evaluation or other search strategies. Trees are structures that generalizes both ranked lists and nested subsets of features. A tree of depth one is a ranked list (of all children of the root.) A tree that has only one branch defines nested subsets of features. One can also build trees of super-features (cluster centers) and, therefore, obtain a structure that contains multiple alternative of nested subsets of subsets of equivalent features. Another variant is to build a tree using only the top features of a ranked list. Subsequently, the features eliminated can be aggregated to the nodes of the tree they most resemble. Other graphs, particularly other kinds of directed acyclic graphs, may have some relevance to describe subsets of features. For example, Bayesian networks have been used to describe relationships between genes.

For some features, e.g., genes, one can obtain patterns from various sources. Assume that one wishes to assess the relevance of certain genes with respect to a given disease. Gene scores (or gene subsest scores) can be derived from DNA microarray gene expression coefficients for a variety of diseased and normal patients. Other scores can be obtained from protein arrays, and still other scores can be obtained by correlation of the citation of various genes with the given disease from published medical articles. In each case, a feature subset data structure can be constructed. These structures can then be combined to select feature subsets based on the combined information.

WE CLAIMS 1. Our invention "DIMA-Dataset Discovery" is an improved capability are described for design, mapping, developing, training, validating and deploying discovery virtual avatars, avatars embodying mathematical models must be used for document and large data repositories. For example: an avatar may be constructed by machine learning, Al-based Programming, method, processes, including by processing information related to what types of information analysis, investigative find useful in large data sets. The invention an avatar must be deployed as an aid to human intuition in a wide range of analytical processes, such as related to international, national security, enterprise management, data management, query mapping (advertised, sales, marketing, product, promotions, placement, pricing, etc.). The invention also dispute resolution (including litigation), forensic analysis, criminal, administrative, civil and private investigations, scientific investigations, research and development, and a wide range of others. The data elements from the source data may be presented and tracking, scored, rated mapping or ranked based at least in part on the identifiers within the data cluster relating to the super-set topic. The mathematical model must be optimized based at least in part on a comparison of the scored data elements and upon reaching a threshold of optimization, accuracy, quality, or merit, the optimized mathematical model must be saved and stored as a computer-based discovery avatar parent. A second set of extracted data features must be extracted from the source data that share a second attribute that is related to both the super-set topic and a subset topic. 2. According to claims# the invention is to an improved capability is described for design, mapping, developing, training, validating and deploying discovery virtual avatars, avatars embodying mathematical models must be used for document and large data repositories. 3. According to claim,2# the invention is to an avatar may be constructed by machine learning, Al-based Programming, method, processes, including by processing information related to what types of information analysis, investigative find useful in large data sets. 4. According to claim,2,3# the invention is to the invention an avatar must be deployed as an aid to human intuition in a wide range of analytical processes, such as related to international, national security, enterprise management, data management, query mapping (advertised, sales, marketing, product, promotions, placement, pricing, etc.). 5. According to claim1,2# the invention is to the invention also dispute resolution (including litigation), forensic analysis, criminal, administrative, civil and private investigations, scientific investigations, research and development, and a wide range of others. The data elements from the source data may be presented and tracking, scored, rated mapping or ranked based at least in part on the identifiers within the data cluster relating to the super-set topic. 6. According to claim,2,4# the invention is to the mathematical model must be optimized based at least in part on a comparison of the scored data elements and upon reaching a threshold of optimization, accuracy, quality, or merit, the optimized mathematical model must be saved and stored as a computer-based discovery avatar parent.

7. According to claim,2,5# the invention is to a second set of extracted data features must be extracted from the source data that share a second attribute that is related to both the super-set topic and a subset topic. The invention is to wherein analyzing the extracted data features from the tokenized data source to identify the data cluster is based on k-means clustering or latent Dirichlet allocation (LDA) and topic modeling. 8. According to claiml,2,4,7# the invention is to further comprising using a custom feature list to generate the extracted data features. The invention is to wherein the data cluster is a first data cluster in a plurality of data clusters identified by analyzing the extracted data features and the method further comprising selecting an object within each of the plurality of data clusters having a largest magnitude for presentation to the analyst. 9. According to claim,2,5# the invention is to further comprising creating the tokenized data source from a using a white space tokenization. 1O.According to claim,2,5,8# the invention is to wherein the computer-based discovery avatar categorizes the tokenized data source based at least in part on use of support vector machines.

Date: 25/8/2020 Dr. M. Shanmukhi (Professor) NaziaTabassum (Assistant Professor) Dr. Raja boina Raja Kumar (Associate Professor) Dr. Attili Venkata Ramana (Associate Professor) Dr. Annaluri Sreenivasa Rao (Assistant Professor) N. Sree Divya (Assistant Professor) K. Harinath (Assistant Professor) Dr. Rama Reddy T (Professor) Dr. Venkata Rajesh Masina (Associate Professor) Dr. N Chandra Sekhar Reddy (Professor & Head)

FOR Dr. M. Shanmukhi (Professor) NaziaTabassum (Assistant Professor) Dr. Raja boina Raja Kumar (Associate Professor) Dr. Attili Venkata Ramana (Associate Professor) Dr. Annaluri Sreenivasa Rao (Assistant Professor) N. Sree Divya (Assistant Professor) K. Harinath (Assistant Professor) Dr. Rama Reddy T (Professor) 26 Aug 2020

Dr. Venkata Rajesh Masina (Associate Professor) Dr. N Chandra Sekhar Reddy (Professor & Head) TOTAL NO OF SHEET: 05 NO OF FIG.:05 2020101987

FIG. 2: IS AN OF DISCOVERY AVATAR DEVELOPMENT AND OPTIMIZATION.

FIG. 3: IS STEPS FOR DEVELOPING OPTIMIZING AND STORING A DISCOVERY AVATAR.

FIG. 4: IS AN OF AVATAR-PARENT AND AVATAR-CHILD DEVELOPMENT AND OPTIMIZATION.