US20080228766A1 - Efficiently Compiling Co-associating Attributes - Google Patents

Efficiently Compiling Co-associating Attributes Download PDF

Info

Publication number
US20080228766A1
US20080228766A1 US12/031,671 US3167108A US2008228766A1 US 20080228766 A1 US20080228766 A1 US 20080228766A1 US 3167108 A US3167108 A US 3167108A US 2008228766 A1 US2008228766 A1 US 2008228766A1
Authority
US
United States
Prior art keywords
attribute
query
combinations
attributes
combination
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/031,671
Inventor
Andrew Alexander Kenedy
Charles Anthony Eldering
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Expanse Bioinformatics Inc
Original Assignee
EXPANSE NETWORKS Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US89523607P priority Critical
Application filed by EXPANSE NETWORKS Inc filed Critical EXPANSE NETWORKS Inc
Priority to US12/031,671 priority patent/US20080228766A1/en
Assigned to EXPANSE NETWORKS, INC. reassignment EXPANSE NETWORKS, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KENEDY, ANDREW A.
Assigned to EXPANSE NETWORKS, INC. reassignment EXPANSE NETWORKS, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ELDERING, CHARLES A.
Publication of US20080228766A1 publication Critical patent/US20080228766A1/en
Assigned to EXPANSE BIOINFORMATICS, INC. reassignment EXPANSE BIOINFORMATICS, INC. CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: EXPANSE NETWORKS, INC.
Application status is Abandoned legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24575Query processing with adaptation to user needs using context
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F19/00Digital computing or data processing equipment or methods, specially adapted for specific applications
    • G06F19/30Medical informatics, i.e. computer-based analysis or dissemination of patient or disease data
    • G06F19/32Medical data management, e.g. systems or protocols for archival or communication of medical images, computerised patient records or computerised general medical references
    • G06F19/324Management of patient independent data, e.g. medical references in digital format
    • G06F19/325Medical practices, e.g. general treatment protocols
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06QDATA PROCESSING SYSTEMS OR METHODS, SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL, SUPERVISORY OR FORECASTING PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL, SUPERVISORY OR FORECASTING PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06QDATA PROCESSING SYSTEMS OR METHODS, SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL, SUPERVISORY OR FORECASTING PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL, SUPERVISORY OR FORECASTING PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance, e.g. risk analysis or pensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06QDATA PROCESSING SYSTEMS OR METHODS, SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL, SUPERVISORY OR FORECASTING PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL, SUPERVISORY OR FORECASTING PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/22Social work
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B19/00Teaching not covered by other main groups of this subclass
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H40/00ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
    • G16H40/20ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the management or administration of healthcare resources or facilities, e.g. managing hospital staff or surgery rooms
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H40/00ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
    • G16H40/60ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the operation of medical equipment or devices
    • G16H40/63ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the operation of medical equipment or devices for local operation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/20ICT specially adapted for the handling or processing of medical references relating to practices or guidelines
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/60ICT specially adapted for the handling or processing of medical references relating to pathologies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00Arrangements for user-to-user messaging in packet-switching networks, e.g. e-mail or instant messages
    • H04L51/32Messaging within social networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements or protocols for real-time communications
    • H04L65/40Services or applications
    • H04L65/403Arrangements for multiparty communication, e.g. conference

Abstract

A method, software, database and system are presented in which attribute profiles of query-attribute-positive individuals and query-attribute-negative individuals are compared, and combinations of attributes that occur at a higher frequency in the group of query-attribute-positive individuals are identified and stored to generate a compilation of attribute combinations that co-associate with the query attribute (i.e., an attribute of interest). Several computationally efficient approaches for identifying the attribute combinations are incorporated.

Description

  • This application claims priority to U.S. Provisional Application Ser. No. 60/895,236, which was filed on Mar. 16, 2007, and which is incorporated herein by reference in its entirety.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The following detailed description will be better understood when read in conjunction with the appended drawings, in which there is shown one or more of the multiple embodiments of the present invention. It should be understood, however, that the various embodiments are not limited to the precise arrangements and instrumentalities shown in the drawings.
  • FIG. 1 illustrates attribute categories and their relationships;
  • FIG. 2 illustrates a system diagram including data formatting, comparison, and statistical computation engines and dataset input/output for a method of creating an attribute combinations database;
  • FIG. 3 illustrates examples of genetic attributes;
  • FIG. 4 illustrates examples of epigenetic attributes;
  • FIG. 5 illustrates representative physical attributes classes;
  • FIG. 6 illustrates representative situational attributes classes;
  • FIG. 7 illustrates representative behavioral attributes classes;
  • FIG. 8 illustrates an attribute determination system;
  • FIG. 9 illustrates an example of expansion and reformatting of attributes;
  • FIG. 10 illustrates the advantage of identifying attribute combinations in a two attribute example;
  • FIG. 11 illustrates the advantage of identifying attribute combinations in a three attribute example;
  • FIG. 12 illustrates an example of statistical measures & formulas useful for the methods;
  • FIG. 13 illustrates a flow chart for a method of creating an attribute combinations database;
  • FIG. 14 illustrates a 1st dataset example for a method of creating an attribute combinations database;
  • FIG. 15 illustrates 2nd dataset and combinations table examples for a method of creating an attribute combinations database;
  • FIG. 16 illustrates a 3rd dataset example for a method of creating an attribute combinations database;
  • FIG. 17 illustrates a 4th dataset example for a method of creating an attribute combinations database;
  • FIG. 18 illustrates a 4th dataset example for a method of creating an attribute combinations database;
  • FIG. 19 illustrates a flowchart for a method of identifying predisposing attribute combinations;
  • FIG. 20 illustrates a rank-ordered tabulated results example for a method of identifying predisposing attribute combinations;
  • FIG. 21 illustrates a flowchart for a method of predisposition prediction;
  • FIG. 22 illustrates 1st and 2nd dataset examples for a method of predisposition prediction;
  • FIG. 23 illustrates 3rd dataset and tabulated results examples for a method of predisposition prediction;
  • FIG. 24 illustrates a flowchart for a method of destiny modification;
  • FIG. 25 illustrates 1st dataset, 3rd dataset and tabulated results examples for destiny modification of individual #113;
  • FIG. 26 illustrates 1st dataset, 3rd dataset and tabulated results examples for destiny modification of individual #114;
  • FIG. 27 illustrates a flowchart for a method of predisposition modification;
  • FIG. 28 illustrates a flowchart for a method of genetic attribute analysis;
  • FIG. 29 illustrates 3rd dataset examples from a method of destiny modification for use in synergy discovery;
  • FIG. 30 illustrates one embodiment of a computing system on which the present method and system can be implemented; and
  • FIG. 31 illustrates a representative deployment diagram for an attribute determination system.
  • DETAILED DESCRIPTION
  • Disclosed herein are methods, computer systems, databases and software for identifying combinations of attributes associated with individuals that co-occur (i.e., co-associate, co-aggregate) with attributes of interest, such as specific disorders, behaviors and traits. Disclosed herein are databases as well as database systems for creating and accessing databases describing those attributes and for performing analyses based on those attributes. The methods, computer systems and software are useful for identifying intricate combinations of attributes that predispose human beings toward having or developing specific disorders, behaviors and traits of interest, determining the level of predisposition of an individual towards such attributes, and revealing which attribute associations can be added or eliminated to effectively modify what may have been hereto believed to be destiny. The methods, computer systems and software are also applicable for tissues and non-human organisms, as well as for identifying combinations of attributes that correlate with or cause behaviors and outcomes in complex non-living systems including molecules, electrical and mechanical systems and various devices and apparatus whose functionality is dependent on a multitude of attributes.
  • Previous methods have been largely unsuccessful in determining the complex combinations of attributes that predispose individuals to most disorders, behaviors and traits. The level of resolution afforded by the data typically used is too low, the number and types of attributes considered is too limited, and the sensitivity to detect low frequency, high complexity combinations is lacking. The desirability of being able to determine the complex combinations of attributes that predispose an individual to physical or behavioral disorders has clear implications for improving individualized diagnoses, choosing the most effective therapeutic regimens, making beneficial lifestyle changes that prevent disease and promote health, and reducing associated health care expenditures. It is also desirable to determine those combinations of attributes that promote certain behaviors and traits such as success in sports, music, school, leadership, career and relationships.
  • Advances in technology within the field of genetics now provide the ability to achieve maximum resolution of the entire genome. Discovery and characterization of epigenetic modifications—reversible chemical modifications of DNA and structural modification of chromatin that dramatically alter gene expression—has provided an additional level of information that may be altered due to environmental conditions, life experiences and aging. Along with a collection of diverse nongenetic attributes including physical, behavioral, situational and historical attributes associated with an organism, the present invention provides the ability to utilize the above information to enable prediction of the predisposition of an organism toward developing a specific attribute of interest provided in a query.
  • There are approximately 25,000 genes in the human genome. Of these, approximately 1,000 of these genes are involved in monogenic disorders, which are disorders whose sole cause is due to the properties of a single gene. This collection of disorders represents less than two percent of all human disorders. The remaining 98 percent of human disorders, termed complex disorders, are caused by multiple genetic influences or a combination of multiple genetic and non-genetic influences, still yet to be determined due to their resistance to current methods of discovery.
  • Previous methods using genetic information have suffered from either a lack of high resolution information, very limited coverage of total genomic information, or both. Genetic markers such as single nucleotide polymorphisms (SNPs) do not provide a complete picture of a gene's nucleotide sequence or the total genetic variability of the individual. The SNPs typically used occur at a frequency of at least 5% in the population. However, the majority of genetic variation that exists in the population occurs at frequencies below 1%. Furthermore, SNPs are spaced hundreds of nucleotides apart and do not account for genetic variation that occurs in the genetic sequence lying between, which is vastly more sequence than the single nucleotide position represented by an SNP. SNPs are typically located within gene coding regions and do not allow consideration of 98% of the 3 billion base pairs of genetic code in the human genome that does not encode gene sequences. Other markers such as STS, gene locus markers and chromosome loci markers also provide very low resolution and incomplete coverage of the genome. Complete and partial sequencing of an individual's genome provides the ability to incorporate that detailed information into the analysis of factors contributing toward expressed attributes.
  • Genomic influence on traits is now known to involve more than just the DNA nucleotide sequence of the genome. Regulation of expression of the genome can be influenced significantly by epigenetic modification of the genomic DNA and chromatin (3-dimensional genomic DNA with bound proteins). Termed the epigenome, this additional level of information can make genes in an individual's genome behave as if they were absent. Epigenetic modification can dramatically affect the expression of approximately at least 6% of all genes.
  • Epigenetic modification silences the activity of gene regulatory regions required to permit gene expression. Genes can undergo epigenetic silencing as a result of methylation of cytosines occurring in CpG dinucleotide motifs, and to a lesser extent by deacetylation of chromatin-associated histone proteins which inhibit gene expression by creating 3-dimensional conformational changes in chromatin. Assays such as bisulfite sequencing, differential methyl hybridization using microarrays, methylation sensitive polymerase chain reaction, and mass spectrometry enable the detection of cytosine nucleotide methylation while chromosome immunoprecipitation (CHIP) can be used to detect histone acetylation states of chromatin.
  • In one embodiment, epigenetic attributes are incorporated in the present invention to provide certain functionality. First, major mental disorders such as schizophrenia and bipolar mood disorder are thought to be caused by or at least greatly influenced by epigenetic imprinting of genes. Second, all epigenetic modification characterized to date is reversible in nature, allowing for the potential therapeutic manipulation of the epigenome to alter the course and occurrence of disease and certain behaviors. Third, because epigenetic modification of the genome occurs in response to experiences and stimuli encountered during prenatal and postnatal life, epigenetic data can help fill gaps resulting from unobtainable personal data, and reinforce or even substitute for unreliable self-reported data such as life experiences and environmental exposures.
  • In addition to genetic and epigenetic attributes, which can be referred to collectively as pangenetic attributes, numerous other attributes likely influence the development of traits and disorders. These other attributes, which can be referred to collectively as non-pangenetic attributes, can be categorized individually as physical, behavioral, or situational attributes. FIG. 1 displays one embodiment of the attribute categories and their interrelationships according to the present invention and illustrates that physical and behavioral attributes can be collectively equivalent to the broadest classical definition of phenotype, while situational attributes can be equivalent to those typically classified as environmental. In one embodiment, historical attributes can be viewed as a separate category containing a mixture of genetic, epigenetic, physical, behavioral and situational attributes that occurred in the past. Alternatively, historical attributes can be integrated within the genetic, epigenetic, physical, behavioral and situational categories provided they are made readily distinguishable from those attributes that describe the individual's current state. In one embodiment, the historical nature of an attribute is accounted for via a time stamp or other time based marker associated with the attribute. As such, there are no explicit historical attributes, but through use of time stamping, the time associated with the attribute can be used to make a determination as to whether the attribute is occurring in what would be considered the present, or if it has occurred in the past. Traditional demographic factors are typically a small subset of attributes derived from the phenotype and environmental categories and can be therefore represented within the physical, behavioral and situational categories.
  • In the present invention the term ‘attributes’ rather than the term ‘factors’ is used since many of the entities are characteristics associated with an individual that may have no influence on the vast majority of their traits, behaviors and disorders. As such, there may be many instances during execution of the methods disclosed herein when a particular attribute does not act as a factor in determining predisposition. Nonetheless, every attribute remains a potentially important characteristic of the individual and may contribute to predisposition toward some other attribute or subset of attributes queried during subsequent or future implementation of the methods disclosed herein. In the present invention, the term ‘bioattribute’ can be used to refer to any attribute associated with a biological entity, such as an attribute associated with an organism or an attribute associated with a biologic molecule, for example. Therefore even a numerical address ZIP code, which is not a biological entity, can be a bioattribute when used to describe the residential location associated with a biological entity such as a person.
  • An individual possesses many associated attributes which may be collectively referred to as an ‘attribute profile’ associated with that individual. In one embodiment, an attribute profile can be considered as being comprised of the attributes that are present (i.e., occur) in that profile, as well as being comprised of the various combinations (i.e., combinations and subcombinations) of those attributes. The attribute profile of an individual is preferably provided to embodiments of the present invention as a dataset record whose association with the individual can be indicated by a unique identifier contained in the dataset record. An actual attribute of an individual can be represented by an attribute descriptor in attribute profiles, records, datasets, and databases. Herein, both actual attributes and attribute descriptors may be referred to simply as attributes. In one embodiment, statistical relationships and associations between attribute descriptors are a direct result of relationships and associations between actual attributes of an individual. In the present disclosure, the term ‘individual’ can refer to a singular group, person, organism, organ, tissue, cell, virus, molecule, thing, entity or state, wherein a state includes but is not limited to a state-of-being, an operational state or a status. Individuals, attribute profiles and attributes can be real and/or measurable, or they may be hypothetical and/or not directly observable.
  • In one embodiment the present invention can be used to discover combinations of attributes regardless of number or type, in a population of any size, that cause predisposition to an attribute of interest. In doing so, this embodiment also has the ability to provide a list of attributes one can add or subtract from an existing profile of attributes in order to respectively increase or decrease the strength of predisposition toward the attribute of interest. The ability to accurately detect predisposing attribute combinations naturally benefits from being supplied with datasets representing large numbers of individuals and having a large number and variety of attributes for each. Nevertheless, the present invention will function properly with a minimal number of individuals and attributes. One embodiment of the present invention can be used to detect not only attributes that have a direct (causal) effect on an attribute of interest, but also those attributes that do not have a direct effect such as instrumental variables (i.e., correlative attributes), which are attributes that correlate with and can be used to predict predisposition for the attribute of interest but are not causal. For simplicity of terminology, both types of attributes are referred to herein as predisposing attributes, or simply attributes, that contribute toward predisposition toward the attribute of interest, regardless of whether the contribution or correlation is direct or indirect.
  • It is beneficial, but not necessary, in most instances, that the individuals whose data is supplied for the method be representative of the individual or population of individuals for which the predictions are desired. In a preferred embodiment, the attribute categories collectively encompass all potential attributes of an individual. Each attribute of an individual can be appropriately placed in one or more attribute categories of the methods, system and software of the invention. Attributes and the various categories of attributes can be defined as follows:
      • a) attribute: a quality, trait, characteristic, relationship, property, factor or object associated with or possessed by an individual;
      • b) genetic attribute: any genome, genotype, haplotype, chromatin, chromosome, chromosome locus, chromosomal material, deoxyribonucleic acid (DNA), allele, gene, gene cluster, gene locus, gene polymorphism, gene mutation, gene marker, nucleotide, single nucleotide polymorphism (SNP), restriction fragment length polymorphism (RFLP), variable tandem repeat (VTR), genetic marker, sequence marker, sequence tagged site (STS), plasmid, transcription unit, transcription product, ribonucleic acid (RNA), and copy DNA (cDNA), including the nucleotide sequence and encoded amino acid sequence of any of the above;
      • c) epigenetic attribute: any feature of the genetic material—all genomic, vector and plasmid DNA, and chromatin—that affects gene expression in a manner that is heritable during somatic cell divisions and sometimes heritable in germline transmission, but that is nonmutational to the DNA sequence and is therefore fundamentally reversible, including but not limited to methylation of DNA nucleotides and acetylation of chromatin-associated histone proteins;
      • d) pangenetic attribute: any genetic or epigenetic attribute;
      • e) physical attribute: any material quality, trait, characteristic, property or factor of an individual present at the atomic, molecular, cellular, tissue, organ or organism level, excluding genetic and epigenetic attributes;
      • f) behavioral attribute: any singular, periodic, or aperiodic response, action or habit of an individual to internal or external stimuli, including but not limited to an action, reflex, emotion or psychological state that is controlled or created by the nervous system on either a conscious or subconscious level;
      • g) situational attribute: any object, condition, influence, or milieu that surrounds, impacts or contacts an individual; and
      • h) historical attribute: any genetic, epigenetic, physical, behavioral or situational attribute that was associated with or possessed by an individual in the past. As such, the historical attribute refers to a past state of the individual and may no longer describe the current state.
  • The methods, systems, software, and databases disclosed herein apply to and are suitable for use with not only humans, but for other organisms as well. The methods, systems, software and databases may also be used for applications that consider attribute identification, predisposition potential and destiny modification for organs, tissues, individual cells, and viruses both in vitro and in vivo. For example, the methods can be applied to behavior modification of individual cells being grown and studied in a laboratory incubator by providing pangenetic attributes of the cells, physical attributes of the cells such as size, shape and surface receptor densities, and situational attributes of the cells such as levels of oxygen and carbon dioxide in the incubator, temperature of the incubator, and levels of glucose and other nutrients in the liquid growth medium. Using these and other attributes, the methods, systems, software and databases can then be used to predict predisposition of the cells for such characteristics as susceptibility to infection by viruses, general growth rate, morphology, and differentiation potential. The methods, systems, software, and databases disclosed herein can also be applied to complex non-living systems to, for example, predict the behavior of molecules or the performance of electrical devices or machinery subject to a large number of variables.
  • FIG. 2 illustrates system components corresponding to one embodiment of a method, system, software, and databases for compiling predisposing attribute combinations. Attributes can be stored in the various datasets of the system. In one embodiment, 1st dataset 200 is a raw dataset of attributes that may be converted and expanded by conversion/formatting engine 220 into a more versatile format and stored in expanded 1st dataset 202. Comparison engine 222 can perform a comparison between attributes from records of the 1st dataset 200 or expanded 1st dataset 202 to determine candidate predisposing attributes which are then stored in 2nd dataset 204. Comparison engine 222 can tabulate a list of all possible combinations of the candidate attributes and then perform a comparison of those combinations with attributes contained within individual records of 1st dataset 200 or expanded 1st dataset 202. Comparison engine 222 can store those combinations that are found to occur and meet certain selection criteria in 3rd dataset 206 along with a numerical frequency of occurrence obtained as a count during the comparison. Statistical computation engine 224 can perform statistical computations using the numerical frequencies of occurrence to obtain results (values) for strength of association between attributes and attribute combinations and then store those results in 3rd dataset 206. Statistical computation engine 224, alone or in conjunction with comparison engine 222, can create a 4th dataset 208 containing attributes and attribute combinations that meet a minimum or maximum statistical requirement by applying a numerical or statistical filter to the numerical frequencies of occurrence or the values for strength of association stored in 3rd dataset 206. Although represented as a system and engines, the system and engines can be considered subsystems of a larger system, and as such referred to as subsystems. Such subsystems may be implemented as sections of code, objects, or classes of objects within a single system, or may be separate hardware and software platforms which are integrated with other subsystems to form the final system.
  • FIGS. 3A and 3B show a representative form for genetic attributes as DNA nucleotide sequence with each nucleotide position associated with a numerical identifier. In this form, each nucleotide is treated as an individual genetic attribute, thus providing maximum resolution of the genomic information of an individual. FIG. 3A depicts a portion of the known gene sequence for the HTR2A gene for two individuals having a nucleotide difference at nucleotide sequence position number 102. Comparing known genes simplifies the task of properly phasing nucleotide sequence comparisons. However, for comparison of non-gene sequences, due to the presence of insertions and deletions of varying size in the genome of one individual versus another, markers such as STS sequences can be used to allow for a proper in-phase comparison of the DNA sequences between different individuals. FIG. 3B shows genomic DNA plus-strand sequence for two individuals beginning at the STS#68777 forward primer which provides a known location of the sequence within the genome and facilitates phasing of the sequence with other sequences from that region of the genome during sequence comparison.
  • Conversion/formatting engine 220 of FIG. 2 can be used in conjunction with comparison engine 222 to locate and number the STS marker positions within the sequence data and store the resulting data in expanded 1st dataset 202. In one embodiment, comparison engine 222 has the ability to recognize strings of nucleotides with a word size large enough to enable accurately phased comparison of individual nucleotides in the span between marker positions. This function is also valuable in comparing known gene sequences. Nucleotide sequence comparisons in the present invention can also involve transcribed sequences in the form of mRNA, tRNA, rRNA, and cDNA sequences which all derive from genomic DNA sequence and are handled in the same manner as nucleotide sequences of known genes.
  • FIGS. 3C and 3D show two other examples of genetic attributes that may be compared in one embodiment of the present invention and the format they may take. Although not preferred because of the relatively small amount of information provided, SNP polymorphisms (FIG. 3C) and allele identity (FIG. 3D) can be processed by one or more of the methods herein to provide a limited comparison of the genetic content of individuals.
  • FIGS. 4A and 4B show examples of epigenetic data that can be compared, the preferred epigenetic attributes being methylation site data. FIG. 4A represents a format of methylation data for hypothetical Gene X for two individuals, where each methylation site (methylation variable position) is distinguishable by a unique alphanumeric identifier. The identifier may be further associated with a specific gene, site or chromosomal locus of the genome. In this embodiment, the methylation status at each site is an attribute that can have either of two values: methylated (M) or unmethylated (U). Other epigenetic data and representations of epigenetic data can be used to perform the methods disclosed herein, and to construct the systems, software and databases disclosed herein, as will be understood by one skilled in the art.
  • As shown in FIG. 4B, an alternative way to organize epigenetic methylation data is to append it directly to the corresponding genetic sequence attribute dataset as methylation status at each candidate CpG dinucleotide occurring in that genomic nucleotide sequence, in this example for hypothetical Gene Z for two individuals. The advantage of this format is that it inherently includes chromosome, gene and nucleotide position information. In this format, which is the most complete and informative format for the raw data, the epigenetic data can be extracted and converted to another format at any time. Both formats (that of FIG. 4A as well as that of FIG. 4B) provide the same resolution of methylation data, but it is preferable to adhere to one format in order to facilitate comparison of epigenetic data between different individuals. Regarding either data format, in instances where an individual is completely lacking a methylation site due to a deletion or mutation of the corresponding CpG dinucleotide, the corresponding epigenetic attribute value should be omitted (i.e., assigned a null).
  • FIG. 5 illustrates representative classes of physical attributes as defined by physical attributes metaclass 500, which can include physical health class 510, basic physical class 520, and detailed physical class 530, for example. In one embodiment physical health class 510 includes a physical diagnoses subclass 510.1 that includes the following specific attributes (objects), which when positive indicate a known physical diagnoses:
  • 510.1.1 Diabetes
  • 510.1.2 Heart Disease
  • 510.1.3 Osteoporosis
  • 510.1.4 Stroke
  • 510.1.5 Cancer
      • 510.1.5.1 Prostrate Cancer
      • 510.1.5.2 Breast Cancer
      • 510.1.5.3 Lung Cancer
      • 510.1.5.4 Colon Cancer
      • 510.1.5.5 Bladder Cancer
      • 510.1.5.6 Endometrial Cancer
      • 510.1.5.7 Non-Hodgkin's Lymphoma
      • 510.1.5.8 Ovarian Cancer
      • 510.1.5.9 Kidney Cancer
      • 510.1.5.10 Leukemia
      • 510.1.5.11 Cervical Cancer
      • 510.1.5.12 Pancreatic Cancer
      • 510.1.5.13 Skin melanoma
      • 510.1.5.14 Stomach Cancer
  • 510.1.6 Bronchitis
  • 510.1.7 Asthma
  • 510.1.8 Emphysema
  • The above classes and attributes represent the current condition of the individual. In the event that the individual (e.g. consumer 810) had a diagnosis for an ailment in the past, the same classification methodology can be applied, but with an “h” placed after the attribute number to denote a historical attribute. For example, 510.1.4 h can be used to create an attribute to indicate that the individual suffered a stroke in the past, as opposed to 510.1.4 which indicates the individual is currently suffering a stroke or the immediate aftereffects. Using this approach, historical classes and attributes mirroring the current classes and attributes can be created, as illustrated by historical physical health class 510 h, historical physical diagnoses class 510.1 h, historical basic physical class 520 h, historical height class 520.1 h, historical detailed physical class 530 h, and historical hormone levels class 530.1 h. In an alternate embodiment historical classes and historical attributes are not utilized. Rather, time stamping of the diagnoses or event is used. In this approach, an attribute of 510.1.4-05FEB03 would indicate that the individual suffered a stroke on Feb. 5, 2003. Alternate classification schemes and attribute classes/classifications can be used and will be understood by one of skill in the art. In one embodiment, time stamping of attributes is preferred in order to permit accurate determination of those attributes or attribute combinations that are associated with an attribute of interest (i.e., a query attribute or target attribute) in a causative or predictive relationship, or alternatively, those attributes or attribute combinations that are associated with an attribute of interest in a consequential or symptomatic relationship. In one embodiment, only attributes bearing a time stamp that predates the time stamp of the attribute of interest are processed by the methods. In another embodiment, only attributes bearing a time stamp that postdates the time stamp of the attribute of interest are processed by the methods. In another embodiment, both attributes that predate and attributes that postdate an attribute of interest are processed by the methods.
  • As further shown in FIG. 5, physical prognoses subclass 510.2 can contain attributes related to clinical forecasting of the course and outcome of disease and chances for recovery. Basic physical class 520 can include the attributes age 520.1, sex 520.2, height 520.3, weight 520.4, and ethnicity 520.5, whose values provide basic physical information about the individual. Hormone levels 530.1 and strength/endurance 530.4 are examples of attribute subclasses within detailed physical class 530. Hormone levels 530.1 can include attributes for testosterone level, estrogen level, progesterone level, thyroid hormone level, insulin level, pituitary hormone level, and growth hormone level, for example. Strength/endurance 530.4 can include attributes for various weight lifting capabilities, stamina, running distance and times, and heart rates under various types of physical stress, for example. Blood sugar level 530.2, blood pressure 530.3 and body mass index 530.5 are examples of attributes whose values provide detailed physical information about the individual. Historical physical health class 510 h, historical basic physical class 520 h and historical detailed physical class 530 h are examples of historical attribute classes. Historical physical health class 510 h can include historical attribute subclasses such as historical physical diagnoses class 510.h which would include attributes for past physical diagnoses of various diseases and physical health conditions which may or may not be representative of the individual's current health state. Historical basic physical class 520 h can include attributes such as historical height class 520.1 h which can contain heights measured at particular ages. Historical detailed physical class 530 h can include attributes and attribute classes such as the historical hormone levels class 530.1 h which would include attributes for various hormone levels measured at various time points in the past.
  • In one embodiment, the classes and indexing illustrated in FIG. 5 and disclosed above can be matched to health insurance information such as health insurance codes, such that information collected by health care professionals (such as clinician 820 of FIG. 8, which can be a physician, nurse, nurse practitioner or other health care professional) can be directly incorporated as attribute data. In this embodiment, the heath insurance database can directly form part of the attribute database, such as one which can be constructed using the classes of FIG. 5.
  • FIG. 6 illustrates classes of situational attributes as defined by situational attributes metaclass 600, which in one embodiment can include medical class 610, exposures class 620, and financial class 630, for example. In one embodiment, medical class 610 can include treatments subclass 610.1 and medications subclass 610.2; exposures class 620 can include environmental exposures subclass 620.1, occupational exposures subclass 620.2 and self-produced exposures 620.3; and financial class 630 can include assets subclass 630.1, debt subclass 630.2 and credit report subclass 630.3. Historical medical class 610 h can include historical treatments subclass 610.1 h, historical medications subclass 610.2 h, historical hospitalizations subclass 610.3 h and historical surgeries subclass 610.4 h. Other historical classes included within the situational attributes metaclass 600 can be historical exposures subclass 620 h, historical financial subclass 630 h, historical income history subclass 640 h, historical employment history subclass 650 h, historical marriage/partnerships subclass 660 h, and historical education subclass 670 h.
  • In one embodiment, commercial databases such as credit databases, databases containing purchase information (e.g. frequent shopper information) can be used as either the basis for extracting attributes for the classes such as those in financial subclass 630 and historical financial subclass 630 h, or for direct mapping of the information in those databases to situational attributes. Similarly, accounting information such as that maintained by the consumer 810 of FIG. 8, or a representative of the consumer (e.g. the consumer's accountant) can also be incorporated, transformed, or mapped into the classes of attributes shown in FIG. 6.
  • Measurement of financial attributes such as those illustrated and described with respect to FIG. 6 allows financial attributes such as assets, debt, credit rating, income and historical income to be utilized in the methods, systems, software and databases described herein. In some instances, such financial attributes can be important with respect to a query attribute. Similarly, other situational attributes such as the number of marriages/partnerships, length of marriages/partnership, number jobs held, income history, can be important attributes and will be found to be related to certain query attributes. In one embodiment a significant number of attributes described in FIG. 6 are extracted from public or private databases, either directly or through manipulation, interpolation, or calculations based on the data in those databases.
  • FIG. 7 illustrates classes of behavioral attributes as defined by behavioral attributes metaclass 700, which in one embodiment can include mental health class 710, habits class 720, time usage class 730, mood/emotional state class 740, and intelligence quotient class 750, for example. In one embodiment, mental health class 710 can include mental/behavioral diagnoses subclass 710.1 and mental/behavioral prognoses subclass 710.2; habits class 720 can include diet subclass 720.1, exercise subclass 720.2, alcohol consumption subclass 720.3, substances usage subclass 720.4, and sexual activity subclass 720.5; and time usage class 730 can include work subclass 730.1, commute subclass 730.2, television subclass 730.3, exercise subclass 730.4 and sleep subclass 730.5. Behavioral attributes metaclass 700 can also include historical classes such as historical mental health class 710 h, historical habits 720 h, and historical time usage class 730 h.
  • As discussed with respect to FIGS. 5 and 6, in one embodiment, external databases such as health care provider databases, purchase records and credit histories, and time tracking systems can be used to supply the data which constitutes the attributes of FIG. 7. Also with respect to FIG. 7, classification systems such as those used by mental health professionals such as classifications found in the DSM-IV can be used directly, such that the attributes of mental health class 710 and historical prior mental health class 710 h have a direct correspondence to the DSM-IV. The classes and objects of the present invention, as described with respect to FIGS. 5, 6 and 7, can be implemented using a number of database architectures including, but not limited to flat files, relational databases and object oriented databases.
  • Unified Modeling Language (“UML”) can be used to model and/or describe methods and systems and provide the basis for better understanding their functionality and internal operation as well as describing interfaces with external components, systems and people using standardized notation. When used herein, UML diagrams including, but not limited to, use case diagrams, class diagrams and activity diagrams, are meant to serve as an aid in describing the embodiments of the present invention but do not constrain implementation thereof to any particular hardware or software embodiments. Unless otherwise noted, the notation used with respect to the UML diagrams contained herein is consistent with the UML 2.0 specification or variants thereof and is understood by those skilled in the art.
  • FIG. 8 illustrates a use case diagram for an attribute determination system 800 which, in one embodiment, allows for the determination of attributes which are statistically relevant or related to a query attribute. Attribute determination system 800 allows for a consumer 810, clinician 820, and genetic database administrator 830 to interact, although the multiple roles may be filled by a single individual, to input attributes and query the system regarding which attributes are relevant to the specified query attribute. In a contribute genetic sample use case 840 a consumer 810 contributes a genetic sample.
  • In one embodiment this involves the contribution by consumer 810 of a swab of the inside of the cheek, a blood sample, or contribution of other biological specimen associated with consumer 810 from which genetic and epigenetic data can be obtained. In one embodiment, genetic database administrator 830 causes the genetic sample to be analyzed through a determine genetic and epigenetic attributes use case 850. Consumer 810 or clinician 820 may collect physical attributes through a describe physical attributes use case 842. Similarly, behavioral, situational, and historical attributes are collected from consumer 810 or clinician 820 via describe behavioral attributes use case 844, describe situational attributes use case 846, and describe historical attributes use case 848, respectively. Clinician 820 or consumer 810 can then enter a query attribute through receive query attribute use case 852. Attribute determination system 800 then, based on attributes of large query-attribute-positive and query-attribute-negative populations, determines which attributes and combinations of attributes, extending across the pangenetic (genetic/epigenetic), physical, behavioral, situational, and historical attribute categories, are statistically related to the query attribute. As previously discussed, and with respect to FIG. 1 and FIGS. 4-6, historical attributes can, in certain embodiments, be accounted for through the other categories of attributes. In this embodiment, describe historical attributes use case 848 is effectively accomplished through determine genetic and epigenetic attributes use case 850, describe physical attributes use case 842, describe behavioral attributes use case 844, and describe situational attributes use case 846.
  • With respect to the aforementioned method of collection, inaccuracies can occur, sometimes due to outright misrepresentations of the individual's habits. For example, it is not uncommon for patients to self-report alcohol consumption levels which are significantly below actual levels. This can occur even when a clinician/physician is involved, as the patient reports consumption levels to the clinician/physician that are significantly below their actual consumption levels. Similarly, it is not uncommon for an individual to over-report the amount of exercise they get.
  • In one embodiment, disparate sources of data including consumption data as derived from purchase records, data from blood and urine tests, and other observed characteristics are used to derive attributes such as those shown in FIGS. 5-7. By analyzing sets of disparate data, corrections to self-reported data can be made to produce more accurate determinations of relevant attributes. In one embodiment, heuristic rules are used to generate attribute data based on measured, rather than self-reported attributes. Heuristic rules are defined as rules which relate measurable (or accurately measurable) attributes to less measurable or less reliable attributes such as those from self-reported data. For example, an individual's recorded purchases including cigarette purchases can be combined with urine analysis or blood test results which measure nicotine levels or another tobacco related parameter and heuristic rules can be applied to estimate cigarette consumption level. As such, one or more heuristic rules, typically based on research which statistically links a variety of parameters, can be applied by data conversion/formatting engine 220 to the data representing the number of packs of cigarettes purchased by an individual or household, results of urine or blood tests, and other studied attributes, to derive an estimate of the extent to which the individual smokes.
  • In one embodiment, the heuristic rules take into account attributes such as household size and self-reported data to assist in the derivation of the desired attribute. For example, if purchase data is used in a heuristic rule, household size and even the number of self-reported smokers in the household, can be used to help determine actual levels of consumption of tobacco by the individual. In one embodiment, household members are tracked individually, and the heuristic rules provide for the ability to approximately assign consumption levels to different people in the household. Details such as individual brand usages or preferences may be used to help assign consumptions within the household. As such, in one embodiment the heuristic rules can be applied by data conversion/formatting engine 220 to a number of disparate pieces of data to assist in extracting one or more attributes.
  • Physical, behavioral, situational and historical attribute data may be stored or processed in a manner that allows retention of maximum resolution and accuracy of the data while also allowing flexible comparison of the data so that important shared similarities between individuals are not overlooked. This can be important when processing narrow and extreme attribute values, or when using smaller populations of individuals where the reduced number of individuals makes the occurrence of identical matches of attributes rare. In these and other circumstances, flexible treatment and comparison of attributes can reveal predisposing attributes that are related to or legitimately derive from the original attribute values but have broader scope, lower resolution, and extended or compounded values compared to the original attributes. In one embodiment, attributes and attribute values can be qualitative (categorical) or quantitative (numerical). In another embodiment, attributes and attribute values can be discrete or continuous numerical values.
  • There are several ways flexible treatment and comparison of attributes can be accomplished. As shown in FIG. 2, one approach is to incorporate data conversion/formatting engine 220 which is able to create expanded 1st dataset 202 from 1st dataset 200. In one embodiment, 1st dataset 200 can comprise one or more primary attributes, or original attribute profiles containing primary attributes, and expanded 1st dataset 202 can comprise one or more secondary attributes, or expanded attribute profiles containing secondary attributes. A second approach is to incorporate functions into attribute comparison engine 222 that allow it to expand the original attribute data into additional values or ranges during the comparison process. This provides the functional equivalent of reformatting the original dataset without having to create and store the entire set of expanded attribute values.
  • In one embodiment, original attributes (primary attributes) can be expanded into one or more sets containing derived attributes (secondary attributes) having values, levels or degrees that are above, below, surrounding or including that of the original attributes. In one embodiment, original attributes can be used to derive attributes that are broader or narrower in scope than the original attributes. In one embodiment, two or more original attributes can be used in a computation (i.e., compounded) to derive one or more attributes that are related to the original attributes. As shown in FIG. 9A, a historical situational attribute indicating a time span of smoking, from age 25-27, and a historical behavioral attribute indicating a smoking habit, 10 packs per week, may be compounded to form a single value for the historical situational attribute of total smoking exposure to date, 1560 packs, as shown in FIG. 9B, by simply multiplying 156 weeks by 10 packs/week. Similar calculations enable the derivation of historical situational attributes such as total nicotine and total cigarette tar exposure based on known levels nicotine and tar in the specific brand smoked, Marlboro as indicated by the cigarette brand attribute, multiplied by the total smoking exposure to date. In another example, a continuous numerical attribute, {time=5.213 seconds}, can be expanded to derive the discrete numerical attribute, {time=5 seconds}.
  • Attribute expansion of a discrete numerical attribute, such as age, can be exemplified in one embodiment using a population comprised of four individuals ages 80, 66, 30 and 15. In this example, Alzheimer's disease is the query attribute, and both the 80 year old and the 66 year old individual have Alzheimer's disease, as indicated by an attribute for a positive Alzheimer's diagnosis in their attribute profiles. Therefore, for this small population, the 80 and 66 year old individuals constitute the query-attribute-positive group (the group associated with the query attribute). If a method of discovering attribute associations is executed, none of the attribute combinations identified as being statistically associated with the query attribute will include age, since the numerical age attributes 80 and 66 are not identical. However, it is already known from empirical scientific research that Alzheimer's disease is an age-associated disease, with prevalence of the disease being much higher in the elderly. By using the original (primary) age attributes to derive new (secondary) age attributes, a method of discovering attribute associations can appropriately identify attribute combinations that contain age as a predisposing attribute for Alzheimer's disease based on the query-attribute-positive group of this population. To accomplish this, a procedure of attribute expansion derives lower resolution secondary age attributes from the primary age attributes and consequently expands the attribute profiles of the individuals in this population. This can be achieved by either categorical expansion or numerical expansion.
  • In one embodiment of a categorical attribute expansion, primary numerical age attributes are used to derive secondary categorical attributes selected from the following list: infant (ages 0-1), toddler (ages 1-3), child (ages 4-8), preadolescent (ages 9-12), adolescent (ages 13-19), young adult (ages 20-34), mid adult (ages 35-49), late adult (ages 50-64), and senior (ages 65 and up). This particular attribute expansion will derive the attribute ‘senior’ for the 80 year old individual, ‘senior’ for the 66 year old, ‘young adult’ for the 30 year old, and ‘adolescent’ for the 15 year old. These derived attributes can be added to the respective attribute profiles of these individuals to create an expanded attribute profile for each individual. As a consequence of this attribute expansion procedure, the 80 and 66 year old individuals will both have expanded attribute profiles containing an identical age attribute of ‘senior’, which will be then be identified in attribute combinations that are statistically associated with the query attribute of Alzheimer's disease, based on a higher frequency of occurrence of this attribute in the query-attribute-positive group for this example.
  • As an alternative to the above categorical expansion, a numerical attribute expansion can be performed in which numerical age is used to derive a set of secondary numerical attributes comprising a sequence of inequality statements containing progressively larger numerical values than the actual age and a set of secondary attributes comprising a sequence of inequality statements containing progressively smaller quantitative values than the actual age. For example, attribute expansion can produce the following two sets of secondary age attributes for the 80 year old: {110>age, 109>age . . . , 82>age, 81>age} and {age>79, age>78 . . . , age>68, age>67, age>66, age>65, age>64 . . . , age>1, age>0}. And attribute expansion can produce the following two sets of secondary age attributes for the 66 year old: {110>age, 109>age . . . , 82>age, 81>age, 80>age, 79>age, 78>age . . . , 68>age, 67>age} and {age>65, age>64 . . . , age>1, age>0}.
  • Identical matches of age attributes found in the largest attribute combination associated with Alzheimer's disease, based on the 80 and 66 year old individuals that have Alzheimer's in this sample population, would contain both of the following sets of age attributes: {110>age, 109>age . . . , 82>age, 81>age} and {age>65, age>64 . . . , age>1, age>0}. This result indicates that being less than 81 years of age but greater than 65 years of age (i.e., having an age in the range: 81>age>65) is a predisposing attribute for having Alzheimer's disease in this population. This particular method of attribute expansion of age into a numerical sequence of inequality statements provides identical matches between at least some of the age attributes between individuals, and provides an intermediate level of resolution between actual age and the broader categorical age attribute of ‘senior’ derived in the first example above.
  • Expansion of age attributes can be also be used for instances in which age is used to designate a point in life at which a specific activity or behavior occurred. For example, FIG. 9 demonstrates an example in which the actual ages of exposure to smoking cigarettes, ages 25-27, are expanded into a low resolution categorical age attribute of ‘adult’, a broader numerical age range of ‘21-30’, and a set of age attributes comprising a sequence of progressively larger numerical inequality statements for age of the individual, {age>24, age>23 . . . , age>2, age>1}.
  • Attribute expansion can also be used to reduce the amount of genetic information to be processed by the methods of the present invention, essentially 3 billion nucleotides of information per individual and numerous combinations comprised thereof. For example, attribute expansion can be used to derive a set of lower resolution genetic attributes (e.g., categorical genetic attributes such as names) that can be used instead of the whole genomic sequence in the methods. Categorical genetic attributes can be assigned based on only one or a few specific nucleotide attributes out of hundreds or thousands in a sequence segment (e.g., a gene, or a DNA or RNA sequence read). However, using only lower resolution categorical genetic attributes may cause the same inherent limitations of sensitivity as using only SNPs and genomic markers, which represent only a portion of the full genomic sequence content. So, while categorical genetic attributes can be used to greatly decrease processing times required for execution of the methods, they extract a cost in terms of loss of information when used in place of the full high resolution genomic sequence, and the consequence of this can be the failure to identify certain predisposing genetic variations during execution of the methods. In one embodiment, this can show up statistically in the form of attribute combinations having lower strengths of association with query attributes and/or an inability to identify any attribute combination having an absolute risk of 1.0 for association with a query attribute. So the use of descriptive genetic attributes would be most suitable, and accuracy and sensitivity the methods increased, once the vast majority of influential genetic variations in the genome (both in gene encoding regions and non-coding regions) have been identified and can be incorporated into rules for assigning categorical genetic attributes.
  • Instead of being appended to the whole genome sequence attribute profile of an individual, categorical genetic attributes can be used to create a separate genetic attribute profile for the individual that comprises thousands of genetic descriptors, rather than billions of nucleotide descriptors. As an example, 19 different nucleotide mutations have been identified in the Cystic Fibrosis Conductance Regulator Gene, each of which can disrupt function of the gene's encoded protein product resulting in clinical diagnosis of cystic fibrosis disease. Since this is the major known disease associated with this gene, the presence of any of the 19 mutations can be the basis for deriving a single lower resolution attribute of ‘CFCR gene with cystic fibrosis mutation’ with a status value of {1=Yes} to represent possession of the genomic sequence of one of the diseased variations of this gene, with the remaining sequence of the gene ignored. For individuals that do not possess any of the 19 mutations in their copies of the gene, the attribute ‘CFCR gene with cystic fibrosis mutation’ and a status value {0=No} can be derived. This approach not only reduces the amount of genetic information that needs to be processed, it allows for creation of an identical genetic attribute associated with 19 different individuals, each possessing one of 19 different nucleotide mutations in the Cystic Fibrosis Conductance Regulator Gene, but all having the same gene mutated and sharing the same disease of cystic fibrosis. This allows for identification of identical genetic attribute within their attribute profiles with respect to defect of the CFCR gene without regard for which particular nucleotide mutation is responsible for the defect. This type of attribute expansion can be performed for any genetic sequence, not just gene encoding sequences, and need not be related to disease phenotypes. Further, the genetic attribute descriptors can be names or numeric codes, for example. In one embodiment, a single categorical genetic attribute descriptor can be used to represent a collection of nucleotide variations occurring simultaneously across multiple locations of a genetic sequence or genome.
  • Similar to expansion of genetic attributes, attribute expansion can be performed with epigenetic attributes. For example, multiple DNA methylation modifications are known to occur simultaneously at different nucleotide positions within DNA segments and can act in a cooperative manner to effect regulation of expression of one gene, or even a collection of genes located at a chromosomal locus. Based on information which indicates that several different patterns of epigenetic DNA methylation, termed epigenetic polymorphisms, can produce the same phenotypic effect, a single categorical epigenetic attribute descriptor can be derived as a descriptor for that group of epigenetic DNA methylation patterns, thereby ensuring the opportunity for an epigenetic attribute match between individuals sharing predisposition to the same outcome but having a different epigenetic polymorphism that produces that outcome. For example, it has been suggested by researchers that several different patterns of epigenetic modification of the HTR2A serotonin gene locus are capable of predisposing an individual to schizophrenia. For individuals associated with one of these particular schizophrenia-predisposing epigenetic patterns, the same categorical epigenetic attribute of ‘HTR2A epigenetic schizophrenia pattern’ with a status value of {1=yes} can be derived. For an individual who is negative for all known schizophrenia-predisposing epigenetic patterns in the HTR2A gene, the categorical epigenetic attribute of ‘HTR2A epigenetic schizophrenia pattern’ with a status value {0=no} can be derived to indicate that the individual does not possess any of the epigenetic modifications of the HTR2A serotonin gene locus that are associated with predisposition to schizophrenia.
  • In one embodiment, the original attribute value is retained and the expanded attribute values provided in addition to allow the opportunity to detect similarities at both the maximal resolution level provided by the original attribute value and the lower level of resolution and/or broader coverage provided by the expanded attribute values or attribute value range. In one embodiment, attribute values are determined from detailed questionnaires which are completed by the consumer/patient directly or with the assistance of clinician 820. Based on these questionnaires, attribute values such as those shown in FIGS. 9A and 9B can be derived. In one or more embodiments, when tabulating, storing, transmitting and reporting results of methods of the present invention, wherein the results include both narrow attributes and broad attributes that encompass those narrow attributes, the broader attributes may be included and the narrow attributes eliminated, filtered or masked in order to reduce the complexity and lengthiness of the final results.
  • Attribute expansion can be used in a variety of embodiments, many of which are described in the present disclosure, in which statistical associations between attribute combinations and one or more query attributes are determined, identified or used. As such, attribute expansion can be performed to create expanded attribute profiles that are more strongly associated with a query attribute than the attribute profiles from which they were derived. As explained previously, attribute expansion can accomplish this by introducing predisposing attributes that were missing or introducing attributes of the correct resolution for maximizing attribute identities between attribute profiles of a group of query-attribute-positive individuals. In effect, expansion of attribute profiles can reveal predisposing attributes that were previously masked from detection and increase the ability of a method that uses the expanded attribute profiles to predict an individual's risk of association with a query attribute with greater accuracy and certainty as reflected by absolute risk results that approach either 1.0 (certainty of association) or 0.0 (certainty of no association) and have higher statistical significance. To avoid introducing bias error into methods of the present invention, expansion of attribute profiles should be performed according to a set of rules, which can be predetermined, so that identical types of attributes are expanded in the attribute profiles of all individuals processed by the methods. For example, if a method processes the attribute profiles of a group of query-positive individuals and a group of query-attribute-negative individuals, and the query-attribute-positive individuals have had their primary age attributes expanded into secondary categorical age attributes which have been added to their attribute profiles, then attribute expansion of the primary age attributes of the query-attribute-negative individuals should also be performed according to the same rules used for the query-attribute-positive individuals before processing any of the attribute profiles by the method. Ensuring uniform application of attribute expansion across a collection of attribute profiles will minimize introducing considerable bias into those methods that use expanded attribute profiles or data derived from them.
  • Consistent with the various embodiments of the present invention disclosed herein, computer based systems (which can comprise a plurality of subsystems), datasets, databases and software can be implemented for methods of generating and using secondary attributes and expanded attribute profiles.
  • In one embodiment, a computer based method for compiling attribute combinations using expanded attribute combinations is provided. A query attribute is received, and a set of expanded attribute profiles associated with a group of query-attribute-positive individuals and a set of expanded attribute profiles associated with a group of query-attribute-negative individuals are accessed, both sets of expanded attribute profiles comprising a set of primary attributes and a set of secondary attributes, wherein the set of secondary attributes is derived from the set of primary attributes and has lower resolution than the set of primary attributes. Attribute combinations having a higher frequency of occurrence in the set of expanded attribute profiles associated with the group of query-attribute-positive individuals than in the set of expanded attribute profiles associated with the group of query-attribute-negative individuals are identified. The identified attribute combinations are stored to create a compilation of attribute combinations that co-occur (i.e., co-associate, co-aggregate) with the query attribute, thereby generating what can be termed an ‘attribute combination database’.
  • In one embodiment, a computer based method for expanding attribute profiles to increase the strength of association between a query attribute and a set of attribute profiles associated with query-attribute-positive individuals is provided. A query attribute is received, and a set of attribute profiles associated with a group of query-attribute-positive individuals and a set of attribute profiles associated with a group of query-attribute-negative individuals are accessed. A first statistical result indicating strength of association of the query attribute with an attribute combination having a higher frequency of occurrence in the set of attribute profiles associated with the group of query-attribute-positive individuals than in the set of attribute profiles associated with the group of query-attribute-negative individuals is determined. One or more attributes in the set of attribute profiles associated with the group of query-attribute-positive individuals and one or more attributes in the set of attribute profiles associated with the query-attribute-negative individuals are expanded to create a set of expanded attribute profiles associated with the group of query-attribute-positive individuals and a set of expanded attribute profiles associated with the group of query-attribute-negative individuals. A second statistical result indicating strength of association of the query attribute with an attribute combination having a higher frequency of occurrence in the set of expanded attribute profiles associated with the group of query-attribute-positive individuals than in the set of expanded attribute profiles associated with the group of query-attribute-negative individuals is determined. If the second statistical result is higher than the first statistical result, the expanded attribute profiles associated with the group of query-attribute-positive individuals and the expanded attribute profiles associated with the group of query-attribute-negative individuals are stored.
  • In one embodiment, a computer based method for determining attribute associations using an expanded attribute profile is provided. A query attribute is received, and one or more primary attributes in an attribute profile associated with a query-attribute-positive individual are accessed. One or more secondary attributes are the derived from the primary attributes such that the secondary attributes are lower resolution attributes than the primary attributes. The secondary attributes are stored in association with the attribute profile to create an expanded attribute profile. Attribute combinations that are associated with the query attribute are determined by identifying attribute combinations from the expanded attribute profile that have higher frequencies of occurrence in a set of attribute profiles associated with a group of query-attribute-positive individuals than in a set of attribute profiles associated with a group of query-attribute-negative individuals.
  • In one embodiment, a computer based method for determining attribute associations using an expanded attribute profile is provided in which one or more primary attributes in an attribute profile are accessed. One or more secondary attributes are generated from the primary attributes such that the secondary attributes have lower resolution than the primary attributes. The secondary attributes are stored in association with the attribute profile to create an expanded attribute profile. The strength of association between the expanded attribute profile and a query attribute is determined by comparing the expanded attribute profile to a set of attribute combinations that are statistically associated with the query attribute.
  • The methods, systems, software and databases disclosed herein are able to achieve determination of complex combinations of predisposing attributes not only as a consequence of the resolution and breadth of data used, but also as a consequence of the process methodology used for discovery of predisposing attributes. An attribute may have no effect on expression of another attribute unless it occurs in the proper context, the proper context being co-occurrence with one or more additional predisposing attributes. In combination with one or more additional attributes of the right type and degree, an attribute may be a significant contributor to predisposition of the organism for developing the attribute of interest. This contribution is likely to remain undetected if attributes are evaluated individually. As an example, complex diseases require a specific combination of multiple attributes to promote expression of the disease. The required disease-predisposing attribute combinations will occur in a significant percentage of those that have or develop the disease and will occur at a lower frequency in a group of unaffected individuals.
  • FIG. 10 illustrates an example of the difference in frequencies of occurrence of attributes when considered in combination as opposed to individually. In the example illustrated, there are two groups of individuals referred to based on their status of association with a query attribute (a specific attribute of interest that can be submitted in a query). One group does not possess (is not associated with) the query attribute, the query-attribute-negative group, and the other does possess (is associated with) the query attribute, the query-attribute-positive group. In one embodiment, the query attribute of interest is a particular disease or trait. The two groups are analyzed for the occurrence of two attributes, A and X, which are candidates for causing predisposition to the disease. When frequencies of occurrence are computed individually for A and for X, the observed frequencies are identical (50%) for both groups. When the frequency of occurrence is computed for the combination of A with X for individuals of each group, the frequency of occurrence is dramatically higher in the positive group compared to the negative group (50% versus 0%). Therefore, while both A and X are significant contributors to predisposition in this theoretical example, their association with expression of the disease in individuals can only be detected by determining the frequency of co-occurrence of A with X in each individual.
  • FIG. 11 illustrates another example of the difference in frequencies of occurrence of attributes when considered in combination as opposed to individually. In this example there are again two groups of individuals that are positive or negative for an attribute of interest submitted in a query, which could again be a particular disease or trait of interest. Three genes are under consideration as candidates for causing predisposition to the query attribute. Each of the three genes has three possible alleles (each labeled A, B, or C for each gene). This example not only illustrates the requirement for attributes occurring in combination to cause predisposition, but also the phenomenon that there can be multiple different combinations of attributes that produce the same outcome. In the example, a combination of either all A, all B, or all C alleles for the genes can result in predisposition to the query attribute. The query-attribute-positive group is evenly divided among these three attribute combinations, each having a frequency of occurrence of 33%. The same three combinations occur with 0% frequency in the query-attribute-negative group. However, if the attributes are evaluated individually, the frequency of occurrence of each allele of each gene is an identical 33% in both groups, which would appear to indicate no contribution to predisposition by any of the alleles in one groups versus the other. As can be seen from FIG. 11, this is not the case, since every gene allele considered in this example does contribute to predisposition toward the query attribute when occurring in a particular combination of alleles, specifically a combination of all A, all B, or all C. This demonstrates that a method of attribute predisposition determination needs to be able to detect attributes that express their predisposing effect only when occurring in particular combinations. It also demonstrates that the method should be able to detect multiple different combinations of attributes that may all cause predisposition to the same query attribute.
  • Although the previous two figures present frequencies of occurrence as percentages, for the methods of the present invention the frequencies of occurrence of attribute combinations are can be stored as ratios for both the query-attribute-positive individuals and the query-attribute-negative individuals. Referring to FIG. 12A and FIG. 12B, the frequency of occurrence for the query-attribute-positive group is the ratio of the number of individuals of that group having the attribute combination (the exposed query-attribute-positive individuals designated ‘a’) to the total number of individuals in that group (‘a’ plus ‘c’). The number of individuals in the query-attribute-positive group that do not possess the attribute combination (the unexposed query-attribute-positive individuals designated ‘c’) can either be tallied and stored during comparison of attribute combinations, or computed afterward from the stored frequency as the total number of individuals in the group minus the number of exposed individuals in that group (i.e., (a+c)−a=c). For the same attribute combination, the frequency of occurrence for the query-attribute-negative group is the ratio of the number of individuals of that group having the attribute combination (the exposed query-attribute-negative individuals designated ‘b’) to the total number of individuals in that group (‘b’ plus ‘d’). The number of individuals in the query-attribute-negative group that do not possess the attribute combination (the unexposed query-attribute-negative individuals designated ‘d’) can either be tallied and stored during comparison of attribute combinations or can be computed afterward from the stored frequency as the total number of individuals in the group minus the number of exposed individuals in that group (i.e., (b+d)−b=d).
  • The frequencies of occurrence of an attribute or attribute combination, when compared for two or more groups of individuals with respect to a query attribute, are statistical results (values) that can indicate strength of association of the attribute combination with a query attribute and can therefore be referred to as corresponding statistical results in one or more embodiments of the present invention. Frequencies of occurrence can also be utilized by statistical computation engine 224 to compute additional statistical results for strength of association (i.e., strength of association values) of the attribute combinations with the query attribute, and these statistical results may also be referred to as corresponding statistical results in one or more embodiments. The statistical measures used to compute these statistical results may include, but are not limited to, prevalence, incidence, probability, absolute risk, relative risk, attributable risk, excess risk, odds (a.k.a. likelihood), and odds ratio (a.k.a. likelihood ratio). Absolute risk (a.k.a. probability), relative risk, odds, and odds ratio are the preferred statistical computations for the present invention. Among these, absolute risk and relative risk are the more preferable statistical computations because their values can still be calculated for an attribute combination in instances where the frequency of occurrence of the attribute combination in the query-attribute-negative group is zero. Odds and odds ratio are undefined in instances where the frequency of occurrence of the attribute combination in the query-attribute-negative group is zero, because in that situation their computation requires division by zero which is mathematically undefined. One embodiment of the present invention, when supplied with ample data, is expected to routinely yield frequencies of occurrence of zero in query-attribute-negative groups because of its ability to discover large predisposing attribute combinations that are exclusively associated with the query attribute.
  • FIG. 12B illustrates formulas for the statistical measures that can be used to compute statistical results. In one embodiment absolute risk is computed as the probability that an individual has or will develop the query attribute, given exposure to an attribute combination. In one embodiment, relative risk is computed as the ratio of the probability that an exposed individual has or will develop the query attribute to the probability that an unexposed individual has or will develop the query attribute. In one embodiment, odds is computed as the ratio of the probability that an exposed individual has or will develop the query attribute (absolute risk of the exposed query-attribute-positive individuals) to the probability that an exposed individual does not have and will not develop the query attribute (absolute risk of the exposed query-attribute-negative individuals). In one embodiment, the odds ratio is computed as the ratio of the odds that an exposed individual has or will develop the query attribute to the odds that an unexposed individual has or will develop the query attribute.
  • In one embodiment, results for absolute risk and relative risk can be interpreted as follows with respect to an attribute combination predicting association with a query attribute: 1) if absolute risk=1.0, and relative risk is mathematically undefined, then the attribute combination is sufficient and necessary to cause association with the query attribute, 2) if absolute risk=1.0, and relative risk is not mathematically undefined, then the attribute combination is sufficient but not necessary to cause association with the query attribute, 3) if absolute risk<1.0, and relative risk is not mathematically undefined, then the attribute combination is neither sufficient nor necessary to cause association with the query attribute, and 4) if absolute risk<1.0, and relative risk is mathematically undefined, then the attribute combination is not sufficient but is necessary to cause association with the query attribute. In an alternate embodiment, a relative risk that is mathematically undefined can be interpreted to mean that there are two or more attribute combinations, rather than just one attribute combination, that can cause association with the query attribute. In one embodiment, an absolute risk<1.0 can be interpreted to mean one or more of the following: 1) the association status of one or more attributes, as provided to the methods, is inaccurate or missing (null), 2) not enough attributes have been collected, provided to or processed by the methods, or 3) the resolution afforded by the attributes that have been provided is too narrow or too broad. These interpretations can be used to increase accuracy and utility of the methods for use in many applications including but not limited to attribute combination discovery, attribute prediction, predisposition prediction, predisposition modification and destiny modification.
  • The statistical results obtained from computing the statistical measures, as well as the attribute combinations to which they correspond, can be subjected to inclusion, elimination, filtering, and evaluation based on meeting one or more statistical requirements which may be predetermined, predesignated, preselected or alternatively, computed de novo based on the statistical results. Statistical requirements can include, but are not limited to, numerical thresholds, statistical minimum or maximum values, and statistical significance (confidence) values which may collectively be referred to as predetermined statistical thresholds. Ranks (e.g., numerical rankings) assigned to attribute combinations based on their attribute content and/or the corresponding statistical results can likewise be subjected to inclusion, elimination, filtering, and evaluation based on a predetermined threshold, in this case applied to rank, which can be specified by a user or by the computer system implementing the methods.
  • One embodiment of the present invention can be used in many types of statistical analyses including but not limited to Bayesian analyses (e.g., Bayesian probabilities, Bayesian classifiers, Bayesian classification tree analyses, Bayesian networks), linear regression analyses, non-linear regression analyses, multiple linear regression analyses, uniform analyses, Gaussian analyses, hierarchical analyses, recursive partitioning (e.g., classification and regression trees), resampling methods (e.g., bootstrapping, cross-validation, jackknife), Markov methods (e.g., Hidden Markov Models, Regular Markov Models, Markov Blanket algorithms), kernel methods (e.g., Support Vector Machine, Fisher's linear discriminant analysis, principle components analysis, canonical correlation analysis, ridge regression, spectral clustering, matching pursuit, partial least squares), multivariate data analyses including cluster analyses, discriminant analyses and factor analyses, parametric statistical methods (e.g., ANOVA), non-parametric inferential statistical methods (i.e., binomial test, Anderson-Darling test, chi-square test, Cochran's Q, Cohen's kappa, Efron-Petrosian Test, Fisher's exact test, Friedman two-way analysis of variance by ranks, Kendall's tau, Kendall's W, Kolmogorov-Smirnov test, Kruskal-Wallis one-way analysis of variance by ranks, Kuiper's test, Mann-Whitney U or Wilcoxon rank sum test, McNemar's test, median test, Pitman's permutation test, Siegel-Tukey test, Spearman's rank correlation coefficient, Student-Newman-Keuls test, Wald-Wolfowitz runs test, Wilcoxon signed-rank test).
  • In one embodiment, the methods, databases, software and systems of the present invention can be used to produce data for use in and/or results for the above statistical analyses. In another embodiment, the methods, databases, software and systems of the present invention can be used to independently verify the results produced by the above statistical analyses.
  • In one embodiment a method is provided which accesses a first dataset containing attributes associated with a set of query-attribute-positive individuals and query-attribute-negative individuals, the attributes being pangenetic, physical, behavioral and situational attributes associated with individuals, and creates a second dataset of attributes associated with a query-attribute-positive individual but not associated with one or more query-attribute-negative individuals. A third dataset can be created which contains combinations of attributes from the second dataset (i.e., attribute combinations) that are either associated with one or more query-attribute-positive individuals or are not present in any of the query-attribute-negative individuals, along with the frequency of occurrence in the query-attribute-positive individuals and the frequency of occurrence in the query-attribute-negative individuals. Statistical computations based on the frequencies of occurrence can be performed for each attribute combination, where the statistical computation results indicate the strength of association, as measured by one or more well known statistical measures, between each attribute combination and the query attribute. The process can be repeated for a number of query attributes, and multiple query-positive individuals can be studied to create a computer-stored and machine-accessible compilation of different attribute combinations that co-occur with the queried attributes. The compilation can be ranked (i.e., attribute combinations can be assigned individual ranks) and co-occurring attribute combinations not meeting a statistical requirement for strength of association with the query attribute and/or at least a minimum rank can be eliminated from the compilation. The statistical requirement can be a minimum or maximum statistical value and/or a value of statistical significance applied to one or more statistical results. In a further embodiment, ranking the attribute combinations can also be based on the attribute content of the attribute combinations, such as whether certain attributes are present or absent in a particular attribute combination, what percentage of attributes in a particular attribute combination are modifiable, what specific modifiable attributes are present in a particular attribute combination, and/or what types or categories of attributes (i.e., epigenetic, genetic, physical, behavioral, situational) are present in a particular attribute and in what relative percentages. These methods of ranking attribute combinations can be applied in various embodiments of the present invention disclosed herein.
  • Similarly, a system can be developed which contains a subsystem for accessing a query attribute, a second subsystem for accessing a set of databases containing pangenetic, physical, behavioral, and situational attributes associated with a plurality of query-attribute-positive, and query-attribute-negative individuals, a data processing subsystem for identifying combinations of pangenetic, physical, behavioral, and situational attributes associated with query-attribute-positive individuals, but not with query-attribute-negative individuals, and a calculating subsystem for determining a set of statistical results that indicates a strength of association between the combinations of pangenetic, physical, behavioral, and situational attributes with the query attribute. The system can also include a communications subsystem for retrieving at least some of pangenetic, physical, behavioral, and situational attributes from at least one external database; a ranking subsystem for ranking the co-occurring attributes according to the strength of the association of each co-occurring attribute with the query attribute; and a storage subsystem for storing the set of statistical results indicating the strength of association between the combinations of pangenetic, physical, behavioral, and situational attributes and the query attribute. The various subsystems can be discrete components, configurations of electronic circuits within other circuits, software modules running on computing platforms including classes of objects and object code, or individual commands or lines of code working in conjunction with one or more Central Processing Units (CPUs). A variety of storage units can be used including but not limited to electronic, magnetic, electromagnetic, optical, opto-magnetic and electro-optical storage.
  • In one application the method and/or system is used in conjunction with a plurality of databases, such as those that would be maintained by health-insurance providers, employers, or health-care providers, which serve to store the aforementioned attributes. In one embodiment the pangenetic (genetic and epigenetic) data is stored separately from the other attribute data and is accessed by the system/method. In another embodiment the pangenetic data is stored with the other attribute data. A user, such as a clinician, physician or patient, can input a query attribute, and that query attribute can form the basis for determination of the attribute combinations associated with that query attribute. In one embodiment the associations will have been previously stored and are retrieved and displayed to the user, with the highest ranked (most strongly associated) combinations appearing first. In an alternate embodiment the calculation is made at the time the query is entered, and a threshold can be used to determine the number of attribute combinations that are to be displayed.
  • FIG. 13 illustrates a flowchart of one embodiment of a metho