MXPA99008824A - Coincidence detection method, products and apparatus - Google Patents

Coincidence detection method, products and apparatus

Info

Publication number
MXPA99008824A
MXPA99008824A MXPA/A/1999/008824A MX9908824A MXPA99008824A MX PA99008824 A MXPA99008824 A MX PA99008824A MX 9908824 A MX9908824 A MX 9908824A MX PA99008824 A MXPA99008824 A MX PA99008824A
Authority
MX
Mexico
Prior art keywords
attributes
subset
objects
data set
coincidence
Prior art date
Application number
MXPA/A/1999/008824A
Other languages
Spanish (es)
Inventor
W Steeg Evan
Original Assignee
Queen's University At Kingston
W Steeg Evan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Queen's University At Kingston, W Steeg Evan filed Critical Queen's University At Kingston
Publication of MXPA99008824A publication Critical patent/MXPA99008824A/en

Links

Abstract

A method and system for detecting coincidences in a data set of objects, where each object has a number of attributes. Iteratively, equally-sized subsets of the data set are sampled, and coincidences (co-occurrences of a plurality of attribute values in one or more objects in the subset) are recorded. For each coincidence of interest, the expected coincidence count is determined and compared with the observed coincidence count;this comparison is used to determine a measure of correlation for the plurality of attributes for the coincidence. The resulting set of k-tuples of correlated attributes is reported, a k-tuple of correlated attributes being a plurality of attributes for which the measure of correlation is above a predetermined threshold. The method and system (implemented on an array of processing nodes) is suitable for protein structure analysis, e.g. in HIV research.

Description

METHOD, PRODUCTS AND COINCIDENCE DETECTION DEVICE DESCRIPTION OF THE INVENTION The invention relates to methods, devices and systems for detection of coincidence among a multitude of variables. Furthermore, the invention relates to the application of coincidence detection methods to various fields and to products derived from such application. k-tuples of Correlated Attributes The discovery of correlations between pairs or k-tuples of variables has applications in many areas of science, medicine, industry and commerce. For example, it is of great interest to physicians and public health professionals to know what lifestyle, diet and environmental factors correlate with each other and with what particular diseases in a database of patient records. It is potentially profitable for a seller of shares or items to discover a set of financial instruments whose prices co-vary with time. The sales staff of a supermarket chain or a mail order distributor would be interested in which consumers who would buy a product A would tend to buy products B and C, and this can be discovered in a database of records-sales . Molecular biologists and computer-based drug discovery researchers would similarly infer aspects of three-dimensional molecular structure from correlations between elements of distant sequence in aligned sets of RNA or protein sequences. A formulation of the general problem that encompasses many diverse applications and that facilitates the understanding of the principles described herein is a matrix of discrete characteristics, in which the rows correspond to "objects" (such as individual patients, stock prices, consumers , or protein sequences) and the columns correspond to characteristics or attributes or variables (such as life factors, actions, sales items, or amino acid residue positions). Mathematical methods for determining a measure of the type, degree and statistical significance of the correlation between any two, or even three or four particular variables are scattered and well understood. These methods include linear and non-linear regression analysis and contingency table analysis techniques for discrete variables. However, great difficulties arise when trying to estimate the correlation or only estimate the probabilities of union or conditional on much larger sets of variables. This insolubility has a main cause, there are too many terms of probability density of attribute value sets and this manifests itself in two serious problems: (1) frequency counts of calculation and storage on all terms, on the basis of data, requires too much calculation and memory; (2) there is usually an insufficient number of database records to support reliable probability estimates based on those frequency counts. Let's consider some details. For M records (objects), N variables (attributes, fields) and assuming that each variable has the same set of possible values I -a. |, exist (k) = ¿^ ¡¡¿¿. of columns. Adding the number of Ictupies for each k = 1, 2, ..., NA results in 2? - 1 of such tupies of all sizes. This exponential complexity has been a major obstacle that remains in the way of higher order probability estimation and correlation detection methodologies. One way to think about this complexity is in terms of the power set of the set of column variables. This power set forms a mathematical grid under the operation c, a "tower" that corresponds to a graph whose nodes are subsets of this set of column variables. (Note that if a set has N members, the power set has 2? Members). From your point of view, two nodes that represent subsets, if and s2 are connected yes and only yes any of either whatever if z s2 = yes. It is said that s2 node is on yes yes if cz s2. This gives a natural meaning to the term "higher order", since it appears superior arxiba of the tower. It is called inferior to the null set node, the Oth row; the terms of the individual column form the first row and so on. Continuing with the tower analogy, it is observed that each "floor" of this building "apartments" and each apartment contains | A | k "rooms". In other words, the kth level of the network corresponds to (I Nr different k-tuples of column variables, and associated with each k-tuple is a (| A | by | A | ... by | A | /) each cell of which must store the counted frequency of a particular union symbol { ^, al2, ... a1k) where a classical contingency table test for the correlation between those particular k columns (See Figure 1). For any k e,. { 1, 2,. . . , N.}. for any k-tuple of columns (c -,?, c-j2 ..., c-¡X, there are | A | k possible binding values.) For any ke { 1, 2, ..., N .}., for any column upturn (c -,?, CD2, ..., c3k), the estimate of the Kullback divergence or other correlation function that uses the data set is at least one O calculation (Mk) or O (| A |) depending on the relative sizes of M, k and | A |. A comprehensive probabilistic model of the database must be able to specify the probability estimates for terms? (*) I _ This means, for example, in the domain of molecular biology - by computer, that for a family of heptapeptide sequence, each sequence having a length of seven amino acid residues, there are 1,801,088,540 terms to specify.For RNA not really small of fifteen nucleotides in length, on the RNA alphabet of four basic symbols, there are 30,517,578,124 terms. They can become insolutely huge. What about the space of possible models through which a modeling / learning process can be sought? Consider a latent variable model, which seeks to explain the correlation between observable variable sets by placing latent variables whose states influence the observables together. Since each model must specify a set of k-tuples of variables, and there exist exp (2.2N) (that is, 2 to the power 2N) such sets, there exist exp (2.2N) possible models in the search space in the worst case. Several methods for determining a high-level probability measure evade the combination explosion through several previous restrictions on the width k (See Figure 3), the location (Figure 2), the number, or the degree of feature correlation. of higher order, sought, and in the types of models involved (See Figure 4). Three Objectives of the Probability Estimation It is useful before discussing the details of the existing methods and the present invention, delineating three different possible objectives of probability estimation in large data sets, each corresponding to a large body of search and current practice: 1. Estimation of the distribution probability of union of order completely _ upper, completely specified: Estimation of a density of probability q that specifies q (ccii @cu, a ± 2§ci2 ... ^ ^ a ± k§cik ) for all k-tuples of possible attributes and values. 2. Test hypothesis for particular hypotheses that refer to particular attributes and variables. For example, are the data that are compatible with the hypothesis that the columns c ± i, c ±? . . . , c ± k are independent? 3. Characteristic detection or "data mining".
Detect the most suspicious matches, for example, the joint attribute occurrences that are more likely than would be predicted from the lower-order marginals. Related to this, the most highly correlated k-tuples of the columns.
It is feature detection and data mining applications that are most relevant to the present invention. However, some of the most successful ways to estimate a total probability estimate of the whole order of a database requires the exact specification of those higher-order terms that represent high correlations between sets of k > 2 variables and that invoke the assumptions of maximum entropy and therefore the current invention is aimed at those applications as well. Related Work Different mathematical and computational methods have been proposed and used to estimate higher-order probabilities to detect correlations and to model higher-order model database relationships. All previous methods e-run a global search, sometimes exhaustive through all the possible k-tuples of variables, which is too expensive or complex and at the same time limit your search only to k-tuples of a specific small fixed size k.
(Frequently, k = 2 so that only correlations in pairs are considered). Below are some of the representative examples of the related work.
Assume Independence between Attributes. The easiest way to avoid the complexity of higher order correlations is to pretend only that they do not exist. Many of the algorithms and computer programs, historically dominant in some application fields of the current method, simply build and use a data model in which all the variables, and all the attributes are independent. For example, DNA modeling of the protein sequence in computational molecular biology is often done with consensus consequences and profiles that incorrectly assume that the different base or amino acid residue positions are independent. Trusting such models can obscure critical functional and structural revisions within the DNA or proteins being modeled. Previous limits on k. A proposal of the Gibbs models of the databases is based on the use of Gibbs' potentials and proposes a method to calculate those special terms. Each potential of th-order requires an estimate of a joint probability density of Jcth-order as well as some lower density order number (usually of the order k-lt). The asymptomatic time complexity of Miller's collection-pattern subroutine, the main component of the potential calculus is, when interpreted in our terminology: M - S N. 2k - 0 (Wf) «(*) where K = kmax is the superior order of the characteristics for which the search is made and by means of which the database objects are represented. This exponential expansion avoids the search for higher-order characteristics (HOF) of any order k much greater than 4 or 5 in databases with hundreds of attributes. _ Many methods, in different application areas, simply limit k to k = 2. For example, second-order internal residue correlation methods that can be useful in the prognosis of protein structure and performance and that can be constructed in more sensitive classifiers than first-order sequence classifiers and classifiers and bending recognizers. To the extent that the k-ary iterations are important, and to the extent that such iterations leave traces in the sets of homologous sequences, the pairwise method is deficient. One can try to infer the correlations k-ary from sets of 2-ary correlations [9] (essentially by calculating the transitive closure of "Correlations with" the binary relation), although this heuristic element can lead to problems: high correlations in pairs between the variables x, y, z do not imply in general, nor are they necessarily implied by, a high 3-ary correlation (as measured by the Kullback divergence) or the three variables x, y, z. In other areas of application, such as the study of multiple drug iterations, it is similarly true that important higher order relationships can be neglected by means of peer correlation detection methods. The Paturi et al method for more correlated identification of random variables. A method has been reported for the problem of looking for the most highly correlated pair X, Xj of variables among a large set of random binary variables N, Xi, X2 / ..., XN. The method is easily extended to find the most correlated k-tuple of random binary variables, although in n significant increase in computational complexity and only for k >; 2 fixed a priori. We use a correlation definition that has a correlation (X ±, Xj) = P [X = Xj] on _ some sets of M samples. { ^ í, ^ 2, ..., -ffiw]? p = i, 2 .... (here P [Xi = Xj] means the "probability that the variable X ± has the same value or state as the variable Xj." Much of the computational complexity, both the time complexity and the sample complexity or its method can The two variants of the Paturi method are asymptomatically quadratic and subcuadratic in N, respectively, the fastest procedure that requires the most sampling When the method is extended to the search for the largest k-ary correlation, where the correlation is defined as P [X ± = Xx? = X? k] r the complexity of the time grows up to approximately O (k2N * log3N The search for highly correlated attribute groups of width k much larger than 5 to 6 in very large data sets is once again excluded.-Hidden Frame Models.
Marco (HMM) have been widely used and with increasing success in recent years, both in automatic speech recognition and in the modeling of protein, DNA and RNA sequences. Although some groups have reported significant success in modeling protein sequence families and continuous voice data with HMM, nevertheless great improvements can be made in the learning time and the firmness of the model through the "hardwiring" of the characteristics of pre-selected top order within HMM. (This has been investigated for recurrent neural networks similar to HMM, in different domains). Some of the same reasons why HMMs are very good at aligning protein sequences or recorded pronunciations first, using local-sequential correlations, make such methods less useful for searching for important distant sequence correlations in data that has already been partially or completely aligned. The phenomenon responsible for this dilemma is termed as "diffusion". A first-order HMM, by definition, assumes independence between the sequence columns by giving a hidden state sequence. Multiple alternative state sequences can be used to capture the higher-rank iterations, although the number of these increases exponentially with the number of k-tuples of correlated columns. The Agrawal Method et al. to Discover the Rules of Association. This method was developed in the context of the most pure data mining possible, the automatic extraction of knowledge base rules from databases. It is considered a database of transactions M (rows of objects), and articles N (attributes, columns) and it is sought to extract the rules of the form a? - b. Therefore, the search for pairs of attributes a, b so that the "transaction containing an a tends to contain b", therefore those pairs with high values for p (b \ a). "People who buy Compact Disc players tend to buy compact discs" is just one example that suggests the potential commercial interests in such methods (more generally one can look for sets of attributes with high p (b, b ?,. ., bk \?, a ?, ... a -,)). It is said that a rule a? - b has: 1. certainty c if the% c of transactions containing a also contain b (therefore, approximately if (Pfa)) (? Oo)); 2. support if the% s of transactions contain s a and b (therefore, approximately if p (a, b)> 100 IQ? The objectives behind this method are different from the objects of the invention. However, different objectives are met if one focuses on the discovery of the Agrawal method of symmetric rules (so that the search is for pairs of attributes that display high values as well as for pfct fv ¡¡¡¡¡¡1 and if it reduces the and b), emphasis on the support (so that they coincide that they are suspect, even if they occur rarely, they are looked for) The Agrawal method is shown to have • MN) complexity of time, where || s || is the sum of all values, Support (a) for an exponentially large number of k-tuples or of attributes, of any size 1 < k < N, which reaches a particular processing stage in this procedure. Therefore the method is 0 (2N) in the worst case. A series of empirical tests run on what are considered to be the most realistic data sets for your domain. The running time of the procedure increases only linearly with the number of M of transactions, although the number of o-attribute items remains constant at NA = 1 000, and their constructed datasets probably contain uncorrelated k-tuples in width k > 10 An analysis of its algorithm, which is based on the increased accumulation of ith-order groups from groups of k-th order, clarifies that the method has much more calculation to find HOF. { k big) than the narrow HOF. { lower k) of equivalent statistical significance. Steeg, Robinson, Deerfield, Lappa - 1993. Some heuristic methods have been presented to find k-tuples of correlated residues (positions) in sets of aligned protein sequences. One of the methods presented employed a mode of a rudimentary version of the representation and stages of detection matches described herein. Alternative methods of, and devices for, finding correlations between attributes and applications for these correlations are required. In a first aspect of the present invention, it provides a method of matching detection for use with a data set of objects having a number of attributes. The base method includes the following steps: • represent a set of M objects in terms of a number N of variables ("attributes"), where a backsubject is said to occur in an object if the object has the attribute; • sample a subset of r-. outside the M objects, for each iteration between the predetermined number of iterations; • detect and record matches between sets of k attributes in each sampled subset of objects, a match that is the concurrence of 1 < k < NA attributes in the same hi out of ri objects in the sampled subset, where O <; hi < rj_; • determine an expected count of matches for any set of k attributes and a predetermined number of sampling iterations and matching counts as described above, the determination that is executed before sampling and collection, at the same time or after sampling and The recollection; • compare, for any set of k attributes and number of sampling iterations and coincidence count, the observed count against the expected count of matches and from this comparison determine a measure of correlation (or association or dependence) for the set of k attributes; and • report a set of k-tuples of correlated attributes, where a k-tuple of correlated attributes is a set of k of the NA attributes that have been determined by this process to have a value for a selected correlation measure on a value of predetermined threshold. In a second aspect, the invention provides a method of matching detection for use with a data set of objects having a number of attributes, the method comprising the steps of: • sampling a subset of the data set for a predetermined number of iterations, each iteration of the sampled subset of the data set that has the same subset of attributes for each object; • detect, and record matching accounts in each sampled subset of data set, a match which is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset of the data set, where the plurality of Attribute values are the same for each occurrence, the detection and registration matching account in each sampled subset of the data set that was previously run, at the same time or after sampling, the count of detection and registration of matches in other subsets; • determine an expected count for each coincidence of interest, the determination that is executed previously, at the same time or after sampling, detection and registration; • compare, for each coincidence of interest, the observed count of the matches against the expected count of matches, and from this comparison determine a correlation measure for the plurality of attributes for the match; and • report a set of k-tuples of correlated attributes, where a k-tuples of correlated attributes is a plurality of attributes for which the correlation measurement is a respective predetermined threshold.
In any of these aspects, the comparison of observed and expected counts can be calculated using a Chernoff link in later probabilities, and the quantities can be recorded by a total operation storage of the count of each match over all the sampled subsets. In a third aspect, the invention provides a method for visual scanning of a data set of objects having a number of attributes, the method comprising the steps of: • sampling a subset of the data set for a predetermined number of iterations, each iteration of the subset of the data set that has the same number of objects but not necessarily the same objects and that has the same subset of attributes for each object; • detect and record the match counts in each sampled subset of data set, a match that is the co-occurrence-of a plurality of attribute values in one or more objects in a sampled subset of the data set, where the plurality of attribute values therein for each occurrence, detection and registration of coincidence counting in each subset sampled from the data set that was previously executed, at the same time or after sampling, detection counts and registration of matches in other subsets untos; • determine an expected count for each coincidence of interest, the determination that was previously executed, at the same time or after sampling, detection and registration; • compare, for each coincidence of interest, the observed count of coincidences for the expected coincidence count and from this comparison determine a correlation measure for the plurality of attributes for the match; and • reporting a set of k-tuples of correlated attributes for a user through a graphical interface, where a k-tuple of correlated attributes is a plurality of attributes for which the correlation measure is greater than a respective predetermined threshold. In a fourth aspect the invention provides a pre-processing method for use with a data modeling unit to capture and report to the data modeling unit the higher order iterations a data set of objects having a number of attributes, the method comprising the steps of: • sampling a subset of the data set for a predetermined number of iterations, each iteration of the sampled subset of the data set having for each object the same subset of attributes; • detect and record the match counts in each sampled subset of the data set, a match that is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset, where the plurality of attribute values is the same for each occurrence, the detection counts and coincidence count records in each sampled subset that were executed before, at the same time or after sampling, the detection and registration of matches - in other subsets; • determine an expected count for each coincidence of interest, the determination that was previously executed, at the same time or after sampling, detection and registration; • compare, for each coincidence of interest, the observed count of the matches against the expected count of matches, and from this comparison, determine a correlation measure for the plurality of attributes for the match; and • reporting to the data modeling unit a set of k-tuples of correlated attributes, wherein a k-tuple of correlated attributes is a plurality of attributes for which the correlation measurement is greater than a respective predetermined threshold. In a fifth aspect the invention provides a correlation elimination method for use with a data set of objects having a number of attributes, the method comprises the steps of: • sampling the subset of the data set for a predetermined number of iterations , each iteration of the sampled subset of the data set that has the same subset of attributes for each object; • detect and record match counts, in each sampled subset of the data set, a match that is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset of the data set, where the plurality Attribute values are the same for each occurrence, detection counts and match records in each sampled subset that are executed previously, at the same time - or after sampling, detection and record counts of matches in other subsets; • determine an expected count for each coincidence of interest, the determination that was previously executed, at the same time or after sampling, detection - registration; • compare, for each coincidence of interest, the observed count of the matches against the expected coincidence count and this comparison determine the measurement of the correlation for the plurality of attributes for the matches; • eliminating a set of k-tuples of co-related attributes, where k-tuple of co-related attributes is a plurality of attributes for which the correlation measurement precedes a respective predetermined threshold. In any of the aspects, the objects can be sales transactions, each transaction comprising one or more products sold and the attributes can be for example sales instances of particular products or types of products. Objects can be time segments and attributes can be statuses of elements in a system. Objects can be time segments and attributes can be prices, or instrument price changes to financial facilities. In any of the aspects, the stages of the method can be represented by the following pseudo-code: 0. begin 1. read (MATRIX); 2. read (R, T); 3. compute jfirst_order_margmals (MATRIX); 4. csets: -. { }; 5. for iter = 1 to T do 6. sampled_rows: = tsample (R, MATRIX): 7. attributes: = get_attributes (sampled_rows); 8. all_coincidences: = find_all_coincidences (attributes); 9. for coincidence in all_coincidences of 10. if cset_already_exists (coincidence, csets) 11. then update_cset (coincidence, csets); 12. eise add_new_cset (coincidence, csets); 13. endif 14. endfor 15. endfor 16. for cset in csets of 17. expected: = compute_expected_match_count (cset); 18. observed: = get_observed_match_count (cset); 19. stats: = update_stats (cset, hypoth_test (expected, observed)); 20. endfor 21. print_fmal_stats (csets, stats); 22. end In a sixth aspect of the invention provides a matching detection system for use with an object data set, each object having a plurality of attributes, the system comprises: • means for sampling a subset of the data set for a predetermined number of iterations, each iteration of the sampled subset of a data set that has the same subset of attributes for each object; Means for detecting, and recording the amount of, matches in each sampling subset of the data set, a match being the concurrency of a plurality of values and attributes in one or more objects in a sampling subset of the data set, where the plurality of values and attribute is the same for each occurrence, the detection and registration of matching quantities --- in each subset of sampling previously executed, at the same time or after sampling, the matching quantities are detected and recorded in other subsets; • means to determine an expected quantity for each coincidence of interest, the determination being executed previously, at the same time, or after sampling, detecting and recording; Means for comparing for each coincidence of interest the observed amount of match against the expected number of matches and from this comparison determines a correlation measurement for the plurality of attributes for the match; Means for reporting a set of k-tuples of correlated attributes, where a k-tuple of correlated attributes is a plurality of attributes for which the correlation measurement is previously a predetermined respective threshold. In the system of the sixth aspect, the means for a sampling subset of the data set may comprise means for dividing the data set within the subsets for sampling. The means for detecting and recording amounts of matches may comprise arrangement of processing nodes, each processing node detecting and recording respective subsets of matches, and means for comparing, for each coincidence of interest., such observed amount of matches for the expected number of matches may comprise means for incorporating the subcontract to provide the observed amount. At least one of the processing nodes may comprise a respective sub-array of processing nodes that detects and records respective subsubcontects of matches and such means for comparing together to provide the subcontects and / or the observed count. Each processing node may comprise memory comprising a temporary input memory for storing subsets received from the data set and a temporary output memory for storing-the subset or the subcontracting; and a memory bus that transfers the data and into memory. In a seventh aspect the invention provides programmed means of coincidence detection for use with a computer and with a data set of objects having a number-of attributes, the programmed means comprising: a computer program stored in compatible storage media with the computer, the computer program that contains instructions for directing the computer to: • sample a subset of the data set for a predetermined number of iterations, each iteration of the sampled subset of the data set that has for each object the same subset of attributes; • detect and record match counts in each sampled subset of the data set, a match that is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset of the data set, where the plurality of values Attribute is the same for each occurrence, the detection and registration counts of the matches in each sampled subset that was previously executed, at the same time or after the sampling, detection and registration of matches in other subsets; • determine an expected count _- for each coincidence of interest, the determination that was previously executed, at the same time or after sampling, detection and registration; • compare, for each coincidence of interest, the observed count of matches against the expected count of matches, and from this determined comparison of measurement of the correlation for the plurality of attributes for the match; and • reporting a set of k-tuples of correlated attributes, where a k-tuple 'of correlated attributes is a plurality of attributes for which the correlation measurement is over a respective predetermined threshold. In an eighth aspect the invention provides programmed means of coincidence detection for use with a data set of objects having a number of attributes, the system comprising: a computer; and a computer program stored in computer-compatible storage media, the computer program is directed to the computer to: • sample a subset of the data set for a predetermined number of iterations, each iteration of the sampled subset of the data set that has for each object the same subset of attributes; • detect and record match counts in each sampled subset of the data set, a match that is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset of the data set, where the plurality of values Attribute is the same for each occurrence, the detection and registration counts of the matches in each sampled subset that was previously executed, at the same time or after the sampling, detection and registration of matches in other subsets; • determine an expected count for each coincidence of interest, the determination that was previously executed, at the same time or after sampling, detection and registration; • compare, for each coincidence of interest, the observed count of matches against the expected count of matches, and from this determined comparison of measurement of the correlation for the plurality of attributes for the match; and • report a set of k-tuples of correlated attributes, where a k-tuple of correlated attributes is a plurality of attributes for which the correlation measurement is over a respective predetermined threshold. In any of its aspects, the methods of the invention may further comprise the step of representing the objects and attributes in an array of objects, the data set that is sampled by sampling the array. In a ninth aspect, the invention provides a product having a set of attributes selected by: • sampling a subset of a data set representing objects against attributes for a predetermined number of iterations, each iteration of the sampled subset having the same number of objects but not necessarily the same objects and that has the same subset of attributes for each object, • detect and record matching counts in each sampled subset of the data set, a coincidence that is the co-occurrence of a plurality of values of attribute in one or more objects in a sampled subset of the data set, where the plurality of attribute values is the same for each occurrence, the detection and registration of matches in each sampled subset that are executed before, at the same time or after sampling, detecting and recording match counts in other subsets untos; • determine an expected count for each coincidence of - interest, the determination that is executed before, at the same time or after sampling, detection and registration, • compare, for each coincidence of interest, the observed count of coincidences, - against the count expected from coincidences, and from this comparison, determine a correlation measure for the plurality of attributes for the match, and • report a set of k-tuples of correlated attributes, - where a k-tuple of correlated attributes is a plurality of attributes for the which - the correlation measurement is on a respective predetermined threshold. In a tenth aspect, the invention provides a product defined by the application of a set of rules generated from: sampling a subset of a data set that represents objects against attributes for a predetermined number of iterations, each iteration of the sampled subset that has the same subset of attributes for each object, • detect and record match counts in each sampled subset of the data set, a match that is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset of the data set, where the plurality of attribute values is the same for each occurrence, the detection counts and registration of match in each sampled subset that are executed before, at the same time or after sampling, detection and registration of the matching counts of other subsets, • determine an expected count of each match of interest, making the determination before, at the same time or after sampling, detection and registration, • compare for each coincidence of interest the observed count of matches against the expected count of matches, and from this comparison determine a correlation measure for the plurality of attributes for the match, and • report a set of k-tuples of correlated attributes , where a k-tuple of correlated attributes is a plurality of attributes for which the correlation measure is on a respective predetermined threshold. In any aspect, the methods of the invention may further comprise the step of applying rules that are defined by the reported correlated attributes. In an eleventh aspect of the invention it provides a peptide or peptidomimetic that includes a structural motif of the V3 cycle of the HIV envelope protein that includes spatial coordinates of residues A18 / Q31 / H33. In a twelfth aspect of the invention provides a pharmaceutical composition comprising a ligand that interacts with a protein having a structural motif identified using the method of claim 2, and a pharmaceutically acceptable carrier or excipient thereof. The ligand can comprise chemical portions of suitable identity and spatially located relative to each other so that the portions interact with corresponding residues or portions of the motif. The ligand, by iteration with the motif, can interfere with the function of a region of the protein comprising the motif. In a thirteenth aspect of the invention provides a diagnostic agent comprising a ligand that interacts with a protein having a structural motif identified by the method of the previous aspects of the invention, and a detectable label linked to the ligand. In a fourteenth aspect, the invention provides a pharmaceutical composition for interacting with a human immunodeficiency virus envelope protein (HIV), the cover protein that includes a structural motif of the V3 cycle having spatial coordinates of residues A18 / Q31 / H33, comprising a ligand that includes at least one functional group that interacts with the motif and a pharmaceutically acceptable carrier or excipient thereof. The ligand may include at least one functional group capable of binding to, and being present at an effective position in such a ligand to bind to the residue 18, at least one functional group capable of binding to, and being present at an effective position in, such a ligand for binding to residue 31, and at least one functional group capable of binding to, and being present in an effective position in, such a ligand to bind to residue 33. In a fifteenth aspect, the invention provides a method of labeling a ligand for interacting with a structural motif of a human immunodeficiency virus (HIV) envelope protein, the method comprising the steps of: providing a template having spatial coordinates of residues, A18, Q31 and H33 in cycle V3 of the HIV envelope protein and cover in the form of a chemical ligand calculation using an effective algorithm with spatial constraints, so that such a covered ligand includes at least s an effective functional group that joins the motive. The ligand may comprise at least one functional group capable of binding to, and being present at an effective position to meet residue 18, at least one functional group capable of binding to, and being present at an effective position on such a ligand for joining residue 31, and at least one functional group capable of binding to, and being present in an effective position in, such a ligand to bind to residue 33. In a sixteenth aspect of the invention, an identification method is provided to a ligand for binding to a structural motif of a human immunodeficiency virus (HIV) envelope protein, the method comprises the steps of: providing a template having spatial coordinates of A18, Q31 and H33 in the V3 cycle of the envelope protein of HIV; providing a database that contains the structure and orientation of the molecules; and screening such molecules to determine if they contain effective portions separated in relation to one another, so that the portions interact with the motif. A first portion of the molecule can interact with the residue 18, a second portion of the molecule interacts with the residue 31 and a third portion of the molecule interacts with the residue 33. In a seventeenth aspect of the invention, the invention can provide antigens and vaccines that present the co-variable k-tuples described herein. In an eighteenth aspect, the invention provides a product that is defined by its iteration with a set of attributes selected by: • sampling a subset of a data set representing objects against attributes for a predetermined number of iterations, each iteration of the iteration sampled subset of the data set that has the same number of objects but not necessarily the same objects and that has the same subset of attributes for each object, • detect and record the match counts in each sampled subset of the data set, a match which is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset, where the plurality of attribute values is equal for each occurrence, the detection and registration of matches in each sampled subset that is executed before, at the same time or after sampling, detection and registration of Matching points in other subsets • Determine an expected count for each match of interest, the determination that was previously executed, at the same time or after sampling, detection and registration, • compare, for each coincidence of interest, the observed count of matches against the expected count of matches, and from this comparison determine a correlation measure for the plurality of attributes for the match, and • report a set-of k-tuples of correlated attributes, where a k-tuple of correlated attributes is a plurality of attributes for which the correlation measure is over a predetermined threshold. In any of the aspects / the objects can be composed and the attributes can comprise particular chemical portions. The objects can be peptides or proteins and the attributes can comprise particular structural or sub-structural patterns or motifs. The objects can be selected from the group consisting of compounds, molecular structures, nucleotide sequences and amino acid sequences and the attributes can be the characteristics of the selected objects. Objects can be time segments and attributes can be biological parameters of genes or gene products. The objects can be documents that are stored electronically and / or electronically indexed and the attributes can be topics. The objects can be the clients and the attributes can include the attributes acquired or not acquired by those clients. Attributes may also include mailings made or not made by customers. The objects can comprise products and the attributes can comprise customers who have acquired or not those products. The attributes may include demographic variables of the clients, the objects may be people with a particular disease or condition, and the attributes may be the potential contributing factors for the disease or condition. Objects can be people with a number of different diseases or conditions and the attributes can be potential contributing factors for diseases or conditions. Objects can include factors that potentially contribute to a disease or condition and the attributes can be people with Q without those factors, in which case the method associates groups of people of substantially equivalent risk for the disease or condition. Objects can be time segments and attributes can comprise the state of components in a system in time segments before system failure, in which case the method associates component states that can potentially cause system failure. In a first aspect, r can be the same for each iteration. In either aspect, the method provided may further comprise the steps of first creating a transaction database between the states of the system, wherein a state of the system is represented by a value of a state variable, on a selected amount of time, and pre-establish the database, in whole or in part, as a set of data, so that each transition from state to state corresponds to one of the objects M and so that each variable state corresponds to a attribute. In any of these aspects, the method provided may further comprise the steps of first creating a database of the states and actions covering a selected amount of time and pre-establishing the database, in whole or in part, as a set of data. data so that each state / action / triple state corresponds to one of the M objects and so that each state variable or action type corresponds to an attribute. In a thirteenth aspect, the invention provides a method of matching detection for use with a data set of objects having a number of attributes represented in an array of objects against the attributes, the method comprising the steps of: • sampling a subset of the matrix for a predetermined number of iterations, each iteration of the sampled subset of the array that has for each object the same subset of attributes; • detecting, and registering match counts in each sampled subset of the array, a match which is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset of the array, where the plurality of values of attribute is equal for each occurrence, the detection and registration of matches in each sampled subset that were executed before, at the same time or after sampling, detection and registration of match counts in other subsets; • determine an expected count for each coincidence of interest, the determination that is executed before, at the same time or after sampling, detection and registration; • compare for each coincidence of interest, the observed count of matches against the expected count of matches and from this comparison determine a correlation measure for the plurality of attributes for the match; and • report a set of k-tuples of correlated attributes, where a k-tuple of correlated attributes is a plurality of attributes for which the correlation measure is above a respective predetermined threshold. In the first aspect, the numerical correlation values can be reported together with a set of k-tuples of correlated attributes. BRIEF DESCRIPTION OF THE DRAWINGS For a better understanding of the present invention, and to show more clearly how it can be carried out, reference will now be made by way of example, to the accompanying drawings which show the preferred embodiment of the present invention, and in which: Figure 1 is an illustration of. a power set of a set with N = 6 objects, placed as a mesh under a subset operation, which represents all possible K-truples of columns from the power set. Figure 1 is an illustration of the relative portions of all - the mesh nodes shown (dark squares) or omitted (light squares) of Figure 1. Figure 2 is an illustration of n-grams for all sizes n = 1,2, ..., 6 for the power set of Figure 1. Figure 2a is an illustration of the relative portion of all network nodes shown or omitted in Figure 2 with a subset of the terms highlighted Figure 3 is an illustration of all the possible pair correlations for the power set of Figure 1, which corresponds to the analysis of the third row from above from the bottom of the mesh network. This is a disadvantage taken in the work on inter-residue correlations in the protein and RNA sequence families for example. In another example, this Figure represents the approach taken by a method that simply finds all pairs of sales items that tend to be purchased by customers jointly. Figure 3a illustrates the relevant correlations from Figure 3 outside the power set of Figure 1. Figure 4 is an illustration of a separation of the object variables from the power set of Figure 1. A separation is an important particular class of component model of a sequence family or other aligned data set. In a component model, a set of N? latent and ± variables is found to "generate" or "explain" a larger set of N observable variables Cj_. In a partition model N? < N, each Cj is generated exactly by one of y ±, and typically Ny < N. Observables that correspond to a latent variable from a group class, and presumably are highly correlated with each other and relatively uncorrelated with the variables outside the group. In Figure 4, the observable are formed into three groups: (Ci, (C2, C5, C6) and (C3, C4) Figure 4a illustrates the separation of Figure 4 outside the -power set of Figure 1 .
Figure 5 is an illustration of three sampling iterations of a data set according to one embodiment of the invention. Figure 5A is an illustration of the three sampling iterations of Figure 5 with explanatory notes, Figure 6 is a general flow diagram of a program method of a preferred embodiment, Figure 7 is a schematic diagram of a system that implements the program method of Figure 6, Figure 8 is a general flow diagram of the program method of Figure 6 adapted to control a process for production of a product, Figure 9 is a schematic diagram of a system that implements the adapted program method of Figure 8, Figure 10 is a general flow diagram of the program method of Figure 6 adapted to the general rules for a rule-based system which in turn produces a product, Figure 11 is a schematic diagram of a system implementing the adapted program method of Figure 10, Figure 12 is a general flow chart of the program method of Figure 6 adapted to the gene rules used to control a process for producing a product, Figure 13 is a schematic diagram of a system implementing the adapted program method of Figure 12, Figure 14 is a diagram of a node of an electromechanical component implementation of a Preferred embodiment, Figure 15 is a residue diagram for given sequences for the three-dimensional sample structure of Figure 15a where the sequence matching can indicate conserved relationships? physical or structural Figure 15a is a diagram of a three-dimensional structure for a sample protein. Figure 16 is a diagram of stages in prediction of tertiary structure that can employ the methods described herein. As previously stated, a base method described herein employs the steps of: • representing a set of M objects in terms of an NA number of variables ("attributes"), where an attribute is said to occur in an object if the object has the attribute; • sample a subset of r ± out of the M objects, for each iteration between the number of predetermined iterations; • detect and register matches between sets of k of the attributes in each sampled subset of objects, the coincidence being the concurrence of 1 < k < NA attributes in the same hi out of ri objects in the sampled subset where 0 < hi < ri; • determine an expected count of matches for any set of k attributes and a predetermined number of sampling iterations and matching counts as described above, the determination that is executed from sampling and collection, at the same time or after sampling and harvest; • compare, for any set of k attributes and number of sampling iterations and coincidence count, the observed count against the expected count of matches, and from this comparison determine a measure of correlation (or association, or dependency) for the set of k attributes; and • report a set of k-tuples of correlated attributes, where a k-tuple of correlated attributes is a set of k of NAs to tributes that have been determined by this process to have a value for a selected correlation measured over a value of predetermined threshold. An alternative base method may include the following steps: • sampling a subset of the data set for a predetermined number of iterations, each iteration of the sampled subset of the data set having the same subset of attributes for each object; • detect and record counts of, matches in each sampled subset of data set, a match which is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset of the data set, where the plurality of Attribute values are the same for each occurrence, the detection and registration counts of matches in each sampled subset of the dataset that is executed before, at the same time or after sampling, detection and registration of match counts in other subsets; • determine an expected count for each coincidence of interest, the determination that is executed before, at the same time or after sampling, detection and registration; • compare, for each coincidence of interest, the observed count of matches against the expected count of matches, and from this comparison, determine a correlation measure for the plurality of attributes for the match; and • report a set of k-tuples of correlated attributes, where a k-tuple of correlated attributes is a plurality of attributes for which the correlation measure is on a respective predetermined threshold. The modes described herein provide extensions for the base methods described above and employ similar principles. The principles of an application as described herein may be applied to others as appropriate. Therefore, the description of all the elements of an application will not always be repeated for each application. In the preferred embodiment, it is preferred for simplicity of programming and interpretation to use a matrix where the objects are rows and the attributes are columns; however, this is not strictly required and any of the modalities may use a data set of objects and attributes that are not represented in the form of a matrix by sampling subsets of the data set directly. As those skilled in the art know, any related database can easily be transformed into a two-dimensional array format.
The embodiments described herein lead particularly to parallel processing as the steps of detection, recording and counting of matches for each of the samples r can run simultaneously through many different samples or other subsets of the data set. Each of the characteristics or variables that an object describes can be numerical or qualitative. If it is qualitative, a characteristic or variable described in terms of some z number of levels or qualities can be transformed into a numerical variable with z possible values or states. A numeric variable with z possible values or states can be transformed into z binary variables called tributes. A numerical variable or characteristic with a continuous range of possible values or levels can be transformed into, or represented by a variable with z possible values or states and therefore can also be transformed into, or represented by a set of z binary attributes. More formally, it is assumed that we are giving a database of M obj ets 0l f 02,. . . , OM each of which is characterized by particular values a13 eA, for each of the discrete variables N v- ,. A particular value for a particular variable is denoted by ^ -,. One can start with continuously valued variables and use any of the known methods for quantification in discrete variables. It is also observed that, many applications, the same alphabet A of possible values is used for all the variables. Each object can be a particular record in a database or it can be a sample from a random source. If the initial N variables are not binary then they can be converted into a set of N attributes. For example, in the appendix listing in Appendix "B" each amino acid position is a variable that has 20 possibilities that correspond to the 20 naturally occurring amino acids represented by a subset of letters of the alphabet. In order to convert the variables into binary attributes, each variable becomes 20 different attributes that have 1 of 2 states, such as "A" or "not A", "B" or not "B" and so on. A modality to represent variables of this type is included in the source code that is listed in Appendix "A". Other techniques to represent data as attributes can be used. The principles set forth in this description can also be extended to higher order attributes, for example, trinomial attributes to be used with higher order calculation machines. The binary examples used here are the easiest to implement.
This situation can be represented by a table in which each row represents an object, each column represents an attribute and in which therefore each table entry, a1D represents the fact of the ith object that has the value written in ± j for the variable jth. You can also write Cj (for "column j") and an attribute like For example, consider the small matrix of six rows (objects) and six variable columns. coil col2 col3 col4 colS coló A B C D E F W U C V E G Z L C M W M V U C V A G A • B C D Z Z W L c M E Z í í Object number 1 has the value A 'for variable 1,? B' for variable 2, C for variable 3, and so on. For some applications, it may be useful to find that, for example, variables 2 and 4 are correlated. In the matrix (small fictitious) of the previous example, this correlation seems viable, because whenever an object has B @ 2, it also has D @ 4; whenever an object has L @ 2, it has M @ 4; and whenever an object has U @ 2, it also has V @ 4. The attribute number 3 does not vary, each object has the attribute C @ 3, and therefore does not correlate in an interesting way with any other variable. Given a data matrix, we also assume that there is "some underlying" true probability distribution "q () which, for all orders k = 1, 2, ... NA specifies the probabilities for each possible k-tuple of attributes. For example, for k = 1, we have q (Cj): Aj -? [0, 1], and we must have for some data set q (B @ 2) = 0.33. A specific distribution also higher order probability, such as q (B @ 2, F @ 6) = 0.166. Inherent in the particular problems posed is the problem of estimating or approximate the -distribution q () or at least parts of it. The problem is to find some or all of the k-tuples of the columns (CJI, Cj2, ..., Cjk) for k = 2 ... NA whose correlation is greater than some predetermined value. For example, one may want a procedure that, given a-by N of table of values, returns a list of k-tuples of column indices. { ji / J2 / - - -rjk) such that D (q (v-Jl, Vj?, ..., Vj I II? =? ... k q (Vji)) >; j. ~ for some real pk. Here D (p1 \ p) is the Kullback divergence measure which in this case estimates the difference between the observed distribution of values on the column variables versus the distribution where all the column variables are statistically independent. The Kullback measure is only one of many possible correlation or association measures applicable to this type of problem. For our purposes we consider the correlation in terms of deviation from statistical independence. One can are an observed number of occurrences of some event in view of the database against the expected number if an underlying hypothesis of independent variables were real. That is, the problem is: Given the table of values, for all k = 2. . . NA, we return to the list of all k-tuples of attributes. { < % ÍI @ ± 2 (Í2 @ CÍ2,. -., CCik @ ± k) so that P (Observed (axl @ c ±?, X ± 2 @ c ± 2, ..., cc ± k @ ± k) | Independen t (c ± i, ci2, ..., cik), Model) < ?, for some observed behavior of («u, @ cn, ± 2, @ Ci2, .. • cc ± @ c ±), for some threshold of real number? is [0, 1], and some Model that is underlying to the estimation or hypothesis of the test method. The sampling subprocess may be random sampling, and if random it may be subject to any number of possible probability distributions on the objects, including a uniform distribution. Similarly, there are restrictions on the statistical independence or dependencies between each of the samples T extracted during the operation of the method and between each of the objects r extracted within a sample. Sample Advantages of Preferred Modes There are at least one class of problems, many arise in many different application areas, on which the comparative advantages of the match detection method of the apparatus described above, and which are also described below are more obvious. Such problems are characterized by: 1. a large number of attributes (columns, in our representation); 2. the possible existence of some number of high and mutually correlated attribute groups in the given set of each member attribute of each group that is relatively uncorrelated with attributes outside their own group; and 3. lack of prior knowledge for the precise number, the width. { k, as in the k-ary correlation and Jcth order feature), and the location of such attribute groups. All the others - procedures of which we are aware take place before the limitations on the k-tuple width that can be discovered, or implement an exhaustive, serial or parallel search on all or almost all the possible k-tuples of attributes. To put it more simply, the method of the preferred mode takes approximately the same computation and memory time to find a 44-ary correlation as it tries to find the 2-ary correlation in the same set of very high dimensional data. Most of the above methods, by comparison, either discard the discovery of the 44th order feature or require the placement of orders of magnitude more time or space to find it. The people dedicated to make models of very large data sets have been frustrated in their attempts to calculate too much in the probabilistic model of a completely superior order both because of the computation complexity of the task and because of the lack of data necessary to statistically support the estimates. significant of most of the higher order terms. The preferred mode calculates -only a subset of higher-order probabilities, and extracts a limited selection of higher-order features ("HOFs") for building a database model.Efficient use can be made of resources calculations using pre-selection sets of higher order features using the correlation-detection methods described here, and constructing the most important ones (statistically and in terms of specific-application criteria) in model-based classifiers and predictions based on the neural network based on statistical regias, or grammar-based methods.Pre-selected sets of HOFs can be used to create rules for such systems.For example, a set of data can be analyzed using the methods set forth herein to determine whether a company is filing a patent application then must present a cession of the inventor. This rule is then used in the systems to generate assignments whenever it has been determined that a company is filing a patent application. Many of the rules-based networks could benefit from pre-processing using the methods described here, see for example, the System and Method for Building a Computer-Based Rete Pattern Matching Network by Grady et al. described in U.S. Patent Number 5,159,662 published October 27, 1992; the interference engine of Hingland et al. Described in U.S. Patent Number 5,119,470 published June 2, 1992; and the Fast Method for a Bidirectional Inference by Masui et al. described in U.S. Patent Number 5,179,632 published on January 12, 1993. The discovered HOFs can alternatively be used directly to create products, for example, in the prediction or determination of protein structure "when fed into existing methods based on the geometry of distance or empirically estimated patterns of cooperation and folding or in marketing schemes based on sales information of correlated products .. Next, the practice of the principles described herein using the Los Alamos HIV database is described. In particular, the principles were applied to study the V3 cycle of the human immunodeficiency virus (HIV) envelope proteins.In biochemistry and molecular biology in general, the covariation of particular residues of a protein also indicate the existence of a structural motif that characterizes a region of the protein that He has a functional psychological role. The envelope proteins are partially embedded in the lipid membranes that surround a virus particle and project externally from the lipid. When the lipid of an HIV particle fuses with the membrane of a host cell during infection, the envelope proteins may also protrude from the membrane of the infected cell. The V in V3 means "variable", since the sequence of the V3 cycle is highly variable among the different isolates of the virus.
Previously, a Los Alamos group in B.T.M. Korber, R.M. Farber, D.H. Wolpert and A.S. Lapades, "Covariations in the V3 loop of HIV-1: An information-thoretic analysis", Proc. Nat. Acad. Sci. US. A. 90 (1993), the description of which is incorporated herein by reference, describes 2-ary covariation mutations in certain residues of the V3 cycle of the HIV 1 cover proteins. The practice of the current principles has confirmed some of the results of the Los Alamos group, although it also allowed the discovery of other groups of covariable residues. Whereas the Los Alamos group could only discover the covariance in pairs, the covariate of residue k-ary is described here, where k > 2. That is, we have previously identified the unrecognized reasons for the HIV cover protein. For a particular test, the entry consisted of respective amino acid sequence of V3 regions from 657 different virus isolates and are shown in Appendix "B". The source code used in the entry is shown in the entry of Appendices "A" and "D", named "File coinc.pl" and "File probsort.pl", respectively. The output is shown in Appendix "C". Referring to Table C.l through C.9, which are established-, somewhere below, the results of 6 separate tests are shown. The parameter values are as indicated in the respective legends. In each Table, the results are ordered by statistical meaning, with the most significant correlation first, and the amino acid code of a standard letter was used. Therefore, referring to Table C.6, the most significant coincidence observed is the occurrence of alanine (A) at residue 18, glutamine (Q) at residue 31, and histidine (H) at residue 33. This , like other coincidences established in the cited pages, represents the identification of a structural motif of the HIV-1 V3 cycle that comprises those residues. Continuing with the particular example of A18 / Q31 / H33, the structural motif V3 comprising those residues presumably exists outside the virus particle and that region of the V3 cycle also performs a specific function that requires the particular structural motif. Therefore, the structural motif would have to be conserved after the mutations to preserve that function. This reasoning has been extended to other coincidences identified in the present. The identification of a particular conserved structural motif of HIV has several uses. Using techniques known in the art, a peptide presenting the motif can be produced for use as an antigen. Consequently, a vaccine could be prepared.
The peptide presenting the motif can be made using known recombinant methods as described generally, for example, in Maniatis et al., Molecular Cloninng: A Laboratory Manual, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY (1982) and in Sambrook et al., Molecular Cloning: A Laboratory Manual (2nd Edition), Cold Spring Harbor laboratory, Cold Spring Harbor, NY (1989). Alternatively, the peptide or a peptidomimetic can be synthesized chemically using standard chemical techniques. Monoclonal antibodies to the peptide or peptidomemetic could be generated using standard methods as described for example in Harlow, E and Lane, D., Antibodies: A Laboratory Manual, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY (1988). Fragments of such monoclonal antibodies, for example, Fab fragments, which have specific affinity for the novel structural motif could be generated. In another embodiment, a ligand that interacts with a structural motif identified according to the invention could be generated. That is, the ligand would be characterized as having chemical portions of suitable identity and located separately from each other so that the portions interact with corresponding residues or portions of the motif. In some modalities, the ligand can be an agent, for example (a drug) which, by binding to the motif, interferes with the function of the region. The ligand would therefore be an HIV antagonist with potential therapeutic utility. Alternatively, the ligand would bind to the particular V3 region comprising the identified motif, providing diagnostic utility. Such diagnostic utility can be ex vivo. A ligand with diagnostic utility (e.g., an antibody) should comprise a label, such as a fluorine or an enzyme conjugate for use in a colorimetric reaction. Viruses labeled by fluorescence or cells infected with the virus can be visualized or counted using fluorescence microscopy or FACS (fluorescence activated cell sorting). Methods of designation and identification of the ligand that bind to structural motifs identified in accordance with the invention are also provided by it. Therefore, in one embodiment, the invention provides a ligand for binding to a coat protein with the human immunodeficiency virus (HIV), wherein the coat protein includes a structural motif comprising amino acid residue A18 / Q31 / H33. . The ligand includes at least one functional group capable of binding to the motif. In a preferred embodiment, the ligand includes at least one functional group capable of binding to and being present at an effective position on said ligand to bind to residue 18, at least one functional group capable of binding to and being present at an effective position in said ligand for binding to residue 31, and at least one functional group capable of binding to and being present at an effective position on said ligand to bind to residue 33. In another embodiment, the invention provides a method of designing a ligand to bind to a structural motif of a human immunodeficiency virus (HIV) envelope protein. The method includes providing a template including spatial coordinates of A18, Q31 and H33 in cycle V3 of the HIV-1 envelope protein, and computationally involving a chemical ligand using an effective algorithm with spatial constraints, so that The ligand involved includes at least one effective functional group that is attached to the motif. In a preferred embodiment, the ligand includes at least one functional group capable of binding to and being present at an effective position on said ligand to bind to residue 18, at least one functional group capable of binding to and being present at an effective position in said ligand for binding to residue 31, and at least one functional group capable of binding to and being present at an effective position on said ligand to bind to residue 33.
In another embodiment, the invention provides a method of identifying a ligand for binding to a structural motif of a human immunodeficiency virus (HIV) envelope protein. The method includes: providing a template including spatial coordinates of A18, Q31 and H33 in the V3 cycle of the HIV-1 envelope protein; provide a database that contains the structure and orientation of the molecules; and sieving said molecules to determine whether they contain effective portions separated one from the other so that the portions interact with the motif. In a preferred embodiment, a first portion of the molecule interacts with the residue 31, a second portion of the molecule interacts with the residue 31 and a third portion of the molecule interacts with the residue 33. The principles described herein encompass respective modalities similar, including antigens and vaccines, for the other covariate k-tuples described herein, ie, the residues of the V3 cycle that would covariate and the particular amino acids in certain covariate residues. The method of the present invention can be visualized as a "high pass filter" for the detection of higher order features. Such HOFs play an important role in the formation of database model, machine learning and pattern recognition and perception. In the contexts of molding and database model formation, a procedure to discover these characteristics should serve for any of the main roles, including: 1. Pre-processing of large complex data: Many of the best training methods of model, including Gibbs models, Hidden Markov and EM models, MacKay density networks, and factorial learning methods related from the community neural network, can be aided significantly in the capture of higher order iteration without exhaustive search or combination explosion of the parameter space, if preceded by a rapid pre-processing procedure such as one provided by the implementation of the principles described herein, which find variables correlated in a viable way in the database. 2. Visual exploration of large complex data: If they are paired in a simple graphical display interface, a procedure such as the current one allows a user to quickly visualize (with a small r-sample number) the higher order characteristics more interesting in the viable sense in the highly dimensionable data. 3. Pre-conditioning and elimination of redundancy: Until now, the usefulness of finding correlations between attributes in order to use them in modeling has been emphasized; although in many applications of optimization, learning and adjustment of data, correlations between variables are required to be found and eliminated, through any of a number of subspace methods such as principal component analysis (PCA). A Modality that Uses a Programmable Digital Computer Components of the Digital Computer Modality Data Matrix, Sampling and Matches. Given a set of M obj ets, each of which has a value "If" (represented by 1] No '[represented by 0) for each fixed set of NA to tributes, the input data set can be placed in an M-by-NA table of values, which can be called a matrix. of data or just maize, and this matrix, as well as all its sub-matrices and related vectors that -. they comprise functional parts of the system / process described below, are stored in memory locations within a programmable computer.
In this representation the rows of the matrix correspond to the objects and the columns correspond to the attributes. The matrix can be labeled VSJ and each element of this two-dimensional box labeled by V2J e. { 0,1} , where i refers to the ith object (row) oS and j refers to the jth attribute (column) a3. The set of objects can be listed, for purposes of this description, as O = Oi, 02, ..., OM and the set of attributes can be listed as Figure 5A illustrates these terms as they apply to the example illustrated in Figure 5 described in greater detail below with respect to the description of program method of a preferred embodiment. A particular attribute a -, can be said to occur in a particular object (row) i if a2J = 1. Given an ordered list of 1 <; m < M objects (rows) 5_, an incidence vector 2_ for an attribute a-j can be defined as the binary vector or string of length m such that gth is 1 if and only if the attribute a ^ occurs in the object gth in the given list of objects. The incidence vector 2 is a simple representation of the pattern of occurrence of the attribute on some set of objects, for example, the set of all M objects or the set of objects corresponding to a r-sample as described below. One sample, for example the three rows identified by the reference number 4 in Figure 5A, is a set of r of the M records randomly drawn from some probability distribution. In some preferred embodiments, the rows within the sample are considered to be independently extracted from a uniform distribution. The extraction of a sample 4 is executed by the system once within each of the specified number of iterations. In some preferred embodiment, samples drawn on the total number T of iterations are considered to be extracted independently from a uniform distribution. In some preferred embodiments, different values of r are used for different sequential sampling iterations, and / or for different subsets of the data set processed by different processing nodes in a parallel computing mode. In such cases, it can be said that in the iteration ith or in the sample ith, the number of objects sampled is r. Some advantages of using different sample sizes include: the ability to test, within a run of the method, different values of r when one is not sure that the values of r are the best; and the ability to select different values of r for different processing nodes in a parallel computing mode, in order to make optimal use of different sizes / processor speeds and memory sizes between the different processing nodes. An advantage of using the same individual value of r through a run of the method is the slight gain in simplicity of the program code. A matching set or cset can be defined as a pattern comprising the union appearance of 1 < k < NA attributes (columns) 1 within some set of objects (rows) 5. That is, given one or more rows 5 under consideration, there exists a cset aDi,? J2, ..., ajk yes & J, & jh .... and ajk that occurs in the row or rows given. For example, the elements A @ cl, B @ c2, D @ c4 identified by the reference number 3 in Figure 5A are a coincidence set (cset). Within the memory of the computer a data structure called the cset table is stored, which is a means to store the identity and the con ception of occurrence for each cset that occurs in one or more iterations within the process. The identity of a cset is a list of attributes (columns) comprising the cset; the occurrence count is a number that corresponds to the number of occurrences of a cset that have been observed up to a particular iteration within the process or at the end of all iterations. In some preferred embodiments, the cset table is implemented as a control table stored in a computer memory. A cset has, for a given r-sample, a particular incidence vector, which is in its binary-encoded record of occurrence (denoted by?] ') Without occurrences O' on the data items r in the sample. Therefore, a cset, which corresponds to a set of k attributes, can have an associated incidence vector; and an individual attribute may have an associated incidence vector. An equality (or coincidence) of size h is said to occur, in a given r-sample, for a cset a = (a ^ ..., a ± k) when at? appears in h out of the records r,. . . , already? appears in h out of the records r, and exactly the same h appears out of the records r (see Figure 5A). Observed Matching Counts. Matches are observed, and the corresponding csets stored or updated, by means of a deposit method. In each iteration, the attributes are Deposited, that is to say within separate subsets according to their incidence vectors 2 on the r-sample 4 for the current iteration. In this described matrix-based embodiment of the invention, those vectors act as r-bit addresses within a very small subset of address space 2 '. (See Figures 5 and 5A). All attributes in a repository constitute a cset. The cset is recorded: if the particular cset has occurred in a previous iteration, then your occurrence account is updated; if it has not previously occurred, then an entry in the cset table is created for itself, and then its occurrence count is updated. In this described embodiment, the system stores the number h: 0 < h < r of occurrences for this and each iteration. After a specified number T of iterations that have been completed, the cset table contains a list of all observed csets, and, for each cset a, a total number of observed matches, corresponding to? Tr =? H1 (), where h. { a) is the number of union occurrences for the attributes k that comprise a, for the ith iteration. Expected Count Function An expected function is a mathematical function, implemented as a computer program or subroutine, or in electronic or optical circuits, which takes a set of attributes aJl r a3, • • -. aJ and a number T and produces a corresponding number for an expected number of matches for that set of attributes in a process of T iterations of extraction of r-samples and observation coincidences. In a particular embodiment of the invention, the match (--- • h, r) function is obtained from the multinominal distribution: r! / match. { , h, r) = (! (r-h)!) p. { to ± 1,. . . , alk) hp. { áll r r-h alk) This formula gives an estimate of the probability to find exactly h occurrence of a ± l r h occurrence of a ± ?,. • •, and h occurrence of a, which are all presented in the same rows h, in a r-sample. (This function definition has a simple form because all but two of the large number of factors p () in the standard multinomial expression vanish with zero exponents). The probability of an equality of size h for the k attributes that make up a potential cset has been defined in terms of the probability of union p (a-ii r • • a ± k); The Expected Count Function can use particular estimates for those union probabilities. In this preferred embodiment, the union probability estimates incorporate the hypothesis of independence between the individual attributes. Therefore, in the definition of the formula given above, II'k¿ =? p (an) by p (ail r.., alk) and II'k? =? p (ajj) by p (áll r.., aik). The Hypothesis Test Function and the Correlation Measurement. A hypothesis test is a mathematical procedure, implemented as a program or subroutine of computation, or in electromechanical components, and / or optical of special purposes, which takes a pair of numbers Hexp and H0 S, which represent the expected and observed numbers of matches, respectively, for a particular set of k attributes and produces a number C representing an estimate of the correlation between the attributes k. In some preferred embodiments, a Chernoff link on the final probabilities provides the hypothesis testing function, as described below. Let the random variable X2 hold the value hj. for each iteration i, and let X =? tr =? X- .. and note that 0 < X < T • r. The Chernoff-Hoeffding method [8] provides the following theorem: Let X = X? + X? + ... + Xn that is the sum of n independent random variables s, where Ix < X < ux for l real inferior ("inferior") and u ("superior"). Then P [X-E [X] > d] < exp -252 (1) For our purposes we set n = T and lx = y u? = for all = 1, 2,. . . , T, and we get in this way P [X-E [X] > d] < exp -2d2 (S ^ -l ^) Using this mathematical relationship you can define an effective procedure to calculate a correlation value: Corr (a) = 1 -exp (? i i2) In the special case where the same sample size r is used for each iteration of the sampling, that is, when r = r for all 1 = 1,2, ..., T, then the formulas are reduced to simpler forms: P [XE [X] > d] < exp -2d2; 2a) (Tr2) Corr (a) = l -exp (Tr2). Here the correlation value corresponds to an estimate of 1 minus the probability of having observed H0b? coincidences, on T r-sampling interactions, if the hypotheses underlying the expected Hexp count were true. If the assumption of independence between attributes was used to calculate Hexp as described above for some preferred modalities, then this hypothesis test provides a correlation value for each cset that estimates the independence relationship; that is, it estimates the statistical dependence between the attributes that make up the cset. Operation of the Components Within a Process Typically the representation component is executed first within the general process of the present invention. A plurality of sampling iterations is executed on the representation of the data, and for each r-sample, the detection and registration of matches is executed. Sampling iterations can be executed sequentially or in parallel, or in some combination of sequential and parallel stages. At any stage within the process, the determination of an expected count of matches, for some or all of the matching sets of attributes is executed. This component of the process can be executed all at once for all matching sets, or in incremented form; sequentially or in parallel, or in some combination. It can be executed for matching sets (csets) as each match is detected or stored, or it can be executed before or after such record detection. After some number of sampling iterations that have been executed, the comparison of the actual number to the expected number of matches can be executed for some or all of the registered sets of matches. This can be done for all csets at the same time, or for any subsets of them at different points throughout the process. Those comparisons for different csets can be executed sequentially or in parallel, or in some combination of them.
After some number of sampling iterations has been executed, the correlated attributes set report can be executed for some or all of the registered matching sets that have been determined, in the comparisons, to signal significant correlations between the component attributes. This can be done for all csets at the same time or for any subsets of them at different points throughout the process. These comparisons for different csets can be executed sequentially or in parallel, or in some combination thereof. Description of the Program Method of a Preferred Modality Below is a pseudocode, a program in appropriate media, for example, a floppy disk, hard drive, RAM read access memory or any other means, corresponding to a possible mode in a programmable digital computer. Figure 5 provides an illustrative example of the application of that modality to a fictitious data set. Three iterations of r-sampling (for r = 3) in the small data set are illustrated from top to bottom. For each iteration, the left box represents the data set, with the underlined entries representing the sampled rows. The right box represents the set of deposits within which the attributes collide. For example, in the first iteration A @ 1, B @ 2, and D @ 4 all occur in the first and second of the three rows sampled, so that each has incidence vector 110 and collide in the deposit labeled by that binary address. The deposits it contains -only an individual attribute are ignored; and "empty" deposits are never created. All deposits are created and withdrawn after each iteration, although the encounters are recorded in the global data structure Csets.
The procedure to find correlated attributes conjugates: 0. begin 1. read (MATRIX); 2. read (R, T); 3. compute_first_order_marginals (MATRIX); 4. csets: =. { }; 5. for iter = 1 to T do 6. sampled_rows; = tsample (R, MATRIX): 7. attributes: = get_attributes (sampled_rows); 8. all_coincidences: = find_ail_coincidences (attributes); 9. for coincidence in all matches of 10. if cset_already_exists (coincidence, csets) 11. then update_cset (coincidence, csets); 12. else add_ne _cset (coincidence, csets); 13. endif 14. endfor . endfor 16. for cset in csets of 17. expected: = compute_expected_match_count (cset); 18. observed: = get_observed_match_count (cset); 19. stats: = update_stats (cset, hypoth_test (expected, observed)); 20. endfor 21. print_final_stats (csets, stats); 22. end Steps 5 to 21 of the pseudocode represent the steps of the base method described herein, namely: • sampling a subset of the array for a predetermined number of iterations, each subset of attributes that is the same, • detecting and recording counts of attribute matches in each subset sampled, a match that is the occurrence of a plurality of attributes in an object in a sampled subset, where the plurality of attributes is the same for each occurrence, • determining an expected count for each match of interest, the determination that is executed before the same time or after the sampling, detection and registration, • compare, for each coincidence- of interest, the observed count of coincidences against the expected coincidence count, and from this comparison determine a correlation measure for the plurality of attributes for coincidence, and • reporting a set of attribut k-tuples correlated, where a k-tuple of correlated attributes is a plurality of attributes for which the correlation measure is over a predetermined threshold. Appendix "B" contains the actual source code written in Perl language to run on a Sun4 computer in the UNIX operating system. The sample input data for the code listing in appendix "B" are listed in appendix "C" for partial amino acid sequences from the V3 cycle of the HIV envelope proteins. The corresponding output of the code in Appendix "B" for the entry in Appendix "C" is shown in Appendix "D". In order to produce the output of appendix "D", the attached Perl language program listed in appendix "E" was used for clarification and presentation of the main code listing in appendix "B". A general flow diagram for this modality is shown in Figure 6, while a general block diagram is shown in Figure 7. The resulting report was stored in a flat file as a relatively unstructured ascii database, which was then printed; similarly it could have been sent to a printer directly or through a network to report to other sources. The descriptions of alternative embodiments of the present invention can be divided into two categories, described separately below: first the different physical modalities of the system / process as they can be used in many potential specific problem applications; and, secondly, different interpretations of the components listed in the previous description, according to different specific problem applications of the present invention.
I Different compliments For example, among the many possible modalities such as programs in digital computers: The method can run correctly sequentially, as in most direct interpretation of the pseudocode previously provided, or the method can run in parallel (vector or multiprocessor ) or computer systems distributed in many possible ways. A set of calculations can run in parallel, in which each calculation executes all the stages of the previously outlined program, although with each separate calculation a different value is used for r, the sample size; or each separate calculation could run the same program stages with the same values of key parameters, but starting with different origins of the initial random number for the r-random sampling. Alternatively, all the previously outlined program steps could be run at the same time, although each with different r-sample could be divided into a separate process running on different processors, where each of these processes would comprise the detection steps and optionally of record, with the global Csets counted subsequently joined within the global process and the global data structures. Additionally, the calculation of the expected counts, and the comparisons of expected counts with observed counts, could be executed all at once or incrementally, sequentially or in parallel. Similarly, the report of the estimated correlation values can be executed for some or all of the Csets, once at the end of the calculation or incrementally through it in series or in parallel. The output of the method, which includes the report of significantly correlated k-tuples of attributes (the csets that are considered high correlated in the comparison aka, hypothesis testing stage), can be verbal, and / or numeric and / or graphic. A number of sampling schemes are possible including deterministic, pseudorandom, or purely random. If it is pseudo-random or random, any sample or random sample number can be used, including hypergeometric and multinomial sampling. Objects r within a r-sample can be sampled "with replacement" or "without replacement". At the next higher level, the set of r samples can be extracted "with replacement" or "without replacement". Different options are possible for the key sampling of the parameter r, and it is not necessary to use the same number r for each sample. Many possible options exist for T, the number of sampling iterations. It is possible to use any number of mathematical methods to select T in order to achieve a predetermined confidence level in the estimated degrees of correlation for the ic-tuples of attributes discovered by the method of the present invention. Alternatively, it is possible to run the procedure for a given fixed number of iterations and then print or visualize the results or to intersperse the run of a number of iterations with the printing or visualization of partial results. There are many possible ways for the representation, storage and access of the Csets data structure used during the processing of the algorithm. The Csets data can be stored and accessed by means of a control box, a kd tree, a patrician tree (also called trie), and / or in other ways, known to those skilled in the art, storage and data access in efficient way Whatever the selected data structure, the structure can be physically stored in registers in the main memory, and / or secondary or external storage means such as magnetic disks, magnetic tape or optical storage means. Alternative to the modalities of the method on the electromechanical elements of general purpose calculation of various types, there are many possible modalities in electromechanical electronic, optical or electro-optical elements of special purpose, or some combination of architecture and devices of general purpose and special purpose. For example, highly efficient special purpose electronic components (LSI or VLSI) can be used to implement the matrix representation of the current invention, by virtue of the fact that the "attribute" incidence vectors are simple binary vectors, by virtue of the fact that that matching deposits, described above in one embodiment of the present invention correspond to "addresses" for a memory space of size 2r for each r-sample, and by the ability with current technology to design, manufacture and use special purpose electromechanical components for random number generation and sampling implementations, fast access storage of Csets data structures and the mathematical functions used in calculating estimates. of expected counts and hypothesis tests and correlation estimates. Description of the Special Purpose Electromechanical Component Method of a Pre-Modality. 1. Review.
Referring now to Figure 14, it is intended to explain a previously mentioned special purpose electromechanical component embodiment to exploit the potential benefits of the parallelism of the execution of the algorithm. Defined below divides a given data set along M (the number of rows of data) and distributes those portions to their CPs (also defined below). The CPs can be other nodes (in a recursive definition) or they can be special purpose proposal processors developed to execute step 8 in the method as described in the high-level "pseudo-code" in the Program Method Description of a previous preferred mode section. When the results have been calculated by the CPs node, the merging step (steps 9 to 14 in the "pseudo-code" description noted above) is executed by the node. Once the merger has been made, the results are passed to the node relationship. If the node is the root of the tree, the complete results are sent to the impeller that controls this electromechanical component. The system described below can be used offline from a main computer CPU; among other possibilities for commercial marketing and use of such a system is its implementation on a special board or card that a user can acquire and install on their own computer or personal workstation. One can also envision the use of one or a number of such special subsystems in a local area network or a "supercomputer" installation. The described embodiment represents only one of many possible forms, as will be understood by those skilled in the art for the parallelism of the methods described herein. This implementation described below is assumed to act only on attributes of evaluated character data. There is no limitation of the basic methods described here as it is a specific implementation of the basic methods. The implementation could easily follow a binary-attribute encoding as described elsewhere in the present. A diagram of a node is shown in Figure 14 with computer processors (CPC). The node includes the following: A memory bank where the input is sent to the CPs that are stored (temporary input memory) and where the results found by the CPs will be stored (temporary output memory). A memory bus divided into data and address control bars used to regulate communication over the communication bus itself, as well as the vehicle for data transfer. A set of bit flags and a small additional portion of memory (LastOut). Last Out is the address of the section in the temporary output memory that was written at the end. The two bit flags are used by the merge and the input / output processors to determine in which state each one is. A size -7 arrangement of data processors (CPs), each with its own local memory, executes the discovery of the matches. A fusion processor (MG) that has its own memory space in which it writes the merged results of the CPs, an input / output processor (01) whose main responsibility is to control the use of the busbar. it is used to ensure that each element in the system runs synchronously with respect to any other element.The execution of each of the parts in the system can be considered as a run in the closing stage.The computation processors are defined as special processors executing the algorithm-sampling step (stage 8 in the description of the pseudo-code and graphically in Figure 5. This allows the possibility of a tree structure of such nodes instead of the limiting modalities that they are limited to only one vector arrangement .. For any particular selection of electromechanical component for the memory bar, it may be the case that there is a maximum useful limit on the number of CPs per node. A tree structure allows a shape around this limit. The implementation assumes the maximum values of the method parameters R and N. { Rmax and Nmax) that are specified by an incumbent. It is the responsibility of the logical component driver to detect when those limits have been violated and react accordingly. 2. Memory Bank For each node, the memory of size 2 * J * Amax * Rmax * Nmax, where Amax is the maximum total number of iterations that can be made in the node. This memory is divided equally in the temporary input and output memories. Note that the size of the entry for an individual iteration is not greater than J * Rmax * Nmax and none of the locally produced results or the final merged results (formed by the combination in the partial results from J CPs) may exceed this limit, so there is no risk of exceeding the available memory. Access to this memory is as follows: 10 has written access to the input memory and read access to the output memory MG does not have access to the temporary input memory and read access to the temporary output memory. CP has read access to the temporary input memory and written access to the output memory. 3. Memory Bar The control of the memory bar is the responsibility of the processor 10. Each CP is assigned with a numerical identifier (0 for J + 1 as 10 is assigned simply as zero and MG is assigned with 1). The memory bar is divided into three sections: Control / Two cables for each CP, two for MG and two for 10 comprise the control bar. The first of each pair is called for the request wire while the second is known as the answer wire. Address: Each device in the system is assigned a unique memory address scale. The address bar, used in combination with the data bar, determines in which device the current value in the data bar will be written and if applicable where within that device it will be stored. The width of the address bar (ie the number of wires in it) is determined by a size selection for inbound and outbound memory storage and is therefore not specified here. Da cough: Given the assumption that only the data attributes evaluated character will be handled by this system, the data bar is eight wires wide The regulation of the bar is handled through the control bar. When a device (which here means MG, 10 or one of the CPs) wants to use the bar, it determines a logical one in its request wire. In any given cycle more than one device can do so. 10, when you return to your bar regulation responsibilities, you simply set the number one device response wire to 1 and zero all the response wires. This indicates to the device identified below that it has permission to use the bar (read and write are not indicated -10 is responsible for setting this context) and all others can wait. All devices that wish to use the bar continue to determine 1 on their request wire until they are granted permission. When the permitted device has finished with the bar, the device determines 0 on its request wire, indicating 10 that it can assign the bar to another device. The "Signal Exchange" are protocols such as those described above, are well known and understood by those skilled in the art. 4. Bitio Flags and Additional Memory. The additional memory is used by 10 to store the last written output section. There is no need to store a list of such sections for MG since the "write" to the temporary output memory is done incrementally and NMG can determine how many unused sections have to wait by comparing their last read index with the last index described. Only 10 can write to this memory and only MG can read from it. Two bit flags are used to indicate "10 terminated" (means that 10 has sent all data and received all CP outputs) and "Fusion terminated" 5. A Computing Processor Arrangement J As noted above, there are nodes or special proposal processors that calculate an R-sampling stage in the algorithmic description of the general method of the present invention. In the latter case these may include: A processor that executes matching detection in addition to the functions listed below 2 * Nmax * Rmax sized local memory. observe only its response cable set to one in the next cycle, it is expected that the current values for R and N will be sent and then the data themselves (otherwise, wait for this to be the case). Based on the first two values, you can determine when the current entry is finished. Then, it determines 0 on its request cable and executes the steps of deposit and detection of coincidence of the method. When those stages have been completed, the CP determines logic 1 again on its request wire, this time indicating its desire to send its results. When permission is granted to use the bar, send if match group to 10. 10 is responsible for managing the storage location of this data. The CP output stream comprises a batch of matches found followed by the same matches (csets). Matches are of the form: High count (not greater than Rmax) Size (that is, the width of the cset, that is, the number of component attributes) A length-size list of the attributes of the matches in shape (value , position) When all the data has been sent to 10, the CP determines 1 on its cable to request more data. 6. MG Fusion Processor The fusion processor can comprise: A processor that operates the fusion stage Local memory NmaxRmax used to store the output from a CP Counters Cl and C2 (the first follows the last output section read by MG, the latter counts the number of matches currently stored in the temporary memory merge) Memory used to store the current value of A JnmaxRmaxAmax size memory used to store the merged results. Initially, MG sets its counters to zero and its request wire to zero and waits for 10 for the signals (setting this cable to 1) to be the output data to be processed. When MG seeks to have its request cable activated, it knows how to initiate the reception of output data indexed by the counter in its local memory. Once this has been achieved, MG can start the fusion algorithm. The fusion is done from the local memory directly inside the memory, temporary fusion (C2 must have the current number of coincidences with this stage that is finished). When this stage is finished, MG recovers the real value of LastOut. If this is greater than Cl, then MG knows how to increase 1 move directly over the next output section. If Cl and LastOut are equal, then MG sets its request wire to zero. If Cl has reached A * J, then MG knows that all results have been calculated and merged (and therefore all CPs and 10 are empty) and that he must set his bit flag to 1 (indicating that he has finished) and begins to send the contents of the temporary memory of merge back to 10 for transmission to this node relationship. The results are sent simply as the value of C2 followed by the list of matches stored in _ the temporary melting memory) the form of matches is identical to that described in section 5 above). 7 Input / Output Processor IO 10 contains: A bit vector of size J A counter Cl indicating the next available output buffer A counter C2 indicating the next unused R * N portion of the IO input is intended to control the execution of the algorithm as a whole since it is responsible for the bar regulation scheme outlined above. Initially, IO sets Cl and C2 to zero and zeroes its bit vector (indicating that it has not sent data to any CP) and waits for the logical component driver to start sending data to it.
During this time, it is known that he can not do work and therefore the zeros for all permits for the bar. An interrupt signals the top of the data from the impeller e and / or continues to zero all communication requests until all the data has been written to the temporary input memory. The input data is deformed: N R T, the total number of sets of acronyms of size R sent in the data stream of size TRN 10 can therefore determine when more data can not be expected. Note that it is the responsibility of the impeller: to divide the data mining requests into sizes no greater than R * N to ensure that the number of rows sent as an entry are equally divisible by R. ensuring that Rmax and Nmax have not been exceeded by the set of current data. merge all results sent from the device Once the entire input has been stored, and / or send R * N size data to each CPi by first setting the i th bit in the vector to one (this indicates that IO must wait for the output of COi), noting that CP by fixing its response cable to 1 while the others are at zero, sends the data on the bar and finally increases C2. When all CPs are busy (or all available entries have been exhausted) 10 wait for a CP to determine its request wire in 1 indicating that it is ready to send results. Once the signal has been received from a CP, 10 will retrieve the results from the CP, store them in the output section indexed by the counter, the zeros in the bit associated with CP, the increments Cl and determine 1 on the cable of MG application. If there is unused data in the input buffer, 10 sends the next available R * N set to the CP that recently returned the results (setting the bit for CP to-üno). When C2 is equal to T and the bit vector does not contain bits set to 1, then 10 knows that it has terminated and sets the bit flag 10 to 1. At this point, 10 returns to the wait state previously described. until you see the MG bit flag that is also set 1 (indicating that MG has finished its work). Once this happens, 10 calls an interrupt (if this node is the root of the tree9 - or only requests the submission (if this node has another node for a related one), it gives MG permission to write on the bar - and then it passes all the data sent from MG to the related one.
Note that the proposed scheme allows unequal execution time between the CPs, the next CP to obtain data that is more recently completed with their last data permission. Therefore, although the overall operation of the system is synchronized, there is a degree of asynchronous processing ability. The options for particular processors, bars and other components are open at the discretion of the designers, manufacturers, fabricators, vendors, buyers and users and the scales of options are known to those skilled in the art: In particular, all parts of the The above-described embodiment can be obtained from stock sources, or they can be specially designed at the VLSI level by persons skilled in the art. Different General Applications Special-purpose modalities are also possible. For example, in an application for marketing and analysis of sales / transaction data, the entry of objects to the methods of the present invention may correspond to transactions and the attributes correspond to instances of sale of particular products or services. In an application for process management, industrial engineering or computer systems management, the objects can correspond to particular time segments or periods and the attributes correspond to the on / off or used / unused state of particular components, resources or subsystems . The objective of the application could be to find k-ary conflicts or demands of conflict between subsystems or users in interaction, in order to improve efficiency or reduce operating costs. For example, the methods can be tied to control a process for production of a product as shown in the general flow diagram of Figure 8 and the schematic diagram of Figure 9. This example may represent - a metal assembly plant in automated sheet. The methods could be applied to a set of existing data in order to discover the correlation that indicates the demand of one of the products from the plant that will decrease significantly in the summer months due to cyclical variations, while increasing the demand for another of the products. A link of the automated process control systems in the plant could reduce the orders for the first product, while increasing the orders for another. Many other examples will be apparent to those with experience in the art, including variations to the current structure of the products as a result of the discovered correlations. In an alternative mode, discovered correlations can be used to generate rules for a system based on rules that in turn produce products based on those rules. A general flow diagram for such an embodiment is set forth in Figure 10. A corresponding schematic diagram is set forth in Figure 11. In a further alternative embodiment, the rules-based system could be used to control a process that creates products. A general flow diagram for such modality is set in Figure 12. A corresponding schematic diagram is set in Figure 13. In the application for financial analysis or trade, the objects may correspond to particular time segments or periods, and the variables they can be related to particular prices, or price changes, of particular financial instruments or facilities. By dividing the prices of each instrument or facility into a set of discrete levels, or by using a simple binary code for "increment versus decrement", each instrument or facility can be represented by a set of attributes and the invention can be used to discover k-tuples of instruments or facilities whose price movements are correlated. Those skilled in the art know many ways of obtaining value from such discovered information. In applications for medicine, epidemiology or environmental science, the objects may correspond to individual patients or to different synchronized observations of an individual patient or samples from the same or different environmental resource (such as air, soil or water); the variables and derivative taxes would correspond to levels, or the presence / absence of particular symptoms, drugs, toxins or "contaminants." In this form, one can use the present invention to discover interactions that can cause disease or environmental risks. structural biological, the objects may correspond to DNA, RNA, or protein sequences and / or structures.The attributes may correspond to the presence of particular bases or amino acids in particular sequence positions, or to substructures with geometric physical or biological chemical properties -particular In a particular sequence or structural positions, or in the presence or absence or levels of other global or local properties, for example, a detailed application of the method for the prediction of protein structure is set forth below, examples of which have previously been described .
In pharmacological applications, the object may correspond to molecular structures or other labels representations of particular compounds or drugs, and the attributes may correspond to the presence, absence or levels of particular geometrical properties chemical biological toxicological toxicological and / or other and characteristic, for example, particular chemical portions. The present method would be used to find correlations between k-tuples of such properties, and this information may be useful in the design and testing of compounds and drugs, and in the design of combination libraries for sifting and testing, or for others processes and stages in drug discovery and drug design. Alternatively, the previous mapping can be transporter, so that the objects correspond to the properties and characteristics and the tributes correspond to the compounds and drugs. In this way, the present invention can be used to find sets of drugs with similar or complementary or synergistic or antagonistic activities. This, too, is extremely useful in the discovery of drugs and their design. - In applications for demographic classifications, marketing, insurance and credit, and / or collection, the objects may correspond to individuals or companies or organizations. The attributes could correspond to the presence or absence of levels of properties and characteristics related to employment, income, welfare, credit history, lifestyle, consumption patterns, or opinions or social / political affiliations. The current method could be used to discover associations between such factors, which may be useful in tasks such as forecasting credit / insurance risks or detecting fraud; or in determining the best objectives for the location of limited technical market or resources raised, for example. The problem of searching for all the significant correlations between the k-tuple pairs of the tributes in a database is ubiquitous in computer science and in industrial and financial medical applications. The principles described here include a probabilistic algorithm that has the interesting property of finding the significant higher order k-ary correlations, for all k such as 2 < K < N in an N-attribute database for the same computing search cost only significant in pairs correlations. In addition K does not need to be fixed in advance in our procedure compared to other known procedures. The procedure was designed for the task of searching for conserved structural relationships in aligned prstein sequences, although it may have more useful application in other domains. Application of the Principles Described in the Present to the Protein Sequence Analysis. There are interactions between amino acid residues of distant sequence in the protein chain, sometimes detectable as correlations between the positions- (columns) in a set of sequences aligned from a structural protein family that play an important role in the determination of structure and function. The discovered correlations may represent a history of evolution of the compensatory mutations and may provide useful features in structural / functional protein family models, although they are ignored or underestimated by most ML (machine learning) classification methods due in part to the high complexity of search calculation for k-tuples of correlated positions. In order to practice the invention in a matrix of biological sequences such as the nucleotide or amino sequences, the different sequences are optimally aligned first for comparison purposes. A position in a first sequence is compared with a corresponding position in a second sequence. When the positions are occupied by the same nucleotide or amino acid, as may be the case, the two sequences are identical in that position. The degree of identity between these two sequences is often expressed as the percentage that represents the relation of the number of equal positions (identical) in two sequences to the total number of positions compared. Optimally aligning two or more sequences generally involves maximizing the degree of sequence identity between them. Various algorithms and computer programs are known to those with ordinary skill in the art for alignment sequences. These tools include the PILEUP program of Genetics Computer Group (Madison, Wl) package (version 8) using a modified version of the progressive alignment method of Fren and Doolittle [J. Mol. Evol. 25, 351 (1987)]; CLUSTAL X, public domain software available from the European Molecular Bio-logy Laboratory (EMBL), Heildelberg, Germany; and BLAST, public domain software available from the National Institutes of Health (NIH), Bethesda, MD. , BLAST-P is used for amino acid sequences; BLAST-N is used for nucleotide sequences and BLAST X is used for nucleic acid / amino acid codon translation. Several types of useful information can be obtained from the protein sequence family analysis.
First, there is information that is to be extracted at the level of individual sequences in the form of union symbol frequencies. It is well known that a frequency observed at normally high from a particular individual positional pattern (eg "G occurs at residue number 3 in 98% of those sequences") may reveal an important physical chemical restriction in the secondary or tertiary structure . This is also true for surprisingly frequent union symbol occurrences (eg "G in the 3 L position in position 5 and M in position 87 - occur much more frequently than would be predicted by marginal frequencies- individual "). Such long-distance co-occurrences may be especially indicative of tertiary constraints, because the designated positions may be close to each other in the three-dimensional structure for which the patterned sequences correspond. (This detection of "suspicious coincidences," when p (A) p (B), is the central point of pattern recognition and learning as observed a long time ago by others). Secondly, there is information that is going to be extracted at the next higher level, of statistical relationships between the positions (columns in alignment of homologous sequences). If the existence of k-tuples of union symbol of frequent occurrence can be used for inferior three-dimensional structural interactions, such lower is even better supported by certain theoretical relations of information between positions (columns) on a set of many occurrences of symbols of different union. This is because such symbolic relationships can mean physical or structural relationships - evolutionarily conserved between different parts of the protein chain. (See Figure 15). The observation of high values of mutual information and other measures of correlation between columns has been successfully used to predict the interactions of three-dimensional structures in RNA and in HIV proteins, for example, see CE. Shannon and W. Weaver The Mathematical Theory of Communication The University of Illinois Press / 1964. While these previously reported efforts have focused on residue-residue interactions in pairs, the principles described here point to the detection of interactions between -ary for 2 < k < N. The discovered k-tuples of correlated amino acid residues can be used in the prediction of protein structure and structure determination. Local forecasts can help narrow the search for the best global structure forecasts. First, there are restrictions of distance geometry. The prediction of secondary structure and the discovery of the long-distance interactions k-ary, give evidence of assumed contacts, of the form contact (i, j) for the i th residues of amino acid in a protein. Using the type of distance geometry theory developed by others (see for example, TF Havel, LD Kuntz, GM Crippen The Theory and Practice of Distance Geometry Bull of Mathematics Biology v.45 1983 pp. 665-720 and KA Dill, KM Feibig, HS Chan Cooperativity in Protein-Folding Kínetics Proc. Nati. Acad. Sci. USA v.90 March 1993-. Pp. 1942-1946), one can derive a lower set of contacts. Inferred block sets can also be derived, contacts that are prohibited by a given set of opposite or inferred contacts. Essentially, given a model of a polymer chain constrained to exist within a fixed volume, the assumption that two particular pieces come into contact implies that some other pieces are in proximity and that still other pieces are separated. ~ In fact, others concluded that considerable amounts of internal architecture (propellers and parallel and antipararan sheets) are predicted to arise in compact polymers simply due to spherical constraints. This seems to be valid for what there is too much internal organization in globular proteins.
Second, as described through the previous sections, one can lower and exploit - the empirical relationships between local and global configurations. Local sequence tensions of nonlocal pairs selected from residues, can be found to occur, with some high probability, in particular global configurations. The heuristic rules, in any form can be used to avoid large parts of the conformation space. The inference of particular models of cooperation in bending is a special case. Knowledge of "rules" such as p (contact (i, j) | contact (i + 1, j -1)) > p (contact (i, j) can help significantly, for example, Figure 16 illustrates stages in the tertiary structure forecast. "The methods described through this application can be applied as part of a larger tertiary structure forecasting system. , where the principles described above are used in the block related to the analysis of aligned sequence families The system predicts the structure of a protein Discovery of Structural Restrictions Evolutively Conserved Three questions are referred to in this section: 1. What Are structurally or functionally multiple residue constraints evolutionarily conserved that can be expected to be found by detecting correlations between columns in a multiple sequence alignment? 2. Are there correlation / detection efforts in fact that encounter significant structural or functional constraints? 3. How much information pro Do such discoveries lead to the prognosis or determination of a native tertiary molecule structure? What is it expected to observe? A protein family is the set of amino acid sequences that are considered to share a common global tertiary structure. The theory and observation of protein folding and evolution supports the general idea of evolution and conservation within a protein family: Functional constraints are conserved in surface residues; Structural restrictions are conserved in core waste. • The mutual drift dominates in cycle residues; Functional constraints frequently involve other molecules, such as other proteins, nucleic acids, lipids, metals, oxygen or other small molecules. The type of structural constraints that are expected to be conserved through the evolution of a protein family are mainly those that involve a few key residues that stabilize a confirmation. When electrostatic interactions are considered important, one can expect to find a conservation of a net charge through two or more sequence positions. When one or two residues that interact electrostatically carry a positive charge, their "companion" residue (presumably close in the three-dimensional structure even if it is distant in sequences) must be negatively charged and vice versa. The situation is similar for packaging restrictions. One should reasonably expect that the sections of the protein core volume vary only slightly through many different proteins - in the same structural family, while the non-core regions may exhibit great volume variability. Therefore one should expect to find pairs of small k-tuples of residues that exhibit mutually compensatory mutations with respect to the side chain volume when a mutant "Large" to a "Small", another "Small" mutates into a "Large" mutant, to place it more simply.
What has been observed? Neher et al. (How often the changes in PNAS protein sequence families, 91: 98-102, 1994) are correlated, attempted to quantify the frequency of compensatory changes within a single protein family using physical chemical property indexes for amino acids and then estimating Pearsonian correlations between columns in an alignment. They tried to avoid the small data set base problem with a repeated sampling scheme suggested by program start based on the examination of sequence pairs from the family. Their study of the myoglincin family of protein sequences found the degree of compensatory mutation to be low for the side-chain volume property but high for the electric charge, close to the expected correlation level for the perfect conversion of the local load. The authors speculate that due to their column pair analyzes focused heavily on neighboring residue contact pairs, they were able to detect a very local action constraint similar to load conservation but without a more distributed restriction similar to conservation of volume. (In other words, an individual positively charged waste must be in contact with its individual negatively charged structural partner, whereas a set of compatible volume partners can comprise more than two residues and do not need to be in contact). Others have also found some evidence of coordinated mutation in the evolution of protein structural families. As for most studies to date, the compensatory mutation approach on highly conserved "core" type regions of protein structures, Korber et al. (Covariation of mutations in the V3 loop of HIV-1: An information-theoretic analysis, Proc Nat. Acad. Sci, 90, 1993) analyzed the highly variable V3 cycle of the HIV-1 envelope protein. The researchers executed firm-initiated estimates of the mutual information pairs for all column pairs from the set of 31 columns representing the V3 residuals. They found a set of approximately 7 pairs that showed considerable and statistically significant mutual information and their analysis of particular attributes (amino acids) suggested a particular pattern of equally high compensatory mutations. Although the authors do not argue or provide evidence for any particular properties or relationships that have been preserved, the subsequent mutation analysis with laboratory experiments indicated the functional link between some of the pairs of sites with high mutual information. Because the V3 region is known to be functionally and immunologically important, the inventor of the present application suggested that such analyzes should be important in the search for HIV / AIDS vaccine design. What Type of Method is Necessary? Clearly, several well-studied and effective methodologies exist for the formation of a comprehensive model of protein sequence families. In each case, the mathematical machinery is in place to manage and detect the very local and lower-order statistical structure in the data. In each case, difficulties with calculation complexity and statistical estimation arise in the attempt to comprehensively account for all possible local and higher-order interactions between the residuals, ie the columns in the aligned sequence data. The simplest progress in modeling can be made if HMM or density networks are used in conjunction with a rapid heuristic preprocessor that focuses on the explicit detection of viable non-local interactions while sacrificing a degree of pressure in the formation of model of those interactions. Such a procedure is provided by the principles described herein. a) HIV PROTEIN SEQUENCE ANALYSIS Tests on an HIV Protein Database The HIV Los Alamos database contains, among other things, the amino acid sequences for the V3 cycle region of the HIV envelope proteins. This region is known to have functional and immunological significance, and the discovery of sets of sites linked by evolutionary covariance may have important implications for the understanding and prevention of HIV infection and reproduction. A simpler and smaller version of the same database was used by the Los Alamos scientists in their analysis of mutual information by pairs in residues (column). The experiments were run on an HIV data set with the coincidence detection detection procedure, on a set of different values for r and T. The tables of results are shown and described below. Results of HIV Protein Database Experiments The aforementioned version of the HIV-V3 data set was edited to focus on the thirty-three residues considered most conservative and structurally and functionally most important by Los Andes researchers. Alamos. The data set therefore consists of M = 657 rows (sequences) of N = 33 columns (residues). For the coincidence detection procedure these 33 columns were transformed into NA = N \ A \ = 33.21 = 693 attributes. As with the artificial data sets, a set of experiments with different values of T and r were executed. The coincidence detection was performed with T = 10,000 and r = 5,6,7,10 respectively, and with T = 100,000 and r = 7, and "finally with T = 750,000 and r = 7. The results are shown in tables Cl a C9 below Table Cl: Probably the most correlated attributes, as estimated by the coincidence detection procedure, for the HIV data set These results were produced with parameter settings T = 10,000 and r = 5. HIV data set T = 10,000, r = 5.
CSET Classification Observed Expected Probability 1 ßI7 | 24 1012 632.553864 0.316056 2 R17 \ T1 \ 901 610.770465 0.509734 3 R \ 2 \ Q \ 1 570 348.605833 0.675621 4 LU \ W \ 9 \ Q24 195 5.535741 0.750381 5 N4 | £ 9 | -421 226 74.167398 0.831582 6 pi |? I2 | ri8 159 20.764346 0.858239 7? 12 | 718 454 318.517747 0.863429 8 Z.13 | A31 419 300.333903 0.893461 Table C2: Probably the most correlated attributes, as estimated by the match detection procedure, for the HIV data set. These results were produced with fixations of parameter T = 10,000 and r = ß. HIV data set T = 10,000, r = 6.
CSET Classification Observed Expected Probability 1 ßl7 | Z) 24 1177 385.853329 0.030891 2 R17 | 721 957 368.736702 0.146238 3 H12I-418 1047 577.583832 0.294000 4 -S10 | £ > 24 859 424.457490 0.350274 5? 12jßl7 656 224.743830 0.355855 6 i? 12 | 718 628 283.191527 0.516585 7 R17 | £ 24 563 234.477161 0.549033 8 H12J? 17 760 434.274580 0.554644 9 ¿18 | 721 560 315.973734 0.718330 10 7111-R17 861 627.014684 0.737741 11 L13 \ W \ 9 \ Q24 230 5.365202 0.755529 12 A2l \ D24 619 405.487239 0.776262 13 N4 | * T9 | -421 237 25.176801 0.779367 14 pi | ? i2 | p8 220 15.841474 0.793296 15 1131 ^ 31 462 267.211446 0.809942 16 GIO | H12 324 157.554658 0.857348 17 M13 | Í? 5 245 84.760597 0.867059 18 Q \ 7 \ K3 l 384 231.749746 0.879169 19 m2 \ Rp \ A \ s 147 8.219536 0.898526 20 N4 \ K9 \ H33 309 170.353419 0.898711 Table C3: -The most likely correlated attributes, as estimated by the match detection procedure, for the HIV data set. These results were produced with fixations of parameter T = 10,000 and r = 7. HIV data set T = 10, 000, r = 7 CSET Classification Observed Expected Probability 1 Q17J 24 1312 228.829775 0.008322 2 N4 | * 9 2023 996.505631 0.013558 3 H12I-418 1175 328.263693 0.053591 4? 17 | 721 940 216.431391 0.118015 5 ß31 | H33 3198 2481.050915 0.122699 6? 12 | 718 879 244.789294 0.193645 7 S10 24 836 232.201517 0.225812 8 J? 12 | ßl7 720 140.866087 0.254370 9/111? 17 808 360.719364 0.441944 10 H12 |? 17 659 253.717115 0.511491 11 R? \ A \ IZ 720 361.819054 0.592356 12 A2l \ D24 554 236.085429 0.661974 13 R \ 7 \ E24 452 138.843412 0.670137 14 L13 | iO l 537 231.137972 0.682602 15 L131 ?? 9 | ß24 292 5.055474 0.714573 16 -418J721 442 165.231990 0.731502 17 18 | ß31lH33 480 209.122778 0.741198 18 M13 | Í? 5 355 88.975694 0.749122 19 N4 | K9 | H33 340 75.556215 0.751690 20 pi | i? 12 513 253.001684 0.758878 Table C4: Probably the most correlated attributes, as estimated by the match detection procedure, for the HIV data set. These results were produced with fixations of parameter T = 10,000 and r = 10.
HIV data set T = 10,000, r = 10.
CSET Classification Observed Expected Probability 1 Q3 \\ H33 3933 883.532458 0.000000 2 N4 \ K9 2898 251.248235 0.000001 3 S10 | 19 2245 907.769718 0.027977 4 19 | G23 2660 1588.173503 0.100497 5? 12IH8 1155 142.229768 0.128554 6 K9 \ IU 1230 311.653160 0.185125 7 AZ \ H33 1720 990.576490 0.345032 8 K9 \ H33 1125 405.874883 0.355482 9 H12] / 118 732 54.213558 0.399002 10 S10 | G23 1492 856.152048 0.445479 11 N4 | tf33 1257 689.784961 0.525468 12? 18 | Q31 1188 636.901303 0.544755 13 7 | £ > 24 571 42.938312 0.572525 14 K11 | K12 670 143.659674 0.574607 15 7111? 17 562 61.788305 0.606274 16 N4 \ R17 992 498.586806 0.614520 17 R \ 2 \ Q \ 1 484 31.204991 0.663619 18 K3 \\ Y33 578 130.131866 0.669535 19 -R1717-21 479 39.372545 0.679400 20 -S10 | > 24 451 34.199456 0.706491 Table C5: The thirty attributes probably most correlated, as estimated by the match detection procedure, for the HIV data set.
These results were produced with fixations of parameter T = 100,000 and r = 7. HIV data set T = 100,000, r = 7.
CSET Classification Observed Expected Probability 1 H12 \ A I & 11686 3282.636926 0.000000 2 N4 | * 9 21853 9965.056308 0.000000 3 Q \ 7 \ D24 11585 2288.297747 0.000000 4 031JH33 3171524810.509148 0.000000 5? 17 | 721 9355 2164.313906 0.000000 6? 12 | ßl7 7259 1408.660868 0.000001 7? 12 | 718 8380 2447.892936 0.000001 8 S10 \ D24 7666 2322.015166 0.000009 9 711I-R17 8336 3607.193645 0.000109 10 A2l \ D24 6342 2360.854285 0.001550 11 H12J? 17 6363 2537.171146 0.002543 12? 171-418 7162 3618.190543 0.005941 13 R \ 7 \ E24 4451 1388.434119 0.021747 14 A \ Z \ T2 \ 4673 1652.319901 0.024130 15 pi |? 12 5486 2530.016841 0.028256 16 L \ 3 \ K3 \ 5224 2311.379719 0.031348 17 N4 \ K9 \ H3 3519 755.562151 0.044291 18 í8 | ß31 | H33 4665 2091.227775 0.066951 19 L13 \ W 9 \ Q24 2585 50.554739 0.072672 20 R \ 7 \ Q3 \ 5967 3574.032278 0.096592 21 M13 \ W \ 5 3204 889.756945 0.112364 22 Fll ] i? 12 | p8 2424 117.500168 0.114017 23 N41-421 6209 4030.321314 0.144077 24 K31 \ Y33 4878 2773.817984 0.164117 25 Q \ 7 \ K3 l 3440 1450.098718 0.198651 26 91-421 5614 3692.671816 0.221632 27 P19 | D24 3998 2250.071839 0.287354 28 Q \ 7 \ A2 \ 4151 2414.536189 0.292077 29 G101H12 2661 953.572593 0.304245 30 H1 2 | £ 24 3018 1458.576938 0.370622 Table C6: The first twenty-five of the fifty attributes probably most correlated, as estimated by the coincidence detection procedure, for the HIV data set. These results were produced with fixations of parameter T = 750,000 and r = 7. Observe the appearance, in this degree of sampling, of several statistically significant higher order features with k > 3. HIV data set T = 750,000, r = 7.
CSET Classification Observed Expected Probability 0 -4l8 | ß3 I | fí33 36019 15684.208314 0.000000 1 -418 | 721 33816 12392.399254 0.000000 2 A21 \ D24 45549 17706.407140 0.000000 3 # 121-418 86025 24619.776947 0.000000 4 H12 \ R17 48257 19028.783592 0.000000 5 IU \ R17 64548 27053.952336 0.000000 6 I13 | -O 1 39382 17335.347894 0.000000 7 L13 | # 19 | ß24 20184 379.160544 0.000000 8 M13 | * H 5 23300 6673.177086 0.000000 9 N4 \ K9 162152 74737.922307 0.000000 10 N4 \ K9 \ H33 26376 5666.716129 0.000000 11 Q \ 7 \ D24 86891 17162.233105 0.000000 12 ß31 | -? 33 23319086078.318611 0.000000 13 R12 | ßl7 53740 10564.956512 0.000000 14 ¿.121718 62774 18359.197022 0.000000 15? 17J-418 54366 27136.429076 0.000000 16? 17 | £ 24 33748 10413.255892 0.000000 17? I7 | Q31 45065 26805.242087 0.000000 18? 17 | 721 70301 16232.354294 0.000000 19 5101 24 57772 17415.113746 0.000000 20 VU \ R \ 2 39546 18975.126308 0.000000 21 H 1 |? 12 | 718 17628 881.251263 0.000000 22 £ 311133 36346 20803.634880 0.000002 23 N4 \ A2l 45441 30227.409858 0.000003 24 ß! 7 | i01 25033 10875.740384 0.000018 25 s? O | # i2 20779 7151.794446 0.000041 Table C7: Continuation of the fifty most likely correlated attributes, as estimated by the matching detection procedure, for the HIV data set: csets rated 26 of 50. These results were produced with parameter fixes T = 750,000 and r = 7 . Note the appearance, in this degree of sampling, of several statistically significant higher order features with k > 3. HIV data set T = 75-0,000, r = 7.
CSET Classification Observed Expected Probability 26 £ 91-421 40098 27695.038620 0.000231 27 F \ 9 \ D24 29121 16875.538795 0.000286 28 2171-421 29621 18109.021417 0.000737 29 H12 \ E24 22348 10939.327036 0.000839 30 15175 4159.316971 0.001355 31 S4 \ 7? \ T12 \ V \ S \ R21 10919 1.718549 0.001524 32 N4 \ K9 \ A2l 11233 623.181959 0.002185 33 N4 \ Q3l \ H33 21868 11328.342993 0.002369 34 Fí9 A2l 44400 34516.144368 0.004910 35 K9 \ Q3 \\ H33 16593 6991.723718 0.006625 36 W \ 9 \ Q24 16738 7234.038664 0.007331 37 £ lf-V12 10844 1492.835945 0.008575 38 K9 \ E24 13847 4587.312260 0.009408 39 K9 \ Rp 33735 24568.179150 0.010326 40 T12 \ VS 23076 14893.617567 0.026158 41 Rl 21-421 15497 7516.155896 0.031231 42 -V4 | K9 | ß31 | i? 33 8280 493.681367 0.036905 43 -V4 | £ 9 | -418 11655 4250.900600 0.050618 44 S4 \ T9 \ Tl2 \ Vl% \ R2l \ Y33 7370 0.093039 0.052029 45? 121Q171718 7452 240.364918 0.058992 46 F11 | Q17 14350 7329.962834 0.068429 47 H12 | 721 2326316324.923094 0.072825 48 ßl7 | ? 33 1728810374.788061 0.074203 49 Z, 1 3 | FR9 15536 8921.243955 0.092437 50 S17 | H28 6529 138.997153 0.108375 Table C8: The thirty-five inter-column mutual information values for higher pairs for the data set VI? -V3, as estimated by our methodology as described in the main text.
Classification Par /,./ MI (c Cj) Standard Error 1 12 | 18 0.340449 0.037792 2 4 | 9 0.337943 0.0389162 3 9 | 21 0.319481 0.0353829 4 23 | 24 0.315202 0.0337213 5 12 | 24 0.314393 0.0330382 6 9 | 24 0.313992 0.0344732 7 19 | 24 0.305609 0.0335857 8 11) 24 0.297498 0.0358645 9 24 | 26 0.290044 0.0384839 9 | 11 0.289911 0.0344244 11 9 | 23 0.285019 0.0343224 12 4 | 21 0.284936 0.0332236 13 18121 0.278151 0.0404634 14 4) 11 0.277189 0.0353993 12 (21 0.273137 0.033385 16 4 (24 0.262226 0.036189 17 21 | 24 0.260366 0.0338395 18 11) 23 0.260337 0.0323302 19 11 | 19 0.249877 0.0320634 10 | 24 0.248938 0.0325318 21 19) 23 0.242185 0.032301 22 5) 26 0.239395 0.0386373 23 9 | 19 0.238318 0.0331283 24 4 | 23 0.23359 0.0302795 25 24 | 25 0.222109 0.0358744 26 6) 26 0.220371 0.0397722 27 4 | 26 0.220213 0.0333324 28 6 | 24 0.218815 0.0335123 29 9 | 12 0.214844 0.0280984 30 15 | 24 0.213921 0.0301834 31 10 (12 0.2133 0.0306496 32 9] 18 0.21078 0.031734 33 11 | 21 0.210155 0.0308121 34 11 (12 0.209421 0.0294066 35 4 | 19 0.20911 0.0290533 Table C9: The seven inter-column mutual information values for higher pairs for the HIV-V3 data set, as estimated by the Los Alamos group.
Classification Par /.j 1 23 | 24 2 12 | 24 3 12 | 18 4 12 | 23 5 19 | 24 6 10 (24 7 10) 12 Tables Cl to C4 illustrate the most important csets - (measured again by our procedure estimate of (Observed \ Independence) for the observed number of matches for each detected match of attributes.As can be expected, a clear separation between probably correlated and probably non-correlated does not manifest itself in its comparatively low degree of sampling for this real data set - L-os - results for r = 7 and r = 10 indicate discovered csets more significant than those for r = 5 and r = 6. In these higher values of r, you can see the emergence of a few csets with "Prob" values less than 0.1: (Q @ 17, D @ 24), (N @ 4, K @ 9), (H @ 12, A @ 18), (Q @ 31, H @ 23) and (ST10, F @ 19) All these csets appear among the most important csets reported in the most intensive sampling operations (with T = 100,000 and T = 750,000), with the notable exception of (S @ 10, F @ 19) This last cset is discovered in this bass Sampling grade only in the operation of r = 10, and does not appear in the more intensive sampling runs shown, both of which used r = 7. Table C5 describes the result for T = 100,000 and r = 7, and here it is clear that some separation of the sign from the signal takes place between the HOF set, with ten and seven correlations per pair and 3-ary that appear within our importance level Prob < 0.1. At T = 750.00, we have more statistically significant stopping of almost fifty 2-ary, 3-ary and even the 6-ary attribute correlations, as shown in Tables C6 and C7.
In order to obtain a better sense of the possible meanings of these results, consider these inter-attribute correlations together with some inter-column correlations in the form of mutual information estimates by peers executed in the analysis itself and also by the Los Alamos group. Table C8 shows the highest estimated mutual information values among all N-N = 528 pairs of columns from our 33 column data set. The estimates were obtained using a similar procedure to program start in which 1000 subsets of sample data of m = 300 out of M = 657 were extracted and operated through standard mutual information calculation. Reported in the table are therefore the average values over the repeated sampling and the standard-associated error values. There is an important interaction between the set of column-pairs indicated by the higher cset values in Tables C6 and C7 and those indicated by the higher mutual information values in Table C8. The correspondence between the two classifications is not perfect, for a few reasons (in addition to noise and simple sampling error). First, while the suspicious character of a combination of individual union tribute certainly contributes to mutual information within the corresponding set of columns the behavior of other symbols that appear within the columns can obviously also have a great effect. Secondly, it is again observed that the observed sensitivity matching detection results result in the choice of r. Table C9 lists the statistically more significant mutual information values as they were estimated by the Los Alamos group. It is noted that the overlap between your list and ours, although it is again emphasized that the group's use of an initial database, smaller and perhaps-s in some way different to that we did not have access to. The application of the match detection method of the invention to biological data such as those of aligned HIV sequences therefore leads to the identification of covariate structural elements that were not previously recognized. The statistically important coincidence of structural elements, such as amino acid residues, also indicates a biological role for a motif comprising the covariable elements, "such as the structure and function that are closely linked in the biochemical systems." An example from the previous application of the invention is the statistically significant coincidence of residues A18, Q31 and H33 in cycle V3 of the HIV cover protein.These results are expected to contribute to a structural motif of the V3 cycle that plays a biological role in the life cycle of HIV. new about A18 / Q31 / H33, which before the invention had never been aggregated together for a particular biological role, can be exploited in several ways as follows: a peptide or mimetic peptide that resembles the aforementioned structural motif of the V3 cycle ( or another protein motif identified by the match detection method) is provided by the invention. peptide or mimetic peptide would include spatial coordinates of amino acid residues A18 / Q31 / H33, although each atom of those amino acids would not necessarily be required. Instead, the mimetic peptide or peptide would have such spatial coordinates of A18 / Q31 / H33, as well as the topological and electrostatic attributes, which would make it useful for a biological function, such as competing with the actual HIV V3 cycle for join another biological molecule, where such binding of V3 would employ the structural motif that is mimicked by the peptide or mimetic peptide. Alternatively, a peptide or mimetic peptide that is designed based on k-tuple covariates discovered by the coincidence detection method could be used as an antigen. That is, the biological function that the molecule mimics is authorizing an immune response in an animal. Similarly, vaccines exhibiting the covariates described herein are also encompassed by the invention. Morgan et al. (Morgan et al 1989. In Annual Reports in Medicinal Chemistry, Ed .: Vinick, FJ Academic Press, San Diego, CA, pp. 243-252.) Define mimetic peptides as "structures that serve as suitable substitutes for peptides. In interactions against sectors and enzymes, the mimic must possess not only the affinity but also the efficacy and function of the substrate. For purposes of this description, the terms "mimetic peptide" and "peptidomimetic" are used interchangeably according to the above definition. That is, a peptidomimetic exhibits functions of a particular peptide, without restriction of the structure. The peptides mimetics of the invention, for example, analogous to the structural motif of the V3 cycle defined above, may include amino acid residues or other chemical moieties that provide the desired functional characteristics. The invention further provides a ligand that interacts with a protein having a structural motif identified using the match detection method of the invention, as well as a pharmaceutical composition that includes the ligand and a pharmaceutically acceptable carrier or excipient thereof. The ligand would include the chemical portions of proper identity and spatially located relative to each other so that the portions interact with the corresponding residues or portions of the motif. By interacting with the motif, the ligand could interfere with the function of that region of the protein including the motif. Thus, the invention provides a pharmaceutical composition for interacting with a human immunodeficiency virus (HIV) envelope protein, which includes a ligand having a functional group that interacts with the structural motif of cycle V3 having spatial coordinates of residues A18 / Q31 / H33, and a pharmaceutically acceptable carrier or excipient thereof. The ligand can have more than one functional group that interacts with the motif, such as for example, a first functional group capable of binding to and being present at an effective position in the ligand to bind to residue 18, a second functional group capable of binding to and being present at an effective position on the ligand to bind to residue 31 and a third group functional group capable of binding to and being present in an effective position in said ligand to bind to residue 33. The invention further provides a method of designing a ligand for interacting with a structural motif of a protein, such as for example the envelope protein of the invention. Human immunodeficiency virus (HIV). For example, in the case where the motif is potentially interesting the motif A18 / Q31 / H33 identified by the coincidence detection method described above, the design method includes the steps of providing a template having spatial coordinates of residues a8, Q31 and H33 in the V3 cycle of the HIV envelope protein and involve in the form of a chemical ligand calculation that uses an effective algorithm with spatial constraints, so that the ligand involved includes at least one effective functional group that binds to the motif. The template provided may also include topological and / or electrostatic attributes, and the effective algorithm includes topological and / or electrostatic constraints. The steps of similar methods will be employed by other proteins comprising a motif identified by the coincidence detection method. The invention further provides a method of identifying a ligand for binding to a structural motif of a protein. The structural motif is preferably identified by the coincidence detection method. For example, in the case where the reason is that identified by the match detection method comprising residues A18, Q31 and H33 of the HIV envelope protein described above, the method includes the steps of: providing a template having coordinates of A18, Q31 and H33 in the V3 cycle of the HIV envelope protein, provide a database containing the structure and orientation of the molecules, and sift the molecules in the database to determine whether they contain separate effective portions. in relation to the other so that the portions interact with the motive. The database may also contain topological and / or electrostatic attributes of the molecules, and the sieving step further includes determining whether the portions are effective in that respect to interact with the motif. For example, a molecule described in the database may have physical and chemical attributes such as including a first portion that interacts with the residue 18, a second portion that interacts with the residue 31 and a third portion that interacts with the residue 33. Stages of similar methods would be employed for other proteins comprising a structural motif of interest. When a ligand provided by the invention is included in a pharmaceutical composition, the pharmaceutical composition further includes a pharmaceutically acceptable carrier as is known to those skilled in the art in relation to the compositions. The term "pharmaceutically acceptable carrier" as used herein includes diluents such as saline and aqueous pH buffer solutions and liquid, solid or gaseous phase vehicles, as well as carriers such as liposomes (Strejan et al., 1984. J. Neuroimmunol 7:27), and dispersing agents such as glycerol, liquid polyethylene glycols, and the like. The pharmaceutical composition can include any of the solvents, dispersion media, coatings, stability improvers, antibacterial and antifungal agents (eg, parabens, chlorobutanol, phenol, ascorbic acid, thimerosal) isotonic agents (eg, sodium chloride, sugars) , polyalcohols such as mannitol) and absorption-dilating agents (for example, aluminum monostearate and gelatin) which are known in the art. Alternatively, a ligand provided by the invention, such as a ligand that binds to a biological target, can be used for diagnostic purposes. A diagnostic agent according to the invention can include a ligand that interacts with a protein having a structural motif identified using the match detection method, and a detectable label linked to the ligand. The detectable label can be any detectable substance known in the art, such as, for example, a fluorescent substance or a radioactive substance. Alternatively, the label can be an enzyme (such as, for example, horseradish peroxidase or alkaline phosphatase) that catalyze a reaction having a detectable product (for example with color), or the label can be the substrate for such an enzyme . Application of the Described Principles to a Background of Drug Discovery: The multi-billion-dollar pharmaceutical industry is largely based on the design or discovery and refining of small molecules ("ligand") that interact with larger molecules ("targets") and in a way represses, improves, blocks or otherwise modifies the structure, function or activity of the objective. It is the structure, function or activity of the objective that is involved in some way in some mechanism of illness. The target molecule is frequently an enzyme or receptor for a protein or nucleic acid or some combination thereof. There is a large number of possible ligands and only relatively few of them are developed and marketed as therapeutic compounds that work against or with one or more targets and therefore are targets against the disease. Therefore, it is of great interest to biotechnology and pharmaceutical researchers to be able to consider a huge number of potentially useful compounds, but they avoid spending too much resources on the development of therapies based on compounds that may not be useful, safe, effective, or economically viable. The methods described herein can be used to improve and accelerate the process of discoveries of suitable effective compounds and to distinguish promising compounds from non-promising or less promising compounds in a private or public collection of molecules or their representations of computer database. They can be used effectively and contribute value in this application in many ways, helping to understand and infer the objective structures and by searching for ligands whose geometric, topological, electrostatic or other characteristics make them likely candidates for effective interaction with them. objectives. Application of the Principles Described in the Present to Molecule Databases and Their Characteristics One way to represent a large number of molecular structures within a computer database (whether they are stored in the main memory or on a magnetic disk, tape or other electronic or optical medium) is in terms of "screen". Those skilled in the art will recognize screens - as binary attributes where a given screen or attribute represents the presence or absence of a particular substructure pattern, for example, a sulfate group. If a set of compounds is represented by screens, then a particular compound, which will be denoted by C, can be represented by a string of ls and Os where ls represent the predefined substructure patterns containing C and Os for those patterns of predefined substructure that does not contain C. This scheme may be extended to represent the primary structure of a nucleic acid or protein in terms of attributes, as described elsewhere herein. The primary structure is also known as the "sequence", ie a sequence of bases, nucleotides in DNA or RNA, and an amino acid sequence, also called amino acid residues, in a protein. It is simple to represent a protein sequence, for example, as a sequence of symbols, each symbol being a letter of the alphabet that corresponds to one of the twenty naturally occurring standard amino acids. It is also simple to transform this representation by representing each residue or position in the sequence by a set of twenty binary attributes, if such a representation is desired. The attributes act like the screens described above. For example, if the first amino acid in the P protein is an alanine, represented by "A", it can also be represented by a value of "1" in the attribute that remains for the question, "Is the amino acid in position 1? as an alanine? ", and by the values of" 0"for the attributes that represent" Is the amino acid in position 1 a cysteine? "," Is it a phenylalanine? ", and so on. Figure 15 provides an illustration of the amino acid and residue positions. It is also easy and sensitive to represent other aspects or characteristics of the compounds in terms of attributes. For example, a given compound may be known to be active against a particular target T, in which case an attribute corresponding to the question "Active against T?" would have the value 1 for the value corresponding to compound C. For another example, a pharmaceutical company may have a number of compounds operating through a set of "tests" or tests of biological or chemical activity. An assay can be tested for some aspect of effectiveness against a target, or for the ability to cross the blood-brain barrier, or for toxicity for example. The test results can be represented in terms of attributes evaluated discrete and even binary as well, by means of pre-processing routines known to those skilled in the art. Other characteristics of particular compounds may include literary citations (ie, references to documents or o-studios in which the compound was described, designated, discovered, or analyzed), and the property or patent status of the compound. Not only can small therapeutic components be represented in screen term and other attributes, but they can also be made with potentially larger therapeutic molecules such as DNA, RNA, peptides, proteins, carbohydrates, and lipids. The target molecules can be represented in this way. All that is required is a predefined list (although possibly updated by changing, shortening or growing) of substructure patterns or other characteristics considered important for researchers or users. For target structures, one may wish to represent sub-structural patterns as well as their 1-dimensional linear structures ("sequence"), genetic link information, interactions with other proteins on the path of disease, literary citations and so on. Sometimes a particular molecule can be listed as more than one object in a database, the different objects that represent different conformations that that molecule can have. Clearly, this screen usage and other attributes in the composite database representation can also be represented in terms of an M by N data matrix that has been used to describe the work of the invention. The data matrix M by N is illustrated below in table 1. The rows in table 1 correspond to a set of molecules, compounds, molecular structures or sequences, while the columns correspond to characteristics that may include sub-structural patterns, test results or other aspects of the molecules. The value in the table cell [i, j] is (1) if the molecule i has the characteristic j and is (0) otherwise.
Table 1 The stages involved in the application of the methods described herein for the analysis of a molecular database include: 1. Obtaining the molecular database that supports the representation and attributes described for molecular structures ID, 2D and / or 3D of interest (or, obtain the molecular database and use standard methods to produce such representation); also use standard methods to transform the sequence and other information about the molecules of interest in attribute representations. 2. Present this database, in whole or in part, for a modality of the current invention, so that each compound in the database corresponds to one of the other M objects (rows) in the data matrix of the modality and in that way each substructure pattern represented on the screen corresponds to an attribute (column) of the data matrix. Additional attributes that represent activity, test, test results, known objectives against which the compound has been used, source or means of production or storage of the compound, property or patent state of the compound, and so on plus the substructure pattern they are attributed together that comprise the N (columns) attributes in the data matrix. 3. Use the previous base method of one of the modalities described herein in the data matrix. 4. Direct the discovered correlated k-tuples of unemployed attributes. A graphical observer, or • A rule-making preprocessor for a rule-based system, or • A report for investigating users or managers, a report generation system or • Another computer program that performs some type of additional analysis of the compounds , sequences, or structures represented in the database, or • Another computer program that executes some transformation or optimization in the database, or • Another computer program that directs humans and / or robots in drug screening experiments or in the design, refining or production of therapeutic compounds. The output of the current invention in the drug discovery application can be useful in many possible ways. First, it can be used in the establishment or optimization of the screen-based representation of the molecules. For example, it is known in the art that a good screen-based representation must use a set of screens (attribute) that are mutually uncorrelated and are approximately equiprobable. The method of the present invention would produce, when used as described above, correlated screen sets. This information can be used, add, remove, or combine the characteristics that the screen represents, in order to make the modified set of screens narrower to the ideal of unrelated and equiprobable. Other useful and valuable aspects of the information produced by the method include the following. For example, it is not common for a pharmaceutical company to have good leading "compounds" that work on experi-pents in vivo or in vi tro even when researchers do not know the objective structure, the active site of the objective structure or even which of the Different proteins in the biological system is the goal. If the methods described here are used to discover correlations between sub-structural patterns and test results, this information can help to interfere with a goal structure and even design more effective composite leaders, since it allows researchers to associate the structure with the desired activity. Another example is that of finding amino acid residues correlated in that part of a drug discovery base that corresponds to an aligned set of DNA, RNA or protein sequences as described hereinafter. In this case, some of the correlated k-tuples of residues (positions) may correspond to structurally and functionally evolutionarily conserved relationships. Thus, the principles described herein can thus be used to help predict or resolve the structure and function of important biological macromolecules, including pharmaceutical targets such as receptors and enzymes. Another example is to find correlations between structural functional aspects of disease trajectory or other of an objective molecule, TI, and another objective molecule T2; or to find correlations between functional or other structural aspects of a set of potential therapeutic compounds destined for IT and those of a set of potential therapeutic compounds intended for T2. In any case, this correlation information is useful because it allows drug designers to apply the knowledge to the compounds and effective technique against IT to make the effort against T2. Another application very different from the principles described herein for drug discovery and medical science is obtained by considering the transposition of the data matrix described above. The place of the compounds as objects (rows) and the characteristics of the compounds as attributes (columns), consider that it is possible when the compound corresponds to columns and its characteristics correspond to rows. See Table 2 below - Use the current invention in this scenario which produces correlated k-tuples of compounds in feature-space. Those k-tuples produced can modalize several kinds of valuable information. For example, if the characteristics in the rows represent mostly sub-structural patterns (screens), then the k-tuples produced correspond to groups of compounds. Such clustering of compound data base is very useful in high performance screening (HTS) with biological / chemical assays. { in vi tro or in vitro) and computational tests. In HTS, it is useful and economical to test only one or some members of each group of compounds initially; then, only in cases where an "impact" occurs (that is, a compound "passes" the "test" in the biological or chemical activity test) made by other members of the corresponding group that are sent through the test. Use the "transposition" method of the molecule database previously shown, in order to group the compounds into feature-space shown in table 2. It is now that the columns corresponding to the set of molecules, compounds, structural or sequences, while the rows correspond to the characteristics that may include subestuctural patterns, test results or other aspects of the molecules. There are M 'rows and N' columns, where perhaps M '= N and N' = M, and for original M and N described above. The value in the table cell [j, i] if the molecule (i) has the characteristic and is (0) otherwise.
Table 2 Application of the Principles Described in the Present To Discover and Analyze Genetic Networks The advanced molecular and computational biology techniques applied in large-scale genome mapping and sequencing effort are beginning to give us access to the complete genome sequences, the patterns of full expression of the genes and the ability to store and manipulate this information. Such information can be used to accelerate the discovery of new disease targets and successful therapeutic compounds. It is known that the genes that make up the "copy" for particular physical characteristics and systems within an organism often act together in complex ways. Genes interact in a mutually regulating way, promoting, repressing and otherwise modulating their own activation and expression and that of others. Traditionally, molecular biology has focused on the study of individual genes in isolation. However, to understand the complex biological phenomenon such as neuronal development or oncogenesis, for example, it is necessary to study the expression patterns of dozens or hundreds of genes in parallel taking into account temporal patterns as well as anatomical patterns. Such an analysis requires novel computational and statistical capabilities such as those provided by the principles described herein. While many variations are possible and can be considered by those in the art, a basic scheme for employing the methods described herein in the analysis of genetic networks may include the following steps: Step 1: Selection of genes of interest. Stage 2: Selection of biological parameters by means of which the state of a gene is represented at a particular moment. Biological parameters may include: expression of a gene (concentration levels of the associated mRNA or protein product, a particular state of a protein such as a biologically vant phosphorylation or any post-translational modification, the location of a given protein or the presence or absence of a cofactor For example, the polymerase chain reaction (PCR) can be used to amplify, then use known methods to detect the mRNA levels of each gene, then normalize these by dividing by the maximum expression levels of each gene and then quantify these continuously by varying the levels in a set of discrete levels z that can be represented in the data matrix format described through this document.It is also possible to use concentration levels of protein products as indicators of activity and interactivity. The change, during the synchronized observations, of Protein concentrations are regulated mainly by three processes: the direct regulation of protein synthesis from a gene determined by the protein products of other genes (including self-regulation as a special case); transport of molecules between nuclei of cells; and the decrease in protein concentrations. Stage 3: Select a time sampling scheme of the biological parameters of the genes in the genetic system under analysis. At each appropriate time, use the methods known in the art to measure the biological parameters selected for the selected genes. Stage 4: Represent the selected genes in terms of the selected biological parameters and represent the measured values of the biological parameters as attributes in the data matrix. Represent the time samples (the cases of measurement of the biological parameters) as rows in the data matrix. That is, for a cell in the data matrix, in row i and column j, record the quantity or characteristic measured in the sample of time i for the biological parameter j (which may correspond to the j gene, or it may not, depending on whether one or more parameters are measured for each gene). The recorded quantity, level or characteristic can be binary (for example, the gene is "active" or "inactive"), or it can be one of the values described z. As described elsewhere in this document, any discrete evaluated attribute may be represented by low binary coding whether that value is absent or present in a particular object, so that any of the preferred embodiments of the present invention may be applied to data of this type. Step 5: Employ the base method described above or one of the other modalities described herein on the data matrix. The output of the previous stages, ie a set of k-tuples of correlated attributes, can be interpreted as a set of correlated gene groups. For example, you may discover that a gene is "active" as long as another gene is "active". Or you can discover that when one gene is Gl is in "lower expression", another G2 gene is "off"; when Gl is in "medium expression", G2 is in "lower expression"; and when Gl is in, just expression, "then G2 is in" medium expression. "Such a result may allow support for the hypothesis that Gl promotes the expression of G2, or that" Gl activates G2"_ Similarly, The correlated k-tuples of genes or biological parameters can provide evidence that a gene represses or "disables" another gene or set of genes and so on.All that information can be useful in constructing a model, for example A "Boolean network", or interactive set of genes, such models are known to those in the art to provide valuable assistance in diagnosis of disease prevention and cure and in the design of effective and economically valuable therapies. Table 3 corresponds to a set of time samples (aka, time points, time segments), that is, times or periods of observation of the activity- of a gene or product of a particular gene. Umnas correspond to particular genes or gene products. The value in the table cell [i, j] is one (1) if the gene is considered "active", for example, "active" or "expressed", during time j and is zero (0) otherwise. This representation and application is easily extended in situations in which the simple active / inactive state of a gene is replaced by a set of different levels of z expression, for example, as measured by the observed amounts of a major protein product of the gen. It also easily extends to situations in which more than one biological parameter is used to represent the state of an individual gene.
Table 3 The method described herein has been applied to a set of gene expression data for genes involved in the development of the spinal cord in rats, as described in (GS Michaels, DB Carr, M. Askenazi, S. Furhman, X. Wen, and R. Somogyi, Pacific Symposium on Biocomputing 3: 42-53, 1988). The data set is available from these authors and since March 1998 it is also available on the international network (WWW) at http: // rsb. imfo. nih gov / mol-physol / PNAS / GEMtable. html Using a reverse transcriptase polymerase chain reaction (RT-PCR) protocol, the expression of 112 genes (mRNA levels normalized by the maximum expression level) was tested on nine developmental time points (Ell, E13, E15, E18 , E21, PO, P7, P14, and P90 or adult where E = embryonic, and P = postnatal). Included in the list of genes used are genes considered important in the development of CNS (Central Nervous System) that cover the nine main families of genes.
The aforementioned data set was easily transformed into a data matrix of objects and attributes, suitable for analysis with the methods described herein in a few steps: 1. The actual evaluated gene expression levels (ie, continuously evaluated) were transformed into a set of discrete values by using the Bayesian clustering method as presented in the SNOB software described in (CS Wallace and DLDowe, "Intrinsic Classification by MML-the SNOB program "Proceeding of the Seventh Australian Joint Conference on Artificial Intelligence, pp.37-44, 1994). Bayesian methods of quantification or discretion of real numbers are well known to those skilled in the art. For convenience of interpretation of the output, these six discrete numbers were further transformed into a small set of alphabetic symbols from A to F. A data matrix was established so that the columns of the matrix correspond to 112 different genes and so that the rows of the matrix correspond to the nine different time points of development. The methods described herein were operated on the input of the transformed gene data set, several times, each time using a different combination of values for the parameters r. { sample size) and T (number of sampling samples). The method can be applied to this data set through the use of a calculation pigeon very similar to the modality described in Appendices A and D; however, this particular modality was adapted for application to the protein sequence analysis domain, which means that some of the parameter values were set to be suitable for those particular attempts on the HIV protein data. The program must be modified to allow the appropriate parameter values for the input data. Those operations on the gene expression data were executed on a PC-compatible computer with IBM under the Windows '95 operating system. For each operation, a table of results was printed for visualization and analysis. The results of an operation, to copy reference and T = 1 00, 000 and r = 5, are appended as Appendix E. A researcher may wish to print only the first 10, and 50 or 1000 more highly correlated k-tupies genes-- (or any other number). The first 25 are shown in Appendix E. In the appended result impression, the following conventional formats were used: Each group of one or more lines report k-tuple correlated genes, ie -cset (coincidence set) that exhibited a low probability of its individual component attributes which are statistically independent, as described elsewhere in this document. The low probability of independence is a form of high correlation, as know by people experienced in the technique and as previously explained in this document. For each k-tuple, the k genes are shown, followed by a numerical value for their probability of independence. (This number is often displayed as zero, because the calculated value is too small, so close to zero that the decimal expansion is truncated to zero). Again, the low probability value means the high degree of co-relation. For each gene, the symbol in A ... F is shown, representing the quantified level of expression, followed by the internal data set name for the gene, followed by the most accepted standard name for the gene. The correlated k-tuples produced can be compared with the results reported by the authors in the aforementioned scientific document. Among the methods of analysis used by these authors in this set of gene expression data was an analysis of mutual information by pairs. In such an analysis, a particular correlation measure, known as mutual information, was measured for each pair of 112 genes, and the results were graphically displayed so that groups of genes with highly mutual information tended to appear close to each other. The method described herein is capable, as shown by the results of Appendix E, of discovering not only highly correlated gene pairs but also 3-tuples, 4-tuples, and so on. The examination of the results in Appendix E and the results of the authors of the aforementioned scientific document show that two different methods tend to corroborate each other although the current method goes further in the search for correlations between large numbers of attributes. For example, an examination of an output line from our results reveals a set of correlated genes so that the different pairs of genes in that set are usually also listed by having high mutual pairwise information for the method of the other authors. It is not only true that a k-tuple correlation of attributes implies that all possible pairs, from that k-tuples are mutually correlated also and I did not see celexa. Therefore, a method similar to those described here, which can be found correlated k-ary by pairs or higher order, offer advantages over peer methods that may not detect important higher order correlations between genes or other attributes in other applications. Application of the Principles Described in the Present for the Discovery of Category in Databases of Internet Documents / Intranet for Use in Ingenerias of Document Search The search of document by subject or keyword implies the existence of an efficient search system and In fact, much more effort has been applied to the development of effective search algorithms. However, this only represents a part of the total solution, the problem also requires an effective document categorization strategy. Information theory dictates that a set of effective categories or topics used to organize documents should not be correlated and should be approximately equiprobable. When those topics are presented with widely varying probabilities, the search space and documents will be too broad or too narrow divided by some themes. If there is a correlation between the topics, that is, when the knowledge of the existence of a topic within a given document implies a greater probability than other topics that were found within the document as well) then the theme set can be reduced in size (withdrawing part of the subjects correlated from the categorization set). "Equiprobability" refers to or can be addressed by applying the principles described herein. This problem is easily located for statistical techniques, although standard statistical techniques usually do not capture the higher order union probability terms. The problem of "correlation" is much more subtle and intractable. A suboptimal subject set forces the search system to examine more topics than necessary before the results can be returned to the users (and much confusing interpretation of the organization of the documents themselves). Since each increase in search efficiency allows larger numbers of users to use the system, developers of such systems can not afford a lack of effective document categorization. The application of the method for the reduction of optimal or near-optimal theme set can also be represented in terms of the data matrix M by N that we have used to describe the work of the invention in other sections of this document. In a specific application mode, the rows of the data matrix correspond to the particular documents in the database; and the columns correspond to a set of proposed topic that is intended to categorize it (See Table 6).
The rows in table 6 correspond to the documents in a database, while the columns correspond to the proposed topics used to classify them. "The value in the table cells [i, j] is one (1) if the document i mentions the topic j and is zero (0) otherwise.
Table 6 The stages involved - in the application of the current invention for a search for a set of topics close to the optimum with which a set of documents is classified includes: 1. Obtaining a set of initial topics. The search field of the document is established and the effective methodologies for the creation of such sets are known to those skilled in the art. 2. Create the database using the set of topics and the set of documents that the set of topics categorizes. Given the set of topics, all that is needed is to have each document examined to determine whether or not to mention each topic. 3. Present this database, complete or in parts, so that each document in the database corresponds to one or more M objects (rows) in the data matrix of the modality and so that the proposed topic falls corresponds to an attribute (columns) of the data matrix. 4. Employ the above base method or one of the other modalities described herein in the data matrix. 5. Direct correlated discovered k-tuples of attributes to: • A screen or graphic printer, • A rule-generating preprocessor for a rule-based system, or • A report for administrators or other users of the base-row system. computer data, or a report generation system, or • Another computer program that executes some type of additional analysis of the data, for example, by performing more in-depth statistical analysis (for example, multiple regression) on the correlated variables, or • Another computer program that executes some transformation or optimization based on data.
Any statistically significant correlation between the topics in the set of topics may indicate an ineffective initial selection of the topics. The correlated k-tuples discovered by the method of the present invention correspond to "highly correlated themes" (with respect to the goal of "uncorrelated subjects") and "highly probable binding issues" (with respect to the approximately equiprobable subject objective). A person skilled in the art can use the correlation outputs in this application, as a guide to determine which topics are found and which occur and that must be removed or combined from the theme set. Using the output of the application in this way would allow the administrator of said document search system to increase the performance of the system by reducing the number of categories to be searched in response to a user request. Improved system performance would benefit the service provider in two ways. The response time of the system to the user's questions would decrease and the total number that could be attended would increase. Applications of the Principles Described in the Present for Search and Storage of the Internet and Intranet Internet and intranet search systems can be subjectively classified by examining the length of time necessary for users to find relevant sites or documents to their questions. Any improvement to the underlying algorithms that drive the output of the search system that allows users to find what they are looking for faster improves the utility of that system, allows serving more users and makes it more attractive for both user communities as advertisers (in the case of internet search -) - and users and management (in the case of the intranet search of the company). Below are two uses of the principles described here that provide ways to obtain relevant information to users faster and for a better handling of document storage on the Internet or intranet search systems. In the description of the following examples, the principles described apply equally either to someone who considers the internet / web and therefore the individual web pages and the websites or intranet, maintained within the information systems of an individual company or another institution, in which case the search it's para- documents instead of websites per se. For the purposes of elucidating this description, assume that each page in the set of web pages, or internal intranet documents in the set of such known documents for the search system have already been classified by topic and that the set of topics is fixed a priori. The objective is to present the user with the normal output of the search system but to supplement that link list with an additional list of topics known to be related to the user's request. The rows in Table 7 correspond to a set of web pages, or internal intranet documents, where the documents correspond to topics. The value in the table cell [i, j] is zero (1) if the web page or document i mentions the subject j and is zero (0) otherwise.
Table 7 Table 7 illustrates the database on which the base method or other modality described herein will be operated, in the data matrix format to represent -objects and attributes that have been defined and described elsewhere in the present. Note that, due to the characteristics of the modalities described herein, the number of pages used in the table need not be the complete set of all web pages. The modality, when it runs, (or is used), in this table you will find those topics that are frequently found in the same document. This indicates that these subjects are related in some way and as the set of web pages supports their association, they may be of interest to the user - also. The advantages are several. The computation cost of these modalities increases linearly with respect to the number of columns in the database. In this application, the number of columns represents the number of topics associated with the web pages. Since with certainty this number is almost very large, it is characteristic of the method that it is a real benefit. In addition, if the web pages are kept in random order, the modalities can be operated in more manageable subsets of the complete set of web pages. This allows the search work of these associations to be divided into many smaller jobs that can be operated, in series or in parallel, during times lost in the server where the search system resides. This method can produce novel associations of great amplitude (k) at any point during its exeon. Many other "association mining" methods find only the largest k-tuples of associated attributes in later stages in their extended runtimes. Finally, as the list of associated topics found by this algorithm increases, the pages that select the links for those new "united topics" can be created and retained. This would reduce the loads of services (thus allowing more users to access the system). Since this places unions on the statistical relevance of searches, this information could be used to select which new topic indexes would be retained and which ones would be retained as necessary. Alternative Application of the Principles Described in the Present to Manage the Storage and Recovery of Web Pages and Documents: Internet and intranet search systems try to sort the space of web pages or documents by theme. Generally, an initial order, for example, alphabetical, is not entirely likely to divide that space evenly. For example, the topic "California" will have a vastly larger set of pages associated with it than "North Dakota." A simple tree-like storage of the pages by topic (with sub-themes at lower levels of the tree) will leave "California" with a very deep tree. What would be of use in this situation would be some better way to divide the search space of the pages than just the individual topics. In the example noted, it would be better to have the large set of web pages related to California divided into smaller sets closer to the size of the North Dakota set. We can maintain our ordering of the pages by theme if we choose to divide large sets into small ones by replacing the individual theme described in the set with a series of lists of associated topics that span the same space. Going back to the example, if "California" were only emphatically associated with "Sol", "Vino" and "Automóviles" we would replace the "California" tree node with the node assemblies "California and Sol", "California and Vino", "California and Cars", "California and Others". This would allow a faster search and storage of those pages because it reduces the height of this part of the tree (in this case by one). The recursive application of the same technique on all nodes in the tree would provide a method to ensure a better balance than could be had before. The only thing missing from this formulation of the new balance tree function is the discovery of the associations themselves. An application of the modalities described herein for the same table discussed in the previous section extracts this information from the set of pages. The method tells us only what topics are related but also gives an indication of the level of support for each association in the database. One time a problematically large issue has been identified, the list of associations found by the algorithm that includes this topic can be consulted to determine how to divide the topic. The use of tree-based storage recovery techniques is known to those in the art and such methods include variations such as B-trees, k-d trees, rows, k-D rows and boxed files. Control hashing schemes can be used in place or addition of methods based on tree per se. There are efficiency gains that can be made both in storage (main memory and off-line memory) and operating time, taking advantage of the application data. The modalities described herein may, as shown before and in other forms, be used to obtain a better understanding and exploitation of the distribution of the data. The advantages include all those listed for the first previous alternatives with an important addition, if one is already using the method to find site lists related to a given question, then you are already compiling the exact list of associations that is needed here to help the balance of the tree search. Application of the Principles Described in the Present for Analysis of Sale, Mail Sending and Related Marketing Activities. Marketing executives, within retail companies, advertising / marketing agencies, magazines, newspapers, radio, television, film companies, and the internet, and nonprofit and charitable organizations need to know what types of people are likely to be buy or contribute. In all these and other marketing contexts, it is very useful and valuable to be able to analyze the data from previous marketing campaigns (we will use the term "shipments" although other campaigns and promotions are also included) and from previous purchases of the relevant goods and services, or previous charitable contributions (we will refer to you as "products"). It is useful for marketing executives sales people and management to know things such as for example: What products tend to be bought together (by the same customer, perhaps within the same transaction)? Which of the previous advertising campaigns or shipments produced a good response and which did not (high sales of a product)? What demographic factors correlated with the total cost of our company products last year? Are women of 25-40 years in the Midwest region who buy our products? Such questions can be addressed by database analysis organized in terms of clients, transactions, demographic factors, prior marketing campaigns and sales of particular products. For charitable organizations, the basic idea is the same, although instead of "sales" and "clients" the application is for "contributions" and "donors." The principles described here can successfully apply to those tasks. of analysis, where one of the main computing challenges is the discovery of associations (correlations) between the sets of variables or attributes in very large databases.Table 8 illustrates the application to the analysis of database in acquisitions of products The table 9 is similar except that it illustrates the case where not only the purchases are recorded in the data, but also the information in previous marketing campaigns, or that these schemes can be increased by the inclusion of additional columns that they correspond to the demographic attributes of the clients, for example, the region of residence, the age group, the income group, the gender, the category occupational and community participation or activities related to fun. The rows in column 8 correspond to customers (and / or potential customers), while the columns corresponding to the products (goods or services) that are acquired (denoted by 1) or not acquired (denoted by 0) by customers particular. The value in the table cell [i, j] is one (1) if the client i has acquired the product j and is zero (0) otherwise.
Table 8 The rows in Table 9 correspond to customers (and / or potential customers), while the columns correspond to shipments (or other marketing campaigns) and products (goods or services) that were acquired (denoted by 1) or not acquired (denoted by 0) by private customers. For the shipping columns, the value in the table cell [i, j] is (1) if the client i received the shipment j and is zero (0) ^ otherwise. For product columns, the value in table cell [i, j] is one (1) if client i acquired product j and is zero (0) otherwise.
Table 9 The stages involved in the application and the principles described here for a sales / marketing database include: 1. Obtain the sales / marketing database as described above. When necessary, use methods known in the art to transform the continuous evaluated variables into discrete state variables. 2. Present this database, in whole or in part, so that each client in the database corresponds to one or more of the M objects (rows) in the data matrix of the modality and so that each product or submission corresponds to an attribute (column) of the data matrix. The shipping attributes (if any) plus the product attributes together comprise the N (columns) attributes in the data matrix. 3. Employ the above base method or one of the other modalities described herein in the data matrix. 4. Direct the discovered correlated attribute k-tuples for: A visualizer or graphic printer, or • A rule generator pre-processor for rule-based system, or • A report for marketing personnel, magazine circulation directors / newspaper , sales personnel, managers or other users of the database query system or a report generation system, or • Another computer program that executes some type of additional analysis - of the data, for example, that executes statistical analysis deeper (for example, multiple regression) on the correlated variables, or • Another computation program that executes some transformation or optimization of the database. The output in this application can be useful in several possible ways.
For example, the output may include correlated k-tuples that comprise sets of products that tend to be purchased together, either within the same transaction or by the same customer through different transactions. Such information can be used to develop "package" and co-marketing campaigns, such as when NBA basketball ticket buyers receive discount coupons on NBA team shirts, basketball shoes and other related merchandise. with basketball. While it's perhaps not surprising that basketball fans like those who wear the NBA team shirts, the stages described above are able to uncover other associations between products that are so obvious. For example, the output may include correlated k-tuples that represent particular advertising campaigns correlated with particular product purchases. Such information can help marketing executives focus their resources on new marketing campaigns of the type that is most likely to increase sales. Use of the Principles Described in the Present in the Clustering of Client Data Another application very different from the principles described here to the practice of marketing is obtained by considering the transposition of the data matrix described above. Instead of customers as objects (rows) and products and demographic factors as attributes (columns), consider that it is possible when the customers correspond to the columns- and the product and the demographic variables correspond to the rows. (See Table 10). Use the principles described here for this scenario that produces correlated k-tuples of clients or customer profiles, in the space of characteristics of demographic and acquisition pattern. This is seen as a way of grouping customer data into customer groups or customer profiles that are roughly similar in terms of their buying habits and lifestyles. Such grouping can be useful in the design of special "target groups", to allow a more optimal location of marketing resources. Once this transposition of the data is contemplated, the other steps apply completely analogously the descriptions given above for the marketing activities. The use of the method in the "transposition" of the marketing database previously shown, in order to group the clients is shown in Table 10. Now, the columns that correspond to a set of clients, while the rows correspond now to acquired products and demographic characteristics. There are rows M 'and columns N', where perhaps M '= N and N' = M, for original M and N described above. The value in the table cell [j, i] is one (1) if the client i acquired the product j or has the demographic characteristic j and is zero (0) otherwise.
Table 10 Application of the Principles Described in the Present to the Analysis of Medical, Epidemiological and / or Public Health Database Scientists and medical practitioners have great knowledge that many human diseases and conditions, physical and mental, are caused by complex interactions among many potential contributing factors. Such factors may include particular genetic conditions or abnormalities, exposure to biological pathogens, aspects of diet, environmental (air, water, noise pollution), exposure to risks in the home or workplace, emotional stress, drug abuse and poverty, among others. The true "causes" of a given condition often remain as impossible to determine, although there is much typical and anecdotal evidence offered in attempts to explain some cases. The problem of the discovery and prevention of threats to health has been assisted in recent times by the capacity of researchers, representatives of insurance companies, epidemiologists and public health officials to compile and analyze large amounts of data on real, healthy and sick people, living or deceased. Since in other applications of computation and statistical analysis for databases one must confront in this field with enormous numbers of variables and the exponential complexity of their potential interactions. This type of analysis can be greatly improved through methods that efficiently find the correlations and associations between tens, hundreds or thousands of variables. The principles described here are applicable to such a situation. The application based on medical data can be represented in terms of data matrix of M by N that we have used in other sections of this document. In a specific application modality, the rows of the data matrix correspond to particular patients or subjects in a health study; and the columns correspond to factors contributing to a specific disease or group of diseases. Again, these factors can include socioeconomic factors, lifestyle factors (exercise, diet), aspects of the patient's domestic or work environment (for example, exposure to carcinogenic chemicals), previous medical treatments, and so on (See Table 11). The rows in Table 11 correspond to patients to human subjects in a study, while the columns correspond to potential disease factors. The value in the table cell [i, j] is one (1) if patient i has experienced or been exposed to the factor j and is zero (0) otherwise.
Table 11 In some specific application modalities, there is not only a disease represented implicitly, but instead a number of different diseases represented as attributes together with the factors shown in Table 11 and described above. For example, a particular patient p may have lung cancer but not diabetes or heart disease and, therefore row p would have a 1 in the column corresponding to lung cancer and would have values of 0 for the columns corresponding to diabetes or heart disease. The stages involved in the application of the current invention even to the medical / epidemiological / lifestyle factors of the database include: 1. Obtain the database of medical / epidemiological / lifestyle factors as described above. When necessary, use methods known in the art to transform the continuous evaluated variables into discrete state variables. 2. Present this database, in whole or in part, so that each patient / subject in the database corresponds to one or more of the M objects (rows) in the data matrix of the modality and so that each potential disease factor corresponds to an attribute (column) of the data matrix. The additional attributes that represent different diseases, plus the disease factors together, comprise the N attributes (columns) in the data matrix. 3. Use the base method or other modalities described herein on the data matrix. 4. Direct the correlated discovered k-tuples of attributes for: • A visualizer or graphic printer, or • A rule generator pre-processor for a rule-based system, or • A report for doctors, researchers, public health functions, managers or other users of the computer database query system, or a reporting system, or • Another computer program that performs some type of additional data analysis, for example, that performs deeper statistical analysis (eg example, multiple regression) on the correlated variables, or • Another computation program that executes some transformation or optimization of the database. The output in this application can be useful in several possible ways. For example, the output may include correlated k-tuples that comprise sets of factors associated with one or more disease conditions. Such information, perhaps refined through additional statistical analysis, can provide insights into the understanding, treatment and prevention of those particular diseases. For another example, the output may include correlated k-tuples that comprise sets of factors associated with each other such that such associations are previously known. The discovery of associated lifestyle factors, such as particular diets or obesity or particular professions of high levels of alcohol consumption, may by itself be useful in the improvement of public health policy and medical practice. All discovered correlations can potentially be of great benefit to insurance providers, public or private, as they can take their actuarial tables and insurance policies that reflect precise predictions of health and life expectancy, for example, based on the style of insurance. life, socioeconomic factors and others. Uses of the Principles Described in the Present in the Grouping of Patient Data Another different application - from the principles described here for public health and insurance policy and practice are obtained by considering the transposition of the data matrix before described. Instead of patients as objects (rows) and potential disease factors as attributes (columns), consider that it is possible when patients correspond to columns and the factors correspond to rows (See Table 12). Use the current invention in - this scenario to produce correlated k-tuples of patients or patient profiles in feature-space. This is observed to be a form of patient data grouping, in groups of patients or patient profiles that are approximately similar in terms of their lifestyle factors. Such grouping may be useful in the special designation of "low risk" or "high risk" types of patients or insurance applicants, to allow the most optimal placement of health services, research programs, insurance protection, or other means. Once this transposition of the data is contemplated, the other stages of the preceding application for the analysis of medical and other data apply completely in a manner analogous to the descriptions given above (See Table 12). The use of the principles in the "transposition" of the factors of the disease database shown above, in order to group the patients or policyholders into factor-space is shown in Table 12. Now, the columns that they correspond to a group of patients, subjects of medical study or owners of potential insurance policy, while the rows correspond now to potential disease factors that may include lifestyle factors, socioeconomic factors, factors of the workplace and so on. successively. There are rows M 'and columns N', where perhaps M '= N and N' = M for the M and N described above. The value in the table cell [j, i] is one (1) if patient i has or has been exposed to the factor j and is zero (0) otherwise.
Table 12 Application of the Principles Described in the Present for the Discovery of the Causes of Failures in Complex Systems The administrators of complex integrated systems such as computer networks and automation systems in factories have faced the problems of difficult diagnosis of those systems that pose from its beginning. When a series of events in the system (perhaps over a prolonged period) leads to a failure of the system as a whole, the diagnosis of the true cause of the failure can be an almost insurmountable task. For example, a network interface card in a bridge computer that fails intermittently when under high load conditions may not cause the host computer to collapse, but may lead to errors in other computers that conduct the card (by proxy). ) to service your network requests. Such a problem would be difficult in the end to follow up using conventional diagnostic techniques. The tools that managers can present as a better analysis of the conditions in the system as a whole that leads to the failure would speed up the diagnosis and correction of the underlying problem. We need to define the database under the principles described in this will be applied. The database as a whole can be thought of as a status record of a series of components over time. The columns of this database, when viewed in the data matrix format used throughout this document, represent the component series; the rows represent points in discrete time. The values in the Table are intended to be in an encoding of each component state (active, inactive, empty, with error and so on) at the time in question. Such registration procedures are well known to those skilled in the art. The rows in Table 13 correspond to points in time, while the columns correspond to individual components in the system. The value in the table cell [i, j] is the coded state of component j at time i.
The steps involved in applying the method of the present invention to the analysis of a system operations database include: 1. Creating a database of system components and their states as described above. A selection of state sets of components in the system will be driven by behaviors of interest to system administrators, as well as by the components themselves. 2. Present this database, in whole or in part, as a data matrix so that each column in the data matrix corresponds to a component in the system and each row in the data matrix corresponds to a point in time in the series .. 3. Use the previous base method or one of the other modalities described here in the data matrix. 4. Direct the discovered correlated attribute k-tuples for: • A graphical viewer or printer, or • A rule-generator pre-processor for a rule-based system, or • A report for system administrators, or a generation system report, or • Another computer program that executes some type of additional analysis of the data, for example, that executes more in-depth analysis on the correlated variables, or the output in this request, can be used to indicate the events in the system which are normally observed to co-occur with a given fault. Given the formulation of the database, we do not need to restrict ourselves to the states of the components in the system at the time of failure, we can expand the examination of fault conditions to any range of points in time for which the base of data have records. This allows the method to help elucidate the subtle causal relationships between the components that ultimately lead to the failure. In the simplest case, the output can be used to eliminate some components in the system from the scrutiny if it is observed that they are not correlated with the failure. Application of the Principles Described in the Present to the Analysis of Complex Systems Complex systems define a large family of applications in some similar way. For the purpose of this description, complex systems are defined as systems for which there are no direct detailed model formation approaches, because these systems comprise a huge number of interacting individual components or parts. Examples would include (but not be limited to) economics, individual human behavior, productivity in groups of employees, weather patterns, crimes in a nation, etc. In each of these cases, there are no known methods for the system model in exact form since the variables or sets of variables are used to measure the state of those systems (examples in the case of the economy would be the interest rate, stock market values and inflationary rates). For the purposes of this description, the events in those complex systems take the form: pre-condition, action and post-condition. These interactions represent the state of the system before the actions were taken, the actions themselves and the resulting state at some point after the. implementation of the actions. In other words, the set of previous disturbances of the system and its results are used as a history of the system from which the information about the characteristics of the system is derived. Database classes of complex systems that can effectively utilize the principles described herein must meet certain restrictions. There must be some set of variables (either commonly used or derivable from knowledge in the domain) used to measure the state of the default system. These variables are used in the pre and post condition parts of each database entry. Additionally, there is some general set of actions that can be applied to the system that covers methods by which it is known that the system can be altered. Returning to the example of economics, fixed action would include all things under the heading of "fiscal policy". Formally, the database must include attributes that represent zero or more pre-condition variables, zero or more action variables, and zero or more post-condition variables. Putting aside the trivial case where the database contains zero pre and post variable conditions and zero action variables, there are eight cases to consider. This will be presented exhaustively below with examples where appropriate. Note that in each case, there are two interpretations of relevance. For example, consider the case when you have pre-condition variables and action variables but not post-conditions. Correlations can be derived in two ways. The database itself could have none of the post-condition variables in it (and the returned set of correlations is discarded to remove any correlations that involved only variables of one type) or it can be that only of the set of correlations themselves that do not they contain post-condition variables although the database actually contains them. For the purposes of discussion, we assume that the first one is the case - we can always discard the results of the method on a database that has more types of variables to leave a set of correlations that does not have some types of variables. If the database contains only variables of one type (that is, only action variables or variables of pre or post condition) then the correlations derived from it can be interpreted in one of the two ormas.- If the variables are variables of pre or post condition, then the results indicate situation archetypes, that is, sets of attribute values (or, equivalently, variable states) that tend to be observed together. An example from the domain of climate patterns would be that of rain and low barometric pressure. If only the action variables are present in the database, then the correlations found between them would indicate sets of decisions that tend to be made together. In a military domain, we can discover that the flanking and offensive maneuvers had the tendency to be seen in co-occurrence. As these types of databases are very similar to others described elsewhere in this document, (as it would be in the applications of the method in those cases), this section is not explicitly addressed to them. The cases where the database contains variables only of two of the three types are three in number. The correlations found in a database that contains only pre-condition and action variables describe the relationship between situations in the domain and the selection of actions. An example is a soccer play call (note that this also involves a complex system that can not be formed into a model in any direct detailed form - the play caller). Here the correlations indicate the tendencies of the entity that takes the action, for example, a coach or a field marshal. If the database contains only action variables and post-condition, then the correlations found discover the effectiveness of the sets of actions in spite of the pre-conditions. Returning again to the soccer example, correlations of this type would determine the ability of the team in question to execute certain actions (for example, if it is "third and long yardage for first and ten" it would tend to result in a set of post-conditions scarce, as a fourth opportunity, we would know then that the team had a tendency to be ineffective in this situation). Another important example is the drug interaction. In this drug, the actions are the given drugs and the post-conditions are the side effects reported by a patient. While the utility -of the case where the database contains only pre and post condition variables can not be clear in the first examination, it can be one that is in one of the most useful cases. Here we are interested in the things that tend to happen after a situation in the given domain regardless of the actions taken by the decision maker or we are in a domain where there are no actions that can be taken (or none that affect the system itself ). An example of the first would be the fact that the "third and long" pre-condition in football tends to be followed by the "fourth and long" post-conditional status. In fact, it may be the last case that is the most interesting. Consider the case of weather patterns. If we focus on post-condition "tornadoes" (ie, we discard the resulting correlation so that it includes only those correlations that involve the appearance of "tornadoes" in the post-condition), then those correlations tell us they are signs Precursors of those tornadoes that are eminent. The last case is the most general: the database contains the three types of variables. Note that a database of this form is able to have attribute maps of all preceding types. The example domains have already been given (economies, crime in a population, etc.). Here the correlations can be considered as sets of classification action (giving some set of pre-conditions) based on the quality of the post-conditions. The last consideration are the types of data that contain the database entries. The binary valued attributes, as observed through this document, can be easily accepted by this method. Other types of value must be of limited range of discrete values. When it is not the case (ie, evaluated real attributes or evaluated integers), some transformation must be executed on the values in question to reduce its range of values to more than a manageable number. Various grouping methods are among the preferred methods for this, and are well known to those skilled in the art. In all cases, the correlations returned by the method are ideal entries for a case-based reasoning package. Given a system condition (that is, the current condition), a reasoning tool based on the case could use the associations found by the principles described here as a basis for the analysis of possible selections results from the set of actions which can be applied to the system. Generally, the principles described herein can be used as a tool to help decision-makers. Those who make decisions can be "real" or artificial (that is, the method can be used as part of an artificial intelligence system, whose purpose is to make decisions in the domain of interest). Description of the Application of the Principles Described in the Present Database with Pre-Condition Variables and Action Variables: Given the restrictions noted above in the form of the database, it is clear that the entry requirements for the application of the modalities described elsewhere of the present are fulfilled. In the convenient data matrix representation cited elsewhere in this document, the rows M in this context are the selected total of pre-conditions and the actions taken. If the entity that applies the actions can be sensitively customized then those rows can represent a record of the decisions made by this entity and the states of the system at the time they were made. Columns N comprise the set of state variables that define the state of the system and the set of all the action variables that describe the ways in which the system can be altered (see Table 14). The rows in Table 14 correspond to instances of, or combinations of, system states (the pre-condition of the system) followed by the actions taken in response to that state, while the columns correspond to variables to describe the state of the system and possible actions that can be applied to the system. The value in the table cell [i, p] is an encoding of the measure of the state variable p in case if the column p is a pre-condition column and is a coding of the action taken in the case from i then column p is an action column.
There are some other considerations that must be addressed before the application of the principles described elsewhere in this one in any given domain. The set of state variables must be defined. - This is allowed to experts in the domain itself (for example, football coaches, military analysts, etc.). The examples previously noted are the case of the call of soccer play by the coaches and the military decision made by the generals. In general, the preferred implementations of this invention will use the current invention method on databases in this manner in order to extract information about the action-taking entity. Correlated state variables and actions describe the trends of this entity. As noted above, these can be further analyzed using reasoning tools on a case-by-case basis to give a better picture of the entity-like decisions given in a system state. Another use of the invention on databases of this type is -in the discovery of indicators of fraud in tax collection. Here we can place the preconditions that are a set of attributes destined to capture the outstanding details of a tax refund (aspects such as the total income, the total tax that is owed as reported by the individual or the company, the tax exemptions claimed, etc.) and select the action variables to define a set of tax evasion methods possible. The correlations found by the invention then indicate associations between the types of tax refunds and the types of tax evasion. As the coincidence detection links the statistically returned correlations, not only do we find evasion indicators, but also the reliability of these findings. Since tax collectors can not rely on investigating all tax refunds that are sent to them, this method allows them to find a well-selected subset of those returns that are more likely to result in fraud findings (and greater monetary returns for the government) . This last use that was presented is in the domain of insurance fraud and is very similar to the application of the principles described herein for the collection of taxes. The pre-condition variables are intended to capture a set of details in an insurance claim that are thought to be possible indicators of fraud (amount claimed, specifications concerning the insured entity, etc.) and the action variables represent types of fraud. The results found when the principles described herein are applied show correlations between the details of the insurance claims and the types of fraud. Insurance companies that can not investigate such claims that are sent to them; thus the application of the principles described herein will reduce the total list of such claims to a set that is more likely to be subjected to fruitful investigations. The stages involved in the application of the principles described herein for a database containing pre-condition and action variables include: 1. creating the database of the states- of the system and the actions taken by the entity action shot ran was described above. When necessary, use methods known in the art to transform the continuous evaluated attributes into discrete state attributes. 2. Present this database, in whole or in part, so that each state / action set corresponds to one of the M objects (rows) in the data matrix and so that each aspect of state type and each type of action correspond to an attribute (column) of the data matrix. 3. Use the base method or other modality described herein in the data matrix. 4. Direct the correlated discovered k-tuples of attributes for: • A visualizer or graphic printer, or • A report for decision makers or a report generation system, or • Another computer program that will use the correlations found as a basis for decision making (for example, a reasoning packet based on the case), or • Another computer program that executes some transformation or optimization of the database.This application of the principles described herein provides and uses a list of correlated state / action sets that have an inspection of the inclinations of the action-taking entity, when one is interested only in one state of the system (or only in a few aspects of a given state) for example, the current state, one could discard the results of any correlations that do not share a given set of aspects with that state. epresentaría the correlations between the aspects of interest and the actions taken in response. The resulting view within the action-taking entity methodology can be used in the additional decision-making. Description of the Principles Described in the Present how they apply to databases with pre-survey variables and post-condition variables: Here also, the restrictions noted above in the form of the database force compliance with the entry requirements of the modalities described in another part of the present. The rows M in this context are the instances or combinations of pre-conditions and post-conditions (visualized together, one thinks that those rows are the transitions of the system between the states). The N columns are comprised of the set of state variables that define the state of the system before and after the transaction (see Table 15). The value in cell [i, j] of Table 15 is a coding of the measurement of the state variable j either before or after the transition.
Table 15 There are some other considerations that must be addressed before the application of this invention in any domain. The set of state variables must be defined. This is left to those experts in the domain itself. Equally important is the selection of the amount of time that defines the granularity of the transitions. This is also left to those skilled in the art to decide - based on their own experience and the kinds of information they wish to extract. It is assumed that some minimum granularity is imposed either by the complexity of the collection of such data or by the useful limits of such data. Given this, any multiple of this minimum granularity can be collected to be the time between the pre and post conditions. In the end, this time distance must be large enough for the system to have changed its state. Possible domains of application for this invention include economic and fiscal policy, stock market forecasting, athletic talent search and weather forecasting. Below are brief descriptions of each in turn to show how these problems can be arranged to adjust the specifications of the method to the current invention. In the domain of economic and fiscal policy, a database of states is proposed where the states are a set of economic indicators (inflation and interest rates, housing starts, GDP and so on). Each row in the database must contain two such states (the pre and post system condition) separated by a fixed amount of time. The correlations found by the method of the present invention then give the view within the cycles in the economy.
For the stock market forecast, a set of actions (supposedly large) that are considered to have influence on each other is proposed. Again, a fixed period is selected for transitions. The rows of this database indicate the transition of those actions over the selected period. The output of the invention then indicates which sets of actions to "move" in a correlated manner over that period. The search for athletic talent. { for example, by professional teams before a selection of young players) would involve an examination of the history of such selections. Each row of the data matrix would belong to an individual player. The pre-condition status is a selection of statistics (and any other information available about the player) that is thought to be indicative of future performance at the professional level. The state of post-condition would then be some set of variables designed to measure the success of that player at professional level. The correlations described by the invention would help the teams to find the best set of indicators of future success with which their selections would take. Note that in this case, the pre- and post-conditions do not need to be exactly the same. There is no pretended restriction on state representations to force them to be equivalent.
The weather forecast is a very direct application of this invention. Here, the granularity of the selected time quantum is based only on the type of information that the user wishes to discover. In other words, the quantum of time determines the desired degree of prediction. If -we select an individual day, then the correlations found by the method will help us predict the climate (given a set of values for each of the pre-condition variables that describe the current climate) one day in advance. If a week (or a month, etc.) is the quantum selected, then this is far in the future from the predictions that will be extended. In general, the preferred embodiments of this invention will use the method of the present invention on databases in this manner in order to extract information about how the current state of the system acts as a predictor for a future state. Given the correlations of data probabilistically linked between the states of the system, effective predictions can be made about the behavior of the system. The stages involved in applying the current invention to a database containing pre-condition and action variables include: 1. Creating the database of transitions between the states of the system, where a system state is represented by a state variable value, over the selected time quantum as described above. When necessary, methods known in the art should be used to transform any continuous evaluated state variables into discrete state variables. 2. Present this database, in whole or in part, so that each transition set from state to state corresponds to one of the M objects (rows) in the data matrix of the modality and so that each state variable correspond to an attribute (column) of the data matrix. 3. Use the base method or other modality described herein in the data matrix. 4. Direct the correlated discovered k-tuples of attributes for: • A visualizer or graphic printer, or • A report for decision makers or a report generation system, or • Another computer program that will use the correlations found as a basis for decision making (for example, a reasoning package based on the case, or • Another computer program that executes some transformation or optimization of the database.
Description of the Application of the Principles Described in the Present for Databases with Action Variables and Post-condition Variables: Here, too, the above constraints noted in the form of a database that force compliance with the entry requirements of the modalities described elsewhere in the present. The rows M in this context are the total selected set of actions and post-conditions. The N columns are comprised of the set of state variables that define the state of the system before and after the transition (see Table 16). The rows in Table 16 correspond to observed instances of, or hypothetical combinations of, the actions applied to the system and their resulting system states. The columns correspond to either possible actions that can be applied to the system or be individual state representation variables. If column p corresponds to one of the action types in the database, the value in table cell [i, p] of Table 16 is an encoding of the action taken. If column j is a column used to indicate some aspect of a system state, then the value in table cell [i, j] is a coding of the measurement of that aspect.
Table 16 As noted in the previous examples, the decisions that can be made before the application of the method of the present invention to databases of this type include the selection of state variable used to store the state of the system at a point given in time and the selection of the quantum of time used to temporarily separate the actions from the post-conditions. Those selections are left to those with expertise in the domain of the application. The quantum of time selected must in most trivial cases, be large enough for the actions that have had some effect on the state of the system. Possible uses of this invention include widely varying fields such as the handling of a player in hockey and in the study of drug interaction. For the purpose of this document, player management in hockey refers only to the selection of players for the next turn on the ice, given the knowledge of the history of those players. The action variables in this case are binary values that indicate if a player is selected or not for the turn, while the post-condition variables include a set of results within the hockey domain (issues such as the relative score in this turn, the accumulated faults, the duration of any faults, the relative number of shots taken, etc.). By formulating the problem, it is clear that the discoveries produced by the invention indicate the correlations between sets of selected players and the results in the next turn. In situations where opposing players are known in advance, those players can be added to the action variables. In this case, we will find correlations between sets of players, both our team and against them and the results. Given this knowledge, the invention is useful as an aid to coaches in the selection of players that are likely to produce better results. The study of drug interaction is a natural matter for this invention. Here, we take the action variables that are binary values that indicate whether a certain patient has received or not a drug or combination of drugs. The post-condition variables indicate the list of side effects reported by the patient. The results found by the invention then indicate the correlations statistically linked between sets of drugs given to patients and side effects. Thus, the method of the present invention can be used to determine the contraindications in the use of drugs, although perhaps it is better studied as a way to select the sets of interactions under which to focus the additional study. The steps involved in the application of the present invention to a database containing action and post-condition variables include: 1. Creating the database of transitions between the states of the system and actions on the quantum of time selected as described earlier, where a state of the system represented by a value of a state variable and an action is represented by a value of an action type. When necessary, use the methods known in the art to transform the continuous evaluated state variables and the action types into discrete state variables and action types. 2. Present this database, in whole or in part, for one embodiment of the present invention so that pair of action set / state set corresponds to one of the M objects (rows) in the data matrix of the mode and so that each state variable or action type corresponds to an attribute (column) of the data matrix. 3. Use the base method or other modality described herein in the data matrix. 4. Direct the correlated discovered k-tuples of attributes for: • A visualizer or graphic printer, or • A report for decision makers or a report generation system, or • Another computer program that will use the correlations found as a basis for decision making (for example, a reasoning package based on the case), or • Another computer program that executes some transformation or optimization of the database Description of the Application of the Principles Described in the Present to Databases with Pre-condition Variables, Action Variables and Postcondition Variables: Here too, the restrictions previously noted in the form of the database that force compliance with the entry requirements of the modalities described elsewhere in the system. The rows M in this application are the total selected set of pre-conditions, actions and post-conditions. Columns N are comprised of the set of state variables that define the state of the system before and after the transition as well as the types of coded actions (See Table 17). The rows in Table 17 correspond to instances or combinations of pre-condition, actions taken and the resulting post-conditions. The columns correspond to possible types of actions in the domain, as well as to all aspects of interest for any given situation in the domain (both for the pre and post condition columns). If column p corresponds to one of those types of action in the database, the value in cell [i, p] of Table 17 is a coding of the action taken. If column p is a column used to specify some aspect, either of the pre-condition or the post-condition, then the value in the cell of table [i, j] is a coding of the measurement of that aspect .
Table 17 As noted in the previous examples, the decisions that can be made before the application of the method of the present invention to databases of this type include the selection of state variables used to store the state of the system at a point given the time and the selection of the quantum of time used to temporarily separate the actions from the post-conditions. In this case, it should be noted that it is not necessary that the pre and post conditions are equivalent (with respect to the selections of variables). Those selections are left for those experts in the domain of the application. The quantum of time selected must, for example, be long enough so that the actions have had some effect on the state of the system. Possible uses of this invention include economic policy, the fight against crime and military strategy. Given some set of variables to define the state of an economy (interest rates, inflation, GNP and so on) and a set of actions taken as part of the economic policy of the governing body (issuance and sale of government bonds, ~ etc .) a database of economic events of the form is created: existing economic status, measure of fiscal policy taken and economic status that follows policy decisions. The correlations found by the method of the present invention give a measure for the effectiveness of economic policy decisions, giving a state of the economy. Such knowledge would be beneficial in the decision of the economic policy since it would show the historical support (or the account of it) for a given set of decisions. In a similar theme, the use of the present invention helps in establishing the initiation of the anti-crime policy with the creation of a database of the previous states of crime in the community, the political measures taken and the resulting state of crime in the community. - State variables could include issues such as averages for different types of crime (trespassing and entry, self-harming, etc.), different characteristics of the crime (that is, whether or not handguns were used, etc.), and so on. successively. The action variables in this case could include such things as the minimum sentencing guide for various crimes, laws, the adoption of the death penalty, as well as the education and mental health fund. In a database of this type, the invention would find correlations that involve these existing exemptions, policy decisions and the results of those decisions. It is proposed that these decisions can be a valuable help to those who are in charge of making those decisions. The concept of "decision maker" needs careful consideration in the domain of military strategy. It may be the case that there is not enough of a "tracking record" to fill a database with enough of a history of any general decision making. In such a case, preferred implementations can extend the concept of decision maker to include all similar decision makers. As an example, consider a single general who commands a tank division. If the general was promoted recently, it would be wise to consider all the history of all the generals of the same loyalty. To increase the granularity of the use of the method, the database can be filled with the decisions made by all the colonels of infantry instead of those of any colonel. The correlations found would be indicative of the tendencies of that class of generals given some measure of battlefield conditions confronted when they made their decisions. Similarly, you would be in a position to determine which battlefield situations you have handled sparingly because you have access to the results of the decision sets. Such knowledge would be vital to select an opponent's strategy. The stages involved in the application of the principles described herein for a database containing pre-condition variables, action and postcondition include: 1. Create the database of states and actions that cover the quantum of time selected as described above. When necessary, use the methods known in the art to transform the continuous evaluated state variables, and the action types into discrete state variables and action types. 2. Present this data base, in whole or in part, so that each state / action triple state corresponds to one of the M objects (rows) in a data matrix so that each state variable or action type corresponds to an attribute (column) of the data matrix. 3. Use the base method or other modality described herein in the data matrix. 4. Direct the correlated discovered k-tuples of attributes for: • A visualizer or graphic printer, or • A report for decision makers or a report generation system, or • Another computer program that will use the correlations found as a basis for decision making (for example, a reasoning package based on the case, or • Another computer program that executes some transformation or optimization of the database.
Those skilled in the art will understand that this description is made with reference to the preferred embodiment and that it is possible to make other embodiments employing the principles of the invention that remain within their spirit and scope as defined by the claims on the pages that follow Appendices A to E annexed to this, whose Appendices form part of this description.
APPENDIX A t perl version of --van Steeg's Coincidence Detection? lgroritb-n, File coincp !: 1 15 t here applied to data hich comes in rows and cslumns of ascii # sypibols. Used first for tests on artificial and real (HIV) ff protein sequence data. # roarch 1996 * «* # M *« ** # * «* # * # ** t **** t *** t * t« »* ** - # - # - * ------ - ## * »* # * Stiny_num -. 0.000001; Sfact-0] - = 1; $ fact (l. "= 1; $ factf2J - = 2; $ -Tact-3)» 6; $ factt4] «= 24; Sfact.S] = 120; $ fact (6)« 720; $ fact [7J - 5040; $ fact [8] - = 40320; $ fact [9] = 362880; SfactflO] - 3628800; Sfact.ll) = 39916800; sub compare (f < $ a < $ b) Sr = -1; lsif ($ a == $ b> $ r «0; else $ r - 1; ff print" a: $ a, b: Sb, r: $ r \ n ", - return Sr;) sub comp_aa. { f m (Sal. $ cl. Sa2, Sc2, Sr); my (Sel. Sc2); II Sal = substr 3a. 0, 1; Sel = subs r Sa, 1; * Sa2 = substr Sb, 0, 1; Sc2 = subscr Sb, 1; if (Sel <Sc2> $ r - = -1; lsif (Sel == Sc2) Sr = 0;: lse Sr «1; File coiiic.pl: 2/15 return Sr; I cale the factorial of a number. want (n) # for now, it's just easier and faster to hard code tl-em into a sub-factorial cable (my ($ n) »ß_; f print" n: $ n \ n "; if (Sn > - 0 & & $ n < = 11 J (return $ fact [$ n);) else (print "ERROR: n larger than max defined factorial requested. ($ N) \ n"; exit (0); .}. I cal the bin mial coe f. Want r (number of iterations) and h (4 observed number of hits) sub binomial_coeff { My (Sr, Sh) # print * r: Sr, h: $ h \ n "; Srf - • --factorial (Sr); Shf =« -factorial (Sh); Srhf = --factorial { { $ r - $ h)); * print T £: Srf, hf: Shf, rhf: $ rhf \ n "; return ($ rf ($ hf "Srhfjl;.}. # cale the chernoff. want <$ observed. Se? pected, Srl, ST1) sub chernoff { y (Sobserved, Sexpected, Srl, ST1) = S_; Sdiff * Sobserved - Sexpected; Sdiff_sq = Sdiff »Sdiff; Smimerator - = 2.0 * (0.0 - Sdiff_sq); S enominator« = ST1 * (Srl * Srl); return (exp ($ nupterator / Sdenoptinatsr));) I cale the ith power oC to nu-nber. NOTE: this thmg can only grok I poeitive integer exponents larger than 0! sub pow (my (Si, Sp) i £ (Sp < 0 | SP int (Sp)) (print "ERROR: I can only grok positive integer exponents larger than 0 '. \ n"; exit (0); Sa = 1.0; File coinc.pl: 3/15 for (Sn = 0; Sn < Sp. Sn ++). { S? * «Yes; } # print "i: Yes. p: Sp, a: Sa \ n"; recurn Sa; ) # want ($ r, Sh, Sc_elep? ent), csec and aasites assumed as global sub prob_coincidence. { piy ($ r, Sh, $ c_element) - > H.H_; my ßelements; if ($ r> 0) { Sjoint = 1.0; $ joint_neg = 1.0; ? aalist = split / \ | /, $ c_element; «Princ" c_elelmenC: Sc_elemenc. AalisC: ßaalisc \ n "foreach $ aa (ßaalist) C Sjoint * = Saasites.Saa); Sjoint-jneg * = (1.0 - Saasites (Saa)); «Print" aa: Saa, joint: Sjoint, joint_neg: Sjoine_neg \ n *, -) Sans = febinomial_coe £ f ($ r, Sh) * &pow (S joint, Sh) * -.pow (Sjoint_neg, (Sr - Sh)); Sans - tbipomial_coeff ($ r, Sh) * (Sjoint * * $ h) * < Sjoint_neg ** (Sr - $ h));.}. Else (return (0 .0);.} Itq principal "joint: Sjoint, joint_peg: Sjoint_neg, ans: Sans \ n" return $ ans;) sub • expected_size C y (Sr, $ c_element) = (? ._; Ssum < = 0.0; foreach Sh (l .. $ r. I Ssum + = (&prob_coincidence ($ r, Sh,? c_element) • Ch); iprint "r: $ r, h: Sh, sura: Ssuro \ n *;) return Ssum; subprob_o £ _correlation (my ($ c_element, $ h_total_obs. $ h_expected_total, Sr, $ T) File matches: 4/15 Sh_expected_total - = &expected_size (Sr, Sc_element); Sch = ichernoff ($ h_total_obs, ($ h_e? Pected_coCal * ST), Sr, ST); recurn Sch; ) »Randomly select list of 'sample_size' unique sequences # in Che range frora 0 to the number of rows in? Family # want sample_size. Familia. sub rsa? t ?? le_family. { my $ R s shifc e_; tpy? family = < ? _; roy (% which_rows,? sa-npled_fa-nily, _sampled_rows); # print "whichrows:", keys% which_rows, "\ n"; # genérate SR number of unique keys Sf scalar ßfaroily; while (scalar (keys% which_rows) <SR) C $ n = int (rand $ f); tprint "ranc-num. $ n \ n"; $ which_rows (Sn) = 1; ) # print "whichrows: *. keys% which_rows." \? - "; • pick out the corresponding sequence from the * family lisc @sampled_rows = keys% which_rows; foreach S ine (8sampled_rows) push ßs - mpled_family, Sfa- nily [Sline 1; «princ • RSAMP E \ n"; «If« 0; * foreach Sline (9sampled_-a-? ily) *. { «Print Sline,": "; «Sn = Ss-ur.pled_rows (Yes¡; * print Sn.": ", Sfamily (Sn)," \ n ";" $ i ++; "print" $ line \ n "; * I tprinc" RSAMPLE END. n "; lexit (O); returr 9sa? c-pled_family; # return che n'th colunn of an array I wanc ($ n. ß &rray) sub column t my Sn = shift < ? _; pty ßa «@_; tny Scol; «Print" CO UMN: Sn \ n "; lt foreach (@a) »(« print "S_ \ n"; «)« go thru and append Che n'Ch element of each row in? arrp.y to Sc Foille agrees: 5/15 Scol - foreach Sline (? A) (Scol «Scol. Substr Sline, Sn. 1; # print length Scol. Scol, "\ n" «print" COI-UMN END \ n "; return $ col; i find all occurences of a character 'aa' in Che n'Ch column of the array saropled_family # want (Saa, $ n,? sa-npled_family) sub find_all C my Saa = shift < -_, - my Sn = shift ß_; my? san.pled_fam ly =? _; my (Sbstring, Scol); print "FIMD_ALL: Saa, $ n \ n"; print "012345678901234567890 \ n" foreach (? sam? led_family). { print "$ _ \ n"; ) # print "JUMPING TO COI? n"; $ col "fccolumn (Sn,? s-unpled_family);" print "GOT: $ col \ n"; Sbstring = * "; if ((index Scol. Saa)! = -1) »make sure Saa is found in Scol (for (Si = 0; If < length Scol; $ i ++) (Sc = substr Scol, $ i, 1; if ( Sc eq Saa) Sbstring = Sbstring. "1";) else { Sbstring = Sbstring.! { Sbstring = "NOT_FOUND";) »princ" $ bscring \ n "; «Print" FIND AU- EMD \ n "; * success); return Sbs ring; * this subroucine isn'C exactly the most optimal code, bu: sub my y (Scoll, Scol2, Sm) =? _; It coincided rl: 6/15 m and (Ssl Ss2, Srow, $ pl, Sp2, Spj, Sal, Sa2, Ss,% sj, Sconcrib, Scotal); Ssl = column ($ coll,? Family); $ s2 = co.umn ($ col2,? family); t princ "coll: Scoll, Ssl \ n"; # print * col2: Scol2. Ss2 \ n ";" print "keysl: keys% sj," \ n "« cale the joint prob for Srow (0 .. ($ p? -l) > {Sal = substr Ssl, Srow. 1; Sa2 • substr Ss2, Srow, 1; Ss Salt Sa2; if (exists Ssj (Ss)) C Ssj (Ss) ++;) else. { Ssj (Ss) »1; ) print "al: Sal. a2: Sa2, $ s \ n" princ "keys2: keys% s, • \ n"; foreach Ss (keys% sj). { $ sj. { $ s) = $ sj (Ss) / Sm; if (Ssj (Ss) < Stiny_nu). { Ssj (Ss) = Stiny_num.- print "Ss: Ssj ($ s) \ n -.- Stotal = 0; foreach Ss (keys% sj) Salt substr Ss, 0, 1; $ a2 substr Ss, 1, 1; It finds partial probs Sal «Sal. Scoll; Sa2 «$ a2. $ col2: Spj = SsjlSs); Spl =? Sites. { $ al); Sp2 - Saasites. { Sa2); if (Spl $ tiny_num) C Spl = Stiny_nu-n; if (Sp2 <Stiny_num) (Sp2 = Stiny_num; ) if (Spj <$ tiny_num) File coi--c.pl: 7/15 (Spj = Stiny_num;) Scontrib = (Spj * log (Spj / (Spl * Sp2))); Stotal + = Scontrib; # princ "al: Salt, a2: $ a2, s: Ss, pj: Spj, pl: Spl.p2: Sp2, contrib: Scontnb, tocal: Scoti ) total r $ turn; sub incidence_vec. { my (Scol, Skey) =? _; my < Svec); Svec = ""; if ((index Scol. Skey)! = -1). { for Si (0 .. ((length Scol) - 1)) (Sc = substr Scol, Si, 1; if (Sc eq Skey) $ vec = Svec "1") else. { $ vec = Svec -0";)) else (Svec =" N0T_ FOUND 'returp Svec; II given two colunns, go chrough each leeeer in che alphabet ana t genérate the incidence vector for Chem. Then if the results are »non-zero, send them to rr.i2_real for the reai computations sub mi2 (my (Scoll, Scol2, Sm) = ß_; my (Ssl, Ss2, Skeyl, SkeyZ, Stota ?, Ssum !; Ssl - eolun-n (Scoll, ftfamily); Ss2 = column (Scol2,? Fa ily): Ssum = 0.0; foreach Skeyl ( keys% alphabet) (Svec = incidence_vec (Ssl, Skeyl); if (Svecl ne "NOT_FOUND" 1 foreach $ key2 (keys% alphabet) File matches: 8/15 Svec2 = incidence_vec ($ e2, Skey2); if (Svec2 ne "NOT-.FOUND") (Stotal = mi2_real (Svecl.Svec2.Sm); princ "if: Ssl \ n '; princ" vecl: Svecl \ n "; princ" s2: Ss2 \ n "; princ" vec2: $ vec2 \ n "; if (Scotal > 1.0) (printf" mi2, cois:% d,% d | keyl: Skeyl | key2: Skey2 total:% .9f \ n "(Sc $ sum + = Scotal; ) ) print "Cotal sum: Ssup-.n"; »Given cwo columns (current scring of araino acid symbols), # produces all combinations (pairs) of attrl, actr2, where actrl is # an incidence vector for a symbol occurripg in coll and # likewise Cor attr2 from col2. Then cali mi2 on the pair I of incidence veccors. «Compute mutual_in or (attrl, attr2) where accri are binary incidence« veccors for cwo al? Coll. a28col. ? hash_single2, for Srow < 0 .. (Sm-1)) (Sal = substr Sactrl, Srow. 1; Sa2 - subscr Saccr2, Srow, 1; Ss s $ al. Sa2; tprint "row: Srow, al: Sal, a2: Sa2, s : Ss \ n "; if (exiscs Shaeh_singlßl (Sal)) C Shash_single1 (Sal) + - •;) else $ hash_s? Nglel. { $ al) if (exists Shash_single2 (Sa2)) (Shash_single2 (Sa2). * j) else (Shash_single2 (Sa2) = 1;) if (exists $ hash_joint { $ s)) File coinapl- 9/15 . { $ hash_joint ($ s) ++; ) else (Shash_joinC ($ s > = 1;) foreach Ss (keys% hash_joint) f $ hash_joint. { $ s) «= $ hash_joint ($ s) / $ m; xf. $ hash_joint [$ s} < $ tiny_nu? a). { $ hash_joint. { $ s) = $ tiny_num; ) "Print" s: Ss, h: $ hash_joine { $ S) \ n "; ) foreach Sa (keys% hash_singlel). { $ hash_singlel £ $ a) »Shash-.singlel. Sa } / YE; if ($ hash_singlel (Sa> <$ tiny_nura) ($ hash_singlelC $ a) = Stiny_num;) "print" a: $ a, hsl: $ hash_singlel { $ a) \ n ") foreach $ -t ( keys% hash_single2) (Shash_single2 ($ a> = Shash_single2 ($ a) / Sm; if (Shash_single2 ($ a) < Stiny_num) { Shash_single2 ($ a) = $ tiny_num;) «princ" a: Sa , hs2: $ hash_single2 ($ a) \ n " foreach Ss (keys% hash_joinc) (Sal = subst.r $ s, 0, 1; Sa2 = substr Ss, 1, 1; Spj = Shash_joint ($ s); Spl = Shash_.singl «l {Sal); Sp2 = Shash_single2 (Sa2); if (Spl < Stiny_num) Spl = Stiny_num; f (Sp2 < Stiny_num) Sp2 = Stiny_nu? »; £ (Spj < Stiny_nup Spj = $ t? Ny_nupt; File matches: 10/15 Stotal * = (Spj * log ($ pj / ($ pl • Sp2) l)) return Stotal) «•« •• **** «**« «« *** • ** «** •« ** «» • # ** # «#« • *** «*« «*» «* ** «« «*« «« **** ## ** «* check to raake sure a file yam was given if (scalar TARGV! - 4). { print 'usage: $ 0 data_file sample_eize iterations min_freq \ n "; exit;) S ilename • - SARGVtO] $ sample_3ize = $ ARGV (11 Siteracions = SARGV.2.}. $ min_freq = SARGVI3) • read concentrates of file into array faraily open (DATAFII-E, $ fileñame); ßfa ily = <DATAFII-E>; chop? fa-nily; t remove nial's +, -, and | deliminers «? family = grep (! / V + /,? fainily ); «Get rid of lines beginning with '+'« forßach (? Family) # remove all '|' s # { «TrA | // d; #)«? Family = grep (/ * \ w /, ßfamily); «foreach (? family)» { «print" S_ \ n "; *)" while { length Sfamilyl (scalar? family) -1) <: J) * { # print " Empty line: ", scalar ßfamily, deleted. \ N"; * pop? family; #) «If« 0; «Foreach (? Family) # (« princ "Si: S_ \ n"; * Si ++; #) «*« «« ** «« *. »« ** «**« ** »**« * • ** «*» «****« «*« «» * «« «« «« ** «*« * «#« «*« * »« «« «« «NOW for the real stuff! princ "Sample_size: Ssap» ple_size \ n "; print • Iterations: Siterations \ n "; print" Min_fre: $ min_f eq \ n "; < construct aasice list $ n = length $ family (0]; File coipc-pl: 11/15 $ m - = scalar? family; foreach Srow (? f mily) { for $ j (0 .. (Sn - 1)) $ c = substr Srow, Sj. 1; if (length Sc! = 1) (print "BUG! Srow, $ j \ n "; exit;)" print * $ line: Sj: $ c \ n '; Yes = Sj; # + 1; $ s - $ c. Yes; * create aasite yam "print" c: Sc, j: Sj, i: Yes, s: $ s \ n "; if (exiscs Saasices (Ss)) { SaasiCesíSs) ++;.}. Else { Saasites (Ss) = 1;)) ) # figure out the alphabet «? a = keys% aasites;« print ßa, "\ n"; "foreach (? a) # {" print * S_: Saasites { $ _) \ n "; *) foreach Sentry keys -aasites) (Sc = substr Sentry, 0. 1; # want the first character in each entry * print Sc, "\ n"; Salphabet (Sc) = 1;) print keys% alphabet. "\ n"; i. cale marginal probabilities for each column of aasites foreach Skey (keys% aasites) (Sp - Saasites (Skey) / $ m; Saasites (Skey.). - Sp: # print "Skey: Sp \ n": for Scoll (0 .. (Sn-2)) (for $ col2 ( { Scoll + 1) .. (Sn-1)). { Smi = &my (Scoll, Scol2, Sm); princ "colurms: *, { Scoll + 1), •". (Scol2 -t 1), "mi = Smi, n"; Smi2 = mi2 (Scoll, $ co! 2, Sm); II might as well ao mi2 while we're here File coinc.pl: 12/15 «Exit; »## MAIN LOOP« seed the random number generator «Sseed = 111; «Srand (Sseed); * remove '(Sseed) to get seed from the system clock srand (); # print "START MAIN LOOPVn"; for (SitersO; Si er < Siterations; Siter ++). { my% BINS; # print "-nlTERATION: $ iter \ n"; print STDER * ITERATIO- .: Siter.n "; # print * JUMP TO rsample_family \ n *;? s - mpled_family * & rsa? -ple_family (Ssample_size,? f --- r.ily)» print • sample size: $ sample_size \ n "; «Print" 012345678901234567890 \ n "; * Yes = 0; «Foreach (ßsa -? Pled_family) * f * print" Yes S_ \ n "; «Yes ++; «)« Print "rsainple printed \ n"; foreach Saas i te (keys% aasites). { Saa x substr Saas i te, 0, 1; Scol-num »substr Saasite, 1; print "aa: Saa, colnum: Scol_num \ n"; Soccurence_string * &find_? Ll (Saa. 5col_num,? Sampled_-am? Ly!, Print Soccurence_string. * \ N ", if (Soccurence_string and" NOT_FOUND ") { Print" FOUND occ_str: Soccurence_string .n "; if (exists SBINS (Soccurence_string)) { $ BINS (Soccur - nce_string) = SDINS (Soccurence_scrtng) Saasite I ';) else C $ BINS { $ occurer-ce_string) = Saasite "I" foreach (keys% BINS) í print "$ _: SBINS { S _) \ n"; ) »Sort the colusion list associated with each BIN and throw away # entries with just one 'collusion' foreach Sbin (keys% BINS) (my? Aalist; File matches: 13/15 Ss - = $ BI-.S ($ bin); * print Ss, "\ n"; ? aalist = split / \ | /, Ss * Yes »0; «Foreach (aalist)« C * print "Yes: S_ \ n"; «Yes ++; «) If ((scalar? Aalist) > 1)« throw away single 'collisiops' («then sort che others $ sorted_aalist •« join "|", sort comp_aa? Aalist; Ssorted_aalist = join "|", sort? Aalist; print "sorted aalist: $ sorted_aalist \ n"; SBINS (Sbin) = $ sorted_aalist; else. {print "chucked \ n"; delete SBINS (Sbin);) pr "BINS DRAWING \ n"; $ z »0; foreach (keys% BINS). { print "Sz: S_: SBI - S (S _) \ n"; $ Z ++; # now we update the cset table foreach Sbin. { keys% BINS) C Scount = 0; * sum up bin hics; sample_size should equal equal of bins for (Si = 0; $ i * - Ssample_size; Si ++). { $ c = subscr $ bin. Yes, 1; if (Sc eq "1") (Scount * ^; 3) Skey = SBINS (Sbin); print "csec key: SV-ey.n"; if (exists ScsetíSkey)). { Scset (Skey) + = Scount; else (Scset (Skey) = ScounC: print "C? ET.n"; Sz = 0; foreach (keys% cset) File coinc.pl: 14/15. { print "Sz: $ _: Scset { S -.) \ n"; $ z ++; ) print "SiCer, BINS:", scalar keys% BINS. "print" CSETS: ", scalar keys% cset," \ n "; princ STDERR" BINS: ", scalar keys% BINS," \ n "print STDERR" CSETS: ", scalar keys% csec," \ n " print "CSETS:", scalar keys% cset, * \ n "; princ" \ n \ nGathering stats. \ n "; foreach Sentry (keys% cset) (Sh_Cotal_obs = $ cset { $ entry); $ h_expecced_cocal = &expected_size ($ sample_size. Sentry); Scorrelation» &prob_of_correlaeion (Sentry, $ h_toCal_obs, $ h_expected_Cocal, $ sample_size, Sicerations); if (Scorrelation < 0.000000001) Scorrelation = 0.0; f ($ h_tocal_obs > = Smin_freq) * this is a wßelly ugly hack to prevent hash key collisions $ h = Sh_total_obs; while (exists Soutput (Sh) ) $ h = $ h. ** "; ) "Princ" VnEnCry: $ entry \ n "; # print "Obsrv hits: $ h_total_obs'.n"; # printf "Expc hits:% .9f \ n", S --._ expeceed_cotai * Yes teraticns; »Printf" Prob ran:% .9f \ n ", Scorrelation; Soutput-Sh) [0] = Sentry; Soutpuc ($ h) [1] «Sh_total__ob; SouCpuc (Sh) (2) B Sh_ßxpected_total • Siterations; Soutput. { $ h) [3) = Scorrelation; ? hits = keys -.output; ßhics = sort compare? hics; «? Hits = sorc? Hits; * foreach (? prob) * £ 1 print "S_ \ n"; «) Print" SORTED \ n "; foreach Shic (ßhits) (my (? aalist); tl Si = index Shit, ti if (Si '= -1) --25 Sh = substr Shit. Or, (index Shit, FHecoinc.pl: 15/15 $ s = Soutput. { Shit} [01; ? aalist = split A | /. H.H; foreach (ßaalist). { Saa = substr $ _, 0, 1; $ col_num = substr $ _, 1; $ _ = Saa. ($ col_num + 1):) Ss 3o? n sort comp_aa? aalist; print "Sobserved = Soutpu (Shit) (1J Sexpected = Soutput -Shit) £ 2J Sprob - Soutput (Shit) [3) if. {Sexpected <Sobserved> - £ Sprob <0.5) princ" \ nEncry ", Ss, "\ n"; princ "Obsrv hits: $ outpuC ($ hit) (1)," \ n "; printf "Expct hits:% .9fVn", Soutpu (Shit) (2), priptf "Prob ran:% .9f \ n", Soutpu (Shit) [31, APPENDIX B ??? TRPMJNTRKSvKIGPGQAFY TG-.IIGDIRQAH TRPNNYTRKHIPTGPGQVIY TGKIisDIRKAY BD-V input: 1/10 SRPNNNTRKSVHMGPGR FYATGDIIGDIRQAY IRPGtnrrR - SMHIGPGRPFYARG-VIGDIRQAH IRPNN TIUSIHIGFGQAFYATGDIIGNIRQAH IRPNm-TRTSVHMGPGKTFYATGDIIGDIRQAH TRPNNNTRRSMRIGPGQtFYATGDIIGDIRQAY TRPNNNT-Ü-SIRIGPGQAFY TGDIIGDIRQ H TRPSNNRTSIHI PGR FY TGAIIGDIRQVH Irp-???? INNTRR - VRIGPSQAFYATGDIIGDIRQAH T - KNNNTR-CSIRIGPGQAFYATGDIIGDIRÍ-AH TU > SNNT-U SIRIG-K-QAFYXTGDII? DIRC-? H TRPNNNTRR = IHIGSGRAFY IIGDIRQAH IRPSRTTRKRWHIGSGQAFYAIDGITGDIRKAY TRPNNNTHRRMHIGPGRAFIATI-AIVGDIRQAY TRPSNNTRKSVPIGPGQAFYATCOIIGDIROAH RPSHUTSKSIRIGPGOTFYA GRIIGDIRQAH IRP-3NNTRK - VNIGP -AFYATG - IXG - XKQAH TRPGNNTRKSVRIGPGQ? FY? TGDIIGDIRC.A-- TRPGNITGRKSWHXQPGRAF? TGDGI XGOXRKAY IRPGNNTRKGVHIGPGQAFYARGDIIGDIRQAH TRPGNNTRKSLRXGPGQTFYATGOXXGDXRQ? H RPNNNTRKSVRIGPGQAFYATGDIIGDIRQAH ? IRPHKMTR- --VHIGPGQAFYA GDIIGDIRQAY -RPNNNTR- SVRXGPGQTFY TGDIIGDIROAH TRPGNYT-U-SVRTGPGQTFYA'RGKI IGDIRQAH TRP - NNTR-.GIHIGPGSAIYATGDIIGDIRQ H T - P-OJ TRTGIHIGPQQTFYATGEIIGNIR0AH TRPNNNT - RSV-U: GPGQTFYATG IIGDXRO -?. IRPNNTRK5VRXGPGQTFYAAGDXZGDXRQAH TRPGNNTRR - VRIGPGOAFYATGEIIGDIR- AH TRI-SNNTR -? VRXGPGQTFYA GEIIGDIRRAH TRPNNNTR- SVRIGPGOTFYATGDIIGDIRQ? H TRPNNNTRTSVRIGPGQAFYATGDIIGDTRQAH TRPGNNIWSVRIGPGQArYATGDIisDIRKAK SRPNNHTRR - IHFGPGOTLYAGHIIGDIRQAH TRPNNOT'RRSIRIGSGQTSYATGDIIG-.IREAK SRPGNNTRKSVRIGPGQTFYATGDIIGDIRQAH TRPNNNTWCSVRIGPGQTFYATGDIIGDIRQAH TRPNN-? RK - VRIGPGQTF? AO-JXIGDIR - AH TRPSNNTRKGXHXGPGRAFYATSQI GDIRQAH TRPGNNTNKNVHIGPGQAFYARGRIIGDIRKAH TRPHNNTRMSIRXGPGQAFYAGDIIGNIROAH TRPNNNTRKSIRXGPGQAFYAGDXXGNIRQAH TRPNNTRTGIHIGPGQAFYARGAI'K-DIRKAY TRPXNNTRKSIHXGPGQAFYATGDIXGDIRKAH TRPNNNTRTSIRIGPGQTFYAGDIIGNIRQAH TRPGNNTRTSIRIGPsQAFYGRsi-IIGDtRKAH TRPNNNT-U - SIRXGPGQAFYATGDI GDIRQAH ARPNUNTRRSIHXGPGQAFY? -SDXIGDIRQAH TRPNHNTRKSVHIGPGQAFYATGDIIGDIRQAH TRPNNNTRKSIRIGPGQAFYTrGDIIGDI - QAH IRPN TRTSIRIGPGQAFYATGDIIGDIRQAH TRPNNNT - KSVFIGPGQAFYATDNIIGDIRQAH TRPNm-TRTSICIGPGQTFYA-GGIIGDIRQAK TRPNNWrRKSVHIGPGQAFY-vTGDIIG-LIRQ H TRPNOTm-XSIHIGPGQA-rYATGDIIGDIRQAH T ^ PSNHTR SIRIGPGQAFYATGDIIGDIRQAH TRPNNNTRKSANIGPGQAFYATGEIIGDIRQAH ?. IRPNNNTL GimGPGQSFYATGSIVGHIRQJ-H IRPYN-ÍTRKSIHIGPGQAFYA-SRIIGNIRQAH TRPHNNTRKSIRIGPGQTF - - --- IRQAK TRPN GEIX NTRKGVHIGPGQAFYA G IIGDIRQAH TRPNNNTRKSVRIGPGQAFYATGDIIGDIRQAY TRPNNNTRTSIRIGPGQSFHATGDIIGDIRQAH SRPNNN RK-RVHIGPGQAFYATGDVIGDIRQAY IRPNNNTRKSVPIGPGRAFYATGDIIGNIRQAH HIV TRPNNNTRKGVRIGPGQAFYATGGIIGDIRQAH input: 2/10? TRPNNNTRKSVRIGPGQAFYATGDIIGDIRQAH TRPNNNTRTSVRIGFGQTFYATGDIIGDIRRAY VRPNNNTRTSVRIGPGQTFYATGÉIIGDIRRAF TRPNNNTRRSIRIGPGQAFYATGOIIGDIRKAH IRPNNNTRKSVHIGPGQAFYATGDIIGDIRQAH IRPNNNTRKSVHIGPGQTSYATGDIIGDIRQ - H TRPNNNTRKSVHIGPGQAFYATGDIIGDIRQAH TRPNNNTRRSVHIGPGQ? FYATGDIIGDIRRAH RP NNTRKSIHLGPGRAFYATGDIIGDIRQAH SRPYN-T ---- T-YSIG-5 - Q - FYVTGKIIGDIRQAH TRPYK-CV - RRIHIGPGRSFY-T-SN - GDIRQAY TRP-JNNISRRIHIGRGQ-U ^ ATGGMTG-NI QAY IRP NNTRKSVRIGPGQAFYATGDIIGNIRQAH TRPNNWPRRSVRIGPGQTFYATGD? GDIRQAH TRPNNNTRTSVHIGPGQAFYARGDIIGDIRQAH TRPNNNTRKÍSIHIGPGQAFYARGDIIGNIRQAH TRPNNNTRKSVHIGPGQAFYATGEIIGDIRQAH TRPNNNTRKSVRisPGQTFYATGDIIGNrRQAH TRPNNTRKGVHIGPGQAFYATGDIIGNIRRAH TRPNNTRQSVHIGPGKAFYATOGIVGDIRQAY TRPNNNTRKSVHXGPGQAFYATGAXXGSIRQAH TRPNNNTRRSVHIGPGQAFYATGDXIGDIRQAH TRPGNNTRRSVRIGPGQTFYATGDIIGDIRQAH IRPNNNTRTSVRIGPGQAFYATGDIXGDIRKAY TRPNNNTRKSIGIGPGQTFYAADNIIGDIRQAH TRPGNTRTSVRIGPGQAFYATGDIIGDIRQAH TRPN - NTRT - VRXGPGQSFYATGDXXGDIKQAH MRPNNTRKSISIGPGRAFFATGDIIGDIRQAH TRPSm-RRQSVRIGPGQAFYATGDXXGDIRRAH TRPNNNTSQGVHIGPGQVFYARDRIIGDIRKAY TRPNNNTRKSVRIGPGQT-nrATsDIIGDIRQAY IRPNNNTRRGIHMGPGQI YATGSIIGDIRQAH RPÍINNTR1SIRIGPGQVTYTN-DIIGDIRQ - H TRPNNOTRKSVHIGPGQ FYATGDIIGNIRQAH TRPNNNTRKSIRIGPGQAFYATGDIIGNIRQAH TRPNNNTRKSIRIGPGQVFYATG * - * ------ TRPTNJNTRKSVRIGPGQTFYATGDIIGDIRQAH TRPNNNTRTSVRIGPGQAFYATCDIIGDIRRAH RPNNNTRKSIHIGPGRAFYRTGEIIGDIRQAH TRP-TOS - RKT - HMGPKRAFYATGDIGGYIRQAH TRP-TONTRKSIQIGPGR? FYTTGEIIGDIRC-AH TRPNNNTRKGI-OK? PGSTFYATGEIIGDIRQAH TRPSKT-TR GIKI-GFGRALYATGEITGDIRQAH TRPNNNTRKSLSLGPGRAFY TGDIVGDIRQAH TRPSNNT - KGIHIGPGRTFFATGEIIGDIRQAH TRPNNNTSKGI-O-GPGGAFYTTGRIIGDIRRAY TRPNNNTRK3ISÍGPGRAFYATGDIIGDIRQAH TRPNNÍGGRKGIHKGWGRTFYATGEIIGAIRQPH TRPNNNTRKSIHMGWGRAFY? TGDIIGDIRQAH TRPNNNT-USI - VGWGR - I-FTTGEIIGNIR - AH TRPNNNTRKSIHMGWGRAFYATGEIIGDIREAH TRPNNNTRKRIYIGPGRAVYTTGQIIGDIRRAH ERPNNNTR -. XNIGPGRAFYTTGDIIGDXRQAH TRPSNNTRKSIHT-G GRAFYTTGDIIGDIRQAH TRPHNNTRRSITIGPGRAFYTK-DIIGDIRQAH TRPSNHTRKSIHLGWGRAFYATGEIIGDIRQAH T - LNNNTRTSIHIGPGQAFYATGDIIGDIRQAH TRPNNTR SIHIGPGSAFYATGDXIGDIRQAH TRPNNNTRKSIHMGWGRTFYATGEIIGDIRQAH TRPRTONTP.KGIHIGPGRAFYAT-EITGDIRQA-- T-RPSNNTRKSIHMGWGRAFYATGEIIGDIRQAH TRPMNNTRKSIHMGWGRAFYATGEIIGNIRQAH TRPGNUTRKGIPIGPGGSFYATERIIGDIRQAH IRPNNNTRR? 11IGPGRAFYATGDIIGDIRQAY TRPNNNTR SIHIGPGRAFYATGDIIGDXRQAH HIV input: 3/10 TRPNNNTXKSIHIGPGSAFYATGDIIGDIRQAH TRPGNNTRRSIHMGWGRAFYATGDIIGDIRQAH TRPNNNTRKSIHIGPGRAFYATGDIIGDIRQAH TRPNNN RKSIHMG GRAFYATGEIIGNIRQAH TRPNNNTRKSIHIGPGKAFY TGEIIGNIRQAY TRPNNNTRKSIHU3WG - AF ATGEIVGDXRQAH TRPNNNTRKSITIGPGRAFYATGEirGDIRQ - H TRPNNNTRKSXHMGWGRTFYATGEXXGDXRQAH TRPSNNTRKGIHIGPGRAFYATGDIIGDIRQA-. TRPSNNTRKSIHIGWGRAIY? TG? XXGDIRQ? H TRPNNNTRKSIHVG graj-.YTTGEIIGNIRQAH TRPNNNTRKSIQYGTOGAFYATGEIVGDIRQAH TRTO --- mUCSIHIGPGR? FYTTGDIIGDIRQ? H TRPNNQTR? -SIHMsS? -iRAF-pUGEIIGNIROAH TRPNNNTRKGIHHGLGR? FYATQGIVGDIRQ? H TRPSNNTRKGIHIG GRAFYATGEITGDIRKAY SRPNNNTRKSIH GHGRAFYTTGEIIGDIRQAH TRPNN? -TRKSIHIGPG - AFYTTGEXIGDIRQAH TRPGNNTHKSIHLG GRAFYATGAIIGOIRQAH TRPSNNTRKSIH GWGRAFYATGEXVGDIR - AH TRP - NNTRRSXH GPGGAFYTTGEIXGNIRKAF TRPNNNTRKSIRSGPGSAFYATGDIXGDIRQAH TRPNNNTR SXPXAPGSA FATGEXXGOXRQ - H TRPNNNTRKSIHLG GRAFYTTGQIIGEIROAH TRPNNNTRKSIHVGVGRAIYATGEIIGDSRQAH TRPSNNTRKSIHMGWGRAFYATGEIIGDIRRAH TRPNNNTRKSIHHGWGRAFYTTGDIIGDIRQAH TRL > N --- MUCR --- SIGPG - AFYTTC - VTGDIRQAH TRPNNNTRKSIHMGPG - AIYATGEIIGDIRKAY TRPNNNTRKGIKXGPGRAFYTT-DXXGDIRQAH TRPNNYTSKRIRIG? RRAFYTKGKIIGDIRQAH TRPNNNT-UCGXHIGPG-UWYTOSRXVGDIR - AH •? -PNM-T-U SIORGPSRAFVTIGKI-GNKRQAH TRPNNNTRNRISIGPGR? FHTTKQIIGDIRQ? H ? IPIOINTRKSITKGPGRVGYATGQIIGDIRKAH TRPYNNVRRS ---- IGPGRAF * RTREIIGIIRQAH TRPNNNTRKSXNXGPGR? WY? T-NXXGDXRQAH IRPNNNTRKSIPICPGRAFYATGDIIGDIRQAH TRPNNOTRKSIRIGPGRAFYT-GEIIGDIRCAH TRPNNNTSKRÍ SIGPGRAFRAT-KI IGNIRQAH TRPNNSTRKRISISPGRVWYTTGQIIGDIR AH TRPNNNTRKRISIGPGRVWYTTGQIIGNIR1-AH TRPNl-N-KIVsDIRKAH RRSGHXGCiGRTLFTT TRPNNNTRÍ SIHIGPGRAFYT-GEIIGDIRQAH TRPNNNTSKRISIGPGR FRAT-KISGN RQAH T -? PNNNTRKI.ISXGPGRASYTTGQIIGDXRK H TRPNNNTR RISIGPGRAWYTTGQIIGI-IRKAH TRP-NJHTRRSGHIGGGRTLFTT-HIVGDI - K - H TRPSNNTRKSIPMGPGKAFYTTGDIIG IRQAY? TRPN-n-TRKSIHXsPGRTFFTTGDIis- > IRQAH TRPNN - TRKSINIGPGRAFYATGEIIGNIR ---- H ERPNNNTKRSITIGPGRAFD? YGGIIGDIRQAH TRPNNNTRKSIHMGPGKAFTTGEIVGDIRQAH TRPNNNTRXGIHIGPGG FYATGGXIGDIRQAH TRI-mJNTRKSINIGPGRAFYAT - DIIGDIRQAH TRPHMtRKSIHIGPGRSFYTTGDIIGDIRQAH TRPNmJTRKSIHIGPGRAF'irPTGDIIGDIRQAH TRPNDNTRKSIPMGPG - AFYATGDIIGNIRQAH TRPNNNTRKSIHIGPGRAFYTTGSIIGDIP.QAH TRPNNNTRKGITIGPGRAFYATEKItGOIRRAY IRPNNNTRKSIPIGPGRAFYATGDIIGDIRKAH TRPNNNTRKSIPIGPGRAFYATGDIIGDIRQAY TRPNDNTRKSIHIGPGRAFYTTGQIIGNIRQAH TRPNNI-T - KSIHMGPGSAFYATGDIIGNIRQAH TRPNNNTRKSIPIGPGRAFFTTGDIIGDIRQAH HIV RPNNNTRRSIHIGPGRAFYATGDIIGDIRQAH input: 4/10-DIIGDIRQAH TRPSNNTRKGIHIGPGGAFYTTGEIIGDIRQAH TRPSNNTRKSIHIGPGRAFYAT TRPKNEIKRRXKIGPGRAFVATGT-VGDTRQAQ TRPNNSXKRRIHIGPGRAFFATNT-VGDTRQAQ TRP - NEIRRSI-QVGPGRAVAAGT-AGDTRQAQ TRPGNNTRRSIHIGPGRAFFATGDITGDIROAH TRPNNNTRKSXTXGSGRAFMAXE XXGNXRQAH TRPSKTTRRRIHIGPGRAFYTTKQIAGDLRQAH TRPNNNTRKSIRIGPGRAFVTIG-KIGNMRQAH TRPNNNTRKSIHIGPGKAFYATGEIIGDIROAH T - ^ JNNT-USIHIGPGSAFYTTsDriGDIROAH TRPNNNTRKRVTMGPGRV YTTGEIIGNI QAH? TRPNNTRKGIHDGPGGTFYATGEIIGDIRQAH IRPNNNTRKSINIGPGR? FYTTGEXXGDXRQAH TRPNNTRRGIHIGLGRSFYT-RKIIGDIRQAH TRPHNNTRKSIHIGPGRAFYTTGEIIGDIRQAH TRPGNNTRRSXPIGPGKAFFTT-EIIGDIRQAH TRFNNNTRKSIHIGtGRAFYTTGDIIGDIRQ? H TRPNNNTRKSIPisPGRAFYATGEXIGDIRQAH TRPNNNTRKSIPIGPGRAFYTTGEIIGDIRQAH TRPNNNTRkSIHIGPGRAFYTTGEIIGNIRQAH TRPMtn-TRRSIGIGPGRAIYATDRIVGNIROAH IRPNNNTRKSISIGPGRAFYATGEIIGNIRQAH TRPNNTRKGIHIGPGRAFYAT - RIIGNIRQAH TRPNNNTRRGIHIGPGRAVYTTG- iGDIRQ? H TRPSNNTRRSrHIGPGRAFYTTGQITGNIRQAH TRPNNNTRKSIQIGPGRAFYTTGEIIGNIRQAH TRPNNNTRKSIHIGPG - AFYTrGDIIGDIRQAH TRPNNNTR - SIHIGPGRAFYTTGEIIGDIRQAH TRPNNNTRKGIHIGPGRAFYTTGEIIGDIRQAH TRPNNNTRKSIHIGPGRAFYATGEIIGDIRQAH TRPNN-rr ------ MT - s - KVF TTGEIIGDIROAH XRPNNNTRKSXHIGPG - AFYTTGEXXGDIRQAH TRPN-ÍNTRKSIHIGPGRAFYTTGEIIGDIRQAH • raPNNNTR SIHIGPGRAFYATGEVIGDIRQAH TRPNNNTRXGIHIGPGR FYTTGDIIGDIRQAH TRPNNNT ---- SIHIGPGRAFYTTGEIIGDIRQAH IRPNNNTRKStHIGPGRAFYTTGEIIGDIRQAH TRPNNNTRKSIPIGPGRAFYTTGDIIGNIRQAH --RPNNNTRRSIPIG - GSAFYTT-EIIGDIRQAH TRPNNNTRKSIHMGPGKTFYTTGDIIGDIRQAH TRPNNNTRKSIHIGPGRAFYTTGQIICiDIRQAY TRPNNNTRKSIPIGPGRAFYTTGEIIGDISQAH TRPNNNTRKSIHIGPGRAFYATGDIIGDIRQAH TRPNNNTRKSIHIGPGRAFYATGEIXGDIRQAH IRPGNNTRKSIPTGPGRAFYATGDIIGDIRQAH TP.PNNNT-yGIRIGPGRAFIAATKIIGDIRQAH TRPNNNTRKSIPIGPGRAFYTTGDIIGDIRQAK TRPNNNTRKSIHIGPGKAFYATGEIIGDIRQAH TRPNNNTRKGIHIGPGP-AFYATEAIIGDI - KAY TRP NirrfÍKGIHIGPGKAFYTrGEIIGDIRQAH TP.PNNNTRKSIHIGPGRAFYTTGEXIGDIRQAH TRPNNNTRKSINIGPGRAFYTTGGCIGDIRQAH TRPNNNTR SIHIGPGR? FYTTGEXIGDIRQAH TRPNNN RK = IHIGPGRAFYTTGEIIGDIRQAH TRPNNNTRKSIHIGPGRAFYTTGEIIGDIRQAH TRPNNSTRKSIHIGPGGAFYATGEIIGOIRQAH TRPNNNTRRGIHIGPGRAFYTTGQIIGNIRQAH TRPNNNTRKGIHIGPGRAFYATGDIIGDIRQAH rSPNNNTRKSIHIGPGRAFYTTGEIIGDIRQAH TRPNNNTRKSIHIGPGRAFYTTGDIIGDIRQAH TRPNNNTRKSIH - GPGKAVYTTGEIIGDIROAH TRPNNNTRKSIPIGPGRAFYTTGEIIGDIROAH TRPNNNTRKSIHIGPGRAFYATGEIIGDIRQAH TRPNNNTRKSTHIGPGRAFYTTGEIIGNIRQAH TRPNNNTRKSIHIGPG - AFYATGDIIGDIRQAH TRPNNNTRKSIHIGPGRAFYTTGDXIGDIRQAH HIV input: 5/10 TRPNNNTRKSINIGPG - AFYATGEI IGDIRQAH TRPNNNTRKSIHIGPGRAFYATGEIIGDIRQ? H TRPNNNTRRS I PIGPGR? FYATGNIIGDIRQAH TRPNNNTRKSINIGPGRAFYTTGEIIGDISQAH TRPFN TRKSIPrGPGRAFYTTGDIIGDIRQ- - TRPNNNTRRSIHIGPGRAFYTTGGIIGDIRQA-. TRPNNNTRKSIHIGPG - AFYTTGDISGDIRQAH TRPNNNTRIGIHIGPGRAFYATGEIIGDIRQAH T P lNN RKSINTGPGRAFYTTGDIIGDIRQAH TRPSNNTRKsXQXGPGRAFYTTGQXTGOIRQAH TRPNNNTRKGIHIGPGRAFYATGEIIGNIRQAH TRPNNNTR - SITI &PGR? FYTTGEIIGDIRQAH TRPNNNT-U S? HIGM- ^ FYTTGEIIGDIRQAH T - PN - NTR - SIHIGPG - AFYT GEIIGDIROAH TRPNNNTRKSSHSGPGRAFY? TGEXXGDXRQ? H TRPNNNTR - SIHIGPGRAFYTTGEIIGNIRQAH TRPNNNTRRGIHIGPGRAVYTTGEIIGNIRQAH TRPNNNTRKSIHIGPGRAFYATGDIIGDIRQAH TRPNNNTRKSINIGPGP-AFFTTGKIIGDIRQAH TRPSNNTRKXIHIGPGRAFY? TGEIIGDIRQ? H TRPNNNTSKGIHIGPGRAFYTTGDIIGDIRQAH TRPNNNTRKstHIGPGRAFYATGEIIGDIRQAH TRPGNNTSRGIHIGPGRAFYTTXKIIGOIRQAH TRPNN - TRKSINIGPG -? FYTTGDIIGDIRQ? H TRPNNNTRKSIPMGPGRAFYTTGDIIGNIRQAH TRPHNNTRKSXPIGPGR? FYTTGEIXGDXRQ? H TRPNNNTRKGIHIGPGRAFYTTGEIIGNIRQAH TRPNNNTRKSIHIAPGRAFYATGEIIGDIRQAH TRPNNNTRKSIXIGPGRAFYATGEIIGDIRQAH TRPNNNTRKSINIGPGRAFYTTGEIIGDIROAH TRPNNNT-USIPIGPGRAFYTTGQIIGDI - QAH TRPNNNTRKGIHIGPGKAFYATGEIIGNIRQAY TRPNNTRXGIHIGPGS? FY? TGEIIGDIRQAH TRPNNNTRKSIHIGPGRAFYTTGEIIGDIRQAH TRPNNNTRKSIHIGPGRAFYTTGDIVGDIRQAY TRPNNNTRKSXHXGPGRAFYATGEXXGDXRQAH TRPNÑNTRKSIHIGPGRAFYTTGDIIGDIRQAH TRPNNNTRKSIHIGPGRAFY? TGQIIGDIRQAH TRPNNNTRKGIHIGPGRAFYATGDIIGDIRQAH TRPNNNTXKSIHIGPGRAFYTTGQIIGDIRQAH TRPNNTRKGIHIGPGRAFYTTG? IIGDIRQAH TRPNNNTRKSITIGPGRAFYTTGDIIGDIRQAH TRPNNNTRRSINIGPGRAFYATGEIIGDIRQAJ. TRPNNNT - KSXHXAPGRAFYATGEIXGDIRQAY TRPNNNTRKSIHIGPGR? FYATGAIIGNIRQAH TRPNNNTRKSIHI-GPGQAWYATGEITGDIRQAH TRPNNNTR-CSIHI-GQGQAWYATGEXXGDXRQAH TRPNNNTRKSIHLGPGQAWYTTGQIIGDIRQAH TRPNNNTRKSIP GPGRAWYATGEIIGDIRQAH TRPMMNTRKSIPLGPGQAWYTTGQIIGDIRQAH TRPNNNTRKGIHLGPGQAWYTTGQIIGDIRQAH TRPN --- JTR - SIPI-GPGQAWYTTGQIIGDIRQAH TRPNNNTRKSIP CPGQVWFTTGQIIGDIRQ? H TRPNNNTRKSIHLGPGQAWYTTGQIXGDXRQAH TRP - NYTRKXIXMGPGRXXYTTGEIIGDIRRAH TRPNNNTRKSXHl-GPGRAWYTTGQXIGDXROAH TRPNNNTRKSIH - GPGRAV-YTTGQIIGDIRQAH TRPNNNTRKSIPI-GPGQAWYTTGQIIGDIRQAH TRPNNNTRKGIPIGPGRAFYTTGDIIGDIRQAH TRPNNNTSKGIPIGPGRAFYATGXIIGDIRQAH TRPNNNTP-KGIHIGPGRAFYTTGEIIGDIRQAH TRPNNNTRKGIHIGPGRAFYTTGEIIGDIRQAH TRPNNNTRKGIHIGPGRAFYTTGEIIGDIRQAH TRPNNNTRKGIHIGPGRAFYTTGEIIGDIRQAH TRPNNNTRKSIPIGPGRAF HIV GTGQIIGDIRQAH input: 6110 TRPNNNTRKGIHIGPGRAFYTTGEIVGDIRQAH TRPNNNTRKGIHRGPGRAFYTTGGIIGDIRQAH TRPNNNTRKSXHMGQGRAFYATGGIIGDRRQAY TRPNNNTRKGIH - GPGQAWYTTGQIIGDIRQAH TRPN BITRKGTPI-GPsQAWYTTGQirGDIRQAQ TRLNNWGRKSIAIGPGRTVYATDRIIGDIRQAH TRPSKNI - RSIHISSGRAFYTIEGVAGDVR - AY TRPNNNTRRGIHIGPGRAFYATGNIIGDIRQAH RPSNNTRKSIHIGPGRVFHATGEIIGDIRQAH TRPNNNTRKRIYIGPGRAVYTT - QXIGNIRQAH TRPGNNTRERISIGPG - AFIARGQIIGDIRQAH TRPGNNTRKSXPXGPGRAFX TSQIIGOIRKAH??? IRPNNNTRKsiGisPGRTVYTAEKIIGDIRQAH TRPNNNTRKSIHIGPG - AFYTTGEIIGDIRQAH TRPNIYRKGRIHIGPGRAFI-TTRQII - NIRQAH TRPN - - T ---- SIHIGPG - AFYTTGEIIGDIRQAH TRPN - KTR - RITTGPGRVYYTTGEIVGDIRQAH TRPNNNTRKRITMGPGRVYYTTGQIIGDIRRAH IRPNNNTRKGINVsPGRALYTTGDIIGDIROAH OIPNNHTRKRVTLGPGRVWYTTGEII-GNIRQAH TRPNNNTRKSITLGPGRAFYTTGDIIGDIRQAH TRPNNNTRK £ HIAPG - AFYTTGDIIG - IRKAH TRPSNNTRKSIHIGPGRAFYTTGEIIGDIRQAH TRPGNNTRKSIPMGPGRAFYATGDIIGDIRKAH TRPNYNKRKRIHIGPGRAFYTTKNIIGTI QAH TRPNNNTRKGIAIGPGRTLYAR - KIIGDIRQAH TRPNNNTRRRLSXGPGRAFYARRNXXGDIRQAH TRPNTKKIRHIHIGPGRAFYATGGIMGDIRQAH TRPNNNTRRSINIGPGRAFYTTGDIIGDIRQAH TRPNNNTSKRISIGPGRAFV? AREIIGDIRK? H IRPNNNT ---- SISIGPG - AFYTTGEIIGDIRQAH TRPNNNTTRSIHIGPGRAFYATGDIIGDIRQAH TRPNNNTRKSITIGPGRAFY? TGDIIGDIRQAH TRPNNNT-USIYIGPGRA-mTTGRIIGDIRKAH T-U'NNN ---- R1-XTSGPG-CVI.YTTGEIIGDIRKAY IRPNNNTRKGIHIGPsKAFYTTGEIIGNIRQAH TRPNNNTRKSINIGPGRALYTTGEIIGDIRQAH TRPNNTRKGIHIGPGRAFYATGEIIGDIRQAH TRPNNNT-WSIPMGPGKAFYTT-EIIGNIRQAH TRPSNYTGKRLSIGPGRAFVATRKIIGDIRQAH TRPGNNTRKSITMGPGKVFYA-GEIIGDIRQAH TRPNNNTRKSIPMGPGRAFYTTGEIIGDIRKAY VRPSNNTRQeiPIGPGKAFYATGEIIGDIRKAH TRPNNNTRR-WHIGPGSALYTT-DIIGDIRQAH IRPNNNTRRSINKGPGRAFYTTGDIIGDIRQAH TRPNNTRRSIHIG ?? sR? YTTGKITGDIRQAH TRPNNNTRKRITMGPGRVI, YTTSQIIGDVRR - H TRPNNNTRKSIHI? PGRAFYATGEIIGDIRQAH TRPNNNTRKGLHIGPGRAFYATGDIIGDIRQ? Y TRPSNNTRKGIPIGPGRAFYTTGGIIGDIRQAH TRPNNNTRKSIHIAPGRAFYATGGIIGDIRQAH TRPNNNTRRSINMGPGRAFYTTGDIIGDIRQAH TRPSNNTRKSITIGPGRAFYTTGEVIGDIRQAH TRPNNNTRRGXKIGPGRAFYTTGEIIGDIRQAH TRPNNN RKSIP ^ '. PGRAFYATGDIIGDIRQAH TRPNNNTRKSIH ± GPGKAFDAT-DIIGDIRQAH TRPNNNTRKSIHIGPGRAFYATGEIIGDIR - AH TRPNNNTRKGIHMGPGRAFYTTGAIIGDIREAH TRPNNNTRRSITIGPGRAFYAT-DIIGDIRQAH TRLSNKTRRSIHIGPGRAFYAT-DIIGDIRQAH TRPNNNTRR? IHIAPGRAFYATGDIIGDIRQAY TRPNNNTSRRISIGPGRAFTAREGIIGDIRQAH TRPNNNTRRSIHIGPGKAFYATGGIIGDIRQAK TRPNNNTRKSIHIGPGRAFYTTGDIIGDIRQAH TRPNNNTRKSIHIGPGRAFYATGDIIGDIRQAH TRPNNNTRKSIHIGPGRAFYTTGDIIGDIRQAH TRPNNNTRKSIHIGPGSAFYTTGDIIGHIRQAH HIV input: 7/10 TRPNNNTGKSIHI-APGRGFHATGEITGNIRQAH TRPNNNTRKGIAIGPGRTVYATGRIIGDIRQAH TRPNNNTRKSIHIGPGRAFYATGGIIGEIRQAH TRPNNNTRKGIPIGPGRAFYTTGDIIGDIRQAH TRPNNNTRKSIHIAPGRAFYATGEIIGDIRQAH SRPNNNTRKGIHIGPGRAFYATGDIIGDIRQAH TRPGNNT - RSSHIGPGRAFYTTGEIIGNXRLAH TRPNNNTRKSIPSGPGRAFYATGDIIGDIRQAH TRPNNNTRKSIHIGPGRAFYTTGDIIGDIROAH TRGNNNTRKGXHXGPGRAFYATGEXXGNXRQAH TRPNNNTRKSIHIGPGRAFYATGDIIGDIRQAH T-U > NNNTR-CGIHSGPGRAFYTTG - VIGNIRQ? H TRPNNNTRKSIPMGPGKAHYATGEIIGDIRKAY TRPNNNTRKSIHIGPGRAFYTTGEIVGDXRQAH TRPNNNTRKSIHIGPGRAFYAT-DIIGDIRQAH TRPNNNTRKSSPMGPGR? FYTTG - VIGNSRQAY TRPNNNTRKSIHIGPGRAFHTTGEVIGDIRQAH TRPNNNTRKSINIGPGRAFYATGEIIGDIRQAH TRPNNNTRKSINIGPGRAFYTTGEIIGDIRQAH IRPNNNTRRSSHKGPGRAFYATGDIIGDIRQAH IRPNNNTRRSINIGPGRAFYTTGDIIGNIRQAH TRPGNKTIRSS = KGPGRAF-RTGQIIGNIRQAN TRPNNNT - KSSPIGPG - AFYATGDIIGDSRQ? H TRPNNNTRRSIHIAPGRAFHATGNIIGDIRQAH TRPSNNTRKSVHIGPGRAFYTTGEIIGDIRQAH TRPNNNTRKSIHI-GFGRAFY? TGEIIGDIRQAH IRPNNNTRKSIHSGPGR? FYTTGDSSGDSRKAH TRPNNNTRKSIHIGPGRAFYTTGEIIGDIRQAH TRPNNNTRKSIHIGPGR? FYTTGQIIGDIRQAH TRPNNNTRKSIPIGPGRAFYTTGDIIGDIRKAH TRPSNNTRRSIHMGLGRAFYTTGDIIGDIRQAH TRPNNNT-? CGXHIGPG-U-FYTTGQIIGDIRKAH • -rRPNNNTRRSXPIGPßRAFYTTGQIIGDXRQAH IRPNNNTRKSITHGPGKVFYVT-DIIGDIRQAQ TRPSNNTRKRXAIGPGRAVYTTE IIGDIRRAH ERPNNNTRKSXMXGPGRAFYATGDXXGDXRQAH TRPNNNTRKSIRIGPGQTFYATGDIIGDIRQA-. TRPNNNTRKSIRIGPGQAFYATGEIIGDIRQAH TRPNNNTRKSISLGPGQAFYATGDIIGNIRQAH TRPNNNTRESIRIGPGQTFYATGDIIGDIRQAH TRPNNNTRQSIRIGPGTFYATGDIIGDIRQAH TRPNNNTRKSIRIGPGTFYATGOIIGDIRQAY TRPNNNTRKGVRIGPGQTFYATGDIIGDIRQAH TRPNNNTRKSIRIGPGQTFYATGDIIGDIRQAH TRPNNNTRKSSRIGPGQTFYATCDIIGDIRQAH TRPNNNTRKSIRIGPGQTFYATGDIIGDIRRAY TRPSNNTRKSIRIGPGQTFYATGEIIGDIRQAH TRPNNNTRKSLRIGPGQTFYATGDIIGDIRRAK TRPNNNTRKSTRIGPGQTFYATGDXIGDIRQAH TRPNNNTRKSSRIGPGOTFYATGDtIGDIRRAY TRPNNNTRKSIRIGPGQAFYATGDIIGDXRQAY TRPNNNTRKSIRSGPGQAFYATNDIIGHIRQAH TRPNNTRQSSRIGPGQVFYATKDIIGDSRQAH TRPTNNTOOSIRIGPGQAFFATKGIIGDIRQAH TRPNNNTrtKSIfclGPGQTFYATGDIIGDIRQAH TRPNNNTRKSIRIGPGQAFYATGGIIGDIRQAH TRPNNNTRKSVRIGPGQTFYATGDIIGDIRQAY TRPNNNTRKSVRIGPGQTFYATGDIXGNIRQAH TRPGNNTRKSMRIGPGQPFYATGDIIGNIRQAH TRPNNNTRKSIRIGPGQAFYATNDIIGDIRQAH TRPNNNTRKSMRIGPGQTFYATGDIIGNIRQAH TRPNNNTRKSVRIGPGQTFYATGDIIGDIRQAH VRPNNNTRKSIRIGPGQTFYATN * • - •• - •• * - TRPNNNTRQSVRIGPGQAFYATKDIIGDIRQAH TRPGNNTRKSIRSGPGQTFY TGDSSGDIRQ H TRPNNNTRRSIR SPGQVFYANNDSSGDSROAH HTV ipput: 8/10 TRPNNNTRKSIRIGPGQTFYATNEIIGNIREAH ARPNNNTRKSMRIGPGQTFYATGDIIGDSRQAH TRPNNNTRKSVRIGPGQTFYAT DIIGDIRQAH TRYANNTRKSVRSGPGQTF -TNDSSGDSRQAH ARPNNNTRESXRIGPGQTFYATG SSGDSRQAY TRPNNNTRKRIRVGÉGQTVYAT-ÍIAISGDIRQAH TRPSNNTRKSIRSGTOQAFYATss isNIRQAH ARPGNNTRKSIRIGPGQTFF TGASSGDIRQAH??????? TRPNNNTRKSrRItíPGQTFYATGDÍSGNXRQAH TRPYNNTRQRTHSGPGQAI-YTT-RISGDIRQAH TRPNNYKRQGTPSGLGQA YTT-RVIGDSR - AH TRPNNNTRQGTHIGPGQALYTT-GVSGDIRKAH TRPYNNTRQSTRSGPGQTS.FTT-KSSGDIRQAH TRPYNNTRQGTHisPGRAYYTT-NISGDIRQAH TRPYNNTRQGTHIGPGQT FTT-KSSGDSRQAH TRPYNNKRORTPIGL§QVLHTT-RVKGDIRQAH TRPYSRVRQGAHSGPGRAYYAT-NIFGDIRQAR TRPSNNTROSTRSGPGQALYTN-KSSGNSRQAH ARPYNNTROSTRSGPGQALFTS-KSXGNIRQAH TRPYENMRQRTPIGLGQAI-VTS-RIKGRIRPAY TRPYNNTRQGTHIGPGRAYYTT-RSLGNSRQAH TRPYNNTSQGTHSGPGRAYYTTISVIGDIRQAH TRPYNNTIQKTSSGRGQALYTT-ETRGDIKQAF TRPYNNIRQRTPIGSGQAI-YTT-RRIGDIRQAY TRPYNNTRQGTHIGPGRAYYTT-RIVGNIRQAH TRPYNNTRQSTHFGPGRAYYTT-DIIGDIRQAH TRPNNNTRQSTQIGPGQALFTKTRIIGDIRQAH TRPY --- TPIGLGQALI'3 WU-N-RSKAKIGQAY TRPYNQIRQRTSIGQGQALYTT-RVTGDIRKAY TRPYNNTRKGIHSGPGRAYYTT-NSVGNSRQAH TRPYDKVSYRTPIGVGRASYTT-RSKGDSRQAH TRPYNNIRQRTPIGLGQAUYTT- --RI - DI - RAH XRPYNNTREGTHXGP RALFTT-DXXGDIRQAH ARPYAIERQRTPSGQGQVI-YTT-KKIGRIGQAH TRPNNNTRQSTHSGPGQASYTI-TICWGDIRQAH SRPYEN - RRRTPSGLGQAYYTT-KI.KGYSRPAH TRPEKIKRRGTPIGLGQAYL-TT-TK-GIKGVAGQPH QITGYIRQAH TRPYRNIRQRTHXGTGQAY IRPNKTKIQRTSIG -? GQALYTNDKIIGNIRQAY ARPY KIWRRTHIGSGOAYSTK-RIQN TGPAH TRPKNITÍQRTPIGLGQALYTT-KP.IGVIGQAS? RPRNVTIQRTSSGSGQAI-YTT-KR1GYIKQAH TRPYTO-KIQRTHIGTßQA HTT-RITGYIGQAH TRPYYNSRQRTPIGI ---- QA YTTRGTTKVIGQAH TRPYNKTSQP.TSIGQGRAI-YTT-KPTGYIRQAY SRPYKSTRIRTHIG? GQAYYRT-NIQGDIRQ? And TRPYRAMRRRTSIGQGOAYYTT TGIGGNIRQAY TRPYSNKRQSTPÍGLGQALYTT-RGRGDIRKAH ARPYEKKRRTTPIGL & 2ALITS-RNFEKIGQAH TRPYKS IRRIGPGRWQTYY-TTNITGRAK IRPNKRTRQRTHIGSGQALYTT-KIVGDIRQAH TRPD- "KRQRTPSGQ ^ AI-YTTR1_.TTRRSGQPH HRPYNNXRQSVHSGPGRAFYTT-NSSG - IRQAK TRPYHNTRQGTHIGPGRAY TT-NIIGDIRQAH TRPYNNTRQGIHIGPGRAYYTD-QITGDSRQAH TRPSNNTRKSIHIGPGQALFTI-DSIGNIRQAH TRPNNNTRQSTHIGPGQALYTT-K1IGDSRRAH TRPANNTRQSVHLGPGQALYTT-RVIGDIRQAY TRPYNNIKIQTPSGRGQALFTT-RSKGIKGQAH TRPNNNTRQSIHIGPGQAt-YTr- NVSGDIRQAH TRPYTNKRQGT - MGPGRA - YTI-DITGDIRQAY TRPYNNTRQSTHIGPGQA.I-YTT-NIIGDIRQAH VRPYSNQRRRTPIG 3QALYTTMDNMKNIKQAY TRPYNNIKIQTPIGRGOAI-FTT-RRKGIKGQAH TRPYTKTRH RAQGRAVWTTGITGDIRQAY ARPYENSRQRTPIGTGQALYTGKK-SGKSGQAH HIVinput: 10.09 TRPYSKERI-KTSSGOGQALYTTVKVTGDSRQAH ARPYQNTRQRTPIGI-GQSLYTT-RSRSIIGQAH TRPNKITRQSTPIGLGQALYTT-RSKGDIRQAY TRPGNNTRRGIHFGPGQALYTT-GIVGDIRRAY TRPYKYTRQRTSIGLRQSI- YTIKKKTGYIGQAH TRPYRNIRQRTSIGLGQALYTT-KTRSSIGQAY IRPNNNTRQSTHLGPGQAI-YTT-KVSGDIRQAY TRPNNNTRKSIHISPGQAIYTT-DVIGDIROAY TRPNNNTRKGIHIGPGCALYTSGDIVGDIRQAH TRPNNNVRQRTPIGPGQAFYTTG ********** TRPSNNTRTSITI GPGQVFYRTGDIIGDIRKAY TRPFKNMRTSARIGPGQVFYKTGSITGDIRKAY TRPFK-amiSARIGPGRVFHTTGNINGDIRKAY TRPFKRVRT VRIGPGRVFHKTGAINGDIRKAY TRPSro-TRTSVRSGPGOVFYKTGpiSGDSRSAY TRPFKKTRSSARSGPGRVFHKTGAILGDSRKAF TRPSNNTRTSVRSGPGQVFYKTGESSGDSRKAF TRPSNKIRT-5VRIGPGQ TYX-K- I -??? GDIRKAF TRPSNNIRTSVRIGPGQVFYKTGSSTGDS - KAF TRPFK - MRTSVRIGPGRVFYKTGSSTGDIR - AY TRPYKNTRTSARIGPGQVFYKTGSITGDSRKAY RPSNNTRTSVRIGPGQVFYGTGEIIGDIRRAF TRPSTTIRTSSRIGPGQAFYKIEGISGNSRAAY TRPSNNTRTRSTSGPGQVFYRTGDSSGDSRKAY TRPSNNTRTSITIGPGQIFYRTGDIIGDSRKAY RPSNNTRTSXTIGPGQVFYRTGDIXGDIRKAY TRPSNNTRTSITIGPGQVFYRTGDIIGDIRKAY TRPSNNTRTSXTSGPGOVFYRTGDSIGNIRKAY TRPSNNTP.TSITIGPGQVFYRTGDITC-ÍÍIR - AY TRPSNNTRTSIPIGPGQVFYRTGDIIGNIRKAY TRPSNNTRTSITMGPGQVFYRTGDIIG - IP --- AY TRPSNNTRPSITIGPGQVFYRTGDIIGDIRKAY TRPSNNTRTSITIGPGQVFYKTGDSS-AIS - KAY TRPSNNTRTSIPIGPGQVFYRTGDIIGDIRKAY TRPSNNTRTSSTIGPGQVFYRTGDSSGDIRKAY TRPSNNTRTSIPSGPGQAFYRTGD1IGDIRKAY TRPSNNTRTSSTIGPGQVFYRTGDIX-NIRKAY TRPSNNTRTSITIGFGQVFYRTGDIIGDIRKAY TRPSNNTRTSITIGPGQVFYRTGDISGDSXKAY TRPSNNTXPSSTXGPGQVFYRTGDISGDIRXAY TRPSNNTRTSITIGPGQVFYRTG-.IIGDIRXAY TRP - NNTRTSINIGPGQVFYRTGDIIGDIRKAY TRPSNNTRTSSTVGPGQVFYF.TGDSTGDIRKAY TRPSNNTRTSSPSGPGQVFYRTGDIIGDIRKAY TRPSNNTRTSITIGPGQVFYRTGOIIGDXRQAY TRPSNNTRTSINIGPGQVFYRTGDIIGDIRXAY TRPSNNTRTSITIGPGQVFYRTGDIIGDIRKAY TRPSNNTRTSSTIGPGQVFYRTGDIIGNSRKAY TRPSNNTRTGITIGPGQVFYRTGDIIGDIRXAY TRPSNNTRTSITIGPGQIFYRTGDIIGDIRKAY TRPSNNTRTSITIGPGOVFYRTGDIIGDIRKAH TRPSNNTRTSLTIGPGQVFYRTGDIIGDIRKAY TRPSNNTRTSLTIGPGOVTYRTGDXIGDIRKAY TRPSNNTRTSITIGPGQVFYRTGDSIGDXRRAY TRPSNNTRTSINIGPGOVFYRTGDIIGDIRKAY TRPSNNTRTSITIGPGQVLYKTGDIIGDIRKAY TRPSNNTRTSTTIGPGQVFYRTCDITGNIR - AY TRPSNNTRTSVRIGPGQVFYRTGDIIGDIRKAY TRP-.NNTRTSITIGPGQVFYRTGDIIGNSRKAY TRPNNNTRKSIH GPGQAFYATGDIIGDIRKAH TRPNNNTRKSIQ - GPGRAFYTTGEIIGDIRKAH TRPNNYTRKSIYFGPGRAFHTAGKISGDIRKAH TRPNNNTRKGIKIGPGRAFYATGDIIGDIRKAH ? TRPNNNIRKSIPLGPGRAFYATGEIIGDIRKAH TRPSKTIRRRIRIG -RVFYAT-GVNGDSRKAY TRPNNNTRKSIHIGPGRAFYATGDIIGDIRKAY HIVinput: 10/10 TRPNNNTRKSIR GPGQVFYATGDISGDSRKAY TRPNNNTRJGSRSGPGRVSYATSAITGDIRQAH TRPNNNTRKSIHLGPGQAFYATGDISGDSRKAH TRPNNNTRKSIHLGPGQAFYATDDISGDSRKAH TRPNNNTRKSIHU5PGQAFYATGDIIGDSRKAY TRPNNNTRKSIHI-GPGQAFYATGDIIGDSRKAH TRPNNNTRKSSHLGPGQAFYATGDSIGDIRKAY TRPNNNTRKSIHI-GPGQAFYTTGDIXGDSRKAH TRPNNNTRKSIH - GPGQAFYATGDIXGDIR-CAH TRPNNNTRKSIHLGPGQAFYATGDIIGDIRKAH TRPNNNTRKSIHLGPGQAFYATGDIIGDIRKAH TRPNNNTRKSIKLGPGQAFYATGDISGDSRI-AH TRPNNNTRKSSH GPGQAFYATGXIIGHIRKAY TRPNNNTRKGSH SVGRPFYRTVDIVSDSRKAH TRPNNNTRKSSH GPGQ FY TGDISGDSR -? AY TRPNN-TO - KSix - GPC-AFYTTGNIIGDIRKAH TRPNNNTRKSIHIGPGQAFYATGDIIGNIR - TRPNNNTRKSIHI-GPGQAFYATGNIIGDSR AH - AH-AH TRPNNNTRKSIHSGPGQAXYTTGDIIGDIR TRPNNNTRKSIHLGPGQAFYATGDSSGDSRKAH TRPNNNTRKSIHLGPGQAFYTTGDSIGDIRKAH TRPNNNTRKSIHGPGQAFYTTGDSIGDIRKAH TRPNNNTRKSIH - GPGQAFYATGDIIGDIRKAY TRPNNNTRK? IHLGGGQAFYATGDSSGDIRKAK TRPNNNTRKSSHLGPGQAFYATGDISGDSRKAH TRPNNNTRKSSHI-GPGQAFYATGDSSGDXRKAH TRPNNNTRKSIHLGPGQAFYATGGIIGNIRKAH TRPNNNTRKSIHLGPGQAFYATGDIIGDSRKAY TRPNNNTRKSIH - GPGQAFYATGDSIGDIRKA-- TRPNNNTRKSIHIGPGQAFYATGDISGDIRKAH TRPNNNTRKSSHSGPGQAFYATGDSSGDSRKAH TRPNNNTRKSXHXGPGQAFY? TGEVXGDSRK? K TRPNNNTR? IHLGPGQAFYATGDIIGDIRKAH TRPNNNTRKSIHLGPGQAFYATGDISGDIRKAH • raPNNin-RKSSHI-GPGQAFYTTGES SGDIRKAH TRPNNNTRKSSTSGPGQAFYATGDSIGDIRQAH TRPNNNTRKSISFGPGQAFYATGDXSGDSRQAH TRPNNNTRKSIHIGPGQA - YATGASSGDSRQAK TRPNNNTRKSIKFGTGRVT-YATCAISGNIRQAH TRPNNNTRKSIRIGPGQAFYATGESSGDSRQAH TRPNNNTRKSITLGPGQAFYATGDISGNSRQAH TRPNNNTRKSITFAPGQAFYATGDIIGNT QAH IRPNNNTRKSIPIGPGQAFYATGDIIGDIRQAH TRPNNNTRKSISIGPGQAFYATGDISGDSRKAY TRPNNNTRK -. ISIGPGQAFYATGDISGDSRKAY TRPNNNTRRSMRIGIGRGQTFHGAIIGDIRQAH TRPNNNTRKSIRIGPGQAFYATGDIIGDIRQAH TRPNNNTRKSINIGPGRAFYATGDIIGDIRQAY TRPNN RNIRTHSGSGQAIFTT-KVIGDIRK And TRPNNNTRTSIHLGPGRAFYATGDIIGDIRQAH? TRPGNTTRRSMRIGPGRTFYTS GDIRKAH TRPNNNTRKSVRIGPGOTFYATGDKKGDIRQAK TRPNNNIRK3IRIGPGQAFFATGDIIGNSRQAQ TRPNNNTRKSIRFGPGQAFYT-SDIIGDIRQAY TRPNNNTRRS? HVGPGQAFYATGDSIGNIRKAH TRPSNNTRRSIRFGPGQAFY-TNDSSGDIRQAY TRPGSDKKIRIRIGPGKVFYAKGGITG QAH ERPGIDSQE-IRIGPMA-WYSMGLGGTSSRAAY --RPO? NIQE-KRIGPMA-WYSMGIGGTSSRAAY IREIAEVQD-IYTGPMR-wrsm KRSNPRSRVA ERPGNQTIQKIMAGPMA-WYSM- TKRA-AY APPENDIX C HIV output: 1/6 or 0.00 A18 | Q31 | H33 S fc 36019 fc 15684.208314 fc 0.000000 fc \ cr 1 0.00 A18 T21 S 33816 fc 12392.399254 fc 0.000000 t \ cr 2 0.01 A21JD24 S 4S549 k 17706.407140 fc 0.000000 k See 3 0.01 H12JA18 S 86025 fc 24619.776947 fc 0.000000 k \ cr 4 0.01 H12JR17 $ 48257 and 19028.783S92 fc 0.000000 i \ cr 5 0.01 I11JR17 S 64548 £ 27053.952336 and 0.000000 k .cr 6 0.02 L13JK31 S 39382 k 17335.347894 fc 0.000000 k \ cr 7 0.02 - 13JW19 | Q24 $ fc 20184 & 379.160544 k 0.000000 k \ cr 8 0.02 K13J 1S S i 23300 k 6673.177086 i 0.000000 fc See 9 0.02 N4I 9 $ t 162152 k 74737.922307 k 0.000000 k See 10 0.03 $ N4 | K9fH33 $ fc 26376 k 5666.716129 fc 0.000000 fc \ cr 11 0.03 $ Q17 | D24 $ fc 86891 k 17162.233105 fc 0.000000 fc \ cr 12 0.03 Q31JH33 $ fc 233190 k 186078.818611 fc 0.000000 k \ cr 3 0.03 R12 | 017 k 53740 k 10564.956S12 fc 0.000000 fc \ cr 4 0.04 R12JT18 k 62774 k 183S9. 197022 fc 0.000000 4 \ cr S 0.04 R17JA18 k 54366 fc 27136.429076 k 0.000000 fc See 6 0.04 R17JE24 k 33748 fc 10413.2S5S92 fc 0.000000 k \ cr 7 0.04 R17JQ31 k 45065 k 26805.242087 k 0.000000 k See 8 0.05 I R17JT21 k 70301 k 16232.354294 fc 0.000000 k See 9 O.OS! $ S10 | D24 k 57772 fc 17415.113746 k 0.000000 k See 0 0.05 ¡k S V11 | R12 S k 39546 k 18975.126308 k 0.000000 k See 1 0.05 $ V11 | R12 | T18 $ k 17628 and 881.251263 k O.000000 k Vc-2 0.06 $ 31J 33 S fc 36346 k 20803.634880 fc 0.000002 fc See 3 0.06 N4 | A21 S k 45441 fc 30227.409858 k 0.000003 - See 4 0.06 Q17 | K31 $ k 2S033 k 10875.740384 k 0.000018 k See 5 0.06 G10JH12 $ t 20779 fc 7151.794446 k 0.000041 k See 6 0.07 K9 | A21 $ k 40098 k 27695.038620 k 0.000231 J- See 7 0.07 F19 | D24 $ t 29121 fc 16875.538795 k 0.000286 fc See 8 0.07 Q17JA21 $ k 29621 k 18109.021417 k 0.000737 k See 9 0.07 H12JE24 $ k 22348 k 10939.327036 fc 0.000839 fc \ cr 0 0.08 N4 | K9 | I11 S k 15175 k 4159.316971 k 0.0013SS k See 1 0.08 $ S4JT9¡T12 | V18 | R21 $ k 10919 k 1.718549 k 0.001524 k See 2 0.08 $ N4JK9JA21 $ k 11233 k 623.181959 k 0.002185 k See 3 0.09 $ N4 | 031 | H33 $ t 21868 fc 11328.342993 fc 0.002369 fc See 4 0.09 $ F19 | A21 $ fc 44400 t 34516.144368 fc 0.004910 fc See 5 0.09 K9 | Q31 | H33 $ k 16593 fc 6991.723713 k 0.00 6625 fc See 6 0.09 W19 | Q24 $ k 16738 k 7234.038664 k 0.007331 k See 7 0.10 The | NI2 S fc 10844 k 1492.335945 k 0.00857S fc See ß 0.10 K9JE24 S k 13847 fc 4S8 .312260 fc 0.009408 fc See 9 0.10 9JF. 17 S i 3373S k 24568.179150 fc 0.010326 fc \ cr 0 0.10 T12 | V18 S k 23076 fc 14893.617567 fc 0.026158 fc See 1 0.11 S R12 | A21 S fc 15497 fc 7516.155896 fc 0.031231 k See 2 0.11 $ - 4 (K9! Q31 | H33 S fc 6280 fc 493.681367 fc 0.0369OS fc See 3 0.11 $ N4 | K9JA18 $ k 116S5 fc 4250.900600 fc 0.050618 fc See 4 0.11 fc S S4¡T9 | T12! V18jR21 | Y33 S * 7370 fc 0.093039 fc 0.0S2029 fc View 5 0.12 fc S R12 | Q17 | T18 $ fc 7452 fc 240.364918 fc 0.058992 -. See 6 0.12 S V11 | Q17 S fc 14350 fc 7329.962834 fc 0.068429 fc See 7 0.12 $ K12JT21 $ t 23263 fc 16324.923094 fc 0.072B25 fc See 8 0.12 S 017JY33 S fc 17288 k 10374.788061 fc 0.074203 fc See 9 0.13 S L13JW19 S fc 15S36 fc 8921.243955 fc 0.092437 fc See 0 0.13 S S17JH28 S fc 6529 fc 138.997153 fc 0.108375 fc See 1 0.13 5 N4 | K9 | Q31 $ fc 10228 fc 38S4.612095 fc 0.11270a fc See 2 0.13 $ X8JS17 $ fc 6573 fc 275.512362 fc 0.115524 fc See 3 0.14 S R17¡Q31 | K33 S fc 7265 k 1223.984346 fc 0.13723S fc See 4 0.14 S T9IT12 | V18 | R21 S 6 6003 t 30.417827 fc 0.143515 fc See S 0.14 $ K4¡K9 |? l8 | H33 S fc 6380 fc 549.756091 fc 0.157254 • - See 6 0.14 S S10! F19 | D24 S fc 6150 fc 620.344848 fc 0.189437 í See 7 0.15 fc S I11ÍR17 | A18 S k 65S5 fc 1027.737537 fc 0.189642 fc \ cr 8 0.1S & S VllJR12 | Q17 $ í 5751 fc 247.598509 fc 0.192378 fc See 59 0.15 fc S S4 | T9! V18 | R21 S fc 5514 fc 35.313082 fc 0.195240 fc See 60 0.15 fc S S4 | T9JT12 | V18lR21 | K31 S fc 5462 fc 0.090571 fc 0.197200 fc See 61 0.16 fc SH) 2 | R17 (A18 S fc S618 fc 172.948903 fc 0.199184 fc See 62 0.16 fc S Q9 | Tll | -.19 | -23 S fc 5464 fc 38.188997 fc 0.201464 fc See 63 0.16 fc S Y4 | Q9 | TU! -23 S t S364 fc 35.276055 fc 0.213243 fc See 64 0.16 fc $ N4 | A18 | Q31 | H33 S fc 6378 fc 1180.344841 fc 0.229871 fc \ cr 65 0.17 fc S L3JN12IR23 S fc 5114 fc 15.794611 fc 0.243044 fc See HTV output: 2/6 fc S V13 | V15 | X19 S fc S095 fc 4.314940 fc 0.244059 I See fc S R12 (Q17 | D24 S fc S088 fc 122.811489 fc 0.261410 fc See fc S S4 | T9 | V18 $ fc 5671 fc 868.949180 fc 0.285090 4 See $ G24 | E2B S fc S363 fc 579.114112 and 0.23780S fc See fc S S4 | T9 | R21 S fc 5425 fc 650.238601 fc 0.289174 k \ cr fc $ K9 | I11 | R17 $ fc 5315 fc 590.615207 fc 0.296804 fc See fc S V1B | K31 | Y33 S £ 5524 4 852.751002 £ 0.304979 fc See fc $ T21JE24 S fc 19192 fc 14557.811161 fc 0.310756 fc See fc S S4 | T9 | T12 | V18 | R21 | K31 | Y33 $ fc 4390 fc 0.004904 fc 0.350351 fc View & S S4JT9 | V1BJR21 | Y33 $ k 4341 fc 1.910712 k 0.358927 & See & $ I11 | H12 | A18 $ i 5225 and 890,707158 fc 0.359740 fc \ cr fc $ H12JI-13 $ fc 9314 & 5009.363342 - 0.364791 fc See £ M1 | S12 | F20 $ i 4243 fc 17.800459 fc 0.378494 £ See fc $ Y41Til | -23 $ t 4876 £ 710.489341 fc 0.388952 - See £ S H12 | A1B | H33 $ fc S292 6 1141.301814 £ 0.391569 fc See fc $ N12 | G24 | 25 $ fc 4169 fc 18.987442 fc 0.391690 t See fc $ N12JT13 $ fc 5365 £ 125S.021021 fc 0.398803 fc See £ $ N4 | K9 | G23 $ £ 9804 fc 5726.074196 £ 0.404S40 £ View £ $ P12 | 13 | W19 | C2 * $ £ 4070 £ 20.998880 fc 0.409748 £ View & S Q12JY13 | T15 | G17 | V26 $ k 4024 £ 0.000255 & 0.414274 fc See £ S S10¡F19 | A21 $ £ 5598 fc 1607.067572 £ 0.420292 £ View £ $ K9 | H12 S £ 26788 £ 22912.753561 £ 0.441631 £ View $ S10 | Q17 | D24 $ Fc 3960 Fc 93.803024 £ 0.443318 Fc See $ Q17JA21JD24 $ fc 3949 £ 133.098101 £ 0.452738 fc See $ N4 | K9 | H12 S £ 4239 £ 450.472945 £ 0.457896 £ View S T9 | T12 | V18 | R21 | Y33 S £ 3784 fc 1.646276 fc 0.4S9063 fc See $ Y4 | Q9 | T11 $ £ 4402 £ 638.401728 £ 0.462612 fc See S N4 | K9 | R17 $ £ 4239 £ 507.820002 £ 0.468770 £ View $ N4 | K12 |? Lß $ £ 4450 £ 726.677198 £ 0.470266 £ View S 09 | T11 | --19 $ fc 4413 £ 691.708041 £ 0.470653 £ View $ S4 (T9) T12 | R21 S £ 3747 £ 31.482325 £ 0.471755 £ View S N12 | S30 $ i 4440 £ 766.347625 £ 0.479764 £ View $ X1JS2 $ fc 3970 k 345.480880 £ 0.489218 £ \ cr $ S4JT12 | V18 | R21 S & 3643 £ 32.472859 £ 0.491921 £ View $ Q9 ¡Til | -23 $ £ 4299 £ 742.828036 fc 0.502461 £ View S T21JQ31 S £ 16089 £ 12621.469597 £ 0.519777 £ View S K9 |? 18 | Q31 | H33 $ £ 4160 £ 697.083962 £ 0.520683 £ View S S4 | T9 | R2l | Y33 $ £ 3460 £ 35.030271 fc 0.528142 fc See S Y4JQ9Jtiljl-19 | -23 $ £ 342S fc 1.824291 fc 0.528495 £ View S S6¡K7 | T10 | 1-, 11 | M13 | K16 | G26 | Y28 S 4 3409 £ 0.000000 t 0 531288 See S S4JT9 | V18 | R21 | K31 S £ 3406 £ 1.8600S7 fc 0.532246 fc See S S17 | IT9 S £ 4910 i- 3.S10.151983 fc 0.533093 fc See S Y12 | H20 [R24 $ £ 3401 fc 29.556849 fc 0.533702 fc See $ S4 | T9 | V18 | R21IK31 | Y33 S fc 3370 & 0 100690 fc. 0 S39008 fc See S S10JQ17 $ £ - 22065 £ 18738.120311 fc 0.547525 £ View 5 All-22 | S23 $ £ 3303 £ 7.355264 £ 0 553724 £ View $ M13 | W15 | E31 S * 3339 fc Sfi. 71417 fc 0.556389 £ See S • 24l * 25 | * 26 | «27 | * 2ß | * 29 | * 30 | * 31 | * 32 2 |, • 3333 $$ fcfc 3 3226699 fcfc ii .0 00000000000 fcfc 0 0. .555599002200 í \ fc S R17 | H33 S £ 31466 £ 28229.156188 fc 0 565421 £ See £ S M13 | W15 | T18 S «• 3501 and 360.659791 fc 0.584679 4 \ cr fc S F13Í-22JS23 $ fc 3123 fc 6.681356 fc C 589460 £ See fc S R17 | A18JT21 $ fc 3190 fc 89.245042 fc 0.592593 fc \ cr fc S N4 | K9 | A18 | U31¡H33 S «- 3143 fc. 55.455693 fc 0.594235 fc See £ S R17 |? 18 | 031 | H33 $ £ 3144 fc 101.027645 & 0.6041S3 fc \ cr £ S VllN23 | * 24 | * 25 | * 26t * 27 | * 28j * 29 | * 30 | »31 | * 32l« 33 S fc 3030 fc 0.000000 i 0.606 b $ A11 | N12 S £ 4517 £ 1492.835945 £ 0.607916 fc See fc S R12JT18 | A21 S £ 3150 U 134.485298 £ 0.609647 fc See £ S S10 | G23 | D24 S £ 3606 fc 599.551395 fc 0.611461 fc See fc S S1 | M13 | 15 S í 3087 £ 91.193028 £ 0.613590 £ \ cr fc S N12 | F20 (K24 S £ 3202 £ 213.735139 fc 0.615099 £ \ cr fc S K13 | W1S | E24 S fc 3282 fc 306.430052 fc 0.617639 fc See t S K9 | I11 | F19 | G23 S fc 4153 fc 1180.595212 fc 0.618272 fc See fc S R2 | P3 | N5 | N6 | T7 | R8 | G14 | P15lG16 | Y20 | T-; 2 | G23 | I25iI26lG271I29 | R3O | A32 S fc 3353 & S H121? L8 | Q31 S fc 3759 fc 845.163446 i 0.629981 fc > cr fc S Kl? | D20J-23 S fc 2928 fc 25.438797 fc 0.632234 £ See fc S Y5 | K7 | R10 | K23lM24 | T28 S fc 2897 fc 0.000008 fc 0.633345 fc See HIV output: 3/6 132 0.34 £ S G10 | R17 S £ 9506 £ 6637.164691 £ 0.638967 fc See 133 0.34 £ S Y4 | Q9 | -23 S £ 3539 & 699.676594 £ 0.644852 fc See 134 0.35 £ S G12 | A22 | D23 | N24 S fc 2838 £ 0.092735 £ 0.645134 fc See 135 0.35 fc $ Tl | R2 | P3 | NS | N6 | T7 | R8 | G14 | PlS | G16 | Y20 | T22 | I25 | I26 | G27 | I29 | R30 | A32 S fc 3787 s 136 0.35 £ S Tl | R2JP3IN5 | N6IT7JR8 | G14 | PlSJG16 | Y20 | T22 | G23 | I25 | I26 | G27 | r29 | R30 | A32 S fc 3C 137 0.35 fc $ N4JH12 S * 26775 £ 23945.157075 fc 0.646741 fc See 138 0.36 fc S V18 | * 24 | * 25 |, 26 | * 27 | * 28 | * 29 | * 30 | * 31 | '32 |, 33 S £ 2775 fc 0.000000 & 0.657651 139 0.36 fc $? L | R9 | -22 | S23 S fc 2763 £ 0.413224 £ 0.660115 £ See 140 0.36 fc $ Y6 | Y12 | F13 | H20 | A22 | K24 $ fc 2761 fc 0.000052 fc 0.660430 & See 141 0.36 fc $ VllfA24 | S2B $ £ 2788 £ 31,535138 £ 0.661330 fc See 142 0.37 £ $ V20 | I22 | -24 | K25 | H29 $ fc 2748 t 0.000084 £ 0.663009 £ See 143 0.37 £ $ K9 | H12 | A18 $ £ 3267 £ 526.343072 £ 0.664465 £ View 144 0.37 £ $ T9 | T12 | V18 | R21 | K31 £ £ 2742 £ 1.602621 £ 0.664517 £ View 145 0.37 fc $ T8JR9 $ £ 3185 £ 445.441758 fc 0.664683 £ View 146 0.38 £ $ I11 | H12 | R17 $ £ 2909 fc 172.776969 fc 0.665344 £ View 147 0.38 £ $ Y6 | X10 | X12 | H13 | X18 | X19 | R31 $ £ 2736 £ 0.000000 £ 0.665388 £ View 148 0.38 £ $ A24 | S28 $ £ 3300 £ 566.063943 £ 0.665797 fc See 149 0.38 £ $ G12 T18 |? 22 | D23 | N24 S £ 2692 £ 0.005083 £ 0.674094 fc See 150 0.39 £ $ P12J 19JQ24 $ £ 3054 £ 395.340658 £ 0.680669 £ View 151 0.39 £ S? 14JH20JN24 $ £ 2697 fc 47.434702 fc 0.682460 fc See 152 0.39 £ $ T9 | T12 | V18 | R21 | K31 | Y33 $ fc 2632 £ 0.086760 fc 0.685931 fc See 153 0.39 fc $ R12 | Q17 | A21 S fc 2701 fc 79.04S229 £ 0.687887 £ View 154 0.4O £ $ R17JA18JH33 S £ 3944 fc 1325.820339 £ 0.68862 8 fc See 155 0.40 fc $ W15JX19JA24 $ £ 26S5 £ 56.835384 £ 0.692552 £ View 156 0.40 £ $ Q12 | R13 | V20 | I22 | K24 | -26 | M29 $ fc 2584 £ 0.000000 £ 0.695324 fc See 1S7 0.40 fc $ Sl | Y4f-6 | N10 | Yll | S12 | S151V21 | K24 $ £ 2554 Fc 0.000000 Fc 0.701181 & '.cr 158 0.41 fc $ T18 | A21 $ £ 6883 £ 4332.151205 £ 0.701796 £ View 159 0.41 fc $ K17JD20 $ £ 2996 fc 458.835571 fc 0.704460 fc See 160 C.41 fc $ Q17 | D24 | K31 $ fc 2660 £ 125.180912 £ 0.70491E £ See 161 0.41 £ $ 13JQ15J 19 $ fc. 2582 £ 98.222466 fc 0.714B12 £ See 162 0.42 £ S S4 | T9 | R21 | K31 | Y33 $ fc 2474 £ 1.844223 fc 0.717056 fc See 163 0.42 fc $ ri | a4 | Mll | P18 | R22 | -24 | V2S $ fc 2445 fc 0.000002 £ 0.722286 fc See 164 0.42 fc $ S12 | F13 $ £ 4939 £ 2502.178252 £ 0.723857 £ See 16S 0.43 £ $ l-13 | Q17 | K31 S £ 2663 £ 227.467572 £ 0.724104 £ View 166 0.43 fc $ K9 | - U7 | K33 $ £ 3142 £ 710.504406 £ 0.724879 £ View 167 0.43 £ $ P12 | 13 | 19 $ £ 2907 £ 483.231131 £ 0.726360 £ View 168 0.43 £ $ K9 | R17 | A18 $ £ 3012 £ 598.308696 £ 0.728290 £ \ cr 169 0.44 £ S S4JT12JR21 S t 3010 £ 597.264141 £ 0.728473 £ View 170 0.44 fc S N4J? 11JR17 S fc 3233 fc 820.559839 i 0.728529 fc See 171 0.44 fc S M13 | A24 | E31 S fc 2426 fc 50.435106 fc 0.735563 fc See 172 0.44 £ S L3 | A12 | T18 | V19 | D23 | R24 $ fc 2374 & 0.000104 fc 0.735861 fc \ cr 173 0.45 fc S K9 | A21 | H33 S £ 3269 fc 897.012220 fc 0.736243 fc See 174 0.45 £ S R2 | P3 | N5 (N6ÍT7jR8 | G14 | P15 | C16 | F19 | Y20 | T22 | G23 ( I2S | I26 | G27tr29 | F.30jA32 S i - 175 0.45 fc S R10 | X11 | S12 | V25 S i 2345 £ 0.448221 £ 0.741446 fc See 176 0.45 £ $ K4 | K9lIll | G23 $ fc 2883 fc 541.944923 fc O 742108 fc See 177 0.46 fc S R17IA13IQ31 S £ 3304 fc 973.769538 fc 0.744153 fc See 173 0.46 fc S Y4 | Q9 | Tll | F13 | Y19 | -23 S fc 2321 £ C.009829 fc 0.74589b fc See 179 0.46 fc S I7jF20 | Q33 S fc 2355 fc 38.678004 fc 0.746775 fc See 180 0.46 fc S T9 | Vl | | K31 | Y33 $ fc 2352 fc 43.522103 i 0.748251 fc See 181 0.47 fc $ 3 | A12 | V19 | D23 | R24 $ £ 2307 fc 0.001890 fc 0.748529 fc See 182 0.47 fc 5 G4 | M11 | P18 S £ 2306 fc 12.419975 fc 0.751048 fc See 183 0.47 fc S S4 | T12¡V18 | R21IY33 $ fc 2292 fc 1.757250 fc 0 751673 í \ cr 184 0.47 fc S K12 | R17 | T21 S fc 2417 fc 129.651999 fc 0.752215 fc See 185 0.48 fc S R10 | S12 | W19 | Q24 $ fc 2299 fc 14.238983 fc 0.7S2700 fc Vc > - 186 0.48 fc S D4 | E6 | I7 | I-lllC12 | Vl3jV20 | A22 | T24l-25 |? 26.T29 | Q33 $ fc 2279 and 0.000000 - C. ^ c. 187 0.48 fc S G10¡? 24 S £ 2727 £ 449.008967 £ 0.753966 £ View 188 0.48 £ S V19 | R24 | V26 | 31 S £ 2272 £ 0.404088 £ C.755161 fc See 189 0 49 '$ V111R12ÍQ17JT18 S t 2281 £ 11.909386 fc 0.755629 6 See 190 0.49 «S 13¡W15 | Q24 | E28 $ fc 2270 £ 0.994080 fc 0.755644 fc See 191 0.49 fc S T1 | R21P3 | N51N6! T7 | R8 | G141P1S | G16 | F19 | Y20 | T22 | G23 | I2S | I26 | G27 | I29 | R30 | A32 S 192 0.49 fc S R17 | T21 | E24 S fc 2366 £ 123.762808 £ 0.760627 fc See 193 0.50 fc S M13 (W15 | N28 S fc 2610 fc 372.687253 &0.761541 fc See 194 0.S0 &S M13 | K17 | V26 S fc 2455 fc 218.504333 fc 0.761692 fc See 195 0.50 & $ M13¡Q15 | G24 S fc 2336 fc 100.386181 fc 0.761856 fc \ cr 196 O. If fc S F19 | G23¡D24 $ fc 3105 fc 885.799386 fc 0.764893 fc See 197 0.51 fc S M11 | I15ÍG18 [Q] 9 | T201F21 | H22 | A24 S £ 2218 fc 0.000000 fc 0.7651V5 fc See HIV output: 4/6 198 0.S1 fc S R9 |? L4 | H20 | N24 S £ 2214 £ 2.664734 £ 0.766345 £ View 199 0.51 & S S4 | T9 | V18 | K31 $ £ 22S3 £ 4S.36768S fc 0.767028 £ View 200 0.52 & S Y4 | Tll | 19 | -23 S £ 2222 £ 36,549243 fc 0.771106 £ See 201 0.52 fc S K9 |? L8 | H33 S fc 9877 fc 7701.775158 £ 0.772980 £ View 202 0.52 & $ T9 | S18 | H20 S fc 2246 £ 73.652930 £ 0.773506 fc See 203 0.52 £ $ G10 | S17 | I19 S £ 2252 £ 85.340274 fc 0.774546 £ View 204 0.53 £ $ T12JF13JA14 $ £ 2217 £ 52.732117 fc 0.774983 fc See 205 0.53 £ S N4 | A21 | H33 $ £ 3446 fc 1316.842768 fc 0.781367 fc See 206 0.53 & S N9 | R10 | S12 | H20 | K23 | Q24 S fc 2120 £ 0.000164 fc 0.783023 4 See 207 0.53 fc S R12 | T18 | D24 S £ 2373 £ 258.082710 fc 0.783941 fc See 208 0.54 £ $ T21JH33 S £ 13722 and 11609.274530 £ 0.784336 £ See 209 0.S4 £ S T9 | V18 | R21 £ £ 2733 £ 627.501151 fc 0.785638 fc See 210 0.54 fc S L13 | K17 | V19 $ fc 2223 fc 123.584647 & 0.786733 £ View 211 0.54 £ $ T9 | V18 | Y33 $ £ 2928 £ 837.332414 £ 0.788304 £ See 212 0.5S £ S The | Q4 | rS | D6 | I7 | Q8 | E9 | -10 | Mll | M16 | A17 | - 18 | 19 | S21 | M22 | l24 | G25 | G26 | T27 | S28 | S2 213 0.55 £ $ Y5 | K7 | K23 | N24 | T28 S £ 2075 £ 0.000148 £ 0.791109 fc See 214 0.55 fc S S4 | W15 | I19 | A24 S £ 2071 £ 3.211998 £ 0.792396 £ View 215 0.55 £ $ N4JK9 | A18 | Q31 $ £ 2414 £ 350.433693 £ 0.793149 £ View 216 0.56 fc S X12 | 13 | N24 S £ 2091 £ 32.654961 fc 0.794078 £ View 217 0.56 £ S S8 | R10 | S12 | R20 | -23 | K24 S £ 2056 £ 0.001995 £ 0.794496 £ View 218 0.56 fc S S10 | A21 | D24 5 and 2195 & 141.715389 fc 0.794978 fc See 219 0.56 & S T5 | K6 | K7 | I8 | H10 | G24 | M26 S fc 2049 £ 0.000000 fc 0.795739 fc See 220 0.57 fc S T9 | V18 | K31 S 4 2861 fc 816.350692 & 0.796510 £ See 221 0.57 fc S I20 | A22 | T23 | K24 S £ 2040 fc 0.135518 & 0.797358 & See 222 0.57 fc S Y3 | A4 | -21 | N23 S fc 2039 £ 0.001752 fc 0.797S11 fc See 223 0.57 fc S G4 | W11 | D23 | G24 S fc 2039 £ 0.335758 fc 0.797570 £ See 224 0.S8 fc S 111 | E24 S £ 4624 fc 2601.997572 fc 0.798748 £ View 225 0.58 fc S G10 | G17 | G24 $ £ 21S7 £ 138.303116 £ 0.801095 £ See 226 0.58 fc $ Y6 | S8 | R10 | A15 | R16 | K22 | K24 $ fc 2011 £ 0.000000 fc 0.802448 & See 227 0.59 4 $ S4 | T9JR21 | K31 $ fe 2043 fc 34.105157 fc 0.802818 £ See 228 0.59 £ $ D4JE6 | I7 | R9 | 11 | Q12 | V13 | V20 | A22 | T24 | -25 | A26 | T29 | Q33 $ fc 1999 £ O.OOGOOO fc 229 0.59 £ $ Q9 | T11 | L19 | -23 | K24 S £ 1990 £ 1.569195 £ 0.806400 £ View 230 0.59 £ $ S8JP12JX24 $ £ 2000 £ 16.301963 £ 0.807225 £ View 231 0.60 £ $ S4¡T9 | T12 | R21 | Y33 S £ 1985 £ 1.703737 fc 0.807295 fc See 232 0.60 fc S R10 | Y12 | V19 | Q24 | R31 5 t 1982 fc 0.043977 fc 0.807529 fc See 233 0.60 £ S T4 | Q9 | F20 | K23 | G24 $ fc 1979 fc 0.004973 i 0.808044 fc See 234 0.60 fc $ ll | S12 | r-13 | V26 S fc 1972 fc 4.533048 - 0.810047 fc \ cr 235 0.61 £ 5 T5 | K6 | K7 | I8 | R9 | H10 | G24 | M26 S fc 1967 fc 0.000000 fc 0.810128 & \ cr 236 0.61 fc S S6¡K7 | T10 | ll | K16lG26 | Y28 S fc 1356 fc 0.000000 fc 0.812033 £ \ cr 237 | 0.62 fc S T9JV13 | R21 | Y33 S fc 1983 fc 33.839576 fc 0.B13214 £ See 238 0.61 fc S R2 | P3 | NS | N6 | T7) R8 | G14 | P15lG16 | Y20 | T22 | G23 | I2b | r26] G2 1-.28II29) RiO | A32 S i 239 0.62 fc S F19¡A21 | D24 S & 2034 fc 139.045764 fc 0.813940 fc \ cr 240 0.62 & S I-1 | M13 | W15 S fc 1949 & 9.905173 fc 0.814948 fc See 241 0.62 fc S Q9jTllÍQ12i - 19 | F20 | K22¡T23lR24 S £ 1933 fc 0.000000 fc. 0.615996 i View 242 0.62 £ S F19 |? 21lG23 S 4 4336 k 2404.525279 fc 0.816257 fc \ cr 243 0.63 fc S H12R17 | E24 S fc 2006 t 91.3867.94 fc 0.819143 fc See 244 0.63 fc S 13JW15 | V26 S t 2149 fc 237.129335 i 0.819611 fc See 245 0.63 fc S N12ÍWT.9J-23I - 24 S - < 1909 fc 7.808753 £ 0.821430 t. - ~ r 246 0.63 fc S Tl | R2 | P3 | N5 | N6iT7 | R8 | G14lP15 | GlSIY20 | T22¡I25 | G271I2i. | R3C | A32 S 4 4991 t 3- 247 0.64 t S T21 | Q24 S fc 7773 £ 5882.829453 fc 0.823300 £ See 248 0.64 fc S G4 | V11 | R12 S i 2497 4 608.660368 S 0.823S10 fc \ cr 249 0.64 fc 3 Q17 | K3l | Y33 S & 2149 fc 263.756833 fc 0.824134 fc See 2S0 0.64 £ S K25JK26 S fc 2096 & 217.906699 fc 0.82S341 fc See 251 3.65 fc S T21¡Q31lH33 S 4 2236 fc 361.582066 & 0.825961 4 See 252 3.65 4 S T12¡V18.R21 S 4 2446 fc 576.530816 fc 0.826794 See 253 0.65 fc S Y4 | Q9 | TÍ1 | L19 | -23 | N24 S £ 1869 fc 0.047981 fc 0.826881 4 See 254 0.65 £ S H12 | Q31 | H33 S £ 2S22 í -055.347497 £ 0.827268 £ \ cr 255 «6 fc S R9 | M1HI15 (GISIQ19 | T20 | F21 | H22 | A24 S fc 1865 £ 0.000000 Sr 0.827546 4 View 256 C. 6 fc S Vl! TlR | N23 | * 24j "25 |« 26 «2 |« 28 | «2 |« 30 | «31 ¡« 32 | «33 S fc 1659 4 0.000000 257 0.66 fc S G4 | I-13 | W15¡A24 S fc 1866 & 7.095250 fc 0.828568 4 See 258 0.66 fc S P12 | T21 S 4 11256 -. 9405,119546 fc 0.829912 fc \ cr 259 0.S7 4 S T9 | K31¡Y33 S 4 2669 4 823.122139 4 0.830747 fc See 260 0.67 4 S K7 | A14 (A24 | V3-S 4 1H44 4 0.130705 4 Q.831083 4 See 261 0.67 fc S Q9Í'rll | I19) - 221T23 | K24 | V2S | V26 S 4 1841 4 0.000000 4 0 331561 4 View 262 0.68 S R9¡K17 | G24 S 4 2157 fc 318.307130 fc 0.831945 4 See 263 0.68 fc S A14IG24 S 4 5018 4 3183.960671 fc 0.832719 fc See HIV output: 5/6 £ $ Sß | R10 | S12 | V20 | A22 | R23 $ 4 1834 £ 0.000248 £ 0.832726 fc See £ S L11 | S12 | V26 $ £ 1919 fc 85.184287 fc 0.832757 £ View 4 S R2 | P3 | NS | N6 | T7 | R8 | S10 | G14 | P15 | G16 | F19 | Y20 | T221G23II25 | I26 | G27 | I29 | R30 | A32 S £ S R12 | T18 | R31 $ £ 2336 £ 510.426023 £ 0.834125 £ View $ V11 | R12 | D24 S £ 2089 £ 265.729908 £ 0.834506 £ View S H12JA18JT21 ££ 1899 £ 83.244349 £ 0.835749 £ View $ R12 | -.? 4 $ £ 12816 £ 11016.724531 £ 0.838463 View View $ DS | C24 | N28 $ £ 1856 £ 57.570055 £ 0.838602 £ View $ V11 | A21 $ £ 6489 £ 4695.336767 £ 0.839384 £ View $ T9 | T12 | R21 $ £ 2344 £ 558.736361 fc 0. r '758 £ View $ G10jL13 | W19 | Q24 $ £ 1805 £ 21.432425 £ 0.841035 View View $ S4JX10 $ £ 2670 £ 889.423177 £ 0.841523 £ View $ Q9 | 19 | -23 | -í24 | V2S $ £ 1781 £ 0.559631 £ 0.841545 fc See $ N12 | W19 | N24 $ £ 1920 £ 140.804894 fc 0.841748 fc See $ H4 | R9 | T12 S £ 1843 £ 63.837063 £ 0.841754 fc See $ - U0 | S12 | S19 | Q24 $ £ 1775 £ 2.690479 £ 0.842869 £ View $ M13 | K17 | T18 $ £ 2153 £ 386.143038 £ 0.843755 £ View $ R17JT21J031 $ £ 1850 £ 91.352915 £ 0.845085 £ View $ SS | X24 $ £ 2047 £ 293.960536 £ 0.845991 £ View - $ Y4 | Q9 | R10 | T11 | X.19 | -23 | R24 $ £ 1749 £ 0.003580 £ 0.846644 £ View £ $ Il | E3 | I4JAS | E6 | V7 | Q | | D9 | -10 | Y12 | T13 | M16 | -18 | 19 | R20 | S21 | M22 | l-23¡ - 24jR2Sf? 2t £ $ DSJQ24 $ £ 27S9 £ 1018.937453 £ 0.848081 £ See S Tl | R2 | P3 | N5 | N6 | T7 | R8 | G14 | P15 | G16 | Y20 | T22 | X25 | I26 | G27 | O28 | I29 | R30 |? 32 fc l. $ K9 | H12 | R17 $ £ 1894 fc 166.406797 £ 0.8S0079 £ View $ V11 | Q17 | D24 $ £ 1812 £ 93.103584 £ 0.851467 £ View $ Al | Tho $ £ 2729 £ 1013.253017 fc 0.851968 £ View $ H12 | A18 | Q31 | H33 $ £ 1795 fc 85.511085 £ 0.852963 fc View 5 19JS22JV26 $ £ 1757 £ 49.527891 £ 0.8S3283 £ View $ R9 (N12 | H13 $ £ 2146 £ 444.904483 £ 0.854292 £ View $ Q6 | M13 | W15 | H20 | N22 $ fc 1695 £ 0.005297 fc 0.85S2S6 fc See S Y4 | T11 | 19 $ £ 2355 £ 661.700054 £ 0.855524 fc See $ X19 | -23 | V2S $ fc 1827 fc 134.704350 fc 0.8SS682 fc See $ T9 | V18 | R21 | K31 | Y33 S £ 1692 £ 1.781938 £ 0.856009 £ View $ S15 | -21 | -22 | -24 $ £ 1664 £ 0.060460 £ 0.860125 £ View $ X12 | N24 $ £ 2272 £ 614.028472 £ 0.861054 £ View $ A1 | H11 | T18 $ £ 1713 fc 55.464572 £ 0.861122 £ View $ V20 | X22f-24 | K25 | N2ß | M29 $ fc 1657 £ O.OOOOOS £ 0.861205 £ View S H9 | N28 $ £ 2094 fc 448.807779 £ C.B63034 £ View S A21 | Q31 | H33 $ £ 3220 and 1574.859557 fc 0.863042 í See $ Q12JR13JV20 | 122.K21 | -26 | N2ßj - 2.> S fc 1645 fc 0.000 000 fc 0.863064 4 '.cr S I-3JT9 S fc 3676 fc 2031.-24C46 4 0.863083 4 See S D241K31 S fc 12967 fc 11324.565776 fc 0.863460 fc .cr $ L13ÍK31 | Y33 S fc 2465 fc 827.871734 £ 0.864278 £ See S L19 | T2lj-23 S fc 1949 fc 312.393933 4 0.864436 4 View S G4 | E9 | R10 | S12] I20 [R22 | Q24 S fc 1633 4 0.000021 4 0.964913 4 See S G4 | W15 | A24 S & 1765 & 1J3.349343 4 0.865121 fc See $ Tl | R2 | P3lN5 | N6 | T7 | Ra | G14lP15 | G16 | Y20 | T22 | G23 | I2S | I26 | G27! -. 28¡12í, ¡R30 | A32 S £ S N4 | R17 |? L8 S £ 2464 £ 833.718030 fc 0.865331 £ See fc S Y4 | TS | K6 | H9 | -101-ll | -12¡-13 | R14 | A15tOi6 | G17 | R18lA191V.201 21 | T23¡C24l, -'26 -. fc S M13¡S17lI19 S fc 1697 fc 69.061642 4 0.865691 fc See i 5 I13JR17 | 0-1 S h 285 'fc 1234.2C8538 k 0.866479 View 4 S SltVU¡V25 S £ 1725 fc 114.913068 fc 0.868418 fc See fc S Q9l-- 19 | -23 S £ 236- 'fc 7S8.375850 i. 0.868641 fc \ cr fc S A12 | V19 | R24 $ 4 1625 fc 17.163947 fc 0.868764 fc See £ S X8 | P9 | X13 | X31 S 4 1606 fc 0.000107 fc C.869040 fc See fc S P12 |? L4 | S17] W19lF20 í í 1605 £ 0.078275 i 0.869203 fc See 4 5 EllKS | T12¡-.20 | Y22 | G24 S i 1602 fc 0.000007 fc 0.86S647 fc See 4 S Tll | 19 | -23 S 4 2366 fc 770.136365 fc 0.870576 fc See fc $ I1 | R9 | I12 S fc 1615 i. 19.398937 fc 0.870616 fc See fc S H3 | R12 S £ 2021 £ 425.578926 £ 0.870643 £ See fc $ I1JX12 $ £ 1939 fc 345.480880 fc 0.870930 £ View 4 S R9 | S15 | -21 | -22 | -23 | -24 S fc 1S92 4 0.000188 4 0.671160 fc \ cr 4 S A12 | T18 (V19 | R24 S 4 1583 4 0.942061 4 0.872657 4 er £ S W1S | Q24 | E28 $ £ 1596 fc 18.677870 fc 0.873368 £ See £ S Y12 | V19 | Q24 | R31 $ i 1578 fc 0.826255 fc 0.873390 fc See 4 S | s4 | lS { D6 | I7 | Q8 | E9 | -10 | M16 | A17 | -; 9 | W19 | S2i | M22 | L241G25 | G261T2 1S28 | S29 | A3 HIV outpnt: 6/6 330 0.85 | fc $ G8) - 13 | A14 | G18 | H20 S £ 1575 fc 0.000857 £ 0.873716 fc See 331 0.85 I & amp; $ M11 | R12 | T18 $ fc 1784 fc 214.819937 4 0.874586 fc See 332 0.86 j fc S Y4jI8 | Q9 | Tll¡Y19 | I23 | S24 | V2S $ & 1569 £ 0.000000 £ 0.874613 fc View 333 0.86 j fc S S4¡K7 | T9 | A14 | A24 | V32 S £ 1568 £ 0.000391 £ 0.874763 4 See 334 0.86 j £ S G10 | V19 | R24 | V26 | L31 S £ 1567 £ 0.022878 £ 0.874915 4 View 335 0.86 £ S A1B | T21 | H33 $ fc 1950 £ 386.091755 £ 0.875373 £ View 336 0.87 £ S T9 | V18 | R21 | K31 S t 1595 £ 32.945102 £ 0.875649 fc See 337 0.87 fc S N12 | N28 | E31 $ £ 1644 £ 84.308230 £ 0.876001 £ View 338 0.87 | £ $ N4 | K9 | F19 | G23 $ £ 2413 £ 853.445657 fc 0.876021 £ View 339 0.87 £ $ S15 | -2l | -22 | -23 | -24 S * 1550 fc 0.003354 £ 0.877439 4 See 340 0.88 £ S V15 | P18 | R21 | V23 | V26 £ 1543 fc 0.000396 £ 0.878473 4 View 341 0.88 £ S V11JR12 | A21 S £ 1677 £ 139.056915 fc 0.879218 fc See 342 0.88 £ $ S4fK31 | Y33 $ fc 2429 £ 891.874986 £ 0.879339 £ See 343 0.88 £ $ Y4 | Q9 | T11 | I-19 $ £ 1566 £ 32.897928 £ 0.879930 £ View 344 0.89 £ $ A14 | S17 | 19 | F20 S £ 1534 £ 1.410979 4 0.880005 4 View 345 0.89 £ S G4 | R9 | F20 | T26 $ £ 1540 £ 12.834655 £ 0.880800 £ See 346 0.89 £ $ Y6 | X10 | X12 | X18fX19 | R31 S £ 1525 £ 0.000000 £ 0.881117 4 View 347 0.89 £ $ I-1 | S4 | M13 | W1S £ 1525 £ 0.559824 £ 0.8B1199 £ View 348 0.90 £ $ N4 | 9 | A21 | H33 S £ 1568 £ 48.227041 £ 0.881881 £ See 349 0.90 fc S T9 | V11 | R22 S fc 1769 & 253.858213 fc 0.882556 fc See 350 0.90 4 S Y6 | G8 | R10 | L11 | S121V20 | R23 | K24 S fc 1515 & 0.000000 i 0.882576 £ View 351 0.90 fc S X4JK31 $ 4 1926 fc 418.267274 6 0.883632 fc See 352 0.91 £ S P12 | D23 | -.24 S £ 1623 fc 115.955006 4 0.883732 £ View 353 0.91 fc S Q9 | K23 S fc 4487 fc 2986.952760 4 0.984744 fc View 354 0.91 4 $ G4 | R9 | M13 | W15 $ fc 1511 4 14.154263 4 0.885206 & See 355 0.91 4 $ R2lP3 | N5 | N6 | T7 | R8 | ll3iG14jP15 | G16lF19 | Y20 | T22 | G23 | I25 | l26 | G27 | I29 | R30 | A32 $ 356 0.92 4 $ V13 | W15 | 19 $ fc 1573 £ 83.913345 £ 0.S86323 4 See 357 0.92 £ $ P12JS30 S £ 2786 £ 129B.604637 £ 0.886566 £ View 358 0.92 fc S Vl | R12 | T18 | N23 | «24 |« 25 | «26 |« 27 | * 2B | * 29 | * 30 | «3l! * 32 | * 33 S 4 1487 £ C.000000 359 0.93 £ $ Q17 | D24 | Y33 $ £ 1608 £ 121.315232 £ 0.886668 £ View 360 0.93 £ $ E9 | R12 | T18 $ £ 1614 £ 133.600703 £ 0.887568 £ View 361 0.93 £ $ G4 | R12 | Tlß $ £ 2078 £ 597.832556 £ 0.887601 £ View 362 0.93 4 $ H4JP12 $ fc 2777 fc 1298.604637 £ 0.887855 £ See 363 0.94 £ S T1 | R2 | P3 | N5 | N6 | T7 | R8 | G14 | G16 | Y20) T22 | G23 | I25 | I26 | G27 | I29 | R30 | A32 S 4 1925 & 364 0.94 £ $ S4 | T9 | N12 | V18 | R21 £ £ 1474 £ 1.158008 £ 0.888647 £ View 365 0.94 £ $ W19 | K24 | T26 S £ 1724 £ 252.469231 fc 0.888834 fc See 366 0.94 fc S A1 | E9 $ £ 2089 fc 630.489943 fc 0.890631 fc See 367 0.95 4 S A1 | G4! F20 | A24 S fc 1455 4 2.044033 4 0.891465 4 See 368 0.95 fc S T9 | C12j-22 | G24 S 4 1450 4 0.214607 4 0.891912 & See 369 0.95 fc S? 4 | Q9 | T11 | Y19 | W20 | -23 | N24 S fc 1447 4 0.000061 fc 0.892304 4 \ cr 370 C.95 4 S G10) M13 | A24 | E31 S 4 1446 fc 2.25S1S8 4 C.8S2845 4 See 371 0.96 4 $ S10 | D24 | I26 $ 4 2229 4 789.361165 4 0.393336 4 See 372 0.96 4 $ G4 | K13 | W15 S fc 1691 fc 252.133281 4 0.393444 £ see 373 0.96 fc S N12 | E31 $ £ 2929 £ 1492.835945 £ 0.893S2? 4 See 374 0.96 fc S T12 | F13 | A14 | N28 S £ 1436 4 2.983773 -. 0.894262 á See 375 0.97 £ S S4 | T12jVlà | | R2l! K31 S 4 1434 4 1.7106SB fc 0.894363 4 Ve. 376 0.97 4 $ S8 | G10jX24 S 6 1444 4 16.637502 4 0.895049 4 See 377 0.97 4 S Q9 | Tll | L19 | -23 | K24 | R3. S i 1427 fc 0.051495 4 0.8951C6 4 \ cr 378 C.97 fc S M13 | Sl? | N2? S fc 1661 4 239.34 ^ 729 fc 0.B95842 4 See 379 0.98 fc S R101Y12 | V19 | E23 | Q24 S 4 1420 fc 0.021913 fc C.8960 4 4 See 380 0.98 4 $ R12II13 S fc 2328 fc 90S.283714 fc 0.396110 £ See 381 0.98 4 S G10ÍI20¡A22 | T231K24 S fc 1415 fc 0.007673 f. 0.396763 s- .er 382 0.98 < £ S Q9 | V18lK23 S £ 1572 and 162.149493 £ = 0 897472 fc .er 383 0.99 1 £ S R17 | T21 | H33 S 4 1486 £ 82.159457 -. 0.898299 4 See 384 0.99! fc S T9 | S18 | K30 S 4 142S fc 25.486414 fc C.898892 fc ci 385 0.99 1 4 $ G8 | A14 | G18 | H20 S 4 1399 4 0.016110 40.898964 4 \ r 386 0.99 < £ S T12lS15 | H20 | I22 | E23 | X24 $ £ 1393 and 0.000820 4 0.09978: 4 •.: - • 387 1.00 £ £ IlIY4¡-22 | S23 | R24 $ fc 1393 £ 0.040715 fc 0.899788 fc View APPENDIX D $ £ -n = SARGV- OJ; File probsort.pl: 1/1 open (IN, Sfm); ßprob = < IN >; chop ßprob; cióse (IN); ßprob = grep (/ cr /, ßprob); open (TEMP, "> probsort. temp"); foreach (? prob) £ print TEMP 'S_ \ n'; } cióse (TEMP); # exit; Sf = $ £ m. ".prob"; # prine "fm: Sf-n \ p"; 'sort -o prob.tmp -n01234567890. «-9 probsor. em '; * rm probsort. temp '; open (I, "rob. tmp"); ßm2 «< IN >; c op ßm; cióse (IN); * rtn pro. tni '; open (TEMP, "> Sfm"); Stotal = scalar Tm2; Yes "0; foreach (9m2! (Printr TEMP"% 3d {% .2f I s \ n ".yes. (Yes / Stocal). S. YES - +; APPENDIX E A [GAP43 RATGAP43] A [ODC RATODC] B [nAChRa4 RATNARAA] 0.00000 D [nAChRd RNZCRDl] D [nAChRe RNACRE] A [CNTFR S54212] A [PTN RATHBGAM] BfFGFRRATFGFRl] B [TGFR RATTGFBIIR] Dflnsl RNINS1] A [cyclin A RATPCNA] A [H2AZ RATHIS2AZ] B [cjun KNRJG9] A [T CP (II)] F [actin RNAC01] A [CC01 ATMTCYTOC] A [CCQ2 RATMTCYTOC] A [SC1 RNU19135] A [DD63.2 (II)] A [GAP43 RATGAP43] A [GAD65 RATGAD65] A [GRg2 (#)] 0.00000 B [-? AC - Ra4 RATNARAA] B [FGFR RATFGFR1] B [cjun RNRJG9] A [CC02 RATMTCYTOC] A [GAP43 RATGAP43] FfMFM RATNFM] A [G67I80 / 86 RATGAD67] 0.00000 C [mGiuR2 RATMGLI-IRB] B [nAChRa4 RATNARAA] B [nAChRa5 RATNACHRR] BfFGFR RATFGFRl] A [IGF I RATIGFIA] B [cjun RNRJG9] B [SOD RNSODR] A [CC02 RATMTCYTOC] B [NMDA2D RNU08260] D [nAChRc RNACRE] C [mAChR4 RATACHRMD] 0.00000 B [EGF RATEPGF] B [TGFR RATTGFBIIR] B [G67I80 / 86 RATGAD67] A [SOD RNSODR] B [SC7 RNU19141] 0.00000 A [MAP2 RATMAP2] A [GAP43 RATGAP43] B [L1 S55536] 0.00000 A [synaptophys-n RNSYM] Afneno RATENONS] F [GATI RATGABAT] B [ChAT (*)] A [ODC RATODC] B [NOS RRBNOS] A [GRa2 (D] A [GRa3 RNGABAA] A [GRa5 (#)] A [GRb3 RATGARB3] A [GRg3 RATGABAA] B [mGluR3 RATMGLURq C [mGluR8 MMU17252] B [NMDA2B RATNMDA2B] B [nAChRa4 RATNARAA] D [nAChRa6 RATNARA6S] D [nAChRd RNZCRD 1] B [5HTlb RAT5HT1BR] 'Aftr B RATTRKB1] A [CNTFR S54212] A [MK2 MUSMK] A [PTN RATHBGAM] BfFGFR RATFGFR1] D [Ipsl RNINS1] B [IGF II RATGFI2] A [IP3R2 RNITPR2R] Afcyclin A RATPCNA] A [H2AZ RATHIS2AZ] B [cjun RNRJG9] BtBrm ( II)] A [TCP (II)] F [actin RNAC01] AfCCOl RATMTCYTOC] A [CC02 RATMTCYTOC] A [SC1 RNU19135] D [SC6 RNU19140] A [DD63.2 (II)] DfGFAP RNU03700] D [GRb2 RATGARB2] D [NMDA2C RATNMCA2C] 0.00000 A [NT3 RATHDNFNT] B [CNTF RNCNTF] D (bFGF RNFGFT] C [PDGFb RNPDGFBCP] B [PDGFR RNPDGFRBE] A [cyclin B RATCYCLNB] Cfcfos RNCFOSR] F [cellubrevin s63830] D [G67I86 RATGAD67] B [IGF I RA? GFIA] 0.00000 BflnsR RATINSAB] A [GAD67 RATGAD67] C [mGluR6 RATMGLUR6.] C [mAChR3 RATACHRMB] 0.00000 B [5HT2 RATSR5HT2] B [Ins2 RN INS2] Ffcellubrcvin s63830] D [mGluR6 RATMGLUR6.] D [5HT3 MOUSESHT3] 0.00000 B [InsR RATINSAB] A [SC2 RNU19136] A [ncstin RATNESTIN] B [TH RATTOHA] C [mAChR4 RATACHRMD] 0.00000 B [CNTF RNCNTF] B [ EGF RATEPGF] Afncstin RATNESTIN] B [TH RATTOHA] C [NGF RNNGFB] 0.00000 A [MK2 MUSMK] B [IGF II RATGFI2] B [Brm (II)] AfODC RATODC] D [nAChRd RNZCRDl] D [nAChRe RNACRE] 0.00000 C [NGF RNNGFB] D [trk RATTRKPREC] A [CNTFR S54212] A [MK2 MUSMK] A { PTN RATHBGAM] B [TGFR RATTGFBpR] DP - S1 RNINS1] BtIGF II RATGFI2] A [cyclin A RATPCNA] AtH2AZRATHIS2AZ] B [Brm (II)] AfTCP (II)] F [actin RNAC01] A [CC01 RATMTCYTOC] A [ SC1 RNU19135] A [DD63.2 (II)] A [GAP43 RATGAP43] F [NFM RATNFM] C { mGIuR2 RATMGLURB] 0.00000 B [nAChRa4 RATNARAA] B [nAChRa5 RATNACHRR] Bf-rkC RATTRKCN3] B [FGFRRATFGFRI] Bfcjun RNRJG9] A [CC02 RATMTCYTOC] B [TH RATTOHA] A [MK2 MUSMK] B [IGF II RATGFI2] 0.00000 B [Bpn (II)] D [GluRl RATGPCGR] D [mGluR4 RATMGLUR4B] D [nAChRa2 RATNNAR] 0.00000 AIEGFR RATEGFR] AflGFRl RA? GFI] AfIGFR2 MU04710 ] F [NFL RATNFL] D [mGluR4 RATMGLUR4B] D [GluR6 RATMGLUR6.] 0.00000 D [nAChRa2 RATNNAR] D [5HT3 MOUSE5HT3] A { IGFR1 RA? GFI] A [SC2 RNU19136] D [MOG RATMOG] B [GRal (#)] D [mG-uR- RATGPCGR] 0.00000 D [mGluR4 RATMGLUR4B] D [pAChRa2 RATNNAR] AfEGFR RATEGFR] A [IGFR MMU047I0] C [IP3R3 RATIP3R3X] A [GAP43 RATGAP43] F [NFM RATNFM] B [nAChRa4 RATNARAA] 0.00000 BCuAChRa5 RATNACHR] BfFGFR RATFGFRl] A [IGF I RATIGFIA] Bfcjun RNRJG9] A [CC02 RATMTCYTOC] Afceliubrev-n s63830] A [GRbl RATGARB1] A [IGF I RA? GFIA] 0.00000 A [CRAF RATRAFA] B [IP3R1 RATI145TR] B [eratin RNKER19] A [ccllubrevin s63830] B { TH RATTOHA] 0.00000 B [CNTFRNCNTF] AFIGF I ATIGFIA] A [InsR RATINSAB]

Claims (54)

  1. CLAIMS 1. A method of matching detection for use with a data set of objects, the objects have a number of attributes, the method is characterized in that it comprises the steps of: • sampling a subset of the data set for a predetermined number of iterations, each iteration of the sampled subset of the data set that has the same subset of attributes for each object; • detect and record the match counts in each sampled subset of the data set, a match that is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset of the data set, where the plurality of Attribute values are the same for each occurrence, the detection and registration of matches in each sampled subset of the data set that is executed before, at the same time or after sampling, the detection and registration of matches in other subsets; • determine an expected count for each coincidence of interest, the determination that is executed before, at the same or time or after sampling, detection and registration;
    • compare, for each coincidence of interest, the observed count of matches, against the expected count of matches, and from this comparison determine a correlation measure for the plurality of attributes for the match; and • report a set of k-tuples of correlated attributes, where a k-tuple of correlated attributes -is a plurality of attributes for which the correlation measure is over- a respective predetermined threshold.
  2. 2. The coincidence detection method according to claim 1, characterized in that the comparison of observed and expected counts is calculated using a Chernoff junction on the final probabilities.
  3. 3. The method of coincidence detection according to claim 1, characterized in that the counts are recorded by storing a total operation of the count of each match over all the sampled subsets.
  4. The method according to claim 1, characterized in that the objects correspond to sales transactions, each transaction comprising one or more products purchased, and the attributes correspond to the instances of sale of particular products or types of products.
  5. 5. The method according to claim 1, characterized in that the objects correspond to points selected in time and the attributes correspond to the state of the elements in a system.
  6. 6. The method according to claim 1, characterized in that the objects correspond to points selected in time and the attributes correspond to prices, or price changes of instruments or financial facilities.
  7. The method according to claim 1, characterized in that the steps of the method are represented by the following pseudo-code: 0. begin 1. read (MATRIX); 2. read (R, T); 3. compute_first_order_margi - als (MATR - X); 4. csets: =. { }; 5. for iter = 1 to T do 6. sampled_rows: = rsap-ple (R- MATRIX): 7. attributes: = get_attributes (sampled_rows); 8. al-_co-r-c-dences: = find_-dl_coincidences (attributes); 9. for coincidence in a-l_caincidences of 10. if cset_already_exists (coincidence, csets) 11. then update_cset (coincidence, csets); 12. else add_new_cset (coincidence, csets); 13. endif 14. endfor 15. endfor 16. for cset in csets of 17. expecte.-Compute_expected_match_count (cset); 18. observed: = get_observed_-t? Atch_count (cset); 19. stats: = update_stats (cset, hypoth_test (expected, observed)); 20. endfor 21. print_final_stats (csets, stats); 22. end 8.
  8. The matching method according to claim 1, characterized in that it also comprises the step of representing the objects and attributes in an array of objects against the attributes before sampling the data set, the data set that is sampled by sampling the matrix.
  9. A method, characterized in that it comprises: • the method of claim 1, and the additional step of: • applying rules that are defined by the reported correlated attributes.
  10. The method according to claim 1, characterized in that the objects correspond to compounds and at least some of the attributes correspond to the particular chemical portions.
  11. 11. The method according to claim 1, characterized in that the objects correspond to peptides or proteins and at least some of the attributes correspond to particular structural or sub-structural patterns or motifs.
  12. The method according to claim 1, characterized in that the objects correspond to selections from the group consisting of compounds, molecular structures, nucleotide sequences and amino acid sequences and the attributes correspond to characteristics of the selected objects.
  13. 13. The method according to the claim
  14. 1, characterized in that the objects correspond to points selected in time and the attributes correspond to biological parameters of genes or gene products. The method according to claim 1, characterized in that the objects correspond to documents that are electronically stored and / or electronically indexed and the attributes correspond to the subjects.
  15. 15. The method according to claim 1, characterized in that the objects correspond to clients and at least some of the attributes correspond to the products acquired or not acquired by those customers.
  16. 16. The method according to claim 15, characterized in that at least some of the attributes correspond to consignments made or not made to the clients.
  17. 17. The method according to claim 1, characterized in that at least some of the objects correspond to products and at least some of the attributes correspond to customers who have acquired or not those products.
  18. 18. The method according to claim 17, characterized in that at least some of the attributes correspond to the demographic variables of the clients.
  19. 19. The method according to the claim
    1, characterized in that the objects correspond to persons with a particular disease or condition and the attributes correspond to potential contributing factors for the disease or condition.
  20. 20. The method of compliance with the claim
    1, characterized in that the objects correspond to persons with a number of different diseases or conditions and the attributes correspond to potential contributing factors for those diseases or conditions.
  21. 21. The method according to the claim
  22. 1, characterized in that at least some of the objects correspond to factors that potentially contribute to a disease or condition and the attributes correspond to people with or without these factors, where the method associates the groups of people of substantially equivalent risk for the disease or suffering The method according to claim 1, characterized in that the objects correspond to points selected in time and at least some of the attributes correspond to the state of the components in a system at points in time before the system fails , where the method associates the component states that can potentially cause the system to fail.
  23. 23. The method according to the claim
    1, characterized in that it also comprises the steps of first creating a database of transitions between states of the system, wherein a state of the system is represented by a value of a state variable, over a selected quantum of time, and presenting the base data in full or in part, as a data set so that each transition from state to state corresponds to one of the objects M and so that each state variable corresponds to an attribute.
  24. 24. The method of compliance with the claim
    1, characterized in that it further comprises the steps of first creating a database of action statuses that cover a selected quantum of time and presenting the database, in whole or in part, as a data set so that each state / action / triple state corresponds to one of the objects M and so that each state variable or action type corresponds to an attribute.
  25. 25. A match detection method for use with a set of object data, the objects have a number of attributes, the method is characterized in that it comprises the steps of: • representing a set of M objects in terms of an NA number of variables ("attributes"), where an attribute is said to occur in an object if the object has the attribute; • sample a subset of rx out of the M objects, for each iteration between a predetermined number of iterations; • detect and record the matches between sets of k of the attributes in each sampled subset of objects, a coincidence that is the concurrence of 1 to k to NA attributes in the same h-. out of the r-. objects in the sampled subset, where 0 to hx to rx; • determine an expected count of matches for any set of k attributes and a predetermined number of sampling iterations and matching counts as described above, the determination that is executed before sampling and collection, at the same time or after sampling and harvest;
    • compare for any set of k attributes and number of sampling iterations and coincidence count, the count-- observed against the expected count of matches, and from this comparison determine a measure of correlation (or association or dependence) for the set of k attributes; and • report a set of k-tuples of correlated attributes, where a k-tuple of correlated attributes is a set of k of NA attributes that have been determined by this process to have a value for a selected correlation measure on a value of default threshold.
  26. 26. The match detection method according to claim 25, characterized in that ri is the same for each iteration.
  27. 27. The method according to claim 25, characterized in that the numerical correlation values are reported together with the set of k-tuples of corxelated attributes.
  28. 28. A method for the visual exploration of a data set of objects, such objects having a number of attributes, the method is characterized in that it comprises the steps of: • sampling a subset of the data set for a predetermined number of iterations, each iteration of the sampled subset of the data set that has the same subset of attributes for each object; • detect and record the match counts in each sampled subset of the data set, a match that is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset of the data set, where the plurality of Attribute values are the same for each occurrence, the detection and registration of matches in each sampled subset of the data set that is executed before, at the same time or after sampling, the detection and registration of matches in other subsets untos; • determine an expected count for each coincidence of interest, the determination that is executed before, at the same or time or after sampling, detection and registration; • compare, for each coincidence of interest, the observed count of matches, against the expected count of matches, and from this comparison determine a correlation measure for the plurality of attributes for the match; and • reporting a set of k-tuples of correlated attributes, for a user through a graphical interface, where a k-tuples of correlated attributes is a plurality of attributes for which the correlation measure is on a respective predetermined threshold.
  29. 29. A preprocessing method to be used with a data model unit to capture and report to the data model unit the higher order interactions of a data set of objects, the objects having a number of attributes, the The method is characterized in that it comprises the steps of: • sampling a subset of the data set for a predetermined number of iterations, each iteration of the sampled subset of the data set having for each object the same subset of attributes; • detect and record the match counts in each sampled subset of the data set, a match which is the co-occurrence of a plurality of attribute values in one or more objects in a subset shown, where the plurality of values of Attributes are the same for each occurrence, the detection and registration counts of matches in each sampled subset that are executed previously, at the same time or after the detection sampling and registration of match counts in other subsets; • determine an expected count for each coincidence of interest, the determination that is executed before, at the same or time or after sampling, detection and registration; • compare, for each coincidence of interest, the observed count of matches, against the expected count of matches, and from this comparison determine a correlation measure for the plurality of attributes for the match; and • reporting to the data model unit a set of k-tuples of correlated attributes, where a k-tuple of correlated attributes is a plurality of attributes for which the correlation measure is on a respective predetermined threshold.
  30. 30. A correlation elimination method for use with an object data set, the objects having a number of attributes, the method is characterized in that it comprises the steps of: • sampling a subset of the data set for a predetermined number of iterations, each iteration of the sampled subset of the data set that has the same subset of attributes for each object; • detect and record the match counts in each sampled subset of the data set, a match that is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset of the data set, where the plurality of Attribute values are the same for each occurrence, the detection and registration counts of matches in each sampled subset of the dataset that is executed before, at the same time or after sampling, detection and registration of match counts in other subsets; • determine an expected count for each coincidence of interest, the determination that is executed before, at the same or time or after "the sampling, detection and registration; • compare, for each coincidence of interest, the observed count of coincidences, against the count expected from matches, and from this comparison determine a correlation measure for the plurality of attributes for the match, and • eliminate a set of k-tuples of correlated attributes, where a k-tuple of correlated attributes is a plurality of attributes for which the correlation measure - is on a respective predetermined threshold
  31. 31. A matching detection system for use with a set of object data, each object has a plurality of attributes, the system is characterized in that it comprises: • means for sample a subset of the data set for a predetermined number of iterations, each iteration of the subset Sampled data from the data set that has the same subset of attributes for each object; Means for detecting and recording match counts in each sampled subset of the data set, a match that is the co-occurrence of a plurality of attribute values in one or more objects in a subset of the data set's samplings, where the plurality of attribute values is the same for each occurrence, the detection and registration of matches in each sampled subset that are executed before, at the same time or after sampling, detection and registration of coincidence counts in other subsets; • means to determine an expected count for each coincidence of interest, the determination that is executed before, at the same time or after sampling, detection and registration; • means for comparing, for each coincidence of interest, the observed count of matches against the expected count of matches, and from this comparison determining a correlation measure for the plurality of attributes for the match; and • means for reporting a set of k-tuples of correlated attributes, where k-tuple of correlated attributes is a plurality of attributes for which the correlation measure is on a respective predetermined threshold.
  32. 32. The coincidence detection system according to claim 31, characterized in that the means of the system in the aggregate carry a method represented by the following pseudo code:
    0. begin 1. read (MATRIX); 2. read (R, T); 3. compute_first_order_marginals (MATRIX); 4. csets: -. { }; 5. for iter = 1 to T do 6. sampledjro s: = rsample (R, MATRIX): 7. attributes: = get_attributes (sampled_rows); 8. all_coincidences: = find_all_co-ncidences (attributes); 9. for coincidence in all_coincidences of 10. if cset_already_exists (coincidence, csets) 11. then update_cset (coincidence, csets); 12. else add_new_cset (coincidence, csets); 13. endif 14. endfor
    15. endfor 16. for cset in csets of 17. expected: = compute_expected_match_count (cset); 18. observed: = get_observed_match_count (cset); 19. stats: = update_stats (cset, hypoth_test (expected, observed)); 20. endfor 21. print_final_stats (csets, stats); 22. end
  33. 33. The coincidence detection system according to claim 31, characterized in that the means for sampling a subset of the data set comprises means for dividing the data set into subsets for sampling.
  34. 34. The match detection system according to claim 33, characterized in that the means for detecting and registering the match counts comprise an arrangement of processing nodes, each processing node detecting and recording a respective subcontract of matches and in wherein the means for comparison, for each coincidence of interest, of such observed count of matches for such expected count of matches comprises means for merging such subcontects to provide the observed count.
  35. 35. The coincidence detection system according to claim 34, characterized in that at least one of the processing nodes comprises a respective sub-array of processing nodes that detects and records respective subcontects of matches, and wherein such means for providing the subcontects provide the subcontects and / or the observed count.
  36. 36. The coincidence detection system according to claim 34 or 35, characterized in that each processing node comprises memory that includes a temporary input memory for storing the subsets received from the data set and a temporary output memory for storing the data. subcontracting or sub-subcontracting; and a counting bar that transfers data to and from memory.
  37. 37. The programmed means of coincidence detection for use with a computer and with a data set of objects having a number of attributes represented in an array of objects against the attributes, the programmed means comprising: a stored computation program on storage media compatible with the computer, the computer program that contains instructions for directing the computer to: • sample a subset of the data set for a predetermined number of iterations, "each iteration of the subset sampled from the data set it has for each object the same subset of attributes • detect and record the match counts in each sampled subset of the data set, a match that is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset of the data set, where the plurality of attribute values is the same for each occurrence ncia, the detection and registration of matches in each sampled subset that run before, at the same time or after sampling, detection and registration of match counts in other subsets; • determine an expected count for each coincidence of interest, the determination that is executed before, at the same or time or after sampling, detection and registration; • compare, for each coincidence of interest, the observed count of matches, against the expected count of matches, and from this comparison determine a correlation measure for 1-a plurality of attributes for the match; and • report a set of k-tuples of correlated attributes, where a k-tuple of correlated attributes is a plurality of attributes for which the correlation measure is on a respective predetermined threshold.
  38. 38. The coincidence detection system for use with a data set of objects having a number of attributes, the system is characterized in that it comprises: a computer; and a computer program in compatible media with the computer, the computer program that directs the computer to: • sample a subset of the data set for a predetermined number of iterations, each iteration of the sampled subset of the data set it has for each object the same subset of attributes, • detect and record the match counts in each sampled subset of the data set, a match that is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset of the set of data, where the plurality of attribute values is equal for each occurrence, the detection and registration of matches in each sampled subset that are executed before, at the same time or after sampling, detection and registration of match counts in other subsets;
    • determine an expected count for each match "of interest, the determination that is executed before, at the same time or after sampling, detection and registration; • compare, for each coincidence of interest, the observed count of matches, against the count expected from matches, and from this comparison determine a correlation measure for the plurality of attributes for the match, and • report a set of k-tuples of correlated attributes, where a k-tuple of correlated attributes is a plurality of attributes for which the correlation measure is on a respective predetermined threshold
  39. 39. A product that has a set of attributes, "selected by:
    • sample a subset of a data set that represents objects against attributes for a predetermined number of iterations, each iteration of the sampled subset that has the same subset of attributes for each object, • detect- and record the match counts in each subset sampled of the data set, a coincidence - which is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset of the data set, where the plurality of attribute values is equal for each occurrence, the counts detection and registration of matches in each sampled subset that run before, at the same time or after sampling, detection and registration of match counts in other subsets; • determine an expected count for each coincidence of interest, the determination that is executed before, at the same time or after sampling, detection and registration, • compare, for each coincidence of interest, the observed count of coincidences, against the expected count of coincidences, and from this comparison determine a correlation measure for the plurality of attributes for the match; and • report a set of k-tuples of correlated attributes, where a k-tuple of correlated attributes is a plurality of attributes for which the correlation measure is on a respective predetermined threshold.
  40. 40. A product defined by the application of a set of rules generated from: • sampling a subset of the data set that represents objects against attributes for a predetermined number of iterations, each iteration of the sampled subset that has for each object the same subset of attributes, • detect and record the match counts in each sampled subset of the data set, a match that is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset of the data set, where the plurality of attribute values is the same for each occurrence, the detection and registration of matches in each sampled subset that are executed before, at the same time or after the sampling, detection and registration of the match counts in other subsets • Determine an expected count for each coincidence of interest, determining the e "is executed before, at the same time or after sampling, detection and registration, • compare, for each coincidence of interest, the observed count of coincidences, against the expected count of coincidences, and from this comparison determine a correlation measure for the plurality of attributes for the match; and • report a set of k-tuples of correlated attributes, where a k-tuple of correlated attributes is a plurality of attributes for which the correlation measure is on a respective predetermined threshold.
  41. 41. A product that is defined by its interaction with a set of attributes selected by: • sampling a subset of the data set that represents objects versus attributes for a predetermined number of iterations, each iteration of the sampled subset of the data set it has for each object the same subset of attributes; • detect and record the match counts in each sampled subset of the data set, a match that is the co-occurrence of a plurality of attribute values in one or more "objects in a sampled subset, where the plurality of attribute values it is the same for each occurrence, the detection and registration of matches in each sampled subset that are run before, at the same time or after sampling, detection and registration of match counts in other subsets, • determining an expected count for each coincidence of interest, the determination that is executed before, at the same time or after the sampling, detection and registration, • compare, for each coincidence of interest, the observed count of coincidences, against the expected coincidence count, and from this comparison determine a correlation measure for the plurality of attributes for the match, and • report a together of k-tuples of correlated attributes, where a k-tuple of correlated attributes is a plurality of attributes for which the correlation measure is on a respective predetermined threshold.
  42. 42. A method of matching detection for use with a data set of objects having a number of attributes represented in an array of objects against the attributes, the method is characterized in that it comprises the steps of: • sampling a subset of the array for a predetermined number of iterations, each iteration of the sampled subset of the array that has for each object the same subset of attributes; • detect and record the match counts in each subset sampled from the matrix, a match that is the co-occurrence of a plurality of attribute values in one or more objects in a sampled subset of the array, where the plurality of values of Attributes are the same for each occurrence, the detection and registration of matches in each sampled subset that are executed before, at the same time or after sampling, detection and registration of match counts in other subsets; • determine an expected count for each coincidence of interest, the determination that is executed before, at the same time or after sampling, detection and registration; • compare, for each coincidence of interest, the observed count of matches, against the expected count of matches, and from this comparison determine a correlation measure for the plurality of attributes for the match; and • report a set of k-tuples of correlated attributes, where a k-tuple of correlated attributes is a plurality of attributes for which the correlation measure is on a respective predetermined threshold.
  43. 43. The antigens and vaccines that present the co-variable k-tuples described herein.
  44. 44. A peptide or mimetic peptide that includes a structural motif of the V3 cycle of the coat protein
    HIV that includes space spatial coordinates A18 / Q31 / H33.
  45. 45. A pharmaceutical composition characterized in that it comprises a ligand that interacts with a protein having a structural motif identified using the method according to claim 1, and a pharmaceutically acceptable carrier or excipient thereof.
  46. 46. The pharmaceutical composition according to claim 45, characterized in that the ligand comprises chemical portions of suitable identity and spatially located one relative to the other, so that the portions interact with residues or corresponding portions of the motif.
  47. 47. The pharmaceutical composition according to claim 46, characterized in that the ligand, by interaction with the motif, interferes with the function of a region of the protein comprising the motif.
  48. 48. A diagnostic agent comprising a ligand that interacts with a protein having a structural motif identified using the method of claim 1, and a detectable label linked to the ligand.
  49. 49. A pharmaceutical composition for interacting with a human immunodeficiency virus (HIV) envelope protein, the envelope protein that includes a structural motif of cycle V3 having spatial coordinates of residues A18 / Q31 / H33, comprising a ligand that includes at least one functional group that interacts with the motif, and a pharmaceutically acceptable carrier or excipient thereof.
  50. 50. The pharmaceutical composition according to claim 49, characterized in that the ligand includes at least one functional group, capable of binding to, and being present in an effective position in such a ligand to bind to the residue 18, at least one functional group, capable of binding to, and being present at, an effective position on such a ligand to bind to residue 31, and at least one functional group capable of binding to, and being present at an effective position on, the ligand to bind to residue 33.
  51. 51 A method of designing a ligand for interacting - with a structural motif of a human immunodeficiency virus (HIV) envelope protein, the method is characterized in that it comprises the steps of: providing a template having spatial coordinates of residues A18, Q31 and H33 in the V3 cycle of the HIV envelope protein, and to computationally involve a chemical ligand using an effective algorithm with spatial constraints, nera that the ligand involved includes at least one effective functional group that binds to the motif.
  52. 52. The method of compliance with the claim
    51, characterized in that the ligand comprises: at least one functional group, capable of binding to, and being present at an effective position in the ligand to bind to residue 18, at least one functional group capable of binding to, and being present -in an effective position in the ligand to bind to residue 31, and at least one functional group capable of binding to, and being present at an effective position in the ligand to bind to residue 33.
  53. 53. A method of ligand identification to bind with a structural motif of a human immunodeficiency virus (HIV) envelope protein, the method is characterized in that it comprises the steps of: providing a template having spatial coordinates of A18, Q31 and H33 in cycle V3 of the HIV cover protein; provide a database that contains the structure and orientation of the molecules; and screening such molecules to determine whether they contain effective portions separated one from the other, so that the portions interact with the motif.
  54. 54. The method of compliance with the claim
    53, characterized in that a first portion of the molecule interacts with the residue 18, a second portion of the molecule interacts with the residue 31 and a third portion of the molecule interacts with the residue 33.
MXPA/A/1999/008824A 1997-03-24 1999-09-23 Coincidence detection method, products and apparatus MXPA99008824A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US60/041472 1997-03-24

Publications (1)

Publication Number Publication Date
MXPA99008824A true MXPA99008824A (en) 2000-02-02

Family

ID=

Similar Documents

Publication Publication Date Title
CA2285058C (en) Coincidence detection method, products and apparatus
WO1998043182A9 (en) Coincidence detection method, products and apparatus
Lacroix et al. Bioinformatics: managing scientific data
JP6445055B2 (en) Feature processing recipe for machine learning
Shao et al. FoldRec-C2C: protein fold recognition by combining cluster-to-cluster model and protein similarity network
Lemmen et al. Computational methods for the structural alignment of molecules
Subramani et al. Structure prediction of loops with fixed and flexible stems
Zimmermann Method evaluation, parameterization, and result validation in unsupervised data mining: A critical survey
Chen et al. A gene profiling deconvolution approach to estimating immune cell composition from complex tissues
Azeroual et al. A record linkage-based data deduplication framework with datacleaner extension
Fang et al. A deep dense inception network for protein beta‐turn prediction
Nojoomi et al. String kernels for protein sequence comparisons: improved fold recognition
MXPA99008824A (en) Coincidence detection method, products and apparatus
US20160012080A1 (en) Multisequence data representation
Sadeghi et al. Computational drug repurposing: research opportunities and challenges classification
US7698067B2 (en) Sequence pattern descriptors for transmembrane structural details
Liyaqat et al. A machine learning strategy with clustering under sampling of majority instances for predicting drug target interactions
Mcloughlin et al. Shared Differential Expression-Based Distance Reflects Global Cell Type Relationships in Single-Cell RNA Sequencing Data
Habib et al. TarDict: A RandomForestClassifier based software predicts drug-target interaction using SMILES
Limo A review of data mining in bioinformatics
Gamaarachchi Computer architecture-aware optimisation of dna analysis systems
Koçak et al. Utilizing maximal frequent itemsets and social network analysis for HIV data analysis
Wang et al. COVID-19 and SARS Virus Function Sites Classification with Machine Learning Methods
Sayeed Protein fold classification using Graph Neural Network and Protein Topology Graph
Toudas et al. Corporate Bankruptcy Prediction Models: A Comparative Study for the Construction Sector in Greece