WO1996012822A1 - Method for indentifying two nucleic acid base code sequences - Google Patents

Method for indentifying two nucleic acid base code sequences Download PDF

Info

Publication number
WO1996012822A1
WO1996012822A1 PCT/SE1995/001213 SE9501213W WO9612822A1 WO 1996012822 A1 WO1996012822 A1 WO 1996012822A1 SE 9501213 W SE9501213 W SE 9501213W WO 9612822 A1 WO9612822 A1 WO 9612822A1
Authority
WO
WIPO (PCT)
Prior art keywords
codes
base
sequence
code
sequences
Prior art date
Application number
PCT/SE1995/001213
Other languages
French (fr)
Inventor
Lennart BJÖRKESTEN
Original Assignee
Pharmacia Biotech Ab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pharmacia Biotech Ab filed Critical Pharmacia Biotech Ab
Priority to AU38193/95A priority Critical patent/AU3819395A/en
Publication of WO1996012822A1 publication Critical patent/WO1996012822A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Definitions

  • the invention relates to a method and an apparatus for identifying two nucleic acid base code sequences belonging to a given set of known base code sequences and being superposed on each other in an original sequence which comprises base codes as well as ambiquity codes.
  • a data base containing all known HLA-DPB sequences makes it possible to analyze hetero- zygous individuals by combinatorial comparison through all base code sequences and thus identify the one or two base code sequences involved.
  • the HLA-DPB sequences in the data base are selected from published sequences (Marsh S.G.E., Bodmer J.G.; "HLA class II nucleotide sequences", 1992, Tissue Antigens 40:229, 1992).
  • a disadvantage with the known method is its inability to handle artifacts in terms of inserted or removed base codes in a test sequence.
  • the object of the invention is to bring about a method which is less sensitive to artifacts in non-crucial parts of a test sequence produced by sequencing equipment, when analyzing low quality samples, the artifacts being described in terms of inserted, removed and exchanged base codes and ambiguity codes, and which is less time consuming and in ⁇ volves less data than the known method, as well as an appara- tus for carrying that method into effect.
  • a first embodiment of the method according to the invention for identifying two nucleic acid base code sequences belonging to a given set of known base code sequences and being superposed on each other in an original sequence which comprises base codes as well as ambiquity codes, in that it comprises the steps of a) constructing a master template sequence from said given set of base code sequences by assigning every conserved position, where the base code is the same all through the set, that particular base code in said master template sequence, and assigning every non-conserved position, where the base code differs through the set, a wild-card code in said master template sequence, b) extracting from every base code sequence of said given set, the non-conserved positions to obtain non-conserved position subsequences containing only the non-conserved base codes, c) superposing in pairs all possible combinations of the non- conserved position sequences extracted in step b) to obtain combination sequences of base codes and ambiguity codes, d) making a determination of the original
  • a second embodiment of the method according to the invention for identifying two nucleic acid base code sequences belonging to a given set of known base code sequences and being superposed on each other in an original sequence which comprises base codes as well as ambiquity codes in that it comprises the steps of a) constructing a master template sequence from said given set of base code sequences by assigning every conserved position, where the base code is the same all through the set, that particular base code in said master template sequence, and assigning every non-conserved position, where the base code differs through the set, a wild-card code in said master template sequence, b) extracting from every base code sequence of said given set, the non-conserved positions to obtain non-conserved position subsequences containing only the non-conserved base codes, c) superposing, in pairs, all possible combinations of the non-conserved position sequences extracted in step b) to obtain combination sequences of base codes and ambiguity codes, d) making one
  • a first embodiment of the apparatus according to the invention for identifying two nucleic acid base code sequen ⁇ ces belonging to a given set of known base code sequences and being superposed on each other in an original sequence which comprises base codes as well as ambiguity codes comprises master template sequence constructing means for constructing a master template sequence from said given set of base code sequences by assigning every conserved position, where the base code is the same all through the set, that particular base code in said master template sequence, and assigning every non-conserved position, where the base code differs through the set, a wild-card code in said master template sequence, non-conserved position extracting means for extrac ⁇ ting from every base code sequence of said given set, the non-conserved positions to obtain non-conserved position sub ⁇ sequences containing only the non-conserved base codes, superposing means for superposing in pairs all possible combinations of the non-conserved position sequences extrac ⁇ ted by said non-conserved position extracting means to obtain combination sequences of
  • a second embodiment of the apparatus according to the invention for identifying two nucleic acid base code sequen ⁇ ces belonging to a given set of known base code sequences and being superposed on each other in an original sequence which comprises base codes as well as ambiguity codes comprises master template sequence constructing means for constructing a master template sequence from said given set of base code sequences by assigning every conserved position, where the base code is the same all through the set, that particular base code in said master template sequence, and assigning every non-conserved position, where the base code differs through the set, a wild-card code in said master template sequence, non-conserved position extracting means for extrac ⁇ ting from every base code sequence of said given set, the non-conserved positions to obtain non-conserved position subsequences containing only the non-conserved base codes, superposing means for superposing, in pairs, all possible combinations of the non-conserved position sequences extrac ⁇ ted by said non-conserved position extracting means to obtain
  • A, C, G and T stand for adenine, cytosine, guanine and thymine, respectively, while other one-letter codes stand for combinations of nucleotides at the same position as defined by Nomenclature Committee of the International Union of Biochemistry (NC-IUB) : No en- clature for incompletely specified bases in nucleic acid sequences.
  • N-IUB Nomenclature Committee of the International Union of Biochemistry
  • V - G and A and C H A and T and C
  • N G and A and T and C
  • test sequences are obtained in a manner known per se by means of sequencing equipment, and are to be analyzed in order to identify the two nucleic acid base code sequences which, superposed on each other, make up the original sequence.
  • the starting point is a given set of alternative base code sequences (alleles) for a gene in the HLA complex.
  • alternative base code sequences or subtypes could be used:
  • Subtype 1 ACC GCT GAT CCC TGT CG
  • the first subtype is explicitely defined, while merely deviations from the first subtype are indicated for the other two subtypes.
  • a master template sequence is constructed from the above given set of subtypes by assigning every conserved position, i.e. every position where the base code is the same all through the set, that particular base code in said master template sequence, while every non- conserved position, i.e. every position where the base code differs through the set, is assigned a wild-card code corre ⁇ sponding to $ in said master template sequence.
  • the master template sequence will be as follows: ACCGC$$$TCCCTG$$G.
  • the non-conserved positions are extracted from every base code sequence in the above given set in order to obtain a corresponding set of non-conserved position subsequences which only contain the non-conserved base codes.
  • test sequence obtained as indicated above, is then aligned with the master template sequence in such a manner that, accepting gaps in either sequence, the matching between the test sequence and the master template sequence, is obtimized.
  • This algorithm functions so that all types of alignments between the two sequences are given points. This is accom ⁇ plished in that different points are awarded e.g. for mat ⁇ ching position, mismatching position, inserted or removed characters etc. The alignment that obtains the highest number of points, is kept.
  • the wild-card code introduced in accordance with the invention gives matching points in combination with any character in the other sequen ⁇ ce.
  • the master template sequence will have the function of pointing out non-conserved positions in the test sequence based on the local appearance of the alignment between the sequences. This will function despite different forms of artifacts (inserted, removed and/or exchanged characters) in the conserved regions and without actual knowledge of where the respective test sequence starts.
  • a first embodiment it is supposed that the below single test sequence has been obtained:
  • Test sequence A CGCWKRTCCCTGCSG Master template sequence A CGC$$$TCCCTG$$G
  • This extracted sequence of base codes and ambiquity codes is then compared with all the above combination sequences of base codes and ambiguity codes.
  • the combination 2/3 above corresponds exactly with the extracted sequence, which means that the two nucleic acid base code sequences superposed on each other, in other words, the two HLA alleles for a certain gene present in the sequence obtained from a sample from a human individu- al, can be identified.
  • the test sequence is, in fact, a superposition of subtypes 2 and 3.
  • Test sequence I CGGTATCGCWKRTCCCTGCSGGAT
  • Test sequence II CGGTACCGTTKRTCCCTGCSGGAT.
  • Test sequence I A CGCWKRTCCCTGCSG Master template sequence A CGC$$$TCCCTG$$G
  • Test sequence II ACCG TKRTCCCTGCSG
  • a consensus sequence of base codes and am ⁇ biguity codes is then determined from the two or more extrac ⁇ ted sequences in the following way:
  • the first component, 0.5 and 1.0, respectively, reflects the quality of the local alignment in such a manner that 1.0 means that the quality of the local alignment is perfect, while 0.5 means that the quality of the local alignment is not perfect, in this case, in view of the mismatch immediately to the left of the position in question. It should be understood that, in this example, 0.5 has been chosen to reflect a mismatch in an adjacent position.
  • the second component, 0.0001*5 in both cases gives a negative contribution due to the position in the test sequences in such a manner that a position located closer to the beginning of the test sequence gives a smaller negative contribution than a position located further away from the beginning of the test sequence.
  • a match between one of said combination sequences and the extracted sequence of base codes and ambiguity codes indicate that that particular combination sequence corresponds to the two nucleic acid base code sequences to be identified.
  • the above combination 2/3 corresponds exactly with the extracted sequence, which means that the two nucleic acid base code sequences superposed on each other, in other words, the two HLA alleles for a certain gene present in the sequence obtained from a sample from a human individual, can be identified.
  • test sequence is, in fact, a superposition of subtypes 2 and 3.
  • a first embodiment of an apparatus for identifying two nucleic acid base code sequen ⁇ ces belonging to a given set of known base code sequences and being superposed on each other in an original sequence which comprises base codes as well as ambiguity codes, comprises master template sequence constructing means (not shown) for constructing a master template sequence from said given set of base code sequences by assigning every conserved position, where the base code is the same all through the set, that particular base code in said master template sequence, and assigning every non-conserved position, where the base code differs through the set, a wild-card code in said master template sequence, non-conserved position extracting means (not shown) for extracting from every base code sequence of said given set, the non-conserved positions to obtain non- conserved position subsequences containing only the non- conserved base codes, superposing means (not shown) for superposing in pairs all possible combinations of the non- conserved position sequences extracted by said non-conserved position extracting means to obtain combination sequences
  • a second embodiment of an apparatus according to the invention for identifying two nucleic acid base code sequen- ces belonging to a given set of known base code sequences and being superposed on each other in an original sequence which comprises base codes as well as ambiguity codes comprises master template sequence constructing means (not shown) for constructing a master template sequence from said given set of base code sequences by assigning every conserved position, where the base code is the same all through the set, that particular base code in said master template sequence, and assigning every non-conserved position, where the base code differs through the set, a wild-card code in said master template sequence, non-conserved position extracting means (not shown) for extracting from every base code sequence of said given set, the non-conserved positions to obtain non- conserved position subsequences containing only the non- conserved base codes, superposing means (not shown) for superposing, in pairs, all possible combinations of the non- conserved position sequences extracted by said non-conserved position extracting means to obtain combination
  • the apparatuses according to the invention are preferably implemented in computer software.

Abstract

In a method and an apparatus for identifying two nucleic acid base code sequences belonging to a given set of known base code sequences and being superposed on each other in an original sequence which comprises base codes as well as ambiguity codes, a master template sequence is constructed from said given set of base code sequences as combination sequences of base codes and ambiguity codes. One or more determinations of the original sequence are made to obtain one or more test sequences which are aligned against said master template sequence in such a manner that the matching between the sequences is optimized. A consensus sequence is determined from the aligned test sequences and is compared with all the combination sequences. A match between one of the combination sequences and the consensus sequence indicates that particular combination sequence corresponds to said two nucleic acide base code sequences to be identified.

Description

METHOD FOR INDENTBΥING TWO NUCLEIC ACID BASE CODE SEQUENCES TECHNICAL FIELD
The invention relates to a method and an apparatus for identifying two nucleic acid base code sequences belonging to a given set of known base code sequences and being superposed on each other in an original sequence which comprises base codes as well as ambiquity codes.
BACKGROUND OF THE INVENTION Such a method is known from Erik H. Rozemuller et al
"Assignment of HLA-DPB alleles by computerized matching based upon sequence data", Human Immunology 37, 207-212 (1993).
According to the known method, a data base containing all known HLA-DPB sequences, makes it possible to analyze hetero- zygous individuals by combinatorial comparison through all base code sequences and thus identify the one or two base code sequences involved. The HLA-DPB sequences in the data base are selected from published sequences (Marsh S.G.E., Bodmer J.G.; "HLA class II nucleotide sequences", 1992, Tissue Antigens 40:229, 1992).
A disadvantage with the known method is its inability to handle artifacts in terms of inserted or removed base codes in a test sequence.
Moreover, the known method is time consuming and involves a great amount of data.
BROAD DESCRIPTION OF THE INVENTION
The object of the invention is to bring about a method which is less sensitive to artifacts in non-crucial parts of a test sequence produced by sequencing equipment, when analyzing low quality samples, the artifacts being described in terms of inserted, removed and exchanged base codes and ambiguity codes, and which is less time consuming and in¬ volves less data than the known method, as well as an appara- tus for carrying that method into effect.
This is attained by a first embodiment of the method according to the invention for identifying two nucleic acid base code sequences belonging to a given set of known base code sequences and being superposed on each other in an original sequence which comprises base codes as well as ambiquity codes, in that it comprises the steps of a) constructing a master template sequence from said given set of base code sequences by assigning every conserved position, where the base code is the same all through the set, that particular base code in said master template sequence, and assigning every non-conserved position, where the base code differs through the set, a wild-card code in said master template sequence, b) extracting from every base code sequence of said given set, the non-conserved positions to obtain non-conserved position subsequences containing only the non-conserved base codes, c) superposing in pairs all possible combinations of the non- conserved position sequences extracted in step b) to obtain combination sequences of base codes and ambiguity codes, d) making a determination of the original sequence in order to obtain a test sequence, e) aligning said test sequence against said master template sequence in such a manner that, accepting gaps in either sequence, the matching between them is optimized, said wild¬ card coded non-conserved positions in said master template sequence being considered as matching any base code and any ambiguity code in said test sequence, f) extracting from said test sequence all base codes and ambiguity codes which are aligned with the wild-card codes in said master template sequence, and g) comparing the base codes and ambiguity codes extracted in step f) with all the combination sequences of base codes and ambiguity codes obtained in step c) , a match between one of said combination sequences obtained in step c) and the base codes and ambiguity codes extracted in step f) , indicating that that particular combination sequence of base codes and ambiguity codes corresponds to said two nucleic acid base code sequences to be identified. This is also attained by a second embodiment of the method according to the invention for identifying two nucleic acid base code sequences belonging to a given set of known base code sequences and being superposed on each other in an original sequence which comprises base codes as well as ambiquity codes, in that it comprises the steps of a) constructing a master template sequence from said given set of base code sequences by assigning every conserved position, where the base code is the same all through the set, that particular base code in said master template sequence, and assigning every non-conserved position, where the base code differs through the set, a wild-card code in said master template sequence, b) extracting from every base code sequence of said given set, the non-conserved positions to obtain non-conserved position subsequences containing only the non-conserved base codes, c) superposing, in pairs, all possible combinations of the non-conserved position sequences extracted in step b) to obtain combination sequences of base codes and ambiguity codes, d) making one or more determinations of the original sequence in order to obtain one or more test sequences, e) aligning each of said one or more test sequences against said master template sequence in such a manner that, accep¬ ting gaps in either sequence, the matching between the master template and each test sequence is optimized, said wild-card coded non-conserved positions in said master template sequen¬ ce being considered as matching any base code and any am¬ biguity code in each test sequence, f) extracting from each of said test sequences all base codes and ambiguity codes which are aligned with the wild-card codes in said master template sequence, g) determining, for each non-conserved position, a consensus base code or ambiguity code on the basis of the non-conserved bases extracted from each test sequence by summing up a score for each base code for each non-conserved position and keeping the base code with the highest score, the score being a function of the position of the base code in the respective test sequence as well as of the local quality of the align- ment between the respective test sequence and said master template sequence, and h) comparing the consensus base codes and ambiguity codes determined in step g) with all the combination sequences of base codes and ambiguity codes obtained in step c) , a match between one of said combination sequences obtained in step c) and the consensus base codes and ambiguity codes determined in step g) , indicating that that particular combination sequence of base codes and ambiguity codes corresponds to said two nucleic acid base code sequences to be identified. A first embodiment of the apparatus according to the invention for identifying two nucleic acid base code sequen¬ ces belonging to a given set of known base code sequences and being superposed on each other in an original sequence which comprises base codes as well as ambiguity codes, comprises master template sequence constructing means for constructing a master template sequence from said given set of base code sequences by assigning every conserved position, where the base code is the same all through the set, that particular base code in said master template sequence, and assigning every non-conserved position, where the base code differs through the set, a wild-card code in said master template sequence, non-conserved position extracting means for extrac¬ ting from every base code sequence of said given set, the non-conserved positions to obtain non-conserved position sub¬ sequences containing only the non-conserved base codes, superposing means for superposing in pairs all possible combinations of the non-conserved position sequences extrac¬ ted by said non-conserved position extracting means to obtain combination sequences of base codes and ambiguity codes, original sequence determining means for making a determina¬ tion of the original sequence in order to obtain a test sequence, aligning means for aligning said test sequence against said master template sequence in such a manner that, accepting gaps in either sequence, the matching between them is optimized, said wild-card coded non-conserved positions in said master template sequence being considered as matching any base code and any ambiguity code in said test sequence, base code and ambiguity code extracting means for extracting from said test sequence all base codes and ambiguity codes which are aligned with the wild-card codes in said master template sequence, and comparing means for comparing the base codes and ambiguity codes extracted by said base code and ambiguity code extracting means with all the combination sequences of base codes and ambiguity codes obtained by means of said superposing means, a match between one of said combination sequences obtained by means of said superposing means and the base codes and ambiguity codes extracted by said base code and ambiguity code extracting means, indica¬ ting that that particular combination sequence of base codes and ambiguity codes corresponds to said two nucleic acid base code sequences to be identified.
A second embodiment of the apparatus according to the invention for identifying two nucleic acid base code sequen¬ ces belonging to a given set of known base code sequences and being superposed on each other in an original sequence which comprises base codes as well as ambiguity codes, comprises master template sequence constructing means for constructing a master template sequence from said given set of base code sequences by assigning every conserved position, where the base code is the same all through the set, that particular base code in said master template sequence, and assigning every non-conserved position, where the base code differs through the set, a wild-card code in said master template sequence, non-conserved position extracting means for extrac¬ ting from every base code sequence of said given set, the non-conserved positions to obtain non-conserved position subsequences containing only the non-conserved base codes, superposing means for superposing, in pairs, all possible combinations of the non-conserved position sequences extrac¬ ted by said non-conserved position extracting means to obtain combination sequences of base codes and ambiguity codes, original sequence determining means for making one or more determinations of the original sequence in order to obtain one or more test sequences, aligning means for aligning each of said one or more test sequences against said master template sequence in such a manner that, accepting gaps in either sequence, the matching between the master template and each test sequence is optimized, said wild-card coded non- conserved positions in said master template sequence being considered as matching any base code and any ambiguity code in each test sequence, base code and ambiguity code extrac¬ ting means for extracting from each of said test sequences all base codes and ambiguity codes which are aligned with the wild-card codes in said master template sequence, determining means for determining, for each non-conserved position, a consensus base code or ambiguity code on the basis of the non-conserved bases extracted from each test sequence by summing up a score for each base code for each non-conserved position and keeping the base code with the highest score, the score being a function of the position of the base code in the respective test sequence as well as of the local quality of the alignment between the respective test sequence and said master template sequence, and comparing means for comparing the consensus base codes and ambiguity codes determined by said determining means with all the combination sequences of base codes and ambiguity codes obtained by means of said superposing means, a match between one of said combination sequences obtained by means of said superposing means and the consensus base codes and ambiguity codes determined by said determining means, indicating that that particular combination sequence of base codes and ambiguity codes corresponds to said two nucleic acid base code sequen- ces to be identified.
DESCRIPTION OF PREFERRED EMBODIMENTS
In the following description, A, C, G and T stand for adenine, cytosine, guanine and thymine, respectively, while other one-letter codes stand for combinations of nucleotides at the same position as defined by Nomenclature Committee of the International Union of Biochemistry (NC-IUB) : No en- clature for incompletely specified bases in nucleic acid sequences. Eur J Biochem 150:1, 1985 as follows: R = G and A
Y - T and C W = A and T
S = G and C M = A and C K = G and T B - G and T and C D = G and A and T
V - G and A and C H = A and T and C
N = G and A and T and C
In the method according to the invention, one or more determinations of an original sequence are made in order to obtain one or more test sequences. The test sequences are obtained in a manner known per se by means of sequencing equipment, and are to be analyzed in order to identify the two nucleic acid base code sequences which, superposed on each other, make up the original sequence.
To accomplish this, the starting point is a given set of alternative base code sequences (alleles) for a gene in the HLA complex. For this example, the following set of three alternative base code sequences or subtypes could be used:
Subtype 1 ACC GCT GAT CCC TGT CG
Subtype 2 —A TG- —C G-
Subtype 3 —C —
According to the nomenclature above, the first subtype is explicitely defined, while merely deviations from the first subtype are indicated for the other two subtypes.
It is to be understood that, in practice, the number of subtypes is very large.
According to the invention a master template sequence is constructed from the above given set of subtypes by assigning every conserved position, i.e. every position where the base code is the same all through the set, that particular base code in said master template sequence, while every non- conserved position, i.e. every position where the base code differs through the set, is assigned a wild-card code corre¬ sponding to $ in said master template sequence.
Applying this to the above given set of just three base code sequences, the master template sequence will be as follows: ACCGC$$$TCCCTG$$G.
According to the invention, also the non-conserved positions are extracted from every base code sequence in the above given set in order to obtain a corresponding set of non-conserved position subsequences which only contain the non-conserved base codes.
Applying this to the above given set of subtypes, the following three non-conserved position subsequences are obtained: 1. TGATC
2. ATGCG
3. TGACC.
According to the invention, all possible combinations of the above non-conserved position sequences are superposed in pairs in order to obtain combination sequences of base codes and ambiguity codes.
For the above three non-conserved position sequences, the following combination sequences are obtained.
Combination
1/1 TGATC
1/2 WKRYS
1/3 TGAYC
2/2 ATGCG
2/3 WKRCS
3/3 TGACC In accordance with the invention, a test sequence, obtained as indicated above, is then aligned with the master template sequence in such a manner that, accepting gaps in either sequence, the matching between the test sequence and the master template sequence, is obtimized.
For this alignment, a dynamic programming algorithm described by Sigvard Needleman and C. Wunsch, J. Mol. Biol. 48, 444 (1970) , may be used.
This algorithm functions so that all types of alignments between the two sequences are given points. This is accom¬ plished in that different points are awarded e.g. for mat¬ ching position, mismatching position, inserted or removed characters etc. The alignment that obtains the highest number of points, is kept. According to the invention, also the wild-card code introduced in accordance with the invention, gives matching points in combination with any character in the other sequen¬ ce. Thus, the master template sequence will have the function of pointing out non-conserved positions in the test sequence based on the local appearance of the alignment between the sequences. This will function despite different forms of artifacts (inserted, removed and/or exchanged characters) in the conserved regions and without actual knowledge of where the respective test sequence starts. According to a first embodiment, it is supposed that the below single test sequence has been obtained:
CGGTATCGCWKRTCCCTGCSGGAT.
Aligning the above test sequence and the master template sequence in the above manner would give the following result
CGGT T GAT
Test sequence A CGCWKRTCCCTGCSG Master template sequence A CGC$$$TCCCTG$$G
C According to the invention, all base codes and ambiguity codes which are aligned with the wild-card codes in the master template sequence, are then extracted, which gives the following sequence:
WKRCS.
This extracted sequence of base codes and ambiquity codes is then compared with all the above combination sequences of base codes and ambiguity codes.
A match between one of said combination sequences and the extracted sequence of base codes and ambiguity codes, in¬ dicate that that particular combination sequence corresponds to the two nucleic acid base code sequences to be identified. In this case, the combination 2/3 above corresponds exactly with the extracted sequence, which means that the two nucleic acid base code sequences superposed on each other, in other words, the two HLA alleles for a certain gene present in the sequence obtained from a sample from a human individu- al, can be identified.
Thus, in the present case, since the subsequences in the combination 2/3 are extracted from subtypes 2 and 3, the test sequence is, in fact, a superposition of subtypes 2 and 3.
According to a second embodiment, it is supposed that the below two test sequences have been obtained:
Test sequence I CGGTATCGCWKRTCCCTGCSGGAT Test sequence II CGGTACCGTTKRTCCCTGCSGGAT.
Aligning the above two test sequences and the master template sequence would give the following results:
CGGT T GAT
Test sequence I A CGCWKRTCCCTGCSG Master template sequence A CGC$$$TCCCTG$$G
C and CGGT T GAT
Test sequence II ACCG TKRTCCCTGCSG
Master template sequence ACCG $$$TCCCTG$$G
C
As in the first embodiment, all base codes and ambiguity codes which are aligned with the wild-card codes in the master template sequence, are then extracted from each test sequence, which gives the following extracted sequences:
WKRCS, and TKRCS.
According to the invention, when two or more test sequen- ces are obtained, a consensus sequence of base codes and am¬ biguity codes is then determined from the two or more extrac¬ ted sequences in the following way:
For each non-conserved position, a score is assigned to all possible code types. For the first position, this gives:
Code 1st sequence Score 2nd 1 sequence Score Total Score
A 0 0 0
C 0 0 0
G 0 0 0
T 0 0. 5- "(0. ,0001 *5) 0.4995
R 0 0 0
Y 0 0 0
W 1.0 -(0. .0001*5) 0 0.9995
S 0 0 0
M 0 0 0
K 0 0 0
The code with the highest total score, in this case W=0.9995, is kept for the consensus sequence. The first component, 0.5 and 1.0, respectively, reflects the quality of the local alignment in such a manner that 1.0 means that the quality of the local alignment is perfect, while 0.5 means that the quality of the local alignment is not perfect, in this case, in view of the mismatch immediately to the left of the position in question. It should be understood that, in this example, 0.5 has been chosen to reflect a mismatch in an adjacent position. The second component, 0.0001*5 in both cases, gives a negative contribution due to the position in the test sequences in such a manner that a position located closer to the beginning of the test sequence gives a smaller negative contribution than a position located further away from the beginning of the test sequence.
The next position is treated in the same way:
Code 1st sequence Score 2nd . sequence Score Total Score
A 0 0 0
C 0 0 0
G 0 0 0
T 0 0 0
R 0 0 0
Y 0 0 0
W 0 0 0
S 0 0 0
M 0 0 0
K 1.0 -(0. ,0001*» 6) 1. 0- (0. .0001 *6) 1.9988
The code with the highest total score, in this case K=1.9988, is kept for the consensus sequence. It should be pointed out that the total score may be used as a quality measure of the position inquestion. Thus, in the above two examples, the quality of K is almost as high as possible. Treating the rest of the positions in the same manner gives the following final consensus sequence:
WKRCS.
These determined consensus base codes and ambiquity codes are then compared with all the above combination sequences of base codes and ambiguity codes.
As in the first embodiment, a match between one of said combination sequences and the extracted sequence of base codes and ambiguity codes, indicate that that particular combination sequence corresponds to the two nucleic acid base code sequences to be identified.
Also in this second embodiment, the above combination 2/3 corresponds exactly with the extracted sequence, which means that the two nucleic acid base code sequences superposed on each other, in other words, the two HLA alleles for a certain gene present in the sequence obtained from a sample from a human individual, can be identified.
Thus, also in this case, since the subsequences in the combination 2/3 are extracted from subtypes 2 and 3, the test sequence is, in fact, a superposition of subtypes 2 and 3.
It should be understood that the above second embodiment of the method according to the invention, with two (or more) test sequences, also could be applied to just a single test sequence. In that case, the consensus sequence would, of course, be the same as the test sequence.
A first embodiment of an apparatus according to the invention for identifying two nucleic acid base code sequen¬ ces belonging to a given set of known base code sequences and being superposed on each other in an original sequence which comprises base codes as well as ambiguity codes, comprises master template sequence constructing means (not shown) for constructing a master template sequence from said given set of base code sequences by assigning every conserved position, where the base code is the same all through the set, that particular base code in said master template sequence, and assigning every non-conserved position, where the base code differs through the set, a wild-card code in said master template sequence, non-conserved position extracting means (not shown) for extracting from every base code sequence of said given set, the non-conserved positions to obtain non- conserved position subsequences containing only the non- conserved base codes, superposing means (not shown) for superposing in pairs all possible combinations of the non- conserved position sequences extracted by said non-conserved position extracting means to obtain combination sequences of base codes and ambiguity codes, original sequence determining means (not shown) for making a determination of the original sequence in order to obtain a test sequence, aligning means (not shown) for aligning said test sequence against said master template sequence in such a manner that, accepting gaps in either sequence, the matching between them is optimi- zed, said wild-card coded non-conserved positions in said master template sequence being considered as matching any base code and any ambiguity code in said test sequence, base code and ambiguity code extracting means (not shown) for extracting from said test sequence all base codes and am- biguity codes which are aligned with the wild-card codes in said master template sequence, and comparing means (not shown) for comparing the base codes and ambiguity codes extracted by said base code and ambiguity code extracting means with all the combination sequences of base codes and ambiguity codes obtained by means of said superposing means, a match between one of said combination sequences obtained by means of said superposing means and the base codes and ambiguity codes extracted by said base code and ambiguity code extracting means, indicating that that particular combination sequence of base codes and ambiguity codes corresponds to said two nucleic acid base code sequences to be identified.
A second embodiment of an apparatus according to the invention for identifying two nucleic acid base code sequen- ces belonging to a given set of known base code sequences and being superposed on each other in an original sequence which comprises base codes as well as ambiguity codes, comprises master template sequence constructing means (not shown) for constructing a master template sequence from said given set of base code sequences by assigning every conserved position, where the base code is the same all through the set, that particular base code in said master template sequence, and assigning every non-conserved position, where the base code differs through the set, a wild-card code in said master template sequence, non-conserved position extracting means (not shown) for extracting from every base code sequence of said given set, the non-conserved positions to obtain non- conserved position subsequences containing only the non- conserved base codes, superposing means (not shown) for superposing, in pairs, all possible combinations of the non- conserved position sequences extracted by said non-conserved position extracting means to obtain combination sequences of base codes and ambiguity codes, original sequence determining means (not shown) for making one or more determinations of the original sequence in order to obtain one or more test sequences, aligning means (not shown) for aligning each of said one or more test sequences against said master template sequence in such a manner that, accepting gaps in either sequence, the matching between the master template and each test sequence is optimized, said wild-card coded non-conser¬ ved positions in said master template sequence being conside- red as matching any base code and any ambiguity code in each test sequence, base code and ambiguity code extracting means (not shown) for extracting from each of said test sequences all base codes and ambiguity codes which are aligned with the wild-card codes in said master template sequence, determining means (not shown) for determining, for each non-conserved position, a consensus base code or ambiguity code on the basis of the non-conserved bases extracted from each test sequence by summing up a score for each base code for each non-conserved position and keeping the base code with the highest score, the score being a function of the position of the base code in the respective test sequence as well as of the local quality of the alignment between the respective test sequence and said master template sequence, and compa¬ ring means (not sshown) for comparing the consensus base codes and ambiguity codes determined by said determining means with all the combination sequences of base codes and ambiguity codes obtained by means of said superposing means. a match between one of said combination sequences obtained by means of said superposing means and the consensus base codes and ambiguity codes determined by said determining means, indicating that that particular combination sequence of base codes and ambiguity codes corresponds to said two nucleic acid base code sequences to be identified.
The apparatuses according to the invention are preferably implemented in computer software.

Claims

1. A method for identifying two nucleic acid base code sequences belonging to a given set of known base code sequen- ces and being superposed on each other in an original sequen¬ ce which comprises base codes as well as ambiguity codes, characterized by the steps of a) constructing a master template sequence from said given set of base code sequences by assigning every conserved position, where the base code is the same all through the set, that particular base code in said master template sequence, and assigning every non-conserved position, where the base code differs through the set, a wild-card code in said master template sequence, b) extracting from every base code sequence of said given set, the non-conserved positions to obtain non-conserved position subsequences containing only the non-conserved base codes, c) superposing, in pairs, all possible combinations of the non-conserved position sequences extracted in step b) to obtain combination sequences of base codes and ambiguity codes, d) making a determination of the original sequence in order to obtain a test sequence, e) aligning said test sequence against said master template sequence in such a manner that, accepting gaps in either sequence, the matching between them is optimized, said wild¬ card coded non-conserved positions in said master template sequence being considered as matching any base code and any ambiguity code in said test sequence, f) extracting from said test sequence all base codes and ambiguity codes which are aligned with the wild-card codes in said master template sequence, and g) comparing the base codes and ambiguity codes extracted in step f) with all the combination sequences of base codes and ambiguity codes obtained in step c) , a match between one of said combination sequences obtained in step c) and the base codes and ambiguity codes extracted in step f) , indicating that that particular combination sequence of base codes and ambiguity codes corresponds to said two nucleic acid base code sequences to be identified.
2. A method for identifying two nucleic acid base code sequences belonging to a given set of known base code sequen¬ ces and being superposed on each other in an original sequen¬ ce which comprises base codes as well as ambiguity codes, characterized by the steps of a) constructing a master template sequence from said given set of base code sequences by assigning every conserved position, where the base code is the same all through the set, that particular base code in said master template sequence, and assigning every non-conserved position, where the base code differs through the set, a wild-card code in said master template sequence, b) extracting from every base code sequence of said given set, the non-conserved positions to obtain non-conserved position subsequences containing only the non-conserved base codes, c) superposing, in pairs, all possible combinations of the non-conserved position sequences extracted in step b) to obtain combination sequences of base codes and ambiguity codes, d) making one or more determinations of the original sequence in order to obtain one or more test sequences, e) aligning each of said one or more test sequences against said master template sequence in such a manner that, accep- ting gaps in either sequence, the matching between the master template and each test sequence is optimized, said wild-card coded non-conserved positions in said master template sequen¬ ce being considered as matching any base code and any am¬ biguity code in each test sequence, f) extracting from each of said test sequences all base codes and ambiguity codes which are aligned with the wild-card codes in said master template sequence, g) determining, for each non-conserved position, a consensus base code or ambiguity code on the basis of the non-conserved bases extracted from each test sequence by summing up a score for each base code for each non-conserved position and keeping the base code with the highest score, the score being a function of the position of the base code in the respective test sequence as well as of the local quality of the align¬ ment between the respective test sequence and said master template sequence, and h) comparing the consensus base codes and ambiguity codes determined in step g) with all the combination sequences of base codes and ambiguity codes obtained in step c) , a match between one of said combination sequences obtained in step c) and the consensus base codes and ambiguity codes determined in step g) , indicating that that particular combination sequence of base codes and ambiguity codes corresponds to said two nucleic acid base code sequences to be identified.
3. A method of genetic analysis, comprising the steps of (i) subjecting a test sample to a sequencing procedure to obtain two superposed base code sequences representing the alleles present for a specific gene, and (ii) identifying the base code sequences by the method according to claim 1 or 2.
4. Use of the method according to claim 1 or 2 for HLA typing.
5. An apparatus for identifying two nucleic acid base code sequences belonging to a given set of known base code sequen¬ ces and being superposed on each other in an original sequen¬ ce which comprises base codes as well as ambiguity codes, characterized in that it comprises - master template sequence constructing means for construc- ting a master template sequence from said given set of base code sequences by assigning every conserved position, where the base code is the same all through the set, that particu- lar base code in said master template sequence, and assigning every non-conserved position, where the base code differs through the set, a wild-card code in said master template sequence, - non-conserved position extracting means for extracting from every base code sequence of said given set, the non-conserved positions to obtain non-conserved position subsequences containing only the non-conserved base codes,
- superposing means for superposing, in pairs, all possible combinations of the non-conserved position sequences extrac¬ ted by said non-conserved position extracting means to obtain combination sequences of base codes and ambiguity codes,
- original sequence determining means for making a determi¬ nation of the original sequence in order to obtain a test sequence,
- aligning means for aligning said test sequence against said master template sequence in such a manner that, accepting gaps in either sequence, the matching between them is optimi¬ zed, said wild-card coded non-conserved positions in said master template sequence being considered as matching any base code and any ambiguity code in said test sequence,
- base code and ambiguity code extracting means for extrac¬ ting from said test sequence all base codes and ambiguity codes which are. aligned with the wild-card codes in said master template sequence, and
- comparing means for comparing the base codes and ambiguity codes extracted by said base code and ambiguity code extrac¬ ting means with all the combination sequences of base codes and ambiguity codes obtained by means of said superposing means, a match between one of said combination sequences obtained by means of said superposing means and the base codes and ambiguity codes extracted by said base code and ambiguity code extracting means, indicating that that parti¬ cular combination sequence of base codes and ambiguity codes corresponds to said two nucleic acid base code sequences to be identified.
6. An apparatus for identifying two nucleic acid base code sequences belonging to a given set of known base code sequen¬ ces and being superposed on each other in an original sequen¬ ce which comprises base codes as well as ambiguity codes, characterized in that it comprises
- master template sequence constructing means for construc¬ ting a master template sequence from said given set of base code sequences by assigning every conserved position, where the base code is the same all through the set, that particu- lar base code in said master template sequence, and assigning every non-conserved position, where the base code differs through the set, a wild-card code in said master template sequence,
- non-conserved position extracting means for extracting from every base code sequence of said given set, the non-conserved positions to obtain non-conserved position subsequences containing only the non-conserved base codes,
- superposing means for superposing, in pairs, all possible combinations of the non-conserved position sequences extrac- ted by said non-conserved position extracting means to obtain combination sequences of base codes and ambiguity codes,
- original sequence determining means for making one or more determinations of the original sequence in order to obtain one or more test sequences, - aligning means for aligning each of said one or more test sequences against said master template sequence in such a manner that, accepting gaps in either sequence, the matching between the master template and each test sequence is optimi¬ zed, said wild-card coded non-conserved positions in said master template sequence being considered as matching any base code and any ambiguity code in each test sequence,
- base code and ambiguity code extracting means for extrac¬ ting from each of said test sequences all base codes and ambiguity codes which are aligned with the wild-card codes in said master template sequence,
- determining means for determining, for each non-conserved position, a consensus base code or ambiguity code on the basis of the non-conserved bases extracted from each test sequence by summing up a score for each base code for each non-conserved position and keeping the base code with the highest score, the score being a function of the position of the base code in the respective test sequence as well as of the local quality of the alignment between the respective test sequence and said master template sequence, and - comparing means for comparing the consensus base codes and ambiguity codes determined by said determining means with all the combination sequences of base codes and ambiguity codes obtained by means of said superposing means, a match between one of said combination sequences obtained by means of said superposing means and the consensus base codes and ambiguity codes determined by said determining means, indicating that that particular combination sequence of base codes and ambiguity codes corresponds to said two nucleic acid base code sequences to be identified.
PCT/SE1995/001213 1994-10-21 1995-10-17 Method for indentifying two nucleic acid base code sequences WO1996012822A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU38193/95A AU3819395A (en) 1994-10-21 1995-10-17 Method for indentifying two nucleic acid base code sequences

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SE9403612A SE9403612D0 (en) 1994-10-21 1994-10-21 Method for identifying two nucleic acid base code sequences
SE9403612-6 1994-10-21

Publications (1)

Publication Number Publication Date
WO1996012822A1 true WO1996012822A1 (en) 1996-05-02

Family

ID=20395701

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SE1995/001213 WO1996012822A1 (en) 1994-10-21 1995-10-17 Method for indentifying two nucleic acid base code sequences

Country Status (3)

Country Link
AU (1) AU3819395A (en)
SE (1) SE9403612D0 (en)
WO (1) WO1996012822A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5972604A (en) * 1991-03-06 1999-10-26 Regents Of The University Of Minnesota DNA sequence-based HLA typing method
WO2013166303A1 (en) * 2012-05-02 2013-11-07 Ibis Biosciences, Inc. Dna sequencing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HUMAN IMMUNOLOGY, Volume 37, 1993, ERIK H. ROZEMULLER et al., "Assignment of HLA-DPB Alleles by Computerized Matching Based Upon Sequence Data", pages 207-212. *
J. MOL. BIOL., Volume 221, 1991, B. EDWIN BLAISDELL et al., "An Efficient Algorithm for Identifying Matches With Errors in Multiple Long Molecular Sequences", pages 1367-1378. *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5972604A (en) * 1991-03-06 1999-10-26 Regents Of The University Of Minnesota DNA sequence-based HLA typing method
WO2013166303A1 (en) * 2012-05-02 2013-11-07 Ibis Biosciences, Inc. Dna sequencing
US20150184238A1 (en) * 2012-05-02 2015-07-02 Ibis Biosciences, Inc. Dna sequencing
US10202642B2 (en) 2012-05-02 2019-02-12 Ibis Biosciences, Inc. DNA sequencing

Also Published As

Publication number Publication date
AU3819395A (en) 1996-05-15
SE9403612D0 (en) 1994-10-21

Similar Documents

Publication Publication Date Title
AU2020201622B2 (en) Methods and system for detecting sequence variants
US20230044434A1 (en) Methods and systems for detecting sequence variants
US20210280272A1 (en) Methods and systems for quantifying sequence alignment
AU2014337089B2 (en) Methods and systems for genotyping genetic samples
AU2014308794B2 (en) Methods and systems for aligning sequences
Vasiljevic et al. Developmental validation of Oxford Nanopore Technology MinION sequence data and the NGSpeciesID bioinformatic pipeline for forensic genetic species identification
US20100304395A1 (en) Method, Program, and System for Normalizing Gene Expression Amounts
WO1996012822A1 (en) Method for indentifying two nucleic acid base code sequences
Dimitrova et al. Evaluation of viral heterogeneity using next-generation sequencing, end-point limiting-dilution and mass spectrometry
JP4266575B2 (en) Gene expression data processing method and processing program
Giraldez et al. Accuracy, Reproducibility And Bias Of Next Generation Sequencing For Quantitative Small RNA Profiling: A Multiple Protocol Study Across Multiple Laboratories [preprint]
US20080108510A1 (en) Method for estimating error from a small number of expression samples
CN114520024A (en) Sequence association method based on k-mer
Giraldez et al. Let us know how access to this document benefits you.
Gondro et al. Gene Expression Analysis
Score If the coefficients {} are chosen so that the corresponding polynomial: a x+ a 2 x 2+…+ a 2k x 2k
Murakami et al. Using the charm package to estimate DNA methylation levels and find differentially methylated regions

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU CA JP US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
ENP Entry into the national phase

Ref country code: US

Ref document number: 1997 809868

Date of ref document: 19970402

Kind code of ref document: A

Format of ref document f/p: F

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA