JP6902258B2

JP6902258B2 - How to determine an allele pair of a subject's HLA gene

Info

Publication number: JP6902258B2
Application number: JP2016257041A
Authority: JP
Inventors: 松田　文彦; 文彦松田; 修治川口; 幸一郎日笠; 山田　亮; 亮山田
Original assignee: Kyoto University
Current assignee: Kyoto University
Priority date: 2016-12-28
Filing date: 2016-12-28
Publication date: 2021-07-14
Anticipated expiration: 2036-12-28
Also published as: JP2018108042A

Description

本発明は被験者のHLA遺伝子のアレルペアを判定する方法、および被験者のHLA遺伝子のアレルペア判定用プログラムに関する。 The present invention relates to a method for determining an allergic pair of a subject's HLA gene and a program for determining an allergic pair of the subject's HLA gene.

正確にHLAアレルを決定（タイピング）することは移植や免疫に関わる難病研究において極めて重要である。HLA遺伝子は30以上存在し、HLAアレルデータベースとして最も有名なIPD-IMGT/HLAデータベースでは10,000種類を超えるアレルが登録されており、日々更新が行なわれている。高速な遺伝子配列決定技術である次世代シークエンサーとこのデータベース情報を用いることで、高精度かつ高効率なHLAタイピングが可能となる。 Accurately determining (typing) HLA alleles is extremely important in research on intractable diseases related to transplantation and immunity. There are more than 30 HLA genes, and more than 10,000 types of alleles are registered in the IPD-IMGT / HLA database, which is the most famous HLA allele database, and are updated daily. By using the next-generation sequencer, which is a high-speed gene sequencing technology, and this database information, highly accurate and highly efficient HLA typing becomes possible.

しかし、登録されているHLAアレルのうち、完全長配列が登録されているHLAアレルはほとんどなく、９割以上のHLAアレルが一部のエクソンしか登録されていない。特にG-DOMAINと呼ばれる抗原提示部に相当するエクソンは全てのアレルで登録されているが、それ以外のエクソンについては情報のないアレルが大半である。この理由から、次世代シークエンサーを用いる既存のタイピング技術は一部のアレルの判定にしか対応していなかった。例えば、HLA遺伝子の部分塩基配列情報（以下、リード）を複数のHLA遺伝子のアレルへのマッピングし、その結果を線形計画問題に当てはめ、最も当てはまりの良いアレルペアを検出する方法(OptiType)が報告されている（非特許文献１）。しかし、該方法は、Class I 遺伝子であるHLA-A, B, Cの一部のアレルの判定しか行うことができないという問題があった。さらに、該方法は、アレルを判定する際、4-digitの結果までしか対応できない問題もあった。また、別の方法として、リードのマッピング結果をDe novoアセンブリし、その結果からアレルを判定する方法(HLAreporter)が報告されている（非特許文献２）。該方法は、アレルを判定する際、6-digitの結果まで判定することができる。しかし、該方法は、解析に高いシークエンスカバレージを要求する為、判定不能なサンプルが多くなるという問題を有していた。さらに、OptiTypeおよびHLAreporterとも、一部のエクソンのみを用いてアレルの判定を行う為、それ以外のエクソン間の多様性を考慮できず、誤ったアレルを選択してしまうという問題もあった。 However, among the registered HLA alleles, there are almost no HLA alleles in which the full-length sequence is registered, and more than 90% of the HLA alleles are registered in only some exons. In particular, exons corresponding to the antigen presenting part called G-DOMAIN are registered in all alleles, but most of the other alleles have no information. For this reason, existing typing techniques using next-generation sequencers have only supported the determination of some alleles. For example, a method (OptiType) has been reported in which partial base sequence information (hereinafter, read) of an HLA gene is mapped to multiple HLA gene alleles, the results are applied to a linear programming problem, and the most applicable allele pair is detected. (Non-Patent Document 1). However, this method has a problem that it can only determine some alleles of HLA-A, B, and C, which are Class I genes. Furthermore, this method has a problem that when determining an allele, only 4-digit results can be dealt with. Further, as another method, a method (HLA reporter) in which a lead mapping result is Denovo-assembled and an allele is determined from the result has been reported (Non-Patent Document 2). The method can determine up to 6-digit results when determining alleles. However, this method has a problem that a large number of undeterminable samples are required because a high sequence coverage is required for analysis. Furthermore, since both OptiType and HLA reporter use only some exons to determine alleles, there is also the problem that diversity among other exons cannot be considered and the wrong allele is selected.

従って、従来法では困難であった「データベースに一部のエクソンの塩基配列情報のみしか登録されていないHLAアレルも含めた、全てのHLAアレルを同一比較する」という課題を克服する必要があった。 Therefore, it was necessary to overcome the problem of "same comparison of all HLA alleles including HLA alleles in which only some exon base sequence information is registered in the database", which was difficult with the conventional method. ..

Szolek, A., Schubert, B., Mohr, C., Sturm, M., Feldhahn, M. and Kohlbacher, O. (2014) OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics, 30, 3310-3316.Szolek, A., Schubert, B., Mohr, C., Sturm, M., Feldhahn, M. and Kohlbacher, O. (2014) OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics, 30, 3310- 3316. Huang, Y., Yang, J., Ying, D., Zhang, Y., Shotelersuk, V., Hirankarn, N., Sham, P.C., Lau, Y.L. and Yang, W. (2015) HLAreporter: a tool for HLA typing from next generation sequencing data. Genome Med, 7, 25.Huang, Y., Yang, J., Ying, D., Zhang, Y., Shotelersuk, V., Hirankarn, N., Sham, PC, Lau, YL and Yang, W. (2015) HLAreporter: a tool for HLA typing from next generation sequencing data. Genome Med, 7, 25.

本発明は、データベースに登録されているエクソンの塩基配列情報が不統一なHLAアレル群をリファレンスとして用いても、精度よく被験者のHLA遺伝子のアレルを判定することができる新たな方法およびプログラムを提供することを課題とする。 The present invention provides a new method and program capable of accurately determining an allele of a subject's HLA gene even when an HLA allele group in which the exon base sequence information registered in the database is inconsistent is used as a reference. The task is to do.

本発明者らは、初めにIPD-IMGT/HLAデータベースから、登録されている全HLAアレルのエクソンおよびイントロンの全塩基配列を抽出し、塩基配列パターンごとにグループ化した。これらのパターンとその属するアレル情報を記録したHLA辞書を作成し、これをリファレンスとして、次世代シークエンサーで得られたリードをマッピングした。この際、辞書のエクソンおよびイントロンの両端に固定長のN配列を付与し、次に、各リードがどのエクソンもしくはイントロンにマップしたか調べた。リード全長の５０％以上の塩基配列長を有する連続する塩基配列が、HLA辞書に含まれるいずれかのエクソンの塩基配列と重複しかつ該重複範囲において両塩基配列が完全に一致する、もしくはHLA辞書に含まれるいずれかのイントロンの塩基配列と重複しかつ該重複範囲において両塩基配列が2塩基ミスマッチ以内で一致すれば該リードを該エクソンもしくは該イントロンが含まれるアレルに一致させた。さらにイントロンに一致した該リードにおいて、該イントロンの塩基配列に一致しなかった残りの塩基配列が該イントロンと隣接する該HLA辞書に含まれるエクソンの塩基配列と重複しかつ該重複範囲において両塩基配列に完全に一致する場合、該リードは該エクソンが含まれるアレルに一致させた。次に、一致したアレル数に応じて、各リードに遺伝子毎の重みを付け、アレルペアを判定する際のスコアへの重みとして用いた。
タイピングは各HLA遺伝子において、エクソンにマップされた重み付きリード数（スコア）が最も多いアレルペアを選択することで判定した（6-digitで判定した）。この際、マップリード数は塩基配列が登録されているエクソン数に依存するため、単純なリード数の比較で全てのアレル間を公平に判定できない。そこで始めは、各HLA遺伝子において全てのアレルで共通して登録されているエクソン(ほとんどのHLA遺伝子ではG-DOMAINと呼ばれている、抗原提示に関わる部位に対応するエクソンであり、具体的には、Class IのHLA遺伝子の場合は第２エクソンおよび第３エクソン、Class IIのHLA遺伝子の場合は第２エクソン)に限定して、スコアを最大にするアレルペアの探査を行った。この場合、G-DOMAINにおいて、共通の配列を持った複数のアレルペア候補が最大のスコアを持つ可能性がある。そこで、選択されたアレルペア候補内のアレル間でHLA辞書内にG-DOMAIN以外でも塩基配列が共通して登録されているエクソンが存在する場合、このエクソンを新たにスコアの計算対象に含め、スコアを最大にするアレルペアを候補の中から再探査した。これをアレルペアが一意に決定されるまで逐次的に繰り返した。
1000 Genomes Projectで得られたゲノムデータを用いて、従来法および上記の開発法によるアレルタイピングをそれぞれ行い、PCR-SBT法およびPCR-SSOP法でタイピングされたHLAアレルとの一致率を確認した。その結果、開発法は従来法に比べて高い精度の一致率を示し、さらに、一致しないアレルの多くはPCR-SSOP法や他手法では検出不可能なアレルであった。
本発明者らは、これらの知見に基づいてさらに検討を重ねた結果、本発明を完成させるに至った。 First, the present inventors extracted the entire base sequences of exons and introns of all registered HLA alleles from the IPD-IMGT / HLA database, and grouped them by base sequence pattern. An HLA dictionary recording these patterns and allele information to which they belong was created, and the reads obtained by the next-generation sequencer were mapped using this as a reference. At this time, fixed-length N sequences were assigned to both ends of the exon and intron of the dictionary, and then it was examined which exon or intron each read mapped to. A continuous base sequence having a base sequence length of 50% or more of the total length of the read overlaps with the base sequence of any exon contained in the HLA dictionary, and both base sequences completely match in the overlapping range, or the HLA dictionary If it overlaps with the base sequence of any of the introns contained in the above and both base sequences match within a 2-base mismatch within the overlapping range, the read is matched with the exon or the allele containing the intron. Furthermore, in the read that matches the intron, the remaining base sequence that does not match the base sequence of the intron overlaps with the base sequence of the exon contained in the HLA dictionary adjacent to the intron, and both base sequences are overlapped in the overlapping range. The lead was matched to the allele containing the exon. Next, each read was weighted for each gene according to the number of matched alleles, and used as a weight for the score when determining the allele pair.
Typing was determined by selecting the allele pair with the highest number of weighted reads (scores) mapped to exons for each HLA gene (determined by 6-digit). At this time, since the number of map reads depends on the number of exons in which the base sequence is registered, it is not possible to fairly judge all alleles by simply comparing the number of reads. Therefore, the first is an exon that is commonly registered in all alleles in each HLA gene (in most HLA genes, it is called G-DOMAIN, which is an exon corresponding to the site involved in antigen presentation, specifically. In the case of the Class I HLA gene, the second exon and the third exon, and in the case of the Class II HLA gene, the second exon) were limited to the search for the allele pair that maximizes the score. In this case, in G-DOMAIN, multiple allele pair candidates with a common sequence may have the highest score. Therefore, if there is an exon in the HLA dictionary in which the base sequence is commonly registered in the HLA dictionary among alleles in the selected allele pair candidate, this exon is newly included in the score calculation target and the score is scored. We re-explored the allele pair that maximizes the number of candidates. This was repeated sequentially until the aller pair was uniquely determined.
Using the genomic data obtained from the 1000 Genomes Project, allertyping was performed by the conventional method and the above-mentioned development method, respectively, and the concordance rate with the HLA allele typed by the PCR-SBT method and the PCR-SSOP method was confirmed. As a result, the development method showed a higher accuracy of matching rate than the conventional method, and most of the alleles that did not match were alleles that could not be detected by the PCR-SSOP method or other methods.
As a result of further studies based on these findings, the present inventors have completed the present invention.

すなわち、本発明は以下よりなる。
［１］以下の工程を含む、被験者のHLA遺伝子のアレルペアを判定する方法：
（１）被験者から得られた生体試料中のHLA遺伝子の部分塩基配列情報（リード）を取得する工程、
（２）公知のHLAアレルのエクソンおよびイントロンの塩基配列情報を取得し、該エクソンおよび該イントロンの塩基配列情報と該HLAアレルのエクソンおよびイントロンとの対応関係を記録したHLA辞書を作成する工程、
（３）工程（１）で取得されたリードが工程（２）で作成されたHLA辞書に含まれるいずれかのエクソンまたはイントロンの塩基配列に一致する場合、該リードは該塩基配列が含まれるアレルに一致すると判定する工程、
（４）工程（３）でリードに一致すると判定されたアレルが含まれるHLA遺伝子に対する該リードの重みを計算する工程、
（５）工程（４）で計算されたリードの重みを基に、HLA遺伝子のアレルペアのスコアを計算する工程、
（６）工程（５）で計算されたスコアを最大にするアレルペアを探査し、被験者のHLA遺伝子のアレルペアとして判定する工程。
［２］以下の工程を含む、被験者のHLA遺伝子のアレルペアを判定する方法：
（１）被験者から得られた生体試料中のHLA遺伝子の部分塩基配列情報（リード）を取得する工程、
（２）公知のHLAアレルのエクソンおよびイントロンの塩基配列情報を取得し、該塩基配列情報にIDを割り当て、該IDと該HLAアレルのエクソンおよびイントロンの対応関係を記録したHLA辞書を作成する工程、
（３）以下の（Ａ）、（Ｂ）および（Ｃ）を行う工程：
（Ａ）工程（１）で取得されたリード全長の５０％以上の塩基配列長を有する連続する塩基配列が、工程（２）で作成されたHLA辞書に含まれるいずれかのエクソンの塩基配列と重複しかつ該重複範囲において両塩基配列が完全に一致する場合、該リードは該エクソンの塩基配列に割り当てられたIDと対応関係にあるエクソンが含まれるアレルと一致すると判定する、
（Ｂ）工程（１）で取得されたリード全長の５０％以上の塩基配列長を有する連続する塩基配列が、工程（２）で作成されたHLA辞書に含まれるいずれかのイントロンの塩基配列と重複しかつ該重複範囲において両塩基配列が2塩基ミスマッチ以内で一致した場合、該リードは該イントロンの塩基配列に割り当てられたIDと対応関係にあるイントロンが含まれるアレルと一致すると判定する、
（Ｃ）工程（１）で取得されたリード全長の５０％以上の塩基配列長を有する連続する塩基配列が、工程（２）で作成されたHLA辞書に含まれるいずれかのイントロンの塩基配列と重複しかつ該重複範囲において両塩基配列が2塩基ミスマッチ以内で一致しており、かつ該イントロンに一致しない残りの塩基配列が該イントロンと隣接する該HLA辞書に含まれるいずれかのエクソンの塩基配列と重複しかつ該重複範囲において両塩基配列に完全に一致する場合、該リードは該エクソンの塩基配列に割り当てられたIDと対応関係にあるエクソンが含まれるアレルと一致すると判定する、
（４）工程（１）で取得されたリードが、シングルエンドリードの場合、 That is, the present invention comprises the following.
[1] A method for determining an allergic pair of a subject's HLA gene, which comprises the following steps:
(1) Step of acquiring partial base sequence information (read) of HLA gene in a biological sample obtained from a subject,
(2) A step of acquiring the base sequence information of known HLA allele exons and introns and creating an HLA dictionary recording the correspondence between the base sequence information of the exons and the introns and the exons and introns of the HLA allele.
(3) If the read obtained in step (1) matches the base sequence of any exon or intron contained in the HLA dictionary created in step (2), the read is an allele containing the base sequence. The process of determining that it matches
(4) A step of calculating the weight of the read with respect to the HLA gene containing the allele determined to match the read in the step (3).
(5) A step of calculating the HLA gene allele pair score based on the read weight calculated in step (4).
(6) A step of searching for an allele pair that maximizes the score calculated in step (5) and determining it as an allele pair of the subject's HLA gene.
[2] A method for determining an allergic pair of a subject's HLA gene, which comprises the following steps:
(1) Step of acquiring partial base sequence information (read) of HLA gene in a biological sample obtained from a subject,
(2) A step of acquiring the base sequence information of known HLA allele exons and introns, assigning an ID to the base sequence information, and creating an HLA dictionary recording the correspondence between the ID and the exons and introns of the HLA allele. ,
(3) Steps of performing the following (A), (B) and (C):
(A) A continuous base sequence having a base sequence length of 50% or more of the total length of the read obtained in step (1) is a base sequence of any exon included in the HLA dictionary prepared in step (2). If there is an overlap and both base sequences are completely matched in the overlapping range, it is determined that the read matches an allele containing an exon corresponding to the ID assigned to the base sequence of the exon.
(B) A continuous base sequence having a base sequence length of 50% or more of the total length of the read obtained in step (1) is the base sequence of any intron included in the HLA dictionary created in step (2). If there is an overlap and both base sequences match within a 2-base mismatch within the overlap range, it is determined that the read matches an allele containing an intron that corresponds to the ID assigned to the base sequence of the intron.
(C) A continuous base sequence having a base sequence length of 50% or more of the total length of the read obtained in step (1) is the base sequence of any intron included in the HLA dictionary created in step (2). The base sequence of any exson that is duplicated and both base sequences match within 2 base mismatches in the overlapping range, and the remaining base sequences that do not match the intron are contained in the HLA dictionary adjacent to the intron. If it overlaps with and completely matches both base sequences in the overlapping range, it is determined that the read matches an allele containing an exson corresponding to the ID assigned to the base sequence of the exson.
(4) When the lead acquired in step (1) is a single-ended lead,

に従い、工程（１）で取得されたリードが、ペアエンドリードの場合、 According to the above, when the lead acquired in the step (1) is a paired end lead,

にさらに従い、工程（３）でリードに一致すると判定されたアレルが含まれるHLA遺伝子に対する該リードの重みを計算する工程、
（５） Further, the step of calculating the weight of the lead with respect to the HLA gene containing the allele determined to match the read in step (3).
(5)

に従い、HLA遺伝子のアレルペアのスコアを計算する工程、および
（６） To calculate the score of the HLA gene allele pair according to (6).

に従い、工程（５）で計算されたスコアを最大にするアレルペアを被験者のHLA遺伝子のアレルペアとして判定する（ただし、該スコアを最大にするアレルペアが Therefore, the allele pair that maximizes the score calculated in step (5) is determined as the allele pair of the subject's HLA gene (however, the allele pair that maximizes the score is determined.

を満たす場合、スコアを最大にするアレルペアA, BにおけるアレルAのホモ接合型を被験者のHLA遺伝子のアレルペアとして判定する）工程。
［３］工程（２）で作成されるHLA辞書に記録される公知のHLAアレルのエクソンおよびイントロンの塩基配列情報は、IPD-IMGT/HLAデータベースから取得される、［１］または［２］に記載の方法。
［４］工程（２）で作成されるHLA辞書に記録される公知のHLAアレルのエクソンおよびイントロンの両端に工程（１）で取得されるリードの最大長の1/2倍のN配列が付加される、［１］〜［３］のいずれか１つに記載の方法。
［５］工程（５）の集合Tに、Class IのHLA遺伝子の第２エクソンおよび第３エクソン、Class IIのHLA遺伝子の第２エクソンが含まれる、［２］〜［４］のいずれか１つに記載の方法。
［６］工程（６）において2種類以上のアレルペアで同一の最大スコアが得られた場合、以下の工程をさらに含む、［２］〜［５］のいずれか１つに記載の方法：
（７）工程（５）の集合Tに、該アレルペアに含まれるアレル間でHLA辞書に共通して含まれるエクソンを追加し、再度、HLA遺伝子のアレルペアのスコアを計算する工程、
（８）工程（７）で計算されるスコアを最大にするアレルペアを被験者のHLA遺伝子のアレルペアとして判定する工程。
［７］工程（６）において2種類以上のアレルペアで同一の最大スコアが得られ、かつ該アレルペアに含まれるアレル間でHLA辞書に共通して含まれるエクソンが存在しない場合、以下の工程をさらに含む、［２］〜［５］のいずれか１つに記載の方法：
（７’）工程（６）において得られた同一の最大スコアを有する2種類以上のアレルペアのうち、公知のデータベースに登録されているHLA遺伝子のアレル頻度数データを用いて、該アレルペアの各アレルの頻度の積が最も高いアレルペアを被験者のHLA遺伝子のアレルペアとして判定する工程。
［８］工程（７’）の公知のデータベースが、Allele Frequency Net Databaseである、［７］に記載の方法。
［９］コンピュータを
被験者から得られた生体試料中のHLA遺伝子の部分塩基配列情報（リード）を受け付ける受付手段１;
公知のHLAアレルのエクソンおよびイントロンの塩基配列情報を取得し、該エクソンおよび該イントロンの塩基配列情報と該HLAアレルのエクソンおよびイントロンとの対応関係を記録したHLA辞書を受け付ける受付手段２;
前記受け付けられたリードが前記受け付けられたHLA辞書に含まれるいずれかのエクソンまたはイントロンの塩基配列に一致する場合、該リードは該塩基配列が含まれるアレルに一致すると判定する判定手段１;
リードに一致すると判定された前記アレルが含まれるHLA遺伝子に対する該リードの重みを計算する計算手段１;
前記計算されたリードの重みを基に、HLA遺伝子のアレルペアのスコアを計算する計算手段２;
前記計算されたスコアを最大にするアレルペアを探査し、被験者のHLA遺伝子のアレルペアとして判定する判定手段２;
前記判定されたアレルペアを出力する出力手段;
として機能させるための被験者のHLA遺伝子のアレルペア判定用プログラム。
［１０］コンピュータを
被験者から得られた生体試料中のHLA遺伝子の部分塩基配列情報（リード）を受け付ける受付手段１;
公知のHLAアレルのエクソンおよびイントロンの塩基配列情報を取得し、該塩基配列情報にIDを割り当て、該IDと該HLAアレルのエクソンおよびイントロンの対応関係を記録したHLA辞書を受け付ける受付手段２;
以下の（Ａ）、（Ｂ）および（Ｃ）を行う判定手段１：
（Ａ）前記取得されたリード全長の５０％以上の塩基配列長を有する連続する塩基配列が、前記作成されたHLA辞書に含まれるいずれかのエクソンの塩基配列と重複しかつ該重複範囲において両塩基配列が完全に一致する場合、該リードは該エクソンの塩基配列に割り当てられたIDと対応関係にあるエクソンが含まれるアレルと一致すると判定する、
（Ｂ）前記取得されたリード全長の５０％以上の塩基配列長を有する連続する塩基配列が、前記作成されたHLA辞書に含まれるいずれかのイントロンの塩基配列と重複しかつ該重複範囲において両塩基配列が2塩基ミスマッチ以内で一致した場合、該リードは該イントロンの塩基配列に割り当てられたIDと対応関係にあるイントロンが含まれるアレルと一致すると判定する、
（Ｃ）前記取得されたリード全長の５０％以上の塩基配列長を有する連続する塩基配列が、前記作成されたHLA辞書に含まれるいずれかのイントロンの塩基配列と重複しかつ該重複範囲において両塩基配列が2塩基ミスマッチ以内で一致しており、かつ該イントロンに一致しない残りの塩基配列が該イントロンと隣接する該HLA辞書に含まれるいずれかのエクソンの塩基配列と重複しかつ該重複範囲において両塩基配列に完全に一致する場合、該リードは該エクソンの塩基配列に割り当てられたIDと対応関係にあるエクソンが含まれるアレルと一致すると判定する;
前記取得されたリードが、シングルエンドリードの場合、 If the above conditions are met, the homozygous form of allele A in allele pairs A and B that maximizes the score is determined as the allele pair of the subject's HLA gene).
[3] The nucleotide sequence information of the known HLA allele exons and introns recorded in the HLA dictionary created in step (2) can be obtained from the IPD-IMGT / HLA database in [1] or [2]. The method described.
[4] N sequences that are 1/2 times the maximum length of the read obtained in step (1) are added to both ends of the exons and introns of known HLA alleles recorded in the HLA dictionary created in step (2). The method according to any one of [1] to [3].
[5] Any one of [2] to [4], wherein the set T of step (5) contains the second and third exons of the HLA gene of Class I and the second exon of the HLA gene of Class II. The method described in one.
[6] The method according to any one of [2] to [5], further comprising the following steps when the same maximum score is obtained for two or more types of aller pairs in step (6):
(7) A step of adding exons commonly included in the HLA dictionary among alleles included in the allele pair to the set T of step (5) and calculating the score of the allele pair of the HLA gene again.
(8) A step of determining an allele pair that maximizes the score calculated in step (7) as an allele pair of the subject's HLA gene.
[7] If the same maximum score is obtained for two or more types of allele pairs in step (6) and there is no exon commonly included in the HLA dictionary among alleles included in the allele pair, the following steps are further performed. The method according to any one of [2] to [5], which includes:
(7') Of two or more types of allele pairs having the same maximum score obtained in step (6), each allele of the allele pair using the allele frequency data of the HLA gene registered in a known database. The step of determining the allele pair having the highest frequency product as the allele pair of the subject's HLA gene.
[8] The method according to [7], wherein the known database in step (7') is the Allele Frequency Net Database.
[9] Reception means for receiving partial base sequence information (read) of HLA gene in a biological sample obtained from a subject using a computer 1;
Receiving means for acquiring the base sequence information of known HLA allele exons and introns and accepting an HLA dictionary recording the correspondence between the base sequence information of the exons and the introns and the exons and introns of the HLA allele 2;
If the accepted read matches the base sequence of any exon or intron contained in the accepted HLA dictionary, the determination means for determining that the read matches the allele containing the base sequence 1;
Calculation means for calculating the weight of the read with respect to the HLA gene containing the allele determined to match the read 1;
Calculation means for calculating the score of the allele pair of the HLA gene based on the calculated read weight 2;
Judgment means for searching for the allele pair that maximizes the calculated score and determining it as the allele pair of the subject's HLA gene 2;
Output means for outputting the determined aller pair;
A program for determining the allergic pair of the subject's HLA gene to function as.
[10] Reception means for receiving partial base sequence information (read) of HLA gene in a biological sample obtained from a subject using a computer 1;
Reception means 2; which acquires the base sequence information of known HLA allele exons and introns, assigns an ID to the base sequence information, and accepts an HLA dictionary recording the correspondence between the ID and the exons and introns of the HLA allele.
Judgment means 1: for performing the following (A), (B) and (C)
(A) A continuous base sequence having a base sequence length of 50% or more of the acquired total length of the read overlaps with the base sequence of any exon included in the prepared HLA dictionary, and both in the overlapping range. If the base sequence is completely matched, it is determined that the read matches the allele containing the exon corresponding to the ID assigned to the base sequence of the exon.
(B) A continuous base sequence having a base sequence length of 50% or more of the obtained read total length overlaps with the base sequence of any intron included in the prepared HLA dictionary, and both in the overlapping range. If the nucleotide sequences match within a 2-base mismatch, it is determined that the read matches an allele containing an intron that corresponds to the ID assigned to the nucleotide sequence of the intron.
(C) A continuous base sequence having a base sequence length of 50% or more of the obtained read total length overlaps with the base sequence of any intron included in the prepared HLA dictionary, and both in the overlapping range. The base sequence matches within 2 base mismatches, and the remaining base sequence that does not match the intron overlaps with the base sequence of any exson contained in the HLA dictionary adjacent to the intron and in the overlapping range. If both base sequences are completely matched, it is determined that the read matches an allele containing an exson corresponding to the ID assigned to the base sequence of the exson;
When the acquired lead is a single-ended read,

に従い、前記取得されたリードが、ペアエンドリードの場合、 According to the above, when the acquired lead is a paired end lead,

にさらに従い、前記のリードに一致すると判定されたアレルが含まれるHLA遺伝子に対する該リードの重みを計算する計算手段１； Further, the calculation means for calculating the weight of the read with respect to the HLA gene containing the allele determined to match the read according to 1;

に従い、HLA遺伝子のアレルペアのスコアを計算する計算手段２； Calculation means for calculating the score of the allele pair of the HLA gene according to 2;

に従い、前記計算されたスコアを最大にするアレルペアを被験者のHLA遺伝子のアレルペアとして判定する（ただし、該スコアを最大にするアレルペアが Therefore, the allele pair that maximizes the calculated score is determined as the allele pair of the subject's HLA gene (however, the allele pair that maximizes the score is determined.

を満たす場合、スコアを最大にするアレルペアA, BにおけるアレルAのホモ接合型を被験者のHLA遺伝子のアレルペアとして判定する）判定手段２；
前記判定されたアレルペアを出力する出力手段；
として機能させるための被験者のHLA遺伝子のアレルペア判定用プログラム。
［１１］受付手段２で受け付けられるHLA辞書に記録された公知のHLAアレルのエクソンおよびイントロンの塩基配列情報は、IPD-IMGT/HLAデータベースから取得される、［９］または［１０］に記載のプログラム。
［１２］受付手段２で受け付けられるHLA辞書に記録される公知のHLAアレルのエクソンおよびイントロンの両端に受付手段１で受け付けられるリードの最大長の1/2倍のN配列が付加される、［９］〜［１１］のいずれか１つに記載のプログラム。
［１３］計算手段２において、集合Tに、Class IのHLA遺伝子の第２エクソンおよび第３エクソン、Class IIのHLA遺伝子の第２エクソンが含まれる、［１０］〜［１２］のいずれか１つに記載のプログラム。
［１４］判定手段２において、2種類以上のアレルペアで同一の最大スコアが得られた場合、以下の手段をさらに含む、［１０］〜［１３］のいずれか１つに記載のプログラム：
計算手段２において、集合Tに、該アレルペアに含まれるアレル間でHLA辞書に共通して含まれるエクソンが追加され、再度、HLA遺伝子のアレルペアのスコアを計算する計算手段２’、
計算手段２’に計算されるスコアを最大にするアレルペアを被験者のHLA遺伝子のアレルペアとして判定する判定手段２
前記判定されたアレルペアを出力する出力手段。
［１５］判定手段２において、2種類以上のアレルペアで同一の最大スコアが得られ、かつ該アレルペアに含まれるアレル間でHLA辞書に共通して含まれるエクソンが存在しない場合、以下の手段をさらに含む、［１０］〜［１３］のいずれか１つに記載のプログラム：
判定手段２において得られた同一の最大スコアを有する2種類以上のアレルペアのうち、公知のデータベースに登録されているHLA遺伝子のアレル頻度数データを用いて、該アレルペアの各アレルの頻度の積が最も高いアレルペアを被験者のHLA遺伝子のアレルペアとして判定する判定手段２’、
前記判定されたアレルペアを出力する出力手段。
［１６］判定手段２’の公知のデータベースが、Allele Frequency Net Databaseである、［１５］に記載のプログラム。 If the condition is satisfied, the homozygous type of allele A in allele pairs A and B that maximizes the score is determined as the allele pair of the subject's HLA gene.) Judgment means 2;
Output means for outputting the determined aller pair;
A program for determining the allergic pair of the subject's HLA gene to function as.
[11] The base sequence information of exons and introns of known HLA alleles recorded in the HLA dictionary accepted by the receiving means 2 is obtained from the IPD-IMGT / HLA database, according to [9] or [10]. program.
[12] N sequences that are 1/2 times the maximum read length accepted by reception means 1 are added to both ends of the exons and introns of known HLA alleles recorded in the HLA dictionary accepted by reception means 2. 9] The program according to any one of [11].
[13] In the calculation means 2, the set T includes the second and third exons of the HLA gene of Class I and the second exon of the HLA gene of Class II, any one of [10] to [12]. The program described in one.
[14] The program according to any one of [10] to [13], further including the following means when the same maximum score is obtained in two or more types of aller pairs in the determination means 2.
In the calculation means 2, exons commonly included in the HLA dictionary among the alleles included in the allele pair are added to the set T, and the calculation means 2', which calculates the score of the allele pair of the HLA gene again,
Judgment means 2 for determining the allele pair that maximizes the score calculated by the calculation means 2'as the allele pair of the subject's HLA gene.
An output means for outputting the determined aller pair.
[15] In the determination means 2, when the same maximum score is obtained for two or more types of allele pairs and there is no exon commonly included in the HLA dictionary among alleles included in the allele pair, the following means are further added. The program according to any one of [10] to [13], which includes:
Among two or more types of allele pairs having the same maximum score obtained in the determination means 2, the product of the frequencies of each allele of the allele pair is calculated using the allele frequency data of the HLA gene registered in a known database. Judgment means 2', which determines the highest allele pair as the allele pair of the subject's HLA gene,
An output means for outputting the determined aller pair.
[16] The program according to [15], wherein the known database of the determination means 2'is the Allele Frequency Net Database.

本発明によれば、IPD-IMGT/HLAデータベースを含む公知のデータベースに登録されている全てのHLA遺伝子およびアレルに基づいて被験者のHLAアレルタイピングを行うことができる。また、本発明は、HLA遺伝子データベースのデータ更新にも迅速に対応出来る。さらに、本発明は、全ゲノムシークエンス法、全エクソームシークエンス法、Long-PCR法によるHLA遺伝子シークエンスなどを用いたNGSシークエンスデータをHLAアレルタイピングに使用する塩基配列情報として適用できる。特に、Long-PCR法による大規模シークエンスとの組み合わせにおいては、従来のPCR-SSOP法やPCR-SBT法に比べて低コストタイピングが可能になる。また、従来法によって誤って判定されてきたアレルが修正されることで、正しいHLAアレル頻度の計算が可能となる。加えて、レアアレルや抗原提示ドメイン以外のエクソン箇所の変異を検出および解析することが可能になる。 According to the present invention, a subject's HLA allele typing can be performed based on all HLA genes and alleles registered in a known database including the IPD-IMGT / HLA database. In addition, the present invention can quickly respond to data updates of the HLA gene database. Furthermore, the present invention can apply NGS sequence data using HLA gene sequencing by whole genome sequencing method, whole exome sequencing method, Long-PCR method, etc. as base sequence information used for HLA allele typing. In particular, in combination with a large-scale sequence by the Long-PCR method, low-cost typing becomes possible as compared with the conventional PCR-SSOP method and PCR-SBT method. In addition, by correcting alleles that have been erroneously determined by the conventional method, it is possible to calculate the correct HLA allele frequency. In addition, it will be possible to detect and analyze mutations in exon sites other than rare alleles and antigen-presenting domains.

本発明のプログラムが用いられる被験者のHLA遺伝子のアレルペア判定用装置の一例を示す図である。It is a figure which shows an example of the device for determining the allergic pair of the HLA gene of the subject in which the program of this invention is used. 本発明のプログラムを実行するコンピュータシステムの一例を示す図である。It is a figure which shows an example of the computer system which executes the program of this invention. 本発明のプログラムによる具体的な動作のフローチャートである。It is a flowchart of the specific operation by the program of this invention. Ａ）およびＢ）は、リードが特定のエクソンに一致するか否かを判定するための条件を示す図である。Ａ）リード全長の５０％以上の塩基配列長を有する連続する塩基配列が、HLA辞書に含まれるいずれかのエクソンの塩基配列と重複しかつ該重複範囲において両塩基配列が完全に一致する場合、該リードを該エクソンの塩基配列に割り当てられたIDと対応関係にあるエクソンが含まれるアレルに一致させる、Ｂ）リード全長の５０％以上の塩基配列長を有する連続する塩基配列が、HLA辞書に含まれるいずれかのイントロンの塩基配列と重複しかつ該重複範囲において両塩基配列が2塩基ミスマッチ以内で一致する場合、該リードを該イントロンの塩基配列に割り当てられたIDと対応関係にあるイントロンが含まれるアレルに一致させる。さらに該イントロンに一致した該リードにおいて、該イントロンの塩基配列に一致しなかった残りの塩基配列が該イントロンと隣接する該HLA辞書に含まれるエクソンの塩基配列と重複しかつ該重複範囲において両塩基配列に完全に一致する場合、該リードを該エクソンの塩基配列に割り当てられたIDと対応関係にあるエクソンが含まれるアレルと一致させる。A) and B) are diagrams showing conditions for determining whether or not a lead matches a specific exon. A) When a continuous base sequence having a base sequence length of 50% or more of the total length of the read overlaps with the base sequence of any exson included in the HLA dictionary and both base sequences completely match in the overlapping range. Match the read to the allele containing the exson corresponding to the ID assigned to the base sequence of the exson. B) A continuous base sequence having a base sequence length of 50% or more of the total length of the read is added to the HLA dictionary. If it overlaps with the base sequence of any of the included introns and both base sequences match within a 2-base mismatch within the overlapping range, the intron corresponding to the ID assigned to the base sequence of the intron will be used. Match the contained alleles. Furthermore, in the read that matches the intron, the remaining base sequence that does not match the base sequence of the intron overlaps with the base sequence of the exon contained in the HLA dictionary adjacent to the intron, and both bases are overlapped in the overlapping range. If it exactly matches the sequence, the read is matched with an allele that contains an exon that corresponds to the ID assigned to the base sequence of the exon. シングルエンドリードに関して重みの計算例を示した図である。リードrが遺伝子G₁のエクソンX₁の4種類のアレルと一致、遺伝子G₂のエクソンX₂、X₃にそれぞれ1種類のアレルと一致した場合の、リードrの遺伝子G₁およびG₂に対する重みの計算方法を示す。It is a figure which showed the calculation example of the weight with respect to a single-ended read. _{For genes G 1} and G ₂ of lead r when read r matches four alleles of exon X ₁ of gene G ₁ and one allele of exon X ₂ and X ₃ _{of gene G 2 respectively.} The weight calculation method is shown. スコア比較によるアレルペアの決定過程を示す図である。初期探査ではスコアの計算に用いられるエクソンの集合Tは同遺伝子中の全アレル共通で存在するエクソンセットで行われる。従って、初期探査において集合T に含まれるエクソンは、HLA Class Iの遺伝子ではExon 2および3、 HLA Class IIの遺伝子ではExon 2に限定される。初期探査で複数のアレルペア候補が選出された場合においては、該候補のアレル共通で塩基配列が登録されているエクソンをTに追加してスコアを再計算し、最大スコアを持つアレルペアを再探査する。これを一つのアレルペア候補が求まるまたは追加可能なエクソンが存在するまで行う。It is a figure which shows the determination process of the aller pair by score comparison. In the initial exploration, the exon set T used for score calculation is performed in the exon set that is common to all alleles in the same gene. Therefore, the exons contained in the set T in the initial exploration are limited to Exon 2 and 3 for the HLA Class I gene and Exon 2 for the HLA Class II gene. When multiple allele pair candidates are selected in the initial search, the exon whose base sequence is registered in common with the candidate alleles is added to T, the score is recalculated, and the allele pair with the maximum score is re-searched. .. Do this until one allele pair candidate is available or there is an exon that can be added.

本発明は、以下の工程：
（１）被験者から得られた生体試料中のHLA遺伝子の部分塩基配列情報（リード）を取得する工程、
（２）公知のHLAアレルのエクソンおよびイントロンの塩基配列情報を取得し、該エクソンおよび該イントロンの塩基配列情報と該HLAアレルのエクソンおよびイントロンの対応関係を記録したHLA辞書を作成する工程、
（３）工程（１）で取得されたリードが工程（２）で作成されたHLA辞書に含まれるいずれかのエクソンまたはイントロンの塩基配列情報に一致する場合、該リードは該塩基配列が含まれるアレルに一致すると判定する工程、
（４）工程（３）でリードに一致すると判定されたアレルが含まれるHLA遺伝子に対する該リードの重みを計算する工程、
（５）工程（４）で計算されたリードの重みを基に、HLA遺伝子のアレルペアのスコアを計算する工程、
（６）工程（５）で計算されたスコアを最大にするアレルペアを探査し、被験者のHLA遺伝子のアレルペアとして判定する工程
を含む、被験者のHLA遺伝子のアレルペアを判定する方法（以下、本発明の判定方法１）を提供する。 The present invention has the following steps:
(1) Step of acquiring partial base sequence information (read) of HLA gene in a biological sample obtained from a subject,
(2) A step of acquiring the base sequence information of known HLA allele exons and introns and creating an HLA dictionary recording the correspondence between the base sequence information of the exons and the introns and the exons and introns of the HLA allele.
(3) If the read obtained in step (1) matches the base sequence information of any exon or intron contained in the HLA dictionary created in step (2), the read contains the base sequence. The process of determining that it matches the allele,
(4) A step of calculating the weight of the read with respect to the HLA gene containing the allele determined to match the read in the step (3).
(5) A step of calculating the HLA gene allele pair score based on the read weight calculated in step (4).
(6) A method for determining an allele pair of a subject's HLA gene, which comprises a step of searching for an allele pair that maximizes the score calculated in step (5) and determining the allele pair of the subject's HLA gene (hereinafter, the present invention). The determination method 1) is provided.

本発明の判定方法１が適用される被験者としては、特に制限されないが、例えば、臓器、組織またはiPS細胞由来の組織の移植ためのレシピエントまたはドナー、あるいはiPS細胞ライブラリー構築のためのiPS細胞の提供者などが挙げられる。またコホート研究における大規模な健常者のHLA遺伝子型決定にも有効である。 The subject to which the determination method 1 of the present invention is applied is not particularly limited, and is, for example, a recipient or donor for transplantation of a tissue derived from an organ, tissue or iPS cell, or an iPS cell for constructing an iPS cell library. Providers and the like. It is also useful for large-scale healthy HLA genotyping in cohort studies.

本発明の判定方法１で判定可能な被験者のHLA遺伝子としては、HLA辞書に登録される限り特に制限されないが、Class I遺伝子に分類される遺伝子としてHLA-A、HLA-B、HLA-C、HLA-E、HLA-FおよびHLA-G、Class II遺伝子に分類される遺伝子としてHLA-DRA、HLA-DRB1、HLA-DRB3、HLA-DRB4、HLA-DRB5、HLA-DQA1、HLA-DQB1、HLA-DPA1、HLA-DPB1、HLA-DMA、HLA-DMB、HLA-DOAおよびHLA-DOBが挙げられる。その中でもClassical HLAと呼ばれる、HLA-A, HLA-B, HLA-C、HLA-DPA1、HLA-DPB1、HLA-DQA1、HLA-DQB1、HLA-DRB1は特に重要な遺伝子てして挙げられる。 The HLA gene of the subject that can be determined by the determination method 1 of the present invention is not particularly limited as long as it is registered in the HLA dictionary, but HLA-A, HLA-B, HLA-C, etc. are classified as Class I genes. HLA-DRA, HLA-DRB1, HLA-DRB3, HLA-DRB4, HLA-DRB5, HLA-DQA1, HLA-DQB1, HLA as genes classified into HLA-E, HLA-F and HLA-G, Class II genes -DPA1, HLA-DPB1, HLA-DMA, HLA-DMB, HLA-DOA and HLA-DOB. Among them, HLA-A, HLA-B, HLA-C, HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DQB1 and HLA-DRB1 called Classical HLA are mentioned as particularly important genes.

本発明の判定方法１で判定される被験者のHLAアレルは、エクソン長が特異的な値を持たない限り、公知のアレルを全て判定することができる。また、本発明の判定方法１は、6-digitのアレルとして判定することができる。 The HLA allele of the subject determined by the determination method 1 of the present invention can determine all known alleles as long as the exon length does not have a specific value. Further, the determination method 1 of the present invention can determine as a 6-digit allele.

本発明の判定方法１は、被験者から得られた生体試料中のHLA遺伝子の部分塩基配列情報（リード）を取得する工程（以下、本発明の工程（１））を含む。
本発明の工程（１）は、受付手段１に受け付けられる。 The determination method 1 of the present invention includes a step of acquiring partial base sequence information (read) of an HLA gene in a biological sample obtained from a subject (hereinafter, step (1) of the present invention).
The step (1) of the present invention is accepted by the receiving means 1.

本発明の工程（１）の生体試料としては、被験者由来のDNAを含む生体試料であれば特に制限されず、被験者の組織、細胞などでよいが、被験者への侵襲が少ないものであることが好ましく、例えば、血液、血漿、血清、尿、唾液などの生体から容易に採取できるものが挙げられる。血清や血漿を用いる場合、常法に従って被験者から採血し、液性成分を分離することによりそれらを調製することができる。 The biological sample in the step (1) of the present invention is not particularly limited as long as it is a biological sample containing DNA derived from the subject, and may be the tissue, cells, etc. of the subject, but the biological sample may not invade the subject. Preferably, for example, those that can be easily collected from a living body such as blood, plasma, serum, urine, and saliva can be mentioned. When serum or plasma is used, they can be prepared by collecting blood from the subject according to a conventional method and separating the humoral components.

本発明の工程（１）のリードは、被験者から得られた生体試料から抽出されるゲノムDNAから取得できる。生体試料からゲノムDNAを抽出する方法は、当該技術において知られるゲノムDNA抽出法を用いて行うことができる。例えば、生体試料を遠心分離して、ゲノムDNAを含む細胞を沈殿させ、該細胞を物理的または酵素的に破壊し、細胞破片を除去することによりゲノムDNA抽出物を得ることができる。ゲノムDNAの抽出は、市販のゲノムDNA抽出キットなどを用いて行うこともできる。抽出されたゲノムDNAからは、例えば、次世代シークエンサーを用いて塩基配列解析することによってリードを取得することができる。他にも、National Center for Biotechnology Information (NCBI) やEuropean Molecular Biology Laboratory (EMBL)等の公共のデータベースで公開されているリードの塩基配列情報をダウンロードして入手することも可能である。リードは、HLA遺伝子内外のランダムな箇所における塩基配列情報を有する。リードの長さとしては、後述するHLA辞書の塩基配列情報に対してマッピングすることができる限り特に制限はないが、通常、50〜300塩基長であり、長い程好ましい。ただし長いリードの場合、該リードの後半の塩基配列情報のクオリティが低くなるため、クオリティコントロールソフトを用いてトリミングをすることが好ましい。また、本発明の判定方法１に適したリードの数としては、HLA遺伝子のエクソン上の任意の箇所の塩基配列を含むリードが、通常、平均してアレル毎に50以上、より好ましくは、平均してアレル毎に100以上である。
リードは、シングルエンド法によって得られるリード（以下、シングルエンドリード）であってもペアエンド法によって得られるリード（以下、ペアエンドリード）であってもよいが、精度向上が期待出来る点で、ペアエンドリードが好ましい。 The lead of step (1) of the present invention can be obtained from genomic DNA extracted from a biological sample obtained from a subject. The method for extracting genomic DNA from a biological sample can be performed using a genomic DNA extraction method known in the art. For example, a genomic DNA extract can be obtained by centrifuging a biological sample, precipitating cells containing genomic DNA, physically or enzymatically destroying the cells, and removing cell debris. Genomic DNA can also be extracted using a commercially available genomic DNA extraction kit or the like. Reads can be obtained from the extracted genomic DNA by, for example, base sequence analysis using a next-generation sequencer. In addition, it is also possible to download and obtain the nucleotide sequence information of reads published in public databases such as the National Center for Biotechnology Information (NCBI) and the European Molecular Biology Laboratory (EMBL). Reads have nucleotide sequence information at random locations inside and outside the HLA gene. The read length is not particularly limited as long as it can be mapped to the base sequence information of the HLA dictionary described later, but is usually 50 to 300 base lengths, and the longer the read length, the more preferable. However, in the case of a long read, the quality of the base sequence information in the latter half of the read is low, so it is preferable to perform trimming using quality control software. Further, as the number of reads suitable for the determination method 1 of the present invention, the number of reads containing the base sequence at an arbitrary position on the exon of the HLA gene is usually 50 or more per allele on average, more preferably the average. And there are more than 100 per allele.
The lead may be a lead obtained by the single-ended method (hereinafter referred to as single-ended read) or a lead obtained by the pair-ended method (hereinafter referred to as “pair-end lead”), but it is expected to improve accuracy. Is preferable.

本発明の判定方法１は、公知のHLAアレルのエクソンおよびイントロンの塩基配列情報を取得し、該エクソンおよび該イントロンの塩基配列情報と該HLAアレルのエクソンおよびイントロンとの対応関係を記録したHLA辞書を作成する工程（以下、本発明の工程（２））を含む。
本発明の工程（２）で作成されるHLA辞書は、受付手段２に受け付けられる。 The determination method 1 of the present invention is an HLA dictionary in which the nucleotide sequence information of known HLA allele exons and introns is acquired, and the correspondence between the nucleotide sequence information of the exons and the introns and the exons and introns of the HLA allele is recorded. (Hereinafter, the step (2) of the present invention) is included.
The HLA dictionary created in the step (2) of the present invention is accepted by the receiving means 2.

本発明の工程（２）のHLA辞書に含まれるHLAアレルのエクソンおよびイントロンの塩基配列情報は、公知のデータベースから取得することができる。公知のデータベースとしては、例えば、European Bioinformatics Institute (EMBL-EBI)から提供されているIPD-IMGT/HLAデータベース、NCBI Reference Sequences (RefSeq)などが挙げられる。また新たに発見されたHLA遺伝子およびHLAアレル情報を逐次追加することも可能である。 The nucleotide sequence information of exons and introns of the HLA allele contained in the HLA dictionary of the step (2) of the present invention can be obtained from a known database. Known databases include, for example, the IPD-IMGT / HLA database provided by the European Bioinformatics Institute (EMBL-EBI), NCBI Reference Sequences (RefSeq), and the like. It is also possible to sequentially add newly discovered HLA gene and HLA allele information.

本発明の工程（２）で取得されたエクソンおよびイントロンの塩基配列情報とHLAアレルのエクソンおよびイントロンの対応関係の記録は、該塩基配列情報にIDを割り当て、該IDと該HLAアレルのエクソンおよびイントロンの対応関係を記録することによって行われる。IDの割り当てについて、同一の塩基配列には同一IDが割り当てられる。そして、どのHLAアレルがどのエクソンおよびイントロンにおいてどのIDの塩基配列を持つか、その対応関係がHLA辞書に記録される。例えばHLA-A*01:01:01とHLA-A*03:01:01はExon1で同じ塩基配列を持つため、該塩基配列に対してA:Exon1_1という同一のIDが割り当てられ、HLA-A*01:01:01とHLA-A*03:01:01のExon1はA:Exon1_1というIDの塩基配列を有することがHLA辞書に記録される。リードのマッピングはIDを割り当てられた塩基配列に対して行われ、HLA辞書を通して各アレルの一致リードの集計やスコア計算を行う。 The record of the correspondence between the exon and intron base sequence information obtained in the step (2) of the present invention and the exon and intron of the HLA allele assigns an ID to the base sequence information, and the ID and the exon and the HLA allele are assigned an ID. This is done by recording the intron correspondence. Regarding the assignment of ID, the same ID is assigned to the same base sequence. Then, the correspondence between which HLA allele has the base sequence of which ID in which exon and intron is recorded in the HLA dictionary. For example, since HLA-A * 01: 01: 01 and HLA-A * 03: 01: 01 have the same base sequence in Exon1, the same ID A: Exon1_1 is assigned to the base sequence, and HLA-A It is recorded in the HLA dictionary that Exon1 of * 01: 01: 01 and HLA-A * 03: 01: 01 has the base sequence of ID A: Exon1_1. Read mapping is performed on the base sequence to which the ID is assigned, and the matching reads of each allele are aggregated and the score is calculated through the HLA dictionary.

HLA辞書に登録された各エクソンおよび各イントロンには、その両端にN配列を付加してもよい。ここでN配列とは、リードをHLA辞書に登録された各エクソンおよび各イントロンの塩基配列情報にマッピングする際、リードがどのような塩基配列であっても不一致と見なされない配列をいう。リードの50%までカバーできるようにするために全てのリードのうち最大長を有するリードの1/2倍の塩基配列長のN配列が付加される。マッピングソフトによってNとは異なる文字を使用する場合は、対応する文字に置き換えて配列が付加される。 N sequences may be added to both ends of each exon and each intron registered in the HLA dictionary. Here, the N sequence refers to a sequence that is not regarded as a mismatch regardless of the base sequence of the read when mapping the read to the base sequence information of each exon and each intron registered in the HLA dictionary. In order to cover up to 50% of the reads, an N sequence having a base sequence length 1/2 times that of the read having the maximum length among all the reads is added. If a character different from N is used depending on the mapping software, the array is added by replacing it with the corresponding character.

本発明の判定方法１は、本発明の工程（１）で取得されたリードが本発明の工程（２）で作成されたHLA辞書に含まれるいずれかのエクソンまたはイントロンの塩基配列に一致する場合、該リードは該塩基配列が含まれるアレルに一致すると判定する工程（以下、本発明の工程（３））を含む。
本発明の工程（３）は、判定手段１に判定される。 The determination method 1 of the present invention is when the read obtained in the step (1) of the present invention matches the base sequence of any exson or intron contained in the HLA dictionary prepared in the step (2) of the present invention. , The read includes a step of determining that the read matches the allele containing the base sequence (hereinafter, step (3) of the present invention).
The step (3) of the present invention is determined by the determination means 1.

本発明の工程（３）において、リードがHLA辞書に含まれるいずれかのエクソンまたはイントロンの塩基配列に一致すると判定する場合とは、以下の（Ａ）、（Ｂ）および（Ｃ）のいずれかの場合である。
（Ａ）本発明の工程（１）で取得されたリード全長の５０％以上の塩基配列長を有する連続する塩基配列が、本発明の工程（２）で作成されたHLA辞書に含まれるいずれかのエクソンの塩基配列と重複しかつ該重複範囲において両塩基配列が完全に一致する。
（Ｂ）本発明の工程（１）で取得されたリード全長の５０％以上の塩基配列長を有する連続する塩基配列が、本発明の工程（２）で作成されたHLA辞書に含まれるいずれかのイントロンの塩基配列と重複しかつ該重複範囲において両塩基配列が2塩基ミスマッチ以内で一致する。
（Ｃ）本発明の工程（１）で取得されたリード全長の５０％以上の塩基配列長を有する連続する塩基配列が、本発明の工程（２）で作成されたHLA辞書に含まれるいずれかのイントロンの塩基配列と重複しかつ該重複範囲において両塩基配列が2塩基ミスマッチ以内で一致しており、かつ該イントロンに一致しない残りの塩基配列が該イントロンと隣接する該HLA辞書に含まれるいずれかのエクソンの塩基配列に完全に一致する。
これらのいずれかを満たす場合、該リードは該エクソンまたはイントロンの塩基配列に割り当てられたIDと対応関係にあるエクソンまたはイントロンが含まれるアレルと一致すると判定する。 In the step (3) of the present invention, the case where it is determined that the read matches the base sequence of any exon or intron contained in the HLA dictionary is any of the following (A), (B) and (C). This is the case.
(A) Any of the continuous base sequences having a base sequence length of 50% or more of the total length of the read obtained in the step (1) of the present invention is included in the HLA dictionary prepared in the step (2) of the present invention. It overlaps with the exon base sequence of, and both base sequences completely match in the overlapping range.
(B) Any of the continuous base sequences having a base sequence length of 50% or more of the total length of the read obtained in the step (1) of the present invention is included in the HLA dictionary prepared in the step (2) of the present invention. It overlaps with the base sequence of the intron of No. 1 and both base sequences match within a 2-base mismatch within the overlapping range.
(C) Any of the continuous base sequences having a base sequence length of 50% or more of the total length of the read obtained in the step (1) of the present invention is included in the HLA dictionary prepared in the step (2) of the present invention. Any of the following base sequences that overlap with the base sequence of the intron, both base sequences match within 2 base mismatches within the overlap range, and the remaining base sequences that do not match the intron are included in the HLA dictionary adjacent to the intron. It completely matches the base sequence of the exon.
If any of these conditions are met, the read is determined to match an allele containing the exon or intron that is associated with the ID assigned to the exon or intron base sequence.

本発明の判定方法１は、本発明の工程（３）でリードに一致すると判定されたアレルが含まれるHLA遺伝子に対する該リードの重みを計算する工程（以下、本発明の工程（４））を含む。
本発明の工程（４）は、計算手段１に計算される。 The determination method 1 of the present invention is a step of calculating the weight of the lead with respect to the HLA gene containing the allele determined to match the read in the step (3) of the present invention (hereinafter, the step (4) of the present invention). Including.
The step (4) of the present invention is calculated by the calculation means 1.

本発明の工程（４）において、リードに一致するエクソンまたはイントロンを判定する際には、異なるHLA遺伝子を含む複数のエクソンまたはイントロンに跨って一致する場合が存在する。例えば、HLA-Aの第１エクソンに一致し、かつHLA-Bの第１エクソンにも一致するリードが存在する場合がある。その場合、該リードを、各エクソンで一致したアレル数に応じて各遺伝子に重み付けをする。リードの各HLA遺伝子に対する重みは、該リードが一致するエクソンの塩基配列と該塩基配列を含むアレルの対応関係に基づいて計算される。本発明の工程（４）において、各HLA遺伝子に対する該リードの重みは、本発明の工程（１）で取得されたリードが、シングルエンドリードの場合は以下の式 In step (4) of the present invention, when determining an exon or intron that matches a read, there is a case where the exon or intron that matches a read is matched across a plurality of exons or introns containing different HLA genes. For example, there may be a lead that matches the first exon of HLA-A and also matches the first exon of HLA-B. In that case, the reads are weighted to each gene according to the number of alleles matched in each exon. The weight of a read for each HLA gene is calculated based on the correspondence between the exon base sequence to which the read matches and the allele containing the base sequence. In the step (4) of the present invention, the weight of the read for each HLA gene is as follows when the read obtained in the step (1) of the present invention is a single-ended read.

に従い、本発明の工程（１）で取得されたリードが、ペアエンドリードの場合はさらに以下の式 According to the following formula, when the lead acquired in the step (1) of the present invention is a paired end lead.

に従い、計算することができる。 It can be calculated according to.

本発明の判定方法１は、本発明の工程（４）で計算されるリードの重みを基に、HLA遺伝子のアレルペアのスコアを計算する工程（以下、本発明の工程（５））を含む。
本発明の工程（５）は、計算手段２に計算される。 The determination method 1 of the present invention includes a step of calculating the score of the allele pair of the HLA gene based on the read weight calculated in the step (4) of the present invention (hereinafter, the step (5) of the present invention).
The step (5) of the present invention is calculated by the calculation means 2.

本発明の工程（５）において、リードの重みを基にしたHLA遺伝子のアレルペアのスコアは以下の式 In step (5) of the present invention, the score of the HLA gene allele pair based on the read weight is calculated by the following formula.

アレルペア間のスコア比較に用いるエクソンの集合Tは、比較するアレル全て共通で塩基配列が得られている必要がある。そのようなエクソンとしては、Class IのHLA遺伝子では第２エクソンおよび第３エクソン、およびClass IIのHLA遺伝子では第２エクソンが対応する。Tをこれらのエクソン集合とすることで、全てのアレルペアで同一な条件においてスコアを取得することができる。 The exon set T used for score comparison between allele pairs needs to have a nucleotide sequence common to all alleles to be compared. Such exons correspond to the second and third exons for the Class I HLA gene and the second exon for the Class II HLA gene. By setting T as these exon sets, scores can be obtained under the same conditions for all aller pairs.

本発明の判定方法１は、本発明の工程（５）で計算されるスコアを最大にするアレルペアを探査し、被験者のHLA遺伝子のアレルペアとして判定する工程（以下、本発明の工程（６））を含む。
本発明の工程（６）は、判定手段２に判定される。 The determination method 1 of the present invention is a step of searching for an allele pair that maximizes the score calculated in the step (5) of the present invention and determining it as an allele pair of the HLA gene of the subject (hereinafter, step (6) of the present invention). including.
The step (6) of the present invention is determined by the determination means 2.

本発明の工程（６）において、被験者のHLA遺伝子のアレルペアの判定は以下の式 In step (6) of the present invention, the determination of the HLA gene allele pair of the subject is determined by the following formula.

に従い、本発明の工程（５）で計算されるスコアを最大にするアレルペアを探査し、被験者のHLA遺伝子のアレルペアとして判定する。ただし、該スコアを最大にするアレルペアが Therefore, the allele pair that maximizes the score calculated in the step (5) of the present invention is searched for and determined as the allele pair of the HLA gene of the subject. However, the aller pair that maximizes the score

を満たす場合、スコアを最大にするアレルペアA, BにおけるアレルAのホモ接合型を、被験者のHLA遺伝子のアレルペアとして判定する。 If is satisfied, the homozygous form of allele A in allele pairs A and B that maximizes the score is determined as the allele pair of the subject's HLA gene.

本発明の工程（６）において、２種類以上のアレルペアで同一の最大スコアが得られた場合、本発明の判定方法１にさらに以下の工程：
（７）本発明の工程（５）の集合Tに、該アレルペアに含まれるアレル間でHLA辞書に共通して含まれるエクソンを追加し、再度、HLA遺伝子のアレルペアのスコアを計算する工程（以下、本発明の工程（７））、
（８）本発明の工程（７）で計算されたスコアを最大にするアレルペアを被験者のHLA遺伝子のアレルペアとして判定する工程（以下、本発明の工程（８））、
を含む、被験者のHLA遺伝子のアレルペアを判定する方法(以下、本発明の判定方法２)を提供する。 In the step (6) of the present invention, when the same maximum score is obtained for two or more types of aller pairs, the determination method 1 of the present invention further includes the following step:
(7) A step of adding an exson commonly included in the HLA dictionary among alleles included in the allele pair to the set T of the step (5) of the present invention, and calculating the score of the allele pair of the HLA gene again (hereinafter). , Step of the present invention (7)),
(8) A step of determining an allele pair that maximizes the score calculated in the step (7) of the present invention as an allele pair of the HLA gene of the subject (hereinafter, step (8) of the present invention).
Provided is a method for determining an allergic pair of an HLA gene of a subject including the above (hereinafter, determination method 2 of the present invention).

さらに、本発明の工程（６）において、2種類以上のアレルペアで同一の最大スコアが得られ、かつ該アレルペアに含まれるアレル間でHLA遺伝子辞書に共通して含まれるエクソンが存在しない場合、本発明の判定方法１にさらに以下の工程：
（７’）本発明の工程（６）において得られた同一の最大スコアを有する2種類以上のアレルペアのうち、公知のデータベースに登録されているHLA遺伝子のアレル頻度数データを用いて、該アレルペアの各アレルの頻度の積が最も高いアレルペアを被験者のHLA遺伝子のアレルペアとして判定する工程（以下、本発明の工程（７’））、
を含む、被験者のHLA遺伝子のアレルペアを判定する方法(以下、本発明の判定方法３)を提供する。 Further, in the step (6) of the present invention, when the same maximum score is obtained for two or more types of allele pairs and there is no exon commonly included in the HLA gene dictionary among alleles included in the allele pair, the present invention Further to the determination method 1 of the invention, the following steps:
(7') Of two or more allele pairs having the same maximum score obtained in the step (6) of the present invention, the allele pair using the allele frequency data of the HLA gene registered in a known database. The step of determining the allele pair having the highest frequency product of each allele as the allele pair of the HLA gene of the subject (hereinafter, the step (7') of the present invention),
Provided is a method for determining an allergic pair of an HLA gene of a subject including the above (hereinafter, determination method 3 of the present invention).

本発明の工程（７’）において、公知のデータベースは、HLA遺伝子のアレル頻度数データが登録されているデータベースであれば特に制限はないが、Allele Frequency Net Databaseが代表的に挙げられる。 In the step (7') of the present invention, the known database is not particularly limited as long as it is a database in which allele frequency data of the HLA gene is registered, but the Allele Frequency Net Database is a typical example.

上記の本発明の判定方法１、２または３をコンピュータにより行うための被験者のHLA遺伝子のアレルペア判定用プログラムも、本発明の１つである。従って、本発明はまた、
コンピュータを
被験者から得られた生体試料中のHLA遺伝子の部分塩基配列を（リード）を受け付ける受付手段１；
公知のHLAアレルのエクソンおよびイントロンの塩基配列情報を取得し、該エクソンおよび該イントロンの塩基配列情報と該HLAアレルのエクソンおよびイントロンとの対応関係を記録したHLA辞書を受け付ける受付手段２；
前記受け付けられたリードが前記受け付けられたHLA辞書に含まれるいずれかのエクソンまたはイントロンの塩基配列に一致する場合、該リードは該塩基配列が含まれるアレルに一致すると判定する判定手段１；
リードに一致すると判定された前記アレルが含まれるHLA遺伝子に対する該リードの重みを計算する計算手段１；
前記計算されたリードの重みを基に、HLA遺伝子のアレルペアのスコアを計算する計算手段２；
前記計算されたスコアを最大にするアレルペアを探査し、被験者のHLA遺伝子のアレルペアとして判定する判定手段２；
前記判定されたアレルペアを出力する出力手段；
として機能させるための被験者のHLA遺伝子のアレルペア判定用プログラム（以下、本発明のプログラム）を提供する。 A program for determining an allergic pair of a subject's HLA gene for performing the determination methods 1, 2 or 3 of the present invention by a computer is also one of the present inventions. Therefore, the present invention also
Receiving means for accepting (reads) the partial base sequence of the HLA gene in a biological sample obtained from a subject using a computer 1;
Receiving means 2; which obtains the base sequence information of known HLA allele exons and introns and accepts an HLA dictionary recording the correspondence between the base sequence information of the exons and the introns and the exons and introns of the HLA allele.
If the accepted read matches the base sequence of any exon or intron contained in the accepted HLA dictionary, the determination means for determining that the read matches the allele containing the base sequence 1;
Calculation means for calculating the weight of the read with respect to the HLA gene containing the allele determined to match the read 1;
Calculation means 2 for calculating the score of the allele pair of the HLA gene based on the calculated read weight;
Judgment means 2; which searches for the allele pair that maximizes the calculated score and determines it as the allele pair of the subject's HLA gene;
Output means for outputting the determined aller pair;
Provided is a program for determining an allergic pair of an HLA gene of a subject (hereinafter, the program of the present invention) for functioning as a subject.

より具体的には、本発明のプログラムは、
コンピュータを
被験者から得られた生体試料中のHLA遺伝子の部分塩基配列情報（リード）を受け付ける受付手段１、
公知のHLAアレルのエクソンおよびイントロンの塩基配列情報を取得し、該塩基配列情報にIDを割り当て、該IDと該HLAアレルのエクソンおよびイントロンの対応関係を記録したHLA辞書を受け付ける受付手段２、
以下の（Ａ）、（Ｂ）および（Ｃ）を行う判定手段１：
（Ａ）前記取得されたリード全長の５０％以上を有する連続する塩基配列が、前記作成されたHLA辞書に含まれるいずれかのエクソンの塩基配列と重複しかつ該重複範囲において両塩基配列が完全に一致する場合、該リードは該エクソンの塩基配列に割り当てられたIDと対応関係にあるエクソンが含まれるアレルと一致すると判定する、
（Ｂ）前記取得されたリード全長の５０％以上の塩基配列長を有する連続する塩基配列が、前記作成されたHLA辞書に含まれるいずれかのイントロンの塩基配列と重複しかつ該重複範囲において両塩基配列が2塩基ミスマッチ以内で一致した場合、該リードは該イントロンの塩基配列に割り当てられたIDと対応関係にあるイントロンが含まれるアレルと一致すると判定する、
（Ｃ）前記取得されたリード全長の５０％以上を有する連続する塩基配列が、前記作成されたHLA辞書に含まれるいずれかのイントロンの塩基配列と重複しかつ該重複範囲において両塩基配列が2塩基ミスマッチ以内で一致しており、かつ該イントロンに一致しない残りの塩基配列が該イントロンと隣接する該HLA辞書に含まれるいずれかのエクソンの塩基配列と重複しかつ該重複範囲において両塩基配列に完全に一致する場合、該リードは該エクソンの塩基配列に割り当てられたIDと対応関係にあるエクソンが含まれるアレルと一致すると判定する、
前記取得されたリードが、シングルエンドリードの場合、 More specifically, the program of the present invention
Reception means 1, which accepts the partial base sequence information (read) of the HLA gene in the biological sample obtained from the subject by the computer.
Receiving means 2, which acquires base sequence information of known HLA allele exons and introns, assigns an ID to the base sequence information, and accepts an HLA dictionary recording the correspondence between the ID and the exons and introns of the HLA allele.
Judgment means 1: for performing the following (A), (B) and (C)
(A) The continuous base sequence having 50% or more of the acquired total length of the read overlaps with the base sequence of any exon included in the prepared HLA dictionary, and both base sequences are complete in the overlapping range. If, the read is determined to match an allele containing an exon corresponding to the ID assigned to the base sequence of the exon.
(B) A continuous base sequence having a base sequence length of 50% or more of the obtained read total length overlaps with the base sequence of any intron included in the prepared HLA dictionary, and both in the overlapping range. If the nucleotide sequences match within a 2-base mismatch, it is determined that the read matches an allele containing an intron that corresponds to the ID assigned to the nucleotide sequence of the intron.
(C) The continuous base sequence having 50% or more of the acquired total length of the read overlaps with the base sequence of any intron included in the prepared HLA dictionary, and both base sequences are 2 in the overlapping range. The remaining nucleotide sequences that match within the base mismatch and do not match the intron overlap with the base sequence of any exson contained in the HLA dictionary adjacent to the intron, and in the overlapping range, both base sequences are overlapped. If there is an exact match, the read is determined to match an allele containing an exson that corresponds to the ID assigned to the base sequence of the exson.
When the acquired lead is a single-ended read,

にさらに従い、前記のリードに一致すると判定されたアレルが含まれるHLA遺伝子に対する該リードの重みを計算する計算手段１、 In accordance with the above, the calculation means 1 for calculating the weight of the read with respect to the HLA gene containing the allele determined to match the read.

に従い、HLA遺伝子のアレルペアのスコアを計算する計算手段２、 Calculation means for calculating the score of the allele pair of the HLA gene according to

を満たす場合、スコアを最大にするアレルペアA, BにおけるアレルAのホモ接合型を被験者のHLA遺伝子のアレルペアとして判定する）判定手段２、
前記判定されたアレルペアを出力する出力手段
として機能させるための被験者のHLA遺伝子のアレルペア判定用プログラムである。 When the condition is satisfied, the homozygous type of allele A in allele pairs A and B that maximizes the score is determined as the allele pair of the subject's HLA gene).
This is a program for determining the aller pair of the HLA gene of a subject in order to function as an output means for outputting the determined aller pair.

また、前記判定手段２に判定される2種類以上のアレルペアで同一の最大スコアが得られた場合、本発明のプログラムは以下の手段をさらに含む。
計算手段２において、集合Tに、該アレルペアに含まれるアレル間でHLA辞書に共通して含まれるエクソンが追加され、再度、HLA遺伝子のアレルペアのスコアを計算する計算手段２’、
計算手段２’に計算されるスコアを最大にするアレルペアを被験者のHLA遺伝子のアレルペアとして判定する判定手段２
前記判定されたアレルペアを出力する出力手段。 Further, when the same maximum score is obtained in two or more kinds of aller pairs determined by the determination means 2, the program of the present invention further includes the following means.
In the calculation means 2, exons commonly included in the HLA dictionary among the alleles included in the allele pair are added to the set T, and the calculation means 2', which calculates the score of the allele pair of the HLA gene again,
Judgment means 2 for determining the allele pair that maximizes the score calculated by the calculation means 2'as the allele pair of the subject's HLA gene.
An output means for outputting the determined aller pair.

さらに、判定手段２において、2種類以上のアレルペアで同一の最大スコアが得られ、かつ該アレルペアに含まれるアレル間でHLA辞書に共通して含まれるエクソンが存在しない場合、本発明のプログラムは以下の手段をさらに含む。
判定手段２において得られた同一の最大スコアを有する2種類以上のアレルペアのうち、公知のデータベースに登録されているHLA遺伝子のアレル頻度数データを用いて、該アレルペアの各アレルの頻度の積が最も高いアレルペアを被験者のHLA遺伝子のアレルペアとして判定する判定手段２’、
前記判定されたアレルペアを出力する出力手段。 Further, in the determination means 2, when the same maximum score is obtained for two or more types of allele pairs and there is no exon commonly included in the HLA dictionary among alleles included in the allele pair, the program of the present invention is as follows. Further includes the means of.
Among two or more types of allele pairs having the same maximum score obtained in the determination means 2, the product of the frequencies of each allele of the allele pair is calculated using the allele frequency data of the HLA gene registered in a known database. Judgment means 2', which determines the highest allele pair as the allele pair of the subject's HLA gene,
An output means for outputting the determined aller pair.

図１に、本発明のプログラムが用いられる被験者のHLA遺伝子のアレルペア判定用装置の一例を示す。該装置は、被験者から得られた生体試料中のDNA断片の塩基配列を解析する装置１（以下、リード解析装置１）と、コンピュータ２と、これらを接続するケーブル３とから構成される。リード解析装置１で解析されるリードの塩基配列データは、ケーブル３を介してコンピュータ２の不揮発性記憶装置（ハードディスク）に格納することができる。また、リード解析装置１は、コンピュータ２と接続されていなくてもよくこの場合、可搬型記録媒体に記録されたリードの塩基配列データをコンピュータ２に入力してハードディスクに格納する。他にもNCBIやEMBL等の公共データベースからインターネット回線を介してダウンロードすることにより他研究機関等で解析されたリードの塩基配列データを入手し、ハードディスクに格納できる。
コンピュータ２は、読み込まれたリードが別途作成されたHLA辞書に含まれるいずれかのエクソンまたはイントロンの塩基配列と一致する場合、該リードは該塩基配列が含まれるアレルに一致すると判定し、リードに一致すると判定された該アレルが含まれるHLA遺伝子に対する該リードの重みを計算し、計算されたリードの重みを基に、HLA遺伝子のアレルペアのスコアを計算し、計算されたスコアを最大にするアレルペアを被験者のHLA遺伝子のアレルペアとして判定する。 FIG. 1 shows an example of an allele pair determination device for the HLA gene of a subject in which the program of the present invention is used. The device includes a device 1 (hereinafter, read analysis device 1) that analyzes the base sequence of a DNA fragment in a biological sample obtained from a subject, a computer 2, and a cable 3 that connects them. The read base sequence data analyzed by the read analysis device 1 can be stored in the non-volatile storage device (hard disk) of the computer 2 via the cable 3. Further, the read analysis device 1 does not have to be connected to the computer 2. In this case, the read base sequence data recorded on the portable recording medium is input to the computer 2 and stored in the hard disk. In addition, the nucleotide sequence data of reads analyzed by other research institutes can be obtained and stored on the hard disk by downloading from public databases such as NCBI and EMBL via the Internet.
If the read read matches the base sequence of any exson or intron contained in the separately prepared HLA dictionary, the computer 2 determines that the read matches the allele containing the base sequence, and determines that the read matches the allele containing the base sequence. The weight of the read for the HLA gene containing the allele determined to match is calculated, the score of the allele pair of the HLA gene is calculated based on the calculated read weight, and the calculated score is maximized. Is determined as an allele pair of the subject's HLA gene.

本発明のプログラムは、中央処理装置、記憶部、コンパクトディスクやフロッピー（登録商標）ディスクなどの記録媒体の読取装置、キーボードなどの操作入力部、およびディスプレイなどの出力部を備えるコンピュータ２と協働して、上記の本発明の判定方法を実現することができる。上記の方法を実施するための、より具体的なコンピュータシステムの一例を、図２に示す。 The program of the present invention cooperates with a computer 2 including a central processing unit, a storage unit, a reading device for a recording medium such as a compact disk or a floppy (registered trademark) disk, an operation input unit such as a keyboard, and an output unit such as a display. Then, the above-mentioned determination method of the present invention can be realized. An example of a more specific computer system for carrying out the above method is shown in FIG.

図２に示されたコンピュータ２は、本体110と、ディスプレイ120と、操作入力部130とから主として構成されている。本体110は、CPU110aと、RAM110bと、ハードディスク110cと、読出装置110dと、入出力インタフェース110eと、画像出力インタフェース110fとから主として構成されており、CPU110a、RAM110b、ハードディスク110c、読出装置110d、入出力インタフェース110e、および画像出力インタフェース110fは、バス110gによってデータ通信可能に接続されている。 The computer 2 shown in FIG. 2 is mainly composed of a main body 110, a display 120, and an operation input unit 130. The main body 110 is mainly composed of a CPU 110a, a RAM 110b, a hard disk 110c, a reading device 110d, an input / output interface 110e, and an image output interface 110f. The interface 110e and the image output interface 110f are connected by a bus 110g so that data communication is possible.

CPU110aは、RAM110bにロードされたコンピュータプログラムを実行することが可能である。 CPU110a can execute computer programs loaded in RAM110b.

RAM110bは、SRAMまたはDRAMなどによって構成されている。RAM110bは、ハードディスク110cに記録されているコンピュータプログラムの読み出しに用いられる。また、これらのコンピュータプログラムを実行するときに、CPU110aの作業領域として利用される。 RAM110b is composed of SRAM, DRAM, and the like. The RAM 110b is used to read the computer program recorded on the hard disk 110c. It is also used as a work area for the CPU 110a when executing these computer programs.

本実施形態におけるハードディスク110cには、オペレーティングシステムおよびアプリケーションプログラムなど、CPU110aに実行させるための種々のコンピュータプログラムおよび該コンピュータプログラムの実行に用いるデータが格納されている。またハードディスク110cには、リード解析装置１によって解析されたリードの塩基配列データおよびHLA辞書に関するデータ、取得されたリードがHLA辞書に含まれるいずれかのエクソンまたはイントロンの塩基配列情報に一致するか判定するための条件、前記判定されたリードに一致するアレルが含まれるHLA遺伝子に対する該リードの重みを計算する計算式、HLA遺伝子のアレルペアのスコアを計算する計算式、前記計算されたスコアを最大にするアレルペアを被験者のHLA遺伝子のアレルペアとして判定する判定式、公知のデータベースから得られたHLAアレルの頻度情報が格納されている。なお、後述するアプリケーションプログラム140aも、このハードディスク110cにインストールされている。 The hard disk 110c in the present embodiment stores various computer programs for execution by the CPU 110a, such as an operating system and an application program, and data used for executing the computer programs. Further, on the hard disk 110c, it is determined whether the read base sequence data analyzed by the read analyzer 1, the data related to the HLA dictionary, and the acquired read match the base sequence information of any exson or intron contained in the HLA dictionary. Conditions for calculating the weight of the lead to the HLA gene containing the allele matching the determined read, the formula for calculating the score of the allele pair of the HLA gene, and maximizing the calculated score. The determination formula for determining the allele pair to be used as the allele pair of the subject's HLA gene, and the frequency information of the HLA allele obtained from a known database are stored. The application program 140a, which will be described later, is also installed on this hard disk 110c.

読出装置110dは、フレキシブルディスクドライブ、CD-ROMドライブ、またはDVD-ROMドライブなどによって構成されており、可搬型記録媒体140に記録されたコンピュータプログラムまたはデータを読み出すことができる。また、可搬型記録媒体140には、コンピュータに本実施形態の方法を実行させるためのアプリケーションプログラム140aが格納されており、CPU110aが当該可搬型記録媒体140から本発明に係るアプリケーションプログラム140aを読み出し、該アプリケーションプログラム140aをハードディスク110cにインストールすることが可能である。 The reading device 110d is composed of a flexible disk drive, a CD-ROM drive, a DVD-ROM drive, or the like, and can read a computer program or data recorded on the portable recording medium 140. Further, the portable recording medium 140 stores an application program 140a for causing a computer to execute the method of the present embodiment, and the CPU 110a reads the application program 140a according to the present invention from the portable recording medium 140. The application program 140a can be installed on the hard disk 110c.

なお、上記のアプリケーションプログラム140aは、可搬型記録媒体140によって提供されるのみならず、電気通信回線（有線、無線を問わない）によってコンピュータ本体110と通信可能に接続された外部の機器から前記電気通信回線を通じて提供することも可能である。例えば、上記のアプリケーションプログラム140aがインターネット上のサーバコンピュータのハードディスク内に格納されており、このサーバコンピュータに本体110がアクセスして、該アプリケーションプログラムをダウンロードし、これをハードディスク110cにインストールすることも可能である。 The application program 140a is provided not only by the portable recording medium 140, but also by an external device that is communicably connected to the computer body 110 by a telecommunication line (whether wired or wireless). It can also be provided through a communication line. For example, the above application program 140a is stored in the hard disk of a server computer on the Internet, and the main unit 110 can access this server computer, download the application program, and install it on the hard disk 110c. Is.

また、ハードディスク110cには、例えば米国マイクロソフト社が製造販売するWindows（登録商標）などのユーザインタフェース環境を提供するオペレーティングシステムがインストールされている。以下の説明においては、本実施形態に係るアプリケーションプログラム140aは、該オペレーティングシステム上で動作するものとしている。 In addition, an operating system that provides a user interface environment such as Windows (registered trademark) manufactured and sold by Microsoft Corporation in the United States is installed on the hard disk 110c. In the following description, it is assumed that the application program 140a according to the present embodiment operates on the operating system.

入出力インタフェース110eは、例えば、USB、IEEE1394、RS-232Cなどのシリアルインタフェース、SCSI、IDE、IEEE1284などのパラレルインタフェース、およびD/A変換器、A/D変換器などからなるアナログインタフェースなどから構成されている。入出力インタフェース110eには、リード解析装置１が、ケーブル３を介して接続されており、リード解析装置１で解析されるリードの塩基配列データを、コンピュータ本体110のハードディスク110cに格納することが可能である。また、入出力インタフェース110eには、キーボードからなる操作入力部130が接続されており、ユーザーが該操作入力部130から操作手順を入力することにより、コンピュータ本体110にデータ格納およびアプリケーションプログラム140aの実行を命令することが可能である。また上記の操作は電気通信回線（有線、無線を問わない）によってコンピュータ本体110と通信可能に接続された外部コンピュータの操作入力部から前記電気通信回線を通じて行うことが可能である。 The input / output interface 110e is composed of, for example, a serial interface such as USB, IEEE1394, RS-232C, a parallel interface such as SCSI, IDE, IEEE1284, and an analog interface including a D / A converter and an A / D converter. Has been done. A read analysis device 1 is connected to the input / output interface 110e via a cable 3, and the base sequence data of the read analyzed by the read analysis device 1 can be stored in the hard disk 110c of the computer main body 110. Is. Further, an operation input unit 130 composed of a keyboard is connected to the input / output interface 110e, and when the user inputs an operation procedure from the operation input unit 130, data is stored in the computer main body 110 and the application program 140a is executed. It is possible to order. Further, the above operation can be performed through the telecommunication line from the operation input unit of the external computer which is communicably connected to the computer main body 110 by the telecommunication line (whether wired or wireless).

画像出力インタフェース110fは、LCDまたはCRTなどで構成されたディスプレイ120に接続されており、CPU110aから与えられた画像データに応じた映像信号をディスプレイ120に出力するようになっている。ディスプレイ120は、入力された映像信号にしたがって、画像（画面）を表示する。また画像出力インタフェースは本体110に直接接続する以外に、電気通信回線（有線、無線を問わない）によってコンピュータ本体110と通信可能に接続された外部の機器に接続することで前記電気通信回線を通じて出力することが可能である。 The image output interface 110f is connected to a display 120 composed of an LCD, a CRT, or the like, and outputs a video signal corresponding to the image data given from the CPU 110a to the display 120. The display 120 displays an image (screen) according to the input video signal. In addition to connecting directly to the main unit 110, the image output interface outputs through the telecommunication line by connecting to an external device that is communicatively connected to the computer main unit 110 via a telecommunication line (whether wired or wireless). It is possible to do.

本発明のプログラムによる手段としてのより具体的なコンピュータ２の動作のフローチャートを、図３に示す。
リード解析装置１で被験者から得られた生体試料中のDNA断片の塩基配列が解析されると、リード解析装置１が解析された塩基配列のデータ（以下、「リード解析データ」という）を出力し、出力データをコンピュータ2のハードディスク110cに格納する。ユーザーは操作入力部130を通じてアプリケーション140aを呼び出す。アプリケーションからの命令によりCPU110aは、格納されたリード解析データをハードディスク110cから読み込み、RAM110bに記憶する（ステップS1）。 A more specific flowchart of the operation of the computer 2 as a means by the program of the present invention is shown in FIG.
When the base sequence of the DNA fragment in the biological sample obtained from the subject is analyzed by the read analysis device 1, the read analysis device 1 outputs the analyzed base sequence data (hereinafter referred to as "read analysis data"). , Store the output data in the hard disk 110c of computer 2. The user calls the application 140a through the operation input unit 130. In response to a command from the application, the CPU 110a reads the stored read analysis data from the hard disk 110c and stores it in the RAM 110b (step S1).

次いで、CPU110aはハードディスク110cに格納されている、各HLAアレルの塩基配列を基に各エクソンおよび各イントロンにおける同一塩基配列毎に付与されたIDおよびIDとアレルの関係性を記述した情報（以下、「HLA辞書」という）および各HLA遺伝子の初期ステップにおけるエクソン集合Tの情報、公知のデータベースから得られたHLAアレルの頻度情報を読み込む（ステップS2）。 Next, the CPU 110a describes the ID and the relationship between the ID and the allele stored in the hard disk 110c for each exon and each intron based on the base sequence of each HLA allele (hereinafter, The information on the exon set T in the initial step of each HLA gene (referred to as "HLA dictionary") and the frequency information of HLA alleles obtained from a known database are read (step S2).

次いで、CPU110aは、RAM110cに記憶されたリード解析データの塩基配列が、HLA辞書に含まれるいずれかのエクソンまたはイントロンの塩基配列に一致した場合、該リードは該塩基配列が含まれるアレルに一致すると判定する（ステップS3）。 Next, when the base sequence of the read analysis data stored in the RAM 110c matches the base sequence of any exon or intron contained in the HLA dictionary, the CPU 110a determines that the read matches the allele containing the base sequence. Judgment (step S3).

次いで、CPU110aは、リードに一致したアレルが含まれるHLA遺伝子に対する該リードの重みを計算する（ステップS4）。 The CPU110a then calculates the weight of the read for the HLA gene containing the allele that matches the read (step S4).

次いで、CPU110aは、リードの重みを基に、エクソン集合Tにおけるアレルペアのスコアを計算する（ステップS5）。 The CPU110a then calculates the score of the allele pair in the exon set T based on the read weights (step S5).

次いで、CPU110aは、スコアを最大にするアレルペアが２種類以上存在するか否かを判定する（ステップS6）。 Next, the CPU 110a determines whether or not there are two or more types of aller pairs that maximize the score (step S6).

ステップS6の判定の結果、スコアを最大にするアレルペアが、唯１つ存在する場合、CPU110aは、該アレルペアを被験者のHLA遺伝子のアレルペアとして判定する（ステップS7）。 As a result of the determination in step S6, if there is only one aller pair that maximizes the score, the CPU110a determines the allele pair as the allele pair of the subject's HLA gene (step S7).

ステップS6の判定の結果、スコアを最大にするアレルペアが２種類以上得られた（即ち、同じ最大値スコアを有するアレルペアが２種類以上得られた）場合、CPU110aは、スコアを最大にするアレルペア間でHLA辞書に共通して含まれるエクソン（以下、「共通エクソン」という）がエクソン集合Tに含まれるエクソン以外に存在するか否かを判定する（ステップS8）。 As a result of the determination in step S6, when two or more types of aller pairs that maximize the score are obtained (that is, two or more types of aller pairs having the same maximum value score are obtained), the CPU110a determines between the aller pairs that maximize the score. Determines whether or not there are exons (hereinafter referred to as "common exons") commonly included in the HLA dictionary other than the exons included in the exon set T (step S8).

ステップS8の判定の結果、共通エクソンが存在する場合、アレルペアのスコア計算に含まれるエクソン集合Ｔに上記共通エクソンを追加し（ステップS9）、再度、ステップS5に戻る。 If a common exon exists as a result of the determination in step S8, the above common exon is added to the exon set T included in the score calculation of the aller pair (step S9), and the process returns to step S5 again.

ステップS8の判定の結果、共通エクソンが存在しない場合、スコアを最大にするアレルペアのうち、公知のデータベースに登録されているHLA遺伝子のアレル頻度データを用いて、アレルペアの各アレルの頻度の積が最も高いアレルペアを被験者のHLA遺伝子のアレルペアとして判定する（ステップS10）。 As a result of the determination in step S8, if there is no common exon, among the allele pairs that maximize the score, the product of the frequencies of each allele of the allele pair is calculated using the allele frequency data of the HLA gene registered in a known database. The highest allele pair is determined as the subject's HLA gene allele pair (step S10).

最後に、ステップS7またはステップS10で判定した結果をRAM110bに格納するとともに、画像出力インタフェース110fを介してコンピュータのディスプレイ120に表示する（ステップS11）。 Finally, the result determined in step S7 or step S10 is stored in the RAM 110b and displayed on the computer display 120 via the image output interface 110f (step S11).

なお、本実施形態においては、リード解析データを、リード解析装置１から、入出力インタフェース110eを介してハードディスク110cに格納したが、これに限定されるものではない。例えば、コンピュータ２とは独立したリード解析装置で得たリード解析データを可搬型記録媒体に書き出した後、読出装置110dを通じてハードディスク110cに格納することもできる。またNCBIやEMBL等の公共データベースからインターネット回線を介してダウンロードすることにより他研究機関等で解析されたリードの塩基配列データを入手し、ハードディスク110cに格納することもできる。 In the present embodiment, the read analysis data is stored in the hard disk 110c from the read analysis device 1 via the input / output interface 110e, but the present invention is not limited to this. For example, the read analysis data obtained by the read analysis device independent of the computer 2 can be written out to a portable recording medium and then stored in the hard disk 110c through the reading device 110d. It is also possible to obtain the nucleotide sequence data of reads analyzed by other research institutes by downloading from public databases such as NCBI and EMBL via the Internet, and store them in the hard disk 110c.

以下に、実施例を示して本発明をさらに具体的に説明するが、本発明はこれらにより限定されるものではないことは明らかである。 Hereinafter, the present invention will be described in more detail with reference to Examples, but it is clear that the present invention is not limited thereto.

1. HLA辞書の作成
European Bioinformatics Institute (EMBL-EBI)から提供されているIPD-IMGT/HLAデータベースに登録されているHLAアレルの塩基配列情報をHLA辞書に登録する。また、NCBI Reference Sequences (RefSeq)に登録されている9種のHLA-Uと7種のHLA-DQB2および京都大学附属ゲノム医学センターが新たに同定したpseudogeneを登録することで精度向上が見込める。なお、HLA辞書に登録するアレルはユーザーが自由に追加・更新できる。登録した塩基配列は各エクソン、イントロンの塩基配列に分割し、同エクソンもしくは同イントロン内で同じ塩基配列毎にIDを付与し、IDとアレルとの間の対応表を作成する。登録された全てのエクソンおよびイントロンの塩基配列の両側にオフターゲット領域としてリードの最大長の1/2倍のN配列を追加する。同一エクソン内の塩基配列の集合から塩基配列の長さの頻度を計算し（頻度はID数ではなくアレル数でカウントする）、1%以下の頻度に含まれるエクソンの塩基配列長を持つアレルは、判別対象から外す（外さないことも可能）。 1. Creating an HLA dictionary
Register the base sequence information of HLA alleles registered in the IPD-IMGT / HLA database provided by the European Bioinformatics Institute (EMBL-EBI) in the HLA dictionary. In addition, accuracy improvement can be expected by registering 9 types of HLA-U and 7 types of HLA-DQB2 registered in NCBI Reference Sequences (RefSeq) and pseudogene newly identified by the Center for Genomic Medicine, Kyoto University. Alleles registered in the HLA dictionary can be freely added and updated by the user. The registered base sequence is divided into the base sequences of each exon and intron, and an ID is assigned to each of the same base sequences in the same exon or intron to create a correspondence table between the ID and the allele. Add N sequences that are 1/2 times the maximum read length as off-target regions on both sides of all registered exon and intron base sequences. Calculate the frequency of the base sequence length from the set of base sequences in the same exon (the frequency is counted by the number of alleles, not the number of IDs), and alleles with the base sequence length of exons included in the frequency of 1% or less , Exclude from the discrimination target (it is also possible not to remove).

2. 次世代シークエンサーを用いたHLA遺伝子の塩基配列の解析
次世代シークエンサーを用いて、サンプルのHLA遺伝子の塩基配列を解析する。各リードはサンプルのHLA遺伝子内外のランダムな箇所における50〜300塩基長の塩基配列情報を持つ。この際、エクソン上の任意箇所の塩基配列を含むリードが平均して100以上(対立遺伝子毎に50以上)となるデータ量が得られることが望ましい。また、リードは、シングルエンドリードであっても、ペアエンドリードであってもよいが、ペアエンドリードが含まれる場合、サンプルのHLA遺伝子タイピングの精度向上が期待出来る。 2. Analysis of the base sequence of the HLA gene using the next-generation sequencer The base sequence of the HLA gene of the sample is analyzed using the next-generation sequencer. Each read has sequence information of 50 to 300 base lengths at random sites inside and outside the HLA gene of the sample. At this time, it is desirable to obtain an amount of data in which the average number of reads containing the base sequence at an arbitrary location on the exon is 100 or more (50 or more for each allele). Further, the read may be a single-ended read or a pair-ended read, but when the pair-ended read is included, the accuracy of HLA gene typing of the sample can be expected to be improved.

3. HLA辞書へのマッピング
各リードをHLA辞書の塩基配列に対してbowtie2(GNU Free Documentation Licenseソフト)を用いてマッピングする。N配列をスコアに含めず、リード全長の50%以上の塩基配列長を有する塩基配列がエクソンまたはイントロン領域に当たる全てのマッピング結果が得られるオプションでマッピングを実行する。例えば、以下のオプションでマッピングを実行することが望ましい。
‘bowtie2 --n-ceil L,0,0.5 -score-min L,-1.0,0 -a’ 3. Mapping to the HLA dictionary Map each read to the base sequence of the HLA dictionary using bowtie2 (GNU Free Documentation License software). The N sequence is not included in the score, and the mapping is performed with the option that all mapping results are obtained in which the base sequence having a base sequence length of 50% or more of the read total length corresponds to the exon or intron region. For example, it is desirable to perform the mapping with the following options:
'bowtie2 --n-ceil L, 0,0.5 -score-min L, -1.0,0 -a'

4. マッピング結果の精査
マッピング結果を読み込み、各エクソンまたはイントロンの塩基配列に一致したリードの対応関係を調べる。具体的には、リード全長の50%以上の塩基配列長を有する連続する塩基配列がHLA辞書に登録されたいずれかのエクソンの塩基配列と重複しかつ該重複範囲において両塩基配列が完全一致、もしくはHLA辞書に登録されたいずれかのイントロンの塩基配列と重複しかつ該重複範囲において両塩基配列が2塩基ミスマッチ以内で一致するかを調べ、一致した場合、該リードは該エクソンまたは該イントロンが含まれるアレルに一致すると判定する。さらに、イントロンの塩基配列に一致した該リードにおいて一致しない残りの塩基配列が該イントロン外におよぶ場合、該部分塩基配列が該イントロンに隣接するいずれかのエクソンの塩基配列と重複しかつ該重複範囲において両塩基配列と完全一致するか調べ、一致した場合、該リードは該エクソンが含まれるアレルに一致すると判定する（図４）。
リードの中には異なるHLA遺伝子に含まれる複数のエクソンもしくはイントロンに跨って一致するものが存在する。その場合、各エクソンの一致アレル数に応じて各遺伝子の重み付けをする。リードの各遺伝子に対する重みは、リードがシングルリードの場合、以下の式 4. Examining the mapping result Read the mapping result and examine the correspondence of the reads that match the base sequence of each exon or intron. Specifically, a continuous base sequence having a base sequence length of 50% or more of the total length of the read overlaps with the base sequence of any exon registered in the HLA dictionary, and both base sequences completely match in the overlapping range. Alternatively, it is examined whether the nucleotide sequence of any of the introns registered in the HLA dictionary overlaps and both nucleotide sequences match within a 2-base mismatch in the overlapping range, and if they match, the read is the exon or the intron. Determined to match the included allele. Furthermore, if the remaining unmatched base sequence in the read that matches the base sequence of the intron extends outside the intron, the partial base sequence overlaps with the base sequence of any exon adjacent to the intron and the overlapping range. If it matches exactly with both base sequences, it is determined that the read matches the allele containing the exon (FIG. 4).
Some reads are matched across multiple exons or introns contained in different HLA genes. In that case, each gene is weighted according to the number of matching alleles of each exon. The weight of each gene of a reed is calculated by the following formula when the reed is a single read.

に従い計算し、リードがペアエンドリードの場合はさらに以下の式 Calculate according to, and if the lead is a paired end read, further the following formula

に従い、計算する。例えば、図５の通り、シングルエンドリードであるリードrが遺伝子G₁上のエクソンX₁において4種類のアレルと一致し、かつ遺伝子G₂上のエクソンX₂、X₃においてそれぞれ1種類のアレルと一致した場合、リードrの遺伝子G₁に対する重みは、以下の式で表される。 Calculate according to. For example, as shown in FIG. 5, the single-ended read r matches four types of alleles in exon X ₁ _{on gene G 1} , and one type of allele in exons X ₂ and X ₃ _{on gene G 2.} If the result is the same, the weight of the read r for the gene G ₁ is expressed by the following formula.

また、リードrの遺伝子G₂に対する重みは、以下の式で表される。 The weight of the read r on the gene G ₂ is expressed by the following formula.

５. アレルペアのスコア計算
上記の通り重み付けされたリードを用いてサンプルのHLA遺伝子の全アレルペアに対してスコアリングを行う。ここで、アレルペアA,Bに対するスコアは以下の式で与えられる。 5. Aller pair score calculation Score all aller pairs of HLA genes in the sample using the reads weighted as described above. Here, the scores for aller pairs A and B are given by the following formula.

ここで、S(R)は、以下の式で与えられる。 Here, S (R) is given by the following equation.

また、S(R^p)は、以下の式で与えられる。

Moreover, S (R ^p ) is given by the following equation.

６. アレルペアの探索
下記式の通り、上記の通りに得られるスコアを最大にするアレルペアが、サンプルのHLA遺伝子のアレルペアであると判定する。 6. Search for aller pairs As shown in the formula below, it is determined that the allele pair that maximizes the score obtained as described above is the allele pair of the HLA gene of the sample.

ただし、スコアを最大にするアレルペアであっても、以下の式を満たす場合、スコアを最大にするアレルペアA, BにおけるアレルAのホモ接合型が、サンプルのHLA遺伝子のアレルペアであると判定する。

However, even if the allele pair maximizes the score, if the following formula is satisfied, it is determined that the homozygous type of allele A in the allele pairs A and B that maximizes the score is the allele pair of the HLA gene of the sample.

なお、アレルペアに対するスコア計算は、初期探査においては全てのアレルペアに対するスコアを比較できるようエクソンの集合Tを全てのアレルで共通して塩基配列が得られているエクソンに限定する。具体的には、最初は集合Tとして、Class IのHLA遺伝子についてはExon 2および3、 Class IIのHLA遺伝子についてはExon 2が用いられる(Class IのHLA遺伝子のExon 2および3、 Class IIのHLA遺伝子のExon 2は、G-DOMAINと呼ばれる領域に対応)。その他の遺伝子も同様に予め定める。 In the initial search, the score calculation for allele pairs is limited to exons whose base sequence is common to all alleles so that the scores for all allele pairs can be compared. Specifically, Exon 2 and 3 are used for the Class I HLA gene and Exon 2 is used for the Class II HLA gene as the set T initially (Exon 2 and 3 of the Class I HLA gene, Class II). Exon 2 of the HLA gene corresponds to a region called G-DOMAIN). Other genes are also predetermined.

求めるアレルペアの探査は以下の手順で決定される（図３）。またアレルは6-digitで出力する。
1.初期のエクソン集合Tにおいて、アレルペアA,Bに対するスコアを最大にするアレルペアを探査する。
2.得られたアレルペアが一意に定まった場合、アレルペアの探査を終了する。複数のアレルペア候補で同一の最大スコアが得られた場合、それらアレルペア候補内の全てのアレルで共通して塩基配列が登録されているエクソンを初期の集合Tに追加（共通するエクソンが複数ある場合は全て追加する）し、Tを拡張する。なお、追加できるエクソンが存在しない場合、アレルペアの探査を終了する。
3.拡張された集合Tにおいて再びスコアを最大にするアレルペアを探査する。探査するアレルペアは残ったアレル候補におけるペアに限定する。
4.上記2.および3.を集合Tが拡張できる限りもしくはアレルペアが一意に決定されるまで続ける。
5.最終的に一つのアレルペアに定まらない場合、web上で公開されている公知のデータベースAllele Frequency Net Databaseに登録されている頻度数データを用いてアレルペアの各アレル頻度の積が最も高いペアを選択する（結果のデータには全ての最終候補が記録される）。それでも一つのアレルペアに決まらない場合は、残った候補全てを出力する。 The search for the desired aller pair is determined by the following procedure (Fig. 3). Alleles are output in 6-digit.
1. In the initial exon set T, search for the allele pair that maximizes the score for allele pairs A and B.
2. When the obtained aller pair is uniquely determined, the search for the aller pair is terminated. If the same maximum score is obtained for multiple allele pair candidates, an exon whose base sequence is registered in common for all alleles in those allele pair candidates is added to the initial set T (when there are multiple common exons). Add all) and extend T. If there is no exon that can be added, the search for aller pairs is terminated.
3. Search for the aller pair that maximizes the score again in the extended set T. Allele pairs to be explored are limited to the pairs in the remaining allele candidates.
4. Continue steps 2 and 3 above as long as the set T can be extended or until the aller pair is uniquely determined.
5. If it is not finally decided to be one allele pair, use the frequency data registered in the publicly known database Allele Frequency Net Database published on the web to select the pair with the highest product of each allele frequency of the allele pair. Make a selection (all finalists are recorded in the resulting data). If it is still not decided on one aller pair, all the remaining candidates are output.

7. 1000 Genomes Projectデータを用いた精度検証
1000 Genomes Project (The 1000 Genomes Project Consortium. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature, 491, 56-65.)で観測した62検体の全エクソームシークエンスデータ(以下、WESデータ)に対して本発明の方法（以下、HLA-HD）を適用し、同検体のうちLiu, C.ら((Liu, C. et al., (2013) ATHLATES: accurate typing of human leukocyte antigen through exome sequencing. Nucleic Acids Res, 41, e142)によるPCR-SBTによる10検体計100アレルのタイピング結果（セットA）およびde Bakker, P.I.ら(de Bakker, P.I. et al. (2006) A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat Genet, 38, 1166-1172.)、およびErlich, R.L.ら(Erlich, R.L. et al. (2011) Next-generation sequencing for HLA typing of class I loci. BMC Genomics, 12, 42)によるPCR-SSOPによる51検体計600アレル（セットB）に対しての一致率を測定した。ここでHLA-HDでは6-digitの精度でアレルを判定することができるが、従来法では高々4-digitの精度までしかアレルを判定できていないため、HLA-HDによる判定結果も4-digitもしくは2-digitまで桁落ちして合わせた。
この結果、PCR-SBTによる10検体計100アレルのタイピング結果とHLA-HDによるタイピング結果は完全一致した（表1）。
次にPCR-SSOPによる51検体計600アレルに対しては、一致しないアレルが、Class Iでは17アレル、Class IIでは37アレル、計54アレル存在した（表2）。これらの不一致アレルを精査した結果、ほとんどの場合、PCR-SSOPではカバーされないエクソンにおける塩基配列が異なることが原因であった。そこで、結果が異なる検体に対し、WESデータのマッピング結果を検証し、HLA-HDによる判定結果とPCR-SSOPによる判定結果のどちらが正しいか検証した。その結果、39アレルはHLA-HDによる判定アレルを支持し、5アレルは判断がつかない結果となった(表２)。残りの10アレルはカバレージが低く、HLA-HDがうまく機能しなかったと考えられた。この結果を反映させて一致率を再計算したところ（修正データ）、精度の向上が得られ特にClass IIの一致率は100%に達した（表１、修正データに対応する行）。
次にHLA-HDと同様に次世代シークエンサーを用いる他手法との間で精度を比較した。ここではHLAreporter (Huang, Y. et al., (2015) HLAreporter: a tool for HLA typing from next generation sequencing data. Genome Med, 7, 25.) 、OptiType (Szolek, A. et al. (2014) OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics, 30, 3310-3316.)およびPHLAT (Bai, Y. et al.(2014) Inference of high resolution HLA types using genome-wide RNA or DNA sequencing reads. BMC Genomics, 15, 325.)を比較対象とした。HLAreporterはカバレージが低いデータに対してはタイピングを行わないため、セットBでは一部のサンプルのタイピング結果しか得られていない。OptiType (Szolek, A. et al. (2014) OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics, 30, 3310-3316.)はClass IのA,B,Cにのみ対応するため、Class IIをタイピングできない。 PHLAT (Bai, Y. et al.(2014) Inference of high resolution HLA types using genome-wide RNA or DNA sequencing reads. BMC Genomics, 15, 325.)はセットAの結果だけ公開されている。セットAにおいてはPHLAT以外で完全一致する結果となった(表１)。セットBにおいては、HLAreporter、OptiTypeによる判定アレルは共にPCR-SSOPによる判定アレルとの一致率がHLA-HDを上回っているが、上記の通り、WESデータのマッピング結果からタイピング結果を修正したデータを用いると、HLA-HDによる判定アレルと修正したデータとの一致率は、Class IにおいてはOptiTypeとは同等、HLAreporterとの比較では上回る結果となった。Class IIでも同様にHLA-HDとHLAreporterが互いに、修正前と修正後で逆転した。この結果は、OptiType、HLAreporter共に従来法と同様にタイピングできるアレルの範囲が制限されている事が起因している。 7. Accuracy verification using 1000 Genomes Project data
All exome sequence data of 62 samples observed in 1000 Genomes Project (The 1000 Genomes Project Consortium. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature, 491, 56-65.) (WES data) The method of the present invention (hereinafter referred to as HLA-HD) was applied to the same sample, and Liu, C. et al. ((Liu, C. et al., (2013) ATHLATES: accurate typing of human leukocyte antigen through exome) Typing results of 10 samples total 100 alleles by PCR-SBT by sequencing. Nucleic Acids Res, 41, e142) (set A) and de Bakker, PI et al. (De Bakker, PI et al. (2006) A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat Genet, 38, 1166-1172.), And Erlich, RL et al. (2011) Next-generation sequencing for HLA typing of class I loci The consortium was measured for a total of 600 alleles (set B) of 51 samples by PCR-SSOP by BMC Genomics, 12, 42). Here, HLA-HD can judge alleles with 6-digit accuracy. However, since the conventional method can only judge the allele with an accuracy of 4-digit at most, the judgment result by HLA-HD is also reduced to 4-digit or 2-digit.
As a result, the typing results of 10 samples and 100 alleles by PCR-SBT and the typing results by HLA-HD were in perfect agreement (Table 1).
Next, for a total of 600 alleles of 51 samples by PCR-SSOP, there were 17 alleles in Class I and 37 alleles in Class II, for a total of 54 alleles (Table 2). As a result of scrutinizing these inconsistent alleles, most of the time, the cause was the difference in the base sequences in exons that are not covered by PCR-SSOP. Therefore, we verified the mapping result of WES data for the samples with different results, and verified which of the judgment result by HLA-HD and the judgment result by PCR-SSOP was correct. As a result, 39 alleles supported the HLA-HD judgment allele, and 5 alleles were undecidable (Table 2). The remaining 10 alleles had low coverage, suggesting that HLA-HD did not work well. When the match rate was recalculated to reflect this result (correction data), the accuracy was improved and the match rate of Class II reached 100% (Table 1, row corresponding to the correction data).
Next, the accuracy was compared with other methods using the next-generation sequencer as in HLA-HD. Here, HLAreporter (Huang, Y. et al., (2015) HLAreporter: a tool for HLA typing from next generation sequencing data. Genome Med, 7, 25.), OptiType (Szolek, A. et al. (2014) OptiType : precision HLA typing from next-generation sequencing data. Bioinformatics, 30, 3310-3316.) And PHLAT (Bai, Y. et al. (2014) Inference of high resolution HLA types using genome-wide RNA or DNA sequencing reads. BMC Genomics, 15, 325.) Was compared. HLAreporter does not type on low-coverage data, so set B only gives typing results for some samples. OptiType (Szolek, A. et al. (2014) OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics, 30, 3310-3316.) Only supports Class I A, B, C, so Class II Can't type. PHLAT (Bai, Y. et al. (2014) Inference of high resolution HLA types using genome-wide RNA or DNA sequencing reads. BMC Genomics, 15, 325.) Is published only as a result of set A. In set A, the results were exactly the same except for PHLAT (Table 1). In set B, the matching rate of alleles determined by HLAreporter and OptiType is higher than that of alleles determined by PCR-SSOP, but as described above, the data obtained by modifying the typing result from the mapping result of WES data is used. When used, the concordance rate between the HLA-HD judgment allele and the corrected data was equivalent to OptiType in Class I and higher than that in comparison with HLA reporter. Similarly in Class II, HLA-HD and HLA reporter reversed each other before and after modification. This result is due to the fact that the range of alleles that can be typed is limited for both OptiType and HLA reporter as in the conventional method.

本発明の方法とロボティクスによるHLA遺伝子シークエンスとを組み合わせることで大規模検体に対するHLAの詳細解析が可能となる。また、本発明の方法をバイオインフォマティクスの統合解析ソフトへの組み込むことができる。さらに、本発明の方法は、臓器移植や将来的なiPS治療における安全性の高い適合検査や多くのドナー確保に利用することができる。 By combining the method of the present invention with the HLA gene sequence by robotics, detailed analysis of HLA for a large-scale sample becomes possible. In addition, the method of the present invention can be incorporated into integrated analysis software of bioinformatics. Furthermore, the method of the present invention can be used for highly safe conformity testing in organ transplantation and future iPS treatment and for securing a large number of donors.

1 リード解析装置
2 コンピュータ
3 ケーブル
110 本体
110a CPU
110b RAM
110c ハードディスク
110d 読出装置
110e 入出力インタフェース
110f 画像出力インタフェース
110g バス
120 ディスプレイ
130 操作入力部
140 可搬型記録媒体
140a アプリケーションプログラム 1 Read analyzer
2 computer
3 cable
110 body
110a CPU
110b RAM
110c hard disk
110d reader
110e I / O interface
110f image output interface
110g bus
120 display
130 Operation input section
140 Portable recording medium
140a application program

Claims

How to determine an allele pair of a subject's HLA gene, including the following steps:
(1) Step of acquiring partial base sequence information (read) of HLA gene in a biological sample obtained from a subject,
(2) A step of acquiring the base sequence information of known HLA allele exons and introns, assigning an ID to the base sequence information, and creating an HLA dictionary recording the correspondence between the ID and the exons and introns of the HLA allele. ,
(3) Steps of performing the following (A), (B) and (C):
(A) A continuous base sequence having a base sequence length of 50% or more of the total length of the read obtained in step (1) is a base sequence of any exon included in the HLA dictionary prepared in step (2). If there is an overlap and both base sequences are completely matched in the overlapping range, it is determined that the read matches an allele containing an exon corresponding to the ID assigned to the base sequence of the exon.
(B) A continuous base sequence having a base sequence length of 50% or more of the total length of the read obtained in step (1) is the base sequence of any intron included in the HLA dictionary created in step (2). If there is an overlap and both base sequences match within a 2-base mismatch within the overlap range, it is determined that the read matches an allele containing an intron that corresponds to the ID assigned to the base sequence of the intron.
(C) A continuous base sequence having a base sequence length of 50% or more of the total length of the read obtained in step (1) is the base sequence of any intron included in the HLA dictionary created in step (2). The base sequence of any exson that is duplicated and both base sequences match within 2 base mismatches in the overlapping range, and the remaining base sequences that do not match the intron are contained in the HLA dictionary adjacent to the intron. If it overlaps with and completely matches both base sequences in the overlapping range, it is determined that the read matches an allele containing an exson corresponding to the ID assigned to the base sequence of the exson.
(4) When the lead acquired in step (1) is a single-ended lead,

According to the above, when the lead acquired in the step (1) is a paired end lead,

Further, the step of calculating the weight of the lead with respect to the HLA gene containing the allele determined to match the read in step (3).
(5)

To calculate the score of the HLA gene allele pair according to (6).

Therefore, the allele pair that maximizes the score calculated in step (5) is determined as the allele pair of the subject's HLA gene (however, the allele pair that maximizes the score is determined.

If the above conditions are met, the homozygous form of allele A in allele pairs A and B that maximizes the score is determined as the allele pair of the subject's HLA gene).

The method according to claim 1, wherein the known HLA allele exon and intron nucleotide sequence information recorded in the HLA dictionary created in step (2) is obtained from the IPD-IMGT / HLA database.

An N sequence that is 1/2 times the maximum length of the read obtained in step (1) is added to both ends of the exons and introns of the known HLA allele recorded in the HLA dictionary created in step (2). The method according to claim 1 or 2.

The method according to any one of claims 1 to 3 , wherein the set T of step (5) contains the second and third exons of the HLA gene of Class I and the second exon of the HLA gene of Class II. ..

The method according to any one of claims 1 to 4 , further comprising the following steps when the same maximum score is obtained for two or more types of aller pairs in step (6):
(7) A step of adding exons commonly included in the HLA dictionary among alleles included in the allele pair to the set T of step (5) and calculating the score of the allele pair of the HLA gene again.
(8) A step of determining an allele pair that maximizes the score calculated in step (7) as an allele pair of the subject's HLA gene.

If the same maximum score is obtained for two or more types of allele pairs in step (6) and there is no exon commonly included in the HLA dictionary among alleles included in the allele pair, the following steps are further included in the claims. Item 2. The method according to any one of Items 1 to 4.
(7') Of two or more types of allele pairs having the same maximum score obtained in step (6), each allele of the allele pair using the allele frequency data of the HLA gene registered in a known database. The step of determining the allele pair having the highest frequency product as the allele pair of the subject's HLA gene.

The method of claim 6 , wherein the known database of step (7') is the Allele Frequency Net Database.

Reception means for receiving partial base sequence information (read) of HLA gene in a biological sample obtained from a subject using a computer 1;
Reception means 2; which acquires the base sequence information of known HLA allele exons and introns, assigns an ID to the base sequence information, and accepts an HLA dictionary recording the correspondence between the ID and the exons and introns of the HLA allele.
Judgment means 1: for performing the following (A), (B) and (C)
(A) A continuous base sequence having a base sequence length of 50% or more of the acquired total length of the read overlaps with the base sequence of any exon included in the prepared HLA dictionary, and both in the overlapping range. If the base sequence is completely matched, it is determined that the read matches the allele containing the exon corresponding to the ID assigned to the base sequence of the exon.
(B) A continuous base sequence having a base sequence length of 50% or more of the obtained read total length overlaps with the base sequence of any intron included in the prepared HLA dictionary, and both in the overlapping range. If the nucleotide sequences match within a 2-base mismatch, it is determined that the read matches an allele containing an intron that corresponds to the ID assigned to the nucleotide sequence of the intron.
(C) A continuous base sequence having a base sequence length of 50% or more of the obtained read total length overlaps with the base sequence of any intron included in the prepared HLA dictionary, and both in the overlapping range. The base sequence matches within 2 base mismatches, and the remaining base sequence that does not match the intron overlaps with the base sequence of any exson contained in the HLA dictionary adjacent to the intron and in the overlapping range. If both base sequences are completely matched, it is determined that the read matches an allele containing an exson corresponding to the ID assigned to the base sequence of the exson;
When the acquired lead is a single-ended read,

According to the above, when the acquired lead is a paired end lead,

Further, the calculation means for calculating the weight of the read with respect to the HLA gene containing the allele determined to match the read according to 1;

Calculation means for calculating the score of the allele pair of the HLA gene according to 2;

Therefore, the allele pair that maximizes the calculated score is determined as the allele pair of the subject's HLA gene (however, the allele pair that maximizes the score is determined.

If the condition is satisfied, the homozygous type of allele A in allele pairs A and B that maximizes the score is determined as the allele pair of the subject's HLA gene.) Judgment means 2;
Output means for outputting the determined aller pair;
A program for determining the allergic pair of the subject's HLA gene to function as.

The program according to claim 8 , wherein the base sequence information of exons and introns of known HLA alleles recorded in the HLA dictionary accepted by the receiving means 2 is obtained from the IPD-IMGT / HLA database.

Claim 8 or claim 8 or in which an N sequence that is 1/2 times the maximum length of the read accepted by the receiving means 1 is added to both ends of the exons and introns of the known HLA allele recorded in the HLA dictionary accepted by the receiving means 2. The program described in 9.

The program according to any one of claims 8 to 10 , wherein in the calculation means 2, the set T includes the second and third exons of the HLA gene of Class I and the second exon of the HLA gene of Class II. ..

The program according to any one of claims 8 to 11 , further comprising the following means when the same maximum score is obtained in two or more types of aller pairs in the determination means 2.
In the calculation means 2, exons commonly included in the HLA dictionary among the alleles included in the allele pair are added to the set T, and the calculation means 2', which calculates the score of the allele pair of the HLA gene again,
Judgment means 2 for determining the allele pair that maximizes the score calculated by the calculation means 2'as the allele pair of the subject's HLA gene.
An output means for outputting the determined aller pair.

In the determination means 2, if the same maximum score is obtained for two or more types of allele pairs and there is no exon commonly included in the HLA dictionary among alleles included in the allele pair, the following means are further included. The program according to any one of items 8 to 11:
Among two or more types of allele pairs having the same maximum score obtained in the determination means 2, the product of the frequencies of each allele of the allele pair is calculated using the allele frequency data of the HLA gene registered in a known database. Judgment means 2', which determines the highest allele pair as the allele pair of the subject's HLA gene,
An output means for outputting the determined aller pair.

The program according to claim 13 , wherein the known database of the determination means 2'is the Allele Frequency Net Database.