CN108350498A - Classifying method and device - Google Patents

Classifying method and device Download PDF

Info

Publication number
CN108350498A
CN108350498A CN201680067128.7A CN201680067128A CN108350498A CN 108350498 A CN108350498 A CN 108350498A CN 201680067128 A CN201680067128 A CN 201680067128A CN 108350498 A CN108350498 A CN 108350498A
Authority
CN
China
Prior art keywords
type
candidate
types
haplotype
reads
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201680067128.7A
Other languages
Chinese (zh)
Other versions
CN108350498B (en
Inventor
章元伟
张涛
曹红志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Shenzhen Co Ltd
Original Assignee
BGI Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Shenzhen Co Ltd filed Critical BGI Shenzhen Co Ltd
Publication of CN108350498A publication Critical patent/CN108350498A/en
Application granted granted Critical
Publication of CN108350498B publication Critical patent/CN108350498B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Abstract

The present invention discloses a kind of classifying method, including step:Obtain sequencing data;Reference gene group is compared to consensus sequence group, obtains type set;Sequencing data is compared to set of reference sequences, obtains comparison result;Convert comparison result to the comparison result of the conversion relative to consensus sequence;Read to comparing upper same consensus sequence assembles, and obtains haplotype;Determine the type that haplotype is supported;Type is grouped, the first and second candidate type groups are obtained;The type in the first and second candidate type groups is screened respectively, to determine main and secondary type;Based on the difference for supporting the mainly read number with secondary type, the genotype of target area is judged.The classifying method is suitable for the Genotyping in any region, especially suitable for the parting in high polymorphism region, such as to the parting of HLA DRB3,4 and/or 5.

Description

Parting method and device Technical Field
The invention relates to the field of biological information, in particular to a typing method and a typing device.
Background
HLA-DRB3, DRB4 and DRB5 belong to homologous coding genes of chain β of HLA (human leukocyte antigen) class II molecules.
HLA class II molecules are expressed on the cell membrane of antigen presenting cells, can present peptide fragments of foreign proteins of the cells, and play a central role in the immune system. HLA-DRB3 has been reported to be associated with Crohn's disease, Graves' disease, type I diabetes, and the like. HLA-DRB4 has been reported to be associated with childhood acute lymphoblastic leukemia, hashimoto's thyroiditis, allergic granulomatous vasculitis, vitiligo, etc. HLA-DRB5 has been reported to be associated with keloid, systemic lupus erythematosus, multiple sclerosis, lethargy, and the like.
Has important medical application and disease research value for typing HLA-DRB3, DRB4 and DRB 5.
The conventional HLA-DRB3,4 and 5 typing method mainly comprises exon PCR combined gene sequencing or long-fragment PCR combined gene sequencing, and relates to the problems of more necessary PCR primer design and test steps, incapability of being applied to high-throughput whole genome sequencing or high-throughput chip capture sequencing and the like.
Disclosure of Invention
The present invention is directed to at least one of the above problems or to at least one alternative business means.
According to an aspect of the present invention, there is provided a typing method comprising: obtaining sequencing data of a sample to be tested, wherein the sequencing data comprises a plurality of reads from a target area; comparing a reference sequence group to obtain a type set, wherein the type set comprises a plurality of types, the reference sequence group comprises one or more reference sequences, the reference sequence group can completely cover all gene sequences of the target region, different reference sequences comprise different gene complete sequences, the reference sequence group comprises a plurality of reference sequences, the reference sequence group can completely cover all exons of coding region sequences of the target region, and different reference sequences comprise different exons; comparing the sequencing data to the reference sequence group to obtain a comparison result; converting the comparison result into a comparison result relative to the reference sequence group to obtain a converted comparison result; respectively comparing and assembling the reads of the same reference sequence based on the converted comparison results to obtain an assembly result, wherein the assembly result comprises a plurality of haplotypes; comparing the variation on the haplotype to the genotypes in the set of genotypes to determine the genotypes supported by the haplotype; dividing the types into two groups according to the haplotypes supporting the types and the reads supporting the types, and obtaining a first candidate type group and a second candidate type group; screening the types in the first and second candidate type groups, respectively, based on the haplotypes that support the types and the reads that support the types to determine primary and secondary types; determining a genotype for the target region based on a difference in the number of reads supporting the primary type and the secondary type.
According to another aspect of the present invention, there is provided a computer-readable medium for storing a computer-executable program, the execution of which includes performing the above-described typing method of one aspect of the present invention. It will be understood by those skilled in the art that all or part of the steps of the above-described typing method may be performed by instruction-dependent hardware when the computer-executable program is executed. The storage medium may include: read-only memory, random access memory, magnetic or optical disk, and the like.
According to still another aspect of the present invention, there is provided a typing device comprising: a data input unit for inputting data; a data output unit for outputting data; a storage unit for storing data including a computer executable program; and the processor is connected with the data input unit, the data output unit and the storage unit and is used for executing the computer executable program, and the execution of the program comprises the completion of the typing method.
In accordance with yet another aspect of the present invention, there is provided a typing system comprising: the data input module is used for inputting sequencing data of a sample to be tested, and the sequencing data comprises a plurality of reads from a target area; the comparison module comprises a first comparison module and a second comparison module, wherein the first comparison module is used for comparing a reference sequence group to obtain a type set, the type set comprises a plurality of types, the second comparison module is used for comparing sequencing data from a data input module to the reference sequence group to obtain a comparison result, the reference sequence group comprises one or more reference sequences, the reference sequence group can completely cover all gene sequences of the target region, different reference sequences comprise different whole gene sequences, the reference sequence group comprises a plurality of reference sequences, the reference sequence group can completely cover all exons of coding region sequences of the target region, and different reference sequences comprise different exons; the conversion module is used for converting the comparison result into a comparison result relative to the reference sequence group to obtain a converted comparison result; the assembly module is used for respectively assembling the reads of the same reference sequence based on the converted comparison results to obtain assembly results, and the assembly results comprise a plurality of haplotypes; a haplotype support type determination module for comparing the variation on the haplotype to the types in the set of types to determine the type supported by the haplotype; the clustering module is used for dividing the types into two groups according to the haplotypes supporting the types and the reads supporting the types to obtain a first candidate type group and a second candidate type group; a primary/secondary type determination module for screening types in the first and second candidate type groups, respectively, based on the haplotypes supporting the types and the reads supporting the types to determine primary and secondary types; a genotype determination module to determine a genotype for the target region based on a difference in the number of reads supporting the primary type and the secondary type.
The method constructs a type set by constructing a reference sequence and a reference sequence of a target region, constructs a haplotype based on read information, and compares the type with the haplotype to perform genotyping on the target region. The typing method of the present invention is applicable to genotyping of any region, and is particularly applicable to typing of highly polymorphic regions, for example to typing of HLA-DRB3,4 and/or 5, and in particular to typing of HLA-DRB3,4 and/or 5 based on sequencing data comprising sequence information of HLA-DRB3,4 and/or 5 in any form. The method does not need PCR or long-fragment PCR aiming at the exon of the target gene, thereby reducing the experimental workload and the experimental difficulty and improving the flexibility of scheme design during application or research.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a typing method in an embodiment of the present invention.
FIG. 2 is a flow chart of a typing method in an embodiment of the present invention.
FIG. 3 is a flow chart of a typing method in an embodiment of the present invention.
FIG. 4 is a schematic structural view of a typing device in an embodiment of the present invention.
FIG. 5 is a schematic structural diagram of a typing system in an embodiment of the present invention.
FIG. 6 is a schematic structural diagram of a typing system in an embodiment of the present invention.
Detailed Description
As shown in fig. 1, the typing method provided according to an embodiment of the present invention includes the steps of:
s10 obtaining the sequencing data.
Obtaining sequencing data of a sample to be tested, the sequencing data comprising a plurality of reads from a target region.
The sequencing data is obtained by preparing a sequencing library of the nucleic acid sequence of a sample to be tested and performing computer sequencing. According to an embodiment of the invention, obtaining the sequencing data comprises: obtaining nucleic acid in a sample to be detected, preparing a sequencing library of the nucleic acid, and sequencing the sequencing library. The preparation method of the sequencing library is carried out according to the requirements of the selected sequencing method, the sequencing method can select but not limited to Hisq2000/2500 sequencing platform of Illumina, Ion Torrent platform of Life Technologies, BGISEQ platform of BGI and single molecule sequencing platform according to the difference of the selected sequencing platform, the sequencing mode can select single-ended sequencing or double-ended sequencing, and the obtained off-line data is a sequencing and reading fragment called reads (reads).
The target region can be any gene region of interest. According to an embodiment of the present invention, the target region includes a member of the MHC (major histocompatibility complex) gene family. Mammalian MHC genes (MHC genes) are highly polymorphic, and human MHC is commonly referred to as HLA (human leucocyte antigen), a human leukocyte antigen. According to one embodiment of the invention, the sample to be tested is from a human, and the target region comprises at least one of HLA-DRB3, HLA-DRB4 and HLA-DRB 5. According to another embodiment of the invention, the target areas include HLA-DRB3, HLA-DRB4, and HLA-DRB 5. By using the method of the present invention, it is possible to accurately type such highly polymorphic regions without performing PCR, based on any form of sequencing data including HLA-DRB3, HLA-DRB4 and/or HLA-DRB 5.
And S20 alignment.
And comparing the reference sequence group with the reference sequence group to obtain a type set, wherein the type set comprises a plurality of types.
And comparing the sequencing data with a reference sequence group to obtain a comparison result.
The above two alignment processes are performed independently without any sequential limitation. Two comparisons are actually obtained, which are called a type set and a comparison.
The reference sequence group includes one or more reference sequences, where the reference sequence is a predetermined sequence including the full length of a gene, and may be a reference template of a biological category to which the sample to be tested belongs, which is obtained in advance, for example, if the sample to be tested is sourced from a human individual, the reference sequence may select the full sequence of the gene in the target region included in HG19 provided in the NCBI database. According to one embodiment of the invention, a reference sequence is the complete sequence of a gene in the target region. If a gene has multiple gene complete sequences, the longest gene complete sequence can be selected as the reference sequence of the gene, the reference sequence group can completely cover all gene sequences of the target region, and different reference sequences comprise different gene complete sequences.
The reference sequence group includes a plurality of reference sequences, where a reference sequence is a predetermined sequence including an exon, and may be a reference template of a biological category to which the sample to be tested belongs, for example, if the sample to be tested is sourced from a human individual, the reference sequence may select a sequence of an exon in a target region included in HG19 provided in the NCBI database, and further, a resource library including more reference sequences may be preconfigured, for example, a more similar sequence is selected or determined to be assembled as the reference sequence according to factors such as a state and a region of the individual from which the sample to be tested is sourced. The set of reference sequences can completely cover all exons of the coding region sequence of the target region, and different reference sequences comprise different exons.
The so-called type is an allele. A type is an allele of a gene and is essentially a collection of specific variations of a genomic region. Certain international organizations such as WHO will name known alleles of certain genes, and generally, named alleles will be referred to as a type. According to one embodiment of the invention, the target region comprises at least a portion of an HLA gene. Since the number of alleles of HLA genes is large and plays a key role in medical transplantation, all known alleles of HLA genes are named by WHO, for example, DRB3 × 01:01:01, and a certain type refers to a named allele. Typing is the process of determining the type of a target gene in a target individual using various methods.
The pattern provided in the so-called pattern set is position information and variation information with respect to the reference sequence group. According to one embodiment of the invention, the longest gene full sequence is selected as the reference sequence, all coding region sequences and variations of the gene full sequence relative to the reference sequence are recorded, and the variations are correlated with the type. And the association between the variation information and the type is established, so that the subsequent establishment of a type standard based on the detected variation and genotyping are facilitated.
The aligned sequencing data is added to the reference sequence group to obtain an alignment result, and the alignment result comprises position information and variation information of a read of any reference sequence in the reference sequence group.
In this embodiment, sequencing data is not directly aligned to the reference sequence group because the reference sequence group only includes sequences of one allele of each gene, and by using the above manner, that is, the sequencing data is aligned to the reference sequence group first, and then the alignment result is converted based on the reference sequence, since the reference sequence group includes all alleles, the number of reads (reads) in the alignment can be increased, and the data utilization rate is significantly improved.
According to one embodiment of the present invention, the target region is a highly polymorphic region, and the reference sequence group is constructed by: obtaining a coding region sequence containing a target region and a gene complete sequence; dividing the coding region sequence according to exons to obtain a plurality of exon sequences; extracting K bp sequences on both sides of the exon sequence from the gene complete sequence of the type closest to the exon sequence, and adding the K bp sequences to both sides of the corresponding exon sequence to obtain a reference sequence in the reference sequence group, wherein K is the length of a read segment. For example, to type HLA-DRB3, HLA-DRB4, and/or HLA-DRB5, HLA genotype and sequence data, including multiple coding region sequences and multiple gene full sequences, can be downloaded from the IMGT/HLA database. The format of the downloaded HLA genotype and sequence data can be modified to facilitate subsequent step analysis. Then, each coding region sequence is divided according to the exon, and K bp on both sides of the exon is extracted from the gene full-length sequence of the most similar type and added on both sides of the corresponding exon sequence to form a reference sequence. K depends on the length of the reading, and the extension of the exon sequence is to ensure that the reading aligned to the exon edge can be reserved when aligning, so that the part of data can be utilized, and the data utilization rate is improved. It is noted that K can vary depending on whether the length of the reads is of uniform length or of unequal length, and further, K need not be a definite value or an exact value and can be a range of values, such as an interval of plus or minus 10% of the length of any read, and K added to either side of an exon sequence can take on different values. According to one embodiment of the present invention, the target typing region comprises HLA-DRB3, HLA-DRB4 and HLA-DRB5, the inventors obtained 94 published coding region sequences of HLA-DRB3, HLA-DRB4 or HLA-DRB5 and 7 full-length gene sequences of HLA-DRB3, HLA-DRB4 or HLA-DRB5 from IMGT/HLA database, each coding region sequence or each full-length gene sequence has differences in length, number of exons contained, base sequence or structure at specific positions, etc., and the constructed reference sequence group comprises 166 reference sequences.
The expression "the complete sequence of the gene of the type closest to the exon sequence" means that the corresponding exon sequence in the complete sequence of the gene has the highest match with the exon sequence, i.e., the most similar and least different sequences. For example, if a base different from the reference genome is present at a plurality of specific positions on an exon sequence, and the corresponding positions of the corresponding exon regions in which a full-length sequence of a gene is present are all the same as the exon sequence, the type of the full-length sequence of the gene is determined to be closest to the exon sequence.
The term "aligned" means matched. For specific alignment, known alignment software, such as SOAP, BWA, TeraMap, etc., may be used, which is not limited in this embodiment. In the alignment process, according to the setting of alignment parameters, at most n base mismatches (mismatches) are allowed for a pair or a read, for example, n is set to 1 or 2, if more than n bases in a read are mismatched, it is considered that the pair of reads cannot be aligned to a reference sequence, or if all the mismatched n bases are located in one read of the pair of reads, it is considered that the read in the pair of reads cannot be aligned to the reference sequence.
A read is said to support a variation when the match is a perfect match, e.g., when the read does not have a mismatch with the sequence of the reference sequence that contains the variation site or contains the signature sequence that the variation should have when it occurs.
S30 obtains the transformed alignment.
And converting the comparison result into a comparison result relative to the reference sequence group to obtain a converted comparison result. That is, all the position information and variation information in the alignment result are converted into position information and variation information with respect to the reference sequence group, and the converted alignment result is obtained. Thus, the obtained alignment of the transformation and the type set in S20 are both based on the reference sequence, which facilitates the subsequent comparison of the two for genotype judgment.
According to one embodiment of the invention, the most complete gene complete sequence of each target gene is selected as the reference sequence of the gene, the comparison position information and variation information in the comparison result are converted into the position information and variation information relative to the reference sequence group, and the converted comparison result is obtained, so that the genotype can be determined by screening and judging the type based on the detected variation information.
And S40 assembling.
And respectively comparing and assembling the reads of the same reference sequence based on the converted comparison results to obtain an assembly result, wherein the assembly result comprises a plurality of haplotypes.
The Haplotype is a group of related mononucleotide polymorphisms located in a specific region of a chromosome and tending to be inherited as a whole to offspring, and is also called a Haplotype or Haplotype.
The assembly of the sequences may be according to known assembly methods, which are not limited by the present embodiment. According to one embodiment of the invention, the step comprises: performing the assembling of aligned reads of the same reference sequence using reads having overlapping portions that are identical to obtain the plurality of haplotypes.
Since the reference sequences in the reference sequence group are all full sequences including one gene, and one gene of eukaryote generally includes a plurality of exons, a plurality of haplotypes, which are assembled using reads having overlapping portions therein and the overlapping portions being completely identical, are substantially at least a part of at least one exon sequence including different base sequences or different lengths based on the aligned reads of the same reference sequence. According to one embodiment of the invention, haplotypes with a coverage of less than 95% of the exons are filtered out after obtaining the assembly result. Therefore, the sequence with relatively short length is discarded, so that the reliability of data used in the subsequent analysis steps can be improved, the complexity of the data is reduced, and the accurate typing is facilitated.
The step of obtaining the type set in S20 is not limited to S30 and S40 in sequence. For example, the step of obtaining the genre set in S20 may be performed first, and then S30 and S40 may be performed; alternatively, S30 and S40 may be performed first, and then the comparison step of obtaining the genre set in S20 may be performed.
S50 determines the type supported by the haplotype.
Comparing the variation on the haplotype to the genotypes in the set of genotypes to determine the genotypes supported by the haplotype.
According to one embodiment of the invention, before performing this step comprises performing: scoring each haplotype based on the assembly result of the reads of the same reference sequence on the comparison, and screening the haplotypes based on the scores of the obtained haplotypes to obtain candidate haplotypes; the subsequent step is performed with the candidate haplotype in place of the haplotype. Therefore, data interference is reduced, and the data volume needing to be processed is reduced.
According to another embodiment of the present invention, the scoring each haplotype based on the assembly result of aligning the reads of the same reference sequence and screening the haplotypes based on the score of the obtained haplotypes to obtain candidate haplotypes respectively comprises: determining a Score for the haplotype using the formula, Score, where c is the coverage of the exons by the haplotype, N is the number of reads that support the haplotype, R represents the reliability of the haplotype, Xi is the sequencing depth of position i of the haplotype, i is the number of positions on the haplotype, X is the average depth of the haplotype, and L is the length of the haplotype. The reliability of the haplotype is judged based on the average sequencing depth of the haplotype per se, the sequencing depth and the length of each position by utilizing the formula, and the haplotype score is given according to the reliability of the haplotype, the coverage of the corresponding exon and the number of the reads supporting the haplotype, can reflect and represent the haplotype and is beneficial to the comparative judgment of a plurality of haplotypes.
The so-called screening includes: and for the assembly results of the reads of the same aligned reference sequence, after taking out the haplotype corresponding to the score of the highest haplotype, removing the haplotypes in the assembly results which meet the following conditions: (ii) is not identical to the sequence of the haplotype removed and the sequencing depth of the inconsistent site is less than 20% of the sequencing depth of the corresponding site of the haplotype; repeating the above steps until no more than 4 haplotypes are taken out to obtain the candidate haplotypes. As the target region is from a diploid, one region has at most two types (namely heterozygote), the step is utilized to screen the haplotypes, so that each exon has at most 4 corresponding haplotypes, the haplotypes which do not accord with the real condition or have no significance are screened out, the complexity of data is reduced, and the rapid and accurate typing of subsequent analysis is facilitated.
According to one embodiment of the invention, the subsequent step of replacing the haplotype with the candidate haplotype is performed by: comparing the variation on each candidate haplotype with the type in the set of types, and if the variation is completely matched with the type, determining that the candidate haplotype supports the type. In this way, the type supported by the candidate haplotype is determined, and the subsequent further determination of the type based on the support condition of the candidate haplotype is facilitated.
According to one embodiment of the present invention, after determining the types supported by the candidate haplotypes, the candidate haplotypes that support the same types are merged. The rapid implementation of the subsequent steps is facilitated.
S60 determines a first candidate type group and a second candidate type group.
The types are divided into two groups according to the haplotypes supporting the types and the reads supporting the types, and a first candidate type group and a second candidate type group are obtained.
The coding sequence of the target region comprises a plurality of exons and, according to one embodiment of the invention, the type supporting a number of exons less than 30% of the total number of exons is filtered out before this step is performed. Therefore, meaningless or small types in the type set are removed according to the condition, the interference of data and the complexity of typing are reduced, and the subsequent rapid typing and accurate typing are facilitated.
As shown in fig. 2, according to one embodiment of the present invention, the steps include: according to the haplotype supporting the type and the reading section supporting the type, performing first grading on the type, screening the type based on the obtained first score of the type, and obtaining a first candidate type and a second candidate type; and based on the supporting conditions of the reads supporting the first candidate type and the reads supporting the second candidate type on other types in the type set, respectively classifying the other types in the type set to the first candidate type and the second candidate type to obtain the first candidate type group and the second candidate type group.
According to an embodiment of the present invention, said first ranking said types according to haplotypes supporting said type and reads supporting said type, and screening said types based on obtained first scores of types, obtaining a first candidate type and a second candidate type, comprises: determining a first score, TScore, for the type using the following equation, TScore ═ NxS, where N is the number of reads supporting the type and S is the sum of scores of candidate haplotypes supporting the type; and combining the first scores of all the types pairwise, and determining the types in the combination with the highest sum of the first scores as the first candidate type and the second candidate type respectively. And (3) endowing the type with a first score by using the formula, wherein the score can reflect the type and is beneficial to subsequent type judgment.
According to an embodiment of the present invention, the obtaining the first candidate type group and the second candidate type group by attributing other types in the type set to the first candidate type and the second candidate type respectively based on supporting conditions of the other types in the type set by the reads supporting the first candidate type and the reads supporting the second candidate type comprises: for each of the other types, comparing the sizes of a first intersection and a second intersection, and classifying each of the other types into the first candidate type or the second candidate type according to the comparison result to obtain the first candidate type group and the second candidate type group, where the first intersection is an intersection of the read supporting the other type and the read supporting the first candidate type, and the second intersection is an intersection of the read supporting the other type and the read supporting the second candidate type. If the first intersection is greater than the second intersection, i.e., the number of reads supporting the other type falls within the number of reads supporting the first candidate type, then the other type is grouped in the same group as the first candidate type, otherwise the other type is grouped in the group in which the second candidate type is located. Thus, the other types in the set of types are divided into two groups based on the first and second candidate types. The treatment is beneficial to subsequent further screening and accurate typing.
S70 determines the major type and the minor type.
Screening the types in the first and second candidate type groups, respectively, based on the haplotypes supporting the types and the reads supporting the types to determine primary and secondary types.
According to one embodiment of the invention, the step comprises: and respectively carrying out second scoring on types in the first candidate type group and the second candidate type group based on the reading supporting the types and the first scores of the types, and determining the primary type and the secondary type based on the obtained second scores of the types.
According to an embodiment of the invention, the second score for the type, TScore _ New, is determined using the following formula, TScore _ New ═ N*xTScore, wherein N*The number of supported read paragraphs for a type in the first candidate type group that are outside the reads that support the second candidate type, or the number of supported read paragraphs for a type in the second candidate type group that are outside the reads that support the first candidate type(ii) a Determining two types of the first candidate type group and the second candidate type group with the highest second scores as the primary type and the secondary type. The so-called major and minor types are relative concepts based on the relative magnitude of the frequencies. Here, the primary and secondary types are distinguished based on the relative number and number of reads that support them. The first score assigned to the type in the previous step is adjusted by using the formula, namely the first score is corrected to obtain a second score which can reflect the type, and the main type and the secondary type are determined according to the level of the second score, so that the accurate judgment of the subsequent type is facilitated.
S80 determining the genotype of the target region.
Determining a genotype for the target region based on a difference in the number of reads supporting the primary type and the secondary type.
According to one embodiment of the invention, the step comprises: if the ratio of the number of reads supporting only the minor type to the number of reads supporting only the major type is greater than 0.1, determining that the target region is a heterozygote, and the major allele and the minor allele constituting the heterozygote are the major type and the minor type, respectively, otherwise determining that the target region is a homozygote, and both the two alleles constituting the homozygote are the major type. The inventors have found, based on intermediate results of the foregoing steps and a large number of sample data analyses, that setting the threshold value of the ratio of the number of reads supporting only the minor type to the number of reads supporting only the major type to 0.1 enables simple and accurate determination of the genotype of the target region.
It should be noted that the specific numerical values referred to in the present invention are most statistically significant, and thus, unless otherwise indicated, any numerical value expressed in a precise manner may represent a range, e.g., an interval of plus or minus 10% of the numerical value.
As shown in fig. 3, according to an embodiment of the present invention, the typing method in the above embodiment may further include the following steps:
determination at S90Copy number of the target area.
And taking the average sequencing depth of at least one region with the fixed copy number of 2 as a reference depth, calculating the ratio of the sequencing depth of the target region to the reference depth, and determining the copy number of the target region according to the calculated ratio. According to an embodiment of the present invention, this step is based on the use of sequencing depth information, the copy number of the HLA-DRB3,4 and/or 5 of the target region can be determined.
And further correcting the determined genotype of the target region according to the copy number of the target region so as to more accurately determine the genotype of the target region. According to an embodiment of the present invention, if the copy number of the target area is 0, it is determined that the target area does not exist; if the copy number of the target region is 1, determining that the target region is a homozygote, and the two alleles forming the homozygote are the main type; if the copy number of the target region is 2, the target region is judged to be a heterozygote, and the major allele and the minor allele constituting the heterozygote are the major type and the minor type, respectively. It should be noted that the particular values referred to herein are statistically significant, and thus, the values "0", "1" and "2" expressed herein in a precise manner may each represent a range, such as an interval containing plus or minus 10% or plus or minus 20% of the value.
It will be understood by those skilled in the art that all or part of the steps of the above-described typing method may be programmed in a machine-recognizable language and stored in a storage medium. According to another embodiment of the present invention, a computer-readable medium is provided, which is used for storing a computer-executable program, and the program is executed by the computer-executable program, wherein the program comprises all or part of the steps of the typing method in any one of the above-mentioned embodiments. It will be understood by those skilled in the art that all or part of the steps of any of the above-described typing methods may be performed by instruction-dependent hardware when the computer-executable program is executed. The storage medium may include: read-only memory, random access memory, magnetic or optical disk, and the like.
As shown in fig. 4, there is provided a typing device 100 according to still another embodiment of the present invention, the typing device 100 including: a data input unit 110 for inputting data; a data output unit 120 for outputting data; a storage unit 140 for storing data including computer executable programs; a processor 130, connected to the data input unit, the data output unit and the storage unit, for executing the computer-executable program, where executing the program includes performing all or part of the steps of the typing method in any of the above embodiments.
As shown in fig. 5, according to an embodiment of the present invention, there is further provided a typing system 1000, where the system 1000 can be used to implement the typing method in any embodiment of the present invention, and the system 1000 includes: a data input module 1001 for inputting sequencing data of a sample to be tested, the sequencing data comprising a plurality of reads from a target region; an alignment module 1003 including a first alignment module 1013 and a second alignment module 1023, wherein the first alignment module 1013 is configured to align a reference sequence group to obtain a type set, the type set includes a plurality of types, the second alignment module 1023 is configured to align sequencing data from the data input module to the reference sequence group to obtain an alignment result, the reference sequence group includes one or more reference sequences, the reference sequence group can completely cover all gene sequences of the target region, different reference sequences include different complete gene sequences, the reference sequence group includes a plurality of reference sequences, the reference sequence group can completely cover all exons of coding region sequences of the target region, different reference sequences include different exons; a conversion module 1005, configured to convert the comparison result into a comparison result corresponding to the reference sequence group, and obtain a converted comparison result; an assembling module 1007, configured to assemble the reads of the same reference sequence by comparison respectively based on the converted comparison result, so as to obtain an assembling result, where the assembling result includes a plurality of haplotypes; a haplotype support type determination module 1009 for comparing the variation on the haplotype and the types in the set of types to determine the type supported by the haplotype; a clustering module 1011, configured to divide the genre into two groups according to the haplotype supporting the genre and the reads supporting the genre, so as to obtain a first candidate genre group and a second candidate genre group; a primary/secondary type determination module 1013 for screening types in the first candidate type group and the second candidate type group based on the haplotype supporting the type and the reads supporting the type to determine a primary type and a secondary type; a genotype determination module 1015 to determine a genotype for the target region based on a difference in the number of reads supporting the primary type and the secondary type. The above description of the advantages and technical features of the typing method in any embodiment of the present invention is also applicable to the typing system in this embodiment of the present invention, and will not be described herein again. One skilled in the art will appreciate that the functional modules in the above-described system may contain sub-modules or be linked to other functional modules to perform optional or optimized steps or processes in the typing method.
According to one embodiment of the invention, the target region comprises at least one of HLA-DRB3, HLA-DRB4 and HLA-DRB 5.
According to one embodiment of the invention, the target regions comprise HLA-DRB3, HLA-DRB4 and HLA-DRB 5.
According to an embodiment of the present invention, the reference sequence group in the alignment module 1003 is constructed by the following steps: obtaining a coding region sequence and a gene complete sequence comprising the target region; dividing the coding region sequence according to exons to obtain a plurality of exon sequences; extracting sequences of K bp on two sides of the exon sequence from the gene complete sequence of the type closest to the exon sequence, and adding the sequences to two sides of the corresponding exon sequence to obtain the reference sequence in the reference sequence group, wherein K is determined according to the length of a read segment.
According to one embodiment of the present invention, one of the reference sequences in the first alignment module 1013 and/or the transformation module 1005 is the complete gene sequence of one gene in the target region.
According to one embodiment of the invention, the following is performed with the assembly module 1007: comparing reads of the same reference sequence on pairs, and performing the assembling using reads having overlapping portions that are identical to each other to obtain the plurality of haplotypes.
In assembly module 1007, haplotypes with coverage of less than 95% of the exons are filtered out after the assembly results are obtained, according to one embodiment of the invention.
According to one embodiment of the invention, a candidate haplotype determination module 1008 is further included for, prior to determining the haplotype supported type using the haplotype support type determination module 1009, performing the following: scoring each haplotype based on the assembly result of the reads comparing the same reference sequence, and screening the haplotypes based on the score of the obtained haplotypes to obtain candidate haplotypes; replacing the haplotype with the candidate haplotype.
In candidate haplotype determination module 1008, according to one embodiment of the present invention, each of the haplotypes is scored using the following formula to determine a Score for the haplotype: where c is the coverage of the exon by the haplotype, N is the number of reads supporting the haplotype, R represents the reliability of the haplotype, Xi is the sequencing depth of position i of the haplotype, i is the number of positions on the haplotype, X is the average depth of the haplotype, and L is the length of the haplotype.
According to one embodiment of the present invention, screening for haplotypes based on the scores obtained for the haplotypes is performed in candidate haplotype determination module 1008, including: and for the assembly results of the reads of the same aligned reference sequence, removing the haplotypes in the assembly results meeting the following conditions after taking out the haplotype corresponding to the score of the highest haplotype in the assembly results: (ii) is not identical to the sequence of the haplotype removed and the sequencing depth of the inconsistent site is less than 20% of the sequencing depth of the corresponding site of the haplotype; repeating the above steps until at most 4 haplotypes are taken out, and obtaining the candidate haplotypes.
In accordance with one embodiment of the present invention, the following is performed in haplotype support type determination module 1009: comparing the variation on each candidate haplotype with the type in the set of types, and if the variation is completely matched with the type, determining that the candidate haplotype supports the type.
According to one embodiment of the present invention, after determining the types supported by the candidate haplotypes, the candidate haplotypes that support the same types are merged in haplotype support type determination module 1009.
According to an embodiment of the present invention, the method further comprises a type filtering module 1010 for filtering out types supporting a number of exons less than 30% of the total number of exons before obtaining the first candidate type group and the second candidate type group by using the clustering module 1011.
According to an embodiment of the invention, the clustering module 1011 is configured to: according to the haplotype supporting the type and the reading section supporting the type, performing first grading on the type, screening the type based on the obtained first score of the type, and obtaining a first candidate type and a second candidate type; and based on the supporting conditions of the reads supporting the first candidate type and the reads supporting the second candidate type on other types in the type set, respectively classifying the other types in the type set to the first candidate type and the second candidate type to obtain the first candidate type group and the second candidate type group.
According to an embodiment of the present invention, performing, in the clustering module 1011, a first scoring on the genre according to the haplotype supporting the genre and the reads supporting the genre, and screening the genre based on the obtained first score of the genre to obtain a first candidate genre and a second candidate genre includes: determining a first score, TScore, for the type using the following equation, TScore ═ NxS, where N is the number of reads supporting the type and S is the sum of scores of candidate haplotypes supporting the type; and combining the first scores of all the types pairwise, and determining the types in the combination with the highest sum of the first scores as the first candidate type and the second candidate type respectively.
According to an embodiment of the present invention, performing, in the clustering module 1011, support of other types in the type set based on the reads supporting the first candidate type and the reads supporting the second candidate type, and attributing the other types in the type set to the first candidate type and the second candidate type respectively to obtain the first candidate type group and the second candidate type group includes: for each of the other types, comparing the sizes of a first intersection and a second intersection, and classifying each of the other types into the first candidate type or the second candidate type according to the comparison result to obtain the first candidate type group and the second candidate type group, where the first intersection is an intersection of the read supporting the other type and the read supporting the first candidate type, and the second intersection is an intersection of the read supporting the other type and the read supporting the second candidate type.
According to one embodiment of the invention, the primary/secondary type determination module 1013 is configured to perform the following: and respectively carrying out second scoring on types in the first candidate type group and the second candidate type group based on the reading supporting the types and the first scores of the types, and determining the primary type and the secondary type based on the obtained second scores of the types.
According to one embodiment of the invention, the following is performed in the primary/secondary type determination module 1013: determining a second score TScore _ New for the type using the following equation*xTScore, wherein N*The number of supported read paragraphs for a type in the first candidate type group that are outside the second candidate type-supporting reads, or the number of supported read paragraphs for a type in the second candidate type group that are outside the first candidate type-supporting reads; determining two types of the first candidate type group and the second candidate type group with the highest second scores as the primary type and the secondary type.
According to one embodiment of the invention, the genotype determination module 1015 is configured to perform the following: if the ratio of the number of reads supporting only the minor type to the number of reads supporting only the major type is greater than 0.1, then the target region is determined to be a heterozygote, the major allele and the minor allele that make up the heterozygote are the major type and the minor type, respectively, otherwise the target region is determined to be a homozygote, and both alleles that make up the homozygote are the major type.
As shown in fig. 6, according to an embodiment of the present invention, further comprises a copy number determination module 1016 for performing the following: calculating the ratio of the sequencing depth of the target region to the reference depth by taking the average sequencing depth of at least one region with the fixed copy number of 2 as the reference depth, and determining the copy number of the target region according to the calculated ratio; and determining the genotype of the target region according to the copy number of the target region.
According to one embodiment of the present invention, determining the genotype of the target region based on the copy number of the target region using the copy number determination module 1016 includes: if the copy number of the target area is 0, judging that the target area does not exist; if the copy number of the target region is 1, determining that the target region is a homozygote, and the two alleles forming the homozygote are the main type; if the copy number of the target region is 2, the target region is judged to be a heterozygote, and the major allele and the minor allele constituting the heterozygote are the major type and the minor type, respectively.
The typing method, device and/or system in any embodiment above constructs a type set by constructing a reference sequence and a reference sequence of a target region, constructs a haplotype based on read information, and performs genotyping on the target region. The typing method is suitable for genotyping any region, and is particularly suitable for typing of a highly polymorphic region, for example, HLA-DRB3,4 and/or 5. The method does not need to carry out PCR aiming at the target gene, thereby reducing the experimental workload and the experimental difficulty and improving the flexibility of scheme design during application or research.
Reference will now be made in detail to the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements throughout, or elements having the same or similar function. The following examples are illustrative only and are not to be construed as limiting the invention.
It should be noted that the terms "first," "second," and the like, as used herein, are used for convenience of description only and are not to be construed as indicating or implying relative importance, nor order relationships therebetween. In the description of the present invention, "a plurality" means two or more unless otherwise specified.
Unless otherwise noted, the reagents, instruments or software referred to in the following examples are conventional commercially available products or open sources, such as sequencing library preparation kits purchased from Illumina corporation, library building according to kit instructions, and the like.
Example 1
Typing of human HLA-DRB3,4,5, comprising the general steps of:
1. HLA genotype and sequence data are downloaded from the IMGT/HLA database. Sequence data includes coding region sequences and full-length sequences of genes.
2. The downloaded HLA genotype and format of the sequence data are modified for analysis. The coding region sequence is divided according to the exon, 100bp (or other length depending on the length of sequencing reading) sequences on both sides of the exon are extracted from the full-length sequence of the gene of the most similar type, and the sequences are added on both sides of the corresponding exon sequence to form a reference sequence. The exon sequences are extended to ensure that exon edges are preserved when aligned. Selecting the complete gene sequence as reference sequence, recording the coding region sequences and the variation of the gene sequence relative to the reference sequence, and associating the variation with the type.
3. And comparing the sample sequencing data with the reference sequence to obtain the position information and the variation information of the read on the comparison, and converting the position information and the variation information relative to the reference sequence.
4. Using positional information and variation information of aligned reads, for each exon, assembly was performed using reads having overlapping portions therein and overlapping portions of completely identical sequences, to obtain multiple haplotypes. Haplotypes with lower coverage of exons were filtered. And obtaining the number of the support reads and the coverage of the haplotype, scoring according to the number of the support reads and the coverage, taking out the haplotype from the top to the bottom according to the score, and removing other haplotypes with sequence conflicts and lower conflict site depths when taking out one haplotype to obtain the candidate haplotypes. The variation of the candidate haplotype is compared to the variation of each type, the perfectly matched type is the type supported by this haplotype, and one haplotype may support multiple types.
5. Candidate haplotypes for all exons and their corresponding classes and supported reads were extracted. The types with a low number of exons supported were filtered. The types are scored according to the score of the supported haplotypes and the number of reads supported. And combining the scores of all the types pairwise, and selecting the combination with the highest sum of the scores as a candidate type. And obtaining two corresponding sets of candidate reads according to the two candidate types, comparing all possible types with the two sets of candidate reads, dividing all types into two sets according to the support condition of the candidate reads, re-grading the types according to the candidate reads, and taking the two types with the highest values in the two sets as the optimal types. The best pattern from the set of candidate reads with the greater number of reads is the primary pattern and the other best pattern is the secondary pattern. If the ratio of the number of reads supporting only the minor type to the number of reads supporting only the major type is large, e.g., greater than 0.1, then the gene is judged to be heterozygous, leaving the major and minor types, otherwise it is judged to be homozygous, leaving only the major type.
The following steps 6 and 7 are optional steps.
6. Judging the copy number of HLA-DRB3,4 and 5 according to the sequencing depth. Unlike the common genes, the copy numbers of HLA-DRB3,4 and 5 are not fixed, and for any one of them, the copy numbers of three genes of 0,1 and 2 may appear in a human sample. The ratio of the depth of HLA-DRB3,4,5 to the reference depth was calculated using the sequencing depth of the region (other gene or non-coding sequence) whose copy number was fixed to 2 or the average sequencing depth of a certain larger region as the reference depth: if the ratio is close to 0, the copy number is 0; if the ratio is close to 0.5, the copy number is 1; if the ratio is close to 1, the copy number is 2.
7. And combining the reserved type information with copy number information to obtain the final HLA-DRB3,4,5 genotype. If the copy number is 0, the gene is absent and the type is absent; if the copy number is 1, selecting the main type in the reserved types as a final type; if the copy number is 2, all the reserved types are selected as final types: only the major type is homozygous, and both the major and minor types are heterozygous.
If the typing results of the target region are not consistent when only the steps 1 to 5 are performed and all the steps are performed, the typing results after the steps 6 and 7 are performed can be used as the standard.
Example 2
1. HLA genotype and sequence data are downloaded from the IMGT/HLA database. Sequence data includes coding region sequences and full-length sequences of genes.
2. The downloaded HLA genotype and format of the sequence data are modified for analysis. The gene coding sequence is divided according to the exons, 100bp (or other length depending on the length of sequencing reading) sequences on both sides of the exons are extracted from the full-length sequence of the gene of the most similar type, and the extracted sequences are added on both sides of the corresponding exon sequences to form the reference sequence. The exon sequences are extended to ensure that exon edges are preserved when aligned. Selecting the complete gene sequence as reference sequence, recording the coding region sequences and the variation of the gene sequence relative to the reference sequence, and associating the variation with the type.
3. High throughput sequencing data were captured using the MHC region of one YH cell line sample (cell sample in the swelling program).
4. And comparing the sequencing data with the reference sequence to obtain the position information and the variation information of the read on the comparison, and converting the position information and the variation information into the position information and the variation information relative to the reference sequence.
5. Using positional information and variation information of aligned reads, for each exon, assembly was performed using reads having overlapping portions therein and overlapping portions of completely identical sequences, to obtain multiple haplotypes. Haplotypes with less than 95% coverage for exons were filtered. The number of supporting reads, depth and coverage of haplotypes were obtained and the Score of haplotypes was calculated. Calculation formula of Score:
c: (ii) the coverage of exons by haplotypes;
n: the number of reads supporting this haplotype;
r: the haplotype reliability is calculated as follows:
xi: the depth of each location of the haplotype;
x: the average depth of the haplotype;
l: sequence length of the haplotype;
haplotypes were retrieved by Score from top to bottom, and every haplotype retrieved removed the other haplotypes with which the sequence conflicts and with conflict site depths below 20% of the retrieved haplotypes. A maximum of 4 haplotypes were taken out as candidate haplotypes. The variation of the candidate haplotype is compared to the variation of each type, the perfectly matched type is the type supported by this haplotype, and one haplotype may support multiple types.
6. Information about haplotypes supporting the same type are merged. The filtering supported a pattern of exons less than 30% of the total number of exons. The score for the genre, TScore, is calculated. The formula for calculating TScore:
TScore=N×S
n: the number of reads supporting this type;
s: the sum of scores for haplotypes supporting this category;
and combining all types pairwise, and selecting the combination with the highest score sum as a candidate type. And obtaining a corresponding first group of candidate reads according to the type with a higher score in the candidate types, and obtaining a second group of candidate reads according to the type with a lower score. All types are compared with the two sets of candidate reads, if the number of reads supporting a certain type falls in the first set of candidate reads is larger, the type is classified into the first set, otherwise, the type is classified into the second set, and all types are divided into two sets based on the candidates. The score for the genre, TScore _ New, is calculated. TScore _ New's calculation:
TScore_New=N*×TScore
N*: the number of read paragraphs that support this type outside of another set of candidate reads,
the two types with the highest TScore _ New in the two groups are respectively used as the optimal type. The best type in the first group is used as the primary type, and the best type in the second group is used as the secondary type. If the ratio of the number of reads supporting only the minor type to the number of reads supporting only the major type is greater than 0.1, the gene is judged to be heterozygous, the major and minor types are retained, otherwise, the gene is judged to be homozygous, and only the major type is retained.
7. The ratio of the depth of HLA-DRB3,4,5 to the reference depth was calculated using the average sequencing depth of the region where the copy number was fixed at 2 as the reference depth, and the copy number was 0 if the ratio was close to 0,1 if the ratio was close to 0.5, and 2 if the ratio was close to 1. The sample had HLA-DRB3 copy number of 1, HLA-DRB4 copy number of 0, and HLA-DRB5 copy number of 1.
8. The final results of HLA-DRB3,4,5 typing obtained from YH cell line samples, combined with the retained type and copy number, are shown in Table 1 below, where "Blank" indicates allele loss, copy number 0 indicates Blank/Blank, copy number 1 indicates DRB/Blank, and copy number 2 indicates DRB/DRB.
TABLE 1
HLA-DRB3 HLA-DRB4 HLA-DRB5
DRB3*02:02:01/Blank Blank/Blank DRB5*01:01:01/Blank
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," "an implementation," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims (44)

  1. A typing method, comprising:
    obtaining sequencing data of a sample to be tested, wherein the sequencing data comprises a plurality of reads from a target area;
    comparing the reference sequence group to obtain a type set, wherein the type set comprises a plurality of types,
    the set of reference sequences comprises one or more reference sequences, the set of reference sequences is capable of completely covering all gene sequences of the target region, different ones of the reference sequences comprise different complete sequences of genes,
    the reference sequence set comprises a plurality of reference sequences, the reference sequence set is capable of completely covering all exons of the coding region sequence of the target region, different ones of the reference sequences comprise different exons;
    comparing the sequencing data to the reference sequence group to obtain a comparison result;
    converting the comparison result into a comparison result relative to the reference sequence group to obtain a converted comparison result;
    respectively comparing and assembling the reads of the same reference sequence based on the converted comparison results to obtain an assembly result, wherein the assembly result comprises a plurality of haplotypes;
    comparing the variation on the haplotype to the genotypes in the set of genotypes to determine the genotypes supported by the haplotype;
    dividing the types into two groups according to the haplotypes supporting the types and the reads supporting the types, and obtaining a first candidate type group and a second candidate type group;
    screening the types in the first and second candidate type groups, respectively, based on the haplotypes that support the types and the reads that support the types to determine primary and secondary types;
    determining a genotype for the target region based on a difference in the number of reads supporting the primary type and the secondary type.
  2. The method of claim 1, wherein said target region comprises at least one of HLA-DRB3, HLA-DRB4, and HLA-DRB 5.
  3. The method of claim 1, wherein said target region comprises HLA-DRB3, HLA-DRB4, and HLA-DRB 5.
  4. The method of claim 1, wherein the set of reference sequences is obtained by:
    obtaining a coding region sequence and a gene complete sequence comprising the target region;
    dividing the coding region sequence according to exons to obtain a plurality of exon sequences;
    extracting sequences of K bp on two sides of the exon sequence from the gene complete sequence of the type closest to the exon sequence, and adding the sequences to two sides of the corresponding exon sequence to obtain the reference sequence in the reference sequence group, wherein K is determined according to the length of a read segment.
  5. The method of claim 1, wherein one of said reference sequences is the complete gene sequence of a gene in said target region.
  6. The method of claim 1, wherein the assembling of reads of the same reference sequence in respective alignments based on the transformed alignment results to obtain an assembly result, the assembly result comprising a plurality of haplotypes, comprising:
    comparing reads of the same reference sequence on pairs, and performing the assembling using reads having overlapping portions that are identical to each other to obtain the plurality of haplotypes.
  7. The method of claim 1, characterized in that, after obtaining said assembly result, haplotypes with a coverage of said exons lower than 95% are filtered out.
  8. The method of claim 1, wherein the following is performed prior to determining the haplotype-supported type:
    scoring each haplotype based on the assembly result of the reads comparing the same reference sequence, and screening the haplotypes based on the score of the obtained haplotypes to obtain candidate haplotypes;
    replacing the haplotype with the candidate haplotype.
  9. The method of claim 8, characterized in that each of said haplotypes is scored using the following formula to determine a Score for said haplotype:
    wherein the content of the first and second substances,
    c is the coverage of the exon by the haplotype,
    n is the number of reads supporting the haplotype,
    r represents the reliability of the haplotype(s),
    xi is the sequencing depth of position i of the haplotype, i is the number of positions on the haplotype,
    x is the average depth of the haplotype,
    l is the length of the haplotype.
  10. The method of claim 8, wherein screening for haplotypes based on the score obtained for said haplotypes comprises:
    and for the assembly results of the reads of the same aligned reference sequence, removing the haplotypes in the assembly results meeting the following conditions after taking out the haplotype corresponding to the score of the highest haplotype in the assembly results: (ii) is not identical to the sequence of the haplotype removed and the sequencing depth of the inconsistent site is less than 20% of the sequencing depth of the corresponding site of the haplotype;
    repeating the above steps until at most 4 haplotypes are taken out, and obtaining the candidate haplotypes.
  11. The method of claim 8, wherein said comparing the variation across each haplotype to the type in said set of types to determine the type supported by said haplotype comprises:
    comparing the variation on each candidate haplotype with the type in the set of types, and if the variation is completely matched with the type, determining that the candidate haplotype supports the type.
  12. The method of claim 11, wherein after determining the type supported by said candidate haplotype, merging said candidate haplotypes that support the same type.
  13. The method of claim 8 wherein prior to performing reads based on haplotypes supporting said type and reads supporting said type, grouping said type into two groups, obtaining a first group of candidate types and a second group of candidate types,
    the type supporting a number of exons less than 30% of the total number of exons was filtered out.
  14. The method of claim 8 wherein said dividing said type into two groups based on haplotypes supporting said type and reads supporting said type to obtain a first group of candidate types and a second group of candidate types comprises:
    according to the haplotype supporting the type and the reading section supporting the type, performing first grading on the type, screening the type based on the obtained first score of the type, and obtaining a first candidate type and a second candidate type;
    and based on the supporting conditions of the reads supporting the first candidate type and the reads supporting the second candidate type on other types in the type set, respectively classifying the other types in the type set to the first candidate type and the second candidate type to obtain the first candidate type group and the second candidate type group.
  15. The method of claim 14 wherein said first scoring said genre based on haplotypes supporting said genre and reads supporting said genre and screening said genre based on a first score obtained for said genre to obtain a first candidate genre and a second candidate genre comprises:
    the first score for the type TScore is determined using the following formula,
    TScore ═ nxs, where,
    n is the number of reads supporting the type,
    s is the sum of the scores of the candidate haplotypes that support the genre;
    and combining the first scores of all the types pairwise, and determining the types in the combination with the highest sum of the first scores as the first candidate type and the second candidate type respectively.
  16. The method of claim 14, wherein said obtaining said first set of candidate types and said second set of candidate types based on support of other types in said set of types by reads supporting said first candidate type and reads supporting said second candidate type comprises:
    for each of said other types, comparing the magnitudes of the first and second intersections, attributing each of said other types to either said first candidate type or said second candidate type depending on the result of the comparison to obtain said first set of candidate types and said second set of candidate types,
    the first intersection is the intersection of the reads supporting the other type and the reads supporting the first candidate type,
    the second intersection is an intersection of the reads supporting the other type and the reads supporting the second candidate type.
  17. The method of claim 14 wherein said screening the types in said first set of candidate types and said second set of candidate types to determine primary types and secondary types based on the haplotypes supporting said types and the reads supporting said types, respectively, comprises:
    and respectively carrying out second scoring on types in the first candidate type group and the second candidate type group based on the reading supporting the types and the first scores of the types, and determining the primary type and the secondary type based on the obtained second scores of the types.
  18. The method of claim 17, comprising:
    a second score for the type, TScore _ New, is determined using the following equation,
    TScore_New=N*xTScore, wherein,
    N*the number of supported read paragraphs for a type in the first candidate type group that are outside the second candidate type-supporting reads, or the number of supported read paragraphs for a type in the second candidate type group that are outside the first candidate type-supporting reads;
    determining two types of the first candidate type group and the second candidate type group with the highest second scores as the primary type and the secondary type.
  19. The method of claim 1, wherein said determining the genotype of said target region based on the difference in the number of reads supporting said major type and said minor type comprises:
    if the ratio of the number of reads supporting only the minor type to the number of reads supporting only the major type is greater than 0.1, determining that the target region is a heterozygote, and the major allele and the minor allele constituting the heterozygote are the major type and the minor type, respectively,
    otherwise, the target region is judged to be homozygote, and the two alleles forming the homozygote are the main type.
  20. The method of any of claims 1-19, further comprising:
    calculating the ratio of the sequencing depth of the target region to the reference depth by taking the average sequencing depth of at least one region with the fixed copy number of 2 as the reference depth, and determining the copy number of the target region according to the calculated ratio;
    and determining the genotype of the target region according to the copy number of the target region.
  21. The method of claim 20, wherein said determining the genotype of said target region based on the copy number of said target region comprises:
    if the copy number of the target area is 0, judging that the target area does not exist;
    if the copy number of the target region is 1, determining that the target region is a homozygote, and the two alleles forming the homozygote are the main type;
    if the copy number of the target region is 2, the target region is judged to be a heterozygote, and the major allele and the minor allele constituting the heterozygote are the major type and the minor type, respectively.
  22. A computer-readable medium storing a computer-executable program, execution of which comprises performing the method of any one of claims 1-21.
  23. A typing device, comprising:
    a data input unit for inputting data;
    a data output unit for outputting data;
    a storage unit for storing data including a computer executable program;
    a processor coupled to the data input unit, the data output unit, and the storage unit, for executing the computer-executable program, the executing the program comprising performing the method of any of claims 1-21.
  24. A typing system, comprising:
    the data input module is used for inputting sequencing data of a sample to be tested, and the sequencing data comprises a plurality of reads from a target area;
    a comparison module comprising a first comparison module and a second comparison module,
    the first comparison module is used for comparing the reference sequence group to obtain a type set, wherein the type set comprises a plurality of types,
    a second comparison module for comparing the sequencing data from the data input module to the reference sequence group to obtain a comparison result,
    the set of reference sequences comprises one or more reference sequences, the set of reference sequences is capable of completely covering all gene sequences of the target region, different ones of the reference sequences comprise different complete sequences of genes,
    the reference sequence set comprises a plurality of reference sequences, the reference sequence set is capable of completely covering all exons of the coding region sequence of the target region, different ones of the reference sequences comprise different exons;
    the conversion module is used for converting the comparison result into a comparison result relative to the reference sequence group to obtain a converted comparison result;
    the assembly module is used for respectively assembling the reads of the same reference sequence based on the converted comparison results to obtain assembly results, and the assembly results comprise a plurality of haplotypes;
    a haplotype support type determination module for comparing the variation on the haplotype to the types in the set of types to determine the type supported by the haplotype;
    the clustering module is used for dividing the types into two groups according to the haplotypes supporting the types and the reads supporting the types to obtain a first candidate type group and a second candidate type group;
    a primary/secondary type determination module for screening types in the first and second candidate type groups, respectively, based on the haplotypes supporting the types and the reads supporting the types to determine primary and secondary types;
    a genotype determination module to determine a genotype for the target region based on a difference in the number of reads supporting the primary type and the secondary type.
  25. The system of claim 24, wherein said target region comprises at least one of HLA-DRB3, HLA-DRB4, and HLA-DRB 5.
  26. The system of claim 24, wherein said target regions comprise HLA-DRB3, HLA-DRB4, and HLA-DRB 5.
  27. The system of claim 24, wherein the set of reference sequences in the alignment module is constructed by:
    obtaining a coding region sequence and a gene complete sequence comprising the target region;
    dividing the coding region sequence according to exons to obtain a plurality of exon sequences;
    extracting sequences of K bp on two sides of the exon sequence from the gene complete sequence of the type closest to the exon sequence, and adding the sequences to two sides of the corresponding exon sequence to obtain the reference sequence in the reference sequence group, wherein K is determined according to the length of a read segment.
  28. The system of claim 24, wherein one of said reference sequences in the alignment module and/or the transformation module is the complete sequence of a gene in said target region.
  29. The system of claim 24, wherein the following is performed using the assembly module:
    comparing reads of the same reference sequence on pairs, and performing the assembling using reads having overlapping portions that are identical to each other to obtain the plurality of haplotypes.
  30. The system of claim 24, characterized in that in said assembly module haplotypes with a coverage of less than 95% of said exons are filtered out after obtaining said assembly result.
  31. The system of claim 24, further comprising a candidate haplotype determination module for, prior to determining the haplotype supported genre using said haplotype support genre determination module, performing the following:
    scoring each haplotype based on the assembly result of the reads comparing the same reference sequence, and screening the haplotypes based on the score of the obtained haplotypes to obtain candidate haplotypes;
    replacing the haplotype with the candidate haplotype.
  32. The system of claim 31, wherein each of said haplotypes is scored in said candidate haplotype determination module using the following formula to determine a Score for said haplotype:
    wherein the content of the first and second substances,
    c is the coverage of the exon by the haplotype,
    n is the number of reads supporting the haplotype,
    r represents the reliability of the haplotype(s),
    xi is the sequencing depth of position i of the haplotype, i is the number of positions on the haplotype,
    x is the average depth of the haplotype,
    l is the length of the haplotype.
  33. The system of claim 31, wherein said screening said haplotype in said candidate haplotype determination module based on the score obtained for said haplotype comprises:
    and for the assembly results of the reads of the same aligned reference sequence, removing the haplotypes in the assembly results meeting the following conditions after taking out the haplotype corresponding to the score of the highest haplotype in the assembly results: (ii) is not identical to the sequence of the haplotype removed and the sequencing depth of the inconsistent site is less than 20% of the sequencing depth of the corresponding site of the haplotype;
    repeating the above steps until at most 4 haplotypes are taken out, and obtaining the candidate haplotypes.
  34. The system of claim 31 wherein the following is performed in said haplotype support type determination module:
    comparing the variation on each candidate haplotype with the type in the set of types, and if the variation is completely matched with the type, determining that the candidate haplotype supports the type.
  35. The system of claim 34 wherein, in said haplotype support type determination module, after determining the type supported by said candidate haplotypes, merging said candidate haplotypes that support the same type.
  36. The system of claim 31, further comprising a type filtering module for, prior to obtaining the first set of candidate types and the second set of candidate types using the clustering module,
    the type supporting a number of exons less than 30% of the total number of exons was filtered out.
  37. The system of claim 31, wherein the clustering module is configured to perform the following:
    according to the haplotype supporting the type and the reading section supporting the type, performing first grading on the type, screening the type based on the obtained first score of the type, and obtaining a first candidate type and a second candidate type;
    and based on the supporting conditions of the reads supporting the first candidate type and the reads supporting the second candidate type on other types in the type set, respectively classifying the other types in the type set to the first candidate type and the second candidate type to obtain the first candidate type group and the second candidate type group.
  38. The system of claim 37 wherein performing a first scoring of said genre based on haplotypes supporting said genre and reads supporting said genre in said clustering module, and wherein screening said genre based on a first score obtained for said genre to obtain a first candidate genre and a second candidate genre comprises:
    the first score for the type TScore is determined using the following formula,
    TScore ═ nxs, where,
    n is the number of reads supporting the type,
    s is the sum of the scores of the candidate haplotypes that support the genre;
    and combining the first scores of all the types pairwise, and determining the types in the combination with the highest sum of the first scores as the first candidate type and the second candidate type respectively.
  39. The system of claim 37 wherein obtaining said first set of candidate types and said second set of candidate types based on support of other types in said set of types by reads supporting said first candidate type and reads supporting said second candidate type in said clustering module is performed by assigning other types in said set of types to said first candidate type and said second candidate type, respectively, comprising:
    for each of said other types, comparing the magnitudes of the first and second intersections, attributing each of said other types to either said first candidate type or said second candidate type depending on the result of the comparison to obtain said first set of candidate types and said second set of candidate types,
    the first intersection is the intersection of the reads supporting the other type and the reads supporting the first candidate type,
    the second intersection is an intersection of the reads supporting the other type and the reads supporting the second candidate type.
  40. The system of claim 37, wherein said primary/secondary type determination module is operative to:
    and respectively carrying out second scoring on types in the first candidate type group and the second candidate type group based on the reading supporting the types and the first scores of the types, and determining the primary type and the secondary type based on the obtained second scores of the types.
  41. The system of claim 40, wherein the following is performed in said primary/secondary type determination module:
    a second score for the type, TScore _ New, is determined using the following equation,
    TScore_New=N*xTScore, wherein,
    N*the number of supported read paragraphs for a type in the first candidate type group that are outside the second candidate type-supporting reads, or the number of supported read paragraphs for a type in the second candidate type group that are outside the first candidate type-supporting reads;
    determining two types of the first candidate type group and the second candidate type group with the highest second scores as the primary type and the secondary type.
  42. The system of claim 24, wherein said genotyping module is configured to perform the following:
    if the ratio of the number of reads supporting only the minor type to the number of reads supporting only the major type is greater than 0.1, determining that the target region is a heterozygote, and the major allele and the minor allele constituting the heterozygote are the major type and the minor type, respectively,
    otherwise, the target region is judged to be homozygote, and the two alleles forming the homozygote are the main type.
  43. The system of any of claims 24-42, further comprising a copy number determination module configured to:
    calculating the ratio of the sequencing depth of the target region to the reference depth by taking the average sequencing depth of at least one region with the fixed copy number of 2 as the reference depth, and determining the copy number of the target region according to the calculated ratio;
    and determining the genotype of the target region according to the copy number of the target region.
  44. The system of claim 43, wherein determining the genotype of the target region based on the copy number of the target region using the copy number determination module comprises:
    if the copy number of the target area is 0, judging that the target area does not exist;
    if the copy number of the target region is 1, determining that the target region is a homozygote, and the two alleles forming the homozygote are the main type;
    if the copy number of the target region is 2, the target region is judged to be a heterozygote, and the major allele and the minor allele constituting the heterozygote are the major type and the minor type, respectively.
CN201680067128.7A 2016-02-18 2016-02-18 Parting method and device Active CN108350498B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/074027 WO2017139945A1 (en) 2016-02-18 2016-02-18 Typing method and device

Publications (2)

Publication Number Publication Date
CN108350498A true CN108350498A (en) 2018-07-31
CN108350498B CN108350498B (en) 2021-10-19

Family

ID=59624680

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201680067128.7A Active CN108350498B (en) 2016-02-18 2016-02-18 Parting method and device

Country Status (2)

Country Link
CN (1) CN108350498B (en)
WO (1) WO2017139945A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111798924A (en) * 2020-07-07 2020-10-20 博奥生物集团有限公司 Human leukocyte antigen typing method and device
CN112634991A (en) * 2020-12-18 2021-04-09 长沙都正生物科技股份有限公司 Genotyping method, genotyping device, electronic device, and storage medium
CN112825267A (en) * 2019-11-21 2021-05-21 深圳华大基因科技服务有限公司 Method for determining small nucleic acid sequence set and application thereof
CN113373208A (en) * 2021-07-14 2021-09-10 上海序祯达生物科技有限公司 Human leukocyte antigen typing system and method based on next generation sequencing
CN114334006A (en) * 2021-12-29 2022-04-12 纳昂达(南京)生物科技有限公司 Method and device for introducing noise in enzyme digestion library building mode
CN112825267B (en) * 2019-11-21 2024-05-14 深圳华大基因科技服务有限公司 Method for determining a collection of small nucleic acid sequences and use thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103074444A (en) * 2013-02-25 2013-05-01 苏州晶因生物科技有限公司 HLA (histocompatibility locus antigen) genetic typing method of HLA determinant gene through high-throughput sequencing
CN103198238A (en) * 2012-01-06 2013-07-10 深圳华大基因科技有限公司 Drug related gene type database, gene typing and drug action detection method
CN103221551A (en) * 2010-11-23 2013-07-24 深圳华大基因科技有限公司 HLA genotype-SNP linkage database, its constructing method, and HLA typing method
CN103874767A (en) * 2011-10-14 2014-06-18 深圳华大基因研究院 Method and system for genotyping predetermined region in nucleic acid sample
CN104834833A (en) * 2014-02-12 2015-08-12 深圳华大基因科技有限公司 Single nucleotide polymorphism (SNP) detection method and apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103221551A (en) * 2010-11-23 2013-07-24 深圳华大基因科技有限公司 HLA genotype-SNP linkage database, its constructing method, and HLA typing method
CN103874767A (en) * 2011-10-14 2014-06-18 深圳华大基因研究院 Method and system for genotyping predetermined region in nucleic acid sample
CN103198238A (en) * 2012-01-06 2013-07-10 深圳华大基因科技有限公司 Drug related gene type database, gene typing and drug action detection method
CN103074444A (en) * 2013-02-25 2013-05-01 苏州晶因生物科技有限公司 HLA (histocompatibility locus antigen) genetic typing method of HLA determinant gene through high-throughput sequencing
CN104834833A (en) * 2014-02-12 2015-08-12 深圳华大基因科技有限公司 Single nucleotide polymorphism (SNP) detection method and apparatus

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112825267A (en) * 2019-11-21 2021-05-21 深圳华大基因科技服务有限公司 Method for determining small nucleic acid sequence set and application thereof
CN112825267B (en) * 2019-11-21 2024-05-14 深圳华大基因科技服务有限公司 Method for determining a collection of small nucleic acid sequences and use thereof
CN111798924A (en) * 2020-07-07 2020-10-20 博奥生物集团有限公司 Human leukocyte antigen typing method and device
CN111798924B (en) * 2020-07-07 2024-03-26 博奥生物集团有限公司 Human leukocyte antigen typing method and device
CN112634991A (en) * 2020-12-18 2021-04-09 长沙都正生物科技股份有限公司 Genotyping method, genotyping device, electronic device, and storage medium
CN113373208A (en) * 2021-07-14 2021-09-10 上海序祯达生物科技有限公司 Human leukocyte antigen typing system and method based on next generation sequencing
CN114334006A (en) * 2021-12-29 2022-04-12 纳昂达(南京)生物科技有限公司 Method and device for introducing noise in enzyme digestion library building mode
CN114334006B (en) * 2021-12-29 2022-11-29 纳昂达(南京)生物科技有限公司 Method and device for introducing noise in enzyme digestion library building mode

Also Published As

Publication number Publication date
CN108350498B (en) 2021-10-19
WO2017139945A1 (en) 2017-08-24

Similar Documents

Publication Publication Date Title
Smolka et al. Comprehensive structural variant detection: from mosaic to population-level
Zhou et al. Deep sequencing of the MHC region in the Chinese population contributes to studies of complex disease
Duke et al. Determining performance characteristics of an NGS‐based HLA typing method for clinical applications
Sigmon et al. Content and performance of the MiniMUGA genotyping array: a new tool to improve rigor and reproducibility in mouse research
Daber et al. Understanding the limitations of next generation sequencing informatics, an approach to clinical pipeline validation using artificial data sets
CN108350498B (en) Parting method and device
Barone et al. HLA-genotyping of clinical specimens using Ion Torrent-based NGS
Olson et al. Variant calling and benchmarking in an era of complete human genome sequences
Hills et al. BAIT: Organizing genomes and mapping rearrangements in single cells
CN115052994A (en) Method for determining base type of predetermined site in chromosome of embryonic cell and application thereof
Schaumont et al. Stack Mapping Anchor Points (SMAP): a versatile suite of tools for read-backed haplotyping
JP7333838B2 (en) Systems, computer programs and methods for determining genetic patterns in embryos
US20180119210A1 (en) Fetal haplotype identification
CN116814813B (en) Molecular marker related to lambing number in goat 3BHSD gene and application thereof
Huszar et al. Mitigating the effects of reference sequence bias in single-multiplex massively parallel sequencing of the mitochondrial DNA control region
CN113981070B (en) Method, device, equipment and storage medium for detecting embryo chromosome microdeletion
Zhang et al. The reliable assurance of detecting somatic mutations in cancer-related genes by next-generation sequencing: the results of external quality assessment in China
Patel et al. The COPILOT raw Illumina genotyping QC protocol
Roy et al. NGS-μsat: Bioinformatics framework supporting high throughput microsatellite genotyping from next generation sequencing platforms
US20180330050A1 (en) Detecting copy number variations
Levy et al. A framework for the clinical implementation of optical genome mapping in hematologic malignancies
Johansson et al. A novel method for automatic genotyping of microsatellite markers based on parametric pattern recognition
KR20220064951A (en) SYSTEMS AND METHODS FOR USING DENSITY OF SINGLE NUCLEOTIDE VARIATIONS FOR THE VERIFICATION OF COPY NUMBER VARIATIONS IN HUMAN EMBRYOS
US20180179595A1 (en) Fetal haplotype identification
RU2806429C1 (en) Whole genome sequencing data processing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant