CN111883211B - Gene scar for representing HRD homologous recombination repair defect and identification method - Google Patents

Gene scar for representing HRD homologous recombination repair defect and identification method Download PDF

Info

Publication number
CN111883211B
CN111883211B CN202010789009.0A CN202010789009A CN111883211B CN 111883211 B CN111883211 B CN 111883211B CN 202010789009 A CN202010789009 A CN 202010789009A CN 111883211 B CN111883211 B CN 111883211B
Authority
CN
China
Prior art keywords
score
preset
length
allele
site
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010789009.0A
Other languages
Chinese (zh)
Other versions
CN111883211A (en
Inventor
张哲�
孟元光
杜欣欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202010789009.0A priority Critical patent/CN111883211B/en
Publication of CN111883211A publication Critical patent/CN111883211A/en
Application granted granted Critical
Publication of CN111883211B publication Critical patent/CN111883211B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Zoology (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Wood Science & Technology (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to a gene scar for representing HRD homologous recombination repair defects and an identification method, which comprises the following steps: step 1, sequencing a whole genome to obtain original gene data; step 2, establishing a chromosome information matrix X (X1, X2, X3 … Xn), step 3, controlling data quality, step 4, comparing sequences, step 5, deleting repeated reads, and step 6, extracting LRP files through signals copied by two genomes; step 7, comparing the number of the high-quality non-reference bases with the number of the high-quality bases to extract a BAF file; step 8, generating fragmented LRP and BAF files for preprocessing, obtaining the copy number of the A allele nA and the copy number of the B allele nB, and step 9, obtaining the telomere allele imbalance TAI score through nA and nB; step 10, obtaining a large-scale transition LST score according to the length of each locus; 11, judging nA and nB of each SNP locus to obtain a loss of heterozygosity LOH score; and step 12, performing statistical calculation to obtain HRD scores and judging.

Description

Gene scar for representing HRD homologous recombination repair defect and identification method
Technical Field
The invention relates to the technical field of biological information, in particular to a gene scar for representing HRD homologous recombination repair defects and an identification method.
Background
At present, a domestic patent "a genome recombination fingerprint for representing hHRD homologous recombination defects and an identification method thereof" focuses on the genome recombination fingerprint of HRD, the prediction efficiency of PARP inhibitors is not high, and foreign researches show that HRD can cause the phenomenon of "genome scar" to be abbreviated as genomic scars, including genome Heterozygosity Loss, namely Loss of Heterozygosity in English, LOH in short, telomere allele Imbalance in English, and Telomeric Allelic Imblance in English, TAI in short, Large fragment migration in English, and Large-scale state Transition in English, which is abbreviated as LST in short. There are research tests to score LOH, TAI, LST, etc., and the score is judged to define BRCA mutant as positive HRD. However, the data is derived from gene chip data, and the gene chip data has the problems of large error and poor robustness.
Disclosure of Invention
Therefore, the invention provides a gene scar for representing HRD homologous recombination repair defects and an identification method thereof, which are used for overcoming the problems of larger error and poorer robustness of gene chip data in the prior art.
In order to achieve the aim, the invention provides a gene scar for representing HRD homologous recombination repair defects and an identification method, which comprises the following steps:
step 1: collecting samples, and performing whole genome sequencing to obtain gene original data;
step 2: establishing chromosome information matrixes X (X1, X2 and X3 … Xn) by extracting a public database, wherein X1 represents a first preset chromosome information matrix, X2 represents a second preset chromosome information matrix, X3 represents a third preset chromosome information matrix, and Xn represents an nth preset chromosome information matrix, and each preset chromosome information matrix comprises the length, the starting point and the end point of 24 chromosomes, and the position, the starting point and the end point of centromere;
and step 3: data quality control, which is to determine the incorrect probability of the corresponding nucleotide calling for quality control by calculating the quality score of each gene;
and 4, step 4: comparing the sequence, namely comparing the reads in a chromosome information matrix for sequence comparison;
and 5: deleting repeated reads, determining repeated items according to the same 5' mapping coordinates of the initial coordinates and directions of the two reads of the read pair, and deleting;
step 6: extracting the LRP file by comparing the observed signal of the two genome copies with the expected signal;
and 7: extracting a BAF file by comparing the number of high quality non-reference bases to the number of high quality bases;
and 8: preprocessing, namely preprocessing the LRP and BAF files generated in the step 7, generating average fragments at SNP loci, recording the chromosome position of each locus, and obtaining the copy number of the A allele and the copy number of the B allele as nB at the initial point and the end point of the region;
and step 9: obtaining the score of the telomere allelic gene imbalance TAI by comparing nA and nB of the telomere position of the chromosome;
step 10: the length of each SNP locus is compared with a preset length to obtain the score of the large-scale transition LST;
step 11: judging the nA and nB values of each SNP locus and comparing the lengths of the loci to obtain the score of the loss of heterozygosity LOH;
step 12: and (3) judging the genes after counting the scores of the telomere allele imbalance TAI, the large-scale transition LST and the heterozygosity loss LOH.
Further, in the step 3, the quality score of the gene is expressed as:
Q=-10x log10(P)
wherein Q represents a sequence format quality score and P represents the probability of each base error.
Further, in step 6, the extracted LRP file includes an LRP value and a corresponding chromosome position, and the LRP value may be expressed as:
Figure BDA0002623092800000021
where LRP denotes the density of the document signal, X denotes the actual value of each region extracted, i denotes the position of the start of the chromosome, and k denotes the position of the last region in the chromosome;
in step 7, the extracted BAF file includes BAF values and corresponding chromosome positions, and the BAF values may be represented as:
Figure BDA0002623092800000031
wherein BAF represents the allele frequency of the file, XADRepresenting the number of extracted high quality non-reference bases, XAPExpressed as the number of high quality bases.
Further, in step 8, by recording the positions of the corresponding chromosomes described in the LRP file and BAF file, the start point and end point of the region of the chromosome can be found to have nA as the copy number of the a allele, nB as the copy number of the B allele, nA being expressed as:
Figure BDA0002623092800000032
wherein nA represents the copy number of the A allele, alpha, beta represent constraint parameters, X is the fragmented LRP value, and Y is the fragmented BAF value;
nB is represented as:
Figure BDA0002623092800000033
where nB is the B allele copy number, α, β represent the constraint parameters, X is the fragmented LRP value, and Y is the fragmented BAF value.
Further, the value interval of the BAF value is 0-1.
Further, in step 9, the SNP is expressed as a single nucleotide polymorphism, and the score of the telomere allelic imbalance TAI is as follows: firstly, judging each locus of SNP, if the locus is not the telomere position, judging the locus as 0, if the locus is the telomere position, comparing nA with nB,
in the process of comparing nA with nB, the number of the A allele copy number nA in all sites of the preset SNP forms a matrix Ta (Ta1, Ta2 and Ta3 … Tan), wherein Ta1 represents a first preset copy number of the A allele, Ta2 represents a second preset copy number of the A allele, Ta3 represents a third preset copy number of the A allele, and Tan represents an nth preset copy number of the A allele;
the number of occurrences of the copy number nB of the B allele in all sites of the preset SNP forms a matrix Tb (Tb1, Tb2 and Tb3 … Tbn), wherein Tb1 represents the first preset copy number of the B allele, Tb2 represents the second preset copy number of the B allele, Tb3 represents the third preset copy number of the B allele, and Tbn represents the nth preset copy number of the B allele;
comparing Ta1 with Tb1, Ta2 with Tb2, Ta3 with Tb3, comparing Tan with Tbn, if Tai ═ Tbi, the locus score is calculated to be 1, if Tai ≠ Tbi, the locus score is calculated to be 0, and the score of the telomere allelic imbalance TAI is calculated for all loci of the chromosomes.
Further, in the step 10, the site length of the preset SNP includes a preset first length and a preset second length, the length of each site of each SNP is first compared with the first preset length, and if the site is greater than or equal to the preset first length, judging that the locus score is 0, if the length of the locus is smaller than the preset first length, judging the relationship between the length from the end point of the last site of the sites to the start point of the sites and the preset second length, if the length from the end point of the last site of the sites to the start point of the sites is less than the preset second length, judging the site score to be 1, if the length from the end point of the last site of the sites to the start point of the sites is more than or equal to the preset second length, judging the site score to be 0, and calculating the score of the large-scale transition LST according to the relation of the SNP site lengths.
Further, in the step 11, the site length of the preset SNP includes a preset third length, first, nA of each site of each SNP is determined, if nA is not equal to 0, the site score is determined as 0, if nA is equal to 0, nB is determined, if nB is not equal to 0, the site score is determined as 0, if nB is equal to 0, the length of the site is compared with the preset third length, if the site length is greater than the preset third length, the site score is determined as 1, if the site length is less than or equal to the preset third length, the site score is determined as 0, and a heterozygosity loss LOH score is obtained by calculating a relationship between each site nA and an nB value.
Further, in step 12, HRD is expressed as a homologous recombination defect, and the HRD score is expressed as:
HRD=LOH+TAI+LST
wherein, HRD represents homologous recombination defect score, LOH represents heterozygosity loss score, LST represents large-scale transition score, and TAI represents telomere allele imbalance score;
presetting a comparison reference value of HRD as P, if the HRD is more than or equal to P, judging that the HRD is positive, and if the HRD is less than P, judging that the HRD is negative.
Further, in the step 1, the format of the obtained raw data is a FASTQ file, and in the step 2, the resource for extracting the public database is selected from hg 38.
Compared with the prior art, the method has the advantages that the method is based on HRD genome scar phenomenon, a new identification method flow is found by using second-generation sequencing data, and the whole genome sequencing method is carried out by off-topic body fluid and tissues, so that the detection accuracy of whole genome determination is improved; the method comprises the steps of obtaining original data through whole genome sequencing of sample data, determining the incorrect probability through the quality score of each gene of the original data, comparing the original data with a sequence of a chromosome information matrix, deleting repeated reads, extracting LRP and BAF files, calculating the copy number of A allele as nA and the copy number of B allele as nB through the LRP and BAF files, calculating the score of telomere allele imbalance TAI, the score of large-scale transition LST and the score of heterozygosity deficiency LOH through the position relation and the length relation of nA and nB, judging the genes after counting the scores of the telomere allele imbalance TAI, the large-scale transition LST and the heterozygosity deficiency LOH, and gradually reducing the generation of errors in each step through a systematic judging method so as to improve the robustness of the identifying method.
Furthermore, the invention adopts BWA software, Picard, GATK and VarScan2 software to calculate the data obtained by sequencing, reduces artificial factors and improves the accuracy of the data through the calculation of the software.
Furthermore, by means of the quality score of the gene, the probability that the corresponding nucleotide call is incorrect is determined for quality control, and the generation of errors is reduced; by recording the positions of the corresponding chromosomes described in the LRP file and BAF file, the start and end points of the region of the chromosome can be found to give nA as the a allele copy number and nB as the B allele copy number. By respectively calculating nA and nB and judging the numerical values of nA and nB, the error is further reduced, the inaccuracy of data caused by uncontrollable factors is avoided, and the adaptability of the data is improved by the formula calculation method.
Further, in the process of comparing nA and nB, the number of occurrences of the copy number nA of the a allele in all loci of the preset SNP forms a matrix Ta (Ta1, Ta2, Ta3 … Tan), the number of occurrences of the copy number nB of the B allele in all loci of the preset SNP forms a matrix Tb (Tb1, Tb2, Tb3 … Tbn), Ta1 is compared with Tb1, Ta2 is compared with Tb2, Ta3 is compared with Tb3, Tan is compared with Tbn, if Tai ═ Tbi, the locus is calculated to be 1, if Tai ≠ Tbi, the locus is calculated to be 0, the scores of all loci are obtained by adding the calculated scores of each locus of the chromosome to obtain the imbalance of the telomere allele, and the generation of errors is further reduced by accurately calculating each locus.
Furthermore, by presetting the lengths of the SNP sites including the preset first length and the preset second length and comparing the length of each SNP site with the actual length, firstly, the site is compared and judged with the preset first length, secondly, the site is compared and judged with the preset second length, and by comparing the length settings of the two SNPs, the score of the large-scale transition LST is calculated through the relationship of the lengths of the SNP sites, so that the error is further reduced, the accuracy of data in the calculation method process is improved, and the robustness of the calculation system is improved.
Furthermore, by presetting the length of the SNP sites including the preset third length, firstly judging nA of each SNP site, then judging nB, finally comparing the length of the SNP sites with the preset third length, obtaining the score of the loss-of-heterozygosity LOH by calculating the relation between the nA of each site and the nB value, and by adopting a layer-by-layer progressive judgment method, avoiding the large influence on the whole data caused by one data, and in the data judgment process, if the judgment error of a certain node does not cause the large influence on the whole data, further reducing the error of the data and improving the robustness of the identification method.
Further, calculating heterozygosity loss LOH score, large-scale transition LST score and telomere allele imbalance TAI score to obtain a homologous recombination defect HRD score, comparing the homologous recombination defect HRD score with a preset comparison reference value to obtain the negative or positive judgment of HRD, calculating the three groups of scores in a lump, and setting an interception point to judge the positive or negative judgment of the gene scar, thereby further reducing the possibility of data error.
Drawings
FIG. 1 is a schematic flow chart of the method for characterizing a gene scar with HRD homologous recombination repair defect and identifying the gene scar according to the embodiment of the invention;
FIG. 2 is a schematic flow chart of the identification method of the genetic scar with HRD homologous recombination repair defect and the imbalance of telomere allele TAI score in the method according to the embodiment of the present invention;
FIG. 3 is a schematic flow chart of large-scale transition LST scoring in the method for characterizing a gene scar with HRD homologous recombination repair defect and identifying the same according to the embodiment of the present invention;
FIG. 4 is a schematic flow chart of the procedures for characterizing the gene scar with HRD homologous recombination repair defect and identifying the heterozygosity loss LOH score in the method according to the embodiment of the present invention.
Detailed Description
In order that the objects and advantages of the invention will be more clearly understood, the invention is further described below with reference to examples; it should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and do not limit the scope of the present invention.
It should be noted that in the description of the present invention, the terms of direction or positional relationship indicated by the terms "upper", "lower", "left", "right", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, which are only for convenience of description, and do not indicate or imply that the device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.
Furthermore, it should be noted that, in the description of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
Referring to fig. 1, a method for characterizing a gene scar with an HRD homologous recombination repair defect and identifying the gene scar includes the following steps:
step 1: collecting samples, and performing whole genome sequencing to obtain gene original data;
step 2: establishing chromosome information matrixes X (X1, X2 and X3 … Xn) by extracting a public database, wherein X1 represents a first preset chromosome information matrix, X2 represents a second preset chromosome information matrix, X3 represents a third preset chromosome information matrix, and Xn represents an nth preset chromosome information matrix, and each preset chromosome information matrix comprises the length, the starting point and the end point of 24 chromosomes, and the position, the starting point and the end point of centromere;
and step 3: data quality control, which is to determine the incorrect probability of the corresponding nucleotide calling for quality control by calculating the quality score of each gene;
and 4, step 4: comparing the sequence, namely comparing the reads in a chromosome information matrix for sequence comparison;
and 5: deleting repeated reads, determining repeated items according to the same 5' mapping coordinates of the initial coordinates and directions of the two reads of the read pair, and deleting;
step 6: extracting the LRP file by comparing the observed signal of the two genome copies with the expected signal;
and 7: extracting a BAF file by comparing the number of high quality non-reference bases to the number of high quality bases;
and 8: preprocessing, namely preprocessing the LRP and BAF files generated in the step 7, generating average fragments at SNP loci, recording the chromosome position of each locus, and obtaining the copy number of the A allele and the copy number of the B allele as nB at the initial point and the end point of the region;
and step 9: obtaining the score of the telomere allelic gene imbalance TAI by comparing nA and nB of the telomere position of the chromosome;
step 10: the length of each SNP locus is compared with a preset length to obtain the score of the large-scale transition LST;
step 11: judging the nA and nB values of each SNP locus and comparing the lengths of the loci to obtain the score of the loss of heterozygosity LOH;
step 12: and (3) judging the genes after counting the scores of the telomere allele imbalance TAI, the large-scale transition LST and the heterozygosity loss LOH.
Specifically, in the embodiment of the present invention, the format of the raw data obtained in step 1 is a FASTQ file. The FASTQ file format is the de facto file format used for sequence reads generated from second generation sequencing technologies. This file format evolved from FASTA and contained both sequence data and quality information. Like FASTA, FASTQ files begin with a heading line. The difference is that the FASTQ header first line is represented by the @ character and the 4 th line has a character that encodes the quality of each nucleotide in the reading.
Specifically, in the embodiment of the present invention, in the step 2, a chromosome information matrix X (X1, X2, X3 … Xn) is established, where X1 represents a first preset chromosome information matrix, X2 represents a second preset chromosome information matrix, X3 represents a third preset chromosome information matrix, and Xn represents an nth preset chromosome information matrix, each preset chromosome information matrix includes a length, a starting point and an end point of 24 chromosomes, and a position, a starting point and an end point of a centromere; the gene version of the chromosome information matrix information of the invention is taken from hg38, and can also be taken from other versions, and the invention does not limit the gene version at all, and all the details are subject to the implementation.
Specifically, in the embodiment of the present invention, in step 3, the quality score is calculated based on a logarithm and is calculated by fastqc software, and a calculation formula of the quality score is as follows:
Q=-10x log10(P)
wherein Q represents a sequence format quality score and P represents the probability of each base error.
Specifically, in the examples of the present invention, the pre-established value of Q is 20, and the probability of corresponding base errors is 0.01, so that only bases with a accuracy of more than 99% are included, and bases with a mass fraction Q of less than 20 are removed. The preset value of Q can be 10 or 30, and the invention does not limit the specific value of Q, all according to the specific implementation needs.
Specifically, in the embodiment of the present invention, in the step 4, the BWA software is used to align the reads to the reference genome, and other software may be used for the BWA software. The gene data of hg38 version can be selected as reference alignment data in the reference genome, or the gene data of hg19 can be selected as reference alignment data, and the data read by the BWA software and the data of the reference genome are subjected to sequence alignment to generate a BAM file.
In particular, in the present example, in step 5, due to errors in sample or library preparation, reads may be from the exact same input DNA template and accumulate at the same starting position of the reference genome. Any sequencing errors are multiplied and may lead to artifacts in downstream variant calls. Although the read repeat fragments may represent true DNA material, they cannot be distinguished from PCR artifacts, which are the result of uneven amplification of DNA fragments. To reduce this detrimental effect of duplicate entries before a variant is found, a "delete duplicate map read" application based on the Picard MarkDuplicates tool will be run. To determine duplicates, Picard MarkDuplicates uses the starting coordinates and direction of the two reads of the read pair. Based on the same 5' mapping coordinates, it will discard all duplicate entries except the "best" copy.
Specifically, in the present example, in step 6, LRP represents the density of the document signal, and the normalized measure of the signal intensity from each SNP marker, and SNP represents a single nucleotide polymorphism. It was calculated from the log2 of the ratio of observed signal to expected signal for both genomic copies. After normalization, we expect to see the signal to be concentrated around 0 when there are two copies of the region. A higher value may indicate a repeat event and a lower value may indicate a deletion. The LRP extracted file includes extracted LRP values and corresponding chromosome positions, and the LRP values may be expressed as:
Figure BDA0002623092800000091
where LRP denotes the density of the document signal, and X denotes the actual value of each region extracted. i denotes the position of the start of the chromosome and k denotes the position of the last region in the chromosome.
Specifically, in the embodiment of the present invention, in step 7, BAF represents the file allele frequency, and the BAF value is between 0 and 1. The B allele frequency at the heterologous locus can be calculated by dividing the number of high quality non-reference bases by the number of high quality bases, and the extracted BAF file contains the BAF values and corresponding chromosomal locations. The BAF value may be expressed as:
Figure BDA0002623092800000092
wherein BAF represents the allele frequency of the file, XADRepresenting the number of high quality non-reference bases extracted in a BAM document, XAPExpressed as the number of high quality bases.
Specifically, in the present example, in step 8, fragmented LRP and BAF (averaged for each sample) are generated, and this step is to generate average fragments for SNP sites, record the chromosome position of each site, and obtain nA as the copy number of a allele, nB as the copy number of B allele, and ploidy as the chromosome multiple at the start point and the end point of the region. nA can be expressed as:
Figure BDA0002623092800000101
wherein nA represents the copy number of the A allele, and alpha and beta represent constraint parameters.
nB can be represented as:
Figure BDA0002623092800000102
where nB is the B allele copy number, α, β represent the constraint parameters, X is the fragmented LRP value, and Y is the fragmented BAF value.
Specifically, in the present example, α ═ 1, β ═ 0.1, nA, nB are integers, and ploidy is a chromosome multiple.
Referring to fig. 2, in step 9, for each chromosome, comparing whether nA at the telomere position is equal to nB, if so, it is recorded as 1, and if not, it is recorded as 0, and finally the TAI score of the sample is a count of 1, and TAI is expressed as the telomere allele imbalance. Firstly, a SNP locus is judged, if the locus is not a telomere position, the locus is judged to be 0, if the locus is the telomere position, nA is compared with nB, if nA is equal to nB, the locus is judged to be 1, and if nA is not equal to nB, the locus is judged to be 0. In the process of comparing nA with nB, the number of the A allele copy number nA in all sites of the preset SNP forms a matrix Ta (Ta1, Ta2 and Ta3 … Tan), wherein Ta1 represents a first preset copy number of the A allele, Ta2 represents a second preset copy number of the A allele, Ta3 represents a third preset copy number of the A allele, and Tan represents an nth preset copy number of the A allele. The number of occurrences of B allele copy number nB in all loci of the preset SNP forms a matrix Tb (Tb1, Tb2, Tb3 … Tbn), where Tb1 represents the first preset copy number of the B allele, Tb2 represents the second preset copy number of the B allele, Tb3 represents the third preset copy number of the B allele, and Tbn represents the nth preset copy number of the B allele. Ta1 was compared to Tb1, Ta2 was compared to Tb2, Ta3 was compared to Tb3, Tan was compared to Tbn, and if Tai ═ Tbi, the site was calculated as 1.
Referring to FIG. 3, in step 10, LST represents the large-scale transition score, and in each sample, for each SNP locus, if the length of the locus is greater than 10MB and the length from the end point of the last locus to the start point of the locus is less than 3MB, the result is marked as 1, otherwise, the result is marked as 0, and finally the sample has a LST score of 1. Each site length of each SNP comprises a preset first length and a preset second length,
firstly, comparing the length of each site of each SNP with a first preset length, if the length of the site is greater than or equal to the preset first length, judging that the site score is 0, if the length of the site is smaller than the preset first length, judging the relationship between the length from the end point of the last site of the site to the start point of the site and the preset second length, if the length from the end point of the last site of the site to the start point of the site is smaller than the preset second length, judging that the site score is 1, and if the length from the end point of the last site of the site to the start point of the site is greater than or equal to the preset second length, judging that the site score is 0.
Referring to FIG. 4, in step 11, LOH represents loss of heterozygosity, and the relationship between the nA and nB values of each SNP site in each sample and the length of the site is used to obtain the LOH score of loss of heterozygosity. The method comprises the steps of firstly judging the nA of each site of each SNP, if the nA is not equal to 0, judging the site score to be 0, if the nA is equal to 0, judging nB, if the nB is not equal to 0, judging the site score to be 0, if the nB is equal to 0, comparing the length of the site with the preset third length, if the length of the site is greater than the preset third length, judging the site score to be 1, and if the length of the site is less than or equal to the preset third length, judging the site score to be 0.
Specifically, in the embodiment of the present invention, in step 12, scores of the telomere allele imbalance TAI, the large-scale transition LST, and the loss of heterozygosity LOH are counted and determined, and the statistics is as follows:
HRD=LOH+TAI+LST
wherein HRD represents the homologous recombination defect score, LOH represents the heterozygosity loss score, LST represents the large-scale transition score, TAI represents the telomere allele imbalance score, the HRD homologous recombination defect score is obtained by adding the three,
presetting a comparison reference value as P, if HRD is more than or equal to P, judging that the HRD is positive, if HRD is less than P, judging that the HRD is negative,
in the embodiment of the present invention, the comparison value P is set to 43, and other values can be set as well, which all shall be subject to specific implementation. The HRD is calculated by adding, and a weighted average method can be adopted, the invention does not limit the specific calculation method and the specific size of the preset comparison reference value, if the weighted average method is adopted, the size of the corresponding preset comparison reference value is also correspondingly adjusted according to the actual situation, and all the method is subject to specific implementation.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention; various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. A gene scar for representing HRD homologous recombination repair defects and an identification method are characterized by comprising the following steps:
step 1: collecting samples, and performing whole genome sequencing to obtain gene original data;
step 2: establishing chromosome information matrixes X (X1, X2 and X3 … Xn) by extracting a public database, wherein X1 represents a first preset chromosome information matrix, X2 represents a second preset chromosome information matrix, X3 represents a third preset chromosome information matrix, and Xn represents an nth preset chromosome information matrix, and each preset chromosome information matrix comprises the length, the starting point and the end point of 24 chromosomes, and the position, the starting point and the end point of centromere;
and step 3: data quality control, which is to determine the incorrect probability of the corresponding nucleotide calling for quality control by calculating the quality score of each gene;
and 4, step 4: comparing the sequence, namely comparing the reads in a chromosome information matrix for sequence comparison;
and 5: deleting repeated reads, determining repeated items according to the same 5' mapping coordinates of the initial coordinates and directions of the two reads of the read pair, and deleting;
step 6: extracting the LRP file by comparing the observed signal of the two genome copies with the expected signal;
and 7: extracting a BAF file by comparing the number of high quality non-reference bases to the number of high quality bases;
and 8: preprocessing, namely preprocessing the LRP and BAF files generated in the step 7, generating average fragments at SNP loci, recording the chromosome position of each locus, and obtaining the copy number of the A allele and the copy number of the B allele as nB at the initial point and the end point of the region;
and step 9: obtaining the score of the telomere allelic gene imbalance TAI by comparing nA and nB of the telomere position of the chromosome;
step 10: the length of each SNP locus is compared with a preset length to obtain the score of the large-scale transition LST;
step 11: judging the nA and nB values of each SNP locus and comparing the lengths of the loci to obtain the score of the loss of heterozygosity LOH;
step 12: judging genes after counting scores of the telomere allele imbalance TAI, large-scale transition LST and heterozygosity loss LOH;
in step 6, the extracted LRP file includes LRP values and corresponding chromosome positions, where the LRP values are expressed as:
Figure FDA0002968444570000021
where LRP denotes the density of the document signal, X denotes the actual value of each region extracted, i denotes the position of the start of the chromosome, and k denotes the position of the last region in the chromosome;
in step 7, the extracted BAF file includes BAF values and corresponding chromosome positions, and the BAF values are expressed as:
Figure FDA0002968444570000022
wherein BAF represents the allele frequency of the file, XADIndicating high quality of the extracted nonNumber of reference bases, XAPExpressed as the number of high quality bases;
in step 8, by recording the positions of the corresponding chromosomes described in the LRP file and BAF file, the start point and the end point of the region of the chromosome can be found to be nA, nB, and nA are expressed as:
Figure FDA0002968444570000023
wherein nA represents the copy number of the A allele, alpha, beta represent constraint parameters, X is the fragmented LRP value, and Y is the fragmented BAF value;
nB is represented as:
Figure FDA0002968444570000024
wherein nB is the B allele copy number, α, β represent the constraint parameter, X is the fragmented LRP value, and Y is the fragmented BAF value;
in step 9, SNP is expressed as single nucleotide polymorphism, and the score of the telomere allelic imbalance TAI is as follows: firstly, judging each locus of SNP, if the locus is not the telomere position, judging the locus as 0, if the locus is the telomere position, comparing nA with nB,
in the process of comparing nA with nB, the number of the A allele copy number nA in all sites of the preset SNP forms a matrix Ta (Ta1, Ta2 and Ta3 … Tan), wherein Ta1 represents a first preset copy number of the A allele, Ta2 represents a second preset copy number of the A allele, Ta3 represents a third preset copy number of the A allele, and Tan represents an nth preset copy number of the A allele;
the number of occurrences of the copy number nB of the B allele in all sites of the preset SNP forms a matrix Tb (Tb1, Tb2 and Tb3 … Tbn), wherein Tb1 represents the first preset copy number of the B allele, Tb2 represents the second preset copy number of the B allele, Tb3 represents the third preset copy number of the B allele, and Tbn represents the nth preset copy number of the B allele;
comparing Ta1 with Tb1, comparing Ta2 with Tb2, comparing Ta3 with Tb3, comparing Tan with Tbn, if Tai ═ Tbi, the locus score is 1, if Tai ≠ Tbi, the locus score is 0, and calculating all loci of chromosome to obtain the score of telomere allele imbalance TAI;
in the step 10, the lengths of the sites of the preset SNPs include a preset first length and a preset second length, the length of each site of each SNP is first compared with the first preset length, and if the site is greater than or equal to the preset first length, judging that the locus score is 0, if the length of the locus is smaller than the preset first length, judging the relationship between the length from the end point of the last site of the sites to the start point of the sites and the preset second length, if the length from the end point of the last site of the sites to the start point of the sites is less than the preset second length, judging the site score to be 1, if the length from the end point of the last site of the sites to the start point of the sites is more than or equal to the preset second length, judging the site score to be 0, and calculating the score of the large-scale transition LST according to the relation of the SNP site lengths.
2. The method for characterization of a gene scar and identification of HRD homologous recombination repair defect of claim 1, wherein in step 3, the quality score of the gene is expressed as:
Q=-10 x log10(P)
wherein Q represents a sequence format quality score and P represents the probability of each base error.
3. The method for characterizing the genetic scar deficient in HRD homologous recombination and repair of claim 2, wherein the BAF value ranges from 0 to 1.
4. The method for identifying a gene scar that is defective in HRD homologous recombination and repair as claimed in claim 3, wherein in step 11, the site length of the predetermined SNP includes a predetermined third length, first, nA of each site of each SNP is determined, if nA is not equal to 0, the site score is determined to be 0, if nA is equal to 0, nB is determined, if nB is not equal to 0, the site score is determined to be 0, if nB is equal to 0, the length of the site is compared with the predetermined third length, if the site length is greater than the predetermined third length, the site score is determined to be 1, if the site length is less than or equal to the predetermined third length, the site score is determined to be 0, and the heterozygosity loss LOH score is obtained by calculating the relationship between nA and nB of each site.
5. A genetic scar and identification method for characterising a defect in HRD homologous recombination repair as claimed in claim 4 wherein in step 12, HRD is expressed as a defect in homologous recombination and the HRD score is expressed as:
HRD=LOH+TAI+LST
wherein, HRD represents homologous recombination defect score, LOH represents heterozygosity loss score, LST represents large-scale transition score, and TAI represents telomere allele imbalance score;
presetting a comparison reference value of HRD as P, if the HRD is more than or equal to P, judging that the HRD is positive, and if the HRD is less than P, judging that the HRD is negative.
6. The method of claim 1, wherein the raw data obtained in step 1 is in the form of FASTQ file, and the source for extracting public database in step 2 is selected from hg 38.
CN202010789009.0A 2020-08-07 2020-08-07 Gene scar for representing HRD homologous recombination repair defect and identification method Active CN111883211B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010789009.0A CN111883211B (en) 2020-08-07 2020-08-07 Gene scar for representing HRD homologous recombination repair defect and identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010789009.0A CN111883211B (en) 2020-08-07 2020-08-07 Gene scar for representing HRD homologous recombination repair defect and identification method

Publications (2)

Publication Number Publication Date
CN111883211A CN111883211A (en) 2020-11-03
CN111883211B true CN111883211B (en) 2021-04-23

Family

ID=73211843

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010789009.0A Active CN111883211B (en) 2020-08-07 2020-08-07 Gene scar for representing HRD homologous recombination repair defect and identification method

Country Status (1)

Country Link
CN (1) CN111883211B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110106063B (en) * 2019-05-06 2022-07-08 臻和精准医学检验实验室无锡有限公司 System for detecting 1p/19q combined deletion of glioma based on second-generation sequencing
CN112397145A (en) * 2020-11-19 2021-02-23 河南省肿瘤医院 HRD (high resolution display) score calculation method based on chip detection
CN112980834B (en) * 2021-04-22 2021-08-17 菁良基因科技(深圳)有限公司 Homologous recombination defect repair reference product and preparation method and kit thereof
CN113257346B (en) * 2021-06-28 2021-10-19 北京橡鑫生物科技有限公司 Method for evaluating HRD score based on low-depth WGS
CN114242170B (en) * 2021-12-21 2023-05-09 深圳吉因加医学检验实验室 Method and device for evaluating homologous recombination repair defects and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014165785A2 (en) * 2013-04-05 2014-10-09 Myriad Genetics, Inc. Methods and materials for assessing homologous recombination deficiency
CN107287285A (en) * 2017-03-28 2017-10-24 上海至本生物科技有限公司 It is a kind of to predict the method that homologous recombination absent assignment and patient respond to treatment of cancer
CN110029157A (en) * 2018-01-11 2019-07-19 北京大学 A method of the unicellular genome monoploid of detection tumour copies number variation
CN110093417A (en) * 2018-01-31 2019-08-06 北京大学 A method of the detection unicellular somatic mutation of tumour
CN110241198A (en) * 2019-05-30 2019-09-17 成都吉诺迈尔生物科技有限公司 A kind of genome recombination fingerprint and its identification method characterizing hHRD HR defective
CN110913896A (en) * 2017-07-14 2020-03-24 弗朗西斯.克里克研究所 Analysis of HLA alleles in tumors and uses thereof

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1929933B (en) * 2004-03-11 2010-10-13 住友金属工业株式会社 Seamless tube piercing/rolling plug, seamless tube producing apparatus, and seamless tube producing method therewith
GB0603683D0 (en) * 2006-02-23 2006-04-05 Novartis Ag Organic compounds
CN110527744A (en) * 2019-05-30 2019-12-03 四川大学华西第二医院 The identification method of one group of genome signature mutation fingerprint relevant to homologous recombination repair defect
CN111462823B (en) * 2020-04-08 2022-07-12 西安交通大学 Homologous recombination defect judgment method based on DNA sequencing data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014165785A2 (en) * 2013-04-05 2014-10-09 Myriad Genetics, Inc. Methods and materials for assessing homologous recombination deficiency
CN107287285A (en) * 2017-03-28 2017-10-24 上海至本生物科技有限公司 It is a kind of to predict the method that homologous recombination absent assignment and patient respond to treatment of cancer
CN110913896A (en) * 2017-07-14 2020-03-24 弗朗西斯.克里克研究所 Analysis of HLA alleles in tumors and uses thereof
CN110029157A (en) * 2018-01-11 2019-07-19 北京大学 A method of the unicellular genome monoploid of detection tumour copies number variation
CN110093417A (en) * 2018-01-31 2019-08-06 北京大学 A method of the detection unicellular somatic mutation of tumour
CN110241198A (en) * 2019-05-30 2019-09-17 成都吉诺迈尔生物科技有限公司 A kind of genome recombination fingerprint and its identification method characterizing hHRD HR defective

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Characterisation of homologous recombination deficiency in paired primary and recurrent high-grade serous ovarian cancer;Jai N. Patel et al.;《British Journal of Cancer》;20181215;全文 *
Homologous recombination deficiency in triple negative breast cancer;Carmen Belli et al.;《The Breast》;20191231;全文 *
Migrating the SNP array-based homologous recombination deficiency measures to next generation sequencing data of breast cancer;Zsofia Sztupinszki et al.;《npj Breast Cancer》;20180702;第1-4页 *
基于二代测序数据的SNP发现策略及其初步应用;高彧辉;《中国优秀硕士学位论文全文数据库 基础科学辑》;20140115(第02期);第20-41页 *

Also Published As

Publication number Publication date
CN111883211A (en) 2020-11-03

Similar Documents

Publication Publication Date Title
CN111883211B (en) Gene scar for representing HRD homologous recombination repair defect and identification method
CN109033749B (en) Tumor mutation load detection method, device and storage medium
CN110029157B (en) Method for detecting haploid copy number variation of tumor single cell genome
JP2023524722A (en) Method and apparatus for detecting gene mutation and expression level
CN109949861B (en) Tumor mutation load detection method, device and storage medium
CN114999573B (en) Genome variation detection method and detection system
CN111755068B (en) Method and device for identifying tumor purity and absolute copy number based on sequencing data
WO2018157861A1 (en) Method for identifying balanced translocation break points and carrying state for balanced translocations in embryos
CN103114150B (en) The method that storehouse order-checking is identified is built with the mononucleotide polymorphism site of Bayesian statistic based on enzyme action
KR102029393B1 (en) Circulating Tumor DNA Detection Method Using Sample comprising Cell free DNA and Uses thereof
CN110993023B (en) Detection method and detection device for complex mutation
KR101686146B1 (en) Copy Number Variation Determination Method Using Sample comprising Nucleic Acid Mixture
WO2021232388A1 (en) Method for determining base type of predetermined site in embryonic cell chromosome, and application thereof
KR102405245B1 (en) Method for Detecting Chromosomal Abnormalities Based on Whole Genome Sequencing and Uses thereof
CN113584178A (en) Noninvasive paternity testing analysis method and device
CN105483210A (en) RNA (ribonucleic acid) editing locus detection method
WO2019242186A1 (en) Method, apparatus, computer device and storage medium for determining target to be detected
CN114990202B (en) Application of SNP (Single nucleotide polymorphism) locus in evaluation of genome abnormality and method for evaluating genome abnormality
CN114067908B (en) Method, device and storage medium for evaluating single-sample homologous recombination defects
CN115394359A (en) Method for identifying human embryonic cell chromosome variation and application
CN114974415A (en) Method and device for detecting chromosome copy number abnormality
CN111172248B (en) General kit for verifying copy number variation based on fragment analysis technology
CA3149056A1 (en) Methods for dna library generation to facilitate the detection and reporting of low frequency variants
CN112442540A (en) Microsatellite instability detection method, marker combination, kit and application
CN114067909B (en) Method, device and storage medium for correcting homologous recombination defect score

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant