WO2019242187A1 - Method and apparatus for detecting chromosomal copy number variations, and storage medium - Google Patents

Method and apparatus for detecting chromosomal copy number variations, and storage medium Download PDF

Info

Publication number
WO2019242187A1
WO2019242187A1 PCT/CN2018/111958 CN2018111958W WO2019242187A1 WO 2019242187 A1 WO2019242187 A1 WO 2019242187A1 CN 2018111958 W CN2018111958 W CN 2018111958W WO 2019242187 A1 WO2019242187 A1 WO 2019242187A1
Authority
WO
WIPO (PCT)
Prior art keywords
chromosome
mer
specific
standard
occurrences
Prior art date
Application number
PCT/CN2018/111958
Other languages
French (fr)
Chinese (zh)
Inventor
孙亚洲
肖贡
陈斌
杜刘稳
牛团结
陈杰
Original Assignee
深圳市达仁基因科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市达仁基因科技有限公司 filed Critical 深圳市达仁基因科技有限公司
Publication of WO2019242187A1 publication Critical patent/WO2019242187A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present application relates to a method, a device, a computer device, and a storage medium for detecting an abnormality in chromosome copy number.
  • a method, a device, and a storage medium for detecting an abnormality in chromosome copy number are provided.
  • a method for detecting chromosome copy number abnormalities including:
  • a specific k-mer corresponding to each chromosome contained in the target species stored in the target database is obtained, where the specific k-mer is a k-mer in each chromosome that satisfies a preset specific condition, and the k-mer mer refers to a genomic sequence of length k;
  • a copy number of each specific k-mer is obtained from the target database, and the copy number is a specificity that has the least number of occurrences of the specific k-mer in the corresponding chromosome and that on the chromosome The ratio of the number of occurrences of k-mer;
  • a chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is determined to be a chromosome with abnormal copy number.
  • a device for detecting abnormal chromosome copy number includes:
  • a specific k-mer acquisition module configured to acquire sequencing data of a sample to be detected as the data to be detected, and determine a target species corresponding to the data to be detected; and acquire a corresponding one of each chromosome contained in the target species stored in the target database.
  • a specific k-mer where the specific k-mer is a k-mer in each chromosome that meets a preset specificity condition, and the k-mer refers to a genomic sequence of length k;
  • An actual appearance frequency acquisition module configured to obtain the actual appearance frequency of the specific k-mer included in each chromosome in the data to be detected
  • a copy number obtaining module is configured to obtain a copy number of each specific k-mer from the target database, the copy number is the number of occurrences of the specific k-mer in the corresponding chromosome and the chromosome The ratio of the number of occurrences of the specific k-mer with the fewest occurrences;
  • a determination module configured to calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and the number of copies of each specific k-mer; determine that the chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is a copy Number of abnormal chromosomes.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the one or more processors are executed. The following steps:
  • a specific k-mer corresponding to each chromosome contained in the target species stored in the target database is obtained.
  • the specific k-mer is a k-mer in each chromosome that meets a preset specificity condition.
  • mer refers to a genomic sequence of length k;
  • a copy number of each specific k-mer is obtained from the target database, and the copy number is a specificity that has the least number of occurrences of the specific k-mer on the corresponding chromosome and the number of occurrences on the chromosome.
  • a chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is determined to be a chromosome with abnormal copy number.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the one or more processors execute the following steps:
  • a specific k-mer corresponding to each chromosome contained in the target species stored in the target database is obtained, where the specific k-mer is a k-mer in each chromosome that satisfies a preset specific condition, and the k-mer mer refers to a genomic sequence of length k;
  • a copy number of each specific k-mer is obtained from the target database, and the copy number is a specificity that has the least number of occurrences of the specific k-mer on the corresponding chromosome and the number of occurrences on the chromosome.
  • a chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is determined to be a chromosome with abnormal copy number.
  • FIG. 1 is a schematic flowchart of a method for detecting a chromosome copy number abnormality according to one or more embodiments.
  • FIG. 2 is a schematic flow chart before step 102 according to one or more embodiments.
  • FIG. 3 is a schematic flow chart before step 102 according to another embodiment.
  • Figure 4 is a list of copy numbers of specific k-mers of chromosome X according to one or more embodiments.
  • FIG. 5 is a schematic flowchart of step 110 according to one or more embodiments.
  • FIG. 6 is a schematic flowchart of a method for detecting an abnormal chromosome copy number according to one or more embodiments, which further includes other steps.
  • FIG. 7A is a standard signal intensity recording table of a chromosome in a normal male sample according to one or more embodiments.
  • FIG. 7B is a standard signal intensity recording table of a chromosome in a normal female sample according to one or more embodiments.
  • FIG. 8 is a schematic flowchart of step 610 according to one or more embodiments.
  • FIG. 9A is a distribution table of pre-set reliability values P according to standard signal intensities of chromosomes in normal male samples in one or more embodiments.
  • FIG. 9B is a distribution table of pre-set reliability values P according to standard signal intensities of chromosomes in normal female samples in one or more embodiments.
  • FIG. 10 is a schematic flowchart of step 610 according to another embodiment.
  • FIG. 11 is a schematic flowchart of a method for detecting an abnormal chromosome copy number according to another or more embodiments, including other steps.
  • FIG. 12 is a schematic flowchart before step 102 according to still another embodiment.
  • FIG. 13 is a table showing the actual number of occurrences of the specific k-mer of a specific chromosome according to one or more embodiments.
  • FIG. 14 is a schematic flowchart of a method for detecting an abnormality of a chromosome copy number according to another or more embodiments.
  • FIG. 15 is a schematic flowchart of step 1402 according to one or more embodiments.
  • FIG. 16 is a table of human chromosome copy numbers in accordance with one or more embodiments.
  • FIG. 17 is a schematic flowchart of step 1404 according to one or more embodiments.
  • FIG. 18 is a single-copy signal strength calculation table for a specific chromosome according to one or more embodiments.
  • FIG. 19 is a single copy signal intensity recording table for each chromosome according to one or more embodiments.
  • FIG. 20 is a calculation table of actual signal intensities of individual chromosomes according to one or more embodiments.
  • FIG. 21 is a block diagram of an apparatus for detecting an abnormality in chromosome copy number according to one or more embodiments.
  • FIG. 22 is a block diagram of a computer device in accordance with one or more embodiments.
  • a method for detecting an abnormal chromosome copy number which includes the following steps:
  • Step 102 Obtain sequencing data of the sample to be detected as the data to be detected, and determine a target species corresponding to the data to be detected.
  • the data to be detected refers to the data output by a sample after the sequence of a biomolecule contained in a sample is read by a DNA sequencer, an RNA sequencer, or a protein sequencing device.
  • DNA sequencing is the process of determining the exact sequence of nucleotides within a DNA molecule. It includes any method or technique for determining the four base sequences of adenine, guanine, cytosine, and thymine in a DNA strand.
  • a sequencer is an instrument capable of measuring the sequence of an input sample. The sequence measured here includes not only DNA sequences but also sequences composed of other substances such as proteins and RNA. Samples can be in the form of a drop of blood, a sputum, a handful of soil, and so on.
  • the species to which the data to be detected belongs, that is, the target species. For example, when the sequencing data is a human gene sequence, the target species is human.
  • Step 104 Obtain a specific k-mer corresponding to each chromosome contained in the target species stored in the target database, and the specific k-mer is a k-mer, k-mer in each chromosome that satisfies a preset specific condition. Refers to a genomic sequence of length k.
  • Each target species contains one or more individuals. Each individual contains one or more genomes, and each genome contains one or more chromosomes. Therefore, each target species contains multiple chromosomes.
  • the target database may store a feature target sequence set previously established for each chromosome, and the feature target sequence set corresponding to each chromosome may include a specific k-mer corresponding to each chromosome.
  • the specific k-mer refers to a k-mer selected from the k-mers contained in each chromosome and meeting a preset specificity condition, that is, a specific k-mer corresponding to each chromosome.
  • the preset specific condition is a condition set by a technician in advance for selecting a matching k-mer. The preset specific condition may be determined according to a technician's consideration or an actual project requirement.
  • k-mer refers to a genomic sequence of length k, where k is a natural number. If there are a different deterministic characters in a genomic data, then for a particular k, there may be a total of k-mers with a power of a that are different.
  • deterministic characters refer to the five bases A (adenine), T (thymine), C (cytosine), G (guanine), and U (uracil); In the case of protein sequences, deterministic characters are defined amino acid characters.
  • Step 106 Obtain the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected.
  • the data to be detected can be compared with each chromosome separately, that is, the appearance of the specific k-mer included in the characteristic target sequence set corresponding to each chromosome in the data to be detected
  • the number of times is the actual number of times each specific k-mer appears in the data to be detected.
  • Step 108 Obtain a copy number of each specific k-mer from the target database.
  • the copy number is the specific k-mer with the least number of occurrences of the specific k-mer on the corresponding chromosome and the specific k-mer on the chromosome. The ratio of the number of occurrences.
  • the copy number of each specific k-mer refers to the ratio of the number of occurrences of the specific k-mer on the corresponding chromosome to the number of occurrences of the specific k-mer with the least number of occurrences on the chromosome.
  • Step 110 Calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and the copy number of each specific k-mer.
  • the actual signal intensity of each specific k-mer can be calculated according to these two parameters.
  • the ratio of Ci and Fi can be calculated, and the ratio is used as the adjusted number of appearances of each specific k-mer. In this way, the number of adjusted occurrences of all specific k-mers contained in each chromosome can be calculated. Then calculate the average of the number of occurrences of the specific k-mer adjusted in each chromosome, and use this average as the single copy signal intensity E of the corresponding chromosome.
  • Step 112 Determine a chromosome whose actual signal strength is not within the standard confidence interval of the corresponding chromosome as a chromosome with abnormal copy number.
  • the standard confidence interval refers to the standard signal intensity interval calculated in advance based on a large number of samples.
  • the standard signal strength is actually calculated in the same way as the actual signal strength, but since the standard test sample is a sample confirmed to have no abnormal chromosome copy number, the standard signal strength is for the data of the standard test sample, and the actual signal strength is for Data to be tested.
  • the actual signal intensity of the chromosome is within the standard confidence interval of the corresponding chromosome, it can be judged that there is no copy number abnormality on the chromosome; otherwise, it can be judged that the copy number is abnormal on the chromosome.
  • a chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome can be determined as a chromosome with an abnormal copy number.
  • the actual signal intensity of each chromosome is compared with the standard confidence interval of the corresponding chromosome.
  • the actual signal intensity of chromosome 1 is compared with a pre-established standard confidence interval of chromosome 1
  • the actual signal intensity of chromosome 2 is compared with a pre-established standard confidence interval of chromosome 2.
  • the actual signal intensity of each chromosome can be compared with the standard confidence interval of the corresponding chromosome, and the chromosome that is not within the standard confidence interval of the corresponding chromosome can be determined as a chromosome with abnormal copy number.
  • This method of detecting chromosome copy number abnormalities is compared with the characteristic target sequence in each chromosome of the target species, that is, the specific k-mer, which is part of the entire target species genome, and is therefore specific.
  • the comparison of the performance k-mer can reduce the comparison space, thereby shortening the analysis time and improving the detection efficiency.
  • the specific k-mer refers to the k-mer in a chromosome whose appearance frequency in the genome occurrence number index table corresponding to the chromosome meets a preset error condition.
  • the set of characteristic target sequences corresponding to each chromosome includes a specific k-mer in each chromosome that satisfies a predetermined specificity condition.
  • the preset specific condition refers to a k-mer included in a chromosome whose occurrence number in the genome occurrence number index table corresponding to each chromosome meets a preset error condition.
  • the preset error condition refers to the error condition preset by the technician according to the actual project requirements.
  • the error condition can be a range of regions, that is, the k-mer selected as a specific can be allowed to have a certain error, instead of being completely satisfied. Some strict objective condition.
  • each chromosome there is an index table of the number of occurrences of the genome corresponding to the chromosome.
  • the number of k-mers contained in each chromosome in the chromosome can be obtained according to the index of the number of occurrences of the genome corresponding to each chromosome. It has appeared in the genome, that is, the k-mer in the chromosome whose occurrence number in the chromosome genome occurrence index table meets the preset error condition can be selected, and the selected k-mer is used as the specific k-mer.
  • the method before step 102, further includes the following steps: generating an index table of the number of occurrences of the genome corresponding to each chromosome, and the index of the number of times of the genome records that the genome contained in the chromosome corresponding to each k-mer contains The number of genomes of the k-mer; the index table of the number of occurrences of the genome is stored in the feature target sequence set corresponding to the chromosome.
  • the genome is all the genetic information in an organism. This genetic information is stored in the form of a nucleotide sequence.
  • the sum of the genetic material in a complete monomer of an organism is the genome.
  • an individual's complete genome can contain multiple chromosomes, and each chromosome can contain multiple k-mers.
  • chromosome genome commonly used in the art is used here to refer to the sum of all sequences contained in a complete chromosome.
  • the number of genome occurrences corresponding to each chromosome has been recorded in the index table of the number of occurrences of the genome corresponding to each chromosome in the number of genomes corresponding to the chromosome, that is, the number of genomes index table records each k-mer
  • the number of the k-mer genome is contained in the genome corresponding to the chromosome to which it belongs.
  • the genomic appearance frequency index table corresponding to each chromosome can be stored into the feature target sequence set corresponding to each chromosome, that is, stored in the target database. After storage, if needed, Data can be retrieved from the target database at the genome occurrence index table, which improves the detection efficiency.
  • the method further includes the following steps:
  • Step 100 Select a k-mer that satisfies a preset specific condition from the k-mers corresponding to each chromosome.
  • Step 101 Store a k-mer that satisfies a preset specific condition into a feature target sequence set corresponding to each chromosome.
  • each feature target sequence set includes a specific k-mer corresponding to each chromosome.
  • Specific k-mer refers to the selection of k-mers that satisfy preset specific conditions from the k-mers contained in each chromosome.
  • a k-mer that satisfies a preset specificity condition that is, a specific k-mer
  • the specific k-mer can be stored in a feature target sequence set corresponding to each chromosome.
  • a feature target library is established in advance, so when detecting whether the chromosome is abnormal, it can directly call data that requires specific k-mer, which improves the detection efficiency.
  • the method further includes the following steps:
  • Step 302 Obtain the number of occurrences of the specific k-mer included in each chromosome contained in the target species stored in the target database in the corresponding chromosome C, and the specific k-mer corresponding to the least number of occurrences in the chromosome.
  • the number of occurrences is taken as the minimum number of occurrences Cm.
  • step 304 the ratio of the number of occurrences C to the minimum number of occurrences Cm is used as the copy number of the specific k-mer.
  • Step 306 Generate a specific k-mer copy number list corresponding to each chromosome according to the copy number of the specific k-mer included in each chromosome.
  • Step 308 Store the specific k-mer copy number list into the target database.
  • the above step 108 includes: obtaining the copy number of each specific k-mer according to the specific k-mer copy number list.
  • the target species contains multiple chromosomes, and each chromosome contains one or more specific k-mers.
  • the number of occurrences of each specific k-mer contained in each chromosome on the chromosome C can be obtained, and the number of occurrences of the specific k-mer with the least number of occurrences in the chromosome can be obtained as the minimum number of occurrences Cm .
  • the ratio of the number of occurrences C to the number of occurrences of the k-mer with the least number of occurrences on the chromosome is the copy number of the specific k-mer.
  • the copy number of each specific k-mer can be calculated to generate a list of specific k-mer copy numbers corresponding to the chromosome.
  • Each specific k-mer copy number list can be stored in the characteristic target sequence set corresponding to the chromosome, which is convenient for directly calling the list to obtain relevant data when needed, improving detection efficiency.
  • the copy number of the specific k-mer with the least occurrence is equal to Cm / Cm, that is, the copy number of the specific k-mer with the least occurrence is 1.
  • the above step 110 includes:
  • Step 502 Calculate the ratio of the actual number of occurrences of each specific k-mer to the number of copies.
  • Step 504 Calculate the average of the ratio of the actual number of occurrences of all specific k-mers to the number of copies in each chromosome as the single-copy signal strength of the corresponding chromosome.
  • Step 506 Calculate the actual signal intensity of the corresponding chromosome according to the signal intensity of the single copy of each chromosome.
  • each chromosome can contain multiple specific k-mers, so the ratio of the actual number of occurrences of all specific k-mers contained in each chromosome to the number of copies can be obtained, and the average of the ratio can be obtained. Therefore, each chromosome will have a corresponding average of the ratio of the actual number of occurrences to the number of copies, and this average is the single-copy signal strength of each chromosome. Therefore, the actual signal intensity corresponding to each chromosome can be calculated based on the signal intensity of a single copy of each chromosome.
  • the actual signal intensity of the corresponding chromosome is calculated according to the following formula:
  • the actual signal intensity of the chromosome (single copy signal intensity of the chromosome-M) / SD, where M is the average of the single copy signal intensity of all chromosomes, and SD is the variance of the single copy signal intensity of all chromosomes.
  • the average value M and the variance of the single-copy signal intensity of all chromosomes can be calculated.
  • the method for detecting an abnormal chromosome copy number further includes the following steps:
  • Step 602 Obtain a preset number of standard test samples, and the standard test samples are samples confirmed to have no abnormal chromosome copy number.
  • Step 604 Obtain the actual number of occurrences of the specific k-mer contained in each chromosome in the standard detection sample in the data to be detected.
  • the chromosome in the data to be detected can be compared with a standard confidence interval list corresponding to a predetermined chromosome, and it can be determined whether there is an abnormal copy number of the chromosome in the data to be detected.
  • a standard confidence interval list corresponding to a chromosome a preset number of standard detection samples need to be obtained first.
  • the standard test sample is a sample confirmed as having no abnormal chromosome copy number.
  • the preset quantity is an exponential quantity that can be set by the technicians, but it should be based on meeting the requirements of a large sample in statistics. Generally the preset number should be greater than 30, or greater than 100. After obtaining multiple standard detection samples, the actual number of occurrences of the specific k-mer in the chromosome contained in each standard detection sample in the data to be detected can be obtained.
  • Step 606 Obtain a copy number of each specific k-mer in each chromosome included in the standard detection sample from the target database.
  • Step 608 Obtain the standard signal intensity of the corresponding chromosome according to the actual number of occurrences and the copy number of each specific k-mer included in the standard detection sample.
  • the copy number of each specific k-mer refers to the ratio of the number of occurrences of the specific k-mer on the corresponding chromosome to the number of occurrences of the specific k-mer with the least number of occurrences on the chromosome.
  • a standard signal intensity record table can be established according to different genders. For example, if the target species is a human, a standard signal intensity record table of chromosomes in a standard test sample belonging to a male and a standard signal intensity record table of a chromosome in a standard test sample belonging to a female can be established.
  • a standard signal intensity recording table for chromosomes in a normal male sample as shown in FIG. 7A and a standard signal intensity recording table for chromosomes in a normal female sample as shown in FIG. 7B.
  • a standard signal intensity record corresponding to a chromosome included in a male sample and a standard signal intensity record corresponding to a chromosome included in a female sample are recorded.
  • the standard signal intensity of chromosome 1 in sample 1 is recorded as S 1 1
  • the standard signal intensity of chromosome 2 is recorded as S 1 2
  • the standard signal intensity of chromosome 1 in sample i is recorded as S i 1
  • the standard signal intensity of chromosome 2 in sample i is recorded as S i 2 .
  • the recording method is the same in FIG. 7B.
  • Step 610 Determine a standard confidence interval corresponding to the chromosome when the confidence value is preset according to the standard signal intensity of each chromosome in the multiple standard detection samples.
  • Step 612 Obtain a list of standard confidence intervals corresponding to the chromosomes included in the target species according to the standard confidence intervals corresponding to each chromosome.
  • a confidence interval is an interval for a population parameter to be estimated. By obtaining a random sample from the population, the calculated confidence interval may include the population parameter of the population. This confidence is also called the confidence level.
  • the preset reliability value P here refers to a confidence value set by a technician in advance, and is generally set to a value greater than 0.95, which is infinitely close to 1 but not equal to 1. The preset reliability value can be adjusted by a technician in actual applications as needed. For example, if the confidence value is set to 95% confidence, P is 0.95, and if the confidence value is set to 99.9%, P is 0.999.
  • the two boundary values LB and UB of the standard signal strength of the chromosome can be determined according to the preset preset confidence value, and a confidence interval corresponding to the preset confidence value can be obtained.
  • LB is the minimum of the confidence interval
  • UB is the maximum of the confidence interval. Therefore, the confidence interval obtained is actually the interval of standard signal strength.
  • the standard signal intensity interval corresponding to the preset confidence value can be obtained, that is, the standard signal intensity interval of each chromosome, that is, the standard confidence interval corresponding to each chromosome. Since the target species contains multiple chromosomes, a list of standard confidence intervals corresponding to the chromosomes contained in the target species can actually be obtained.
  • the standard confidence interval list contains standard confidence intervals corresponding to each chromosome. For example, if the preset reliability value P is set to 0.98, the standard signal intensity interval corresponding to each chromosome at a probability of 98% can be obtained.
  • the above step 610 includes:
  • Step 802 Obtain a standard signal intensity of each chromosome contained in each standard detection sample.
  • Step 804 Calculate the mean and variance of the standard signal strengths of the chromosomes included in all the standard detection samples according to the gender of the standard detection samples.
  • Step 806 Determine the standard confidence corresponding to the chromosome contained in the standard detection sample corresponding to each gender when the confidence value is preset according to the mean and variance of the standard signal strengths in the multiple standard detection samples corresponding to each gender for each sex. Interval.
  • each chromosome refers to each numbered chromosome.
  • the mean and variance of the standard signal intensity of chromosome 1 can be calculated.
  • the mean and variance of standard signal intensities of chromosomes 2, 3, ..., 22 and X, Y and other chromosomes can be calculated.
  • the corresponding standard confidence interval that is, the corresponding standard signal intensity interval, of each chromosome can be determined when the confidence value is preset. For example, with humans as the target species, you can also create a distribution table of the pre-set reliability value P of the standard signal intensity of the chromosome in the male sample and the preset of the standard signal intensity of the chromosome in the female sample according to the standard test samples of different genders. Table of distributions of confidence values P.
  • the normal male sample contains 22 autosomal and XY chromosomes.
  • M ′ represents the average value of the standard signal intensity of all chromosomes
  • SD ′ represents the variance of the standard signal intensity of all chromosomes.
  • LB represents the minimum value of the confidence interval corresponding to the preset confidence value P for each chromosome
  • UB represents the maximum value of the confidence interval corresponding to the preset confidence value P for each chromosome. The minimum and maximum values give the corresponding confidence intervals.
  • Figure 9B The difference between Figure 9A and Figure 9B is that the genomes of individuals of different sexes have different chromosomal compositions. For example, in Figure 9A corresponding to a male sample, in addition to 22 autosomes, X and Y sex chromosomes are included, while in female samples, For 22 chromosomes and two X sex chromosomes. The rest of the data represent the same meaning.
  • the standard test sample is a peripheral blood sample of a normal mother carrying a normal baby.
  • the peripheral blood sample includes a peripheral blood sample of a normal mother carrying a normal baby boy, and a peripheral mother's peripheral blood sample. Peripheral blood samples from normal mothers carrying normal baby boy twins, Peripheral blood samples from normal mothers carrying normal baby girl twins, and Peripheral blood samples from normal mothers carrying normal one boy and one female twin.
  • Peripheral blood is blood other than bone marrow.
  • a normal mother means that the mother's chromosome copy number is not abnormal
  • a normal baby means that the baby's chromosome copy number is also normal.
  • the criteria for identifying as a normal mother or normal baby can also be adjusted by technical staff based on actual project research.
  • the standard test samples can be peripheral blood samples of normal mothers carrying normal babies.
  • the peripheral blood samples include peripheral blood samples from normal mothers with normal boys, and normal mothers with normal girls.
  • the standard test sample may also be a peripheral blood sample from a normal mother with multiple normal babies.
  • a normal mother carries a peripheral blood sample of a normal triplet
  • a normal mother carries a peripheral blood sample of a normal quadruplet, and so on.
  • there is no need to limit the number of babies pregnant by a normal mother but a peripheral blood sample of a normal mother pregnant with a normal baby can be obtained as a standard test sample.
  • step 610 includes the following steps:
  • Step 1002 Determine a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby boy.
  • a standard confidence interval corresponding to a preset chromosome confidence value of a chromosome is determined according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby girl.
  • a standard confidence interval corresponding to a chromosome at a preset confidence value is determined according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby boy twin.
  • Step 1008 Determine a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal strength of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby girl twin.
  • Step 1010 Determine a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother and a female twin.
  • the above steps 1002 to 1010 are to determine the standard confidence interval corresponding to the preset chromosome confidence value according to the standard signal intensity of the chromosome contained in the different standard detection samples.
  • the standard test sample is a peripheral blood sample of a normal mother carrying a normal baby boy
  • the reliability of the chromosome in the preset setting can be determined according to the standard signal intensity of each chromosome contained in the peripheral blood sample of a normal mother carrying a normal baby boy.
  • the standard confidence interval corresponding to the value.
  • the chromosome can be determined based on the standard signal intensity of each chromosome contained in the peripheral blood sample of a normal mother carrying a normal male and female twin.
  • the standard confidence interval when the confidence value is preset.
  • the above-mentioned step 112 includes: when it is detected that the actual signal intensity corresponding to the chromosome does not belong to the standard confidence interval corresponding to the corresponding chromosome, determining the chromosome corresponding to the actual signal intensity as having a copy number abnormality Chromosome.
  • the standard confidence intervals corresponding to each chromosome of the target species can be calculated, and a list of standard confidence intervals can be obtained. Therefore, the actual signal intensity of the chromosome contained in the target species to which the data to be measured can be compared with the standard confidence interval of the corresponding chromosome obtained in advance. When it is detected that the actual signal intensity corresponding to the chromosome does not belong to the standard confidence interval corresponding to the corresponding chromosome, the chromosome corresponding to the actual signal intensity is determined to be a chromosome with abnormal copy number. In the alignment, each chromosome is compared with a standard confidence interval corresponding to each chromosome.
  • the chromosome 1 contained in the target species to which the sequencing data belongs is compared with the pre-calculated standard confidence interval of chromosome 1.
  • the chromosome 2 contained in the target species to which the sequencing data belongs is compared with the pre-calculated chromosome 2
  • the standard confidence intervals are compared. In this way, all chromosomes contained in the target species to which the sequencing data belongs are compared to determine whether there is an abnormal copy number in the chromosome.
  • the standard confidence interval corresponding to the pre-calculated chromosome 1 is (LB1, UB1), and the 1 contained in the test sample is detected and determined. Whether the actual signal intensity of chromosome chromosome exists in the interval (LB1, UB1). If it does not exist, it indicates that the copy number of chromosome 1 is abnormal; if it does, it indicates that chromosome 1 is normal and there is no abnormal copy number.
  • the method for detecting an abnormal chromosome copy number further includes the following steps:
  • Step 1102 Determine a standard confidence interval list of a chromosome corresponding to each gender according to the gender of the target species.
  • Step 1104 Obtain the gender of the sample to be tested.
  • step 1106 the actual signal intensity of each chromosome is compared with the standard confidence interval corresponding to the corresponding chromosome in the standard confidence interval list of the corresponding sex of the target species.
  • step 1108 when it is detected that the actual signal intensity of the chromosome does not belong to the standard confidence interval of the corresponding chromosome of the corresponding gender, the chromosome corresponding to the actual signal intensity is determined to be a chromosome with abnormal copy number.
  • the target database stores a list of standard confidence intervals corresponding to chromosomes included in a sample created according to gender. For example, taking a person as an example, a target table stores a distribution table of preset reliability values P of standard signal intensities of chromosomes in normal male samples, and a normal distribution record of preset reliability values P of male samples records normal The standard confidence interval for each chromosome contained in the male sample when the confidence is preset.
  • the target database stores a distribution table of preset reliability values P of standard signal intensities of chromosomes in normal female samples, and a distribution table of preset reliability values P of female samples records each of the values contained in normal female samples. The standard confidence interval for each chromosome when the confidence is preset.
  • Gender classification of target species that is, to divide target species into parts corresponding to gender according to gender. For example, when the target species is human, the target species is divided into male and female according to gender. Then you can determine the standard confidence interval for each sex's corresponding chromosome. After classifying the target species according to sex, the chromosomes contained in the target species of each sex can be clarified, thereby obtaining the standard confidence interval corresponding to each chromosome. For example, if the female target species contains 22 autosomes and two X sex chromosomes, a distribution table of preset confidence values P of standard signal intensities of chromosomes in normal female samples can be obtained from the target database.
  • Standard confidence intervals corresponding to these 22 chromosomes and X chromosomes were obtained from the table. That is, when the sample to be measured comes from a female, a distribution table of preset reliability values P of the standard signal intensity of the chromosome corresponding to the female is obtained. That is, the actual signal intensity of each chromosome of the female sample to be tested is compared with the standard confidence interval of each chromosome in the list of female standard confidence intervals. In this way, when it is detected that the actual signal intensity of the chromosome does not belong to the standard confidence interval of the corresponding chromosome of the corresponding gender, the chromosome corresponding to the actual signal intensity is determined to be a chromosome with abnormal copy number.
  • the method further includes the following steps:
  • Step 1202 Obtain multiple chromosomes included in the target species.
  • Step 1204 sort and sort multiple chromosomes included in the target species.
  • Step 1206 Obtain a pre-selected high-confidence genome that meets a preset reliability condition.
  • Step 1208 Determine a high-confidence genome corresponding to each chromosome contained in the target species.
  • the target species is the species from which the test sample is derived.
  • the human is the target species.
  • the target species can be a human or a species other than a human.
  • the genomic data of target and non-target species can be derived from the RefSeq data set (RefSeq reference sequence database of the National Center for Biotechnology Information) of the NCBI (RefSeq reference sequence database, which has biological significance provided by the National Center for Bioinformatics). Non-redundant gene and protein sequences) or other public or private genomes. The genomes of all target and non-target species are integrated into a complete collection.
  • an individual's complete genome contains multiple chromosomes. Therefore, after obtaining the respective genomes of different individuals corresponding to the target species, multiple chromosomes contained in the target species can be obtained. Because there may be multiple sets of genomes of the target species collected, that is, different genomes of different individuals or populations from the same target species. Taking humans as target species, the genomes of target species collected may include genomes from European, North American Indian, and Chinese Han ethnic groups. Therefore, each chromosome of the target species may contain sequences belonging to that chromosome from a different genome. Taking humans as an example, the first chromosome of humans can include the first chromosome of European descent, the first chromosome of North American Indian, and the first chromosome of Chinese Han. Here, the data of each identical chromosome of the target species are put together, that is, the sequence data set of each chromosome of the target species is composed.
  • a preselected genome that meets the preset credibility conditions is obtained, that is, a high credibility genome that meets the preset credibility conditions is selected, and the corresponding chromosomes of the target species can be determined.
  • High-confidence genome refers to a genome that satisfies a preset confidence condition. Of course, the order here can also be changed. A large number of genomes can be collected from the NCBI in advance, and these genomes can be screened to select a genome that meets the preset reliability conditions as a high-confidence genome.
  • each target species determines the high-confidence sequence data set of each chromosome contained in each target species, that is, to put together the data of each identical chromosome of all high-confidence genomes of each target species, that is, each target species is composed High confidence sequence data set for individual chromosomes.
  • satisfying the preset credibility condition includes any of the following: when the proportion of non-deterministic characters contained in the chromosome sequence is lower than a preset proportion threshold; the sequence belonging to the same chromosome included in the chromosome sequence When the fragment is below the preset fragment threshold; compare a certain chromosome sequence with all other chromosomal sequences whose genetic relationship meets the preset genetic distance threshold range to determine the average full coverage of the chromosome sequence in the similar chromosome sequences Percentage, when the average coverage percentage is higher than the preset percentage value.
  • the proportion of non-deterministic characters refers to the proportion of non-ACGT characters contained in it. If the proportion of non-ACGT characters in a piece of DNA genome data is too high, then the piece of data is a genome with a suspected low confidence .
  • non-deterministic characters refer to characters other than ACGTU.
  • non-deterministic characters refer to characters other than certain amino acid characters.
  • the genome can be considered to satisfy a preset credibility condition.
  • the genome sequence is a suspected low confidence genome. That is, when a sequence fragment included in a genomic sequence belonging to the same chromosome is lower than a preset fragment threshold, the genomic sequence data can also be considered to satisfy a preset confidence condition.
  • Genetic distance refers to an index that measures the size of the overall genetic difference between species (or individuals).
  • the k-mer in the specific k-mer satisfies the following two conditions: the number of occurrences in the genome occurrence index table corresponding to each chromosome meets a first preset error condition; The number of occurrences in the genome occurrence number index table corresponding to each chromosome and the number of appearances in the genome occurrence number index table of the complete set meet the second preset error condition.
  • the genome occurrence index table of a certain chromosome records the number of genomes of each k-mer in the genome included in the corresponding chromosome; the genome occurrence index table of the complete set records each of the target species
  • the k-mer included in the chromosome includes the number of the k-mer genome in the genome included in the corpus.
  • each chromosome has its own set of characteristic target sequences, and the specific k-mer included in the set of characteristic target sequences refers to a k-mer that satisfies a preset specific condition.
  • the preset specific condition includes a first preset error condition and a second preset error condition. When the k-mer satisfies these two conditions at the same time, it is considered that the k-mer meets the preset specific condition and the k -mer as a specific k-mer.
  • the complete set refers to the collection of all high-confidence genomes collected.
  • the high-confidence genome contains both the genomes of each target species and the genomes of non-target species, such as pathogenic bacteria, symbiotic bacteria, and probiotics. , Human, animal, plant, etc. high confidence genome.
  • An index table of the number of occurrences of a genome of a certain chromosome records the number of each k-mer's genome in the corresponding genome of the corresponding chromosome.
  • the count corresponding to each k-mer recorded in the genome occurrence index table of the complete set represents how many genomes of the k-mer have appeared in the total set. If the k-mer appears multiple times in the same genome, it will only be counted once.
  • each k-mer contains the number of genomes of the k-mer in the corresponding genome of the corresponding chromosome
  • the genome occurrence number index table of the complete set records in The genome included in the corpus contains the number of k-mer genomes.
  • the selection of the specific k-mer includes two parameters, a preset error condition and a second preset error condition, and thus allows the non-specificity of the specific k-mer within a certain range. Without these two parameters, non-specificity in a certain range cannot be allowed, and it is often difficult to find a specific k-mer for a chromosome. Therefore, by selecting a specific k-mer that allows a certain amount of error, and thus establishing a set of characteristic target sequences, a specific target that can represent the chromosome can be found with high probability.
  • the first preset error condition is: the sum of the ratio of the number of occurrences in the genome occurrence number index table corresponding to each chromosome to the number of genomes contained in the corresponding chromosome and the first threshold is greater than or equal to 1.
  • the first preset error condition refers to that the sum of the ratio of the number of occurrences recorded in the genome occurrence number index table corresponding to the chromosome to the number of genomes corresponding to the chromosome and the first threshold is greater than or equal to 1. Assume that there are N corresponding genomes of this chromosome, and the number of occurrences of a certain k-mer in the genome occurrence index table corresponding to this chromosome is C1, and the first threshold is P1, then the first preset error condition is C1 / N + P1 ⁇ 1.
  • the first threshold value P1 represents an acceptable error probability, and can be any value between 0 and 1.
  • the first threshold value can be set by a technician according to the actual project.
  • the first threshold is less than 5%.
  • the first threshold is an acceptable error probability.
  • the first threshold may be any value between 0 and 1.
  • the first threshold may be set to a value less than 5%.
  • the second preset error condition is: the ratio of the number of occurrences in the genome occurrence number index table corresponding to each chromosome to the number of occurrences in the genome occurrence number index table of the complete set and the second threshold value. Is greater than or equal to 1.
  • the second preset error condition refers to that the sum of the ratio of the number of occurrences recorded in the genome occurrence number index table corresponding to the chromosome to the occurrence number in the genome occurrence number index table of the corpus and the second threshold is greater than or equal to 1.
  • the number of occurrences of a k-mer in the genome occurrence number index table corresponding to the chromosome is C1
  • the number of occurrences of the k-mer in the genome occurrence number index table of the complete set is C2
  • the second threshold value is P2.
  • the second preset error condition refers to C1 / C2 + P2 ⁇ 1.
  • the second threshold value is the same as the above-mentioned first threshold value, which represents an acceptable error probability, and can be any value between 0 and 1.
  • the second threshold value P2 can also be set by a technician based on the actual project.
  • the second threshold is less than 5%.
  • the second threshold value is the same as the first threshold value, which means an acceptable error probability.
  • the second threshold value can also be any value between 0 and 1, and the second threshold value can be set to a value less than 5%.
  • the first threshold and the second threshold may be equal or different.
  • the method before step 102, further includes the following steps: generating an index table of the number of occurrences of the genome corresponding to each chromosome, and the index of the number of times of the genome records that each k-mer is included in the corresponding genome of the corresponding chromosome The number of genomes of the k-mer; the index table of the number of occurrences of the genome is stored in the feature target sequence set corresponding to the chromosome.
  • the genome is all the genetic information in an organism. This genetic information is stored in the form of a nucleotide sequence.
  • the sum of the genetic material in a complete monomer of an organism is the genome.
  • Each individual's complete genome can contain multiple chromosomes, while the genome of each chromosome can contain multiple k-mers.
  • the term "chromosome genome” commonly used in the art is used here to refer to the sum of all sequences contained in a complete chromosome.
  • the number of genome occurrences corresponding to each chromosome has been recorded in the index table of the number of occurrences of the genome corresponding to each chromosome in the number of genomes corresponding to the chromosome, that is, the number of genomes index table records each k-mer
  • the genome corresponding to the corresponding chromosome contains the number of the k-mer genome.
  • the genomic appearance frequency index table corresponding to each chromosome can be stored into the feature target sequence set corresponding to each chromosome, that is, stored in the target database. After storage, if needed, Data can be retrieved from the target database at the genome occurrence index table, which improves the detection efficiency.
  • the method before obtaining the sequencing data of the sample, the method further includes: generating a genome occurrence index table of the complete set, and the genome occurrence index table of the complete set records a genome containing the k-mer in a genome included in the complete set.
  • the number of genomic appearances index table of the complete set is stored in the target database.
  • a characteristic target sequence set corresponding to each chromosome is stored.
  • the full set contains all the high-reliability genomes collected, that is, the full set contains both the high-reliability genomes of the target species corresponding to the data to be detected, and the multiple non-detected data corresponding targets.
  • Species high confidence genome.
  • the genome occurrence index table of the complete set records how many genomes of the k-mer contained in each chromosome have appeared in the complete set, that is, the genome count index table of the complete set records that each k-mer contains the genome contained in the complete set. There are the number of k-mer genomes.
  • the genome number table of the complete set actually how many genomes each k-mer contains in the complete set is recorded, that is, how many genomes each k-mer appears in the entire genome is recorded.
  • the number of measurements is the number of genomes, not the number of k-mer occurrences. If a k-mer occurs more than once in the same genome, it will still be counted only once in the genome occurrence index table of the complete set.
  • an index table of the number of occurrences of the genome for the complete set can be established.
  • the genomic appearance frequency index table of the complete set is different from the genomic appearance frequency index table corresponding to each chromosome.
  • the genomic appearance frequency index table of a certain chromosome corresponds to the chromosome, and each chromosome has its corresponding genomic appearance frequency index table , But the genomic appearance frequency index table of the complete set will only generate one, which is for all data. After storing the generated genomic appearance frequency index table of the complete set, if it is needed in the process of detecting the data to be detected, the data can be retrieved from the target database, thereby improving the detection efficiency.
  • the method further includes: generating a specific k-mer actual occurrence frequency record table corresponding to the chromosome according to the actual occurrence number.
  • the specific k-mer contained in each chromosome is stored. After the data to be detected is obtained, the data to be detected can be compared with the specific k-mer of each chromosome, that is, each The actual number of times a specific k-mer appears in the data to be detected. After obtaining the actual number of occurrences of each specific k-mer in the sequencing data, a record table of the actual occurrences of specific k-mer corresponding to each chromosome can be generated according to the acquired data.
  • M corresponding specific k-mer actual occurrence frequency record tables will be generated, and the specific k-mer actual occurrence frequency record table records the specificity contained in each chromosome.
  • the actual number of k-mer occurrences in the sequencing data is the actual number of k-mer occurrences in the sequencing data.
  • the specific number of occurrences of the specific k-mer of a particular chromosome the leftmost column records the specific k-mer contained in chromosome X, and the second column records the corresponding specificity
  • the actual number of occurrences of sexual k-mer in the sequencing data is C 1 , C 2 ,... According to the actual number of occurrences of the specific k-mer in the sequencing data, a corresponding record of the actual occurrences of the specific k-mer is generated, and the data is stored for subsequent recall, thereby improving the detection efficiency.
  • a method for detecting an abnormal chromosome copy number includes the following steps:
  • Step 1402 A feature target sequence set corresponding to each chromosome is established.
  • step 1402 includes:
  • Step 1402A Collection and sorting of high-confidence genomes.
  • the high-confidence genome can include both the genome in the target species corresponding to the data to be detected and the genome that does not belong to the target species corresponding to the data to be detected.
  • high-confidence genomes of commensal bacteria, probiotics, humans, animals, plants, and the like.
  • High confidence genomes can be derived from the NCBI's RefSeq dataset or other public or private high confidence genomes.
  • non-deterministic characters For example, for the DNA genome, the proportion of non-deterministic characters refers to the proportion of non-ACGT characters contained in it. If the proportion of non-ACGT characters in a piece of DNA genome data is too high, then the piece of data is suspected of low confidence. Genome. For DNA or RNA sequences, non-deterministic characters refer to characters other than ACGTU. For protein sequences, non-deterministic characters refer to characters other than certain amino acid characters.
  • Genomes with a low average percentage of coverage are those that are suspected of having low completion, ie, low confidence.
  • Genetic distance refers to an index that measures the size of the overall genetic difference between species (or individuals).
  • Step 1402B Determine a high-confidence sequence data set of each chromosome in the target species corresponding to the data to be detected.
  • each chromosome of the target species may contain sequences belonging to that chromosome from a different genome.
  • the first chromosome of humans can include the first chromosome of European descent, the first chromosome of North American Indian, and the first chromosome of Chinese Han.
  • the data of each identical chromosome of all high-confidence genomes of the target species are put together, that is, a high-confidence sequence data set of each chromosome of the target species is assembled.
  • the high-confidence sequence data sets of all chromosomes of the target species and the high-confidence sequence data sets of all non-target species are brought together to form a complete set. That is, a high-confidence sequence data set of all chromosomes of the target species corresponding to the data to be detected and a high-confidence sequence data set of all chromosomes of other target species are brought together to form a complete set.
  • the ratio of the copy number of each chromosome of the target species corresponding to the data to be detected is determined under normal circumstances, and the autosome and sex chromosome are distinguished.
  • a normal human genome contains 23 pairs and a total of 46 chromosomes.
  • chromosomes 1 to 22 are autosomes, and their copy numbers are two.
  • X and Y chromosomes are sex chromosomes. Normal males have only one X chromosome and one Y chromosome. Normal women have two X chromosomes and no Y chromosomes.
  • Copy number refers to the number of haploid genomes (haploid geneome) of a certain gene or a specific DNA sequence.
  • the information determined in FIG. 16 is generated only once when the target species corresponding to the data to be detected is determined, and then the information in FIG. 16 is called when analyzing each sample data that needs to be detected.
  • step 1402C an index table of the number of occurrences of the genome of the complete set is generated.
  • the genomic occurrence index table of the corpus can be generated.
  • k-mer refers to a genomic sequence of length k.
  • k can be defined by itself, and the range can generally be set between 11 and 32. If there are a different deterministic characters in a genomic data, then for a specific k, there may be a total of k different powers of k.
  • DNA has a total of four different deterministic characters of ACGT, then for a particular k, there are 4 possible k-th different k-mers.
  • n For a genome of length n, there may be at most n-k + 1 different k-mers.
  • an n-character genome contains different k-mers that are much smaller than n-k + 1. Therefore, if the ordinary k-mer counting method is used, a given k-mer may appear multiple times and may be counted multiple times in a given genome.
  • the genome occurrence index table of the complete set which is different from the previous method, if a k-mer occurs more than once in a genome, the genome occurrence index table of the complete set still counts only once. Therefore, the count corresponding to a k-mer in the resulting k-mer genome occurrence number index table represents how many genomes the k-mer has appeared in the total set.
  • each chromosome of the target genome can be operated as a species here, that is, each individual sequence that can completely represent the chromosome contained in each stained high-confidence data set of the target species is considered as For a single genome.
  • the high-confidence dataset of human chromosome 1 may contain three pieces of data, namely the chromosome 1 sequence of European descent, the chromosome 1 sequence of North American Indian, and the chromosome 1 of Chinese Han Chromosome 1 sequence, then the European chromosome 1 sequence is regarded as a complete independent genome to participate in the count of the k-mer genome occurrence index table, and the North American Indian chromosome 1 sequence is regarded as a complete The independent genome participates in the counting of the k-mer genome appearance index table.
  • the Chinese chromosome number 1 of the Han ethnic group is regarded as a complete independent genome participating in the counting of the k-mer genome appearance index table.
  • step 1402D an index table of the number of occurrences of the genome corresponding to each chromosome is generated.
  • the genome appearance number index table of a chromosome is different from the genome appearance number index table of the complete set in step 1402C.
  • the genome occurrence index table of the complete set records the complete set, that is, how many genomes of a k-mer have appeared in the complete set, but the genome occurrence number index table corresponding to the chromosome corresponds to each chromosome, and records each The k-mer contained in each chromosome has appeared in how many genomes corresponding to the chromosome.
  • Step 1402E Generate a specific k-mer table corresponding to each chromosome.
  • the specific k-mer table corresponding to each chromosome records the k-mers that satisfy the preset specific conditions in each chromosome, that is, the specific k-mer.
  • the specific k-mer is a k-mer selected from the k-mers that meets the preset specificity conditions. The selection of a specific k-mer must meet the following two conditions:
  • the high-confidence data set of the chromosome contains N genomes
  • the number of occurrences of a certain k-mer in the genome occurrence index table corresponding to the chromosome is C 1 .
  • the first threshold value P 1 and the second threshold value P 2 may be equal to or different from each other.
  • the two parameters of the first threshold P 1 and the second threshold P 2 are added, allowing an error rate within a certain range, that is, allowing the non-specificity of the specific k-mer within a certain range. . Without these two parameters, non-specificity in a certain range cannot be allowed, and it is often difficult to find a specific k-mer for a certain chromosome.
  • the probability of a false positive on this chromosome is less than or equal to P 1 n ' (that is, the power n' to P 2 ). For n 'large enough, the probability of false positives that can occur here is extremely small.
  • the false negative rate refers to the proportion of positives that produce a negative test result in the test, that is, the conditional probability that a negative test result exists considering the condition being searched for.
  • k-mer when calculating the false positive probability, can be independently corrected. For any two k-mers A and B in the specific k-mer list, if there are no less than j characters between them at their ends (for example, the last j characters of A and B's The first j characters are exactly the same), then the two k-mers A and B are considered to be coincident ends.
  • j is generally a value greater than 5 and less than or equal to k-1, that is, 5 ⁇ j ⁇ (k-1).
  • the terminal coincidence detection should include A and B, A reverse complementary sequences A 'and B, A and B reverse complementary sequences B', and A reverse complementary sequences A 'and B reverse complementary sequences B'.
  • each specific k-mer or specific region retained in the table in the final state is one Non-coincidence specific regions.
  • multiple k-mers belonging to the same non-overlapping specific region only calculate the value of P1 or P2 once. If there are M chromosomes in the target species, then a specific k-mer table of M corresponding chromosomes will be created here.
  • Step 1402F Generate a specific k-mer copy number list corresponding to each chromosome.
  • the number of occurrences of each specific k-mer screened out is calculated, that is, the specific k-mer on this chromosome Record as many occurrences as possible in all genomes of the high confidence dataset.
  • the number of copies of each specific k-mer of the chromosome is calculated from the number of occurrences of one k-mer, which is the least frequent of all specific k-mers of the chromosome, that is, Cm. If the target species has a total of M chromosomes, then a specific k-mer copy number list of M corresponding chromosomes will be created here.
  • the copy number of specific k-mer is a value greater than or equal to one.
  • Module A can be run from time to time in order to continuously update the feature target sequence set corresponding to each chromosome, that is, update the target database. For example, whenever the reference genome data is updated, module A can be run. However, module A does not need to be run or updated during the analysis of each actual sample.
  • Step 1404 Calculate the actual signal strength of each chromosome contained in the target sample corresponding to the data to be detected.
  • step 1404 includes:
  • Step 1404A Obtain data to be detected.
  • Step 1404B Obtain a specific k-mer list and a specific k-mer copy number list.
  • Step 1404C Obtain the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected.
  • the specific k-mer list and specific k-mer copy number list of each chromosome in the target species generated in step 1402 are called. If there are M chromosomes in the target species corresponding to the data to be detected, a total of M specific k-mer lists and specific k-mer copy number lists corresponding to each chromosome need to be called. The actual number of occurrences of the specific k-mer contained in each chromosome of the target species in the data to be detected is obtained. The number of occurrences of the specific k-mer can be recorded to the corresponding position in the actual number of occurrences of the specific k-mer of the corresponding chromosome. That is, according to the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected, a record table of the actual number of occurrences of the specific k-mer corresponding to the chromosome is generated.
  • step 1404D a single copy signal intensity E of each chromosome is calculated.
  • a single copy signal strength calculation table for a specific chromosome is shown in FIG.
  • any specific k-mer belonging to this specific chromosome can be obtained.
  • the adjusted number of occurrences of all specific k-mers of the chromosome is averaged, and the average value is the single copy signal intensity E of the chromosome.
  • the single copy signal intensity E of each chromosome contained in the target species can be recorded and stored through the single copy signal intensity record table of each chromosome as shown in FIG. 19 .
  • Step 1404E calculate the actual signal strength S of each chromosome.
  • the average M and variance SD of all single-copy signal intensity E can be calculated.
  • the calculation formulas for other chromosomes are also calculated in this way.
  • Step 1406 Calculate a standard confidence interval list corresponding to the chromosome contained in the target species according to the standard detection sample.
  • the actual signal intensity of each chromosome contained in each standard test sample can be calculated in the manner in step 1404.
  • the standard signal strength of the standard test sample is referred to as the standard signal strength.
  • the standard signal intensity of each chromosome contained in each standard detection sample can be calculated.
  • the standard signal intensity corresponding to the chromosomes contained in all the standard test samples can be recorded in a table. Further, gender-sensitive records can be distinguished. That is, a standard signal intensity record table of chromosomes in normal male samples and a standard signal intensity record table of chromosomes in normal female samples are generated.
  • the standard signal intensities of each chromosome included in all the standard detection samples are statistically calculated, and the mean value M 'and the variance SD' of the standard signal intensity distributions of the respective standard detection samples of each chromosome are calculated.
  • the standard test sample is human and there are 100 standard test samples, then there are 100 chromosomes 1, 100 chromosomes 2, ..., 100 22 stains.
  • the specific number of X and Y sex chromosomes needs to be determined according to the gender of these 100 people. Therefore, in order to meet the number of X and Y sex chromosomes, the number of standard test samples for a certain sex should also be required. So for chromosome 1, there are 100 standard signal intensities.
  • the corresponding mean and variance of chromosome 1 can be calculated according to the standard signal intensities corresponding to the 100 chromosomes 1, and the mean and variance of standard signal intensities of other chromosomes can also be calculated.
  • a standard confidence interval corresponding to each chromosome contained in the standard detection sample when the confidence value is preset can be determined, that is, an interval of standard signal strength. That is, two boundary values LB and UB of the confidence interval with the confidence degree P are obtained. LB is the minimum of the confidence interval, and UB is the maximum of the confidence interval.
  • P is generally a value greater than 0.95, infinitely close to 1 but not equal to 1. In practical applications, the confidence level can be adjusted as required. For example, with 95% confidence, P is 0.95, and 99.9% confidence, P is 0.999.
  • a distribution table of P-confidence boundary values of the actual signal strengths of the chromosomes corresponding to the two sexes of the target species can be obtained.
  • the standard confidence interval corresponding to each chromosome of the target species can be estimated in a statistical manner by calculating statistics on the standard signal intensity of the chromosomes of a large number of standard test sample data. That is, the actual signal intensity interval corresponding to each chromosome of the target species in the normal sample when the reliability P value is preset is estimated.
  • the above standard test sample can also be: a peripheral blood sample of a normal mother carrying a normal baby, the peripheral blood sample includes a peripheral blood sample of a normal mother carrying a normal baby boy, and a normal mother carrying a normal baby girl Peripheral blood samples, peripheral blood samples from normal mothers carrying normal baby boy twins, peripheral blood samples from normal mothers carrying normal baby girl twins, and peripheral blood samples from normal mothers carrying normal one male and one female twin. Therefore, when making a distribution table of P-confidence boundary values, the table can also be adjusted according to the difference in the standard detection samples.
  • Step 1408 It is detected whether there is an abnormal copy number in the data to be detected.
  • the actual signal intensity of each chromosome can be compared with each chromosome of the target species obtained in step 1406 above when the reliability P value is set.
  • the corresponding standard confidence intervals are compared separately.
  • the actual signal intensity of chromosome 1 contained in the target species corresponding to the data to be detected is compared with the standard confidence interval of chromosome 1.
  • the actual signal intensity of chromosome 1 is not within the standard confidence interval of chromosome 1, it can be determined that copy number abnormality exists in chromosome 1. Conversely, it can be determined that chromosome 1 is not copy number abnormal.
  • step 1406 a distribution table of pre-set reliability values P of standard signal intensities of chromosomes in the corresponding samples is established according to different genders of the target species. Therefore, the actual signal intensity of the sex chromosome can also be compared with the distribution table of the preset reliability values P corresponding to different genders. The actual signal intensity of the X chromosome and the actual signal intensity of the Y chromosome calculated from the data to be tested are compared with the boundary value of the confidence interval in the distribution table of the preset confidence value P corresponding to different genders.
  • the data to be tested corresponds Gender is male. If the calculated actual signal intensity of the X chromosome and the actual signal intensity of the Y chromosome in the data to be detected are in the distribution table of the preset reliability value P of the standard signal intensity of the chromosome in a normal female sample, then the data to be tested corresponds to Gender is female.
  • the actual signal strength of each chromosome in the data to be detected is compared with the confidence interval of each chromosome in the distribution table of the preset confidence value P.
  • the probability of false positives can be reduced by increasing the preset reliability value P. But increasing P increases the probability of false negatives.
  • the actual signal intensity of each chromosome can be compared with the standard confidence interval of the corresponding chromosome, and the chromosome that is not within the standard confidence interval of the corresponding chromosome can be determined as a chromosome with abnormal copy number.
  • This method of detecting chromosome copy number abnormalities is compared with the characteristic target sequence in each chromosome of the target species, that is, the specific k-mer, which is part of the entire target species genome, and is therefore specific.
  • the comparison of the performance k-mer can reduce the comparison space, thereby shortening the analysis time and improving the detection efficiency.
  • the characteristic target of each chromosome of the target species generated here is the integration of multiple genomes of different individuals or populations in the target species, thus avoiding "when a set of data comes from a genetic relationship that is far away from the reference genome Individuals, the effect of using whole-genome alignments becomes worse.
  • a device for detecting an abnormal chromosome copy number including:
  • the specific k-mer acquisition module 2102 is used to obtain sequencing data of a sample to be detected as the data to be detected, and determine a target species corresponding to the data to be detected; and acquire a specificity corresponding to each chromosome contained in the target species stored in the target database.
  • Sexual k-mer, specific k-mer is the k-mer in each chromosome that meets the preset specificity conditions, k-mer refers to the genomic sequence of length k;
  • the actual appearance frequency obtaining module 2104 is configured to obtain the actual appearance times of the specific k-mer included in each chromosome in the data to be detected;
  • the copy number acquisition module 2106 is used to obtain the copy number of each specific k-mer from the target database.
  • the copy number is the least number of occurrences of the specific k-mer on the corresponding chromosome and the number of occurrences on the chromosome. Ratio of occurrences of specific k-mers;
  • a determination module 2108 configured to calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and the copy number of each specific k-mer; determine that the chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome exists as a copy number Abnormal chromosomes.
  • the determination module 2108 is further configured to calculate the ratio of the actual number of occurrences of each specific k-mer to the number of copies; calculate the actual number of occurrences and the number of copies of all specific k-mers contained in each chromosome The average value of the ratio of the chromosomes is used as the single-copy signal strength of the corresponding chromosome; and the actual signal strength of the corresponding chromosome is calculated based on the single-copy signal strength of each chromosome.
  • the actual signal intensity of the corresponding chromosome is calculated according to the following formula:
  • the actual signal intensity of the chromosome (single copy signal intensity of the chromosome-M) / SD, where M is the average of the single copy signal intensity of all chromosomes, and SD is the variance of the single copy signal intensity of all chromosomes.
  • the apparatus for detecting abnormal copy number of a chromosome further includes a standard confidence interval list calculation module (not shown in the figure) for obtaining a preset number of standard test samples, and the standard test samples are confirmed as having no chromosomes.
  • Samples with abnormal copy number Obtain the actual number of occurrences of the specific k-mer contained in each chromosome in the standard test sample in the data to be tested; obtain from the target database each of each chromosome contained in the standard test sample Copy number of specific k-mer; get the standard signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer included in the standard detection sample; detect each chromosome in the sample according to multiple standards
  • the standard signal strength of the chromosome determines the standard confidence interval corresponding to the chromosome when the confidence value is preset; and according to the standard confidence interval corresponding to each chromosome, a list of standard confidence intervals corresponding to the chromosome contained in the target species is obtained.
  • the above-mentioned standard confidence interval list calculation module is further configured to obtain the standard signal intensity of each chromosome contained in each standard detection sample; and calculate the chromosome Mean and variance of standard signal strengths; and based on the mean and variance of standard signal strengths in multiple standard test samples for each chromosome for the corresponding gender, determine the pre-set reliability of the chromosomes contained in the standard test samples corresponding to each gender The standard confidence interval corresponding to the value.
  • the standard test sample is a peripheral blood sample of a normal mother carrying a normal baby.
  • the peripheral blood sample includes a peripheral blood sample of a normal mother carrying a normal baby boy, and a peripheral mother's peripheral blood sample. Peripheral blood samples from normal mothers carrying normal baby boy twins, Peripheral blood samples from normal mothers carrying normal baby girl twins, and Peripheral blood samples from normal mothers carrying normal one boy and one female twin.
  • the above-mentioned standard confidence interval list calculation module is further configured to determine a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby boy;
  • the standard signal intensity of each chromosome contained in the peripheral blood sample of a normal baby girl is determined by the standard confidence interval of the chromosome when the confidence value is preset; according to the The standard signal intensity of each chromosome determines the standard confidence interval of the chromosome when the confidence value is preset; according to the standard signal intensity of each chromosome contained in the peripheral blood sample of a normal mother carrying a normal baby girl twin, it is determined that the chromosome is in a preset
  • the above-mentioned determination module 2108 is further configured to, when it is detected that the actual signal intensity corresponding to the chromosome does not belong to the standard confidence interval corresponding to the corresponding chromosome, determine the chromosome corresponding to the actual signal intensity as a copy number Abnormal chromosomes.
  • the apparatus for detecting abnormal copy number of a chromosome further includes a gender division comparison module (not shown in the figure) for determining a standard confidence interval list of a chromosome corresponding to each gender according to the gender of the target species; respectively Compare the actual signal strength of each chromosome with the standard confidence interval corresponding to the corresponding chromosome in the list of standard confidence intervals for the corresponding sex of the target species; and when it is detected that the actual signal strength of the chromosome does not belong to the corresponding sex When corresponding to the standard confidence interval of a chromosome, the chromosome corresponding to the actual signal intensity is determined as a chromosome with abnormal copy number.
  • the above-mentioned apparatus for detecting abnormal copy number of a chromosome further includes a target sequence creation module (not shown in the figure), configured to obtain a specificity contained in each chromosome included in the target species stored in the target database.
  • the number of occurrences of sexual k-mer in the corresponding chromosome C, and the number of occurrences of specific k-mer in the corresponding chromosome are taken as the minimum occurrences Cm; the ratio of the occurrences C to the minimum occurrences Cm is taken as the specificity Copy number of specific k-mer; generating a specific k-mer copy number list corresponding to each chromosome according to the copy number of specific k-mer contained in each chromosome; and storing the specific k-mer copy number list To the target database.
  • the above-mentioned target sequence creation module is further configured to obtain multiple chromosomes contained in the target species; classify and sort multiple chromosomes contained in the target species; and obtain a pre-selected condition that satisfies a preset credibility High-confidence genome; and determining the high-confidence genome corresponding to each chromosome contained in the target species.
  • satisfying the preset credibility condition includes any of the following: when the proportion of non-deterministic characters contained in the chromosome sequence is lower than a preset proportion threshold; the sequence belonging to the same chromosome included in the chromosome sequence When the fragment is below the preset fragment threshold; compare a certain chromosome sequence with all other chromosomal sequences whose genetic relationship meets the preset genetic distance threshold range to determine the average full coverage of the chromosome sequence in the similar chromosome sequences Percentage, when the average coverage percentage is higher than the preset percentage value.
  • the k-mer in the specific k-mer satisfies the following two conditions: the number of occurrences in the genome occurrence index table corresponding to each chromosome meets a first preset error condition; The number of occurrences in the genome occurrence index table corresponding to each chromosome, and the occurrences in the genome occurrence index table of the complete set meet the second preset error condition; the genome appearance index table records the corresponding chromosome of each k-mer Contains the number of k-mer genomes in the genome; the genome occurrence index table of the complete set records the k-mers contained in each chromosome in the target species, and the k-mer genomes in the complete set contain the k-mer genomes. Number.
  • the first preset error condition is: the sum of the ratio of the number of occurrences in the genome occurrence number index table corresponding to each chromosome to the number of genomes contained in the corresponding chromosome and the first threshold is greater than or equal to 1.
  • the first threshold is less than 5%.
  • the second preset error condition is: the ratio of the number of occurrences in the genome occurrence number index table corresponding to each chromosome to the number of occurrences in the genome occurrence number index table of the complete set and the second threshold value. Is greater than or equal to 1.
  • the second threshold is less than 5%.
  • Each module in the above apparatus for detecting abnormal copy number of a chromosome can be realized in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 22.
  • the computer device includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium.
  • the computer equipment database is used to store data for detecting abnormal chromosome copy numbers.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by a processor to implement a method for detecting abnormalities in chromosome copy number.
  • FIG. 22 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • the specific computer equipment may be Include more or fewer parts than shown in the figure, or combine certain parts, or have a different arrangement of parts.
  • a computer device includes a memory and one or more processors.
  • Computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by the processor, the method for detecting an abnormality of a chromosome copy number provided in any embodiment of the present application is implemented. A step of.
  • One or more non-transitory computer-readable storage media storing computer-readable instructions.
  • the computer-readable instructions When executed by one or more processors, the one or more processors implement one of the embodiments of the present application.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Abstract

A method for detecting chromosomal copy number variations, comprising: obtaining sequencing data of a sample to be detected as data to be detected, and determining a target species corresponding to the data to be detected; obtaining the specific k-mer corresponding to each chromosome included in the target species stored in a target database; obtaining the actual occurrence number of the specific k-mer included in each chromosome in the data to be detected; obtaining the copy number of each specific k-mer from the target database; obtaining, according to the actual occurrence number and the copy number of each specific k-mer, the actual signal intensity of the corresponding chromosome by calculation; and determining chromosomes of which the actual signal intensities are not in a standard confidence interval of the corresponding chromosome as chromosomes having copy number variations.

Description

检测染色体拷贝数异常的方法、装置和存储介质Method, device and storage medium for detecting abnormal chromosome copy number
本申请要求于2018年06月22日提交中国专利局,申请号为2018106514416,申请名称为“检测染色体拷贝数异常的方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed on June 22, 2018, with the application number 2018106514416, and the application name is "Method, Device, Computer Equipment, and Storage Medium for Detecting Abnormal Chromosome Copy Numbers", all of which are The contents are incorporated herein by reference.
技术领域Technical field
本申请涉及一种检测染色体拷贝数异常的方法、装置、计算机设备和存储介质。The present application relates to a method, a device, a computer device, and a storage medium for detecting an abnormality in chromosome copy number.
背景技术Background technique
在医学和生物学领域,为了检测一个样本中是否存在染色体拷贝数异常的现象,传统的技术方案可以利用一个待检测的样本的基因组测序数据,通过数据分析的方法,判断样本中是否存在染色体拷贝数异常的问题。然而在目前的技术方案中,一般需要将测序数据与一个物种的全部的染色体的完整序列进行序列比对,因此需要的计算资源高,消耗时间长,消耗内存大。In the field of medicine and biology, in order to detect whether a chromosome copy number abnormality exists in a sample, traditional technical solutions can use the genome sequencing data of a sample to be tested, and determine whether a chromosome copy exists in the sample through data analysis. Number of problems. However, in the current technical solution, it is generally required to perform sequence comparison between the sequencing data and the complete sequence of all chromosomes of a species, so the required computing resources are high, the time is consumed, and the memory is consumed.
发明内容Summary of the Invention
根据本申请公开的各种实施例,提供一种检测染色体拷贝数异常的方法、装置和存储介质。According to various embodiments disclosed in the present application, a method, a device, and a storage medium for detecting an abnormality in chromosome copy number are provided.
一种检测染色体拷贝数异常的方法,包括:A method for detecting chromosome copy number abnormalities, including:
获取待检测的样本的测序数据作为待检测数据,确定所述待检测数据对应的目标物种;Acquiring sequencing data of a sample to be detected as the data to be detected, and determining a target species corresponding to the data to be detected;
获取靶点数据库中存储的目标物种包含的每个染色体对应的特异性k-mer,所述特异性k-mer为每个染色体中的满足预设特异性条件的k-mer,所述k-mer是指长度为k的基因组序列;A specific k-mer corresponding to each chromosome contained in the target species stored in the target database is obtained, where the specific k-mer is a k-mer in each chromosome that satisfies a preset specific condition, and the k-mer mer refers to a genomic sequence of length k;
获取每个染色体中包含的特异性k-mer在所述待检测数据中的实际出现次数;Obtaining the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected;
从所述靶点数据库中获取到每个特异性k-mer的拷贝数,所述拷贝数是所述特异性k-mer在对应的染色体中的出现次数与该染色体上出现次数最少的特异性k-mer的出现次数的比值;A copy number of each specific k-mer is obtained from the target database, and the copy number is a specificity that has the least number of occurrences of the specific k-mer in the corresponding chromosome and that on the chromosome The ratio of the number of occurrences of k-mer;
根据每个特异性k-mer的实际出现次数和拷贝数计算得到对应的染色体的实际信号强度;及Calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer; and
将所述实际信号强度不在对应染色体的标准置信区间内的染色体判定为存在拷贝数异常的染色体。A chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is determined to be a chromosome with abnormal copy number.
一种检测染色体拷贝数异常的装置,所述装置包括:A device for detecting abnormal chromosome copy number, the device includes:
特异性k-mer获取模块,用于获取待检测的样本的测序数据作为待检测数据,确定所述待检测数据对应的目标物种;获取靶点数据库中存储的目标物种包含的每个染色体对应的特异性k-mer,所述特异性k-mer为每个染色体中的满足预设特异性条件的k-mer,所述k-mer是指长度为k的基因组序列;A specific k-mer acquisition module, configured to acquire sequencing data of a sample to be detected as the data to be detected, and determine a target species corresponding to the data to be detected; and acquire a corresponding one of each chromosome contained in the target species stored in the target database. A specific k-mer, where the specific k-mer is a k-mer in each chromosome that meets a preset specificity condition, and the k-mer refers to a genomic sequence of length k;
实际出现次数获取模块,用于获取每个染色体中包含的特异性k-mer在所述待检测数据中的实际出现次数;An actual appearance frequency acquisition module, configured to obtain the actual appearance frequency of the specific k-mer included in each chromosome in the data to be detected;
拷贝数获取模块,用于从所述靶点数据库中获取到每个特异性k-mer的拷贝数,所述拷贝数是所述特异性k-mer在对应的染色体中的出现次数与该染色体上出现次数最少的特异性k-mer的出现次数的比值;及A copy number obtaining module is configured to obtain a copy number of each specific k-mer from the target database, the copy number is the number of occurrences of the specific k-mer in the corresponding chromosome and the chromosome The ratio of the number of occurrences of the specific k-mer with the fewest occurrences; and
判定模块,用于根据每个特异性k-mer的实际出现次数和拷贝数计算得到对应的染色体的实际信号强度;将所述实际信号强度不在对应染色体的标准置信区间内的染色体判定为存在拷贝数异常的染色体。A determination module, configured to calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and the number of copies of each specific k-mer; determine that the chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is a copy Number of abnormal chromosomes.
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the one or more processors are executed. The following steps:
获取待检测的样本的测序数据作为待检测数据,确定所述待检测数据对应的目标物种;Acquiring sequencing data of a sample to be detected as the data to be detected, and determining a target species corresponding to the data to be detected;
获取靶点数据库中存储的目标物种包含的每个染色体对应的特异性k-mer,所述特异性k-mer为每个染色体中的满足预设特异性条件的k-mer,所述k-mer是指长度为k的基因组序列;A specific k-mer corresponding to each chromosome contained in the target species stored in the target database is obtained. The specific k-mer is a k-mer in each chromosome that meets a preset specificity condition. mer refers to a genomic sequence of length k;
获取每个染色体中包含的特异性k-mer在所述待检测数据中的实际出现次数;Obtaining the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected;
从所述靶点数据库中获取到每个特异性k-mer的拷贝数,所述拷贝数是所述特异性k-mer在对应的染色体中的出现次数与该染色体上出现次数最少的特异性k-mer的出现次数的比值;A copy number of each specific k-mer is obtained from the target database, and the copy number is a specificity that has the least number of occurrences of the specific k-mer on the corresponding chromosome and the number of occurrences on the chromosome. The ratio of the number of occurrences of k-mer;
根据每个特异性k-mer的实际出现次数和拷贝数计算得到对应的染色体的实际信号强度;及Calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer; and
将所述实际信号强度不在对应染色体的标准置信区间内的染色体判定为存在拷贝数异常的染色体。A chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is determined to be a chromosome with abnormal copy number.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:
获取待检测的样本的测序数据作为待检测数据,确定所述待检测数据对应的目标物种;Acquiring sequencing data of a sample to be detected as the data to be detected, and determining a target species corresponding to the data to be detected;
获取靶点数据库中存储的目标物种包含的每个染色体对应的特异性k-mer,所述特异性k-mer为每个染色体中的满足预设特异性条件的k-mer,所述k-mer是指长度为k的基因组序列;A specific k-mer corresponding to each chromosome contained in the target species stored in the target database is obtained, where the specific k-mer is a k-mer in each chromosome that satisfies a preset specific condition, and the k-mer mer refers to a genomic sequence of length k;
获取每个染色体中包含的特异性k-mer在所述待检测数据中的实际出现次数;Obtaining the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected;
从所述靶点数据库中获取到每个特异性k-mer的拷贝数,所述拷贝数是所述特异性k-mer在对应的染色体中的出现次数与该染色体上出现次数最少的特异性k-mer的出现次数的比值;A copy number of each specific k-mer is obtained from the target database, and the copy number is a specificity that has the least number of occurrences of the specific k-mer on the corresponding chromosome and the number of occurrences on the chromosome. The ratio of the number of occurrences of k-mer;
根据每个特异性k-mer的实际出现次数和拷贝数计算得到对应的染色体的实际信号强度;及Calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer; and
将所述实际信号强度不在对应染色体的标准置信区间内的染色体判定为存在拷贝数异常的染色体。A chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is determined to be a chromosome with abnormal copy number.
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the application will become apparent from the description, the drawings, and the claims.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to explain the technical solutions in the embodiments of the present application more clearly, the drawings used in the embodiments are briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. Those of ordinary skill in the art can also obtain other drawings according to these drawings without paying creative labor.
图1为根据一个或多个实施例中检测染色体拷贝数异常的方法的流程示意图。FIG. 1 is a schematic flowchart of a method for detecting a chromosome copy number abnormality according to one or more embodiments.
图2为根据一个或多个实施例中在步骤102之前的流程示意图。FIG. 2 is a schematic flow chart before step 102 according to one or more embodiments.
图3为根据另一个或多个实施例中在步骤102之前的流程示意图。FIG. 3 is a schematic flow chart before step 102 according to another embodiment.
图4为根据一个或多个实施例中染色体X的特异性k-mer的拷贝数列表。Figure 4 is a list of copy numbers of specific k-mers of chromosome X according to one or more embodiments.
图5为根据一个或多个实施例中步骤110的流程示意图。FIG. 5 is a schematic flowchart of step 110 according to one or more embodiments.
图6为根据一个或多个实施例中检测染色体拷贝数异常的方法还包括其他步骤的流程示意图。FIG. 6 is a schematic flowchart of a method for detecting an abnormal chromosome copy number according to one or more embodiments, which further includes other steps.
图7A为根据一个或多个实施例中正常男性样本中染色体的标准信号强度记录表。FIG. 7A is a standard signal intensity recording table of a chromosome in a normal male sample according to one or more embodiments.
图7B为根据一个或多个实施例中正常女性样本中染色体的标准信号强度记录表。FIG. 7B is a standard signal intensity recording table of a chromosome in a normal female sample according to one or more embodiments.
图8为根据一个或多个实施例中步骤610的流程示意图。FIG. 8 is a schematic flowchart of step 610 according to one or more embodiments.
图9A为根据一个或多个实施例中正常男性样本中染色体的标准信号强度的预设置信度值P的分布表。FIG. 9A is a distribution table of pre-set reliability values P according to standard signal intensities of chromosomes in normal male samples in one or more embodiments.
图9B为根据一个或多个实施例中正常女性样本中染色体的标准信号强度的预设置信度值P的分布表。FIG. 9B is a distribution table of pre-set reliability values P according to standard signal intensities of chromosomes in normal female samples in one or more embodiments.
图10为根据另一个或多个实施例中步骤610的流程示意图。FIG. 10 is a schematic flowchart of step 610 according to another embodiment.
图11为根据另一个或多个实施例中检测染色体拷贝数异常的方法还包括其他步骤的流程示意图。FIG. 11 is a schematic flowchart of a method for detecting an abnormal chromosome copy number according to another or more embodiments, including other steps.
图12为根据又一个或多个实施例中在步骤102之前的流程示意图。FIG. 12 is a schematic flowchart before step 102 according to still another embodiment.
图13为根据一个或多个实施例中某一特定染色体的特异性k-mer实际出现次数记录表。FIG. 13 is a table showing the actual number of occurrences of the specific k-mer of a specific chromosome according to one or more embodiments.
图14为根据另一个或多个实施例中检测染色体拷贝数异常的方法的流程示意图。FIG. 14 is a schematic flowchart of a method for detecting an abnormality of a chromosome copy number according to another or more embodiments.
图15为根据一个或多个实施例中步骤1402的流程示意图。FIG. 15 is a schematic flowchart of step 1402 according to one or more embodiments.
图16为根据一个或多个实施例中人类的染色体拷贝数表。FIG. 16 is a table of human chromosome copy numbers in accordance with one or more embodiments.
图17为根据一个或多个实施例中步骤1404的流程示意图。FIG. 17 is a schematic flowchart of step 1404 according to one or more embodiments.
图18为根据一个或多个实施例中某一特定染色体的单拷贝信号强度计算表。FIG. 18 is a single-copy signal strength calculation table for a specific chromosome according to one or more embodiments.
图19为根据一个或多个实施例中各个染色体的单拷贝信号强度记录表。FIG. 19 is a single copy signal intensity recording table for each chromosome according to one or more embodiments.
图20为根据一个或多个实施例中各个染色体的实际信号强度的计算表。FIG. 20 is a calculation table of actual signal intensities of individual chromosomes according to one or more embodiments.
图21为根据一个或多个实施例中检测染色体拷贝数异常的装置的框图。FIG. 21 is a block diagram of an apparatus for detecting an abnormality in chromosome copy number according to one or more embodiments.
图22为根据一个或多个实施例中计算机设备的框图。FIG. 22 is a block diagram of a computer device in accordance with one or more embodiments.
具体实施方式detailed description
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行 进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the technical solution and advantages of the present application more clear, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the application, and are not used to limit the application.
在其中一个实施例中,如图1所示,提供了一种检测染色体拷贝数异常的方法,包括以下步骤:In one embodiment, as shown in FIG. 1, a method for detecting an abnormal chromosome copy number is provided, which includes the following steps:
步骤102,获取待检测的样本的测序数据作为待检测数据,确定待检测数据对应的目标物种。Step 102: Obtain sequencing data of the sample to be detected as the data to be detected, and determine a target species corresponding to the data to be detected.
待检测数据,指一个样本被DNA测序仪、RNA测序仪、蛋白测序等设备读取其内部包含的生物分子的序列后,设备输出的数据。DNA测序是确定DNA分子内核苷酸精确顺序的过程,它包括用于确定DNA链中腺嘌呤,鸟嘌呤,胞嘧啶和胸腺嘧啶四种碱基顺序的任何方法或技术。测序仪是指能够测量出输入的样本的序列的仪器,此处测量出的序列不仅仅包括有DNA序列,还包括蛋白、RNA等别的物质构成的序列。样本可以是可以是一滴血、一口痰、一把土等等各种形式。当获取到待检测数据后,可确定待检测数据所属的物种,即目标物种。比如当测序数据为人的基因序列时则目标物种就是人。The data to be detected refers to the data output by a sample after the sequence of a biomolecule contained in a sample is read by a DNA sequencer, an RNA sequencer, or a protein sequencing device. DNA sequencing is the process of determining the exact sequence of nucleotides within a DNA molecule. It includes any method or technique for determining the four base sequences of adenine, guanine, cytosine, and thymine in a DNA strand. A sequencer is an instrument capable of measuring the sequence of an input sample. The sequence measured here includes not only DNA sequences but also sequences composed of other substances such as proteins and RNA. Samples can be in the form of a drop of blood, a sputum, a handful of soil, and so on. After the data to be detected is obtained, the species to which the data to be detected belongs, that is, the target species. For example, when the sequencing data is a human gene sequence, the target species is human.
步骤104,获取靶点数据库中存储的目标物种包含的每个染色体对应的特异性k-mer,特异性k-mer为每个染色体中的满足预设特异性条件的k-mer,k-mer是指长度为k的基因组序列。Step 104: Obtain a specific k-mer corresponding to each chromosome contained in the target species stored in the target database, and the specific k-mer is a k-mer, k-mer in each chromosome that satisfies a preset specific condition. Refers to a genomic sequence of length k.
每个目标物种包含有一个或多个个体。每个个体中包含有一个或多个基因组,而每个基因组中包含有一个或多个染色体。因此,每个目标物种中包含有多个染色体。在靶点数据库中可以存储有预先为每个染色体建立的特征靶点序列集合,在每个染色体对应的特征靶点序列集合中包含有每个染色体对应的特异性k-mer。特异性k-mer是指从每个染色体包含的k-mer中选取的满足预设特异性条件的k-mer,即作为每个染色体对应的特异性k-mer。预设特异性条件是技术人员预先设定的条件,用于选取符合的k-mer,预设特异性条件可根据技术人员的考虑或实际项目需求而定。Each target species contains one or more individuals. Each individual contains one or more genomes, and each genome contains one or more chromosomes. Therefore, each target species contains multiple chromosomes. The target database may store a feature target sequence set previously established for each chromosome, and the feature target sequence set corresponding to each chromosome may include a specific k-mer corresponding to each chromosome. The specific k-mer refers to a k-mer selected from the k-mers contained in each chromosome and meeting a preset specificity condition, that is, a specific k-mer corresponding to each chromosome. The preset specific condition is a condition set by a technician in advance for selecting a matching k-mer. The preset specific condition may be determined according to a technician's consideration or an actual project requirement.
k-mer是指长度为k的基因组序列,k为自然数。如果一种基因组数据中一共有a个不同的确定性字符,那么对于一个特定的k,则一共有数量为a的k次方个可能不相同的k-mer。对于DNA或RNA(核糖核酸)序列,确定性字符是指A(腺嘌呤)、T(胸腺嘧啶)、C(胞嘧啶)、G(鸟嘌呤)、U(尿嘧啶)这五种碱基;如果是蛋白序列,确定性字符则是指确定的氨基酸字符。k-mer refers to a genomic sequence of length k, where k is a natural number. If there are a different deterministic characters in a genomic data, then for a particular k, there may be a total of k-mers with a power of a that are different. For DNA or RNA (ribonucleic acid) sequences, deterministic characters refer to the five bases A (adenine), T (thymine), C (cytosine), G (guanine), and U (uracil); In the case of protein sequences, deterministic characters are defined amino acid characters.
步骤106,获取每个染色体中包含的特异性k-mer在待检测数据中的实际出现次数。Step 106: Obtain the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected.
当获取到待检测数据后,可将待检测数据分别与每个染色体进行比较,即获取到每个染色体所对应的特征靶点序列集合中包含的特异性k-mer在待检测数据中的出现次数,即为每个特异性k-mer在待检测数据中的实际出现次数。After obtaining the data to be detected, the data to be detected can be compared with each chromosome separately, that is, the appearance of the specific k-mer included in the characteristic target sequence set corresponding to each chromosome in the data to be detected The number of times is the actual number of times each specific k-mer appears in the data to be detected.
步骤108,从靶点数据库中获取到每个特异性k-mer的拷贝数,拷贝数是特异性k-mer在对应的染色体中的出现次数与该染色体上出现次数最少的特异性k-mer的出现次数的比值。Step 108: Obtain a copy number of each specific k-mer from the target database. The copy number is the specific k-mer with the least number of occurrences of the specific k-mer on the corresponding chromosome and the specific k-mer on the chromosome. The ratio of the number of occurrences.
每个特异性k-mer的拷贝数是指该特异性k-mer在对应的染色体中的出现次数与该染色体上出现次数最少的特异性k-mer的出现次数的比值。从靶点数据库中获取到每个特异性 k-mer的拷贝数时,可从靶点数据库中获取到每个染色体对应的特异性k-mer拷贝数列表,再根据每个特异性k-mer拷贝数列表获取到每个染色体中包含的特异性k-mer的拷贝数。特异性k-mer拷贝数列表预先建立存储在靶点数据库中,可在需要用到时进行调用,提升检测效率。The copy number of each specific k-mer refers to the ratio of the number of occurrences of the specific k-mer on the corresponding chromosome to the number of occurrences of the specific k-mer with the least number of occurrences on the chromosome. When the copy number of each specific k-mer is obtained from the target database, a list of specific k-mer copy numbers corresponding to each chromosome can be obtained from the target database, and then according to each specific k-mer The copy number list obtains the copy number of the specific k-mer contained in each chromosome. The specific k-mer copy number list is established in advance and stored in the target database, which can be called when needed, improving detection efficiency.
步骤110,根据每个特异性k-mer的实际出现次数和拷贝数计算得到对应的染色体的实际信号强度。Step 110: Calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and the copy number of each specific k-mer.
在获取到每个特异性k-mer在待检测数据中的实际出现次数和拷贝数后,可根据这两个参数计算得到每个特异性k-mer的实际信号强度。在获取到每个特异性k-mer的实际出现次数Ci和拷贝数Fi后,可计算得到Ci和Fi的比值,将比值作为每个特异性k-mer调整后的出现次数。如此,可计算得到每个染色体中包含的全部特异性k-mer调整后的出现次数。再计算每个染色体中包含的特异性k-mer调整后的出现次数的平均值,将该平均值作为对应的染色体的单拷贝信号强度E。当计算得到所有染色体的单拷贝信号强度E后,可计算得到所有染色体的单拷贝信号强度E的平均值M和方差SD。再将每个染色体的单拷贝信号强度与平均值M的差值除以方差SD得到的商,作为每个染色体对应的实际信号强度。即染色体的实际信号强度S i=(E i-M)/SD。 After obtaining the actual number of occurrences and copy number of each specific k-mer in the data to be detected, the actual signal intensity of each specific k-mer can be calculated according to these two parameters. After obtaining the actual number of occurrences Ci and copy number Fi of each specific k-mer, the ratio of Ci and Fi can be calculated, and the ratio is used as the adjusted number of appearances of each specific k-mer. In this way, the number of adjusted occurrences of all specific k-mers contained in each chromosome can be calculated. Then calculate the average of the number of occurrences of the specific k-mer adjusted in each chromosome, and use this average as the single copy signal intensity E of the corresponding chromosome. After the single-copy signal intensity E of all chromosomes is calculated, the average value M and the variance SD of the single-copy signal intensity E of all chromosomes can be calculated. Then, the quotient obtained by dividing the difference between the single copy signal intensity of each chromosome and the average value M by the variance SD is taken as the actual signal intensity corresponding to each chromosome. That is, the actual signal intensity of the chromosome S i = (E i -M) / SD.
步骤112,将实际信号强度不在对应染色体的标准置信区间内的染色体判定为存在拷贝数异常的染色体。Step 112: Determine a chromosome whose actual signal strength is not within the standard confidence interval of the corresponding chromosome as a chromosome with abnormal copy number.
每个染色体均有各自对应的标准置信区间,标准置信区间是指预先根据大量样本计算得到的标准信号强度区间。标准信号强度与实际信号强度实际上是同样的计算方式,但由于标准检测样本是确认为无染色体拷贝数异常的样本,因此标准信号强度是针对标准检测样本的数据,而实际信号强度则是针对待检测数据。当染色体的实际信号强度在对应染色体的标准置信区间中时,可以判断该染色体是不存在拷贝数异常的,反之,则可以判定该染色体是存在拷贝数异常的。因此,可将实际信号强度不在对应染色体的标准置信区间内的染色体判定为存在拷贝数异常的染色体。此处,是将每个染色体的实际信号强度与对应的染色体的标准置信区间进行比较。比如,一号染色体的实际信号强度与预先建立的一号染色体的标准置信区间进行比较,二号染色体的实际信号强度与预先建立的二号染色体的标准置信区间进行比较。Each chromosome has its own corresponding standard confidence interval. The standard confidence interval refers to the standard signal intensity interval calculated in advance based on a large number of samples. The standard signal strength is actually calculated in the same way as the actual signal strength, but since the standard test sample is a sample confirmed to have no abnormal chromosome copy number, the standard signal strength is for the data of the standard test sample, and the actual signal strength is for Data to be tested. When the actual signal intensity of the chromosome is within the standard confidence interval of the corresponding chromosome, it can be judged that there is no copy number abnormality on the chromosome; otherwise, it can be judged that the copy number is abnormal on the chromosome. Therefore, a chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome can be determined as a chromosome with an abnormal copy number. Here, the actual signal intensity of each chromosome is compared with the standard confidence interval of the corresponding chromosome. For example, the actual signal intensity of chromosome 1 is compared with a pre-established standard confidence interval of chromosome 1, and the actual signal intensity of chromosome 2 is compared with a pre-established standard confidence interval of chromosome 2.
通过确定待检测数据对应的目标物种,并获取到目标物种中的每个染色体对应的特异性k-mer后,根据特异性k-mer在待检测数据中的实际出现次数以及每个特异性k-mer的拷贝数,以此计算出每个染色体对应的实际信号强度。从而可将每个染色体的实际信号强度与对应染色体的标准置信区间进行比较,将不在对应染色体的标准置信区间内的染色体判定为存在拷贝数异常的染色体。这种检测染色体拷贝数异常的方法,通过与目标物种的各个染色体中的特征靶点序列,即特异性k-mer进行比较,而特异性k-mer属于整个目标物种基因组的一部分,因此与特异性k-mer进行对比则能够减少比较空间,从而缩短了分析时间,提高了检测的效率。After determining the target species corresponding to the data to be detected, and obtaining the specific k-mer corresponding to each chromosome in the target species, according to the actual number of occurrences of the specific k-mer in the data to be detected and each specific k -mer copy number to calculate the actual signal intensity corresponding to each chromosome. Therefore, the actual signal intensity of each chromosome can be compared with the standard confidence interval of the corresponding chromosome, and the chromosome that is not within the standard confidence interval of the corresponding chromosome can be determined as a chromosome with abnormal copy number. This method of detecting chromosome copy number abnormalities is compared with the characteristic target sequence in each chromosome of the target species, that is, the specific k-mer, which is part of the entire target species genome, and is therefore specific. The comparison of the performance k-mer can reduce the comparison space, thereby shortening the analysis time and improving the detection efficiency.
在其中一个实施例中,特异性k-mer是指在染色体对应的基因组出现次数索引表中的出 现次数满足预设误差条件的染色体中的k-mer。In one embodiment, the specific k-mer refers to the k-mer in a chromosome whose appearance frequency in the genome occurrence number index table corresponding to the chromosome meets a preset error condition.
在每个染色体对应的特征靶点序列集合中,都包含有每个染色体中满足预设特异性条件的特异性k-mer。进一步地,预设特异性条件是指,在每个染色体对应的基因组出现次数索引表中出现次数满足预设误差条件的染色体中包含的k-mer。预设误差条件是指技术人员根据实际项目需求预先设定的误差条件,误差条件可以是一个区域范围,即允许了选取作为特异性的k-mer能够存在一定的误差,而不是完全一定要满足某个严格的客观条件。The set of characteristic target sequences corresponding to each chromosome includes a specific k-mer in each chromosome that satisfies a predetermined specificity condition. Further, the preset specific condition refers to a k-mer included in a chromosome whose occurrence number in the genome occurrence number index table corresponding to each chromosome meets a preset error condition. The preset error condition refers to the error condition preset by the technician according to the actual project requirements. The error condition can be a range of regions, that is, the k-mer selected as a specific can be allowed to have a certain error, instead of being completely satisfied. Some strict objective condition.
针对每个染色体,均有与该染色体对应的基因组出现次数索引表,可根据每个染色体对应的基因组出现次数索引表获知每个染色体中包含的k-mer的在该染色体中所包含的多少个基因组里出现过,即可选出在染色体的基因组出现次数索引表中的出现次数满足预设误差条件的染色体中的k-mer,将选出的k-mer作为特异性k-mer。For each chromosome, there is an index table of the number of occurrences of the genome corresponding to the chromosome. The number of k-mers contained in each chromosome in the chromosome can be obtained according to the index of the number of occurrences of the genome corresponding to each chromosome. It has appeared in the genome, that is, the k-mer in the chromosome whose occurrence number in the chromosome genome occurrence index table meets the preset error condition can be selected, and the selected k-mer is used as the specific k-mer.
在选取特异性k-mer时允许了一定的误差性,因此能够在一定误差范围内较高概率地找到代表该染色体的特异性序列,从而使得在确定测序数据包含的染色体时,仅仅使用特异性序列,而不是全基因组序列。这样一个技术方案减少了处理真实待检测数据时序列比较的空间,从而缩短了分析时间,提高了检测的效率。When selecting a specific k-mer, a certain degree of error is allowed, so a specific sequence representing the chromosome can be found with a high probability within a certain error range, so that when determining the chromosome contained in the sequencing data, only the specificity is used. Sequence, not whole genome sequence. Such a technical solution reduces the space for sequence comparison when processing real to-be-detected data, thereby reducing analysis time and improving detection efficiency.
在其中一个实施例中,在上述步骤102之前,还包括以下步骤:生成与每个染色体对应的基因组出现次数索引表,基因组次数索引表记录了每个k-mer对应的染色体包含的基因组中包含有该k-mer的基因组的个数;将基因组出现次数索引表存储至与染色体对应的特征靶点序列集合。In one embodiment, before step 102, the method further includes the following steps: generating an index table of the number of occurrences of the genome corresponding to each chromosome, and the index of the number of times of the genome records that the genome contained in the chromosome corresponding to each k-mer contains The number of genomes of the k-mer; the index table of the number of occurrences of the genome is stored in the feature target sequence set corresponding to the chromosome.
基因组是指一个生物体内所有遗传信息,这种遗传信息以核苷酸序列形式存储。一个生物体(例如一个动植物个体、或动植物细胞、或细菌个体)的一个完整单体内的遗传物质的总和即为基因组。通常来说,在一个个体的完整基因组中,可以包含有多个染色体,而在每个染色体中,则可以包含有多个k-mer。此处使用了本领域内常用的“染色体的基因组”这个概念,指的是一个完整的染色体所包含的所有序列的总和。按照这个概念,在每个染色体对应的基因组出现次数索引表中记录了每个染色体包含的k-mer在该染色体对应的多少个基因组中出现过,即基因组次数索引表记录了每个k-mer在其所属的染色体对应的基因组中包含有该k-mer的基因组的个数。The genome is all the genetic information in an organism. This genetic information is stored in the form of a nucleotide sequence. The sum of the genetic material in a complete monomer of an organism (such as an animal or plant individual, or animal or plant cell, or bacterial individual) is the genome. Generally speaking, an individual's complete genome can contain multiple chromosomes, and each chromosome can contain multiple k-mers. The term "chromosome genome" commonly used in the art is used here to refer to the sum of all sequences contained in a complete chromosome. According to this concept, the number of genome occurrences corresponding to each chromosome has been recorded in the index table of the number of occurrences of the genome corresponding to each chromosome in the number of genomes corresponding to the chromosome, that is, the number of genomes index table records each k-mer The number of the k-mer genome is contained in the genome corresponding to the chromosome to which it belongs.
因此在基因组次数表中实际上记录的是每个k-mer在该k-mer所在的染色体对应的多少个基因组中出现过。如果在同一个基因组中一个k-mer出现超过一次,那么在该基因组出现次数索引表中仍然只会计数一次。在获取到每个k-mer在多少个基因组中出现过的数据后,即可建立针对每个染色体对应的基因组出现次数索引表。若是一共有M个染色体,则会生成M个相对应的基因组出现次数索引表。Therefore, what is actually recorded in the genome frequency table is how many genomes each k-mer has appeared in the chromosome corresponding to the k-mer. If a k-mer occurs more than once in the same genome, it will still only be counted once in the genome occurrence index table. After obtaining data on how many genomes each k-mer has appeared in, an index table of the number of occurrences of the genome corresponding to each chromosome can be established. If there are M chromosomes in total, M corresponding genomic appearance frequency index tables will be generated.
当每个染色体对应的基因组出现次数索引表均建立后,可将基因组出现次数索引表存储至与每个染色体对应的特征靶点序列集合,即存储至靶点数据库中,存储后,若是需要用到基因组出现次数索引表即可从靶点数据库进行数据调取,进而提高了检测的效率。After the genomic appearance frequency index table corresponding to each chromosome is established, the genomic appearance frequency index table can be stored into the feature target sequence set corresponding to each chromosome, that is, stored in the target database. After storage, if needed, Data can be retrieved from the target database at the genome occurrence index table, which improves the detection efficiency.
在其中一个实施例中,如图2所示,在上述步骤102之前,还包括以下步骤:In one embodiment, as shown in FIG. 2, before step 102, the method further includes the following steps:
步骤100,从每个染色体对应的k-mer中选取满足预设特异性条件的k-mer。Step 100: Select a k-mer that satisfies a preset specific condition from the k-mers corresponding to each chromosome.
步骤101,将满足预设特异性条件的k-mer存储至每个染色体对应的特征靶点序列集合中。Step 101: Store a k-mer that satisfies a preset specific condition into a feature target sequence set corresponding to each chromosome.
在靶点数据库中,存储有每个染色体对应的特征靶点序列集合,在每个特征靶点序列集合中包含有每个染色体对应的特异性k-mer。特异性k-mer是指从每个染色体包含的k-mer中选取满足预设特异性条件的k-mer。当选取出满足预设特异性条件的k-mer,即特异性k-mer,可将特异性k-mer存储至每个染色体对应的特征靶点序列集合中。这种方法即预先建立了特征靶点库,因此在检测确定染色体是否异常时,则能够直接调用需要用到特异性k-mer的数据,提高了检测的效率。In the target database, a feature target sequence set corresponding to each chromosome is stored, and each feature target sequence set includes a specific k-mer corresponding to each chromosome. Specific k-mer refers to the selection of k-mers that satisfy preset specific conditions from the k-mers contained in each chromosome. When a k-mer that satisfies a preset specificity condition, that is, a specific k-mer, is selected, the specific k-mer can be stored in a feature target sequence set corresponding to each chromosome. In this method, a feature target library is established in advance, so when detecting whether the chromosome is abnormal, it can directly call data that requires specific k-mer, which improves the detection efficiency.
在其中一个实施例中,如图3所示,在步骤102之前,还包括以下步骤:In one embodiment, as shown in FIG. 3, before step 102, the method further includes the following steps:
步骤302,获取靶点数据库中存储的目标物种包含的每个染色体中包含的特异性k-mer在对应染色体中的出现次数C,以及该染色体中的出现次数最少的特异性k-mer对应的出现次数作为最小出现次数Cm。Step 302: Obtain the number of occurrences of the specific k-mer included in each chromosome contained in the target species stored in the target database in the corresponding chromosome C, and the specific k-mer corresponding to the least number of occurrences in the chromosome. The number of occurrences is taken as the minimum number of occurrences Cm.
步骤304,将出现次数C与最小出现次数Cm的比值作为特异性k-mer的拷贝数。In step 304, the ratio of the number of occurrences C to the minimum number of occurrences Cm is used as the copy number of the specific k-mer.
步骤306,根据每个染色体中包含的特异性k-mer的拷贝数生成与每个染色体对应的特异性k-mer拷贝数列表。Step 306: Generate a specific k-mer copy number list corresponding to each chromosome according to the copy number of the specific k-mer included in each chromosome.
步骤308,将特异性k-mer拷贝数列表存储至靶点数据库。Step 308: Store the specific k-mer copy number list into the target database.
上述步骤108,包括:根据特异性k-mer拷贝数列表获取到每个特异性k-mer的拷贝数。The above step 108 includes: obtaining the copy number of each specific k-mer according to the specific k-mer copy number list.
目标物种中包含有多个染色体,每个染色体中均包含有一个或多个特异性k-mer。可获取到每个染色体中包含的每个特异性k-mer在该染色体中的出现次数C,并获取到在该染色体中出现次数最少的特异性k-mer的出现次数,作为最小出现次数Cm。The target species contains multiple chromosomes, and each chromosome contains one or more specific k-mers. The number of occurrences of each specific k-mer contained in each chromosome on the chromosome C can be obtained, and the number of occurrences of the specific k-mer with the least number of occurrences in the chromosome can be obtained as the minimum number of occurrences Cm .
对于每一个特异性k-mer而言,其出现次数C与该染色体上出现次数最少的k-mer的出现次数Cm的比值即为该特异性k-mer的拷贝数。在得到每个染色体中包含的全部特异性k-mer的出现次数后,即可计算得到每个特异性k-mer的拷贝数,从而生成与该染色体对应的特异性k-mer拷贝数列表。可将每个特异性k-mer拷贝数列表存储至与染色体对应的特点靶点序列集合中,便于需要使用的时候直接调用列表获取相关的数据,提高检测效率。For each specific k-mer, the ratio of the number of occurrences C to the number of occurrences of the k-mer with the least number of occurrences on the chromosome is the copy number of the specific k-mer. After the number of occurrences of all specific k-mers contained in each chromosome is obtained, the copy number of each specific k-mer can be calculated to generate a list of specific k-mer copy numbers corresponding to the chromosome. Each specific k-mer copy number list can be stored in the characteristic target sequence set corresponding to the chromosome, which is convenient for directly calling the list to obtain relevant data when needed, improving detection efficiency.
在需要获取每个特异性k-mer的拷贝数时,可先获取到特异性k-mer所属染色体对应的特异性k-mer拷贝数列表,从而获取到表中记录的每个特异性k-mer的拷贝数。如图4所示的染色体X的特异性k-mer的拷贝数列表,假设在染色体X中,包含有N个特异性k-mer。N个特异性k-mer在染色体X中的出现次数分别为C1、C2、….、Cn。其中有一个特异性k-mer在染色体X中的出现次数最少,记为Cm。那么N个特异性k-mer的拷贝数F分别为F1=C1/Cm、F2=C2/Cm、….、Fn=Cn/Cm。出现次数最少的特异性k-mer的拷贝数则等于Cm/Cm,即出现次数最少的特异性k-mer的拷贝数为1。When you need to obtain the copy number of each specific k-mer, you can first obtain the specific k-mer copy number list corresponding to the chromosome to which the specific k-mer belongs, so as to obtain each specific k-mer recorded in the table. The number of copies of mer. As shown in the copy number list of specific k-mers of chromosome X shown in FIG. 4, it is assumed that N specific k-mers are included in chromosome X. The number of occurrences of N specific k-mers in chromosome X are C1, C2, ..., and Cn, respectively. One of them, the specific k-mer, appeared the least in chromosome X, and was recorded as Cm. Then the copy numbers F of the N specific k-mers are F1 = C1 / Cm, F2 = C2 / Cm,..., Fn = Cn / Cm. The copy number of the specific k-mer with the least occurrence is equal to Cm / Cm, that is, the copy number of the specific k-mer with the least occurrence is 1.
在其中一个实施例中,如图5所示,上述步骤110,包括:In one embodiment, as shown in FIG. 5, the above step 110 includes:
步骤502,计算每个特异性k-mer的实际出现次数与拷贝数的比值。Step 502: Calculate the ratio of the actual number of occurrences of each specific k-mer to the number of copies.
步骤504,计算每个染色体包含的所有特异性k-mer的实际出现次数与拷贝数的比值的均值,作为对应的染色体的单拷贝信号强度。Step 504: Calculate the average of the ratio of the actual number of occurrences of all specific k-mers to the number of copies in each chromosome as the single-copy signal strength of the corresponding chromosome.
步骤506,根据每个染色体的单拷贝信号强度计算得到对应的染色体的实际信号强度。Step 506: Calculate the actual signal intensity of the corresponding chromosome according to the signal intensity of the single copy of each chromosome.
获取到每个特异性k-mer在待检测数据中的实际出现次数,以及每个特异性k-mer的拷贝数,从而可获取到每个特异性k-mer的实际出现次数与拷贝数的比值。每个染色体中可以包含有多个特异性k-mer,因此可获取到每个染色体中包含的所有特异性k-mer的实际出现次数与拷贝数的比值,并获取到该比值的均值。从而每个染色体均会有对应的实际出现次数与拷贝数的比值的均值,此均值即为每个染色体的单拷贝信号强度。从而可根据每个染色体的单拷贝信号强度计算得到与每个染色体对应的实际信号强度。The actual number of occurrences of each specific k-mer in the data to be detected and the copy number of each specific k-mer are obtained, so that the actual number of occurrences and copy number of each specific k-mer can be obtained. ratio. Each chromosome can contain multiple specific k-mers, so the ratio of the actual number of occurrences of all specific k-mers contained in each chromosome to the number of copies can be obtained, and the average of the ratio can be obtained. Therefore, each chromosome will have a corresponding average of the ratio of the actual number of occurrences to the number of copies, and this average is the single-copy signal strength of each chromosome. Therefore, the actual signal intensity corresponding to each chromosome can be calculated based on the signal intensity of a single copy of each chromosome.
在其中一个实施例中,根据如下公式计算得到对应的染色体的实际信号强度:In one embodiment, the actual signal intensity of the corresponding chromosome is calculated according to the following formula:
染色体的实际信号强度=(染色体的单拷贝信号强度-M)/SD,其中M为全部的染色体的单拷贝信号强度的平均值,SD为全部的染色体的单拷贝信号强度的方差。The actual signal intensity of the chromosome = (single copy signal intensity of the chromosome-M) / SD, where M is the average of the single copy signal intensity of all chromosomes, and SD is the variance of the single copy signal intensity of all chromosomes.
当得到每个染色体的单拷贝信号强度后,可计算得到全部的染色体的单拷贝信号强度的平均值M以及方差。每个染色体的实际信号强度,则为该染色体的单拷贝信号强度与平均值M的差值与方差SD的商。即每个染色体的实际信号强度=(染色体的单拷贝信号强度-M)/SD。When the single-copy signal intensity of each chromosome is obtained, the average value M and the variance of the single-copy signal intensity of all chromosomes can be calculated. The actual signal intensity of each chromosome is the quotient of the difference between the single copy signal intensity of the chromosome and the average value M and the variance SD. That is, the actual signal intensity of each chromosome = (signal intensity of a single copy of the chromosome-M) / SD.
在其中一个实施例中,如图6所示,上述检测染色体拷贝数异常的方法还包括以下步骤:In one embodiment, as shown in FIG. 6, the method for detecting an abnormal chromosome copy number further includes the following steps:
步骤602,获取预设数量的标准检测样本,标准检测样本是确认为无染色体拷贝数异常的样本。Step 602: Obtain a preset number of standard test samples, and the standard test samples are samples confirmed to have no abnormal chromosome copy number.
步骤604,获取标准检测样本中每个染色体包含的特异性k-mer在待检测数据中的实际出现次数。Step 604: Obtain the actual number of occurrences of the specific k-mer contained in each chromosome in the standard detection sample in the data to be detected.
在对待检测数据的染色体是否存在拷贝数异常进行检测时,需要预先确定每个染色体对应的标准置信区间列表。从而可将待检测数据中的染色体与预先确定的染色体对应的标准置信区间列表进行比较,即可确定待检测数据中的染色体是否存在拷贝数异常。确定染色体对应的标准置信区间列表时,需要先获取到预设数量的标准检测样本。标准检测样本是确认为无染色体拷贝数异常的样本。预设数量是指数量可由技术人员自定义进行设置,但是应该以能满足统计学上的大样本要求为准。一般预设数量应该大于30,或大于100。获取到多个标准检测样本后,可获取到每个标准检测样本包含的染色体中的特异性k-mer在待检测数据中的实际出现次数。When detecting whether there is an abnormal copy number of the chromosome to be detected, it is necessary to determine a list of standard confidence intervals corresponding to each chromosome in advance. Therefore, the chromosome in the data to be detected can be compared with a standard confidence interval list corresponding to a predetermined chromosome, and it can be determined whether there is an abnormal copy number of the chromosome in the data to be detected. When determining a standard confidence interval list corresponding to a chromosome, a preset number of standard detection samples need to be obtained first. The standard test sample is a sample confirmed as having no abnormal chromosome copy number. The preset quantity is an exponential quantity that can be set by the technicians, but it should be based on meeting the requirements of a large sample in statistics. Generally the preset number should be greater than 30, or greater than 100. After obtaining multiple standard detection samples, the actual number of occurrences of the specific k-mer in the chromosome contained in each standard detection sample in the data to be detected can be obtained.
步骤606,从靶点数据库中获取到标准检测样本中包含的每个染色体中每个特异性k-mer的拷贝数。Step 606: Obtain a copy number of each specific k-mer in each chromosome included in the standard detection sample from the target database.
步骤608,根据标准检测样本中包含的每个特异性k-mer的实际出现次数和拷贝数得到对应的染色体的标准信号强度。Step 608: Obtain the standard signal intensity of the corresponding chromosome according to the actual number of occurrences and the copy number of each specific k-mer included in the standard detection sample.
每个特异性k-mer的拷贝数是指该特异性k-mer在对应的染色体中的出现次数与该染色体上出现次数最少的特异性k-mer的出现次数的比值。从靶点数据库中获取到标准检测样本中包含的每个染色体中每个特异性k-mer的拷贝数后,可根据标准检测样本中包含的每个特异性k-mer的实际出现次数和拷贝数得到对应的染色体的标准信号强度。标准信号强度与实际信号强度的计算方式是一样的,只是标准信号强度是针对标准检测样本而言,而实际信号 强度是针对待检测数据而言。在获取到每个染色体的标准信号强度后,可根据不同性别建立标准信号强度记录表。例如如果目标物种是人类,那么可以建立属于男性的标准检测样本中染色体的标准信号强度记录表和属于女性的标准检测样本中染色体的标准信号强度记录表。The copy number of each specific k-mer refers to the ratio of the number of occurrences of the specific k-mer on the corresponding chromosome to the number of occurrences of the specific k-mer with the least number of occurrences on the chromosome. After obtaining the copy number of each specific k-mer in each chromosome included in the standard test sample from the target database, the actual number of occurrences and copies of each specific k-mer included in the standard test sample can be obtained Number to get the standard signal intensity of the corresponding chromosome. The calculation method of the standard signal strength and the actual signal strength is the same, except that the standard signal strength is for the standard test samples, and the actual signal strength is for the data to be detected. After obtaining the standard signal intensity of each chromosome, a standard signal intensity record table can be established according to different genders. For example, if the target species is a human, a standard signal intensity record table of chromosomes in a standard test sample belonging to a male and a standard signal intensity record table of a chromosome in a standard test sample belonging to a female can be established.
如图7A所示的正常男性样本中染色体的标准信号强度记录表,和图7B所示的正常女性样本中染色体的标准信号强度记录表。在这两个表中,分别记录有男性样本中包含的染色体对应的标准信号强度记录和女性样本中包含的染色体对应的标准信号强度记录。比如,如图7A所示,样本1中的1号染色体的标准信号强度记录为S 1 1,2号染色体的标准信号强度记录为S 1 2。样本i中1号染色体的标准信号强度记录为S i 1,样本i中2号染色体的标准信号强度记录为S i 2。同理的,图7B中也是这样的记录方式。 A standard signal intensity recording table for chromosomes in a normal male sample as shown in FIG. 7A, and a standard signal intensity recording table for chromosomes in a normal female sample as shown in FIG. 7B. In these two tables, a standard signal intensity record corresponding to a chromosome included in a male sample and a standard signal intensity record corresponding to a chromosome included in a female sample are recorded. For example, as shown in FIG. 7A, the standard signal intensity of chromosome 1 in sample 1 is recorded as S 1 1 , and the standard signal intensity of chromosome 2 is recorded as S 1 2 . The standard signal intensity of chromosome 1 in sample i is recorded as S i 1 , and the standard signal intensity of chromosome 2 in sample i is recorded as S i 2 . Similarly, the recording method is the same in FIG. 7B.
步骤610,根据多个标准检测样本中的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间。Step 610: Determine a standard confidence interval corresponding to the chromosome when the confidence value is preset according to the standard signal intensity of each chromosome in the multiple standard detection samples.
步骤612,根据每个染色体对应的标准置信区间,获得目标物种中包含的染色体对应的标准置信区间列表。Step 612: Obtain a list of standard confidence intervals corresponding to the chromosomes included in the target species according to the standard confidence intervals corresponding to each chromosome.
一个置信区间是针对某一个待估计的种群总体参数,通过获得来自于该种群的某一个随机抽样样本,计算所得的在某一个置信度时可能包括该种群总体参数的一个区间。这个置信度也被称为置信水平。此处的预设置信度值P指的是预先由技术人员设置的置信度值,一般设置为大于0.95的数值,无限接近于1但不会等于1。预设置信度值可以由技术人员在实际应用中根据需要进行调节。例如将置信度值设为95%的置信度,P即为0.95,而设置为99.9%的置信度,P即为0.999。A confidence interval is an interval for a population parameter to be estimated. By obtaining a random sample from the population, the calculated confidence interval may include the population parameter of the population. This confidence is also called the confidence level. The preset reliability value P here refers to a confidence value set by a technician in advance, and is generally set to a value greater than 0.95, which is infinitely close to 1 but not equal to 1. The preset reliability value can be adjusted by a technician in actual applications as needed. For example, if the confidence value is set to 95% confidence, P is 0.95, and if the confidence value is set to 99.9%, P is 0.999.
可根据设置好的预设置信度值确定染色体的标准信号强度的两个边界值LB和UB,则可得到与预设置信度值对应的置信区间。LB为置信区间的最小值,UB为置信区间的最大值。因此得到的置信区间实际上则为标准信号强度的区间。针对每一个染色体而言,均可得到在预设置信度值时对应的标准信号强度区间,即每个染色体的标准信号强度区间,也就是每个染色体对应的标准置信区间。而在目标物种中包含有多个染色体,因此实际上可得到目标物种中包含的染色体对应的标准置信区间列表。标准置信区间列表中则包含有各个染色体对应的标准置信区间。比如将预设置信度值P设置为0.98,那么可得到每个染色体在概率为98%时对应的标准信号强度区间。The two boundary values LB and UB of the standard signal strength of the chromosome can be determined according to the preset preset confidence value, and a confidence interval corresponding to the preset confidence value can be obtained. LB is the minimum of the confidence interval, and UB is the maximum of the confidence interval. Therefore, the confidence interval obtained is actually the interval of standard signal strength. For each chromosome, the standard signal intensity interval corresponding to the preset confidence value can be obtained, that is, the standard signal intensity interval of each chromosome, that is, the standard confidence interval corresponding to each chromosome. Since the target species contains multiple chromosomes, a list of standard confidence intervals corresponding to the chromosomes contained in the target species can actually be obtained. The standard confidence interval list contains standard confidence intervals corresponding to each chromosome. For example, if the preset reliability value P is set to 0.98, the standard signal intensity interval corresponding to each chromosome at a probability of 98% can be obtained.
在其中一个实施例中,如图8所示,上述步骤610包括:In one embodiment, as shown in FIG. 8, the above step 610 includes:
步骤802,获取每个标准检测样本包含的每个染色体的标准信号强度。Step 802: Obtain a standard signal intensity of each chromosome contained in each standard detection sample.
步骤804,根据标准检测样本的性别分别计算所有标准检测样本中包含的染色体的标准信号强度的均值和方差。Step 804: Calculate the mean and variance of the standard signal strengths of the chromosomes included in all the standard detection samples according to the gender of the standard detection samples.
步骤806,根据每个染色体在相应性别的多个标准检测样本中的标准信号强度的均值和方差,确定每个性别对应的标准检测样本中包含的染色体在预设置信度值时对应的标准置信区间。Step 806: Determine the standard confidence corresponding to the chromosome contained in the standard detection sample corresponding to each gender when the confidence value is preset according to the mean and variance of the standard signal strengths in the multiple standard detection samples corresponding to each gender for each sex. Interval.
在获取到每个标准检测样本包含的每个染色体的标准信号强度后,可计算出每一号染色体的标准信号强度的均值和方差。每一号染色体是指,每一个编号的染色体。比如获得每个 标准检测样本的1号染色体的标准信号强度后,即可计算得到1号染色体的标准信号强度的均值和方差。同样,可计算得到2,3,…,22号染色体以及X、Y等染色体的标准信号强度的均值和方差。After obtaining the standard signal intensity of each chromosome contained in each standard detection sample, the mean and variance of the standard signal intensity of each chromosome can be calculated. Each chromosome refers to each numbered chromosome. For example, after obtaining the standard signal intensity of chromosome 1 of each standard test sample, the mean and variance of the standard signal intensity of chromosome 1 can be calculated. Similarly, the mean and variance of standard signal intensities of chromosomes 2, 3, ..., 22 and X, Y and other chromosomes can be calculated.
在计算得到每个染色体对应的标准信号强度的均值和方差后,即可确定每个染色体在预设置信度值时的对应的标准置信区间,即对应的标准信号强度区间。,例如,以人为目标物种,还可以根据不同性别的标准检测样本分别建立男性样本中染色体的标准信号强度的预设置信度值P的分布表,和女性样本中染色体的标准信号强度的预设置信度值P的分布表。After calculating the mean and variance of the standard signal intensity corresponding to each chromosome, the corresponding standard confidence interval, that is, the corresponding standard signal intensity interval, of each chromosome can be determined when the confidence value is preset. For example, with humans as the target species, you can also create a distribution table of the pre-set reliability value P of the standard signal intensity of the chromosome in the male sample and the preset of the standard signal intensity of the chromosome in the female sample according to the standard test samples of different genders. Table of distributions of confidence values P.
如图9A所示的正常男性样本中染色体的标准信号强度的预设置信度值P的分布表。在正常男性的样本中,包含有22条常染色体和XY染色体。M‘代表的是全部染色体的标准信号强度的平均值,SD'代表的是全部染色体的标准信号强度的方差。LB代表的是每个染色体在预设置信度值P时对应的置信区间的最小值,UB代表的是每个染色体在预设置信度值P时对应的置信区间的最大值,由置信区间的最小值和最大值即可得到对应的置信区间。如图9B所示的正常女性样本中染色体的标准信号强度的预设置信度值P的分布表。图9A和图9B的区别在于不同性别的个体的基因组有不同的染色体组成,例如在男性样本对应的图9A中,除了22条常染色体外还包括X和Y性染色体,而在女性样本中则为22条染色体与两条X性染色体。其余数据代表的含义是一样的。A distribution table of preset confidence values P of standard signal intensities of chromosomes in a normal male sample as shown in FIG. 9A. The normal male sample contains 22 autosomal and XY chromosomes. M ′ represents the average value of the standard signal intensity of all chromosomes, and SD ′ represents the variance of the standard signal intensity of all chromosomes. LB represents the minimum value of the confidence interval corresponding to the preset confidence value P for each chromosome, and UB represents the maximum value of the confidence interval corresponding to the preset confidence value P for each chromosome. The minimum and maximum values give the corresponding confidence intervals. A distribution table of preset confidence values P of standard signal intensities of chromosomes in a normal female sample as shown in FIG. 9B. The difference between Figure 9A and Figure 9B is that the genomes of individuals of different sexes have different chromosomal compositions. For example, in Figure 9A corresponding to a male sample, in addition to 22 autosomes, X and Y sex chromosomes are included, while in female samples, For 22 chromosomes and two X sex chromosomes. The rest of the data represent the same meaning.
在其中一个实施例中,标准检测样本为正常母亲怀有正常婴儿的外周血样本,外周血样本包括有正常母亲怀有正常男婴的外周血样本、正常母亲怀有正常女婴的外周血样本、正常母亲怀有正常男婴双胞胎的外周血样本、正常母亲怀有正常女婴双胞胎的外周血样本以及正常母亲怀有正常一男一女双胞胎的外周血样本。In one embodiment, the standard test sample is a peripheral blood sample of a normal mother carrying a normal baby. The peripheral blood sample includes a peripheral blood sample of a normal mother carrying a normal baby boy, and a peripheral mother's peripheral blood sample. Peripheral blood samples from normal mothers carrying normal baby boy twins, Peripheral blood samples from normal mothers carrying normal baby girl twins, and Peripheral blood samples from normal mothers carrying normal one boy and one female twin.
外周血是除骨髓之外的血液。正常母亲指的是该母亲的染色体拷贝数并无异常,正常婴儿指的是该婴儿的染色体拷贝数也并无异常。认定为是正常母亲或正常婴儿的标准也可由技术人员根据实际项目研究进行调整。为了建立样本数据,可获取到大量的标准检测样本,而标准检测样本可以为正常母亲怀有正常婴儿的外周血样本。鉴于母亲怀有的婴儿可为男婴也可以为女婴,且母亲也可以怀有的是双胞胎,因此外周血样本包括有正常母亲怀有正常男婴的外周血样本、正常母亲怀有正常女婴的外周血样本、正常母亲怀有正常男婴双胞胎的外周血样本、正常母亲怀有正常女婴双胞胎的外周血样本以及正常母亲怀有正常一男一女双胞胎的外周血样本。在其他情况下,标准检测样本也可以为正常母亲怀有多个正常婴儿的外周血样本。比如正常母亲怀有正常三胞胎的外周血样本,正常母亲怀有正常四胞胎的外周血样本等等多种情况。此处无需对正常母亲怀有的婴儿数量进行限制,而是可以获取到正常的母亲怀有正常婴儿的外周血样本作为标准检测样本即可。Peripheral blood is blood other than bone marrow. A normal mother means that the mother's chromosome copy number is not abnormal, and a normal baby means that the baby's chromosome copy number is also normal. The criteria for identifying as a normal mother or normal baby can also be adjusted by technical staff based on actual project research. In order to establish sample data, a large number of standard test samples can be obtained, and the standard test samples can be peripheral blood samples of normal mothers carrying normal babies. In view of the fact that the mother's baby can be a baby boy or a baby girl, and the mother can also have twins, the peripheral blood samples include peripheral blood samples from normal mothers with normal boys, and normal mothers with normal girls. Peripheral blood samples, peripheral blood samples from normal mothers carrying normal baby boy twins, peripheral blood samples from normal mothers carrying normal baby girl twins, and peripheral blood samples from normal mothers carrying normal one male and one female twin. In other cases, the standard test sample may also be a peripheral blood sample from a normal mother with multiple normal babies. For example, a normal mother carries a peripheral blood sample of a normal triplet, a normal mother carries a peripheral blood sample of a normal quadruplet, and so on. Here, there is no need to limit the number of babies pregnant by a normal mother, but a peripheral blood sample of a normal mother pregnant with a normal baby can be obtained as a standard test sample.
如图10所示,上述步骤610,包括以下步骤:As shown in FIG. 10, the above step 610 includes the following steps:
步骤1002,根据正常母亲怀有正常男婴的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间。Step 1002: Determine a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby boy.
步骤1004,根据正常母亲怀有正常女婴的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间。In step 1004, a standard confidence interval corresponding to a preset chromosome confidence value of a chromosome is determined according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby girl.
步骤1006,根据正常母亲怀有正常男婴双胞胎的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间。In step 1006, a standard confidence interval corresponding to a chromosome at a preset confidence value is determined according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby boy twin.
步骤1008,根据正常母亲怀有正常女婴双胞胎的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间。Step 1008: Determine a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal strength of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby girl twin.
步骤1010,根据正常母亲怀有正常一男一女双胞胎的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间。Step 1010: Determine a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother and a female twin.
在根据每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间时,当标准检测样本不同时,则需要确定不同的标准检测样本中包含的染色体的标准信号强度。因此,上述步骤1002至步骤1010,则是根据不同的标准检测样本中包含的染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间。比如当标准检测样本为正常母亲怀有正常男婴的外周血样本时,则可以根据正常母亲怀有正常男婴的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间。而当标准检测样本为正常母亲怀有正常一男一女双胞胎的外周血样本时,则可以根据正常母亲怀有正常一男一女双胞胎的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间。When determining the standard confidence interval corresponding to the chromosome in the preset confidence value according to the standard signal intensity of each chromosome, when the standard test samples are different, it is necessary to determine the standard signal intensity of the chromosome contained in the different standard test samples. Therefore, the above steps 1002 to 1010 are to determine the standard confidence interval corresponding to the preset chromosome confidence value according to the standard signal intensity of the chromosome contained in the different standard detection samples. For example, when the standard test sample is a peripheral blood sample of a normal mother carrying a normal baby boy, the reliability of the chromosome in the preset setting can be determined according to the standard signal intensity of each chromosome contained in the peripheral blood sample of a normal mother carrying a normal baby boy. The standard confidence interval corresponding to the value. When the standard test sample is a peripheral blood sample of a normal mother carrying a normal male and female twin, the chromosome can be determined based on the standard signal intensity of each chromosome contained in the peripheral blood sample of a normal mother carrying a normal male and female twin. The standard confidence interval when the confidence value is preset.
在其中一个实施例中,上述步骤112,包括:当检测到存在有染色体对应的实际信号强度不属于与对应染色体的标准置信区间时,则将与实际信号强度对应的染色体判定为存在拷贝数异常的染色体。In one embodiment, the above-mentioned step 112 includes: when it is detected that the actual signal intensity corresponding to the chromosome does not belong to the standard confidence interval corresponding to the corresponding chromosome, determining the chromosome corresponding to the actual signal intensity as having a copy number abnormality Chromosome.
经过获取到大量的标准检测样本,可计算得到目标物种的各个染色体对应的标准置信区间,即得到一个标准置信区间列表。因此可将待测数据所属的目标物种中包含的染色体的实际信号强度与预先计算得到的对应的染色体的标准置信区间进行比对。当检测到存在有染色体对应的实际信号强度不属于与对应染色体的标准置信区间时,则将与实际信号强度对应的染色体判定为存在拷贝数异常的染色体。在比对时,每个染色体与每个染色体对应的标准置信区间进行比对。比如测序数据所属的目标物种中包含的1号染色体与预先计算得到的1号染色体的标准置信区间进行比对,测序数据所属的目标物种中包含的2号染色体与预先计算得到的2号染色体的标准置信区间进行比对,以此方式对测序数据所属的目标物种中包含的全部染色体进行比对,判断是否有染色体存在拷贝数异常。After obtaining a large number of standard test samples, the standard confidence intervals corresponding to each chromosome of the target species can be calculated, and a list of standard confidence intervals can be obtained. Therefore, the actual signal intensity of the chromosome contained in the target species to which the data to be measured can be compared with the standard confidence interval of the corresponding chromosome obtained in advance. When it is detected that the actual signal intensity corresponding to the chromosome does not belong to the standard confidence interval corresponding to the corresponding chromosome, the chromosome corresponding to the actual signal intensity is determined to be a chromosome with abnormal copy number. In the alignment, each chromosome is compared with a standard confidence interval corresponding to each chromosome. For example, the chromosome 1 contained in the target species to which the sequencing data belongs is compared with the pre-calculated standard confidence interval of chromosome 1. The chromosome 2 contained in the target species to which the sequencing data belongs is compared with the pre-calculated chromosome 2 The standard confidence intervals are compared. In this way, all chromosomes contained in the target species to which the sequencing data belongs are compared to determine whether there is an abnormal copy number in the chromosome.
假设比对的是测序数据所属的目标物种中包含的1号染色体,则获取到预先计算得到的1号染色体对应的标准置信区间为(LB1,UB1),则检测判断待测样本中包含的1号染色体的实际信号强度是否存在于区间(LB1,UB1)中。若是不存在,则说明1号染色体拷贝数异常;若是存在,则说明1号染色体正常,无拷贝数异常情况。Assuming that the comparison is for chromosome 1 contained in the target species to which the sequencing data belongs, the standard confidence interval corresponding to the pre-calculated chromosome 1 is (LB1, UB1), and the 1 contained in the test sample is detected and determined. Whether the actual signal intensity of chromosome chromosome exists in the interval (LB1, UB1). If it does not exist, it indicates that the copy number of chromosome 1 is abnormal; if it does, it indicates that chromosome 1 is normal and there is no abnormal copy number.
在其中一个实施例中,如图11所示,上述检测染色体拷贝数异常的方法还包括以下步骤:In one embodiment, as shown in FIG. 11, the method for detecting an abnormal chromosome copy number further includes the following steps:
步骤1102,根据目标物种的性别,确定每个性别对应染色体的标准置信区间列表。Step 1102: Determine a standard confidence interval list of a chromosome corresponding to each gender according to the gender of the target species.
步骤1104,获取待测样本的性别。Step 1104: Obtain the gender of the sample to be tested.
步骤1106,分别将每个染色体的实际信号强度与目标物种的对应性别的标准置信区间列表中的相对应的染色体所对应的标准置信区间进行比较。In step 1106, the actual signal intensity of each chromosome is compared with the standard confidence interval corresponding to the corresponding chromosome in the standard confidence interval list of the corresponding sex of the target species.
步骤1108,当检测到存在有染色体的实际信号强度不属于对应性别的对应染色体的标准置信区间时,则将与实际信号强度对应的染色体判定为存在拷贝数异常的染色体。In step 1108, when it is detected that the actual signal intensity of the chromosome does not belong to the standard confidence interval of the corresponding chromosome of the corresponding gender, the chromosome corresponding to the actual signal intensity is determined to be a chromosome with abnormal copy number.
在靶点数据库中存储有根据性别建立的样本中包含的染色体对应的标准置信区间列表。例如以人为例,即在靶点数据库中存储有正常男性样本中染色体的标准信号强度的预设置信度值P的分布表,在男性样本的预设置信度值P的分布表中记录有正常男性样本中包含的每个染色体在预设置信度时对应的标准置信区间。在靶点数据库中存储有正常女性样本中染色体的标准信号强度的预设置信度值P的分布表,在女性样本的预设置信度值P的分布表中记录有正常女性样本中包含的每个染色体在预设置信度时对应的标准置信区间。The target database stores a list of standard confidence intervals corresponding to chromosomes included in a sample created according to gender. For example, taking a person as an example, a target table stores a distribution table of preset reliability values P of standard signal intensities of chromosomes in normal male samples, and a normal distribution record of preset reliability values P of male samples records normal The standard confidence interval for each chromosome contained in the male sample when the confidence is preset. The target database stores a distribution table of preset reliability values P of standard signal intensities of chromosomes in normal female samples, and a distribution table of preset reliability values P of female samples records each of the values contained in normal female samples. The standard confidence interval for each chromosome when the confidence is preset.
对目标物种进行性别划分,即按照性别将目标物种分为与性别对应的部分。比如当目标物种为人时,则将目标物种按照性别分为男性与女性。则可以确定每个性别对应染色体的标准置信区间。在将目标物种按照性别进行划分后,可明确每个性别的目标物种中包含的染色体,从而获取到每个染色体对应的标准置信区间。比如女性的目标物种中包含有22条常染色体和两条X性染色体,那么可从靶点数据库中获取到正常女性样本中染色体的标准信号强度的预设置信度值P的分布表,从而从此表中获取到与这22条染色体以及X染色体分别对应的标准置信区间。即,当待测样本来自于女性时,则获取到女性对应的染色体的标准信号强度的预设置信度值P的分布表。也就是将女性的待测样本的各个染色体的实际信号强度与女性的标准置信区间列表中的各个染色体的标准置信区间进行比较。如此,当检测到存在有染色体的实际信号强度不属于对应性别的对应染色体的标准置信区间时,则将与实际信号强度对应的染色体判定为存在拷贝数异常的染色体。Gender classification of target species, that is, to divide target species into parts corresponding to gender according to gender. For example, when the target species is human, the target species is divided into male and female according to gender. Then you can determine the standard confidence interval for each sex's corresponding chromosome. After classifying the target species according to sex, the chromosomes contained in the target species of each sex can be clarified, thereby obtaining the standard confidence interval corresponding to each chromosome. For example, if the female target species contains 22 autosomes and two X sex chromosomes, a distribution table of preset confidence values P of standard signal intensities of chromosomes in normal female samples can be obtained from the target database. Standard confidence intervals corresponding to these 22 chromosomes and X chromosomes were obtained from the table. That is, when the sample to be measured comes from a female, a distribution table of preset reliability values P of the standard signal intensity of the chromosome corresponding to the female is obtained. That is, the actual signal intensity of each chromosome of the female sample to be tested is compared with the standard confidence interval of each chromosome in the list of female standard confidence intervals. In this way, when it is detected that the actual signal intensity of the chromosome does not belong to the standard confidence interval of the corresponding chromosome of the corresponding gender, the chromosome corresponding to the actual signal intensity is determined to be a chromosome with abnormal copy number.
在其中一个实施例中,如图12所示,在步骤102之前,还包括以下步骤:In one embodiment, as shown in FIG. 12, before step 102, the method further includes the following steps:
步骤1202,获取目标物种中包含的多个染色体。Step 1202: Obtain multiple chromosomes included in the target species.
步骤1204,对目标物种中包含的多个染色体进行分类整理。Step 1204: sort and sort multiple chromosomes included in the target species.
步骤1206,获取预先选取的满足预设可信度条件的高可信度基因组。Step 1206: Obtain a pre-selected high-confidence genome that meets a preset reliability condition.
步骤1208,确定目标物种包含的各个染色体对应的高可信度基因组。Step 1208: Determine a high-confidence genome corresponding to each chromosome contained in the target species.
目标物种为待测样本来自的物种。比如希望对人的染色体拷贝数异常进行判断时,人是就是目标物种。目标物种可以为人,也可以是除人以外的其他物种。目标物种和非目标物种的基因组数据可以来源于NCBI(National Center for Biotechnology Information,美国国立生物技术信息中心)的RefSeq数据集(RefSeq参考序列数据库,美国国家生物信息技术中心提供的具有生物意义上的非冗余的基因和蛋白质序列)或其他公共或私有的基因组。所有目标物种和非目标物种的基因组综合在一起,成为全集。The target species is the species from which the test sample is derived. For example, when you want to judge the abnormal copy number of a human chromosome, the human is the target species. The target species can be a human or a species other than a human. The genomic data of target and non-target species can be derived from the RefSeq data set (RefSeq reference sequence database of the National Center for Biotechnology Information) of the NCBI (RefSeq reference sequence database, which has biological significance provided by the National Center for Bioinformatics). Non-redundant gene and protein sequences) or other public or private genomes. The genomes of all target and non-target species are integrated into a complete collection.
一个个体的完整基因组中包含多个染色体。因此在获取到目标物种对应的不同个体的各个基因组后,则可以获取到目标物种中包含的多个染色体。由于收集到的目标物种的基因组可能有多套,即来自于同一个目标物种的不同个体或种群的不同基因组。以人类为目标物种为例,收集到的目标物种的基因组可能包括来自欧洲裔、北美印第安裔、中国汉族裔等的基因组。因此,目标物种的每一条染色体都可能包含来自于不同基因组的属于该染色体的序列。以人类为例,人类的第一号染色体可以包括欧洲裔的第一号染色体、北美印第安裔的第一号 染色体、中国汉族裔的第一号染色体。此处,将目标物种的每一个相同染色体的数据整理在一起,即组成了目标物种的各个染色体的序列数据集。An individual's complete genome contains multiple chromosomes. Therefore, after obtaining the respective genomes of different individuals corresponding to the target species, multiple chromosomes contained in the target species can be obtained. Because there may be multiple sets of genomes of the target species collected, that is, different genomes of different individuals or populations from the same target species. Taking humans as target species, the genomes of target species collected may include genomes from European, North American Indian, and Chinese Han ethnic groups. Therefore, each chromosome of the target species may contain sequences belonging to that chromosome from a different genome. Taking humans as an example, the first chromosome of humans can include the first chromosome of European descent, the first chromosome of North American Indian, and the first chromosome of Chinese Han. Here, the data of each identical chromosome of the target species are put together, that is, the sequence data set of each chromosome of the target species is composed.
再从各个染色体的序列数据集中,获取到预先选取的满足预设可信度条件的基因组,即选取满足预设可信度条件的高可信度基因组,即可确定目标物种包含的各个染色体对应的高可信度基因组。高可信度基因组是指满足预设可信度条件的基因组。当然,此处的顺序也可以进行更换。可预先从NCBI收集到大量的基因组,并对这些基因组进行筛选,选取出满足预设可信度条件的基因组作为高可信度基因组。再确定每个目标物种中包含的各个染色体的高可信度序列数据集,即将每个目标物种的所有高可信度基因组的每一个相同染色体的数据整理在一起,即组成了每个目标物种中的各个染色体的高可信度序列数据集。Then, from the sequence data set of each chromosome, a preselected genome that meets the preset credibility conditions is obtained, that is, a high credibility genome that meets the preset credibility conditions is selected, and the corresponding chromosomes of the target species can be determined. High-confidence genome. A high-confidence genome refers to a genome that satisfies a preset confidence condition. Of course, the order here can also be changed. A large number of genomes can be collected from the NCBI in advance, and these genomes can be screened to select a genome that meets the preset reliability conditions as a high-confidence genome. Then determine the high-confidence sequence data set of each chromosome contained in each target species, that is, to put together the data of each identical chromosome of all high-confidence genomes of each target species, that is, each target species is composed High confidence sequence data set for individual chromosomes.
在其中一个实施例中,满足预设可信度条件包括以下任意一种:染色体序列中包含的非确定性字符的比例低于预设比例阈值时;染色体序列中包含的属于同一条染色体的序列片段低于预设片段阈值时;将某一染色体序列与其他所有遗传关系符合预设遗传距离阈值范围的染色体序列进行序列比对,确定该染色体序列在其相近的染色体序列中的全序列平均覆盖百分比,当该平均覆盖百分比高于预设百分比值时。In one embodiment, satisfying the preset credibility condition includes any of the following: when the proportion of non-deterministic characters contained in the chromosome sequence is lower than a preset proportion threshold; the sequence belonging to the same chromosome included in the chromosome sequence When the fragment is below the preset fragment threshold; compare a certain chromosome sequence with all other chromosomal sequences whose genetic relationship meets the preset genetic distance threshold range to determine the average full coverage of the chromosome sequence in the similar chromosome sequences Percentage, when the average coverage percentage is higher than the preset percentage value.
对于DNA基因组来说,非确定性字符的比例是指其中含有的非ACGT字符的比例,一条DNA基因组数据如果其非ACGT字符的比例过高,那么该条数据即为疑似低可信度的基因组。对于DNA或RNA序列,非确定性字符是指除去ACGTU这几个确定性字符以外的字符;对于蛋白质序列,非确定性字符则是指除了确定的氨基酸字符以外的字符。For the DNA genome, the proportion of non-deterministic characters refers to the proportion of non-ACGT characters contained in it. If the proportion of non-ACGT characters in a piece of DNA genome data is too high, then the piece of data is a genome with a suspected low confidence . For DNA or RNA sequences, non-deterministic characters refer to characters other than ACGTU. For protein sequences, non-deterministic characters refer to characters other than certain amino acid characters.
当基因组序列中包含的非确定性字符的比例低于预设比例阈值时,可认为该基因组满足预设可信度条件。根据一条完整的染色体所包括的序列数据片段的数目进行筛选,如果有过多的片段同属于一条染色体,那么该基因组序列即为疑似低可信度的基因组。即当一个基因组序列中包含的属于同一条染色体的序列片段低于预设片段阈值时,也可认为该基因组序列数据满足预设可信度条件。通过将某一基因组序列与其他所有遗传关系符合预设遗传距离阈值范围的基因组序列进行序列比对,以确定该基因组序列在其相近的基因组序列中的全序列平均覆盖百分比,当该平均覆盖百分比高于预设百分比值时,可认为该基因组满足预设可信度条件。遗传距离是指衡量物种间(或个体间)综合遗传差异大小的指标。When the proportion of non-deterministic characters contained in a genomic sequence is lower than a preset proportion threshold, the genome can be considered to satisfy a preset credibility condition. According to the number of sequence data fragments included in a complete chromosome, if there are too many fragments that belong to the same chromosome, then the genome sequence is a suspected low confidence genome. That is, when a sequence fragment included in a genomic sequence belonging to the same chromosome is lower than a preset fragment threshold, the genomic sequence data can also be considered to satisfy a preset confidence condition. Perform a sequence comparison between a certain genomic sequence and all other genomic sequences whose genetic relationship meets a preset genetic distance threshold range to determine the average full sequence coverage percentage of the genomic sequence in similar genomic sequences. When the average coverage percentage When it is higher than the preset percentage value, the genome can be considered to meet the preset confidence condition. Genetic distance refers to an index that measures the size of the overall genetic difference between species (or individuals).
在其中一个实施例中,特异性k-mer中的k-mer满足以下两个条件:在与每个染色体对应的基因组出现次数索引表中的出现次数满足第一预设误差条件;在与每个染色体对应的基因组出现次数索引表中的出现次数,以及在全集的基因组出现次数索引表中的出现次数满足第二预设误差条件。某一染色体的基因组出现次数索引表记录了每个k-mer在对应的染色体包含的基因组中包含有该k-mer的基因组的个数;全集的基因组出现次数索引表记录了目标物种中每个染色体包含的k-mer在全集包含的基因组中包含有该k-mer的基因组的个数。In one embodiment, the k-mer in the specific k-mer satisfies the following two conditions: the number of occurrences in the genome occurrence index table corresponding to each chromosome meets a first preset error condition; The number of occurrences in the genome occurrence number index table corresponding to each chromosome and the number of appearances in the genome occurrence number index table of the complete set meet the second preset error condition. The genome occurrence index table of a certain chromosome records the number of genomes of each k-mer in the genome included in the corresponding chromosome; the genome occurrence index table of the complete set records each of the target species The k-mer included in the chromosome includes the number of the k-mer genome in the genome included in the corpus.
在靶点数据库中,每个染色体均有各自对应的特征靶点序列集合,在特征靶点序列集合中包含的特异性k-mer是指满足预设特异性条件的k-mer。预设特异性条件包括有第一预设误差条件及第二预设误差条件,当k-mer同时满足这两个条件时,即认为该k-mer满足预设特异性条件,可将该k-mer作为特异性k-mer。进一步地,k-mer在染色体对应的基因组出现次 数索引表中的出现次数需要满足第一预设误差条件,且该k-mer在染色体对应的基因组出现次数索引表中的出现次数,以及在全集的基因组出现次数索引表中的出现次数满足第二预设误差条件。全集是指收集到的所有高可信度基因组组成的集合,高可信度基因组中既包含有各个目标物种的基因组,也包含有非目标物种的基因组,比如致病菌、共生菌、益生菌、人类、动物、植物等的高可信度基因组。In the target database, each chromosome has its own set of characteristic target sequences, and the specific k-mer included in the set of characteristic target sequences refers to a k-mer that satisfies a preset specific condition. The preset specific condition includes a first preset error condition and a second preset error condition. When the k-mer satisfies these two conditions at the same time, it is considered that the k-mer meets the preset specific condition and the k -mer as a specific k-mer. Further, the number of occurrences of k-mer in the genome occurrence number index table corresponding to the chromosome needs to satisfy the first preset error condition, and the number of occurrences of the k-mer in the genome occurrence number index table corresponding to the chromosome, and in the full set The number of occurrences in the genome occurrence number index table meets the second preset error condition. The complete set refers to the collection of all high-confidence genomes collected. The high-confidence genome contains both the genomes of each target species and the genomes of non-target species, such as pathogenic bacteria, symbiotic bacteria, and probiotics. , Human, animal, plant, etc. high confidence genome.
某一染色体的基因组出现次数索引表记录了每个k-mer在对应染色体对应的基因组中包含有该k-mer的基因组的个数。全集的基因组出现次数索引表中记录的每一个k-mer所对应的计数代表的是该k-mer一共在全集中多少个基因组中出现过。如果该k-mer在同一个基因组中出现过多次,也只会计数一次。在一个染色体的对应的基因组次数索引表中,记录了每个k-mer在对应的染色体对应的基因组中包含有该k-mer的基因组的个数,而全集的基因组出现次数索引表记录了在全集包含的基因组中包含有该k-mer的基因组的个数。An index table of the number of occurrences of a genome of a certain chromosome records the number of each k-mer's genome in the corresponding genome of the corresponding chromosome. The count corresponding to each k-mer recorded in the genome occurrence index table of the complete set represents how many genomes of the k-mer have appeared in the total set. If the k-mer appears multiple times in the same genome, it will only be counted once. In the corresponding genome number index table of a chromosome, each k-mer contains the number of genomes of the k-mer in the corresponding genome of the corresponding chromosome, and the genome occurrence number index table of the complete set records in The genome included in the corpus contains the number of k-mer genomes.
特异性k-mer的选取加入了一预设误差条件及第二预设误差条件这两个参数,因此允许了一定范围内的特异性k-mer的非特异性。如果没有这两个参数,就不能允许一定范围内的非特异性,那么针对一个染色体,往往很难找到特异性k-mer。所以通过允许一定误差的方式选取的特异性k-mer,从而建立的特点靶点序列集合,能够高概率地找到能够代表该染色体的特异性靶点。因此在确定待检测数据中包含的染色体时,则只需要与已经预先确定好待检测数据对应的目标物种中的各个染色体对应的特征靶点序列集合进行比对即可,减少了比较空间,从而缩短了分析时间,提高了检测的效率。The selection of the specific k-mer includes two parameters, a preset error condition and a second preset error condition, and thus allows the non-specificity of the specific k-mer within a certain range. Without these two parameters, non-specificity in a certain range cannot be allowed, and it is often difficult to find a specific k-mer for a chromosome. Therefore, by selecting a specific k-mer that allows a certain amount of error, and thus establishing a set of characteristic target sequences, a specific target that can represent the chromosome can be found with high probability. Therefore, when determining the chromosomes contained in the data to be detected, it is only necessary to perform alignment with the characteristic target sequence set corresponding to each chromosome in the target species corresponding to the data to be detected, thereby reducing the comparison space, thereby reducing Reduced analysis time and improved detection efficiency.
在其中一个实施例中,第一预设误差条件为:在与每个染色体对应的基因组出现次数索引表中的出现次数与对应染色体中包含的基因组的数量的比值与第一阈值的和大于等于1。In one of the embodiments, the first preset error condition is: the sum of the ratio of the number of occurrences in the genome occurrence number index table corresponding to each chromosome to the number of genomes contained in the corresponding chromosome and the first threshold is greater than or equal to 1.
第一预设误差条件是指,在染色体对应的基因组出现次数索引表中记录的出现次数与染色体对应的基因组数量的比值与第一阈值的和大于等于1。假设该染色体对应的基因组有N个,某一k-mer在该染色体对应的基因组出现次数索引表中的出现次数为C1,第一阈值为P1,那么第一预设误差条件是指,C1/N+P1≥1。第一阈值P1代表的是可接受的误差概率,可以是一个0到1之间的任意值,第一阈值可由技术人员根据实际项目进行设定。The first preset error condition refers to that the sum of the ratio of the number of occurrences recorded in the genome occurrence number index table corresponding to the chromosome to the number of genomes corresponding to the chromosome and the first threshold is greater than or equal to 1. Assume that there are N corresponding genomes of this chromosome, and the number of occurrences of a certain k-mer in the genome occurrence index table corresponding to this chromosome is C1, and the first threshold is P1, then the first preset error condition is C1 / N + P1≥1. The first threshold value P1 represents an acceptable error probability, and can be any value between 0 and 1. The first threshold value can be set by a technician according to the actual project.
在其中一个实施例中,第一阈值小于5%。In one of these embodiments, the first threshold is less than 5%.
第一阈值是指可接受的误差概率,第一阈值可以是一个0到1之间的任意值,可将第一阈值设为小于5%的值。The first threshold is an acceptable error probability. The first threshold may be any value between 0 and 1. The first threshold may be set to a value less than 5%.
在其中一个实施例中,第二预设误差条件为:在与每个染色体对应的基因组出现次数索引表中的出现次数与在全集的基因组出现次数索引表中的出现次数的比值与第二阈值的和大于等于1。In one embodiment, the second preset error condition is: the ratio of the number of occurrences in the genome occurrence number index table corresponding to each chromosome to the number of occurrences in the genome occurrence number index table of the complete set and the second threshold value. Is greater than or equal to 1.
第二预设误差条件是指,在染色体对应的基因组出现次数索引表中记录的出现次数与在全集的基因组出现次数索引表中的出现次数的比值与第二阈值的和大于等于1。假设某一k-mer在该染色体对应的基因组出现次数索引表中的出现次数为C1,该k-mer在全集的基因组出现次数索引表中的出现次数为C2,第二阈值为P2,那么第二预设误差条件是指,C1/C2+P2≥1。第二阈值与上述的第一阈值一样,代表的是可接受的误差概率,可以是一个 0到1之间的任意值,第二阈值P2同样可由技术人员根据实际项目进行设定。The second preset error condition refers to that the sum of the ratio of the number of occurrences recorded in the genome occurrence number index table corresponding to the chromosome to the occurrence number in the genome occurrence number index table of the corpus and the second threshold is greater than or equal to 1. Assume that the number of occurrences of a k-mer in the genome occurrence number index table corresponding to the chromosome is C1, and the number of occurrences of the k-mer in the genome occurrence number index table of the complete set is C2, and the second threshold value is P2. The second preset error condition refers to C1 / C2 + P2≥1. The second threshold value is the same as the above-mentioned first threshold value, which represents an acceptable error probability, and can be any value between 0 and 1. The second threshold value P2 can also be set by a technician based on the actual project.
在其中一个实施例中,第二阈值小于5%。In one of these embodiments, the second threshold is less than 5%.
第二阈值与第一阈值一样,均是指可接受的误差概率,第二阈值也可以是一个0到1之间的任意值,可将第二阈值设为小于5%的值。第一阈值与第二阈值可以是相等的,也可以是不等的。The second threshold value is the same as the first threshold value, which means an acceptable error probability. The second threshold value can also be any value between 0 and 1, and the second threshold value can be set to a value less than 5%. The first threshold and the second threshold may be equal or different.
在其中一个实施例中,在上述步骤102之前,还包括以下步骤:生成与每个染色体对应的基因组出现次数索引表,基因组次数索引表记录了每个k-mer在对应染色体对应的基因组中包含有该k-mer的基因组的个数;将基因组出现次数索引表存储至与染色体对应的特征靶点序列集合。In one embodiment, before step 102, the method further includes the following steps: generating an index table of the number of occurrences of the genome corresponding to each chromosome, and the index of the number of times of the genome records that each k-mer is included in the corresponding genome of the corresponding chromosome The number of genomes of the k-mer; the index table of the number of occurrences of the genome is stored in the feature target sequence set corresponding to the chromosome.
基因组是指一个生物体内所有遗传信息,这种遗传信息以核苷酸序列形式存储。一个生物体(例如一个动植物个体、或动植物细胞、或细菌个体)的一个完整单体内的遗传物质的总和即为基因组。在每个个体的完整基因组中,可以包含有多个染色体,而在每个染色体的基因组中,则可以包含有多个k-mer。此处使用了本领域内常用的“染色体的基因组”这个概念,指的是一个完整的染色体所包含的所有序列的总和。按照这个概念,在每个染色体对应的基因组出现次数索引表中记录了每个染色体包含的k-mer在该染色体对应的多少个基因组中出现过,即基因组次数索引表记录了每个k-mer在对应染色体对应的基因组中包含有该k-mer的基因组的个数。The genome is all the genetic information in an organism. This genetic information is stored in the form of a nucleotide sequence. The sum of the genetic material in a complete monomer of an organism (such as an animal or plant individual, or animal or plant cell, or bacterial individual) is the genome. Each individual's complete genome can contain multiple chromosomes, while the genome of each chromosome can contain multiple k-mers. The term "chromosome genome" commonly used in the art is used here to refer to the sum of all sequences contained in a complete chromosome. According to this concept, the number of genome occurrences corresponding to each chromosome has been recorded in the index table of the number of occurrences of the genome corresponding to each chromosome in the number of genomes corresponding to the chromosome, that is, the number of genomes index table records each k-mer The genome corresponding to the corresponding chromosome contains the number of the k-mer genome.
因此在基因组次数表中实际上记录的是每个k-mer在该k-mer的染色体对应的多少个基因组中出现过。如果在同一个基因组中一个k-mer出现超过一次,那么在该基因组出现次数索引表中仍然只会计数一次。在获取到每个k-mer在该k-mer所在的染色体对应的多少个基因组中出现过的数据后,即可建立针对每个染色体对应的基因组出现次数索引表。若是一共有M个染色体,则会生成M个相对应的基因组出现次数索引表。当每个染色体对应的基因组出现次数索引表均建立后,可将基因组出现次数索引表存储至与每个染色体对应的特征靶点序列集合,即存储至靶点数据库中,存储后,若是需要用到基因组出现次数索引表即可从靶点数据库进行数据调取,进而提高了检测的效率。Therefore, what is actually recorded in the genome frequency table is how many genomes each k-mer has appeared in the chromosome of the k-mer. If a k-mer occurs more than once in the same genome, it will still only be counted once in the genome occurrence index table. After obtaining data on how many genomes each k-mer has appeared in the chromosome corresponding to the k-mer, an index table of the number of occurrences of the genome corresponding to each chromosome can be established. If there are M chromosomes in total, M corresponding genomic appearance frequency index tables will be generated. After the genomic appearance frequency index table corresponding to each chromosome is established, the genomic appearance frequency index table can be stored into the feature target sequence set corresponding to each chromosome, that is, stored in the target database. After storage, if needed, Data can be retrieved from the target database at the genome occurrence index table, which improves the detection efficiency.
在其中一个实施例中,在获取样本的测序数据之前,还包括:生成全集的基因组出现次数索引表,全集的基因组出现次数索引表记录了在全集包含的基因组中包含有该k-mer的基因组的个数;将全集的基因组出现次数索引表存储至靶点数据库。In one embodiment, before obtaining the sequencing data of the sample, the method further includes: generating a genome occurrence index table of the complete set, and the genome occurrence index table of the complete set records a genome containing the k-mer in a genome included in the complete set. The number of genomic appearances index table of the complete set is stored in the target database.
在靶点数据库中,存储有每个染色体对应的特征靶点序列集合。在全集中包含有收集到的所有高可信度基因组,即在全集中既包含有多个待检测数据对应的目标物种的高可信度基因组,也包含有多个非待检测数据对应的目标物种高可信度基因组。获取到每个染色体中包含的每个k-mer在全集包含的多少个基因组中出现过的数据后,即可生成全集的基因组出现次数索引表。在全集的基因组出现次数索引表中记录了每个染色体包含的k-mer在全集的多少个基因组中出现过,即全集的基因组次数索引表记录了每个k-mer在全集包含的基因组中包含有该k-mer的基因组的个数。In the target database, a characteristic target sequence set corresponding to each chromosome is stored. The full set contains all the high-reliability genomes collected, that is, the full set contains both the high-reliability genomes of the target species corresponding to the data to be detected, and the multiple non-detected data corresponding targets. Species high confidence genome. After obtaining data on how many genomes that each k-mer contained in each chromosome has contained in the complete set, an index table of the number of occurrences of the complete set of genomes can be generated. The genome occurrence index table of the complete set records how many genomes of the k-mer contained in each chromosome have appeared in the complete set, that is, the genome count index table of the complete set records that each k-mer contains the genome contained in the complete set. There are the number of k-mer genomes.
因此在全集的基因组次数表中实际上记录的是每个k-mer在全集包含的多少个基因组中 出现过,即记录的是在全部的基因组中,每个k-mer在多少个基因组中出现过,也就是计量数为基因组的数量,而不是k-mer的出现次数。如果在同一个基因组中一个k-mer出现超过一次,那么在该全集的基因组出现次数索引表中仍然只会计数一次。在获取到每个k-mer在全集的多少个基因组中出现过的数据后,即可建立针对全集的基因组出现次数索引表。全集的基因组出现次数索引表与各个染色体所对应的基因组出现次数索引表有所不同,某一染色体的基因组出现次数索引表是与染色体对应的,每一个染色体均有其对应的基因组出现次数索引表,但全集的基因组出现次数索引表则只会生成一个,针对的是全部的数据。将生成的全集的基因组出现次数索引表进行存储后,若是在对待检测数据进行检测的过程中需要用到,即可从靶点数据库进行数据调取,进而提高了检测的效率。Therefore, in the genome number table of the complete set, actually how many genomes each k-mer contains in the complete set is recorded, that is, how many genomes each k-mer appears in the entire genome is recorded. However, the number of measurements is the number of genomes, not the number of k-mer occurrences. If a k-mer occurs more than once in the same genome, it will still be counted only once in the genome occurrence index table of the complete set. After obtaining the data of how many genomes each k-mer has appeared in the complete set, an index table of the number of occurrences of the genome for the complete set can be established. The genomic appearance frequency index table of the complete set is different from the genomic appearance frequency index table corresponding to each chromosome. The genomic appearance frequency index table of a certain chromosome corresponds to the chromosome, and each chromosome has its corresponding genomic appearance frequency index table , But the genomic appearance frequency index table of the complete set will only generate one, which is for all data. After storing the generated genomic appearance frequency index table of the complete set, if it is needed in the process of detecting the data to be detected, the data can be retrieved from the target database, thereby improving the detection efficiency.
在其中一个实施例中,在上述步骤106之后,还包括:根据实际出现次数生成与染色体对应的特异性k-mer实际出现次数记录表。In one embodiment, after step 106 described above, the method further includes: generating a specific k-mer actual occurrence frequency record table corresponding to the chromosome according to the actual occurrence number.
在靶点数据库中,存储有每个染色体包含的特异性k-mer,当获取到待检测数据后,可将待检测数据与每个染色体的各个特异性k-mer进行比对,即获取每个特异性k-mer在该待检测数据中的实际出现次数。在获取到每个特异性k-mer在测序数据中的实际出现次数后,可根据获取的数据生成与每个染色体对应的特异性k-mer实际出现出现次数记录表。若是靶点数据库中一共有M个染色体,则会生成M个对应的特异性k-mer实际出现次数记录表,特异性k-mer实际出现次数记录表中记录的是每个染色体包含的特异性k-mer在测序数据中的实际出现次数。In the target database, the specific k-mer contained in each chromosome is stored. After the data to be detected is obtained, the data to be detected can be compared with the specific k-mer of each chromosome, that is, each The actual number of times a specific k-mer appears in the data to be detected. After obtaining the actual number of occurrences of each specific k-mer in the sequencing data, a record table of the actual occurrences of specific k-mer corresponding to each chromosome can be generated according to the acquired data. If there are a total of M chromosomes in the target database, M corresponding specific k-mer actual occurrence frequency record tables will be generated, and the specific k-mer actual occurrence frequency record table records the specificity contained in each chromosome. The actual number of k-mer occurrences in the sequencing data.
如图13所示的某一特定染色体的特异性k-mer实际出现次数记录表,最左侧一列记录的是染色体X中包含的特异性k-mer,第二列中记录的是对应的特异性k-mer在测序数据中的实际出现次数,分别为C 1,C 2,…。根据特异性k-mer在测序数据中的实际出现次数生成对应的特异性k-mer实际出现次数记录表,将数据进行存储后以便后续调用,从而能够提高检测的效率。 As shown in Figure 13, the specific number of occurrences of the specific k-mer of a particular chromosome, the leftmost column records the specific k-mer contained in chromosome X, and the second column records the corresponding specificity The actual number of occurrences of sexual k-mer in the sequencing data is C 1 , C 2 ,... According to the actual number of occurrences of the specific k-mer in the sequencing data, a corresponding record of the actual occurrences of the specific k-mer is generated, and the data is stored for subsequent recall, thereby improving the detection efficiency.
在其中一个实施例中,如图14所示,提供了一种检测染色体拷贝数异常的方法,包括以下步骤:In one embodiment, as shown in FIG. 14, a method for detecting an abnormal chromosome copy number is provided, which includes the following steps:
步骤1402,建立每个染色体对应的特征靶点序列集合。Step 1402: A feature target sequence set corresponding to each chromosome is established.
如图15所示,步骤1402,包括:As shown in FIG. 15, step 1402 includes:
步骤1402A,高可信度基因组的收集与整理。 Step 1402A: Collection and sorting of high-confidence genomes.
建立每个染色体对应的特征靶点序列集合时,需要先对高可信度基因组数据进行收集与整理。高可信度基因组既可以包括待检测数据对应的目标物种中的基因组,也包括不属于待检测数据对应的目标物种中的基因组。例如共生菌、益生菌、人类、动物、植物等的高可信度基因组。高可信度的基因组可以来源于NCBI的RefSeq数据集或其他公共或私有的高可信度基因组。When establishing a set of characteristic target sequences corresponding to each chromosome, high-reliability genomic data needs to be collected and sorted first. The high-confidence genome can include both the genome in the target species corresponding to the data to be detected and the genome that does not belong to the target species corresponding to the data to be detected. For example, high-confidence genomes of commensal bacteria, probiotics, humans, animals, plants, and the like. High confidence genomes can be derived from the NCBI's RefSeq dataset or other public or private high confidence genomes.
高可信度的基因组的确认和筛选方法可以通过以下这三种方式:There are three ways to identify and screen high-confidence genomes:
1、根据一条基因组数据中所含非确定性字符的比例进行筛选。例如对于DNA基因组来说,非确定性字符的比例是指其中含有的非ACGT字符的比例,一条DNA基因组数据如果 其非ACGT字符的比例过高,那么该条数据即为疑似低可信度的基因组。对于DNA或RNA序列,非确定性字符是指除去ACGTU这几个确定性字符以外的字符;对于蛋白质序列,非确定性字符则是指除了确定的氨基酸字符以外的字符。1. Screen based on the proportion of non-deterministic characters contained in a genomic data. For example, for the DNA genome, the proportion of non-deterministic characters refers to the proportion of non-ACGT characters contained in it. If the proportion of non-ACGT characters in a piece of DNA genome data is too high, then the piece of data is suspected of low confidence. Genome. For DNA or RNA sequences, non-deterministic characters refer to characters other than ACGTU. For protein sequences, non-deterministic characters refer to characters other than certain amino acid characters.
2、根据一条完整的染色体所包括的基因组数据片段的数目进行筛选,如果有过多的片段同属于一条染色体,那么该基因组即为疑似低可信度的基因组。2. Screen based on the number of genomic data fragments included in a complete chromosome. If there are too many fragments that belong to the same chromosome, then the genome is a suspected low-confidence genome.
3、通过与该基因组遗传关系相近的(例如遗传距离小于某一阈值)多个基因组进行全基因组序列比对,确定该基因组在其相近基因组中的全基因组平均覆盖百分比,然后根据这个全基因组平均覆盖百分比进行筛选:平均覆盖百分比过低的基因组即为疑似低完成度、即低可信度的基因组。遗传距离是指衡量物种间(或个体间)综合遗传差异大小的指标。3. Perform genome-wide sequence alignment of multiple genomes with similar genetic relationships (eg, genetic distance is less than a certain threshold) to determine the average genome-wide coverage percentage of the genome in its similar genomes, and then average based on this whole genome Screening by percentage of coverage: Genomes with a low average percentage of coverage are those that are suspected of having low completion, ie, low confidence. Genetic distance refers to an index that measures the size of the overall genetic difference between species (or individuals).
步骤1402B,确定待检测数据对应的目标物种中各个染色体的高可信度序列数据集。 Step 1402B: Determine a high-confidence sequence data set of each chromosome in the target species corresponding to the data to be detected.
在步骤1402A中收集到的目标物种的基因组可能有多套,即来自于同一个目标物种的不同个体或种群的不同基因组。以人类为目标物种为例,收集到的目标物种的基因组可能包括来自欧洲裔、北美印第安裔、中国汉族裔等的基因组。因此,目标物种的每一条染色体都可能包含来自于不同基因组的属于该染色体的序列。以人类为例,人类的第一号染色体可以包括欧洲裔的第一号染色体、北美印第安裔的第一号染色体、中国汉族裔的第一号染色体。There may be multiple sets of genomes of the target species collected in step 1402A, that is, different genomes of different individuals or populations from the same target species. Taking humans as target species, the genomes of target species collected may include genomes from European, North American Indian, and Chinese Han ethnic groups. Therefore, each chromosome of the target species may contain sequences belonging to that chromosome from a different genome. Taking humans as an example, the first chromosome of humans can include the first chromosome of European descent, the first chromosome of North American Indian, and the first chromosome of Chinese Han.
此处,将目标物种的所有高可信度基因组的每一个相同染色体的数据整理在一起,即组成了目标物种的各个染色体的高可信度序列数据集。之后,将目标物种的所有染色体的高可信度序列数据集,和所有非目标物种的高可信度序列数据集汇集到一起,组成全集。即,将待检测数据对应的目标物种的所有染色体的高可信度序列数据集,和其他目标物种的所有染色体的高可信度序列数据集汇集到一起,即可组成全集。Here, the data of each identical chromosome of all high-confidence genomes of the target species are put together, that is, a high-confidence sequence data set of each chromosome of the target species is assembled. Afterwards, the high-confidence sequence data sets of all chromosomes of the target species and the high-confidence sequence data sets of all non-target species are brought together to form a complete set. That is, a high-confidence sequence data set of all chromosomes of the target species corresponding to the data to be detected and a high-confidence sequence data set of all chromosomes of other target species are brought together to form a complete set.
在确定了待检测数据对应的目标物种后,确定出正常情况下待检测数据对应的目标物种的各个染色体的拷贝数的比例,以及区分常染色体和性染色体。如图16所示,以人为例,一个正常人的基因组中含有23对,共46条染色体。其中第1号到第22号染色体为常染色体,其拷贝数均为2。X和Y染色体为性染色体。正常男性只有一条X染色体和一条Y染色体。正常女性有两条X染色体,不含有Y染色体。拷贝数(copy number)是指某一种基因或某一段特定的DNA序列在单倍体基因组(haploid genome)中出现的数目。图16中所确定的信息仅仅在确定待检测数据对应的目标物种时制作一次,之后在对每一个需要进行检测的样本数据进行分析的时候都会调用图16中的信息。After determining the target species corresponding to the data to be detected, the ratio of the copy number of each chromosome of the target species corresponding to the data to be detected is determined under normal circumstances, and the autosome and sex chromosome are distinguished. As shown in Figure 16, taking a human as an example, a normal human genome contains 23 pairs and a total of 46 chromosomes. Among them, chromosomes 1 to 22 are autosomes, and their copy numbers are two. X and Y chromosomes are sex chromosomes. Normal males have only one X chromosome and one Y chromosome. Normal women have two X chromosomes and no Y chromosomes. Copy number (copy number) refers to the number of haploid genomes (haploid geneome) of a certain gene or a specific DNA sequence. The information determined in FIG. 16 is generated only once when the target species corresponding to the data to be detected is determined, and then the information in FIG. 16 is called when analyzing each sample data that needs to be detected.
步骤1402C,生成全集的基因组出现次数索引表。In step 1402C, an index table of the number of occurrences of the genome of the complete set is generated.
使用全集,可生成全集的基因组出现次数索引表,在全集的基因组出现次数索引表中,记录有全集中包含的每个k-mer在全集的多少个基因组中出现过。k-mer是指长度为k的基因组序列,k可自行定义,一般可将范围设置在11到32之间。如果一种基因组数据中一共有a个不同的确定性字符,那么对于一个特定的k,一共有a的k次方个可能的不同k-mer。Using the corpus, the genomic occurrence index table of the corpus can be generated. In the genomic occurrence index table of the corpus, it is recorded how many genomes of each k-mer in the corpus have appeared in the corpus. k-mer refers to a genomic sequence of length k. k can be defined by itself, and the range can generally be set between 11 and 32. If there are a different deterministic characters in a genomic data, then for a specific k, there may be a total of k different powers of k.
例如,对于DNA基因组数据,DNA一共有ACGT四种不同的确定性字符,那么对于一个特定的k,一共有4的k次方个可能的不同k-mer。对于一个长度为n的基因组,其最多可能有n-k+1个不同的k-mer。但是因为一个基因组中含有重复区域,所以一般情况下一个n 字符长的基因组中包含的不同k-mer会远远小于n-k+1。因此,若使用普通的k-mer计数法,在一个给定的基因组中,一个特定的k-mer可能会出现多次,并可能进行多次计数。此处建立的全集的基因组出现次数索引表中,与之前的方法不同的是,如果一个基因组中一个k-mer出现超过一次,那么在该全集的基因组出现次数索引表中仍然仅仅计数一次。因此,在由此产生的k-mer基因组出现次数索引表中一个k-mer所对应的计数即代表着该k-mer一共在全集中多少个基因组中出现过。For example, for DNA genomic data, DNA has a total of four different deterministic characters of ACGT, then for a particular k, there are 4 possible k-th different k-mers. For a genome of length n, there may be at most n-k + 1 different k-mers. However, because a genome contains repeating regions, in general, an n-character genome contains different k-mers that are much smaller than n-k + 1. Therefore, if the ordinary k-mer counting method is used, a given k-mer may appear multiple times and may be counted multiple times in a given genome. In the genome occurrence index table of the complete set, which is different from the previous method, if a k-mer occurs more than once in a genome, the genome occurrence index table of the complete set still counts only once. Therefore, the count corresponding to a k-mer in the resulting k-mer genome occurrence number index table represents how many genomes the k-mer has appeared in the total set.
如果使用的是DNA或RNA基因组序列,因为核酸序列的反向互补性,一个k-mer A出现后,其反向互补序列A'也应该被认定为已经出现,因此A和A'都应该被记录到表中。在后续步骤中,如果针对的是DNA或RNA序列的k-mer,当一个k-mer A被提及做某种操作时,默认也认为其反向互补序列A'也被提及并进行了相应的处理操作。If a DNA or RNA genomic sequence is used, because of the reverse complementarity of the nucleic acid sequence, after a k-mer A appears, its reverse complementary sequence A 'should also be considered to have appeared, so both A and A' Record into the table. In the subsequent steps, if the k-mer of the DNA or RNA sequence is targeted, when a k-mer 'A is mentioned to do some operation, it is also considered that its reverse complementary sequence A' is also mentioned and performed by default Corresponding processing operation.
且,此处可将目标基因组的每一个染色体作为一个物种来操作,即目标物种的每一条染色的高可信度数据集中含有的每一个单独的、能完整代表该染色体的序列,都被视为一个单独的基因组。例如,如果人为目标物种,人的第一号染色体的高可信度数据集可能包含三条数据,即欧洲裔的第一号染色体序列、北美印第安裔的第一号染色体序列、中国汉族裔的第一号染色体序列,那么欧洲裔的第一号染色体序列被视为一个完整的独立的基因组参与k-mer基因组出现次数索引表的计数,北美印第安裔的第一号染色体序列被视为一个完整的独立的基因组参与k-mer基因组出现次数索引表的计数,中国汉族裔的第一号染色体序列被视为一个完整的独立的基因组参与k-mer基因组出现次数索引表的计数。Moreover, each chromosome of the target genome can be operated as a species here, that is, each individual sequence that can completely represent the chromosome contained in each stained high-confidence data set of the target species is considered as For a single genome. For example, if the human is the target species, the high-confidence dataset of human chromosome 1 may contain three pieces of data, namely the chromosome 1 sequence of European descent, the chromosome 1 sequence of North American Indian, and the chromosome 1 of Chinese Han Chromosome 1 sequence, then the European chromosome 1 sequence is regarded as a complete independent genome to participate in the count of the k-mer genome occurrence index table, and the North American Indian chromosome 1 sequence is regarded as a complete The independent genome participates in the counting of the k-mer genome appearance index table. The Chinese chromosome number 1 of the Han ethnic group is regarded as a complete independent genome participating in the counting of the k-mer genome appearance index table.
步骤1402D,生成每个染色体对应的基因组出现次数索引表。In step 1402D, an index table of the number of occurrences of the genome corresponding to each chromosome is generated.
一个染色体的基因组出现次数索引表与上述步骤1402C中的全集的基因组出现次数索引表有所不同。全集的基因组出现次数索引表记录的是全集的,也就是一个k-mer在全集的多少个基因组中出现过,但染色体对应的基因组出现次数索引表是与每个染色体对应的,记录的是每个染色体中包含的k-mer,在该染色体对应的多少个基因组中出现过。The genome appearance number index table of a chromosome is different from the genome appearance number index table of the complete set in step 1402C. The genome occurrence index table of the complete set records the complete set, that is, how many genomes of a k-mer have appeared in the complete set, but the genome occurrence number index table corresponding to the chromosome corresponds to each chromosome, and records each The k-mer contained in each chromosome has appeared in how many genomes corresponding to the chromosome.
步骤1402E,生成每个染色体对应的特异性k-mer表。 Step 1402E: Generate a specific k-mer table corresponding to each chromosome.
每个染色体对应的特异性k-mer表中记录的是每个染色体中满足预设特异性条件的k-mer,即特异性k-mer。特异性k-mer是从k-mer中挑选出的符合预设特异性条件的k-mer,挑选出成为特异性k-mer的需要满足以下两个条件:The specific k-mer table corresponding to each chromosome records the k-mers that satisfy the preset specific conditions in each chromosome, that is, the specific k-mer. The specific k-mer is a k-mer selected from the k-mers that meets the preset specificity conditions. The selection of a specific k-mer must meet the following two conditions:
1、如果该染色体的高可信度数据集中含有N个基因组,某个k-mer在该染色体对应的基因组出现次数索引表中的出现次数为C 1。那么需要满足条件:C 1/N+P 1≥1,即在染色体对应的基因组出现次数索引表中的出现次数与染色体的高可信度数据集中包含的基因组数量的比值与第一阈值的和大于等于1,其中第一阈值P 1通常小于5%。 1. If the high-confidence data set of the chromosome contains N genomes, the number of occurrences of a certain k-mer in the genome occurrence index table corresponding to the chromosome is C 1 . Then the condition needs to be satisfied: C 1 / N + P 1 ≥1, that is, the sum of the ratio of the number of occurrences in the genome occurrence index table corresponding to the chromosome to the number of genomes contained in the high confidence data set of the chromosome and the first threshold Greater than or equal to 1, where the first threshold P 1 is usually less than 5%.
2、如果某个k-mer在该染色体对应的基因组出现次数索引表中的出现次数为C 1,该k-mer在全集的基因组出现次数索引表中的出现次数为C 2。那么则需要满足条件:C 1/C 2+P 2≥1,即在染色体对应的基因组出现次数索引表中的出现次数与在全集的基因组出现次数索引表中的出现次数的比值与第二阈值的和大于等于1。其中第二阈值P 2通常小于5%。 2. If a k-mer appears in the genome occurrence number index table corresponding to the chromosome as C 1 , the k-mer appears in the genome episode number index table of the complete set as C 2 . Then you need to satisfy the condition: C 1 / C 2 + P 2 ≥1, that is, the ratio of the number of occurrences in the genome occurrence index table corresponding to the chromosome to the occurrence number in the genome occurrence index table of the complete set and the second threshold. Is greater than or equal to 1. Wherein the second threshold value P 2 is usually less than 5%.
第一阈值P 1与第二阈值P 2可以相等,也可以不相等。选取特异性k-mer时加入了第一阈 值P 1与第二阈值P 2这两个参数,允许了在一定范围内的误差率,即允许了一定范围内的特异性k-mer的非特异性。如果没有这两个参数,则不能允许一定范围内的非特异性,那么针对某一个染色体,往往很难找到特异性k-mer。 The first threshold value P 1 and the second threshold value P 2 may be equal to or different from each other. When the specific k-mer is selected, the two parameters of the first threshold P 1 and the second threshold P 2 are added, allowing an error rate within a certain range, that is, allowing the non-specificity of the specific k-mer within a certain range. . Without these two parameters, non-specificity in a certain range cannot be allowed, and it is often difficult to find a specific k-mer for a certain chromosome.
对于一个染色体,如果发现有n个特异性k-mer,假设本步骤条件(1)中的P 1出现情况是随机分布于该染色体对应的各个基因组中的,那么实际上对于该染色体出现假阴性的概率则小于或等于P 1 n。对于足够大的n,此处可能出现的假阴性的可能性将极小。同时,如果最终实际检测到该染色体有n'个特异性k-mer,假设本步骤条件(2)中的P 2出现情况是随机分布于非本染色体的各个其他基因组中的,那么实际上对于该染色体出现假阳性的概率则小于或等于P 1 n'(即P 2的n'次方)。对于足够大的n',此处可能出现的假阳性的可能性将极小。假阴性率是指在测试中产生阴性测试结果的阳性的比例,即考虑到正在查找的状况存在阴性测试结果的条件概率。 For a chromosome, if n specific k-mers are found, assuming that the occurrence of P 1 in the condition (1) in this step is randomly distributed in each genome corresponding to the chromosome, then a false negative appears for the chromosome Is less than or equal to P 1 n . For n that is large enough, the likelihood of false negatives occurring here will be extremely small. At the same time, if n 'specific k-mers are actually detected in the chromosome, it is assumed that the occurrence of P 2 in condition (2) of this step is randomly distributed in other genomes other than the chromosome. The probability of a false positive on this chromosome is less than or equal to P 1 n ' (that is, the power n' to P 2 ). For n 'large enough, the probability of false positives that can occur here is extremely small. The false negative rate refers to the proportion of positives that produce a negative test result in the test, that is, the conditional probability that a negative test result exists considering the condition being searched for.
因此,在计算假阳性概率时,可以对k-mer进行独立性修正。对于任意两个在特异性k-mer列表中的k-mer A和B,如果它们之间的有不少于j个字符在它们的末端重合(例如A的最末端的j个字符与B的最开始的j个字符完全一样),那么这两个k-mer A和B就被认为是末端重合的。此处的j一般是一个大于5小于等于k-1的数值,即5<j≤(k-1)。此处应注意在DNA或RNA序列的情况下,因为核酸序列的反向互补性,对于给定的两个特异性k-mer列表中的k-mer A和B,末端重合检测应该包括A与B,A的反相互补序列A’与B,A与B的反相互补序列B’,以及A的反相互补序列A’与B的反相互补序列B’。Therefore, when calculating the false positive probability, k-mer can be independently corrected. For any two k-mers A and B in the specific k-mer list, if there are no less than j characters between them at their ends (for example, the last j characters of A and B's The first j characters are exactly the same), then the two k-mers A and B are considered to be coincident ends. Here, j is generally a value greater than 5 and less than or equal to k-1, that is, 5 <j≤ (k-1). It should be noted here that in the case of a DNA or RNA sequence, due to the reverse complementarity of the nucleic acid sequence, for k-mer A and B in a given two specific k-mer list, the terminal coincidence detection should include A and B, A reverse complementary sequences A 'and B, A and B reverse complementary sequences B', and A reverse complementary sequences A 'and B reverse complementary sequences B'.
将特异性k-mer列表中的所有k-mer复制到一个初始状态下的非重合特异性区域列表中,如果确认在该非重合特异性区域列表中的k-mer A和B是末端重合的,那么取能覆盖该两个特异性k-mer的最小区域(即特异性区域)C来代替这两个特异性k-mer。以此类推,重复检验整个非重合特异性区域列表每两个特异性k-mer或特异性区域的末端重合性,用能覆盖该两个特异性k-mer或特异性区域的最小区域来代替这两个特异性k-mer或特异性区域,直到没有满足条件的末端重合的特异性k-mer或特异性区域。完成该步骤后的所有保留下来的特异性k-mer或特异性区域组成最终的非重合特异性区域列表,即该表在最终状态下保留的每一个特异性k-mer或特异性区域为一个非重合特异性区域。对于假阳性和假阴性的计算,属于同一个非重合特异性区域的多个k-mer仅仅计算一次P1或P2的值。如果目标物种中有M条染色体,那么此处就会建立M个与之相对应的染色体的特异性k-mer表。Copy all k-mers in the specific k-mer list to a list of non-overlapping specific regions in the initial state. If you confirm that k-mer A and B in the non-overlapping specific region list are terminally overlapping Then, take the smallest region (ie, the specific region) C that can cover the two specific k-mers instead of the two specific k-mers. By analogy, the end overlap of every two specific k-mers or specific regions of the entire non-overlapping specific region list is repeatedly tested, and the smallest region covering the two specific k-mers or specific regions is replaced. These two specific k-mers or specific regions until there are no specific k-mers or specific regions that coincide with the end of the condition. After the completion of this step, all the remaining specific k-mers or specific regions constitute the final non-overlapping specific region list, that is, each specific k-mer or specific region retained in the table in the final state is one Non-coincidence specific regions. For the calculation of false positives and false negatives, multiple k-mers belonging to the same non-overlapping specific region only calculate the value of P1 or P2 once. If there are M chromosomes in the target species, then a specific k-mer table of M corresponding chromosomes will be created here.
步骤1402F,生成每个染色体对应的特异性k-mer拷贝数列表。 Step 1402F: Generate a specific k-mer copy number list corresponding to each chromosome.
针对待检测数据对应的目标物种中包含的每一个染色体的高可信度数据集,计算出筛选出的每一个特异性k-mer所出现的次数,即一个特异性k-mer在这个染色体的高可信度数据集的所有基因组里实际出现多少次,就记录多少次。最后,通过该染色体的所有特异性k-mer中出现次数最少的一个k-mer的出现次数,即Cm,记算出该染色体的每一个特异性k-mer的拷贝数。如果目标物种一共有M个染色体,那么此处就会建立M个与之相对应的染色体的特异性k-mer拷贝数列表。其中特异性k-mer的拷贝数为一个大于或等于1的数值。For the high-confidence data set of each chromosome contained in the target species corresponding to the data to be detected, the number of occurrences of each specific k-mer screened out is calculated, that is, the specific k-mer on this chromosome Record as many occurrences as possible in all genomes of the high confidence dataset. Finally, the number of copies of each specific k-mer of the chromosome is calculated from the number of occurrences of one k-mer, which is the least frequent of all specific k-mers of the chromosome, that is, Cm. If the target species has a total of M chromosomes, then a specific k-mer copy number list of M corresponding chromosomes will be created here. The copy number of specific k-mer is a value greater than or equal to one.
在生成所有染色体的各自的特异性k-mer拷贝数列表后,如果有任意两个来自于不同染 色体上的特异性k-merA和k-merB,它们在一套正常目标物种的数据中出现次数为Ca和Cb,且Fa和Fb分别为k-merA和k-merB在其各自染色体的特异性k-mer拷贝数列表中的数值,那么Ca/Fa和Cb/Fb的比值应为表16中的这两条染色体拷贝数的比值。After generating the respective specific k-mer copy number lists of all chromosomes, if there are any two specific k-merA and k-merB from different chromosomes, they appear in the data of a set of normal target species Are Ca and Cb, and Fa and Fb are the values of k-merA and k-merB in the specific k-mer copy number list of their respective chromosomes, then the ratio of Ca / Fa and Cb / Fb should be as shown in Table 16. The ratio of the number of copies of these two chromosomes.
可将创建每个染色体对应的特征靶点序列集合的过程统称为模块A。模块A可以不定时的运行,以便不断的更新每个染色体对应的特征靶点序列集合,即更新靶点数据库。例如每当参考基因组数据有所更新的时候,可以运行模块A。但模块A并不需要在对每一个实际样本进行分析的时候运行或更新。The process of creating a set of characteristic target sequences corresponding to each chromosome may be collectively referred to as module A. Module A can be run from time to time in order to continuously update the feature target sequence set corresponding to each chromosome, that is, update the target database. For example, whenever the reference genome data is updated, module A can be run. However, module A does not need to be run or updated during the analysis of each actual sample.
步骤1404,计算待检测数据对应的目标样本中包含的各个染色体的实际信号强度。Step 1404: Calculate the actual signal strength of each chromosome contained in the target sample corresponding to the data to be detected.
如图17所示,步骤1404,包括:As shown in FIG. 17, step 1404 includes:
步骤1404A,获取待检测数据。 Step 1404A: Obtain data to be detected.
步骤1404B,获取特异性k-mer列表和特异性k-mer拷贝数列表。 Step 1404B: Obtain a specific k-mer list and a specific k-mer copy number list.
步骤1404C,获取每个染色体中包含的特异性k-mer在待检测数据中的实际出现次数。 Step 1404C: Obtain the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected.
获取到待检测数据,并确定待检测数据对应的目标物种。调用步骤1402中生成的目标物种中各个染色体的特异性k-mer列表和特异性k-mer拷贝数列表。如果待检测数据对应的目标物种中一共有M个染色体,那么一共需要调用M个与每条染色体相对应的特异性k-mer列表和特异性k-mer拷贝数列表。再获取到目标物种中每个染色体中包含的特异性k-mer在待检测数据中的实际出现次数。可分别将特异性k-mer的出现次数记录到与其相对应的染色体的特异性k-mer实际出现次数记录表中的相应位置。即根据每个染色体中包含的特异性k-mer在待检测数据中的实际出现次数生成与染色体对应的特异性k-mer实际出现次数记录表。Obtain the data to be detected, and determine the target species corresponding to the data to be detected. The specific k-mer list and specific k-mer copy number list of each chromosome in the target species generated in step 1402 are called. If there are M chromosomes in the target species corresponding to the data to be detected, a total of M specific k-mer lists and specific k-mer copy number lists corresponding to each chromosome need to be called. The actual number of occurrences of the specific k-mer contained in each chromosome of the target species in the data to be detected is obtained. The number of occurrences of the specific k-mer can be recorded to the corresponding position in the actual number of occurrences of the specific k-mer of the corresponding chromosome. That is, according to the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected, a record table of the actual number of occurrences of the specific k-mer corresponding to the chromosome is generated.
步骤1404D,计算每个染色体的单拷贝信号强度E。In step 1404D, a single copy signal intensity E of each chromosome is calculated.
如图18所示的某一特定染色体的单拷贝信号强度计算表。对于某一个特定的染色体,根据特异性k-mer拷贝数列表中的数据和特异性k-mer实际出现次数记录表中的数据,可以获得任意一个属于这个特定染色体的特异性k-mer的在该套数据中的实际出现次数C' i和拷贝数F i。因此可以计算出该k-mer调整后的出现次数C' i/F i。将该染色体所有特异性k-mer的调整后的出现次数求平均值,该平均值即为该染色体的单拷贝信号强度E。 A single copy signal strength calculation table for a specific chromosome is shown in FIG. For a specific chromosome, according to the data in the specific k-mer copy number list and the data in the specific k-mer actual occurrence record table, any specific k-mer belonging to this specific chromosome can be obtained. The actual number of occurrences C ′ i and the copy number F i in the set of data. Therefore, the number of appearances C ′ i / F i after the k-mer adjustment can be calculated. The adjusted number of occurrences of all specific k-mers of the chromosome is averaged, and the average value is the single copy signal intensity E of the chromosome.
在计算得到每一个染色体的单拷贝信号强度E后,可通过如图19所示的各个染色体的单拷贝信号强度记录表,将目标物种中包含的每个染色体的单拷贝信号强度E进行记录存储。After calculating the single copy signal intensity E of each chromosome, the single copy signal intensity E of each chromosome contained in the target species can be recorded and stored through the single copy signal intensity record table of each chromosome as shown in FIG. 19 .
步骤1404E,计算每个染色体的实际信号强度S。 Step 1404E, calculate the actual signal strength S of each chromosome.
在计算得到每个染色体的单拷贝信号强度E后,可计算出全部的单拷贝信号强度E的平均值M和方差SD。每个染色体的实际信号强度S的计算公式为:S i=(E i-M)/SD。如图20所示的各个染色体的实际信号强度的计算表。一号染色体的计算公式为:S 1=(E 1-M)/SD。其他染色体的计算公式也以此方式进行计算。 After the single-copy signal intensity E of each chromosome is calculated, the average M and variance SD of all single-copy signal intensity E can be calculated. The calculation formula of the actual signal intensity S of each chromosome is: S i = (E i -M) / SD. A calculation table of the actual signal intensity of each chromosome as shown in FIG. 20. The calculation formula for chromosome 1 is: S 1 = (E 1 -M) / SD. The calculation formulas for other chromosomes are also calculated in this way.
步骤1406,根据标准检测样本计算得到目标物种中包含的染色体对应的标准置信区间列表。Step 1406: Calculate a standard confidence interval list corresponding to the chromosome contained in the target species according to the standard detection sample.
获取到大量标准检测样本后,可采用步骤1404中的方式计算每个标准检测样本中包含 的各个染色体的实际信号强度。为了区分目标物种与标准检测样本,将标准检测样本的实际信号强度称为标准信号强度。通过步骤1404中的方式,可计算得到每个标准检测样本中包含的每个染色体的标准信号强度。可通过表格的方式记录全部的标准检测样本中包含的染色体对应的标准信号强度。进一步的,可区分性别的记录。即生成正常男性样本中染色体的标准信号强度记录表和正常女性样本中染色体的标准信号强度记录表。After obtaining a large number of standard test samples, the actual signal intensity of each chromosome contained in each standard test sample can be calculated in the manner in step 1404. In order to distinguish the target species from the standard test sample, the actual signal strength of the standard test sample is referred to as the standard signal strength. By the method in step 1404, the standard signal intensity of each chromosome contained in each standard detection sample can be calculated. The standard signal intensity corresponding to the chromosomes contained in all the standard test samples can be recorded in a table. Further, gender-sensitive records can be distinguished. That is, a standard signal intensity record table of chromosomes in normal male samples and a standard signal intensity record table of chromosomes in normal female samples are generated.
对所有标准检测样本中包含的每个染色体的标准信号强度进行统计,计算出各个染色体的各个标准检测样本的标准信号强度分布的均值M’和方差SD’。假设标准检测样本为人类,有100个标准检测样本,那么则存在有100个1号染色体,100个2号染色体,…,100个22号染色。但是X和Y性染色体的具体数目则需要根据这100个人的性别确定,因此为了能达到X和Y性染色体的数量要求,对某一性别的标准检测样本的数目也应该有要求。那么对于1号染色体而言,存在有100个标准信号强度。可以根据这100个1号染色体对应的标准信号强度计算得到1号染色体的所对应的均值和方差,也计算得到其他染色体的标准信号强度的均值和方差。The standard signal intensities of each chromosome included in all the standard detection samples are statistically calculated, and the mean value M 'and the variance SD' of the standard signal intensity distributions of the respective standard detection samples of each chromosome are calculated. Assume that the standard test sample is human and there are 100 standard test samples, then there are 100 chromosomes 1, 100 chromosomes 2, ..., 100 22 stains. However, the specific number of X and Y sex chromosomes needs to be determined according to the gender of these 100 people. Therefore, in order to meet the number of X and Y sex chromosomes, the number of standard test samples for a certain sex should also be required. So for chromosome 1, there are 100 standard signal intensities. The corresponding mean and variance of chromosome 1 can be calculated according to the standard signal intensities corresponding to the 100 chromosomes 1, and the mean and variance of standard signal intensities of other chromosomes can also be calculated.
从而可确定标准检测样本中包含的每个染色体在预设置信度值时对应的标准置信区间,也就是标准信号强度的区间。即获得置信度为P的置信区间的两个边界值LB和UB。LB为置信区间的最小值,UB为置信区间的最大值。此处P一般是大于0.95的数值,无限接近于1但不会等于1。在实际应用中可以根据需要对置信度进行调节。例如95%的置信度,P即为0.95,而99.9%的置信度,P即为0.999。在确定标准检测样本中每个染色体在预设置信度值时对应的标准置信区间后,即可得到目标物种的两个性别对应的染色体的实际信号强度的P置信度边界值的分布表。可根据统计学的方式,通过对大量标准检测样本数据的染色体的标准信号强度的计算统计,可估算得到目标物种的各个染色体在预设置信度P值时对应的标准置信区间。即估算得到目标物种的在正常样本中各个染色体在预设置信度P值时对应的实际信号强度区间。In this way, a standard confidence interval corresponding to each chromosome contained in the standard detection sample when the confidence value is preset can be determined, that is, an interval of standard signal strength. That is, two boundary values LB and UB of the confidence interval with the confidence degree P are obtained. LB is the minimum of the confidence interval, and UB is the maximum of the confidence interval. Here, P is generally a value greater than 0.95, infinitely close to 1 but not equal to 1. In practical applications, the confidence level can be adjusted as required. For example, with 95% confidence, P is 0.95, and 99.9% confidence, P is 0.999. After determining the standard confidence interval corresponding to the preset confidence value of each chromosome in the standard detection sample, a distribution table of P-confidence boundary values of the actual signal strengths of the chromosomes corresponding to the two sexes of the target species can be obtained. The standard confidence interval corresponding to each chromosome of the target species can be estimated in a statistical manner by calculating statistics on the standard signal intensity of the chromosomes of a large number of standard test sample data. That is, the actual signal intensity interval corresponding to each chromosome of the target species in the normal sample when the reliability P value is preset is estimated.
在无创产检(NIPT)的应用场景下,即通过对母体外周血中胎儿的DNA进行测序来推断胎儿的染色体拷贝数异常的情况时,因为母体外周血中胎儿的样本是与母体的样本混杂在一起的,所以上述标准检测样本也可以为:正常母亲怀有正常婴儿的外周血样本,所述外周血样本包括有正常母亲怀有正常男婴的外周血样本、正常母亲怀有正常女婴的外周血样本、正常母亲怀有正常男婴双胞胎的外周血样本、正常母亲怀有正常女婴双胞胎的外周血样本以及正常母亲怀有正常一男一女双胞胎的外周血样本。因此,在制作P置信度边界值的分布表时,也可以根据标准检测样本的不同进行表格的调整。In the application scenario of non-invasive birth test (NIPT), that is, when the fetal chromosome copy number is abnormal by sequencing the fetal DNA in the maternal peripheral blood, because the fetal sample in the maternal peripheral blood is mixed with the maternal sample. Together, so the above standard test sample can also be: a peripheral blood sample of a normal mother carrying a normal baby, the peripheral blood sample includes a peripheral blood sample of a normal mother carrying a normal baby boy, and a normal mother carrying a normal baby girl Peripheral blood samples, peripheral blood samples from normal mothers carrying normal baby boy twins, peripheral blood samples from normal mothers carrying normal baby girl twins, and peripheral blood samples from normal mothers carrying normal one male and one female twin. Therefore, when making a distribution table of P-confidence boundary values, the table can also be adjusted according to the difference in the standard detection samples.
步骤1408,检测待检测数据中是否存在有拷贝数异常的染色体。Step 1408: It is detected whether there is an abnormal copy number in the data to be detected.
计算出待检测数据对应的目标物种中包含的每个染色体的实际信号强度后,可将每个染色体的实际信号强度与上述步骤1406中得到的目标物种的各个染色体在预设置信度P值时对应的标准置信区间分别进行比较。将待检测数据对应的目标物种中包含的1号染色体的实际信号强度与1号染色体的标准置信区间进行比较。当1号染色体的实际信号强度不在1号染色体的标准置信区间内时,则可判定1号染色体是存在拷贝数异常的。反之,则可判定1号 染色体是不存在拷贝数异常的。After calculating the actual signal intensity of each chromosome contained in the target species corresponding to the data to be detected, the actual signal intensity of each chromosome can be compared with each chromosome of the target species obtained in step 1406 above when the reliability P value is set The corresponding standard confidence intervals are compared separately. The actual signal intensity of chromosome 1 contained in the target species corresponding to the data to be detected is compared with the standard confidence interval of chromosome 1. When the actual signal intensity of chromosome 1 is not within the standard confidence interval of chromosome 1, it can be determined that copy number abnormality exists in chromosome 1. Conversely, it can be determined that chromosome 1 is not copy number abnormal.
进一步的,由于步骤1406中,是根据目标物种的不同性别建立了对应的样本中染色体的标准信号强度的预设置信度值P的分布表。因此也可将性染色体的实际信号强度与不同性别对应的预设置信度值P的分布表进行比较。使用该待测数据中计算出的X染色体的实际信号强度和Y染色体的实际信号强度,与不同性别对应的预设置信度值P的分布表中置信区间的边界值进行比较。如果该待检测数据中计算出的X染色体的实际信号强度和Y染色体的实际信号强度在正常男性样本中染色体的标准信号强度的预设置信度值P的分布表中,那么该待测数据对应的性别为男性。如果该待检测数据中计算出的X染色体的实际信号强度和Y染色体的实际信号强度在正常女性样本中染色体的标准信号强度的预设置信度值P的分布表中,那么该待测数据对应的性别为女性。如果既不在正常男性样本中染色体的标准信号强度的预设置信度值P的分布表中也不在正常女性样本中染色体的标准信号强度的预设置信度值P的分布表中的置信区间中,那么可判定为该染色体存在潜在的性染色体拷贝数异常的情况。Further, since in step 1406, a distribution table of pre-set reliability values P of standard signal intensities of chromosomes in the corresponding samples is established according to different genders of the target species. Therefore, the actual signal intensity of the sex chromosome can also be compared with the distribution table of the preset reliability values P corresponding to different genders. The actual signal intensity of the X chromosome and the actual signal intensity of the Y chromosome calculated from the data to be tested are compared with the boundary value of the confidence interval in the distribution table of the preset confidence value P corresponding to different genders. If the calculated actual signal intensity of the X chromosome and the actual signal intensity of the Y chromosome in the data to be detected are in a distribution table of preset reliability values P of the standard signal intensity of the chromosome in a normal male sample, then the data to be tested corresponds Gender is male. If the calculated actual signal intensity of the X chromosome and the actual signal intensity of the Y chromosome in the data to be detected are in the distribution table of the preset reliability value P of the standard signal intensity of the chromosome in a normal female sample, then the data to be tested corresponds to Gender is female. If it is neither in the distribution table of the preset confidence value P of the standard signal strength of the chromosome in the normal male sample nor in the confidence interval in the distribution table of the preset reliability value P of the standard signal strength of the chromosome in the normal female sample, Then it can be determined that there is a situation of potential sex chromosome copy number abnormality on this chromosome.
根据上述方式确定待检测数据对应的性别后,分别将该待检测数据中的各个染色体的实际信号强度与预设置信度值P的分布表中的各个相对应的染色体的置信区间进行比较。具体与哪个性别对应的预设置信度值P的分布表进行比较,取决于待检测数据对应的性别。如果检测到某一染色体的实际信号强度不在预设置信度值P的分布表中的置信区间内,那么则判定该染色体存在潜在的拷贝数异常的情况。此处可通过加大预设置信度值P来减小假阳性的概率。但是增大P就会增大假阴性的概率。After the gender corresponding to the data to be detected is determined according to the above manner, the actual signal strength of each chromosome in the data to be detected is compared with the confidence interval of each chromosome in the distribution table of the preset confidence value P. The comparison with the distribution table of the preset reliability value P corresponding to which gender depends on the gender corresponding to the data to be detected. If it is detected that the actual signal strength of a certain chromosome is not within the confidence interval in the distribution table of the preset confidence value P, then it is determined that there is a potential copy number abnormality situation for the chromosome. Here, the probability of false positives can be reduced by increasing the preset reliability value P. But increasing P increases the probability of false negatives.
通过确定待检测数据对应的目标物种,并获取到目标物种中的每个染色体对应的特异性k-mer后,根据特异性k-mer在待检测数据中的实际出现次数以及每个特异性k-mer的拷贝数,以此计算出每个染色体对应的实际信号强度。从而可将每个染色体的实际信号强度与对应染色体的标准置信区间进行比较,将不在对应染色体的标准置信区间内的染色体判定为存在拷贝数异常的染色体。这种检测染色体拷贝数异常的方法,通过与目标物种的各个染色体中的特征靶点序列,即特异性k-mer进行比较,而特异性k-mer属于整个目标物种基因组的一部分,因此与特异性k-mer进行对比则能够减少比较空间,从而缩短了分析时间,提高了检测的效率。且此处产生的目标物种的各个染色体的特征靶点是综合了目标物种中不同个体或种群的多个基因组,因此避免了“当某一套数据是来自与参考基因组的遗传关系相距较远的个体,使用全基因组比对的效果变差”的问题。在建立目标物种的各个染色体的特征靶点库的过程中包括了目标物种中不同个体或种群的多个基因组,比单一的参考基因组更有普遍适用性。并且在对一套待检测数据分析的过程中,仅仅将数据与特征靶点库内的序列相比对,大大节省了比对的空间和时间消耗。After determining the target species corresponding to the data to be detected, and obtaining the specific k-mer corresponding to each chromosome in the target species, according to the actual number of occurrences of the specific k-mer in the data to be detected and each specific k -mer copy number to calculate the actual signal intensity corresponding to each chromosome. Therefore, the actual signal intensity of each chromosome can be compared with the standard confidence interval of the corresponding chromosome, and the chromosome that is not within the standard confidence interval of the corresponding chromosome can be determined as a chromosome with abnormal copy number. This method of detecting chromosome copy number abnormalities is compared with the characteristic target sequence in each chromosome of the target species, that is, the specific k-mer, which is part of the entire target species genome, and is therefore specific. The comparison of the performance k-mer can reduce the comparison space, thereby shortening the analysis time and improving the detection efficiency. And the characteristic target of each chromosome of the target species generated here is the integration of multiple genomes of different individuals or populations in the target species, thus avoiding "when a set of data comes from a genetic relationship that is far away from the reference genome Individuals, the effect of using whole-genome alignments becomes worse. " In the process of establishing a characteristic target library of each chromosome of a target species, multiple genomes of different individuals or populations in the target species are included, which is more universally applicable than a single reference genome. And in the process of analyzing a set of data to be detected, only comparing the data with the sequences in the feature target library, greatly saving the space and time consumption of the alignment.
应该理解的是,虽然图1-图20的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,各个图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行, 而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowcharts of FIGS. 1-20 are sequentially displayed in accordance with the instructions of the arrows, these steps are not necessarily performed in the order indicated by the arrows. Unless explicitly stated in this document, the execution of these steps is not strictly limited, and these steps can be performed in other orders. Moreover, at least a part of the steps in each figure may include multiple sub-steps or stages. These sub-steps or stages are not necessarily performed at the same time, but may be performed at different times. The execution of these sub-steps or stages The sequence is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a part of the sub-steps or stages of other steps.
在其中一个实施例中,如图21所示,提供了一种检测染色体拷贝数异常的装置,包括:In one embodiment, as shown in FIG. 21, a device for detecting an abnormal chromosome copy number is provided, including:
特异性k-mer获取模块2102,用于获取待检测的样本的测序数据作为待检测数据,确定待检测数据对应的目标物种;获取靶点数据库中存储的目标物种包含的每个染色体对应的特异性k-mer,特异性k-mer为每个染色体中的满足预设特异性条件的k-mer,k-mer是指长度为k的基因组序列;The specific k-mer acquisition module 2102 is used to obtain sequencing data of a sample to be detected as the data to be detected, and determine a target species corresponding to the data to be detected; and acquire a specificity corresponding to each chromosome contained in the target species stored in the target database. Sexual k-mer, specific k-mer is the k-mer in each chromosome that meets the preset specificity conditions, k-mer refers to the genomic sequence of length k;
实际出现次数获取模块2104,用于获取每个染色体中包含的特异性k-mer在待检测数据中的实际出现次数;The actual appearance frequency obtaining module 2104 is configured to obtain the actual appearance times of the specific k-mer included in each chromosome in the data to be detected;
拷贝数获取模块2106,用于从靶点数据库中获取到每个特异性k-mer的拷贝数,拷贝数是特异性k-mer在对应的染色体中的出现次数与该染色体上出现次数最少的特异性k-mer的出现次数的比值;及The copy number acquisition module 2106 is used to obtain the copy number of each specific k-mer from the target database. The copy number is the least number of occurrences of the specific k-mer on the corresponding chromosome and the number of occurrences on the chromosome. Ratio of occurrences of specific k-mers; and
判定模块2108,用于根据每个特异性k-mer的实际出现次数和拷贝数计算得到对应的染色体的实际信号强度;将实际信号强度不在对应染色体的标准置信区间内的染色体判定为存在拷贝数异常的染色体。A determination module 2108, configured to calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and the copy number of each specific k-mer; determine that the chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome exists as a copy number Abnormal chromosomes.
在其中一个实施例中,判定模块2108还用于计算每个特异性k-mer的实际出现次数与拷贝数的比值;计算每个染色体包含的所有特异性k-mer的实际出现次数与拷贝数的比值的均值,作为对应的染色体的单拷贝信号强度;及根据每个染色体的单拷贝信号强度计算得到对应的染色体的实际信号强度。In one embodiment, the determination module 2108 is further configured to calculate the ratio of the actual number of occurrences of each specific k-mer to the number of copies; calculate the actual number of occurrences and the number of copies of all specific k-mers contained in each chromosome The average value of the ratio of the chromosomes is used as the single-copy signal strength of the corresponding chromosome; and the actual signal strength of the corresponding chromosome is calculated based on the single-copy signal strength of each chromosome.
在其中一个实施例中,根据如下公式计算得到对应的染色体的实际信号强度:In one embodiment, the actual signal intensity of the corresponding chromosome is calculated according to the following formula:
染色体的实际信号强度=(染色体的单拷贝信号强度-M)/SD,其中M为全部的染色体的单拷贝信号强度的平均值,SD为全部的染色体的单拷贝信号强度的方差。The actual signal intensity of the chromosome = (single copy signal intensity of the chromosome-M) / SD, where M is the average of the single copy signal intensity of all chromosomes, and SD is the variance of the single copy signal intensity of all chromosomes.
在其中一个实施例中,上述检测染色体拷贝数异常的装置还包括标准置信区间列表计算模块(图中未示出),用于获取预设数量的标准检测样本,标准检测样本是确认为无染色体拷贝数异常的样本;获取标准检测样本中每个染色体包含的特异性k-mer在待检测数据中的实际出现次数;从靶点数据库中获取到标准检测样本中包含的每个染色体中每个特异性k-mer的拷贝数;根据标准检测样本中包含的每个特异性k-mer的实际出现次数和拷贝数得到对应的染色体的标准信号强度;根据多个标准检测样本中的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间;及根据每个染色体对应的标准置信区间,获得目标物种中包含的染色体对应的标准置信区间列表。In one embodiment, the apparatus for detecting abnormal copy number of a chromosome further includes a standard confidence interval list calculation module (not shown in the figure) for obtaining a preset number of standard test samples, and the standard test samples are confirmed as having no chromosomes. Samples with abnormal copy number; Obtain the actual number of occurrences of the specific k-mer contained in each chromosome in the standard test sample in the data to be tested; obtain from the target database each of each chromosome contained in the standard test sample Copy number of specific k-mer; get the standard signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer included in the standard detection sample; detect each chromosome in the sample according to multiple standards The standard signal strength of the chromosome determines the standard confidence interval corresponding to the chromosome when the confidence value is preset; and according to the standard confidence interval corresponding to each chromosome, a list of standard confidence intervals corresponding to the chromosome contained in the target species is obtained.
在其中一个实施例中,上述标准置信区间列表计算模块还用于获取每个标准检测样本包含的每个染色体的标准信号强度;根据标准检测样本的性别分别计算所有标准检测样本中包含的染色体的标准信号强度的均值和方差;及根据每个染色体在相应性别的多个标准检测样本中的标准信号强度的均值和方差,确定每个性别对应的标准检测样本中包含的染色体在预设置信度值时对应的标准置信区间。In one embodiment, the above-mentioned standard confidence interval list calculation module is further configured to obtain the standard signal intensity of each chromosome contained in each standard detection sample; and calculate the chromosome Mean and variance of standard signal strengths; and based on the mean and variance of standard signal strengths in multiple standard test samples for each chromosome for the corresponding gender, determine the pre-set reliability of the chromosomes contained in the standard test samples corresponding to each gender The standard confidence interval corresponding to the value.
在其中一个实施例中,标准检测样本为正常母亲怀有正常婴儿的外周血样本,外周血样本包括有正常母亲怀有正常男婴的外周血样本、正常母亲怀有正常女婴的外周血样本、正常 母亲怀有正常男婴双胞胎的外周血样本、正常母亲怀有正常女婴双胞胎的外周血样本以及正常母亲怀有正常一男一女双胞胎的外周血样本。上述标准置信区间列表计算模块还用于根据正常母亲怀有正常男婴的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间;根据正常母亲怀有正常女婴的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间;根据正常母亲怀有正常男婴双胞胎的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间;根据正常母亲怀有正常女婴双胞胎的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间;及根据正常母亲怀有正常一男一女双胞胎的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间。In one embodiment, the standard test sample is a peripheral blood sample of a normal mother carrying a normal baby. The peripheral blood sample includes a peripheral blood sample of a normal mother carrying a normal baby boy, and a peripheral mother's peripheral blood sample. Peripheral blood samples from normal mothers carrying normal baby boy twins, Peripheral blood samples from normal mothers carrying normal baby girl twins, and Peripheral blood samples from normal mothers carrying normal one boy and one female twin. The above-mentioned standard confidence interval list calculation module is further configured to determine a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby boy; The standard signal intensity of each chromosome contained in the peripheral blood sample of a normal baby girl is determined by the standard confidence interval of the chromosome when the confidence value is preset; according to the The standard signal intensity of each chromosome determines the standard confidence interval of the chromosome when the confidence value is preset; according to the standard signal intensity of each chromosome contained in the peripheral blood sample of a normal mother carrying a normal baby girl twin, it is determined that the chromosome is in a preset The standard confidence interval corresponding to the confidence value; and the standard confidence interval corresponding to the chromosome when the confidence value is preset according to the standard signal strength of each chromosome contained in the peripheral blood sample of a normal mother and a female twin.
在其中一个实施例中,上述判定模块2108还用于当检测到存在有染色体对应的实际信号强度不属于与对应染色体的标准置信区间时,则将与实际信号强度对应的染色体判定为存在拷贝数异常的染色体。In one embodiment, the above-mentioned determination module 2108 is further configured to, when it is detected that the actual signal intensity corresponding to the chromosome does not belong to the standard confidence interval corresponding to the corresponding chromosome, determine the chromosome corresponding to the actual signal intensity as a copy number Abnormal chromosomes.
在其中一个实施例中,上述检测染色体拷贝数异常的装置还包括性别划分比较模块(图中未示出),用于根据目标物种的性别,确定每个性别对应染色体的标准置信区间列表;分别将每个染色体的实际信号强度与目标物种的对应性别的标准置信区间列表中的相对应的染色体所对应的标准置信区间进行比较;及当检测到存在有染色体的实际信号强度不属于对应性别的对应染色体的标准置信区间时,则将与实际信号强度对应的染色体判定为存在拷贝数异常的染色体。In one embodiment, the apparatus for detecting abnormal copy number of a chromosome further includes a gender division comparison module (not shown in the figure) for determining a standard confidence interval list of a chromosome corresponding to each gender according to the gender of the target species; respectively Compare the actual signal strength of each chromosome with the standard confidence interval corresponding to the corresponding chromosome in the list of standard confidence intervals for the corresponding sex of the target species; and when it is detected that the actual signal strength of the chromosome does not belong to the corresponding sex When corresponding to the standard confidence interval of a chromosome, the chromosome corresponding to the actual signal intensity is determined as a chromosome with abnormal copy number.
在其中一个实施例中,上述检测染色体拷贝数异常的装置还包括靶点序列建立模块(图中未示出),用于获取靶点数据库中存储的目标物种包含的每个染色体中包含的特异性k-mer在对应染色体中的出现次数C,以及该染色体中的出现次数最少的特异性k-mer对应的出现次数作为最小出现次数Cm;将出现次数C与最小出现次数Cm的比值作为特异性k-mer的拷贝数;根据每个染色体中包含的特异性k-mer的拷贝数生成与每个染色体对应的特异性k-mer拷贝数列表;及将特异性k-mer拷贝数列表存储至靶点数据库。In one embodiment, the above-mentioned apparatus for detecting abnormal copy number of a chromosome further includes a target sequence creation module (not shown in the figure), configured to obtain a specificity contained in each chromosome included in the target species stored in the target database. The number of occurrences of sexual k-mer in the corresponding chromosome C, and the number of occurrences of specific k-mer in the corresponding chromosome are taken as the minimum occurrences Cm; the ratio of the occurrences C to the minimum occurrences Cm is taken as the specificity Copy number of specific k-mer; generating a specific k-mer copy number list corresponding to each chromosome according to the copy number of specific k-mer contained in each chromosome; and storing the specific k-mer copy number list To the target database.
在其中一个实施例中,上述靶点序列建立模块还用于获取目标物种中包含的多个染色体;对目标物种中包含的多个染色体进行分类整理;获取预先选取的满足预设可信度条件的高可信度基因组;及确定目标物种包含的各个染色体对应的高可信度基因组。In one embodiment, the above-mentioned target sequence creation module is further configured to obtain multiple chromosomes contained in the target species; classify and sort multiple chromosomes contained in the target species; and obtain a pre-selected condition that satisfies a preset credibility High-confidence genome; and determining the high-confidence genome corresponding to each chromosome contained in the target species.
在其中一个实施例中,满足预设可信度条件包括以下任意一种:染色体序列中包含的非确定性字符的比例低于预设比例阈值时;染色体序列中包含的属于同一条染色体的序列片段低于预设片段阈值时;将某一染色体序列与其他所有遗传关系符合预设遗传距离阈值范围的染色体序列进行序列比对,确定该染色体序列在其相近的染色体序列中的全序列平均覆盖百分比,当该平均覆盖百分比高于预设百分比值时。In one embodiment, satisfying the preset credibility condition includes any of the following: when the proportion of non-deterministic characters contained in the chromosome sequence is lower than a preset proportion threshold; the sequence belonging to the same chromosome included in the chromosome sequence When the fragment is below the preset fragment threshold; compare a certain chromosome sequence with all other chromosomal sequences whose genetic relationship meets the preset genetic distance threshold range to determine the average full coverage of the chromosome sequence in the similar chromosome sequences Percentage, when the average coverage percentage is higher than the preset percentage value.
在其中一个实施例中,特异性k-mer中的k-mer满足以下两个条件:在与每个染色体对应的基因组出现次数索引表中的出现次数满足第一预设误差条件;在与每个染色体对应的基因组出现次数索引表中的出现次数,以及在全集的基因组出现次数索引表中的出现次数满足 第二预设误差条件;基因组出现次数索引表记录了每个k-mer在染色体对应的基因组中包含有该k-mer的基因组的个数;全集的基因组出现次数索引表记录了目标物种中每个染色体包含的k-mer在全集包含的基因组中包含有该k-mer的基因组的个数。In one embodiment, the k-mer in the specific k-mer satisfies the following two conditions: the number of occurrences in the genome occurrence index table corresponding to each chromosome meets a first preset error condition; The number of occurrences in the genome occurrence index table corresponding to each chromosome, and the occurrences in the genome occurrence index table of the complete set meet the second preset error condition; the genome appearance index table records the corresponding chromosome of each k-mer Contains the number of k-mer genomes in the genome; the genome occurrence index table of the complete set records the k-mers contained in each chromosome in the target species, and the k-mer genomes in the complete set contain the k-mer genomes. Number.
在其中一个实施例中,第一预设误差条件为:在与每个染色体对应的基因组出现次数索引表中的出现次数与对应染色体中包含的基因组的数量的比值与第一阈值的和大于等于1。In one of the embodiments, the first preset error condition is: the sum of the ratio of the number of occurrences in the genome occurrence number index table corresponding to each chromosome to the number of genomes contained in the corresponding chromosome and the first threshold is greater than or equal to 1.
在其中一个实施例中,第一阈值小于5%。In one of these embodiments, the first threshold is less than 5%.
在其中一个实施例中,第二预设误差条件为:在与每个染色体对应的基因组出现次数索引表中的出现次数与在全集的基因组出现次数索引表中的出现次数的比值与第二阈值的和大于等于1。In one embodiment, the second preset error condition is: the ratio of the number of occurrences in the genome occurrence number index table corresponding to each chromosome to the number of occurrences in the genome occurrence number index table of the complete set and the second threshold value. Is greater than or equal to 1.
在其中一个实施例中,第二阈值小于5%。In one of these embodiments, the second threshold is less than 5%.
关于检测染色体拷贝数异常的装置的具体限定可以参见上文中对于检测染色体拷贝数异常的方法的限定,在此不再赘述。上述检测染色体拷贝数异常的装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the device for detecting the abnormality of the chromosome copy number, refer to the foregoing limitation on the method for detecting the abnormality of the chromosome copy number, and details are not described herein again. Each module in the above apparatus for detecting abnormal copy number of a chromosome can be realized in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图22所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储检测染色体拷贝数异常的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种检测染色体拷贝数异常的方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 22. The computer device includes a processor, a memory, a network interface, and a database connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer-readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium. The computer equipment database is used to store data for detecting abnormal chromosome copy numbers. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by a processor to implement a method for detecting abnormalities in chromosome copy number.
本领域技术人员可以理解,图22中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 22 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. The specific computer equipment may be Include more or fewer parts than shown in the figure, or combine certain parts, or have a different arrangement of parts.
一种计算机设备,包括存储器和一个或多个处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时实现本申请任意一个实施例中提供的检测染色体拷贝数异常的方法的步骤。A computer device includes a memory and one or more processors. Computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by the processor, the method for detecting an abnormality of a chromosome copy number provided in any embodiment of the present application is implemented. A step of.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的检测染色体拷贝数异常的方法的步骤。One or more non-transitory computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors implement one of the embodiments of the present application. Provided are steps of a method for detecting chromosome copy number abnormalities.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令计算机可读指令在执行时,可包括如上述各方法的实施例的流程。本申请所提供的各实施例中所使用的对存储器、 存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art may understand that implementing all or part of the processes in the methods of the foregoing embodiments may be performed by computer-readable instructions and computer-readable instructions to instruct related hardware. The computer-readable instructions may be The computer-readable instructions are stored in a non-volatile computer-readable storage medium. When the computer-readable instructions are executed, the computer-readable instructions may include the processes of the foregoing method embodiments. Any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be arbitrarily combined. In order to make the description concise, all possible combinations of the technical features in the above embodiments have not been described. However, as long as there is no contradiction in the combination of these technical features, it should be It is considered to be the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation manners of the present application, and the description thereof is more specific and detailed, but cannot be understood as a limitation on the scope of the invention patent. It should be noted that, for those of ordinary skill in the art, without departing from the concept of the present application, several modifications and improvements can be made, and these all belong to the protection scope of the present application. Therefore, the protection scope of this application patent shall be subject to the appended claims.

Claims (19)

  1. 一种检测染色体拷贝数异常的方法,包括:A method for detecting chromosome copy number abnormalities, including:
    获取待检测的样本的测序数据作为待检测数据,确定所述待检测数据对应的目标物种;Acquiring sequencing data of a sample to be detected as the data to be detected, and determining a target species corresponding to the data to be detected;
    获取靶点数据库中存储的目标物种包含的每个染色体对应的特异性k-mer,所述特异性k-mer为每个染色体中的满足预设特异性条件的k-mer,所述k-mer是指长度为k的基因组序列;A specific k-mer corresponding to each chromosome contained in the target species stored in the target database is obtained, where the specific k-mer is a k-mer in each chromosome that satisfies a preset specific condition, and the k-mer mer refers to a genomic sequence of length k;
    获取每个染色体中包含的特异性k-mer在所述待检测数据中的实际出现次数;Obtaining the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected;
    从所述靶点数据库中获取到每个特异性k-mer的拷贝数,所述拷贝数是所述特异性k-mer在对应的染色体中的出现次数与该染色体上出现次数最少的特异性k-mer的出现次数的比值;A copy number of each specific k-mer is obtained from the target database, and the copy number is a specificity that has the least number of occurrences of the specific k-mer on the corresponding chromosome and the number of occurrences on the chromosome. The ratio of the number of occurrences of k-mer;
    根据每个特异性k-mer的实际出现次数和拷贝数计算得到对应的染色体的实际信号强度;及Calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer; and
    将所述实际信号强度不在对应染色体的标准置信区间内的染色体判定为存在拷贝数异常的染色体。A chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is determined to be a chromosome with abnormal copy number.
  2. 根据权利要求1所述的方法,其特征在于,所述根据每个特异性k-mer的实际出现次数和拷贝数计算得到对应的染色体的实际信号强度,包括:The method according to claim 1, wherein the calculating the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer comprises:
    计算每个特异性k-mer的实际出现次数与拷贝数的比值;Calculate the ratio of the actual number of occurrences of each specific k-mer to the number of copies;
    计算每个染色体包含的所有特异性k-mer的所述比值的均值,作为对应的染色体的单拷贝信号强度;及Calculating the average of said ratios of all specific k-mers contained in each chromosome as the single copy signal intensity of the corresponding chromosome; and
    根据每个染色体的单拷贝信号强度计算得到对应的染色体的实际信号强度。The actual signal intensity of the corresponding chromosome is calculated from the single-copy signal intensity of each chromosome.
  3. 根据权利要求2所述的方法,其特征在于,根据如下公式计算得到对应的染色体的实际信号强度:The method according to claim 2, wherein the actual signal intensity of the corresponding chromosome is calculated according to the following formula:
    染色体的实际信号强度=(染色体的单拷贝信号强度-M)/SD,其中M为全部的染色体的单拷贝信号强度的平均值,SD为全部的染色体的单拷贝信号强度的方差。The actual signal intensity of the chromosome = (single copy signal intensity of the chromosome-M) / SD, where M is the average of the single copy signal intensity of all chromosomes, and SD is the variance of the single copy signal intensity of all chromosomes.
  4. 根据权利要求1所述的方法,其特征在于,还包括:The method according to claim 1, further comprising:
    获取预设数量的标准检测样本,所述标准检测样本是确认为无染色体拷贝数异常的样本;Obtaining a preset number of standard test samples, the standard test samples being samples confirmed as having no abnormal chromosome copy number;
    获取所述标准检测样本中每个染色体包含的特异性k-mer在所述待检测数据中的实际出现次数;Obtaining the actual number of occurrences of the specific k-mer contained in each chromosome in the standard detection sample in the data to be detected;
    从靶点数据库中获取到所述标准检测样本中包含的每个染色体中每个特异性k-mer的拷贝数;Obtaining the copy number of each specific k-mer in each chromosome contained in the standard detection sample from the target database;
    根据所述标准检测样本中包含的每个特异性k-mer的实际出现次数和拷贝数得到对应的染色体的标准信号强度;Obtaining the standard signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer included in the standard detection sample;
    根据多个标准检测样本中的每个染色体的标准信号强度确定所述染色体在预设置信度值时对应的标准置信区间;及Determining a standard confidence interval corresponding to a preset confidence value of the chromosome according to a standard signal intensity of each chromosome in a plurality of standard detection samples; and
    根据每个染色体对应的标准置信区间,获得所述目标物种中包含的染色体对应的标准置信区间列表。According to the standard confidence intervals corresponding to each chromosome, a list of standard confidence intervals corresponding to the chromosomes contained in the target species is obtained.
  5. 根据权利要求4所述的方法,其特征在于,所述根据多个标准检测样本中的每个染色 体的标准信号强度确定所述染色体在预设置信度值时对应的标准置信区间,包括:The method according to claim 4, wherein determining a standard confidence interval corresponding to the chromosome when a preset confidence value is determined according to a standard signal intensity of each chromosome in a plurality of standard detection samples, comprising:
    获取每个所述标准检测样本包含的每个染色体的标准信号强度;Obtaining a standard signal intensity of each chromosome contained in each of the standard detection samples;
    根据所述标准检测样本的性别分别计算所有标准检测样本中包含的染色体的标准信号强度的均值和方差;及Calculate the mean and variance of the standard signal intensities of the chromosomes contained in all the standard test samples, respectively, according to the sex of the standard test samples; and
    根据每个染色体在相应性别的多个标准检测样本中的标准信号强度的均值和方差,确定每个性别对应的标准检测样本中包含的染色体在所述预设置信度值时对应的标准置信区间。Determine the standard confidence interval corresponding to the chromosome contained in the standard detection sample corresponding to each gender when the preset confidence value is corresponding to the mean value and variance of the standard signal intensity of each standard in multiple standard detection samples of the corresponding gender .
  6. 根据权利要求4所述的方法,其特征在于,所述标准检测样本为正常母亲怀有正常婴儿的外周血样本,所述外周血样本包括有正常母亲怀有正常男婴的外周血样本、正常母亲怀有正常女婴的外周血样本、正常母亲怀有正常男婴双胞胎的外周血样本、正常母亲怀有正常女婴双胞胎的外周血样本以及正常母亲怀有正常一男一女双胞胎的外周血样本;The method according to claim 4, wherein the standard test sample is a peripheral blood sample of a normal mother carrying a normal baby, and the peripheral blood sample includes a peripheral blood sample of a normal mother carrying a normal baby boy, normal Peripheral blood samples from mothers with normal baby girls, Peripheral blood samples from normal mothers with normal baby boy twins, Peripheral blood samples from normal mothers with normal baby girl twins, and Peripheral blood samples from normal mothers with normal one boy and one female twin ;
    所述根据多个标准检测样本中的每个染色体的标准信号强度确定所述染色体在预设置信度值时对应的标准置信区间,包括:The determining a standard confidence interval corresponding to a preset confidence value of the chromosome according to a standard signal intensity of each chromosome in a plurality of standard detection samples includes:
    根据所述正常母亲怀有正常男婴的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间;Determining a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby boy;
    根据所述正常母亲怀有正常女婴的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间;Determining a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby girl;
    根据所述正常母亲怀有正常男婴双胞胎的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间;Determining a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby boy twin;
    根据所述正常母亲怀有正常女婴双胞胎的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间;及Determining a standard confidence interval corresponding to a chromosome at a preset confidence value according to the standard signal intensity of each chromosome contained in the peripheral blood sample of the normal mother carrying a normal baby girl twin; and
    根据所述正常母亲怀有正常一男一女双胞胎的外周血样本中包含的每个染色体的标准信号强度确定染色体在预设置信度值时对应的标准置信区间。A standard confidence interval corresponding to a chromosome at a preset confidence value is determined according to a standard signal intensity of each chromosome contained in a peripheral blood sample of the normal mother carrying a normal one male and one female twin.
  7. 根据权利要求1所述的方法,其特征在于,所述将所述实际信号强度不在对应染色体的标准置信区间内的染色体判定为存在拷贝数异常的染色体,包括:The method according to claim 1, wherein the determining that the chromosome whose actual signal intensity is not within a standard confidence interval of the corresponding chromosome as having a copy number abnormality comprises:
    当检测到存在有染色体对应的实际信号强度不属于与对应染色体的标准置信区间时,则将与所述实际信号强度对应的染色体判定为存在拷贝数异常的染色体。When it is detected that the actual signal intensity corresponding to the chromosome does not belong to the standard confidence interval corresponding to the corresponding chromosome, the chromosome corresponding to the actual signal intensity is determined to be a chromosome with abnormal copy number.
  8. 根据权利要求1所述的方法,其特征在于,还包括:The method according to claim 1, further comprising:
    根据所述目标物种的性别,确定每个性别对应染色体的标准置信区间列表;Determining a standard confidence interval list of a chromosome corresponding to each sex according to the sex of the target species;
    获取待测样本的性别;Get the gender of the sample to be tested;
    分别将每个染色体的实际信号强度与所述目标物种的对应性别的标准置信区间列表中的相对应的染色体所对应的标准置信区间进行比较;及Comparing the actual signal intensity of each chromosome with the standard confidence interval corresponding to the corresponding chromosome in the list of standard confidence intervals for the corresponding sex of the target species; and
    当检测到存在有染色体的实际信号强度不属于对应性别的对应染色体的标准置信区间时,则将与所述实际信号强度对应的染色体判定为存在拷贝数异常的染色体。When it is detected that the actual signal intensity of the chromosome does not belong to the standard confidence interval of the corresponding chromosome of the corresponding sex, the chromosome corresponding to the actual signal intensity is determined as a chromosome with abnormal copy number.
  9. 根据权利要求1所述的方法,其特征在于,在所述获取待检测的样本的测序数据作为待检测数据之前,还包括:The method according to claim 1, before the obtaining the sequencing data of the sample to be detected as the data to be detected, further comprising:
    获取靶点数据库中存储的目标物种包含的每个染色体中包含的特异性k-mer在对应染色 体中的出现次数C,以及该染色体中的出现次数最少的特异性k-mer对应的出现次数作为最小出现次数Cm;Obtain the number of occurrences of the specific k-mer contained in each chromosome contained in the target species stored in the target database in the corresponding chromosome C, and the number of occurrences of the specific k-mer corresponding to the least number of occurrences in the chromosome as the The minimum number of occurrences Cm;
    将所述出现次数C与最小出现次数Cm的比值作为特异性k-mer的拷贝数;Taking the ratio of the number of occurrences C to the minimum number of occurrences Cm as the copy number of the specific k-mer;
    根据每个染色体中包含的特异性k-mer的拷贝数生成与每个染色体对应的特异性k-mer拷贝数列表;及Generating a specific k-mer copy number list corresponding to each chromosome based on the copy number of the specific k-mer contained in each chromosome; and
    将所述特异性k-mer拷贝数列表存储至所述靶点数据库;Storing the specific k-mer copy number list into the target database;
    所述从靶点数据库中获取到每个特异性k-mer的拷贝数,包括:根据所述特异性k-mer拷贝数列表获取到每个特异性k-mer的拷贝数。The obtaining the copy number of each specific k-mer from the target database includes: obtaining the copy number of each specific k-mer according to the specific k-mer copy number list.
  10. 根据权利要求1的方法,其特征在于,在获取待检测的样本的测序数据作为待检测数据之前,还包括:The method according to claim 1, characterized in that before obtaining the sequencing data of the sample to be detected as the data to be detected, further comprising:
    获取目标物种中包含的多个染色体;Obtaining multiple chromosomes contained in the target species;
    对所述目标物种中包含的多个染色体进行分类整理;Classify and sort multiple chromosomes contained in the target species;
    获取预先选取的满足预设可信度条件的高可信度基因组;及Obtaining pre-selected high-confidence genomes that meet preset confidence conditions; and
    确定所述目标物种包含的各个染色体对应的高可信度基因组。A high-confidence genome corresponding to each chromosome contained in the target species is determined.
  11. 根据权利要求1的方法,其特征在于,所述满足预设可信度条件包括以下任意一种:The method according to claim 1, wherein the meeting the preset credibility condition includes any one of the following:
    染色体序列中包含的非确定性字符的比例低于预设比例阈值时;When the proportion of non-deterministic characters contained in the chromosome sequence is lower than a preset proportion threshold;
    染色体序列中包含的属于同一条染色体的序列片段低于预设片段阈值时;及When the sequence fragments contained in the chromosome sequence belonging to the same chromosome are below a preset fragment threshold; and
    将某一染色体序列与其他所有遗传关系符合预设遗传距离阈值范围的染色体序列进行序列比对,确定该染色体序列在其相近的染色体序列中的全序列平均覆盖百分比,当该平均覆盖百分比高于预设百分比值时。Perform a sequence comparison between a chromosome sequence and all other chromosome sequences whose genetic relationship meets the preset genetic distance threshold range to determine the average full coverage percentage of the chromosome sequence in the similar chromosome sequences. When preset percentage value.
  12. 根据权利要求1的方法,其特征在于,特异性k-mer中的k-mer满足以下两个条件:The method according to claim 1, characterized in that the k-mer in the specific k-mer satisfies the following two conditions:
    在与每个染色体对应的基因组出现次数索引表中的出现次数满足第一预设误差条件;在与每个染色体对应的基因组出现次数索引表中的出现次数,以及在全集的基因组出现次数索引表中的出现次数满足第二预设误差条件;The number of occurrences in the genome occurrence number index table corresponding to each chromosome meets the first preset error condition; the number of occurrences in the genome occurrence number index table corresponding to each chromosome, and the genome occurrence number index table in the complete set The number of occurrences in the second meets the second preset error condition;
    所述基因组出现次数索引表记录了每个k-mer在染色体对应的基因组中包含有该k-mer的基因组的个数;所述全集的基因组出现次数索引表记录了所述目标物种中每个染色体包含的k-mer在全集包含的基因组中包含有该k-mer的基因组的个数。The genome appearance index table records the number of each k-mer in the genome corresponding to the chromosome containing the k-mer genome; the complete set genome appearance index table records each of the target species The k-mer included in the chromosome includes the number of the k-mer genome in the genome included in the corpus.
  13. 根据权利要求12的方法,其特征在于,第一预设误差条件为:在与每个染色体对应的基因组出现次数索引表中的出现次数与对应染色体中包含的基因组的数量的比值与第一阈值的和大于等于1。The method according to claim 12, characterized in that the first preset error condition is: the ratio of the number of occurrences in the genome occurrence number index table corresponding to each chromosome to the number of genomes contained in the corresponding chromosome and the first threshold Is greater than or equal to 1.
  14. 根据权利要求13的方法,其特征在于,所述第一阈值小于5%。The method according to claim 13, wherein said first threshold is less than 5%.
  15. 根据权利要求12的方法,其特征在于,所述第二预设误差条件为:在与每个染色体对应的基因组出现次数索引表中的出现次数与在全集的基因组出现次数索引表中的出现次数的比值与第二阈值的和大于等于1。The method according to claim 12, characterized in that the second preset error condition is: the number of occurrences in the genome occurrence index table corresponding to each chromosome and the number of occurrences in the genome occurrence index table of the complete set The sum of the ratio of the second threshold value is greater than or equal to 1.
  16. 根据权利要求15的方法,其特征在于,所述第二阈值小于5%。The method according to claim 15, wherein said second threshold is less than 5%.
  17. 一种检测染色体拷贝数异常的装置,包括:A device for detecting abnormal copy number of chromosome, including:
    特异性k-mer获取模块,用于获取待检测的样本的测序数据作为待检测数据,确定所述待检测数据对应的目标物种;获取靶点数据库中存储的目标物种包含的每个染色体对应的特异性k-mer,所述特异性k-mer为每个染色体中的满足预设特异性条件的k-mer,所述k-mer是指长度为k的基因组序列;A specific k-mer acquisition module, configured to acquire sequencing data of a sample to be detected as the data to be detected, and determine a target species corresponding to the data to be detected; and acquire a corresponding one of each chromosome contained in the target species stored in the target database. A specific k-mer, where the specific k-mer is a k-mer in each chromosome that meets a preset specificity condition, and the k-mer refers to a genomic sequence of length k;
    实际出现次数获取模块,用于获取每个染色体中包含的特异性k-mer在所述待检测数据中的实际出现次数;An actual appearance frequency acquisition module, configured to obtain the actual appearance frequency of the specific k-mer included in each chromosome in the data to be detected;
    拷贝数获取模块,用于从所述靶点数据库中获取到每个特异性k-mer的拷贝数,所述拷贝数是所述特异性k-mer在对应的染色体中的出现次数与该染色体上出现次数最少的特异性k-mer的出现次数的比值;及A copy number obtaining module is configured to obtain a copy number of each specific k-mer from the target database, the copy number is the number of occurrences of the specific k-mer in the corresponding chromosome and the chromosome The ratio of the number of occurrences of the specific k-mer with the fewest occurrences; and
    判定模块,用于根据每个特异性k-mer的实际出现次数和拷贝数计算得到对应的染色体的实际信号强度;将所述实际信号强度不在对应染色体的标准置信区间内的染色体判定为存在拷贝数异常的染色体。A determination module, configured to calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and the number of copies of each specific k-mer; determine that the chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is a copy Number of abnormal chromosomes.
  18. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the one or more processors, the one or more processors are caused. Each processor performs the following steps:
    获取待检测的样本的测序数据作为待检测数据,确定所述待检测数据对应的目标物种;Acquiring sequencing data of a sample to be detected as the data to be detected, and determining a target species corresponding to the data to be detected;
    获取靶点数据库中存储的目标物种包含的每个染色体对应的特异性k-mer,所述特异性k-mer为每个染色体中的满足预设特异性条件的k-mer,所述k-mer是指长度为k的基因组序列;A specific k-mer corresponding to each chromosome contained in the target species stored in the target database is obtained, where the specific k-mer is a k-mer in each chromosome that satisfies a preset specific condition, and the k-mer mer refers to a genomic sequence of length k;
    获取每个染色体中包含的特异性k-mer在所述待检测数据中的实际出现次数;Obtaining the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected;
    从所述靶点数据库中获取到每个特异性k-mer的拷贝数,所述拷贝数是所述特异性k-mer在对应的染色体中的出现次数与该染色体上出现次数最少的特异性k-mer的出现次数的比值;A copy number of each specific k-mer is obtained from the target database, and the copy number is a specificity that has the least number of occurrences of the specific k-mer on the corresponding chromosome and the number of occurrences on the chromosome. The ratio of the number of occurrences of k-mer;
    根据每个特异性k-mer的实际出现次数和拷贝数计算得到对应的染色体的实际信号强度;及Calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer; and
    将所述实际信号强度不在对应染色体的标准置信区间内的染色体判定为存在拷贝数异常的染色体。A chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is determined to be a chromosome with abnormal copy number.
  19. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:One or more non-volatile computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:
    获取待检测的样本的测序数据作为待检测数据,确定所述待检测数据对应的目标物种;Acquiring sequencing data of a sample to be detected as the data to be detected, and determining a target species corresponding to the data to be detected;
    获取靶点数据库中存储的目标物种包含的每个染色体对应的特异性k-mer,所述特异性k-mer为每个染色体中的满足预设特异性条件的k-mer,所述k-mer是指长度为k的基因组序列;A specific k-mer corresponding to each chromosome contained in the target species stored in the target database is obtained, where the specific k-mer is a k-mer in each chromosome that satisfies a preset specific condition, and the k-mer mer refers to a genomic sequence of length k;
    获取每个染色体中包含的特异性k-mer在所述待检测数据中的实际出现次数;Obtaining the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected;
    从所述靶点数据库中获取到每个特异性k-mer的拷贝数,所述拷贝数是所述特异性k-mer在对应的染色体中的出现次数与该染色体上出现次数最少的特异性k-mer的出现次数的比值;A copy number of each specific k-mer is obtained from the target database, and the copy number is a specificity that has the least number of occurrences of the specific k-mer on the corresponding chromosome and the number of occurrences on the chromosome. The ratio of the number of occurrences of k-mer;
    根据每个特异性k-mer的实际出现次数和拷贝数计算得到对应的染色体的实际信号强度; 及Calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer; and
    将所述实际信号强度不在对应染色体的标准置信区间内的染色体判定为存在拷贝数异常的染色体。A chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is determined to be a chromosome with abnormal copy number.
PCT/CN2018/111958 2018-06-22 2018-10-25 Method and apparatus for detecting chromosomal copy number variations, and storage medium WO2019242187A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810651441.6A CN109192246B (en) 2018-06-22 2018-06-22 Method, apparatus and storage medium for detecting chromosomal copy number abnormalities
CN201810651441.6 2018-06-22

Publications (1)

Publication Number Publication Date
WO2019242187A1 true WO2019242187A1 (en) 2019-12-26

Family

ID=64948725

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/111958 WO2019242187A1 (en) 2018-06-22 2018-10-25 Method and apparatus for detecting chromosomal copy number variations, and storage medium

Country Status (2)

Country Link
CN (1) CN109192246B (en)
WO (1) WO2019242187A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112151112A (en) * 2019-06-27 2020-12-29 天津中科智虹生物科技有限公司 Method and device for detecting genetic gene
CN113409885B (en) * 2021-06-21 2022-09-20 天津金域医学检验实验室有限公司 Automatic data processing and mapping method and system
CN113793641B (en) * 2021-09-29 2023-11-28 苏州赛美科基因科技有限公司 Method for rapidly judging sample gender from FASTQ file

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140100792A1 (en) * 2012-10-04 2014-04-10 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations
CN104745718A (en) * 2015-04-23 2015-07-01 北京嘉宝仁和医疗科技有限公司 Method for detecting chromosome microdeletion and micro-duplication of human embryo
CN104789686A (en) * 2015-05-06 2015-07-22 安诺优达基因科技(北京)有限公司 Kit and device for detecting aneuploidy of chromosomes
WO2017094941A1 (en) * 2015-12-04 2017-06-08 주식회사 녹십자지놈 Method for determining copy-number variation in sample comprising mixture of nucleic acids

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2362958A2 (en) * 2008-10-31 2011-09-07 Abbott Laboratories Genomic classification of non-small cell lung carcinoma based on patterns of gene copy number alterations
US9898687B2 (en) * 2011-08-03 2018-02-20 Trigeminal Solutions, Inc. Technique for identifying association variables
PT2893040T (en) * 2012-09-04 2019-04-01 Guardant Health Inc Systems and methods to detect rare mutations and copy number variation
CN104951672B (en) * 2015-06-19 2017-08-29 中国科学院计算技术研究所 Joining method and system associated with a kind of second generation, three generations's gene order-checking data
CN107287285A (en) * 2017-03-28 2017-10-24 上海至本生物科技有限公司 It is a kind of to predict the method that homologous recombination absent assignment and patient respond to treatment of cancer

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140100792A1 (en) * 2012-10-04 2014-04-10 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations
CN104745718A (en) * 2015-04-23 2015-07-01 北京嘉宝仁和医疗科技有限公司 Method for detecting chromosome microdeletion and micro-duplication of human embryo
CN104789686A (en) * 2015-05-06 2015-07-22 安诺优达基因科技(北京)有限公司 Kit and device for detecting aneuploidy of chromosomes
WO2017094941A1 (en) * 2015-12-04 2017-06-08 주식회사 녹십자지놈 Method for determining copy-number variation in sample comprising mixture of nucleic acids

Also Published As

Publication number Publication date
CN109192246B (en) 2020-10-16
CN109192246A (en) 2019-01-11

Similar Documents

Publication Publication Date Title
US10975445B2 (en) Integrated machine-learning framework to estimate homologous recombination deficiency
US11164655B2 (en) Systems and methods for predicting homologous recombination deficiency status of a specimen
Gupta et al. Hierarchical clustering can identify B cell clones with high confidence in Ig repertoire sequencing data
US11848107B2 (en) Predicting likelihood and site of metastasis from patient records
Xi et al. Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion
Robertson et al. Longitudinal dynamics of clonal hematopoiesis identifies gene-specific fitness effects
JP7299169B2 (en) Methods and systems for determining clonality of somatic mutations
WO2021022225A1 (en) Methods and systems for detecting microsatellite instability of a cancer in a liquid biopsy assay
KR20230044325A (en) Methods and processes for non-invasive assessment of genetic variations
WO2019242187A1 (en) Method and apparatus for detecting chromosomal copy number variations, and storage medium
WO2020142563A1 (en) Transcriptome deconvolution of metastatic tissue samples
Sahlin et al. Identification of putative pathogenic single nucleotide variants (SNVs) in genes associated with heart disease in 290 cases of stillbirth
WO2019242445A1 (en) Detection method, device, computer equipment and storage medium of pathogen operation group
Dehghannasiri et al. Unsupervised reference-free inference reveals unrecognized regulated transcriptomic complexity in human single cells
JP2023514851A (en) Identification of methylation patterns that discriminate or indicate cancer pathology
Gao et al. Haplotype-enhanced inference of somatic copy number profiles from single-cell transcriptomes
Bishop et al. A research-based gene panel to investigate breast, ovarian and prostate cancer genetic risk
Kurkiewicz et al. Towards development of a statistical framework to evaluate myotonic dystrophy type 1 mRNA biomarkers in the context of a clinical trial
US11535896B2 (en) Method for analysing cell-free nucleic acids
Lu et al. Overcoming genetic drop-outs in variants-based lineage tracing from single-cell RNA sequencing data
Qu et al. Adaptive parameter of standard deviation enhances the power of noninvasive prenatal screens
Tomofuji et al. Quantification of the escape from X chromosome inactivation with the million cell-scale human single-cell omics datasets reveals heterogeneity of escape across cell types and tissues
Miao et al. Uniform quantification of single-nucleus ATAC-seq data with Paired-Insertion Counting (PIC) and a model-based insertion rate estimator
Poletti TiMMing: developing an innovative suite of bioinformatic tools to harmonize and track the origin of copy number alterations in the evolutive history of multiple myeloma
Inkeles Applications of high-throughput genome and transcriptome analysis in human disease

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18923012

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 18/05/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18923012

Country of ref document: EP

Kind code of ref document: A1