WO2019242187A1

WO2019242187A1 - Method and apparatus for detecting chromosomal copy number variations, and storage medium

Info

Publication number: WO2019242187A1
Application number: PCT/CN2018/111958
Authority: WO
Inventors: 孙亚洲; 肖贡; 陈斌; 杜刘稳; 牛团结; 陈杰
Original assignee: 深圳市达仁基因科技有限公司
Priority date: 2018-06-22
Filing date: 2018-10-25
Publication date: 2019-12-26
Also published as: CN109192246B; CN109192246A

Abstract

A method for detecting chromosomal copy number variations, comprising: obtaining sequencing data of a sample to be detected as data to be detected, and determining a target species corresponding to the data to be detected; obtaining the specific k-mer corresponding to each chromosome included in the target species stored in a target database; obtaining the actual occurrence number of the specific k-mer included in each chromosome in the data to be detected; obtaining the copy number of each specific k-mer from the target database; obtaining, according to the actual occurrence number and the copy number of each specific k-mer, the actual signal intensity of the corresponding chromosome by calculation; and determining chromosomes of which the actual signal intensities are not in a standard confidence interval of the corresponding chromosome as chromosomes having copy number variations.

Description

Method, device and storage medium for detecting abnormal chromosome copy number

This application claims the priority of a Chinese patent application filed on June 22, 2018, with the application number 2018106514416, and the application name is "Method, Device, Computer Equipment, and Storage Medium for Detecting Abnormal Chromosome Copy Numbers", all of which are The contents are incorporated herein by reference.

Technical field

The present application relates to a method, a device, a computer device, and a storage medium for detecting an abnormality in chromosome copy number.

Background technique

In the field of medicine and biology, in order to detect whether a chromosome copy number abnormality exists in a sample, traditional technical solutions can use the genome sequencing data of a sample to be tested, and determine whether a chromosome copy exists in the sample through data analysis. Number of problems. However, in the current technical solution, it is generally required to perform sequence comparison between the sequencing data and the complete sequence of all chromosomes of a species, so the required computing resources are high, the time is consumed, and the memory is consumed.

Summary of the Invention

According to various embodiments disclosed in the present application, a method, a device, and a storage medium for detecting an abnormality in chromosome copy number are provided.

A method for detecting chromosome copy number abnormalities, including:

Acquiring sequencing data of a sample to be detected as the data to be detected, and determining a target species corresponding to the data to be detected;

A specific k-mer corresponding to each chromosome contained in the target species stored in the target database is obtained, where the specific k-mer is a k-mer in each chromosome that satisfies a preset specific condition, and the k-mer mer refers to a genomic sequence of length k;

Obtaining the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected;

A copy number of each specific k-mer is obtained from the target database, and the copy number is a specificity that has the least number of occurrences of the specific k-mer in the corresponding chromosome and that on the chromosome The ratio of the number of occurrences of k-mer;

Calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer; and

A chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is determined to be a chromosome with abnormal copy number.

A device for detecting abnormal chromosome copy number, the device includes:

A specific k-mer acquisition module, configured to acquire sequencing data of a sample to be detected as the data to be detected, and determine a target species corresponding to the data to be detected; and acquire a corresponding one of each chromosome contained in the target species stored in the target database. A specific k-mer, where the specific k-mer is a k-mer in each chromosome that meets a preset specificity condition, and the k-mer refers to a genomic sequence of length k;

An actual appearance frequency acquisition module, configured to obtain the actual appearance frequency of the specific k-mer included in each chromosome in the data to be detected;

A copy number obtaining module is configured to obtain a copy number of each specific k-mer from the target database, the copy number is the number of occurrences of the specific k-mer in the corresponding chromosome and the chromosome The ratio of the number of occurrences of the specific k-mer with the fewest occurrences; and

A determination module, configured to calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and the number of copies of each specific k-mer; determine that the chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is a copy Number of abnormal chromosomes.

A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the one or more processors are executed. The following steps:

A specific k-mer corresponding to each chromosome contained in the target species stored in the target database is obtained. The specific k-mer is a k-mer in each chromosome that meets a preset specificity condition. mer refers to a genomic sequence of length k;

A copy number of each specific k-mer is obtained from the target database, and the copy number is a specificity that has the least number of occurrences of the specific k-mer on the corresponding chromosome and the number of occurrences on the chromosome. The ratio of the number of occurrences of k-mer;

One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:

Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the application will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions in the embodiments of the present application more clearly, the drawings used in the embodiments are briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. Those of ordinary skill in the art can also obtain other drawings according to these drawings without paying creative labor.

FIG. 1 is a schematic flowchart of a method for detecting a chromosome copy number abnormality according to one or more embodiments.

FIG. 2 is a schematic flow chart before step 102 according to one or more embodiments.

FIG. 3 is a schematic flow chart before step 102 according to another embodiment.

Figure 4 is a list of copy numbers of specific k-mers of chromosome X according to one or more embodiments.

FIG. 5 is a schematic flowchart of step 110 according to one or more embodiments.

FIG. 6 is a schematic flowchart of a method for detecting an abnormal chromosome copy number according to one or more embodiments, which further includes other steps.

FIG. 7A is a standard signal intensity recording table of a chromosome in a normal male sample according to one or more embodiments.

FIG. 7B is a standard signal intensity recording table of a chromosome in a normal female sample according to one or more embodiments.

FIG. 8 is a schematic flowchart of step 610 according to one or more embodiments.

FIG. 9A is a distribution table of pre-set reliability values P according to standard signal intensities of chromosomes in normal male samples in one or more embodiments.

FIG. 9B is a distribution table of pre-set reliability values P according to standard signal intensities of chromosomes in normal female samples in one or more embodiments.

FIG. 10 is a schematic flowchart of step 610 according to another embodiment.

FIG. 11 is a schematic flowchart of a method for detecting an abnormal chromosome copy number according to another or more embodiments, including other steps.

FIG. 12 is a schematic flowchart before step 102 according to still another embodiment.

FIG. 13 is a table showing the actual number of occurrences of the specific k-mer of a specific chromosome according to one or more embodiments.

FIG. 14 is a schematic flowchart of a method for detecting an abnormality of a chromosome copy number according to another or more embodiments.

FIG. 15 is a schematic flowchart of step 1402 according to one or more embodiments.

FIG. 16 is a table of human chromosome copy numbers in accordance with one or more embodiments.

FIG. 17 is a schematic flowchart of step 1404 according to one or more embodiments.

FIG. 18 is a single-copy signal strength calculation table for a specific chromosome according to one or more embodiments.

FIG. 19 is a single copy signal intensity recording table for each chromosome according to one or more embodiments.

FIG. 20 is a calculation table of actual signal intensities of individual chromosomes according to one or more embodiments.

FIG. 21 is a block diagram of an apparatus for detecting an abnormality in chromosome copy number according to one or more embodiments.

FIG. 22 is a block diagram of a computer device in accordance with one or more embodiments.

detailed description

In order to make the technical solution and advantages of the present application more clear, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the application, and are not used to limit the application.

In one embodiment, as shown in FIG. 1, a method for detecting an abnormal chromosome copy number is provided, which includes the following steps:

Step 102: Obtain sequencing data of the sample to be detected as the data to be detected, and determine a target species corresponding to the data to be detected.

The data to be detected refers to the data output by a sample after the sequence of a biomolecule contained in a sample is read by a DNA sequencer, an RNA sequencer, or a protein sequencing device. DNA sequencing is the process of determining the exact sequence of nucleotides within a DNA molecule. It includes any method or technique for determining the four base sequences of adenine, guanine, cytosine, and thymine in a DNA strand. A sequencer is an instrument capable of measuring the sequence of an input sample. The sequence measured here includes not only DNA sequences but also sequences composed of other substances such as proteins and RNA. Samples can be in the form of a drop of blood, a sputum, a handful of soil, and so on. After the data to be detected is obtained, the species to which the data to be detected belongs, that is, the target species. For example, when the sequencing data is a human gene sequence, the target species is human.

Step 104: Obtain a specific k-mer corresponding to each chromosome contained in the target species stored in the target database, and the specific k-mer is a k-mer, k-mer in each chromosome that satisfies a preset specific condition. Refers to a genomic sequence of length k.

Each target species contains one or more individuals. Each individual contains one or more genomes, and each genome contains one or more chromosomes. Therefore, each target species contains multiple chromosomes. The target database may store a feature target sequence set previously established for each chromosome, and the feature target sequence set corresponding to each chromosome may include a specific k-mer corresponding to each chromosome. The specific k-mer refers to a k-mer selected from the k-mers contained in each chromosome and meeting a preset specificity condition, that is, a specific k-mer corresponding to each chromosome. The preset specific condition is a condition set by a technician in advance for selecting a matching k-mer. The preset specific condition may be determined according to a technician's consideration or an actual project requirement.

k-mer refers to a genomic sequence of length k, where k is a natural number. If there are a different deterministic characters in a genomic data, then for a particular k, there may be a total of k-mers with a power of a that are different. For DNA or RNA (ribonucleic acid) sequences, deterministic characters refer to the five bases A (adenine), T (thymine), C (cytosine), G (guanine), and U (uracil); In the case of protein sequences, deterministic characters are defined amino acid characters.

Step 106: Obtain the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected.

After obtaining the data to be detected, the data to be detected can be compared with each chromosome separately, that is, the appearance of the specific k-mer included in the characteristic target sequence set corresponding to each chromosome in the data to be detected The number of times is the actual number of times each specific k-mer appears in the data to be detected.

Step 108: Obtain a copy number of each specific k-mer from the target database. The copy number is the specific k-mer with the least number of occurrences of the specific k-mer on the corresponding chromosome and the specific k-mer on the chromosome. The ratio of the number of occurrences.

The copy number of each specific k-mer refers to the ratio of the number of occurrences of the specific k-mer on the corresponding chromosome to the number of occurrences of the specific k-mer with the least number of occurrences on the chromosome. When the copy number of each specific k-mer is obtained from the target database, a list of specific k-mer copy numbers corresponding to each chromosome can be obtained from the target database, and then according to each specific k-mer The copy number list obtains the copy number of the specific k-mer contained in each chromosome. The specific k-mer copy number list is established in advance and stored in the target database, which can be called when needed, improving detection efficiency.

Step 110: Calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and the copy number of each specific k-mer.

After obtaining the actual number of occurrences and copy number of each specific k-mer in the data to be detected, the actual signal intensity of each specific k-mer can be calculated according to these two parameters. After obtaining the actual number of occurrences Ci and copy number Fi of each specific k-mer, the ratio of Ci and Fi can be calculated, and the ratio is used as the adjusted number of appearances of each specific k-mer. In this way, the number of adjusted occurrences of all specific k-mers contained in each chromosome can be calculated. Then calculate the average of the number of occurrences of the specific k-mer adjusted in each chromosome, and use this average as the single copy signal intensity E of the corresponding chromosome. After the single-copy signal intensity E of all chromosomes is calculated, the average value M and the variance SD of the single-copy signal intensity E of all chromosomes can be calculated. Then, the quotient obtained by dividing the difference between the single copy signal intensity of each chromosome and the average value M by the variance SD is taken as the actual signal intensity corresponding to each chromosome. That is, the actual signal intensity of the chromosome S _i = (E _i -M) / SD.

Step 112: Determine a chromosome whose actual signal strength is not within the standard confidence interval of the corresponding chromosome as a chromosome with abnormal copy number.

Each chromosome has its own corresponding standard confidence interval. The standard confidence interval refers to the standard signal intensity interval calculated in advance based on a large number of samples. The standard signal strength is actually calculated in the same way as the actual signal strength, but since the standard test sample is a sample confirmed to have no abnormal chromosome copy number, the standard signal strength is for the data of the standard test sample, and the actual signal strength is for Data to be tested. When the actual signal intensity of the chromosome is within the standard confidence interval of the corresponding chromosome, it can be judged that there is no copy number abnormality on the chromosome; otherwise, it can be judged that the copy number is abnormal on the chromosome. Therefore, a chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome can be determined as a chromosome with an abnormal copy number. Here, the actual signal intensity of each chromosome is compared with the standard confidence interval of the corresponding chromosome. For example, the actual signal intensity of chromosome 1 is compared with a pre-established standard confidence interval of chromosome 1, and the actual signal intensity of chromosome 2 is compared with a pre-established standard confidence interval of chromosome 2.

After determining the target species corresponding to the data to be detected, and obtaining the specific k-mer corresponding to each chromosome in the target species, according to the actual number of occurrences of the specific k-mer in the data to be detected and each specific k -mer copy number to calculate the actual signal intensity corresponding to each chromosome. Therefore, the actual signal intensity of each chromosome can be compared with the standard confidence interval of the corresponding chromosome, and the chromosome that is not within the standard confidence interval of the corresponding chromosome can be determined as a chromosome with abnormal copy number. This method of detecting chromosome copy number abnormalities is compared with the characteristic target sequence in each chromosome of the target species, that is, the specific k-mer, which is part of the entire target species genome, and is therefore specific. The comparison of the performance k-mer can reduce the comparison space, thereby shortening the analysis time and improving the detection efficiency.

In one embodiment, the specific k-mer refers to the k-mer in a chromosome whose appearance frequency in the genome occurrence number index table corresponding to the chromosome meets a preset error condition.

The set of characteristic target sequences corresponding to each chromosome includes a specific k-mer in each chromosome that satisfies a predetermined specificity condition. Further, the preset specific condition refers to a k-mer included in a chromosome whose occurrence number in the genome occurrence number index table corresponding to each chromosome meets a preset error condition. The preset error condition refers to the error condition preset by the technician according to the actual project requirements. The error condition can be a range of regions, that is, the k-mer selected as a specific can be allowed to have a certain error, instead of being completely satisfied. Some strict objective condition.

For each chromosome, there is an index table of the number of occurrences of the genome corresponding to the chromosome. The number of k-mers contained in each chromosome in the chromosome can be obtained according to the index of the number of occurrences of the genome corresponding to each chromosome. It has appeared in the genome, that is, the k-mer in the chromosome whose occurrence number in the chromosome genome occurrence index table meets the preset error condition can be selected, and the selected k-mer is used as the specific k-mer.

When selecting a specific k-mer, a certain degree of error is allowed, so a specific sequence representing the chromosome can be found with a high probability within a certain error range, so that when determining the chromosome contained in the sequencing data, only the specificity is used. Sequence, not whole genome sequence. Such a technical solution reduces the space for sequence comparison when processing real to-be-detected data, thereby reducing analysis time and improving detection efficiency.

In one embodiment, before step 102, the method further includes the following steps: generating an index table of the number of occurrences of the genome corresponding to each chromosome, and the index of the number of times of the genome records that the genome contained in the chromosome corresponding to each k-mer contains The number of genomes of the k-mer; the index table of the number of occurrences of the genome is stored in the feature target sequence set corresponding to the chromosome.

The genome is all the genetic information in an organism. This genetic information is stored in the form of a nucleotide sequence. The sum of the genetic material in a complete monomer of an organism (such as an animal or plant individual, or animal or plant cell, or bacterial individual) is the genome. Generally speaking, an individual's complete genome can contain multiple chromosomes, and each chromosome can contain multiple k-mers. The term "chromosome genome" commonly used in the art is used here to refer to the sum of all sequences contained in a complete chromosome. According to this concept, the number of genome occurrences corresponding to each chromosome has been recorded in the index table of the number of occurrences of the genome corresponding to each chromosome in the number of genomes corresponding to the chromosome, that is, the number of genomes index table records each k-mer The number of the k-mer genome is contained in the genome corresponding to the chromosome to which it belongs.

Therefore, what is actually recorded in the genome frequency table is how many genomes each k-mer has appeared in the chromosome corresponding to the k-mer. If a k-mer occurs more than once in the same genome, it will still only be counted once in the genome occurrence index table. After obtaining data on how many genomes each k-mer has appeared in, an index table of the number of occurrences of the genome corresponding to each chromosome can be established. If there are M chromosomes in total, M corresponding genomic appearance frequency index tables will be generated.

After the genomic appearance frequency index table corresponding to each chromosome is established, the genomic appearance frequency index table can be stored into the feature target sequence set corresponding to each chromosome, that is, stored in the target database. After storage, if needed, Data can be retrieved from the target database at the genome occurrence index table, which improves the detection efficiency.

In one embodiment, as shown in FIG. 2, before step 102, the method further includes the following steps:

Step 100: Select a k-mer that satisfies a preset specific condition from the k-mers corresponding to each chromosome.

Step 101: Store a k-mer that satisfies a preset specific condition into a feature target sequence set corresponding to each chromosome.

In the target database, a feature target sequence set corresponding to each chromosome is stored, and each feature target sequence set includes a specific k-mer corresponding to each chromosome. Specific k-mer refers to the selection of k-mers that satisfy preset specific conditions from the k-mers contained in each chromosome. When a k-mer that satisfies a preset specificity condition, that is, a specific k-mer, is selected, the specific k-mer can be stored in a feature target sequence set corresponding to each chromosome. In this method, a feature target library is established in advance, so when detecting whether the chromosome is abnormal, it can directly call data that requires specific k-mer, which improves the detection efficiency.

In one embodiment, as shown in FIG. 3, before step 102, the method further includes the following steps:

Step 302: Obtain the number of occurrences of the specific k-mer included in each chromosome contained in the target species stored in the target database in the corresponding chromosome C, and the specific k-mer corresponding to the least number of occurrences in the chromosome. The number of occurrences is taken as the minimum number of occurrences Cm.

In step 304, the ratio of the number of occurrences C to the minimum number of occurrences Cm is used as the copy number of the specific k-mer.

Step 306: Generate a specific k-mer copy number list corresponding to each chromosome according to the copy number of the specific k-mer included in each chromosome.

Step 308: Store the specific k-mer copy number list into the target database.

The above step 108 includes: obtaining the copy number of each specific k-mer according to the specific k-mer copy number list.

The target species contains multiple chromosomes, and each chromosome contains one or more specific k-mers. The number of occurrences of each specific k-mer contained in each chromosome on the chromosome C can be obtained, and the number of occurrences of the specific k-mer with the least number of occurrences in the chromosome can be obtained as the minimum number of occurrences Cm .

For each specific k-mer, the ratio of the number of occurrences C to the number of occurrences of the k-mer with the least number of occurrences on the chromosome is the copy number of the specific k-mer. After the number of occurrences of all specific k-mers contained in each chromosome is obtained, the copy number of each specific k-mer can be calculated to generate a list of specific k-mer copy numbers corresponding to the chromosome. Each specific k-mer copy number list can be stored in the characteristic target sequence set corresponding to the chromosome, which is convenient for directly calling the list to obtain relevant data when needed, improving detection efficiency.

When you need to obtain the copy number of each specific k-mer, you can first obtain the specific k-mer copy number list corresponding to the chromosome to which the specific k-mer belongs, so as to obtain each specific k-mer recorded in the table. The number of copies of mer. As shown in the copy number list of specific k-mers of chromosome X shown in FIG. 4, it is assumed that N specific k-mers are included in chromosome X. The number of occurrences of N specific k-mers in chromosome X are C1, C2, ..., and Cn, respectively. One of them, the specific k-mer, appeared the least in chromosome X, and was recorded as Cm. Then the copy numbers F of the N specific k-mers are F1 = C1 / Cm, F2 = C2 / Cm,..., Fn = Cn / Cm. The copy number of the specific k-mer with the least occurrence is equal to Cm / Cm, that is, the copy number of the specific k-mer with the least occurrence is 1.

In one embodiment, as shown in FIG. 5, the above step 110 includes:

Step 502: Calculate the ratio of the actual number of occurrences of each specific k-mer to the number of copies.

Step 504: Calculate the average of the ratio of the actual number of occurrences of all specific k-mers to the number of copies in each chromosome as the single-copy signal strength of the corresponding chromosome.

Step 506: Calculate the actual signal intensity of the corresponding chromosome according to the signal intensity of the single copy of each chromosome.

The actual number of occurrences of each specific k-mer in the data to be detected and the copy number of each specific k-mer are obtained, so that the actual number of occurrences and copy number of each specific k-mer can be obtained. ratio. Each chromosome can contain multiple specific k-mers, so the ratio of the actual number of occurrences of all specific k-mers contained in each chromosome to the number of copies can be obtained, and the average of the ratio can be obtained. Therefore, each chromosome will have a corresponding average of the ratio of the actual number of occurrences to the number of copies, and this average is the single-copy signal strength of each chromosome. Therefore, the actual signal intensity corresponding to each chromosome can be calculated based on the signal intensity of a single copy of each chromosome.

In one embodiment, the actual signal intensity of the corresponding chromosome is calculated according to the following formula:

The actual signal intensity of the chromosome = (single copy signal intensity of the chromosome-M) / SD, where M is the average of the single copy signal intensity of all chromosomes, and SD is the variance of the single copy signal intensity of all chromosomes.

When the single-copy signal intensity of each chromosome is obtained, the average value M and the variance of the single-copy signal intensity of all chromosomes can be calculated. The actual signal intensity of each chromosome is the quotient of the difference between the single copy signal intensity of the chromosome and the average value M and the variance SD. That is, the actual signal intensity of each chromosome = (signal intensity of a single copy of the chromosome-M) / SD.

In one embodiment, as shown in FIG. 6, the method for detecting an abnormal chromosome copy number further includes the following steps:

Step 602: Obtain a preset number of standard test samples, and the standard test samples are samples confirmed to have no abnormal chromosome copy number.

Step 604: Obtain the actual number of occurrences of the specific k-mer contained in each chromosome in the standard detection sample in the data to be detected.

When detecting whether there is an abnormal copy number of the chromosome to be detected, it is necessary to determine a list of standard confidence intervals corresponding to each chromosome in advance. Therefore, the chromosome in the data to be detected can be compared with a standard confidence interval list corresponding to a predetermined chromosome, and it can be determined whether there is an abnormal copy number of the chromosome in the data to be detected. When determining a standard confidence interval list corresponding to a chromosome, a preset number of standard detection samples need to be obtained first. The standard test sample is a sample confirmed as having no abnormal chromosome copy number. The preset quantity is an exponential quantity that can be set by the technicians, but it should be based on meeting the requirements of a large sample in statistics. Generally the preset number should be greater than 30, or greater than 100. After obtaining multiple standard detection samples, the actual number of occurrences of the specific k-mer in the chromosome contained in each standard detection sample in the data to be detected can be obtained.

Step 606: Obtain a copy number of each specific k-mer in each chromosome included in the standard detection sample from the target database.

Step 608: Obtain the standard signal intensity of the corresponding chromosome according to the actual number of occurrences and the copy number of each specific k-mer included in the standard detection sample.

The copy number of each specific k-mer refers to the ratio of the number of occurrences of the specific k-mer on the corresponding chromosome to the number of occurrences of the specific k-mer with the least number of occurrences on the chromosome. After obtaining the copy number of each specific k-mer in each chromosome included in the standard test sample from the target database, the actual number of occurrences and copies of each specific k-mer included in the standard test sample can be obtained Number to get the standard signal intensity of the corresponding chromosome. The calculation method of the standard signal strength and the actual signal strength is the same, except that the standard signal strength is for the standard test samples, and the actual signal strength is for the data to be detected. After obtaining the standard signal intensity of each chromosome, a standard signal intensity record table can be established according to different genders. For example, if the target species is a human, a standard signal intensity record table of chromosomes in a standard test sample belonging to a male and a standard signal intensity record table of a chromosome in a standard test sample belonging to a female can be established.

A standard signal intensity recording table for chromosomes in a normal male sample as shown in FIG. 7A, and a standard signal intensity recording table for chromosomes in a normal female sample as shown in FIG. 7B. In these two tables, a standard signal intensity record corresponding to a chromosome included in a male sample and a standard signal intensity record corresponding to a chromosome included in a female sample are recorded. For example, as shown in FIG. 7A, the standard signal intensity of chromosome 1 in sample 1 is recorded as S ¹ ₁ , and the standard signal intensity of chromosome 2 is recorded as S ¹ ₂ . The standard signal intensity of chromosome ₁ in sample i is recorded as S ⁱ ₁ , and the standard signal intensity of chromosome 2 in sample i is recorded as S ⁱ ₂ . Similarly, the recording method is the same in FIG. 7B.

Step 610: Determine a standard confidence interval corresponding to the chromosome when the confidence value is preset according to the standard signal intensity of each chromosome in the multiple standard detection samples.

Step 612: Obtain a list of standard confidence intervals corresponding to the chromosomes included in the target species according to the standard confidence intervals corresponding to each chromosome.

A confidence interval is an interval for a population parameter to be estimated. By obtaining a random sample from the population, the calculated confidence interval may include the population parameter of the population. This confidence is also called the confidence level. The preset reliability value P here refers to a confidence value set by a technician in advance, and is generally set to a value greater than 0.95, which is infinitely close to 1 but not equal to 1. The preset reliability value can be adjusted by a technician in actual applications as needed. For example, if the confidence value is set to 95% confidence, P is 0.95, and if the confidence value is set to 99.9%, P is 0.999.

The two boundary values LB and UB of the standard signal strength of the chromosome can be determined according to the preset preset confidence value, and a confidence interval corresponding to the preset confidence value can be obtained. LB is the minimum of the confidence interval, and UB is the maximum of the confidence interval. Therefore, the confidence interval obtained is actually the interval of standard signal strength. For each chromosome, the standard signal intensity interval corresponding to the preset confidence value can be obtained, that is, the standard signal intensity interval of each chromosome, that is, the standard confidence interval corresponding to each chromosome. Since the target species contains multiple chromosomes, a list of standard confidence intervals corresponding to the chromosomes contained in the target species can actually be obtained. The standard confidence interval list contains standard confidence intervals corresponding to each chromosome. For example, if the preset reliability value P is set to 0.98, the standard signal intensity interval corresponding to each chromosome at a probability of 98% can be obtained.

In one embodiment, as shown in FIG. 8, the above step 610 includes:

Step 802: Obtain a standard signal intensity of each chromosome contained in each standard detection sample.

Step 804: Calculate the mean and variance of the standard signal strengths of the chromosomes included in all the standard detection samples according to the gender of the standard detection samples.

Step 806: Determine the standard confidence corresponding to the chromosome contained in the standard detection sample corresponding to each gender when the confidence value is preset according to the mean and variance of the standard signal strengths in the multiple standard detection samples corresponding to each gender for each sex. Interval.

After obtaining the standard signal intensity of each chromosome contained in each standard detection sample, the mean and variance of the standard signal intensity of each chromosome can be calculated. Each chromosome refers to each numbered chromosome. For example, after obtaining the standard signal intensity of chromosome 1 of each standard test sample, the mean and variance of the standard signal intensity of chromosome 1 can be calculated. Similarly, the mean and variance of standard signal intensities of chromosomes 2, 3, ..., 22 and X, Y and other chromosomes can be calculated.

After calculating the mean and variance of the standard signal intensity corresponding to each chromosome, the corresponding standard confidence interval, that is, the corresponding standard signal intensity interval, of each chromosome can be determined when the confidence value is preset. For example, with humans as the target species, you can also create a distribution table of the pre-set reliability value P of the standard signal intensity of the chromosome in the male sample and the preset of the standard signal intensity of the chromosome in the female sample according to the standard test samples of different genders. Table of distributions of confidence values P.

A distribution table of preset confidence values P of standard signal intensities of chromosomes in a normal male sample as shown in FIG. 9A. The normal male sample contains 22 autosomal and XY chromosomes. M ′ represents the average value of the standard signal intensity of all chromosomes, and SD ′ represents the variance of the standard signal intensity of all chromosomes. LB represents the minimum value of the confidence interval corresponding to the preset confidence value P for each chromosome, and UB represents the maximum value of the confidence interval corresponding to the preset confidence value P for each chromosome. The minimum and maximum values give the corresponding confidence intervals. A distribution table of preset confidence values P of standard signal intensities of chromosomes in a normal female sample as shown in FIG. 9B. The difference between Figure 9A and Figure 9B is that the genomes of individuals of different sexes have different chromosomal compositions. For example, in Figure 9A corresponding to a male sample, in addition to 22 autosomes, X and Y sex chromosomes are included, while in female samples, For 22 chromosomes and two X sex chromosomes. The rest of the data represent the same meaning.

In one embodiment, the standard test sample is a peripheral blood sample of a normal mother carrying a normal baby. The peripheral blood sample includes a peripheral blood sample of a normal mother carrying a normal baby boy, and a peripheral mother's peripheral blood sample. Peripheral blood samples from normal mothers carrying normal baby boy twins, Peripheral blood samples from normal mothers carrying normal baby girl twins, and Peripheral blood samples from normal mothers carrying normal one boy and one female twin.

Peripheral blood is blood other than bone marrow. A normal mother means that the mother's chromosome copy number is not abnormal, and a normal baby means that the baby's chromosome copy number is also normal. The criteria for identifying as a normal mother or normal baby can also be adjusted by technical staff based on actual project research. In order to establish sample data, a large number of standard test samples can be obtained, and the standard test samples can be peripheral blood samples of normal mothers carrying normal babies. In view of the fact that the mother's baby can be a baby boy or a baby girl, and the mother can also have twins, the peripheral blood samples include peripheral blood samples from normal mothers with normal boys, and normal mothers with normal girls. Peripheral blood samples, peripheral blood samples from normal mothers carrying normal baby boy twins, peripheral blood samples from normal mothers carrying normal baby girl twins, and peripheral blood samples from normal mothers carrying normal one male and one female twin. In other cases, the standard test sample may also be a peripheral blood sample from a normal mother with multiple normal babies. For example, a normal mother carries a peripheral blood sample of a normal triplet, a normal mother carries a peripheral blood sample of a normal quadruplet, and so on. Here, there is no need to limit the number of babies pregnant by a normal mother, but a peripheral blood sample of a normal mother pregnant with a normal baby can be obtained as a standard test sample.

As shown in FIG. 10, the above step 610 includes the following steps:

Step 1002: Determine a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby boy.

In step 1004, a standard confidence interval corresponding to a preset chromosome confidence value of a chromosome is determined according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby girl.

In step 1006, a standard confidence interval corresponding to a chromosome at a preset confidence value is determined according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby boy twin.

Step 1008: Determine a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal strength of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby girl twin.

Step 1010: Determine a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother and a female twin.

When determining the standard confidence interval corresponding to the chromosome in the preset confidence value according to the standard signal intensity of each chromosome, when the standard test samples are different, it is necessary to determine the standard signal intensity of the chromosome contained in the different standard test samples. Therefore, the above steps 1002 to 1010 are to determine the standard confidence interval corresponding to the preset chromosome confidence value according to the standard signal intensity of the chromosome contained in the different standard detection samples. For example, when the standard test sample is a peripheral blood sample of a normal mother carrying a normal baby boy, the reliability of the chromosome in the preset setting can be determined according to the standard signal intensity of each chromosome contained in the peripheral blood sample of a normal mother carrying a normal baby boy. The standard confidence interval corresponding to the value. When the standard test sample is a peripheral blood sample of a normal mother carrying a normal male and female twin, the chromosome can be determined based on the standard signal intensity of each chromosome contained in the peripheral blood sample of a normal mother carrying a normal male and female twin. The standard confidence interval when the confidence value is preset.

In one embodiment, the above-mentioned step 112 includes: when it is detected that the actual signal intensity corresponding to the chromosome does not belong to the standard confidence interval corresponding to the corresponding chromosome, determining the chromosome corresponding to the actual signal intensity as having a copy number abnormality Chromosome.

After obtaining a large number of standard test samples, the standard confidence intervals corresponding to each chromosome of the target species can be calculated, and a list of standard confidence intervals can be obtained. Therefore, the actual signal intensity of the chromosome contained in the target species to which the data to be measured can be compared with the standard confidence interval of the corresponding chromosome obtained in advance. When it is detected that the actual signal intensity corresponding to the chromosome does not belong to the standard confidence interval corresponding to the corresponding chromosome, the chromosome corresponding to the actual signal intensity is determined to be a chromosome with abnormal copy number. In the alignment, each chromosome is compared with a standard confidence interval corresponding to each chromosome. For example, the chromosome 1 contained in the target species to which the sequencing data belongs is compared with the pre-calculated standard confidence interval of chromosome 1. The chromosome 2 contained in the target species to which the sequencing data belongs is compared with the pre-calculated chromosome 2 The standard confidence intervals are compared. In this way, all chromosomes contained in the target species to which the sequencing data belongs are compared to determine whether there is an abnormal copy number in the chromosome.

Assuming that the comparison is for chromosome 1 contained in the target species to which the sequencing data belongs, the standard confidence interval corresponding to the pre-calculated chromosome 1 is (LB1, UB1), and the 1 contained in the test sample is detected and determined. Whether the actual signal intensity of chromosome chromosome exists in the interval (LB1, UB1). If it does not exist, it indicates that the copy number of chromosome 1 is abnormal; if it does, it indicates that chromosome 1 is normal and there is no abnormal copy number.

In one embodiment, as shown in FIG. 11, the method for detecting an abnormal chromosome copy number further includes the following steps:

Step 1102: Determine a standard confidence interval list of a chromosome corresponding to each gender according to the gender of the target species.

Step 1104: Obtain the gender of the sample to be tested.

In step 1106, the actual signal intensity of each chromosome is compared with the standard confidence interval corresponding to the corresponding chromosome in the standard confidence interval list of the corresponding sex of the target species.

In step 1108, when it is detected that the actual signal intensity of the chromosome does not belong to the standard confidence interval of the corresponding chromosome of the corresponding gender, the chromosome corresponding to the actual signal intensity is determined to be a chromosome with abnormal copy number.

The target database stores a list of standard confidence intervals corresponding to chromosomes included in a sample created according to gender. For example, taking a person as an example, a target table stores a distribution table of preset reliability values P of standard signal intensities of chromosomes in normal male samples, and a normal distribution record of preset reliability values P of male samples records normal The standard confidence interval for each chromosome contained in the male sample when the confidence is preset. The target database stores a distribution table of preset reliability values P of standard signal intensities of chromosomes in normal female samples, and a distribution table of preset reliability values P of female samples records each of the values contained in normal female samples. The standard confidence interval for each chromosome when the confidence is preset.

Gender classification of target species, that is, to divide target species into parts corresponding to gender according to gender. For example, when the target species is human, the target species is divided into male and female according to gender. Then you can determine the standard confidence interval for each sex's corresponding chromosome. After classifying the target species according to sex, the chromosomes contained in the target species of each sex can be clarified, thereby obtaining the standard confidence interval corresponding to each chromosome. For example, if the female target species contains 22 autosomes and two X sex chromosomes, a distribution table of preset confidence values P of standard signal intensities of chromosomes in normal female samples can be obtained from the target database. Standard confidence intervals corresponding to these 22 chromosomes and X chromosomes were obtained from the table. That is, when the sample to be measured comes from a female, a distribution table of preset reliability values P of the standard signal intensity of the chromosome corresponding to the female is obtained. That is, the actual signal intensity of each chromosome of the female sample to be tested is compared with the standard confidence interval of each chromosome in the list of female standard confidence intervals. In this way, when it is detected that the actual signal intensity of the chromosome does not belong to the standard confidence interval of the corresponding chromosome of the corresponding gender, the chromosome corresponding to the actual signal intensity is determined to be a chromosome with abnormal copy number.

In one embodiment, as shown in FIG. 12, before step 102, the method further includes the following steps:

Step 1202: Obtain multiple chromosomes included in the target species.

Step 1204: sort and sort multiple chromosomes included in the target species.

Step 1206: Obtain a pre-selected high-confidence genome that meets a preset reliability condition.

Step 1208: Determine a high-confidence genome corresponding to each chromosome contained in the target species.

The target species is the species from which the test sample is derived. For example, when you want to judge the abnormal copy number of a human chromosome, the human is the target species. The target species can be a human or a species other than a human. The genomic data of target and non-target species can be derived from the RefSeq data set (RefSeq reference sequence database of the National Center for Biotechnology Information) of the NCBI (RefSeq reference sequence database, which has biological significance provided by the National Center for Bioinformatics). Non-redundant gene and protein sequences) or other public or private genomes. The genomes of all target and non-target species are integrated into a complete collection.

An individual's complete genome contains multiple chromosomes. Therefore, after obtaining the respective genomes of different individuals corresponding to the target species, multiple chromosomes contained in the target species can be obtained. Because there may be multiple sets of genomes of the target species collected, that is, different genomes of different individuals or populations from the same target species. Taking humans as target species, the genomes of target species collected may include genomes from European, North American Indian, and Chinese Han ethnic groups. Therefore, each chromosome of the target species may contain sequences belonging to that chromosome from a different genome. Taking humans as an example, the first chromosome of humans can include the first chromosome of European descent, the first chromosome of North American Indian, and the first chromosome of Chinese Han. Here, the data of each identical chromosome of the target species are put together, that is, the sequence data set of each chromosome of the target species is composed.

Then, from the sequence data set of each chromosome, a preselected genome that meets the preset credibility conditions is obtained, that is, a high credibility genome that meets the preset credibility conditions is selected, and the corresponding chromosomes of the target species can be determined. High-confidence genome. A high-confidence genome refers to a genome that satisfies a preset confidence condition. Of course, the order here can also be changed. A large number of genomes can be collected from the NCBI in advance, and these genomes can be screened to select a genome that meets the preset reliability conditions as a high-confidence genome. Then determine the high-confidence sequence data set of each chromosome contained in each target species, that is, to put together the data of each identical chromosome of all high-confidence genomes of each target species, that is, each target species is composed High confidence sequence data set for individual chromosomes.

In one embodiment, satisfying the preset credibility condition includes any of the following: when the proportion of non-deterministic characters contained in the chromosome sequence is lower than a preset proportion threshold; the sequence belonging to the same chromosome included in the chromosome sequence When the fragment is below the preset fragment threshold; compare a certain chromosome sequence with all other chromosomal sequences whose genetic relationship meets the preset genetic distance threshold range to determine the average full coverage of the chromosome sequence in the similar chromosome sequences Percentage, when the average coverage percentage is higher than the preset percentage value.

For the DNA genome, the proportion of non-deterministic characters refers to the proportion of non-ACGT characters contained in it. If the proportion of non-ACGT characters in a piece of DNA genome data is too high, then the piece of data is a genome with a suspected low confidence . For DNA or RNA sequences, non-deterministic characters refer to characters other than ACGTU. For protein sequences, non-deterministic characters refer to characters other than certain amino acid characters.

When the proportion of non-deterministic characters contained in a genomic sequence is lower than a preset proportion threshold, the genome can be considered to satisfy a preset credibility condition. According to the number of sequence data fragments included in a complete chromosome, if there are too many fragments that belong to the same chromosome, then the genome sequence is a suspected low confidence genome. That is, when a sequence fragment included in a genomic sequence belonging to the same chromosome is lower than a preset fragment threshold, the genomic sequence data can also be considered to satisfy a preset confidence condition. Perform a sequence comparison between a certain genomic sequence and all other genomic sequences whose genetic relationship meets a preset genetic distance threshold range to determine the average full sequence coverage percentage of the genomic sequence in similar genomic sequences. When the average coverage percentage When it is higher than the preset percentage value, the genome can be considered to meet the preset confidence condition. Genetic distance refers to an index that measures the size of the overall genetic difference between species (or individuals).

In one embodiment, the k-mer in the specific k-mer satisfies the following two conditions: the number of occurrences in the genome occurrence index table corresponding to each chromosome meets a first preset error condition; The number of occurrences in the genome occurrence number index table corresponding to each chromosome and the number of appearances in the genome occurrence number index table of the complete set meet the second preset error condition. The genome occurrence index table of a certain chromosome records the number of genomes of each k-mer in the genome included in the corresponding chromosome; the genome occurrence index table of the complete set records each of the target species The k-mer included in the chromosome includes the number of the k-mer genome in the genome included in the corpus.

In the target database, each chromosome has its own set of characteristic target sequences, and the specific k-mer included in the set of characteristic target sequences refers to a k-mer that satisfies a preset specific condition. The preset specific condition includes a first preset error condition and a second preset error condition. When the k-mer satisfies these two conditions at the same time, it is considered that the k-mer meets the preset specific condition and the k -mer as a specific k-mer. Further, the number of occurrences of k-mer in the genome occurrence number index table corresponding to the chromosome needs to satisfy the first preset error condition, and the number of occurrences of the k-mer in the genome occurrence number index table corresponding to the chromosome, and in the full set The number of occurrences in the genome occurrence number index table meets the second preset error condition. The complete set refers to the collection of all high-confidence genomes collected. The high-confidence genome contains both the genomes of each target species and the genomes of non-target species, such as pathogenic bacteria, symbiotic bacteria, and probiotics. , Human, animal, plant, etc. high confidence genome.

An index table of the number of occurrences of a genome of a certain chromosome records the number of each k-mer's genome in the corresponding genome of the corresponding chromosome. The count corresponding to each k-mer recorded in the genome occurrence index table of the complete set represents how many genomes of the k-mer have appeared in the total set. If the k-mer appears multiple times in the same genome, it will only be counted once. In the corresponding genome number index table of a chromosome, each k-mer contains the number of genomes of the k-mer in the corresponding genome of the corresponding chromosome, and the genome occurrence number index table of the complete set records in The genome included in the corpus contains the number of k-mer genomes.

The selection of the specific k-mer includes two parameters, a preset error condition and a second preset error condition, and thus allows the non-specificity of the specific k-mer within a certain range. Without these two parameters, non-specificity in a certain range cannot be allowed, and it is often difficult to find a specific k-mer for a chromosome. Therefore, by selecting a specific k-mer that allows a certain amount of error, and thus establishing a set of characteristic target sequences, a specific target that can represent the chromosome can be found with high probability. Therefore, when determining the chromosomes contained in the data to be detected, it is only necessary to perform alignment with the characteristic target sequence set corresponding to each chromosome in the target species corresponding to the data to be detected, thereby reducing the comparison space, thereby reducing Reduced analysis time and improved detection efficiency.

In one of the embodiments, the first preset error condition is: the sum of the ratio of the number of occurrences in the genome occurrence number index table corresponding to each chromosome to the number of genomes contained in the corresponding chromosome and the first threshold is greater than or equal to 1.

The first preset error condition refers to that the sum of the ratio of the number of occurrences recorded in the genome occurrence number index table corresponding to the chromosome to the number of genomes corresponding to the chromosome and the first threshold is greater than or equal to 1. Assume that there are N corresponding genomes of this chromosome, and the number of occurrences of a certain k-mer in the genome occurrence index table corresponding to this chromosome is C1, and the first threshold is P1, then the first preset error condition is C1 / N + P1≥1. The first threshold value P1 represents an acceptable error probability, and can be any value between 0 and 1. The first threshold value can be set by a technician according to the actual project.

In one of these embodiments, the first threshold is less than 5%.

The first threshold is an acceptable error probability. The first threshold may be any value between 0 and 1. The first threshold may be set to a value less than 5%.

In one embodiment, the second preset error condition is: the ratio of the number of occurrences in the genome occurrence number index table corresponding to each chromosome to the number of occurrences in the genome occurrence number index table of the complete set and the second threshold value. Is greater than or equal to 1.

The second preset error condition refers to that the sum of the ratio of the number of occurrences recorded in the genome occurrence number index table corresponding to the chromosome to the occurrence number in the genome occurrence number index table of the corpus and the second threshold is greater than or equal to 1. Assume that the number of occurrences of a k-mer in the genome occurrence number index table corresponding to the chromosome is C1, and the number of occurrences of the k-mer in the genome occurrence number index table of the complete set is C2, and the second threshold value is P2. The second preset error condition refers to C1 / C2 + P2≥1. The second threshold value is the same as the above-mentioned first threshold value, which represents an acceptable error probability, and can be any value between 0 and 1. The second threshold value P2 can also be set by a technician based on the actual project.

In one of these embodiments, the second threshold is less than 5%.

The second threshold value is the same as the first threshold value, which means an acceptable error probability. The second threshold value can also be any value between 0 and 1, and the second threshold value can be set to a value less than 5%. The first threshold and the second threshold may be equal or different.

In one embodiment, before step 102, the method further includes the following steps: generating an index table of the number of occurrences of the genome corresponding to each chromosome, and the index of the number of times of the genome records that each k-mer is included in the corresponding genome of the corresponding chromosome The number of genomes of the k-mer; the index table of the number of occurrences of the genome is stored in the feature target sequence set corresponding to the chromosome.

The genome is all the genetic information in an organism. This genetic information is stored in the form of a nucleotide sequence. The sum of the genetic material in a complete monomer of an organism (such as an animal or plant individual, or animal or plant cell, or bacterial individual) is the genome. Each individual's complete genome can contain multiple chromosomes, while the genome of each chromosome can contain multiple k-mers. The term "chromosome genome" commonly used in the art is used here to refer to the sum of all sequences contained in a complete chromosome. According to this concept, the number of genome occurrences corresponding to each chromosome has been recorded in the index table of the number of occurrences of the genome corresponding to each chromosome in the number of genomes corresponding to the chromosome, that is, the number of genomes index table records each k-mer The genome corresponding to the corresponding chromosome contains the number of the k-mer genome.

Therefore, what is actually recorded in the genome frequency table is how many genomes each k-mer has appeared in the chromosome of the k-mer. If a k-mer occurs more than once in the same genome, it will still only be counted once in the genome occurrence index table. After obtaining data on how many genomes each k-mer has appeared in the chromosome corresponding to the k-mer, an index table of the number of occurrences of the genome corresponding to each chromosome can be established. If there are M chromosomes in total, M corresponding genomic appearance frequency index tables will be generated. After the genomic appearance frequency index table corresponding to each chromosome is established, the genomic appearance frequency index table can be stored into the feature target sequence set corresponding to each chromosome, that is, stored in the target database. After storage, if needed, Data can be retrieved from the target database at the genome occurrence index table, which improves the detection efficiency.

In one embodiment, before obtaining the sequencing data of the sample, the method further includes: generating a genome occurrence index table of the complete set, and the genome occurrence index table of the complete set records a genome containing the k-mer in a genome included in the complete set. The number of genomic appearances index table of the complete set is stored in the target database.

In the target database, a characteristic target sequence set corresponding to each chromosome is stored. The full set contains all the high-reliability genomes collected, that is, the full set contains both the high-reliability genomes of the target species corresponding to the data to be detected, and the multiple non-detected data corresponding targets. Species high confidence genome. After obtaining data on how many genomes that each k-mer contained in each chromosome has contained in the complete set, an index table of the number of occurrences of the complete set of genomes can be generated. The genome occurrence index table of the complete set records how many genomes of the k-mer contained in each chromosome have appeared in the complete set, that is, the genome count index table of the complete set records that each k-mer contains the genome contained in the complete set. There are the number of k-mer genomes.

Therefore, in the genome number table of the complete set, actually how many genomes each k-mer contains in the complete set is recorded, that is, how many genomes each k-mer appears in the entire genome is recorded. However, the number of measurements is the number of genomes, not the number of k-mer occurrences. If a k-mer occurs more than once in the same genome, it will still be counted only once in the genome occurrence index table of the complete set. After obtaining the data of how many genomes each k-mer has appeared in the complete set, an index table of the number of occurrences of the genome for the complete set can be established. The genomic appearance frequency index table of the complete set is different from the genomic appearance frequency index table corresponding to each chromosome. The genomic appearance frequency index table of a certain chromosome corresponds to the chromosome, and each chromosome has its corresponding genomic appearance frequency index table , But the genomic appearance frequency index table of the complete set will only generate one, which is for all data. After storing the generated genomic appearance frequency index table of the complete set, if it is needed in the process of detecting the data to be detected, the data can be retrieved from the target database, thereby improving the detection efficiency.

In one embodiment, after step 106 described above, the method further includes: generating a specific k-mer actual occurrence frequency record table corresponding to the chromosome according to the actual occurrence number.

In the target database, the specific k-mer contained in each chromosome is stored. After the data to be detected is obtained, the data to be detected can be compared with the specific k-mer of each chromosome, that is, each The actual number of times a specific k-mer appears in the data to be detected. After obtaining the actual number of occurrences of each specific k-mer in the sequencing data, a record table of the actual occurrences of specific k-mer corresponding to each chromosome can be generated according to the acquired data. If there are a total of M chromosomes in the target database, M corresponding specific k-mer actual occurrence frequency record tables will be generated, and the specific k-mer actual occurrence frequency record table records the specificity contained in each chromosome. The actual number of k-mer occurrences in the sequencing data.

As shown in Figure 13, the specific number of occurrences of the specific k-mer of a particular chromosome, the leftmost column records the specific k-mer contained in chromosome X, and the second column records the corresponding specificity The actual number of occurrences of sexual k-mer in the sequencing data is C ₁ , C ₂ ,... According to the actual number of occurrences of the specific k-mer in the sequencing data, a corresponding record of the actual occurrences of the specific k-mer is generated, and the data is stored for subsequent recall, thereby improving the detection efficiency.

In one embodiment, as shown in FIG. 14, a method for detecting an abnormal chromosome copy number is provided, which includes the following steps:

Step 1402: A feature target sequence set corresponding to each chromosome is established.

As shown in FIG. 15, step 1402 includes:

Step 1402A: Collection and sorting of high-confidence genomes.

When establishing a set of characteristic target sequences corresponding to each chromosome, high-reliability genomic data needs to be collected and sorted first. The high-confidence genome can include both the genome in the target species corresponding to the data to be detected and the genome that does not belong to the target species corresponding to the data to be detected. For example, high-confidence genomes of commensal bacteria, probiotics, humans, animals, plants, and the like. High confidence genomes can be derived from the NCBI's RefSeq dataset or other public or private high confidence genomes.

There are three ways to identify and screen high-confidence genomes:

1. Screen based on the proportion of non-deterministic characters contained in a genomic data. For example, for the DNA genome, the proportion of non-deterministic characters refers to the proportion of non-ACGT characters contained in it. If the proportion of non-ACGT characters in a piece of DNA genome data is too high, then the piece of data is suspected of low confidence. Genome. For DNA or RNA sequences, non-deterministic characters refer to characters other than ACGTU. For protein sequences, non-deterministic characters refer to characters other than certain amino acid characters.

2. Screen based on the number of genomic data fragments included in a complete chromosome. If there are too many fragments that belong to the same chromosome, then the genome is a suspected low-confidence genome.

3. Perform genome-wide sequence alignment of multiple genomes with similar genetic relationships (eg, genetic distance is less than a certain threshold) to determine the average genome-wide coverage percentage of the genome in its similar genomes, and then average based on this whole genome Screening by percentage of coverage: Genomes with a low average percentage of coverage are those that are suspected of having low completion, ie, low confidence. Genetic distance refers to an index that measures the size of the overall genetic difference between species (or individuals).

Step 1402B: Determine a high-confidence sequence data set of each chromosome in the target species corresponding to the data to be detected.

There may be multiple sets of genomes of the target species collected in step 1402A, that is, different genomes of different individuals or populations from the same target species. Taking humans as target species, the genomes of target species collected may include genomes from European, North American Indian, and Chinese Han ethnic groups. Therefore, each chromosome of the target species may contain sequences belonging to that chromosome from a different genome. Taking humans as an example, the first chromosome of humans can include the first chromosome of European descent, the first chromosome of North American Indian, and the first chromosome of Chinese Han.

Here, the data of each identical chromosome of all high-confidence genomes of the target species are put together, that is, a high-confidence sequence data set of each chromosome of the target species is assembled. Afterwards, the high-confidence sequence data sets of all chromosomes of the target species and the high-confidence sequence data sets of all non-target species are brought together to form a complete set. That is, a high-confidence sequence data set of all chromosomes of the target species corresponding to the data to be detected and a high-confidence sequence data set of all chromosomes of other target species are brought together to form a complete set.

After determining the target species corresponding to the data to be detected, the ratio of the copy number of each chromosome of the target species corresponding to the data to be detected is determined under normal circumstances, and the autosome and sex chromosome are distinguished. As shown in Figure 16, taking a human as an example, a normal human genome contains 23 pairs and a total of 46 chromosomes. Among them, chromosomes 1 to 22 are autosomes, and their copy numbers are two. X and Y chromosomes are sex chromosomes. Normal males have only one X chromosome and one Y chromosome. Normal women have two X chromosomes and no Y chromosomes. Copy number (copy number) refers to the number of haploid genomes (haploid geneome) of a certain gene or a specific DNA sequence. The information determined in FIG. 16 is generated only once when the target species corresponding to the data to be detected is determined, and then the information in FIG. 16 is called when analyzing each sample data that needs to be detected.

In step 1402C, an index table of the number of occurrences of the genome of the complete set is generated.

Using the corpus, the genomic occurrence index table of the corpus can be generated. In the genomic occurrence index table of the corpus, it is recorded how many genomes of each k-mer in the corpus have appeared in the corpus. k-mer refers to a genomic sequence of length k. k can be defined by itself, and the range can generally be set between 11 and 32. If there are a different deterministic characters in a genomic data, then for a specific k, there may be a total of k different powers of k.

For example, for DNA genomic data, DNA has a total of four different deterministic characters of ACGT, then for a particular k, there are 4 possible k-th different k-mers. For a genome of length n, there may be at most n-k + 1 different k-mers. However, because a genome contains repeating regions, in general, an n-character genome contains different k-mers that are much smaller than n-k + 1. Therefore, if the ordinary k-mer counting method is used, a given k-mer may appear multiple times and may be counted multiple times in a given genome. In the genome occurrence index table of the complete set, which is different from the previous method, if a k-mer occurs more than once in a genome, the genome occurrence index table of the complete set still counts only once. Therefore, the count corresponding to a k-mer in the resulting k-mer genome occurrence number index table represents how many genomes the k-mer has appeared in the total set.

If a DNA or RNA genomic sequence is used, because of the reverse complementarity of the nucleic acid sequence, after a k-mer A appears, its reverse complementary sequence A 'should also be considered to have appeared, so both A and A' Record into the table. In the subsequent steps, if the k-mer of the DNA or RNA sequence is targeted, when a k-mer 'A is mentioned to do some operation, it is also considered that its reverse complementary sequence A' is also mentioned and performed by default Corresponding processing operation.

Moreover, each chromosome of the target genome can be operated as a species here, that is, each individual sequence that can completely represent the chromosome contained in each stained high-confidence data set of the target species is considered as For a single genome. For example, if the human is the target species, the high-confidence dataset of human chromosome 1 may contain three pieces of data, namely the chromosome 1 sequence of European descent, the chromosome 1 sequence of North American Indian, and the chromosome 1 of Chinese Han Chromosome 1 sequence, then the European chromosome 1 sequence is regarded as a complete independent genome to participate in the count of the k-mer genome occurrence index table, and the North American Indian chromosome 1 sequence is regarded as a complete The independent genome participates in the counting of the k-mer genome appearance index table. The Chinese chromosome number 1 of the Han ethnic group is regarded as a complete independent genome participating in the counting of the k-mer genome appearance index table.

In step 1402D, an index table of the number of occurrences of the genome corresponding to each chromosome is generated.

The genome appearance number index table of a chromosome is different from the genome appearance number index table of the complete set in step 1402C. The genome occurrence index table of the complete set records the complete set, that is, how many genomes of a k-mer have appeared in the complete set, but the genome occurrence number index table corresponding to the chromosome corresponds to each chromosome, and records each The k-mer contained in each chromosome has appeared in how many genomes corresponding to the chromosome.

Step 1402E: Generate a specific k-mer table corresponding to each chromosome.

The specific k-mer table corresponding to each chromosome records the k-mers that satisfy the preset specific conditions in each chromosome, that is, the specific k-mer. The specific k-mer is a k-mer selected from the k-mers that meets the preset specificity conditions. The selection of a specific k-mer must meet the following two conditions:

1. If the high-confidence data set of the chromosome contains N genomes, the number of occurrences of a certain k-mer in the genome occurrence index table corresponding to the chromosome is C ₁ . Then the condition needs to be satisfied: C ₁ / N + P ₁ ≥1, that is, the sum of the ratio of the number of occurrences in the genome occurrence index table corresponding to the chromosome to the number of genomes contained in the high confidence data set of the chromosome and the first threshold Greater than or equal to 1, where the first threshold P ₁ is usually less than 5%.

2. If a k-mer appears in the genome occurrence number index table corresponding to the chromosome as C ₁ , the k-mer appears in the genome episode number index table of the complete set as C ₂ . Then you need to satisfy the condition: C ₁ / C ₂ + P ₂ ≥1, that is, the ratio of the number of occurrences in the genome occurrence index table corresponding to the chromosome to the occurrence number in the genome occurrence index table of the complete set and the second threshold. Is greater than or equal to 1. Wherein the second threshold value P ₂ is usually less than 5%.

The first threshold value P ₁ and the second threshold value P ₂ may be equal to or different from each other. When the specific k-mer is selected, the two parameters of the first threshold P ₁ and the second threshold P ₂ are added, allowing an error rate within a certain range, that is, allowing the non-specificity of the specific k-mer within a certain range. . Without these two parameters, non-specificity in a certain range cannot be allowed, and it is often difficult to find a specific k-mer for a certain chromosome.

For a chromosome, if n specific k-mers are found, assuming that the occurrence of P ₁ in the condition (1) in this step is randomly distributed in each genome corresponding to the chromosome, then a false negative appears for the chromosome Is less than or equal to P ₁ ⁿ . For n that is large enough, the likelihood of false negatives occurring here will be extremely small. At the same time, if n 'specific k-mers are actually detected in the chromosome, it is assumed that the occurrence of P ₂ in condition (2) of this step is randomly distributed in other genomes other than the chromosome. The probability of a false positive on this chromosome is less than or equal to P ₁ ^{n '} (that is, the power n' to P ₂ ). For n 'large enough, the probability of false positives that can occur here is extremely small. The false negative rate refers to the proportion of positives that produce a negative test result in the test, that is, the conditional probability that a negative test result exists considering the condition being searched for.

Therefore, when calculating the false positive probability, k-mer can be independently corrected. For any two k-mers A and B in the specific k-mer list, if there are no less than j characters between them at their ends (for example, the last j characters of A and B's The first j characters are exactly the same), then the two k-mers A and B are considered to be coincident ends. Here, j is generally a value greater than 5 and less than or equal to k-1, that is, 5 <j≤ (k-1). It should be noted here that in the case of a DNA or RNA sequence, due to the reverse complementarity of the nucleic acid sequence, for k-mer A and B in a given two specific k-mer list, the terminal coincidence detection should include A and B, A reverse complementary sequences A 'and B, A and B reverse complementary sequences B', and A reverse complementary sequences A 'and B reverse complementary sequences B'.

Copy all k-mers in the specific k-mer list to a list of non-overlapping specific regions in the initial state. If you confirm that k-mer A and B in the non-overlapping specific region list are terminally overlapping Then, take the smallest region (ie, the specific region) C that can cover the two specific k-mers instead of the two specific k-mers. By analogy, the end overlap of every two specific k-mers or specific regions of the entire non-overlapping specific region list is repeatedly tested, and the smallest region covering the two specific k-mers or specific regions is replaced. These two specific k-mers or specific regions until there are no specific k-mers or specific regions that coincide with the end of the condition. After the completion of this step, all the remaining specific k-mers or specific regions constitute the final non-overlapping specific region list, that is, each specific k-mer or specific region retained in the table in the final state is one Non-coincidence specific regions. For the calculation of false positives and false negatives, multiple k-mers belonging to the same non-overlapping specific region only calculate the value of P1 or P2 once. If there are M chromosomes in the target species, then a specific k-mer table of M corresponding chromosomes will be created here.

Step 1402F: Generate a specific k-mer copy number list corresponding to each chromosome.

For the high-confidence data set of each chromosome contained in the target species corresponding to the data to be detected, the number of occurrences of each specific k-mer screened out is calculated, that is, the specific k-mer on this chromosome Record as many occurrences as possible in all genomes of the high confidence dataset. Finally, the number of copies of each specific k-mer of the chromosome is calculated from the number of occurrences of one k-mer, which is the least frequent of all specific k-mers of the chromosome, that is, Cm. If the target species has a total of M chromosomes, then a specific k-mer copy number list of M corresponding chromosomes will be created here. The copy number of specific k-mer is a value greater than or equal to one.

After generating the respective specific k-mer copy number lists of all chromosomes, if there are any two specific k-merA and k-merB from different chromosomes, they appear in the data of a set of normal target species Are Ca and Cb, and Fa and Fb are the values of k-merA and k-merB in the specific k-mer copy number list of their respective chromosomes, then the ratio of Ca / Fa and Cb / Fb should be as shown in Table 16. The ratio of the number of copies of these two chromosomes.

The process of creating a set of characteristic target sequences corresponding to each chromosome may be collectively referred to as module A. Module A can be run from time to time in order to continuously update the feature target sequence set corresponding to each chromosome, that is, update the target database. For example, whenever the reference genome data is updated, module A can be run. However, module A does not need to be run or updated during the analysis of each actual sample.

Step 1404: Calculate the actual signal strength of each chromosome contained in the target sample corresponding to the data to be detected.

As shown in FIG. 17, step 1404 includes:

Step 1404A: Obtain data to be detected.

Step 1404B: Obtain a specific k-mer list and a specific k-mer copy number list.

Step 1404C: Obtain the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected.

Obtain the data to be detected, and determine the target species corresponding to the data to be detected. The specific k-mer list and specific k-mer copy number list of each chromosome in the target species generated in step 1402 are called. If there are M chromosomes in the target species corresponding to the data to be detected, a total of M specific k-mer lists and specific k-mer copy number lists corresponding to each chromosome need to be called. The actual number of occurrences of the specific k-mer contained in each chromosome of the target species in the data to be detected is obtained. The number of occurrences of the specific k-mer can be recorded to the corresponding position in the actual number of occurrences of the specific k-mer of the corresponding chromosome. That is, according to the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected, a record table of the actual number of occurrences of the specific k-mer corresponding to the chromosome is generated.

In step 1404D, a single copy signal intensity E of each chromosome is calculated.

A single copy signal strength calculation table for a specific chromosome is shown in FIG. For a specific chromosome, according to the data in the specific k-mer copy number list and the data in the specific k-mer actual occurrence record table, any specific k-mer belonging to this specific chromosome can be obtained. The actual number of occurrences C ′ _i and the copy number F _{i in} the set of data. Therefore, the number of appearances C ′ _i / F _i after the k-mer adjustment can be calculated. The adjusted number of occurrences of all specific k-mers of the chromosome is averaged, and the average value is the single copy signal intensity E of the chromosome.

After calculating the single copy signal intensity E of each chromosome, the single copy signal intensity E of each chromosome contained in the target species can be recorded and stored through the single copy signal intensity record table of each chromosome as shown in FIG. 19 .

Step 1404E, calculate the actual signal strength S of each chromosome.

After the single-copy signal intensity E of each chromosome is calculated, the average M and variance SD of all single-copy signal intensity E can be calculated. The calculation formula of the actual signal intensity S of each chromosome is: S _i = (E _i -M) / SD. A calculation table of the actual signal intensity of each chromosome as shown in FIG. 20. The calculation formula for chromosome ₁ is: S ₁ = (E ₁ -M) / SD. The calculation formulas for other chromosomes are also calculated in this way.

Step 1406: Calculate a standard confidence interval list corresponding to the chromosome contained in the target species according to the standard detection sample.

After obtaining a large number of standard test samples, the actual signal intensity of each chromosome contained in each standard test sample can be calculated in the manner in step 1404. In order to distinguish the target species from the standard test sample, the actual signal strength of the standard test sample is referred to as the standard signal strength. By the method in step 1404, the standard signal intensity of each chromosome contained in each standard detection sample can be calculated. The standard signal intensity corresponding to the chromosomes contained in all the standard test samples can be recorded in a table. Further, gender-sensitive records can be distinguished. That is, a standard signal intensity record table of chromosomes in normal male samples and a standard signal intensity record table of chromosomes in normal female samples are generated.

The standard signal intensities of each chromosome included in all the standard detection samples are statistically calculated, and the mean value M 'and the variance SD' of the standard signal intensity distributions of the respective standard detection samples of each chromosome are calculated. Assume that the standard test sample is human and there are 100 standard test samples, then there are 100

chromosomes

1, 100 chromosomes 2, ..., 100 22 stains. However, the specific number of X and Y sex chromosomes needs to be determined according to the gender of these 100 people. Therefore, in order to meet the number of X and Y sex chromosomes, the number of standard test samples for a certain sex should also be required. So for chromosome 1, there are 100 standard signal intensities. The corresponding mean and variance of chromosome 1 can be calculated according to the standard signal intensities corresponding to the 100 chromosomes 1, and the mean and variance of standard signal intensities of other chromosomes can also be calculated.

In this way, a standard confidence interval corresponding to each chromosome contained in the standard detection sample when the confidence value is preset can be determined, that is, an interval of standard signal strength. That is, two boundary values LB and UB of the confidence interval with the confidence degree P are obtained. LB is the minimum of the confidence interval, and UB is the maximum of the confidence interval. Here, P is generally a value greater than 0.95, infinitely close to 1 but not equal to 1. In practical applications, the confidence level can be adjusted as required. For example, with 95% confidence, P is 0.95, and 99.9% confidence, P is 0.999. After determining the standard confidence interval corresponding to the preset confidence value of each chromosome in the standard detection sample, a distribution table of P-confidence boundary values of the actual signal strengths of the chromosomes corresponding to the two sexes of the target species can be obtained. The standard confidence interval corresponding to each chromosome of the target species can be estimated in a statistical manner by calculating statistics on the standard signal intensity of the chromosomes of a large number of standard test sample data. That is, the actual signal intensity interval corresponding to each chromosome of the target species in the normal sample when the reliability P value is preset is estimated.

In the application scenario of non-invasive birth test (NIPT), that is, when the fetal chromosome copy number is abnormal by sequencing the fetal DNA in the maternal peripheral blood, because the fetal sample in the maternal peripheral blood is mixed with the maternal sample. Together, so the above standard test sample can also be: a peripheral blood sample of a normal mother carrying a normal baby, the peripheral blood sample includes a peripheral blood sample of a normal mother carrying a normal baby boy, and a normal mother carrying a normal baby girl Peripheral blood samples, peripheral blood samples from normal mothers carrying normal baby boy twins, peripheral blood samples from normal mothers carrying normal baby girl twins, and peripheral blood samples from normal mothers carrying normal one male and one female twin. Therefore, when making a distribution table of P-confidence boundary values, the table can also be adjusted according to the difference in the standard detection samples.

Step 1408: It is detected whether there is an abnormal copy number in the data to be detected.

After calculating the actual signal intensity of each chromosome contained in the target species corresponding to the data to be detected, the actual signal intensity of each chromosome can be compared with each chromosome of the target species obtained in step 1406 above when the reliability P value is set The corresponding standard confidence intervals are compared separately. The actual signal intensity of chromosome 1 contained in the target species corresponding to the data to be detected is compared with the standard confidence interval of chromosome 1. When the actual signal intensity of chromosome 1 is not within the standard confidence interval of chromosome 1, it can be determined that copy number abnormality exists in chromosome 1. Conversely, it can be determined that chromosome 1 is not copy number abnormal.

Further, since in step 1406, a distribution table of pre-set reliability values P of standard signal intensities of chromosomes in the corresponding samples is established according to different genders of the target species. Therefore, the actual signal intensity of the sex chromosome can also be compared with the distribution table of the preset reliability values P corresponding to different genders. The actual signal intensity of the X chromosome and the actual signal intensity of the Y chromosome calculated from the data to be tested are compared with the boundary value of the confidence interval in the distribution table of the preset confidence value P corresponding to different genders. If the calculated actual signal intensity of the X chromosome and the actual signal intensity of the Y chromosome in the data to be detected are in a distribution table of preset reliability values P of the standard signal intensity of the chromosome in a normal male sample, then the data to be tested corresponds Gender is male. If the calculated actual signal intensity of the X chromosome and the actual signal intensity of the Y chromosome in the data to be detected are in the distribution table of the preset reliability value P of the standard signal intensity of the chromosome in a normal female sample, then the data to be tested corresponds to Gender is female. If it is neither in the distribution table of the preset confidence value P of the standard signal strength of the chromosome in the normal male sample nor in the confidence interval in the distribution table of the preset reliability value P of the standard signal strength of the chromosome in the normal female sample, Then it can be determined that there is a situation of potential sex chromosome copy number abnormality on this chromosome.

After the gender corresponding to the data to be detected is determined according to the above manner, the actual signal strength of each chromosome in the data to be detected is compared with the confidence interval of each chromosome in the distribution table of the preset confidence value P. The comparison with the distribution table of the preset reliability value P corresponding to which gender depends on the gender corresponding to the data to be detected. If it is detected that the actual signal strength of a certain chromosome is not within the confidence interval in the distribution table of the preset confidence value P, then it is determined that there is a potential copy number abnormality situation for the chromosome. Here, the probability of false positives can be reduced by increasing the preset reliability value P. But increasing P increases the probability of false negatives.

After determining the target species corresponding to the data to be detected, and obtaining the specific k-mer corresponding to each chromosome in the target species, according to the actual number of occurrences of the specific k-mer in the data to be detected and each specific k -mer copy number to calculate the actual signal intensity corresponding to each chromosome. Therefore, the actual signal intensity of each chromosome can be compared with the standard confidence interval of the corresponding chromosome, and the chromosome that is not within the standard confidence interval of the corresponding chromosome can be determined as a chromosome with abnormal copy number. This method of detecting chromosome copy number abnormalities is compared with the characteristic target sequence in each chromosome of the target species, that is, the specific k-mer, which is part of the entire target species genome, and is therefore specific. The comparison of the performance k-mer can reduce the comparison space, thereby shortening the analysis time and improving the detection efficiency. And the characteristic target of each chromosome of the target species generated here is the integration of multiple genomes of different individuals or populations in the target species, thus avoiding "when a set of data comes from a genetic relationship that is far away from the reference genome Individuals, the effect of using whole-genome alignments becomes worse. " In the process of establishing a characteristic target library of each chromosome of a target species, multiple genomes of different individuals or populations in the target species are included, which is more universally applicable than a single reference genome. And in the process of analyzing a set of data to be detected, only comparing the data with the sequences in the feature target library, greatly saving the space and time consumption of the alignment.

It should be understood that although the steps in the flowcharts of FIGS. 1-20 are sequentially displayed in accordance with the instructions of the arrows, these steps are not necessarily performed in the order indicated by the arrows. Unless explicitly stated in this document, the execution of these steps is not strictly limited, and these steps can be performed in other orders. Moreover, at least a part of the steps in each figure may include multiple sub-steps or stages. These sub-steps or stages are not necessarily performed at the same time, but may be performed at different times. The execution of these sub-steps or stages The sequence is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a part of the sub-steps or stages of other steps.

In one embodiment, as shown in FIG. 21, a device for detecting an abnormal chromosome copy number is provided, including:

The specific k-mer acquisition module 2102 is used to obtain sequencing data of a sample to be detected as the data to be detected, and determine a target species corresponding to the data to be detected; and acquire a specificity corresponding to each chromosome contained in the target species stored in the target database. Sexual k-mer, specific k-mer is the k-mer in each chromosome that meets the preset specificity conditions, k-mer refers to the genomic sequence of length k;

The actual appearance frequency obtaining module 2104 is configured to obtain the actual appearance times of the specific k-mer included in each chromosome in the data to be detected;

The copy number acquisition module 2106 is used to obtain the copy number of each specific k-mer from the target database. The copy number is the least number of occurrences of the specific k-mer on the corresponding chromosome and the number of occurrences on the chromosome. Ratio of occurrences of specific k-mers; and

A determination module 2108, configured to calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and the copy number of each specific k-mer; determine that the chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome exists as a copy number Abnormal chromosomes.

In one embodiment, the determination module 2108 is further configured to calculate the ratio of the actual number of occurrences of each specific k-mer to the number of copies; calculate the actual number of occurrences and the number of copies of all specific k-mers contained in each chromosome The average value of the ratio of the chromosomes is used as the single-copy signal strength of the corresponding chromosome; and the actual signal strength of the corresponding chromosome is calculated based on the single-copy signal strength of each chromosome.

In one embodiment, the apparatus for detecting abnormal copy number of a chromosome further includes a standard confidence interval list calculation module (not shown in the figure) for obtaining a preset number of standard test samples, and the standard test samples are confirmed as having no chromosomes. Samples with abnormal copy number; Obtain the actual number of occurrences of the specific k-mer contained in each chromosome in the standard test sample in the data to be tested; obtain from the target database each of each chromosome contained in the standard test sample Copy number of specific k-mer; get the standard signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer included in the standard detection sample; detect each chromosome in the sample according to multiple standards The standard signal strength of the chromosome determines the standard confidence interval corresponding to the chromosome when the confidence value is preset; and according to the standard confidence interval corresponding to each chromosome, a list of standard confidence intervals corresponding to the chromosome contained in the target species is obtained.

In one embodiment, the above-mentioned standard confidence interval list calculation module is further configured to obtain the standard signal intensity of each chromosome contained in each standard detection sample; and calculate the chromosome Mean and variance of standard signal strengths; and based on the mean and variance of standard signal strengths in multiple standard test samples for each chromosome for the corresponding gender, determine the pre-set reliability of the chromosomes contained in the standard test samples corresponding to each gender The standard confidence interval corresponding to the value.

In one embodiment, the standard test sample is a peripheral blood sample of a normal mother carrying a normal baby. The peripheral blood sample includes a peripheral blood sample of a normal mother carrying a normal baby boy, and a peripheral mother's peripheral blood sample. Peripheral blood samples from normal mothers carrying normal baby boy twins, Peripheral blood samples from normal mothers carrying normal baby girl twins, and Peripheral blood samples from normal mothers carrying normal one boy and one female twin. The above-mentioned standard confidence interval list calculation module is further configured to determine a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby boy; The standard signal intensity of each chromosome contained in the peripheral blood sample of a normal baby girl is determined by the standard confidence interval of the chromosome when the confidence value is preset; according to the The standard signal intensity of each chromosome determines the standard confidence interval of the chromosome when the confidence value is preset; according to the standard signal intensity of each chromosome contained in the peripheral blood sample of a normal mother carrying a normal baby girl twin, it is determined that the chromosome is in a preset The standard confidence interval corresponding to the confidence value; and the standard confidence interval corresponding to the chromosome when the confidence value is preset according to the standard signal strength of each chromosome contained in the peripheral blood sample of a normal mother and a female twin.

In one embodiment, the above-mentioned determination module 2108 is further configured to, when it is detected that the actual signal intensity corresponding to the chromosome does not belong to the standard confidence interval corresponding to the corresponding chromosome, determine the chromosome corresponding to the actual signal intensity as a copy number Abnormal chromosomes.

In one embodiment, the apparatus for detecting abnormal copy number of a chromosome further includes a gender division comparison module (not shown in the figure) for determining a standard confidence interval list of a chromosome corresponding to each gender according to the gender of the target species; respectively Compare the actual signal strength of each chromosome with the standard confidence interval corresponding to the corresponding chromosome in the list of standard confidence intervals for the corresponding sex of the target species; and when it is detected that the actual signal strength of the chromosome does not belong to the corresponding sex When corresponding to the standard confidence interval of a chromosome, the chromosome corresponding to the actual signal intensity is determined as a chromosome with abnormal copy number.

In one embodiment, the above-mentioned apparatus for detecting abnormal copy number of a chromosome further includes a target sequence creation module (not shown in the figure), configured to obtain a specificity contained in each chromosome included in the target species stored in the target database. The number of occurrences of sexual k-mer in the corresponding chromosome C, and the number of occurrences of specific k-mer in the corresponding chromosome are taken as the minimum occurrences Cm; the ratio of the occurrences C to the minimum occurrences Cm is taken as the specificity Copy number of specific k-mer; generating a specific k-mer copy number list corresponding to each chromosome according to the copy number of specific k-mer contained in each chromosome; and storing the specific k-mer copy number list To the target database.

In one embodiment, the above-mentioned target sequence creation module is further configured to obtain multiple chromosomes contained in the target species; classify and sort multiple chromosomes contained in the target species; and obtain a pre-selected condition that satisfies a preset credibility High-confidence genome; and determining the high-confidence genome corresponding to each chromosome contained in the target species.

In one embodiment, the k-mer in the specific k-mer satisfies the following two conditions: the number of occurrences in the genome occurrence index table corresponding to each chromosome meets a first preset error condition; The number of occurrences in the genome occurrence index table corresponding to each chromosome, and the occurrences in the genome occurrence index table of the complete set meet the second preset error condition; the genome appearance index table records the corresponding chromosome of each k-mer Contains the number of k-mer genomes in the genome; the genome occurrence index table of the complete set records the k-mers contained in each chromosome in the target species, and the k-mer genomes in the complete set contain the k-mer genomes. Number.

In one of these embodiments, the first threshold is less than 5%.

In one of these embodiments, the second threshold is less than 5%.

For the specific limitation of the device for detecting the abnormality of the chromosome copy number, refer to the foregoing limitation on the method for detecting the abnormality of the chromosome copy number, and details are not described herein again. Each module in the above apparatus for detecting abnormal copy number of a chromosome can be realized in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 22. The computer device includes a processor, a memory, a network interface, and a database connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer-readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium. The computer equipment database is used to store data for detecting abnormal chromosome copy numbers. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by a processor to implement a method for detecting abnormalities in chromosome copy number.

Those skilled in the art can understand that the structure shown in FIG. 22 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. The specific computer equipment may be Include more or fewer parts than shown in the figure, or combine certain parts, or have a different arrangement of parts.

A computer device includes a memory and one or more processors. Computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by the processor, the method for detecting an abnormality of a chromosome copy number provided in any embodiment of the present application is implemented. A step of.

One or more non-transitory computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors implement one of the embodiments of the present application. Provided are steps of a method for detecting chromosome copy number abnormalities.

A person of ordinary skill in the art may understand that implementing all or part of the processes in the methods of the foregoing embodiments may be performed by computer-readable instructions and computer-readable instructions to instruct related hardware. The computer-readable instructions may be The computer-readable instructions are stored in a non-volatile computer-readable storage medium. When the computer-readable instructions are executed, the computer-readable instructions may include the processes of the foregoing method embodiments. Any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined. In order to make the description concise, all possible combinations of the technical features in the above embodiments have not been described. However, as long as there is no contradiction in the combination of these technical features, it should be It is considered to be the range described in this specification.

The above-mentioned embodiments only express several implementation manners of the present application, and the description thereof is more specific and detailed, but cannot be understood as a limitation on the scope of the invention patent. It should be noted that, for those of ordinary skill in the art, without departing from the concept of the present application, several modifications and improvements can be made, and these all belong to the protection scope of the present application. Therefore, the protection scope of this application patent shall be subject to the appended claims.

Claims

A method for detecting chromosome copy number abnormalities, including:

Acquiring sequencing data of a sample to be detected as the data to be detected, and determining a target species corresponding to the data to be detected;

A specific k-mer corresponding to each chromosome contained in the target species stored in the target database is obtained, where the specific k-mer is a k-mer in each chromosome that satisfies a preset specific condition, and the k-mer mer refers to a genomic sequence of length k;

Obtaining the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected;

A copy number of each specific k-mer is obtained from the target database, and the copy number is a specificity that has the least number of occurrences of the specific k-mer on the corresponding chromosome and the number of occurrences on the chromosome. The ratio of the number of occurrences of k-mer;

Calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer; and

A chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is determined to be a chromosome with abnormal copy number.
The method according to claim 1, wherein the calculating the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer comprises:

Calculate the ratio of the actual number of occurrences of each specific k-mer to the number of copies;

Calculating the average of said ratios of all specific k-mers contained in each chromosome as the single copy signal intensity of the corresponding chromosome; and

The actual signal intensity of the corresponding chromosome is calculated from the single-copy signal intensity of each chromosome.
The method according to claim 2, wherein the actual signal intensity of the corresponding chromosome is calculated according to the following formula:

The actual signal intensity of the chromosome = (single copy signal intensity of the chromosome-M) / SD, where M is the average of the single copy signal intensity of all chromosomes, and SD is the variance of the single copy signal intensity of all chromosomes.
The method according to claim 1, further comprising:

Obtaining a preset number of standard test samples, the standard test samples being samples confirmed as having no abnormal chromosome copy number;

Obtaining the actual number of occurrences of the specific k-mer contained in each chromosome in the standard detection sample in the data to be detected;

Obtaining the copy number of each specific k-mer in each chromosome contained in the standard detection sample from the target database;

Obtaining the standard signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer included in the standard detection sample;

Determining a standard confidence interval corresponding to a preset confidence value of the chromosome according to a standard signal intensity of each chromosome in a plurality of standard detection samples; and

According to the standard confidence intervals corresponding to each chromosome, a list of standard confidence intervals corresponding to the chromosomes contained in the target species is obtained.
The method according to claim 4, wherein determining a standard confidence interval corresponding to the chromosome when a preset confidence value is determined according to a standard signal intensity of each chromosome in a plurality of standard detection samples, comprising:

Obtaining a standard signal intensity of each chromosome contained in each of the standard detection samples;

Calculate the mean and variance of the standard signal intensities of the chromosomes contained in all the standard test samples, respectively, according to the sex of the standard test samples; and

Determine the standard confidence interval corresponding to the chromosome contained in the standard detection sample corresponding to each gender when the preset confidence value is corresponding to the mean value and variance of the standard signal intensity of each standard in multiple standard detection samples of the corresponding gender .
The method according to claim 4, wherein the standard test sample is a peripheral blood sample of a normal mother carrying a normal baby, and the peripheral blood sample includes a peripheral blood sample of a normal mother carrying a normal baby boy, normal Peripheral blood samples from mothers with normal baby girls, Peripheral blood samples from normal mothers with normal baby boy twins, Peripheral blood samples from normal mothers with normal baby girl twins, and Peripheral blood samples from normal mothers with normal one boy and one female twin ;

The determining a standard confidence interval corresponding to a preset confidence value of the chromosome according to a standard signal intensity of each chromosome in a plurality of standard detection samples includes:

Determining a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby boy;

Determining a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby girl;

Determining a standard confidence interval corresponding to a chromosome at a preset confidence value according to a standard signal intensity of each chromosome contained in a peripheral blood sample of a normal mother carrying a normal baby boy twin;

Determining a standard confidence interval corresponding to a chromosome at a preset confidence value according to the standard signal intensity of each chromosome contained in the peripheral blood sample of the normal mother carrying a normal baby girl twin; and

A standard confidence interval corresponding to a chromosome at a preset confidence value is determined according to a standard signal intensity of each chromosome contained in a peripheral blood sample of the normal mother carrying a normal one male and one female twin.
The method according to claim 1, wherein the determining that the chromosome whose actual signal intensity is not within a standard confidence interval of the corresponding chromosome as having a copy number abnormality comprises:

When it is detected that the actual signal intensity corresponding to the chromosome does not belong to the standard confidence interval corresponding to the corresponding chromosome, the chromosome corresponding to the actual signal intensity is determined to be a chromosome with abnormal copy number.
The method according to claim 1, further comprising:

Determining a standard confidence interval list of a chromosome corresponding to each sex according to the sex of the target species;

Get the gender of the sample to be tested;

Comparing the actual signal intensity of each chromosome with the standard confidence interval corresponding to the corresponding chromosome in the list of standard confidence intervals for the corresponding sex of the target species; and

When it is detected that the actual signal intensity of the chromosome does not belong to the standard confidence interval of the corresponding chromosome of the corresponding sex, the chromosome corresponding to the actual signal intensity is determined as a chromosome with abnormal copy number.
The method according to claim 1, before the obtaining the sequencing data of the sample to be detected as the data to be detected, further comprising:

Obtain the number of occurrences of the specific k-mer contained in each chromosome contained in the target species stored in the target database in the corresponding chromosome C, and the number of occurrences of the specific k-mer corresponding to the least number of occurrences in the chromosome as the The minimum number of occurrences Cm;

Taking the ratio of the number of occurrences C to the minimum number of occurrences Cm as the copy number of the specific k-mer;

Generating a specific k-mer copy number list corresponding to each chromosome based on the copy number of the specific k-mer contained in each chromosome; and

Storing the specific k-mer copy number list into the target database;

The obtaining the copy number of each specific k-mer from the target database includes: obtaining the copy number of each specific k-mer according to the specific k-mer copy number list.
The method according to claim 1, characterized in that before obtaining the sequencing data of the sample to be detected as the data to be detected, further comprising:

Obtaining multiple chromosomes contained in the target species;

Classify and sort multiple chromosomes contained in the target species;

Obtaining pre-selected high-confidence genomes that meet preset confidence conditions; and

A high-confidence genome corresponding to each chromosome contained in the target species is determined.
The method according to claim 1, wherein the meeting the preset credibility condition includes any one of the following:

When the proportion of non-deterministic characters contained in the chromosome sequence is lower than a preset proportion threshold;

When the sequence fragments contained in the chromosome sequence belonging to the same chromosome are below a preset fragment threshold; and

Perform a sequence comparison between a chromosome sequence and all other chromosome sequences whose genetic relationship meets the preset genetic distance threshold range to determine the average full coverage percentage of the chromosome sequence in the similar chromosome sequences. When preset percentage value.
The method according to claim 1, characterized in that the k-mer in the specific k-mer satisfies the following two conditions:

The number of occurrences in the genome occurrence number index table corresponding to each chromosome meets the first preset error condition; the number of occurrences in the genome occurrence number index table corresponding to each chromosome, and the genome occurrence number index table in the complete set The number of occurrences in the second meets the second preset error condition;

The genome appearance index table records the number of each k-mer in the genome corresponding to the chromosome containing the k-mer genome; the complete set genome appearance index table records each of the target species The k-mer included in the chromosome includes the number of the k-mer genome in the genome included in the corpus.
The method according to claim 12, characterized in that the first preset error condition is: the ratio of the number of occurrences in the genome occurrence number index table corresponding to each chromosome to the number of genomes contained in the corresponding chromosome and the first threshold Is greater than or equal to 1.
The method according to claim 13, wherein said first threshold is less than 5%.
The method according to claim 12, characterized in that the second preset error condition is: the number of occurrences in the genome occurrence index table corresponding to each chromosome and the number of occurrences in the genome occurrence index table of the complete set The sum of the ratio of the second threshold value is greater than or equal to 1.
The method according to claim 15, wherein said second threshold is less than 5%.
A device for detecting abnormal copy number of chromosome, including:

A specific k-mer acquisition module, configured to acquire sequencing data of a sample to be detected as the data to be detected, and determine a target species corresponding to the data to be detected; and acquire a corresponding one of each chromosome contained in the target species stored in the target database. A specific k-mer, where the specific k-mer is a k-mer in each chromosome that meets a preset specificity condition, and the k-mer refers to a genomic sequence of length k;

An actual appearance frequency acquisition module, configured to obtain the actual appearance frequency of the specific k-mer included in each chromosome in the data to be detected;

A copy number obtaining module is configured to obtain a copy number of each specific k-mer from the target database, the copy number is the number of occurrences of the specific k-mer in the corresponding chromosome and the chromosome The ratio of the number of occurrences of the specific k-mer with the fewest occurrences; and

A determination module, configured to calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and the number of copies of each specific k-mer; determine that the chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is a copy Number of abnormal chromosomes.
A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the one or more processors, the one or more processors are caused. Each processor performs the following steps:

Acquiring sequencing data of a sample to be detected as the data to be detected, and determining a target species corresponding to the data to be detected;

A specific k-mer corresponding to each chromosome contained in the target species stored in the target database is obtained, where the specific k-mer is a k-mer in each chromosome that satisfies a preset specific condition, and the k-mer mer refers to a genomic sequence of length k;

Obtaining the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected;

A copy number of each specific k-mer is obtained from the target database, and the copy number is a specificity that has the least number of occurrences of the specific k-mer on the corresponding chromosome and the number of occurrences on the chromosome. The ratio of the number of occurrences of k-mer;

Calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer; and

A chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is determined to be a chromosome with abnormal copy number.
One or more non-volatile computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:

Acquiring sequencing data of a sample to be detected as the data to be detected, and determining a target species corresponding to the data to be detected;

A specific k-mer corresponding to each chromosome contained in the target species stored in the target database is obtained, where the specific k-mer is a k-mer in each chromosome that satisfies a preset specific condition, and the k-mer mer refers to a genomic sequence of length k;

Obtaining the actual number of occurrences of the specific k-mer contained in each chromosome in the data to be detected;

A copy number of each specific k-mer is obtained from the target database, and the copy number is a specificity that has the least number of occurrences of the specific k-mer on the corresponding chromosome and the number of occurrences on the chromosome. The ratio of the number of occurrences of k-mer;

Calculate the actual signal intensity of the corresponding chromosome according to the actual number of occurrences and copy number of each specific k-mer; and

A chromosome whose actual signal intensity is not within the standard confidence interval of the corresponding chromosome is determined to be a chromosome with abnormal copy number.