CN111477277A - Sample quality evaluation method and device - Google Patents

Sample quality evaluation method and device Download PDF

Info

Publication number
CN111477277A
CN111477277A CN202010478389.6A CN202010478389A CN111477277A CN 111477277 A CN111477277 A CN 111477277A CN 202010478389 A CN202010478389 A CN 202010478389A CN 111477277 A CN111477277 A CN 111477277A
Authority
CN
China
Prior art keywords
variation
sites
sample
quality
snp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010478389.6A
Other languages
Chinese (zh)
Inventor
单光宇
张静波
徐冰
杨静怡
伍启熹
王建伟
刘倩
唐宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Usci Medical Laboratory Co ltd
Original Assignee
Beijing Usci Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Usci Medical Laboratory Co ltd filed Critical Beijing Usci Medical Laboratory Co ltd
Priority to CN202010478389.6A priority Critical patent/CN111477277A/en
Publication of CN111477277A publication Critical patent/CN111477277A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6827Hybridisation assays for detection of mutation or polymorphism
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Abstract

The invention provides a sample quality evaluation method and a sample quality evaluation device. The evaluation method comprises the following steps: respectively obtaining sequencing data of a tissue sample to be detected and a control cell sample; detecting SNP (single nucleotide polymorphism) variation sites shared in respective sequencing data of the tissue sample to be detected and the control cell sample to obtain an embryonic line SNP variation site; and calculating the proportion of homozygous variant sites and heterozygous variant sites in the SNP variant sites of the embryonic line, and judging the quality of the sample according to the proportion. The quality of the sequencing data of the sample is judged by finding out the sequencing data of the paired sequenced tissue sample and the sequencing data of the cell sample and according to the proportion of the homozygous variation site and the heterozygous variation site, so that the current situation that the quality control cannot be carried out on the sequencing data of the sample without a reference substance is improved.

Description

Sample quality evaluation method and device
Technical Field
The invention relates to the field of sequencing data quality control, in particular to a sample quality evaluation method and a sample quality evaluation device.
Background
In the clinical practice of next generation sequencing, double sample sequencing, i.e., sequencing both pathological and control samples, is often required in order to accurately find somatic variations. However, the quality of the sample is often reduced due to various reasons such as negligence of experimental operation, long-term sample placement or contamination, and the sample cannot be used for subsequent data analysis. Therefore, accurately determining the sample quality is important for efficient detection of somatic mutation.
The existing method does not directly judge the quality state of a sample, but sets a positive reference substance and a negative reference substance in each sequencing batch to indirectly judge the quality according to the general technical guidance principle of second-generation sequencing. However, in real clinical practice, due to cost priority, the purchase and setting of reference products are neglected, and the risk that the sample quality cannot be accurately identified occurs. That is, when there is no reference, there is no effective solution for determining the quality of the sample.
Disclosure of Invention
The invention mainly aims to provide a sample quality evaluation method and a sample quality evaluation device, which are used for solving the problem that the sample quality is difficult to judge when no reference substance exists in the prior art.
In order to achieve the above object, according to an aspect of the present invention, there is provided an evaluation method of sample quality, the evaluation method including: respectively obtaining sequencing data of a tissue sample to be detected and a control cell sample; detecting SNP (single nucleotide polymorphism) variation sites shared in respective sequencing data of the tissue sample to be detected and the control cell sample to obtain an embryonic line SNP variation site; and calculating the proportion of homozygous variant sites and heterozygous variant sites in the SNP variant sites of the embryonic line, and judging the quality of the sample according to the proportion.
Further, detecting common SNP variation sites in the sequencing data of the tissue sample to be detected and the control cell sample to obtain the germline SNP variation sites comprises: detecting SNP (single nucleotide polymorphism) variation sites shared in respective sequencing data of the tissue sample to be detected and the control cell sample to obtain candidate sites; and removing the sites positioned in the repetitive sequence region and the strand preference sites in the candidate sites to obtain the germline SNP variation sites.
Further, calculating the ratio of homozygous to heterozygous variation sites in the germline SNP variation sites includes: dividing the SNP variation sites of the embryonic line into homozygous variation sites, heterozygous variation sites and residual variation sites according to the variation frequency; counting the proportion of the sum of the number of homozygous variant sites and the number of heterozygous variant sites to the total number of the SNP variant sites of the germ line; preferably, the variation frequency of the homozygous variation site is more than or equal to 90 percent, the variation frequency of the heterozygous variation site is more than or equal to 40 percent and less than or equal to 60 percent, and the rest variation frequencies are the rest variation sites.
Further, determining the sample quality based on the ratio comprises: when the proportion is larger than or equal to the quality threshold value, judging that the quality of the sample is qualified; when the proportion is lower than the quality threshold value, judging that the sample quality is unqualified; preferably, the quality threshold is 0.7.
Further, the sequencing data is sequencing data of a targeted capture library, whole genome sequencing data, or whole exon sequencing data.
According to a second aspect of the present application, there is provided an evaluation apparatus of sample quality, the evaluation apparatus comprising: the acquisition module is used for respectively acquiring sequencing data of a tissue sample to be detected and a control cell sample; the embryonic system SNP variation detection module is used for detecting common SNP variation sites in the sequencing data of the tissue sample to be detected and the control cell sample to obtain embryonic system SNP variation sites; the proportion calculation module is used for calculating the proportion of homozygous variant sites and heterozygous variant sites in the germline SNP variant sites; and a quality determination module for determining the quality of the sample according to the ratio.
Further, the germline SNP mutation detection module comprises: the variation screening module is used for detecting SNP variation sites common in respective sequencing data of the tissue sample to be detected and the control cell sample to obtain candidate sites; and the filtering module is used for removing the sites positioned in the repetitive sequence region and the chain preference sites in the candidate sites to obtain the germline SNP variation sites.
Further, the proportion calculation module comprises: the locus dividing module is used for dividing the SNP variation locus of the embryonic line into a homozygous variation locus, a heterozygous variation locus and a residual variation locus according to the high-low division of the variation frequency; the proportion statistic module is used for counting the proportion of the sum of the number of homozygous variant sites and the number of heterozygous variant sites in the total number of the SNP variant sites of the germ line; preferably, the variation frequency of the homozygous variation site is more than or equal to 90 percent, the variation frequency of the heterozygous variation site is more than or equal to 40 percent and less than or equal to 60 percent, and the rest variation frequencies are the rest variation sites.
Further, the quality determination module comprises: the first judging module is used for judging that the quality of the sample is qualified when the proportion is greater than or equal to the quality threshold value; the second judging module is used for judging that the quality of the sample is unqualified when the proportion is lower than the quality threshold value; preferably, the quality threshold is 0.7.
Further, the sequencing data is sequencing data of a targeted capture library, whole genome sequencing data, or whole exon sequencing data.
In order to achieve the above object, according to a third aspect of the present invention, there is provided a storage medium including a stored program, wherein a device on which the storage medium is located is controlled to perform any one of the above evaluation methods when the program is executed.
According to a fourth aspect of the present invention, there is provided a processor for running a program, wherein the program when running performs any one of the above-mentioned evaluation methods.
By applying the technical scheme of the invention, the germline SNP variation site is found by selecting the SNP variation site which is common in the sequencing data of the paired sequenced tissue sample and the sequencing data of the cell sample, and the quality of the sequencing data of the sample is judged according to the proportion of the homozygous variation site and the heterozygous variation site in the total germline SNP variation site. Therefore, the current situation that the quality control cannot be carried out on the sequencing data of the sample without the reference substance is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
fig. 1 is a schematic flow chart showing a method of evaluating sample quality in example 1 according to the present invention;
FIG. 2 is a detailed flowchart showing a method of evaluating the quality of a sample in embodiment 2 according to the present invention;
fig. 3 is a schematic configuration diagram showing an apparatus for evaluating sample quality in embodiment 5 according to the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail with reference to examples.
As mentioned in the background section, there is currently no solution for assessing the quality of sequencing data of samples without reference. In order to improve the current situation, the application provides an improved quality assessment scheme for sequencing data of paired sequenced samples based on that the sequencing samples are all sequenced simultaneously by a control cell sample (such as a leucocyte sample) and a tissue sample to be tested (such as a case tissue sample). In conventional sequencing practice, we have found that when the sample quality is poor, the frequency of germline variation fluctuates widely, with frequency fluctuations outside the normal range, e.g., a large number of variations occur with a frequency of ≦ 40%. Therefore, we try to measure the sample quality by the proportion of the germ line variation with the frequency meeting the normal range, and verify the sample quality through a large number of samples with known sample quality, so as to obtain better effect. On the basis, the improvement scheme of the application is provided.
Example 1
The present embodiment provides a method for evaluating sample quality, as shown in fig. 1, the method includes:
step S101, respectively obtaining sequencing data of a tissue sample to be detected and a control cell sample;
step S103, detecting common SNP (single nucleotide polymorphism) variation sites in the sequencing data of the tissue sample to be detected and the control cell sample to obtain an embryonic system SNP variation site;
step S105, calculating the proportion of homozygous variant sites and heterozygous variant sites in the SNP variant sites of the embryonic line;
in step S107, the sample quality is determined according to the ratio.
According to the evaluation method, the SNP variation sites of the germ line are found by selecting the SNP variation sites which are common in the sequencing data of the tissue sample and the cell sample sequenced in pairs, and the quality of the sequencing data of the sample is judged according to the proportion of the homozygous variation sites and the heterozygous variation sites in the total SNP variation sites of the germ line. Therefore, the current situation that the quality control cannot be carried out on the sequencing data of the sample without the reference substance is improved.
The step of detecting the germline SNP mutation sites in the paired samples can be implemented by using the existing detection software, such as Mutect2 software. In order to further improve the accuracy of the evaluation result and reduce the interference of some SNP variation sites, in a preferred embodiment, the step of detecting SNP variation sites common to the sequencing data of the tissue sample to be tested and the control cell sample to obtain germline SNP variation sites comprises: detecting SNP (single nucleotide polymorphism) variation sites shared in respective sequencing data of the tissue sample to be detected and the control cell sample to obtain candidate sites; and removing the sites positioned in the repetitive sequence region and the strand preference sites in the candidate sites to obtain the germline SNP variation sites.
The sites located in the repeat region and the strand-preferred sites easily bias the statistics of the homozygous SNP variant sites and the heterozygous SNP variant sites, thus eliminating the partial sites and contributing to the improvement of the accuracy of the calculated ratio.
When the embryo SNP mutation sites are divided, a threshold value is reasonably set according to the variation frequency to distinguish homozygous mutation sites and heterozygous mutation sites. In a preferred embodiment, calculating the ratio of homozygous to heterozygous variation sites in the germline SNP variation site comprises: dividing the SNP variation sites of the embryonic line into homozygous variation sites, heterozygous variation sites and residual variation sites according to the variation frequency; and counting the proportion of the sum of the number of homozygous variant sites and the number of heterozygous variant sites to the total number of SNP variant sites of the germ line.
The setting of the variation frequency of the homozygous variation site and the heterozygous variation site can be reasonably set according to different specific sequencing data, and in the present application, the SNP variation site with a variation frequency of more than 90% is preferably marked as the homozygous variation site, the SNP variation site with a variation frequency of 40-60% is preferably marked as the heterozygous variation site, and the remaining variation sites with a variation frequency are preferably marked as the remaining variation sites.
The variation frequency of the SNP sites of each germline can be detected according to detection software, such as Mutect2 software.
In a preferred embodiment, determining the sample mass based on the ratio comprises: when the proportion is larger than or equal to the quality threshold value, judging that the quality of the sample is qualified; and when the ratio is lower than the quality threshold value, judging that the sample quality is unqualified. In the application, the quality threshold is preferably 0.7, and the evaluation result of the quality of the sample is more accurate when the quality threshold is controlled to be 0.7 through a large number of sample detection verifications.
The sequencing data for the above evaluation method is not particularly limited, and may be sequencing data of a target capture library, whole genome sequencing data, or whole exon sequencing data.
Example 2
The embodiment provides an evaluation method for detecting the quality state of a next-generation sequencing sample, which comprises the following specific steps:
1) pretreating a sample and extracting DNA;
2) target area capture principle: capturing a targeted region of a sample using a probe of a specific sequence;
3) sequencing by a high-throughput method to obtain a sequencing sequence;
4) low quality sequences were filtered out and quality determination was performed using the following procedure.
The detailed process is shown in FIG. 2.
The process is mainly divided into two parts:
a first part: sample processing
Extracting, breaking, adding a connector to a sample DNA, carrying out hybridization capture, eluting, enriching and sequencing;
a second part: data processing
High throughput sequencing sequences were aligned to the human reference genome using BWA-mem alignment software, with unaligned sequences forming soft truncations. Then, sequencing according to the position on the reference genome obtained by comparison, and establishing index by using samtools software;
and a third part: determining a quality state of a sample
1) Performing variation detection on SNP sites of the tumor and a control sample by using Mutect2 software, and determining the germline SNP site and the frequency of the sample;
2) removing sites and chain preference sites in a repetitive region from the determined germline SNP site;
3) summarizing variation frequency of the filtered embryonic SNP sites;
4) finally, the filtering obtained embryonic SNP locus regions are divided into three groups: homozygous sites (variation frequency is more than or equal to 90 percent), heterozygous sites (variation frequency is more than or equal to 40 percent and less than or equal to 60 percent) and (residual sites). Then, the proportion of the homozygous locus population and the heterozygous locus population in the total germ line SNP locus population is evaluated (according to the following formula), and the quality state of the sample is determined.
Figure BDA0002516548950000051
For any pair of samples which are sequenced, the total number of the determined SNP variation of the embryonic line can be obtained through the filtering process. If the sample quality is not good, the number of heterozygous/homozygous loci meeting the frequency requirement is reduced (the quality value is reduced, if the sample quality is good, the number of heterozygous/homozygous loci meeting the frequency requirement is large, and the quality value is kept at a high level.
Example 3
The samples to be detected in this embodiment are lung cancer pathological samples with known quality and corresponding control samples. The main reagents in the examples are as follows:
table 1:
Figure BDA0002516548950000052
1. quantification was performed using a fluorescence quantifier (Qubit) with a concentration of 3.8ng/ul and a volume of 130 ul; fragmenting the sample by using an ultrasonic fragmenter (Covaris) to ensure that the size of the DNA fragment is between 200 and 400bp, and detecting whether the size of the fragment meets the requirement by using agarose gel electrophoresis.
2. Firstly, magnetic bead purification is carried out on a fragmented sample, then terminal repair and 3' terminal adenylation are carried out, the system configuration is shown in the following table, and the basic steps are as follows: the reaction is finished by firstly carrying out warm bath at 20 ℃ for 30min and then carrying out warm bath at 65 ℃ for 30 min.
Table 2:
end repair and 3' adenylation buffer 7μl
End repair and 3' adenylate enzyme mixture 3μl
DNA 50ul(500ng)
3. And (3) performing joint connection on the repaired DNA, wherein a joint connection system is detailed in the following table, and performing warm bath for 15min at 20 ℃.
Table 3:
reagent Volume of
Tag-equipped connector 2.5μl
DNA sample 60ul
Ligation reaction solution 30ul
Ligase 10ul
Nuclease-free water 7.5ul
4. And (2) performing magnetic bead purification on the product after the joint connection, and then performing PCR amplification to obtain a sufficient amount of DNA fragments with joints, wherein the basic steps are as follows: pre-denaturing at 98 ℃ for 45s, then denaturing at 98 ℃ for 15s, then annealing at 60 ℃ for 30s, and extending at 72 ℃ for 30 s; repeating the denaturation annealing extension process for 7 times; finally, the reaction was terminated by extension at 72 ℃ for 1 min. The amplification system is shown in the following table:
table 4:
reagent Volume of
Rapid hot start polymerase 25μL
Amplification primers 1uL
Adaptor-ligated DNA fragments 24μL
5. After the PCR amplification product is purified by magnetic beads and the concentration is obtained by utilizing the quantification of the Qubit, 500ng of the amplification product is taken out, the volume of the amplification product is concentrated to 4.4ul by using a concentrator, then the amplification product is sealed and hybridized with a probe, the hybridization reaction system is shown in the following table,
table 5:
Figure BDA0002516548950000061
Figure BDA0002516548950000071
the hybridization reaction conditions are shown in the following table:
table 6:
Figure BDA0002516548950000072
6. the probe-bound sample was captured using streptavidin magnetic beads, as follows: adding 50ul of magnetic beads into a 1.5ml centrifuge tube, placing the centrifuge tube on a magnetic frame, abandoning the supernatant, washing the centrifuge tube for three times by 200ul of connecting buffer solution, using 200ul of connecting buffer solution to resuspend the magnetic beads, adding the magnetic beads into a sample hybridized with the probe, turning the mixer upside down and mixing the mixture for 30min, placing the mixer on the magnetic frame, abandoning the supernatant, washing the mixture for 1 time by using a washing solution 1, then washing the mixture for 3 times by using a washing solution 2 preheated to 65 ℃, and ensuring that the temperature of the magnetic beads and the buffer solution 2 is 65 ℃ in the period. Finally, the mixture was placed on a magnetic frame, the supernatant was discarded, 38ul of nuclease-free water was added, and the beads were resuspended.
7. Carrying out PCR amplification on the DNA fragments captured by the magnetic beads, wherein an amplification system is shown in the following table to obtain enough DNA fragments with joints, and the basic steps are as follows: pre-denaturation at 98 ℃ for 2min, denaturation at 98 ℃ for 30s, annealing at 60 ℃ for 30s, and extension at 72 ℃ for 1 min; repeating the denaturation annealing extension process for 14 times; finally, the reaction was terminated by extension at 72 ℃ for 5 min. The reaction system is shown below.
Table 7:
Figure BDA0002516548950000073
Figure BDA0002516548950000081
8. and (3) performing magnetic bead purification on the obtained PCR amplification product, then performing qPCR quantification, and performing fragment size detection by using 2100.
9. And (4) sequencing, namely completing sequencing on a gene sequencer, and converting the obtained optical signal into base sequence off-line data by a sequencing platform to store all sequencing fragment results for fq files.
10. The off-line data fq files were aligned to the upper reference genome, the low quality sequences were removed, and the detection procedure of example 2 was used for detection.
11. And (3) sample detection results:
the sample quality score for this example is 0.5, which is less than the set threshold of 0.7. And if the sample quality is lower than the threshold value, the sample quality is judged to be unqualified and is consistent with the true state of the sample.
Example 4
The test was performed using 6 second generation sequencing samples with known sample quality status, all samples were correctly judged, and the results are shown in the following table.
Table 8:
sample numbering Type of sample True state Quality scoring Quality determination Whether it is consistent
S1 Lung cancer sample Qualified 0.86 Qualified Uniformity
S2 Lung cancer sample Qualified 0.89 Qualified Uniformity
S3 Lung cancer sample Qualified 0.92 Qualified Uniformity
S4 Lung cancer sample Fail to be qualified 0.57 Fail to be qualified Uniformity
S5 Lung cancer sample Fail to be qualified 0.63 Fail to be qualified Uniformity
S6 Lung cancer sample Fail to be qualified 0.55 Fail to be qualified Uniformity
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for causing a computing device to execute the methods according to the embodiments of the present invention or a processor to execute the methods according to the embodiments of the present invention.
Example 5
The present embodiment provides an evaluation apparatus of sample quality, as shown in fig. 3, the evaluation apparatus including: the system comprises an acquisition module 20, an embryonic-system SNP variation detection module 40, a proportion calculation module 60 and a quality judgment module 80, wherein the acquisition module 20 is used for respectively acquiring sequencing data of a tissue sample to be detected and a control cell sample; an germline SNP variation detection module 40, configured to detect a common SNP variation site in the respective sequencing data of the tissue sample to be detected and the control cell sample, to obtain a germline SNP variation site; a ratio calculation module 60 for calculating the ratio of homozygous variation sites and heterozygous variation sites in the germline SNP variation sites; and a quality determination module 80 for determining the quality of the sample according to the ratio.
In a preferred embodiment, the germline SNP variation detection module includes: the variation screening module is used for detecting SNP variation sites common in respective sequencing data of the tissue sample to be detected and the control cell sample to obtain candidate sites; and the filtering module is used for removing the sites positioned in the repetitive sequence region and the chain preference sites in the candidate sites to obtain the germline SNP variation sites.
In a preferred embodiment, the proportion calculation module comprises: the locus dividing module is used for dividing the SNP variation locus of the embryonic line into a homozygous variation locus, a heterozygous variation locus and a residual variation locus according to the high-low division of the variation frequency; and the proportion counting module is used for counting the proportion of the sum of the number of homozygous variant sites and the number of heterozygous variant sites in the total number of the SNP variant sites of the germ line.
In a preferred embodiment, the quality determination module comprises: the first judging module is used for judging that the quality of the sample is qualified when the proportion is greater than or equal to the quality threshold value; and the second judging module is used for judging that the sample quality is unqualified when the proportion is lower than the quality threshold value.
In a preferred embodiment, the sequencing data is that of a targeted capture library, whole genome sequencing data, or whole exon sequencing data.
Example 6
The present embodiment provides a storage medium including a stored program, wherein a device on which the storage medium is located is controlled to execute any one of the above-described evaluation methods when the program is executed.
The present embodiment provides a processor for running a program, wherein the program runs to perform any one of the above evaluation methods.
From the above description, it can be seen that the above-described embodiments of the present invention achieve the following technical effects: compared with an indirect determination method for setting positive and negative reference substances, the method and the device can directly determine the quality of the sample, have higher detection precision and can also obtain a clear determination threshold value. In addition, the detection process can well utilize sequencing data of pathological samples and control samples, and can accurately identify the quality of the samples by the aid of a self-designed filtering process, so that the quality of the samples can be determined under the condition of reference substance loss.
As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of hardware devices such as software plus necessary detection instruments. Based on such understanding, the data processing part in the technical solution of the present application may be embodied in the form of a software product, and the computer software product may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the embodiments or some parts of the embodiments of the present application.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
It will be apparent to those skilled in the art that some of the above-described modules or steps of the present application may be implemented in a general purpose computing device, they may be centralized on a single computing device or distributed over a network of multiple computing devices, and alternatively, they may be implemented in program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present application is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A method for evaluating sample quality, the method comprising:
respectively obtaining sequencing data of a tissue sample to be detected and a control cell sample;
detecting SNP variation sites common to the sequencing data of the tissue sample to be detected and the control cell sample to obtain an embryonic system SNP variation site;
and calculating the proportion of homozygous variant sites and heterozygous variant sites in the germline SNP variant sites, and judging the sample quality according to the proportion.
2. The method of claim 1, wherein detecting the SNP variation sites common to the sequencing data of each of the tissue sample to be tested and the control cell sample to obtain germline SNP variation sites comprises:
detecting SNP variation sites shared in the sequencing data of the tissue sample to be detected and the control cell sample to obtain candidate sites;
and removing the sites positioned in the repetitive sequence region and the strand preference sites in the candidate sites to obtain the SNP variation sites of the embryonic lines.
3. The method of claim 1, wherein calculating the ratio of homozygous to heterozygous variation sites in the germline SNP variation site comprises:
dividing the germline SNP variation site into a homozygous variation site, a heterozygous variation site and a residual variation site according to the variation frequency;
counting the proportion of the sum of the number of the homozygous variant sites and the number of the heterozygous variant sites to the total number of the SNP variant sites of the embryo line;
preferably, the variation frequency of the homozygous variation site is more than or equal to 90 percent, the variation frequency of the heterozygous variation site is more than or equal to 40 percent and less than or equal to 60 percent, and the rest variation frequencies are the rest variation sites.
4. The evaluation method of claim 1, wherein determining the sample quality based on the ratio comprises:
when the ratio is greater than or equal to a quality threshold value, judging that the sample quality is qualified;
when the ratio is lower than the quality threshold value, judging that the sample quality is unqualified;
preferably, the quality threshold is 0.7.
5. The method of claim 1, wherein the sequencing data is sequencing data of a targeted capture library, whole genome sequencing data, or whole exon sequencing data.
6. An apparatus for evaluating quality of a sample, the apparatus comprising:
the acquisition module is used for respectively acquiring sequencing data of a tissue sample to be detected and a control cell sample;
an embryonic system SNP variation detection module, configured to detect a common SNP variation site in the sequencing data of the tissue sample to be detected and the control cell sample, to obtain an embryonic system SNP variation site;
the proportion calculation module is used for calculating the proportion of homozygous variant sites and heterozygous variant sites in the germline SNP variant sites; and
and the quality judging module is used for judging the quality of the sample according to the proportion.
7. The apparatus according to claim 6, wherein the germline SNP variation detection module comprises:
a variation screening module for detecting common SNP variation sites in the sequencing data of the tissue sample to be detected and the control cell sample to obtain candidate sites;
and the filtering module is used for removing the sites positioned in the repetitive sequence region and the chain preference sites in the candidate sites to obtain the germline SNP variation sites.
8. The evaluation device of claim 6, wherein the proportion calculation module comprises:
the locus dividing module is used for dividing the SNP variation locus of the embryonic line into a homozygous variation locus, a heterozygous variation locus and a residual variation locus according to the variation frequency;
the proportion statistic module is used for counting the proportion of the sum of the number of the homozygous variant sites and the number of the heterozygous variant sites in the total number of the SNP variant sites of the embryonic line;
preferably, the variation frequency of the homozygous variation site is more than or equal to 90 percent, the variation frequency of the heterozygous variation site is more than or equal to 40 percent and less than or equal to 60 percent, and the rest variation frequencies are the rest variation sites.
9. The evaluation device of claim 6, wherein the quality determination module comprises:
the first judging module is used for judging that the sample quality is qualified when the proportion is greater than or equal to a quality threshold value;
a second determination module for determining that the sample quality is not acceptable when the ratio is lower than the quality threshold;
preferably, the quality threshold is 0.7.
10. The evaluation device of claim 6, wherein the sequencing data is sequencing data of a targeted capture library, whole genome sequencing data, or whole exon sequencing data.
11. A storage medium, characterized in that the storage medium comprises a stored program, wherein a device in which the storage medium is located is controlled to perform the evaluation method according to any one of claims 1 to 5 when the program is run.
12. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the evaluation method of any one of claims 1 to 5.
CN202010478389.6A 2020-05-29 2020-05-29 Sample quality evaluation method and device Pending CN111477277A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010478389.6A CN111477277A (en) 2020-05-29 2020-05-29 Sample quality evaluation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010478389.6A CN111477277A (en) 2020-05-29 2020-05-29 Sample quality evaluation method and device

Publications (1)

Publication Number Publication Date
CN111477277A true CN111477277A (en) 2020-07-31

Family

ID=71765409

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010478389.6A Pending CN111477277A (en) 2020-05-29 2020-05-29 Sample quality evaluation method and device

Country Status (1)

Country Link
CN (1) CN111477277A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112599189A (en) * 2020-12-29 2021-04-02 北京优迅医学检验实验室有限公司 Data quality evaluation method for whole genome sequencing and application thereof
CN112746097A (en) * 2021-01-29 2021-05-04 深圳裕康医学检验实验室 Method for detecting sample cross contamination and method for predicting cross contamination source

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109536588A (en) * 2018-12-26 2019-03-29 北京优迅医学检验实验室有限公司 Detect the method and device of the FFPE sample state of oxidation
CN109686404A (en) * 2018-12-26 2019-04-26 北京优迅医学检验实验室有限公司 The method and device that detection sample is obscured
CN109949861A (en) * 2019-03-29 2019-06-28 深圳裕策生物科技有限公司 Tumor mutations load testing method, device and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109536588A (en) * 2018-12-26 2019-03-29 北京优迅医学检验实验室有限公司 Detect the method and device of the FFPE sample state of oxidation
CN109686404A (en) * 2018-12-26 2019-04-26 北京优迅医学检验实验室有限公司 The method and device that detection sample is obscured
CN109949861A (en) * 2019-03-29 2019-06-28 深圳裕策生物科技有限公司 Tumor mutations load testing method, device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112599189A (en) * 2020-12-29 2021-04-02 北京优迅医学检验实验室有限公司 Data quality evaluation method for whole genome sequencing and application thereof
CN112746097A (en) * 2021-01-29 2021-05-04 深圳裕康医学检验实验室 Method for detecting sample cross contamination and method for predicting cross contamination source

Similar Documents

Publication Publication Date Title
EP3143537B1 (en) Rare variant calls in ultra-deep sequencing
CN105543380B (en) A kind of method and device detecting Gene Fusion
KR101795124B1 (en) Method and system for detecting copy number variation
KR102638152B1 (en) Verification method and system for sequence variant calling
Daber et al. Understanding the limitations of next generation sequencing informatics, an approach to clinical pipeline validation using artificial data sets
CN106715711B (en) Method for determining probe sequence and method for detecting genome structure variation
CN109767810B (en) High-throughput sequencing data analysis method and device
CN104894271B (en) Method and device for detecting gene fusion
CN111304303B (en) Method for predicting microsatellite instability and application thereof
CN110211633A (en) The detection method of mgmt gene promoter methylation, the processing method of sequencing data and processing unit
CN111477277A (en) Sample quality evaluation method and device
CN110846411A (en) Method for distinguishing gene mutation types of single tumor sample based on next generation sequencing
CN109022562A (en) For detecting the screening technique of the SNP site of sample contamination and the method for detecting sample contamination in high-flux sequence
CN114530198A (en) Screening method of SNP (single nucleotide polymorphism) sites for detecting sample pollution level and detection method of sample pollution level
CN112746097A (en) Method for detecting sample cross contamination and method for predicting cross contamination source
CN111052249A (en) Methods for determining conserved regions of predetermined chromosomes, methods, systems, and computer readable media for determining the presence or absence of copy number variations in a sample genome
CN109686404B (en) Method and device for detecting sample confusion
CN110468189A (en) The method and device of detection sample somatic variation is sequenced based on single two generation of sample
CN107075565B (en) Individual single nucleotide polymorphism site typing method and device
US7912652B2 (en) System and method for mutation detection and identification using mixed-base frequencies
CN110993024B (en) Method and device for establishing fetal concentration correction model and method and device for quantifying fetal concentration
CN115896256A (en) Method, device, equipment and storage medium for detecting RNA insertion deletion mutation based on second-generation sequencing technology
CN113981070B (en) Method, device, equipment and storage medium for detecting embryo chromosome microdeletion
CN114517223A (en) Method for screening SNP (Single nucleotide polymorphism) sites and application thereof
CN109536588A (en) Detect the method and device of the FFPE sample state of oxidation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination