CN111370065A - Method and device for detecting cross-sample contamination rate of RNA - Google Patents

Method and device for detecting cross-sample contamination rate of RNA Download PDF

Info

Publication number
CN111370065A
CN111370065A CN202010224358.8A CN202010224358A CN111370065A CN 111370065 A CN111370065 A CN 111370065A CN 202010224358 A CN202010224358 A CN 202010224358A CN 111370065 A CN111370065 A CN 111370065A
Authority
CN
China
Prior art keywords
sample
polymorphic sites
cross
comparison result
result file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010224358.8A
Other languages
Chinese (zh)
Other versions
CN111370065B (en
Inventor
黄毅
易鑫
杨玲
王申杰
刘久成
吴玲清
王旭文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Guiinga Medical Laboratory
Beijing Jiyinjia Medical Laboratory Co ltd
Original Assignee
Shenzhen Guiinga Medical Laboratory
Beijing Jiyinjia Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Guiinga Medical Laboratory, Beijing Jiyinjia Medical Laboratory Co ltd filed Critical Shenzhen Guiinga Medical Laboratory
Priority to CN202010224358.8A priority Critical patent/CN111370065B/en
Publication of CN111370065A publication Critical patent/CN111370065A/en
Application granted granted Critical
Publication of CN111370065B publication Critical patent/CN111370065B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method and a device for detecting RNA cross-sample cross contamination rate, wherein the method comprises the following steps: obtaining a comparison result file between sequencing data of a sample to be detected and a reference genome; screening a housekeeping gene protein coding region which covers the polymorphic sites and has an expression level not lower than a set threshold value from the comparison result file to serve as an information extraction interval; and calculating the sample pollution rate by using the information extraction interval, the comparison result file and the genetic polymorphism site information database. The method improves the defect that the software can only be used for evaluating the DNA pollution rate by screening the stably expressed polymorphic sites as the input of the pollution rate calculation software, has convenient program operation, high analysis speed, high automation degree, high reliability of analysis results compared with standard products, realizes the quality evaluation of RNA samples, and is beneficial to the accuracy of subsequent analysis.

Description

Method and device for detecting cross-sample contamination rate of RNA
Technical Field
The invention relates to the field of cross-sample cross-contamination rate detection, in particular to a method and a device for detecting cross-sample cross-contamination rate of RNA.
Background
The gene expression profile of a tumor sample is a powerful biomarker for identifying prognosis and prediction. To date, transcript profiling has been performed on a large number of cancer frozen tissue samples, but formalin-fixed paraffin-embedded tissue (FFPE) has become a more widely used biomaterial in the medical field because fresh frozen tissue from tumor samples of long-term follow-up clinical patients is not easily collected and stored. Genome-wide gene expression profiling of tumor samples is essential for cancer research and also facilitates extensive retrospective clinical genomic studies. However, FFPE requires fixation, paraffin embedding, sectioning, staining, and other steps to avoid degradation of cellular tissues, and contamination is often present during the above-described flaking operations, as well as during transportation, storage, and manual experimental operations. There are three main aspects of contamination: across individuals, within individuals, and across species.
Currently, RNA sample pollution rate evaluation can only realize cross-species pollution evaluation basically, and the main method is as follows: the contamination fraction of other non-human species, such as microorganisms, plants, viruses, etc., is calculated by aligning sequences that cannot be aligned to the reference genome to the NCBI database. Regarding the pollution assessment mode across individuals, only the assessment mode of the DNA sample pollution rate is disclosed in the prior art, and the specific process is as follows: the method comprises the steps of adopting GATK software Contest to calculate cross contamination in human DNA second-generation sequencing data through single nucleotide polymorphic site information, providing genotype information (http:// www.1000genomes.org) of a sequencing sample, population frequency information (Contest to provide) and a comparison file of the sequencing sample, calculating the contamination level of the posterior probability by using a Bayesian method, and determining the contamination level estimated by the maximum posterior probability. The software is mainly used for detecting the DNA pollution rate, and the applicant researches and discovers that when the software is used for detecting the RNA pollution rate, the final pollution rate is extremely inaccurate due to the fact that the RNA has the problem of inconsistent expression quantity. That is, when detecting the cross contamination rate of RNA across samples, the expression levels of different samples and different genes are different in different time spaces, which may seriously affect the coverage of sequencing data of single nucleotide polymorphic sites, and further affect the evaluation of the contamination rate, so that the final evaluation result is extremely inaccurate.
Because even a small degree of cross-contamination between individual RNAs in human samples can lead to false positives in analysis results, particularly in the study of RNA samples from tumor and normal tissues, the RNA cross-contamination between individual RNAs in human samples needs to be strictly controlled. However, currently, no relevant software or process is available to achieve accurate assessment of cross-contamination of RNA across individuals.
Disclosure of Invention
Therefore, the technical problem to be solved by the present invention is to overcome the defect that there is no method and apparatus for accurately evaluating cross-individual RNA cross-contamination in the prior art, and the present invention provides a method and apparatus for more accurately evaluating cross-sample RNA cross-contamination rate.
A method of detecting a cross-sample cross-contamination rate of RNA, comprising:
obtaining a comparison result file between sequencing data of a sample to be detected and a reference genome;
screening a housekeeping gene protein coding region which covers the polymorphic sites and has an expression level not lower than a set threshold value from the comparison result file to serve as an information extraction interval;
and calculating the sample pollution rate by using the information extraction interval, the comparison result file and the genetic polymorphism site information database.
The step of screening the information extraction intervals comprises the following steps:
selecting a housekeeping gene protein coding region in a housekeeping gene database; then screening all housekeeping gene protein coding regions containing the polymorphic sites according to the coordinate information of the polymorphic sites in the genome; the polymorphic sites contained in the protein coding region of the housekeeping gene are marked as polymorphic sites Q;
calculating the expression quantity of the genes in the comparison result file, selecting the genes M of which the expression quantity is not lower than a set threshold value, selecting the polymorphic sites P falling into the genes M from the polymorphic sites Q, and taking the housekeeping gene protein coding region covering the polymorphic sites P as an information extraction interval. The gene M is preferably a housekeeping gene with an expression level not lower than a set threshold value.
The coding regions and the number of the coding regions of the housekeeping gene protein selected from the housekeeping gene database can be adjusted according to whether polymorphic sites can be searched in the housekeeping genes by different samples, and if the polymorphic sites cannot be searched, the coding regions of the housekeeping gene protein are replaced or the number of the coding regions is increased. The smaller the number, the faster the running speed, but the smaller the number, the more likely the polymorphic site will not be found finally, and the accuracy decreases, and the larger the number, the higher the accuracy but the slower the running speed. In theory, 1-3800 can meet the requirements of the present invention, and in order to obtain higher accuracy, it is preferable to perform the subsequent analysis on 2000-3800 housekeeping genes obtained from the housekeeping gene database.
After the comparison result file is obtained, the data volume of the comparison result file can be reduced by a down-sampling method. Through the reduction of the data volume, the flow operation speed is improved, and the memory loss is reduced.
The down-sampling method comprises the following steps: extracting sequencing reads of a housekeeping gene protein coding region covering the polymorphic sites from the comparison result file by adopting Samtools software; or using Picard software to compare and randomly sample the result file to obtain a sequencing read which is randomly extracted.
The pollution rate calculation software is Contest software or Conta software; wherein, a filter _ reads _ with _ N _ cigar parameter is adopted in the calculation process of the Contest software to filter sequencing reads containing unidentifiable bases; the sensitivity of detection is adjusted by the-min _ maf parameter during the calculation process of the Conta software.
When the pollution rate calculation software is Contest software, calculating the width of a pollution rate confidence interval with 95% confidence; when the contamination rate calculation software is the contin software, the RNA cross-sample cross-contamination rate when min _ maf is 0.05 is calculated.
The sample to be detected is a contaminated RNA sample mixed by cells of different individuals.
An apparatus for detecting the rate of cross-contamination of RNA across a sample, comprising:
the detection module is used for acquiring a comparison result file between sequencing data of a sample to be detected and a reference genome, and screening a housekeeping gene protein coding region which covers polymorphic sites and has an expression level not lower than a set threshold value from the comparison result file as an information extraction interval;
and the pollution rate calculation module is used for calculating the sample pollution rate through the information extraction interval, the comparison result file and the genetic polymorphism site information database.
The detection module comprises:
the data comparison module is used for comparing the sequencing data of the sample to be detected with the reference genome to obtain a comparison result file;
a coding region identification module for selecting a housekeeping gene protein coding region from a housekeeping gene database;
the polymorphic site identification module is used for screening all housekeeping gene protein coding regions containing polymorphic sites according to the coordinate information of the polymorphic sites in a genome; the polymorphic sites contained in the protein coding region of the housekeeping gene are marked as polymorphic sites Q;
and the screening module is used for obtaining the expression quantity of the genes in the comparison result file, selecting the genes M of which the expression quantity is not lower than a set threshold value, then selecting the polymorphic sites P falling into the genes M from the polymorphic sites Q, and adopting the housekeeping gene protein coding region covering the polymorphic sites P as an information extraction interval. The gene M is a housekeeping gene.
The detection module also comprises a down-sampling module which is used for reducing the data volume of the analysis sample in the comparison result file.
The technical scheme of the invention has the following advantages:
1. the method for detecting the RNA cross-sample cross-contamination rate can effectively calculate the contamination ratio of other population samples which are not the sample, and fills the blank of the RNA cross-sample cross-contamination rate evaluation method.
Meanwhile, the housekeeping gene in the present invention is also called housekeeping gene, and refers to a gene that is stably expressed in all cells, and the product thereof is necessary for maintaining the basic life activities of the cells. The applicant researches and discovers that the final pollution rate evaluation is extremely inaccurate because the sequencing reading length coverage of polymorphic sites is seriously influenced by the RNA expression quantity in the process of calculating the RNA cross-sample cross-contamination rate, and creatively selects a stably expressed housekeeping gene as a target gene for evaluating the RNA cross-sample cross-contamination rate to solve the problem. The method specifically comprises the following steps: according to the method, the housekeeping gene protein coding region covering the polymorphic sites is screened from the comparison result file to serve as the information extraction interval, and the sample pollution rate is calculated according to the selected information extraction interval, so that the influence of different gene expression quantity differences of different samples on the pollution rate analysis result can be effectively reduced, and the evaluation accuracy is improved; and the results obtained by detection are compared with the standard substance by the method in the embodiment for further verification to obtain: the average error of the method for the pollution rate estimation result is only 0.16%, and the reliability of the analysis result is high; therefore, the method can better realize the quality evaluation of the RNA sample and is beneficial to the accuracy of subsequent analysis.
In addition, the invention improves the defect that the software can only be used for evaluating the DNA pollution rate by screening the stably expressed housekeeping gene protein coding region containing the polymorphic sites as the input file of the pollution rate calculation software (such as Contest, Conta and the like), can more simply, conveniently and rapidly realize the estimation of the RNA cross-sample cross-contamination rate, and has convenient program operation, high analysis speed and high automation degree.
2. The invention reduces the data volume of the analysis sample by down-sampling, and further improves the program operation efficiency on the premise of ensuring the detection efficiency.
Detailed Description
Example 1
A method of detecting a cross-sample cross-contamination rate of RNA, comprising:
(1) providing RNA sample sequencing data of two different individuals, and comparing the RNA sample sequencing data with a ginseng reference genome (GRCH37) by using a BWA-MEM algorithm to obtain a comparison result file; the human reference genome (GRCH37) is the human genome sequence of version 37 published by NCBI. NCBI (national Center for Biotechnology information) refers to the United states national Center for Biotechnology information.
(2) And extracting comparison result files of the two different individuals according to a certain proportion, and mixing to obtain three simulated samples of cross-sample pollution gradients, wherein the real pollution rates are 1%, 3% and 5% respectively.
(3) In order to improve the program operation efficiency, the comparison result of the three simulation samples is firstly subjected to down-sampling processing. In this embodiment, the result file is randomly sampled by using Picard software for comparison, so as to obtain a sequencing read which is randomly extracted.
(4) Stably expressed housekeeping genes were selected as target genes for assessing the rate of RNA cross-sample cross-contamination. In this embodiment, the following websites are used: http:// www.tau.ac.il/. about eliis/HKG/. 3458 housekeeping genes were randomly downloaded as target genes.
(5) And traversing the gene interval of each housekeeping gene serving as the target gene in the steps in sequence by adopting programming, searching an NCBI database once for each gene interval, and finding a protein coding region in the gene interval. Because protein coding regions corresponding to different transcripts of the same gene overlap with each other, the overlapping protein coding regions on the same chromosome are combined according to coordinate information, and finally, the size of the obtained protein coding region containing 36521 integrated protein coding regions is 744 KB.
(6) All housekeeping gene protein coding regions containing polymorphic sites were selected from the protein coding region in the following manner: searching a human genome haplotype map database, namely a HapMap database, and screening according to the coordinate information of the polymorphic sites in the genome, in this embodiment, 459 housekeeping gene protein coding regions covering the polymorphic sites are screened from the protein coding regions obtained in the step (5), and as shown in table 1, the housekeeping gene protein coding regions cover 4326 polymorphic sites Q.
TABLE 1
Figure BDA0002427146820000071
Figure BDA0002427146820000081
Figure BDA0002427146820000091
Figure BDA0002427146820000101
Figure BDA0002427146820000111
Figure BDA0002427146820000121
Figure BDA0002427146820000131
Figure BDA0002427146820000141
Figure BDA0002427146820000151
(7) And (3) calculating the expression quantity of the housekeeping genes containing polymorphic sites in the three down-sampled simulation samples in the step (3) by using transcriptome expression quantity calculation software StringTie, filtering out the housekeeping genes with the expression quantity of 0 in any simulation sample, and taking the protein coding region of the housekeeping genes with the expression quantity of not zero in all the simulation samples as an information extraction interval.
The screening method of the information extraction interval in the step comprises the following steps: selecting a gene M with the expression quantity not lower than a set threshold value, selecting a polymorphic site P falling in the gene M from the polymorphic sites Q, and using a housekeeping gene protein coding region covering the polymorphic site P as an information extraction interval. The gene M can be all genes of which the expression quantity is not lower than a set threshold value in the simulation sample, and can also be housekeeping genes of which the expression quantity is not lower than the set threshold value in the simulation sample.
The method can screen out the information extraction interval suitable for a single simulation sample, and can also screen out the information extraction interval common to all simulation samples. When an information extraction interval suitable for a single simulation sample is screened, a gene with zero expression level is filtered from each simulation sample to obtain a gene M, then polymorphic sites Q and the gene M of each simulation sample are adopted to be respectively contrasted to obtain polymorphic sites P falling into the gene M, and finally a housekeeping gene protein coding region which covers the polymorphic sites P and has the expression level not lower than 0 in each simulation sample is respectively obtained and used as an information extraction interval corresponding to the corresponding simulation sample. When the information extraction interval common to all the simulation samples is screened out: and simultaneously filtering genes with zero expression quantity by adopting all the simulation samples to obtain genes M with the expression quantity more than zero shared by all the simulation samples, then contrasting polymorphic sites Q and the genes M shared by all the simulation samples to obtain polymorphic sites P falling into the genes M, and finally screening out information extraction intervals shared by all the simulation samples.
In this embodiment, a method of screening an information extraction interval common to all the mock samples, that is, a method of comparing three mock samples after filtration with 4326 polymorphic sites Q obtained in step (6) at the same time, is adopted. The method specifically comprises the following steps: screening out housekeeping genes M with expression quantities larger than zero in three simulation samples, selecting polymorphic sites P falling into the housekeeping genes M from the polymorphic sites Q, and taking the finally screened housekeeping gene protein coding regions with the polymorphic sites P as information extraction intervals for calculating the cross contamination rate of RNA across samples. In the embodiment, 16 information extraction intervals are screened out from the three simulation samples, which is specifically shown in table 2.
TABLE 2
Chromosome Starting point Terminal point
1 15986363 15988217
1 52498368 52499433
4 165118158 165118863
6 30297094 30297547
6 30679650 30681131
8 67341366 67342464
8 104427218 104427744
11 11373490 11374666
13 28009811 28010030
13 52602940 52605241
19 10224308 10225414
19 42582758 42585449
19 50411495 50413064
19 52393870 52395150
22 19951088 19951282
22 50962039 50962840
(8) Calculating the pollution rate by using an individual cross-contamination rate calculation tool Contest of the GATK, wherein the information extraction interval obtained in the step, the comparison result file after down-sampling and a HapMap database for providing genetic polymorphism site information are used as input; in the invention, the pollution rate can be obtained by inputting the information extraction interval of a single simulation sample, the comparison result file of the corresponding simulation sample and the HapMap database; the pollution rate can also be obtained by inputting a comparison result file of the common information extraction interval of the plurality of simulation samples and the simulation samples and a HapMap database. In this embodiment, the information extraction section corresponds to genotype information of DNA sequencing, the HapMap database corresponds to population frequency information of DNA sequencing, and the comparison result file after downsampling corresponds to a comparison file of sequencing samples. Firstly, an index is established for a comparison result file after down-sampling, and then, the calculation is carried out by applying Contest software. The- -filter _ reads _ with _ N _ cigar parameter was used in the calculation to filter sequencing reads containing unrecognized bases. In the output file, name indicates the sample name, registration indicates the sample contamination rate, confidence _ interval _95_ width indicates the confidence interval width at 95% confidence, confidence _ interval _95_ low indicates the confidence interval lower limit at 95% confidence, and confidence _ interval _95_ high indicates the confidence interval upper limit at 95% confidence.
The BWA-MEM algorithm, the Picard software, the StringTie calculation software, and the content software involved in the above steps in the present invention are all prior arts, and the specific operations are common knowledge of those skilled in the art, and therefore, the present invention is not described in detail.
(9) Performance verification
In order to verify the accuracy of the pollution rate calculation process, the pollution rate calculation of the simulated sample shows that the detection error is 0.07% at the pollution rate level of the 1% sample, 0.01% at the pollution rate level of the 3% sample and 0.40% at the pollution rate level of the 5% sample. The average error of the three gradient simulation samples is only 0.16%, and the specific detection results are shown in table 3.
TABLE 3
Simulation sample Sample contamination rate The result of the detection
mix_01 1% 1.0744%
mix_03 3% 3.0144%
mix_05 5% 5.3974%
From the above results, it can be seen that: the average error of the estimation result of the RNA cross-sample cross-contamination rate is only 0.16%, and the reliability of the analysis result is high; therefore, the method can better realize the quality evaluation of the RNA sample and is beneficial to the accuracy of subsequent analysis.
Example 2
The difference between this embodiment and embodiment 1 is that the contamination rate values of different simulation contamination rate samples are calculated by using the Conta software in this embodiment. This example uses steps (1) to (7) which are identical to those of example 1, except that the simulation sample in this example is different, and the final calculation software is different. Specifically, in this embodiment, 23 simulated RNA samples with a contamination rate of 1% -50% were used after mixing two different individuals according to different contamination gradients. In this embodiment, 459 information extraction intervals are screened out by using the 23 simulated RNA samples, the specific information extraction intervals refer to table 1, and finally, comparison result files of different simulated RNA samples, information extraction intervals and HapMap database polymorphic site information files are input into the contin software at the same time, and simulated calculation of the simulated RNA samples is performed by using different min _ maf parameter values, with the calculation results shown in table 4.
TABLE 4
Figure BDA0002427146820000191
Figure BDA0002427146820000201
As can be seen from the above examples 1 and 2: by the method, the cross-sample contamination rate of the RNAs with different gradients can be effectively evaluated. When the compa software is used, and the calculation parameter min _ maf is 0.05, the evaluation accuracy can be further improved, and the average error is 0.034.
Example 3
A device for detecting RNA cross-sample cross-contamination rate comprises a detection module and a contamination rate calculation module.
The detection module is used for obtaining a comparison result file between the sequencing data of the sample to be detected and the reference genome, and screening a housekeeping gene protein coding region which covers the polymorphic sites and has an expression level not lower than a set threshold value from the comparison result file to serve as an information extraction interval. And the pollution rate calculation module is used for calculating the sample pollution rate by adopting pollution rate calculation software through the information extraction interval, the comparison result file and the genetic polymorphism site information database.
The detection module comprises a data comparison module, a down-sampling module, a coding region identification module, a polymorphic site identification module and a screening module; the method comprises the following specific steps:
the data comparison module is used for comparing the sequencing data of the sample to be detected with the reference genome to obtain a comparison result file;
and the down-sampling module is used for reducing the data volume of the analysis sample in the comparison result file.
A coding region identification module for selecting a housekeeping gene protein coding region from a housekeeping gene database;
the polymorphic site identification module is used for screening all housekeeping gene protein coding regions containing polymorphic sites according to the coordinate information of the polymorphic sites in a genome; the polymorphic sites contained in the protein coding region of the housekeeping gene are marked as polymorphic sites Q;
and the screening module is used for obtaining the expression quantity of the genes in the comparison result file, selecting the genes M of which the expression quantity is not lower than a set threshold value, then selecting the polymorphic sites P falling into the genes M from the polymorphic sites Q, and adopting the housekeeping gene protein coding region covering the polymorphic sites P as an information extraction interval. The gene M is a housekeeping gene.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (10)

1. A method of detecting a cross-sample cross-contamination rate of RNA, comprising:
obtaining a comparison result file between sequencing data of a sample to be detected and a reference genome;
screening a housekeeping gene protein coding region which covers the polymorphic sites and has an expression level not lower than a set threshold value from the comparison result file to serve as an information extraction interval;
and calculating the sample pollution rate by using the information extraction interval, the comparison result file and the genetic polymorphism site information database.
2. The method for detecting RNA cross-sample cross-contamination rate according to claim 1, wherein the step of screening the information extraction interval comprises:
selecting a housekeeping gene protein coding region in a housekeeping gene database; then screening all housekeeping gene protein coding regions containing the polymorphic sites according to the coordinate information of the polymorphic sites in the genome; the polymorphic sites contained in the protein coding region of the housekeeping gene are marked as polymorphic sites Q;
calculating the expression quantity of the genes in the comparison result file, selecting the genes M of which the expression quantity is not lower than a set threshold value, selecting the polymorphic sites P falling into the genes M from the polymorphic sites Q, and using the housekeeping gene protein coding region covering the polymorphic sites P as an information extraction interval.
3. The method for detecting RNA cross-sample cross-contamination rate according to claim 1 or 2, wherein after the comparison result file is obtained, the data volume of the comparison result file can be reduced by a down-sampling method.
4. The method for detecting RNA cross-sample cross-contamination rate of claim 3, wherein the down-sampling method is: extracting sequencing reads of a housekeeping gene protein coding region covering the polymorphic sites from the comparison result file by adopting Samtools software; or using Picard software to compare and randomly sample the result file to obtain a sequencing read which is randomly extracted.
5. The method for detecting RNA cross-sample cross-contamination rate according to any one of claims 1 to 4, wherein the contamination rate calculation software is Contest software or Conta software; wherein, a filter _ reads _ with _ N _ cigar parameter is adopted in the calculation process of the Contest software to filter sequencing reads containing unidentifiable bases; the sensitivity of detection is adjusted by the-min _ maf parameter during the calculation process of the Conta software.
6. The method for detecting RNA cross-sample cross-contamination rate according to claim 5, wherein when the contamination rate calculation software is a Contest software, a contamination rate confidence interval width of 95% confidence is calculated; when the contamination rate calculation software is the contin software, the RNA cross-sample cross-contamination rate when min _ maf is 0.05 is calculated.
7. The method for detecting the cross-sample contamination rate of RNA according to any one of claims 1 to 6, wherein the sample to be detected is a contaminated RNA sample mixed with cells of different individuals.
8. An apparatus for detecting the rate of cross-contamination of RNA across a sample, comprising:
the detection module is used for acquiring a comparison result file between sequencing data of a sample to be detected and a reference genome, and screening a housekeeping gene protein coding region which covers polymorphic sites and has an expression level not lower than a set threshold value from the comparison result file as an information extraction interval;
and the pollution rate calculation module is used for calculating the sample pollution rate through the information extraction interval, the comparison result file and the genetic polymorphism site information database.
9. The apparatus of claim 8, wherein the detection module comprises:
the data comparison module is used for comparing the sequencing data of the sample to be detected with the reference genome to obtain a comparison result file;
a coding region identification module for selecting a housekeeping gene protein coding region from a housekeeping gene database;
the polymorphic site identification module is used for screening all housekeeping gene protein coding regions containing polymorphic sites according to the coordinate information of the polymorphic sites in a genome; the polymorphic sites contained in the protein coding region of the housekeeping gene are marked as polymorphic sites Q;
and the screening module is used for obtaining the expression quantity of the genes in the comparison result file, selecting the genes M of which the expression quantity is not lower than a set threshold value, then selecting the polymorphic sites P falling into the genes M from the polymorphic sites Q, and adopting the housekeeping gene protein coding region covering the polymorphic sites P as an information extraction interval.
10. The apparatus according to claim 8 or 9, wherein the detection module further comprises a down-sampling module for reducing the data amount of the analysis sample in the comparison result file.
CN202010224358.8A 2020-03-26 2020-03-26 Method and device for detecting cross-sample contamination rate of RNA Active CN111370065B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010224358.8A CN111370065B (en) 2020-03-26 2020-03-26 Method and device for detecting cross-sample contamination rate of RNA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010224358.8A CN111370065B (en) 2020-03-26 2020-03-26 Method and device for detecting cross-sample contamination rate of RNA

Publications (2)

Publication Number Publication Date
CN111370065A true CN111370065A (en) 2020-07-03
CN111370065B CN111370065B (en) 2022-10-04

Family

ID=71209256

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010224358.8A Active CN111370065B (en) 2020-03-26 2020-03-26 Method and device for detecting cross-sample contamination rate of RNA

Country Status (1)

Country Link
CN (1) CN111370065B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111944807A (en) * 2020-08-26 2020-11-17 天津诺禾医学检验所有限公司 Human sequencing sample tracking marker, and monitoring method and monitoring device for human sequencing sample cross contamination
CN115083529A (en) * 2022-07-11 2022-09-20 北京吉因加医学检验实验室有限公司 Method and device for detecting sample pollution rate

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050206402A1 (en) * 2004-03-22 2005-09-22 Jianou Shi Methods and systems for determining one or more properties of a specimen
WO2013142982A1 (en) * 2012-03-28 2013-10-03 Ontario Institute For Cancer Research Colca1 and colca2 and their use for the treatment and risk assessment of colon cancer
US20160046997A1 (en) * 2012-10-18 2016-02-18 Oslo Universitetssykehus Hf Biomarkers for cervical cancer
CN106460070A (en) * 2014-04-21 2017-02-22 纳特拉公司 Detecting mutations and ploidy in chromosomal segments
CN108384842A (en) * 2018-02-27 2018-08-10 宁波海尔施基因科技有限公司 A kind of unknown sample to being suspected to be people source carries out species and the composite PCR amplification method of people source individual identification identification
CN108754010A (en) * 2018-06-14 2018-11-06 中国农业科学院蔬菜花卉研究所 It is a kind of quickly to detect the remaining method of genomic DNA in total serum IgE sample
CN109402241A (en) * 2017-08-07 2019-03-01 深圳华大基因研究院 Identification and the method for analyzing ancient DNA sample
CN110093406A (en) * 2019-05-27 2019-08-06 新疆农业大学 A kind of argali and its filial generation gene research method
CN110797078A (en) * 2020-01-06 2020-02-14 北京吉因加科技有限公司 Method and device for constructing microsatellite unstable site screening and analyzing model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050206402A1 (en) * 2004-03-22 2005-09-22 Jianou Shi Methods and systems for determining one or more properties of a specimen
WO2013142982A1 (en) * 2012-03-28 2013-10-03 Ontario Institute For Cancer Research Colca1 and colca2 and their use for the treatment and risk assessment of colon cancer
US20160046997A1 (en) * 2012-10-18 2016-02-18 Oslo Universitetssykehus Hf Biomarkers for cervical cancer
CN106460070A (en) * 2014-04-21 2017-02-22 纳特拉公司 Detecting mutations and ploidy in chromosomal segments
CN109402241A (en) * 2017-08-07 2019-03-01 深圳华大基因研究院 Identification and the method for analyzing ancient DNA sample
CN108384842A (en) * 2018-02-27 2018-08-10 宁波海尔施基因科技有限公司 A kind of unknown sample to being suspected to be people source carries out species and the composite PCR amplification method of people source individual identification identification
CN108754010A (en) * 2018-06-14 2018-11-06 中国农业科学院蔬菜花卉研究所 It is a kind of quickly to detect the remaining method of genomic DNA in total serum IgE sample
CN110093406A (en) * 2019-05-27 2019-08-06 新疆农业大学 A kind of argali and its filial generation gene research method
CN110797078A (en) * 2020-01-06 2020-02-14 北京吉因加科技有限公司 Method and device for constructing microsatellite unstable site screening and analyzing model

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
EWA A. BERGMANN等: "Conpair: concordance and contamination estimator for matched tumor–normal pairs", 《BIOINFORMATICS》 *
PAUL SIMION等: "A software tool ‘CroCo’ detects pervasive cross-species contamination in next generation sequencing data", 《BMC BIOLOGY》 *
李少波等: "转录组测序数据中cSNP和表达差异基因的分析方法", 《上海交通大学学报(医学版)》 *
武娜娜等: "不同种类皮肤消毒剂对儿童血培养污染的影响", 《应用预防医学》 *
黄毅: "应用cDNA芯片研究水稻杂种与亲本基因表达谱及杂种优势分子生物学基础", 《中国优秀博士学位论文全文数据库农业科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111944807A (en) * 2020-08-26 2020-11-17 天津诺禾医学检验所有限公司 Human sequencing sample tracking marker, and monitoring method and monitoring device for human sequencing sample cross contamination
CN115083529A (en) * 2022-07-11 2022-09-20 北京吉因加医学检验实验室有限公司 Method and device for detecting sample pollution rate

Also Published As

Publication number Publication date
CN111370065B (en) 2022-10-04

Similar Documents

Publication Publication Date Title
US11043283B1 (en) Systems and methods for automating RNA expression calls in a cancer prediction pipeline
CN108319813B (en) Method and device for detecting circulating tumor DNA copy number variation
CN110444255B (en) Biological information quality control method and device based on second-generation sequencing and storage medium
CN111341383B (en) Method, device and storage medium for detecting copy number variation
CN110070915A (en) The next generation utilizes the Prognosis in Breast Cancer prediction technique and forecasting system based on machine learning of base sequence analysis
CN109767810B (en) High-throughput sequencing data analysis method and device
CN112270953A (en) Analysis method, device and equipment based on BD single cell transcriptome sequencing data
CN112687333B (en) Single-sample microsatellite instability analysis method and device for pan-carcinomatous species
CN111370065B (en) Method and device for detecting cross-sample contamination rate of RNA
CN112289376B (en) Method and device for detecting somatic cell mutation
CN112634987B (en) Method and device for detecting copy number variation of single-sample tumor DNA
CN113470743A (en) Differential gene analysis method based on BD single cell transcriptome and proteome sequencing data
CN117947163A (en) Method for evaluating background level of variant nucleic acid sample
CN115678994A (en) Biomarker combination, reagent containing biomarker combination and application of biomarker combination
CN109461473B (en) Method and device for acquiring concentration of free DNA of fetus
CN114530199A (en) Method and device for detecting low-frequency mutation based on double sequencing data and storage medium
Wang et al. Systematic benchmarking of imaging spatial transcriptomics platforms in FFPE tissues
CN110729025B (en) Paraffin section sample somatic mutation detection method and device based on second-generation sequencing
CN108715891B (en) Expression quantification method and system for transcriptome data
CN111696622A (en) Method for correcting and evaluating detection result of mutation detection software
Wilmott et al. Tumour procurement, DNA extraction, coverage analysis and optimisation of mutation-detection algorithms for human melanoma genomes
WO2023184330A1 (en) Method and apparatus for processing genome methylation sequencing data, device, and medium
CN113981070B (en) Method, device, equipment and storage medium for detecting embryo chromosome microdeletion
CN112735533B (en) Aquatic ecology analysis method and system based on eDNA
CN116994647A (en) Method for constructing model for analyzing mutation detection result

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant