CN113793641B - Method for rapidly judging sample gender from FASTQ file - Google Patents

Method for rapidly judging sample gender from FASTQ file Download PDF

Info

Publication number
CN113793641B
CN113793641B CN202111149249.5A CN202111149249A CN113793641B CN 113793641 B CN113793641 B CN 113793641B CN 202111149249 A CN202111149249 A CN 202111149249A CN 113793641 B CN113793641 B CN 113793641B
Authority
CN
China
Prior art keywords
mers
fastq
mer
data
chromosome
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111149249.5A
Other languages
Chinese (zh)
Other versions
CN113793641A (en
Inventor
吴星辰
栗海波
梁萌萌
余伟师
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Semek Gene Technology Co ltd
Original Assignee
Suzhou Semek Gene Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Semek Gene Technology Co ltd filed Critical Suzhou Semek Gene Technology Co ltd
Priority to CN202111149249.5A priority Critical patent/CN113793641B/en
Publication of CN113793641A publication Critical patent/CN113793641A/en
Application granted granted Critical
Publication of CN113793641B publication Critical patent/CN113793641B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Abstract

The invention discloses a method for rapidly judging sample gender from FASTQ files, which comprises the following steps: (1) Generating a unique K-mer on the Y chromosome from the reference genome; (2) Acquiring intersections of design intervals of the full-exome sequencing capture probes, removing K-mers outside the intersections, arranging the retained K-mers in order of more times of occurrence in the design intervals of the capture probes, and selecting the K-mers at the front as a special K-mer set; (3) Randomly reading FASTQ files, counting the special K-mers, analyzing the distribution difference of the special K-mers in the FASTQ files with different sexes by using the real data of the same number of men and women, and determining a gender judgment threshold; (4) sex determination is performed on the FASTQ file according to the threshold value. The method is suitable for various data types of the NGS, has simple analysis flow and convenient operation, and greatly improves the judging efficiency.

Description

Method for rapidly judging sample gender from FASTQ file
Technical Field
The invention relates to the technical field of biology and accurate medical high-throughput sequencing and mutation detection, in particular to a method for rapidly judging the sex of a sample from a FASTQ file.
Background
Along with the rapid development of modern medicine, the cost of high-throughput sequencing technology (Next-Generation Sequencing, NGS) is also becoming lower and lower, and is becoming the first choice for genetic disease, tumor and other gene detection. FASTQ is the most common file format used to store NGS sequencing bases and corresponding mass fractions, as well as other relevant information. FASTQ is also the raw data for sequencing data delivery and genomic analysis, on the basis of which NGS data and results in other formats, such as alignment file BAM and mutation detection file VCF, can be obtained by a large number of calculations. Researchers often need to verify that the sample gender and data gender are consistent when analyzing NGS data, which is critical to determine if the data and sample are consistent, if there is contamination, and subsequent chromosome copy number analysis and variation interpretation.
The main research ideas for judging the sex of NGS data are to analyze the coverage of specific genes on the X chromosome and the Y chromosome from BAM or the genotype distribution on the X chromosome and the Y chromosome from VCF, and these methods have the following disadvantages:
(1) The generation of the comparison file BAM and the mutation detection file VCF from the FASTQ requires a large amount of computing resources and storage space, and the analysis flow generally takes several hours to tens of hours according to the difference of the data amounts, so that the disadvantages are more obvious in some application scenarios in which only the sex of the data needs to be determined and the subsequent analysis is not needed temporarily.
(2) Most of the software used in the analysis process can only be run in a Linux system, the difficulty of installing and running the software on a Windows computer is great, many data are delivered through the network disk software of the Windows system, the sex judgment is required to be uploaded to a Linux server, and inconvenience is brought to analysts.
Therefore, an analyst is urgent to need a new technical solution, which can significantly reduce resource requirements and system dependencies, and also can rapidly determine sample gender and contamination between samples of different sexes from FASTQ files.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provide a method for rapidly judging the sex of a sample from a FASTQ file, which can remarkably reduce the resource requirement, reduce the system dependence and rapidly judge the sex of the sample.
The technical scheme of the invention is as follows:
a method for rapidly determining the sex of a sample from a FASTQ file, comprising the steps of:
(1) Generating a unique K-mer on the Y chromosome from the reference genome;
(2) Acquiring intersections of design intervals of all-exome sequencing capture probes of different sources, removing K-mers outside the intersections, arranging the retained K-mers in order of more than few times in the design intervals of the capture probes, and selecting K-mers with preset number of bits before as a final unique K-mer set;
(3) Randomly reading data of different FASTQ files, counting unique K-mers contained in the data, analyzing distribution differences of the unique K-mers in the different FASTQ files by using real data of the same number of men and women, and determining a gender judgment threshold;
(4) And carrying out sex determination on the FASTQ file according to the threshold value.
Optionally or preferably, in the above method, the threshold includes an upper threshold U and a lower threshold L of the number of K-mers, the data greater than U is male, and the data less than L is female; when the number of K-mers is between L and U, it is determined that contamination between samples of different sexes exists.
Alternatively or preferably, in the above method, the FASTQ file is a FASTQ file generated by whole gene sequencing or whole exome sequencing.
Alternatively or preferably, in the above method, the outside of the intersection in step (2) includes coverage of less than 50% and occurrence frequency on the Y chromosome of less than 3.
Optionally or preferably, in the method, the number of preset bits before in step (2) is the first 100 bits.
Alternatively or preferably, in the method, the data of different FASTQ files in step (3) are read randomly, and the number of FASTQ files is 10 ten thousand.
Compared with the prior art, the invention has the following beneficial effects:
the judging method is based on the special K-mers of the Y chromosome, the special K-mers only exist in the data of the male samples theoretically and contain possible sex information, and the dividing threshold value of the male and female data is determined by utilizing the difference of the occurrence frequencies of the K-mers in different sex FASTQs, so that the sex of the data and the pollution among the samples with different sexes can be judged from the NGS original data.
K-mers which are not covered or have low coverage rate and K-mers which have relatively low frequency of occurrence on the Y chromosome are removed, so that the robustness and the calculation speed of the K-mers can be further improved.
In addition, the invention has the following advantages:
1. the method has the advantages of rapid judging process, and no need of large amount of computing resources
The conventional judgment of the sex of data from the comparison file BAM or the mutation detection file VCF requires several to several tens of hours to be calculated on a specific server. The processing flow designed by the invention is simple to deploy, convenient to use and operate, and can complete the whole flow analysis only by deploying related executable files. The requirement on the computing resource of the server is low, a common notebook computer can judge the sex of dozens of FASTQs per minute by utilizing multithreading, and the efficiency is very high.
2. Independent of an operating system, and wide application range
The method is suitable for various data types of the present NGS, including full genome sequencing data of different depths and full exome sequencing data of various capture probes; the method is not only suitable for large-scale Linux servers, but also suitable for personal Windows notebook computers.
Drawings
FIG. 1 is a flowchart showing the whole judgment method of the embodiment 1;
FIG. 2 is a first partial flow chart of example 1;
FIG. 3 is a second partial flow chart of example 1;
FIG. 4 is a third partial flow chart of example 1;
fig. 5 is a fourth partial flowchart of embodiment 1.
Detailed Description
The following detailed description of the invention is presented in conjunction with the drawings and preferred embodiments to enable one skilled in the art to better understand and practice the invention.
Example 1
Referring to fig. 1, the method for quickly judging the sex of a sample from a FASTQ file includes the following parts:
a first part: generating a unique K-mer on the Y chromosome from the reference genome;
a second part: screening the special K-mer on the Y chromosome according to the probe interval and the occurrence frequency;
third section: analyzing the distribution difference of the screened K-mers in FASTQ with different sexes by using real data so as to determine a threshold value of sex judgment;
fourth part: and carrying out sex determination on the FASTQ of the NGS data according to the threshold value.
The detailed steps of each section are described in detail below.
A first part: generation of unique K-mers on Y chromosome from reference genome
By comparing the K-mer differences on the Y chromosome with the other chromosomes on the reference genome, unique K-mers on the Y chromosome are found that theoretically would only exist in the data of the male sample, implying possible gender information. See fig. 2 for a specific flow.
Input: a reference sequence of the human genome,
and (3) outputting: k-mers specific for the Y chromosome.
The steps are as follows:
(1) The reference sequence in human genome FASTA format, e.g., hg38.fa.gz, is downloaded from UCSC or other public database.
(2) Using script to split the reference sequence into two parts by chromosome: y chromosome sequence (Y.fa) and other chromosome sequences (other.fa).
(3) Different K-mer lengths are set, in this embodiment, 7, 9, 11, 13, 15, 17, 19 and 21 lengths are set respectively, and the two sequence files in the step (2) are counted by using the Jelyfish software module respectively.
(4) Comparing the K-mer sets of the two sequence files to find the unique K-mers on the Y chromosome.
(5) The length of the K-mer is determined to be 13 by considering the running time and the number of unique K-mers.
A second part: screening for unique K-mers on Y chromosome based on probe interval and occurrence number
In order to enable the unique K-mers on the Y chromosome to be better covered in different sequencing technologies and capture probes, the collection of design intervals of the capture probes is obtained according to the main-flow full-exome capture probes of different sources (produced by different manufacturers) on the market, the K-mers which are not covered or have low coverage rate are filtered, and meanwhile, the K-mers with relatively low occurrence frequency on the Y chromosome are removed, so that the robustness and the calculation speed of the K-mers are improved. The remaining K-mers are arranged in order of more to less occurrences in the design space of the capture probe, and the K-mers at the top 100 bits are selected as the final unique K-mer set, see fig. 3 for a specific flow.
Input: a K-mer specific to the Y chromosome, a probe capture region;
and (3) outputting: and (3) screening the specific K-mer.
The steps are as follows:
(1) Obtaining design intervals of whole exome sequencing capture probes from different probe design companies;
(2) Acquiring intersection of design intervals of probe capture probes of different design companies by using a program tool bedtk;
(3) Removing K-mers outside the intersection of the design intervals of the capture probes;
(4) Arranging K-mers in reverse order according to the occurrence times in the design interval of the capture probes;
(5) The K-mers in the first 100 positions are selected as the final unique K-mer set.
Third section: analyzing the distribution difference of the screened K-mers in FASTQ with different sexes by using real data so as to determine a threshold value of sex judgment;
10 ten thousand pieces of data (containing different sexes) of the FASTQ file are randomly read, and the second part of the screened unique K-mers are counted by using the script, that is, the number of the unique K-mers in the FASTQ file is calculated. The distribution difference of the special K-mers in different FASTQ files is analyzed by using a large number of real data of the same male and female numbers for statistics, and an upper limit threshold (U, the data greater than the threshold are male) and a lower limit threshold (L, the data smaller than the threshold are female) of the K-mers which can better distinguish the sexes of the male and female are distinguished. Meanwhile, if the number of K-mers is between L and U (L-U), there may be contamination between samples of different sexes, see fig. 4 for a specific flow.
Input: the specific K-mer, FASTQ and true gender after screening;
and (3) outputting: threshold for gender determination.
The steps are as follows:
(1) 10 ten thousand pieces of data of the FASTQ file are randomly read;
(2) Counting the screened K-mers by using scripts;
(3) And carrying out threshold division according to the true gender of the data.
Fourth part: sex determination of FASTQ of NGS data based on threshold
FASTQ generated by whole gene sequencing (Whole Genome Sequencing, WGS) or whole exome sequencing (Whole Exome Sequencing, WES) can be counted for the unique K-mers obtained from the second part after screening, and sex determination can be performed in combination with the threshold interval obtained from the third part, see fig. 5.
Input: a threshold value for judging the specific K-mer, FASTQ and sex after screening;
and (3) outputting: and judging the sex.
The steps are as follows:
(1) 10 ten thousand pieces of data of the FASTQ file are randomly read;
(2) Counting the screened unique K-mers by using scripts;
(3) And judging the sex according to the threshold value.
The method adopts the unique K-mer on the Y chromosome as the judgment basis, randomly samples the original FASTQ data to judge the sex of the NGS data, is suitable for various data types of the NGS, has simple analysis flow and convenient operation, can complete the whole flow analysis by only deploying related executable files, can judge the sex of dozens of FASTQs by using a common notebook computer and utilizing multiple threads per minute, and has greatly improved efficiency compared with the traditional method of calculating for several hours to tens of hours on a specific server.
Specific examples are set forth herein to illustrate the invention in detail, and the description of the above examples is only for the purpose of aiding in understanding the core concept of the invention. It should be noted that any obvious modifications, equivalents, or other improvements to those skilled in the art without departing from the inventive concept are intended to be included in the scope of the present invention.

Claims (6)

1. A method for rapidly determining the sex of a sample from a FASTQ file, comprising the steps of:
(1) Based on the reference genome, a unique K-mer on the Y chromosome is generated, and the specific operation method is as follows:
a. acquiring a reference sequence in a FASTA format of a reference genome;
b. splitting a reference sequence into two sequence files according to chromosomes: y chromosome and other chromosomes;
c. setting different K-mer lengths, and respectively carrying out K-mer counting on two sequence files by using a Jellyfish program module;
d. comparing the K-mer sets of the two sequence files to obtain a unique K-mer on a Y chromosome;
e. determining the length of a unique K-mer on the Y chromosome to be 13;
(2) Acquiring intersections of design intervals of all-exome sequencing capture probes of different sources, removing K-mers outside the intersections, arranging the retained K-mers in order of more than few times in the design intervals of the capture probes, and selecting K-mers with preset number of bits before as a final unique K-mer set;
(3) Randomly reading data of different FASTQ files, counting unique K-mers contained in the data, analyzing distribution differences of the unique K-mers in the different FASTQ files by using real data of the same number of men and women, and determining a gender judgment threshold;
(4) And carrying out sex determination on the FASTQ file according to the threshold value.
2. The method of claim 1, wherein the intersection of step (2) comprises coverage of less than 50% and frequency of occurrence on the Y chromosome of less than 3.
3. The method of claim 1, wherein the predetermined number of bits before in step (2) is the first 100 bits.
4. The method of claim 1, wherein the threshold in step (3) comprises an upper threshold U and a lower threshold L for the number of K-mers, data greater than U being male and data less than L being female; when the number of K-mers is between L and U, it is determined that contamination between samples of different sexes exists.
5. The method of claim 1, wherein the random reading of the data of the FASTQ files with different identities in step (3) has a number of FASTQ files of 10 ten thousand.
6. The method of claim 1, wherein the FASTQ file is a FASTQ file generated by whole gene sequencing or whole exome sequencing.
CN202111149249.5A 2021-09-29 2021-09-29 Method for rapidly judging sample gender from FASTQ file Active CN113793641B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111149249.5A CN113793641B (en) 2021-09-29 2021-09-29 Method for rapidly judging sample gender from FASTQ file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111149249.5A CN113793641B (en) 2021-09-29 2021-09-29 Method for rapidly judging sample gender from FASTQ file

Publications (2)

Publication Number Publication Date
CN113793641A CN113793641A (en) 2021-12-14
CN113793641B true CN113793641B (en) 2023-11-28

Family

ID=78877534

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111149249.5A Active CN113793641B (en) 2021-09-29 2021-09-29 Method for rapidly judging sample gender from FASTQ file

Country Status (1)

Country Link
CN (1) CN113793641B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004063390A2 (en) * 2003-01-10 2004-07-29 Mmi Genomics, Inc. Compositions and methods for determining canine gender
WO2015035555A1 (en) * 2013-09-10 2015-03-19 深圳华大基因科技有限公司 Method, system, and computer readable medium for determining whether fetus has abnormal number of sex chromosomes
WO2016008146A1 (en) * 2014-07-18 2016-01-21 深圳华大基因研究院 Gender identification method and apparatus for samples
KR20160134106A (en) * 2015-05-14 2016-11-23 배재대학교 산학협력단 Kit for gender determination
CN106520940A (en) * 2016-11-04 2017-03-22 深圳华大基因研究院 Chromosomal aneuploid and copy number variation detecting method and application thereof
CN109192246A (en) * 2018-06-22 2019-01-11 深圳市达仁基因科技有限公司 Detect the method, apparatus and storage medium of chromosomal copy number exception
WO2019025004A1 (en) * 2017-08-04 2019-02-07 Trisomytest, S.R.O. A method for non-invasive prenatal detection of fetal sex chromosomal abnormalities and fetal sex determination for singleton and twin pregnancies
CN109402241A (en) * 2017-08-07 2019-03-01 深圳华大基因研究院 Identification and the method for analyzing ancient DNA sample
CN110033828A (en) * 2019-04-03 2019-07-19 北京各色科技有限公司 Sexual discriminating method based on chip detection DNA data
CN110648721A (en) * 2019-09-19 2020-01-03 北京市儿科研究所 Method and device for detecting copy number variation by aiming at exon capture technology
KR102150078B1 (en) * 2019-12-30 2020-09-01 주식회사 마크로젠 Prediction method for gender of fetus based on directivity with number of reads and analysis apparatus
CN113053460A (en) * 2019-12-27 2021-06-29 分子健康有限责任公司 Systems and methods for genomic and genetic analysis
JP2021101629A (en) * 2019-12-24 2021-07-15 モレキュラー ヘルス ゲーエムベーハー System and method for genome analysis and gene analysis
CN113192555A (en) * 2021-04-21 2021-07-30 杭州博圣医学检验实验室有限公司 Method for detecting copy number of second-generation sequencing data SMN gene by calculating sequencing depth of differential allele

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004063390A2 (en) * 2003-01-10 2004-07-29 Mmi Genomics, Inc. Compositions and methods for determining canine gender
WO2015035555A1 (en) * 2013-09-10 2015-03-19 深圳华大基因科技有限公司 Method, system, and computer readable medium for determining whether fetus has abnormal number of sex chromosomes
WO2016008146A1 (en) * 2014-07-18 2016-01-21 深圳华大基因研究院 Gender identification method and apparatus for samples
KR20160134106A (en) * 2015-05-14 2016-11-23 배재대학교 산학협력단 Kit for gender determination
CN106520940A (en) * 2016-11-04 2017-03-22 深圳华大基因研究院 Chromosomal aneuploid and copy number variation detecting method and application thereof
WO2019025004A1 (en) * 2017-08-04 2019-02-07 Trisomytest, S.R.O. A method for non-invasive prenatal detection of fetal sex chromosomal abnormalities and fetal sex determination for singleton and twin pregnancies
CN109402241A (en) * 2017-08-07 2019-03-01 深圳华大基因研究院 Identification and the method for analyzing ancient DNA sample
CN109192246A (en) * 2018-06-22 2019-01-11 深圳市达仁基因科技有限公司 Detect the method, apparatus and storage medium of chromosomal copy number exception
CN110033828A (en) * 2019-04-03 2019-07-19 北京各色科技有限公司 Sexual discriminating method based on chip detection DNA data
CN110648721A (en) * 2019-09-19 2020-01-03 北京市儿科研究所 Method and device for detecting copy number variation by aiming at exon capture technology
JP2021101629A (en) * 2019-12-24 2021-07-15 モレキュラー ヘルス ゲーエムベーハー System and method for genome analysis and gene analysis
CN113053460A (en) * 2019-12-27 2021-06-29 分子健康有限责任公司 Systems and methods for genomic and genetic analysis
KR102150078B1 (en) * 2019-12-30 2020-09-01 주식회사 마크로젠 Prediction method for gender of fetus based on directivity with number of reads and analysis apparatus
CN113192555A (en) * 2021-04-21 2021-07-30 杭州博圣医学检验实验室有限公司 Method for detecting copy number of second-generation sequencing data SMN gene by calculating sequencing depth of differential allele

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Assessing the Sex-Related Genomic Composition Difference Using a k-mer-Based Approach: A Case of Study in Arapaima gigas (Pirarucu);Cavalcante, RLD,等;ADVANCES IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, BSB 2020;第12558卷;第50-56页 *
人类性别决定基因(SRY)的检测及其临床应用;陈勇,等;分子诊断与治疗杂志(第03期);第161-164页 *
黄江平,等.法医学杂志.2016,第32卷(第5期),第371-377页. *

Also Published As

Publication number Publication date
CN113793641A (en) 2021-12-14

Similar Documents

Publication Publication Date Title
Quintelier et al. Analyzing high-dimensional cytometry data using FlowSOM
US10347365B2 (en) Systems and methods for visualizing a pattern in a dataset
US11954614B2 (en) Systems and methods for visualizing a pattern in a dataset
Browning et al. Haplotype phasing: existing methods and new developments
Zhou et al. RNA-QC-chain: comprehensive and fast quality control for RNA-Seq data
CN106021984A (en) Whole-exome sequencing data analysis system
Yao et al. A comparison of experimental assays and analytical methods for genome-wide identification of active enhancers
Olson et al. Variant calling and benchmarking in an era of complete human genome sequences
WO2020035446A1 (en) Systems and methods for using neural networks for germline and somatic variant calling
US20090226916A1 (en) Automated Analysis of DNA Samples
CN111718982A (en) Tumor tissue single sample somatic mutation detection method and device
US20170228496A1 (en) System and method for process control of gene sequencing
CN107944228A (en) A kind of method for visualizing of gene sequencing variant sites
Gombolay et al. Ribose-Map: a bioinformatics toolkit to map ribonucleotides embedded in genomic DNA
Parrish et al. Assembly of non-unique insertion content using next-generation sequencing
CN110211640B (en) GPU parallel computing-based complex disease gene interaction correlation analysis method
CN109461473B (en) Method and device for acquiring concentration of free DNA of fetus
CN108256291A (en) It is a kind of to generate the method with higher confidence level detection in Gene Mutation result
Trapnell et al. Monocle: Cell counting, differential expression, and trajectory analysis for single-cell RNA-Seq experiments
CN113793641B (en) Method for rapidly judging sample gender from FASTQ file
JP6356015B2 (en) Gene expression information analyzing apparatus, gene expression information analyzing method, and program
Sater et al. UMI-Gen: A UMI-based read simulator for variant calling evaluation in paired-end sequencing NGS libraries
Rodriguez et al. A scalable, flexible workflow for MethylCap-seq data analysis
RU2804535C1 (en) Whole genome sequencing data processing system
do Nascimento et al. Copy number variations detection: unravelling the problem in tangible aspects

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant