CN113257360B - Cancer screening model, and construction method and construction device of cancer screening model - Google Patents

Cancer screening model, and construction method and construction device of cancer screening model Download PDF

Info

Publication number
CN113257360B
CN113257360B CN202110707095.0A CN202110707095A CN113257360B CN 113257360 B CN113257360 B CN 113257360B CN 202110707095 A CN202110707095 A CN 202110707095A CN 113257360 B CN113257360 B CN 113257360B
Authority
CN
China
Prior art keywords
file
genome
cnv
sample
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110707095.0A
Other languages
Chinese (zh)
Other versions
CN113257360A (en
Inventor
曹善柏
周涛
张萌萌
郭璟
孙宏
楼峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Xiangxin Biotechnology Co ltd
Tianjin Xiangxin Medical Instrument Co ltd
Beijing Xiangxin Biotechnology Co ltd
Original Assignee
Tianjin Xiangxin Biotechnology Co ltd
Tianjin Xiangxin Medical Instrument Co ltd
Beijing Xiangxin Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Xiangxin Biotechnology Co ltd, Tianjin Xiangxin Medical Instrument Co ltd, Beijing Xiangxin Biotechnology Co ltd filed Critical Tianjin Xiangxin Biotechnology Co ltd
Priority to CN202110707095.0A priority Critical patent/CN113257360B/en
Publication of CN113257360A publication Critical patent/CN113257360A/en
Application granted granted Critical
Publication of CN113257360B publication Critical patent/CN113257360B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Epidemiology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Pathology (AREA)
  • Bioethics (AREA)
  • Primary Health Care (AREA)
  • Software Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a cancer screening model, a construction method and a construction device of the cancer screening model. The construction method comprises the following steps: constructing different CNV baselines based on different data; performing CNV detection on sWGS data of a sample to be detected by using different CNV baselines respectively to obtain a plurality of genome characteristics; establishing a training set and a testing set by using sWGS data of a sample to be detected, making an ROC curve for numerical values of all genome characteristics, and selecting the genome characteristic with the largest AUC value as a final genome characteristic; and carrying out model training on the final genome characteristics to obtain a cancer screening model. Under the condition of being incorporated into the sWGS, a plurality of genome characteristics are obtained based on different baseline data, a training set and a testing set are established, a cancer screening model is established through a machine learning method, and the accuracy of cancer (early stage) screening is improved.

Description

Cancer screening model, and construction method and construction device of cancer screening model
Technical Field
The invention relates to the technical field of biology, in particular to a cancer screening model, a construction method and a construction device of the cancer screening model.
Background
Bladder cancer is one of the most common malignant tumors of the urinary system. Current bladder cancer monitoring methods rely on repeated cystoscopy, needle biopsy, and imaging examinations. Cystoscopy is considered the current gold standard for bladder cancer diagnosis, but these procedures are time consuming, costly, less sensitive to carcinoma in situ, and may lead to complications such as urinary tract infection, urinary tract injury, bladder injury, and the like. Needle biopsy procedures, because they are highly invasive, can be traumatic to tissue. The imaging examination is accompanied by radiation injury, and the above conventional examination methods all bring pain to patients.
Bladder cancer often has a large amount of tumor cells shed from bladder cancer tissues and small pieces of free DNA released by apoptotic rupture of cancer cells due to the specificity of the lesion in the urine. Urine shed cytology, another current method for diagnosing bladder cancer, has the advantages of no trauma, high specificity, non-invasiveness, low sensitivity (about 30%) due to the influence of multiple factors, and greatly reduced sensitivity due to the fact that intercellular adhesion does not shed particularly in the early stage of tumor.
Therefore, under the condition that tumor tissues cannot be obtained or the detection effect of the conventional method is poor, the low-depth whole genome sequencing is carried out by using the tumor DNA contained in urine through the NGS method, and the genomic characteristics related to the bladder cancer are screened, so that the method becomes another better inspection means, and the method can reflect whether the patient has the bladder cancer in a more efficient, comprehensive, sensitive and real-time manner.
At present, early cancer diagnosis mainly focuses on detection of common hot spot gene mutation conditions, and detection accuracy is improved by increasing the number of detection genes, detection depth and other methods. There is room for the development of early screening for cancer by increasing the detection of other genomic features.
Disclosure of Invention
The invention aims to provide a cancer screening model, a construction method and a construction device of the cancer screening model, and aims to solve the technical problem that the detection accuracy is difficult to improve under the condition of not increasing the number of detection genes or the detection depth in the prior art.
To achieve the above objects, according to one aspect of the present invention, there is provided a method of constructing a cancer screening model. The construction method comprises the following steps: constructing different CNV baselines based on different data; performing CNV detection on sWGS data of a sample to be detected by using different CNV baselines respectively to obtain a plurality of genome characteristics; establishing a training set and a testing set by using sWGS data of a sample to be detected, making an ROC curve for numerical values of all genome characteristics, and selecting the genome characteristic with the largest AUC value as a final genome characteristic; and carrying out model training on the final genome characteristics to obtain a cancer screening model.
Furthermore, the cancer is bladder cancer, and the samples to be detected comprise urine exfoliated cell samples of healthy people and urine exfoliated cell samples of patients with bladder cancer.
Further, constructing different CNV baseline based on different data includes: selecting a preset number of healthy people, obtaining cfDNA sequencing information of the healthy people, comparing the cfDNA sequencing information with a reference genome, and constructing cfCNV baseline; constructing 1000G.CNVbase based on the crowd sample genome data and the reference genome in the thousand human genome database; preferably, the predetermined number is 50 or more; preferably, the genome data of the population in the thousand-person genome database is the genome data of the Chinese population in the thousand-person genome database.
Further, the construction method further comprises, after obtaining the cancer screening model: verifying the model by using the test set; preferably, training the final genome features by adopting a random forest model; preferably, the genomic features include the number of large segment CNVs and the abnormal reads ratio.
Further, the construction of CNV baseline includes: s1, constructing a coordinate file by using the reference genome information file; s2, analyzing and obtaining a reads statistical file corresponding to each sample recorded with reads in each section bin by using the coordinate file obtained in S1, the reference genome information file and cfDNA sequencing information data of a predetermined number of normal persons or human group sample genome data in a thousand-person genome database; s3, analyzing the reference genome information file and the coordinate file to obtain a file containing GC content in each bin; s4, obtaining a reads statistical file corresponding to the sample in S2, obtaining a file containing GC content in each bin in S3, and analyzing to obtain a CNVbias line file.
Further, CNV detection includes: analyzing and obtaining a reads statistical file corresponding to each sample to be detected, wherein the reads statistical file records the reads number in each bin of the interval by using the coordinate file, the reference genome information file and the sWGS data of the sample to be detected; performing noise reduction processing on the reads statistical file corresponding to the sample to be detected by using the CNVbase file to obtain a noise-reduced file; merging each CNA section in the noise-reduced file to obtain a merged file; and analyzing the merged file to judge whether the CNV is amplified, deleted or normal.
Further, the number of the large segment CNVs is obtained by large segment CNV detection, and the large segment CNV detection includes: 1) downloading a cytogenetic band file corresponding to a reference genome by a UCSC database; 2) and calculating the copy number in the segment through the combined file to obtain the CNV segment in an amplification and deletion state, wherein the CNV segment with the intersection of the start and stop intervals of the CNV segment and the range of the chromosome segment in the cytogenetic band file is the large-segment CNV.
According to another aspect of the present invention, there is provided an apparatus for constructing a cancer screening model. The construction apparatus includes: the CNV baseline construction module is set to construct different CNV baselines based on different data; the genome feature acquisition module is set to carry out CNV detection on sWGS data of a sample to be detected by using different CNV baselines respectively so as to acquire a plurality of genome features; the final genome feature determination module is set to establish a training set and a test set by utilizing the sWGS data of the sample to be detected, the numerical value of each genome feature is made into an ROC curve, and the genome feature with the largest AUC value is selected as the final genome feature; and the model training module is used for carrying out model training on the final genome characteristics to obtain a cancer screening model.
Furthermore, the cancer is bladder cancer, and the samples to be detected comprise urine exfoliated cell samples of healthy people and urine exfoliated cell samples of patients with bladder cancer.
Further, the CNV baseline construction module comprises: the cfCNV baseline construction submodule is set to select a preset number of healthy people, acquire cfDNA sequencing information of the healthy people, compare the cfCNV baseline sequencing information with a reference genome and construct a cfCNV baseline; the 1000 G.CNVbias line construction sub-module is set to construct 1000 G.CNVbias line based on the human group sample genome data and the reference genome in the thousand human genome database; preferably, the predetermined number is 50 or more; preferably, the genome data of the population in the thousand-person genome database is the genome data of the Chinese population in the thousand-person genome database.
Further, the construction apparatus further includes: a verification module configured to verify the model using a test set; preferably, training the final genome features by adopting a random forest model; preferably, the genomic features include the number of large segment CNVs and the abnormal reads ratio.
Further, the cfCNV baseline construction sub-module and the 1000G.CNVbaseline construction sub-module respectively include: the coordinate file construction submodule is used for constructing a coordinate file by utilizing the reference genome information file; the reads statistical submodule is arranged to analyze and obtain a reads statistical file corresponding to each sample recorded with the reads number in each section bin by utilizing a coordinate file obtained by the coordinate file construction submodule, a reference genome information file and cfDNA sequencing information data of a predetermined number of normal persons or human group sample genome data in a thousand-person genome database; the GC content statistical submodule is arranged to analyze and obtain a file containing the GC content in each bin by utilizing the reference genome information file and the coordinate file; and the CNVbase file forming submodule is set to obtain a reads statistical file corresponding to the sample through the reads statistical submodule, and the GC content statistical submodule obtains a file containing the GC content in each bin and analyzes the file to obtain the CNVbase file.
Further, the genome feature acquisition module comprises: the reads statistical file acquisition submodule is used for analyzing and obtaining the reads statistical file corresponding to each sample to be detected, wherein the reads statistical file is recorded with the numbers of reads in each interval bin, and the sWGS data of the samples to be detected are set as coordinate files, reference genome information files and the sWGS data of the samples to be detected; the noise reduction sub-module is set to perform noise reduction processing on the reads statistical file corresponding to the sample to be detected by using the CNVbase file to obtain a noise-reduced file; the merging submodule is used for merging all CNA sections in the noise-reduced file to obtain a merged file; and the judgment submodule is used for judging whether the CNV is amplified, deleted or normal by analyzing the merged file.
Further, the number of the large segment CNVs is obtained by a large segment CNV detection submodule, which is set as: 1) downloading a cytogenetic band file corresponding to a reference genome by a UCSC database; 2) and calculating the copy number in the segment through the combined file to obtain the CNV segment in an amplification and deletion state, wherein the CNV segment with the intersection of the start and stop intervals of the CNV segment and the range of the chromosome segment in the cytogenetic band file is the large-segment CNV.
According to yet another aspect of the present invention, a cancer screening model is provided. The cancer screening model is constructed by any one of the construction methods of the cancer screening model.
According to yet another aspect of the present invention, a cancer screening device is provided. The cancer screening device comprises the cancer screening model.
By applying the technical scheme of the invention, under the condition of being brought into the sWGS, a plurality of genome characteristics, such as the number of large-fragment CNV, the abnormals proportions and other genome characteristics, are obtained based on different baseline data, a training set and a test set are established by using the sWGS data of a sample to be detected, a cancer screening model is established by a machine learning method, and the accuracy of cancer (early) screening is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
fig. 1 shows a schematic flow diagram of a CNV baseline construction according to an embodiment of the present invention;
FIG. 2 shows a CNV detection flow diagram according to an embodiment of the invention;
FIG. 3 is a schematic diagram illustrating a large segment CNV detection process according to an embodiment of the present invention;
FIG. 4 illustrates a model building and prediction flow diagram according to an embodiment of the present invention;
FIG. 5 shows an abnormalReads characteristic ROC curve according to example 1;
FIG. 6 shows a characteristic ROC curve of cfDNAselinseCCV according to example 1;
FIG. 7 shows a 1000 GbaselineECNV characteristic ROC curve according to example 1; and
FIG. 8 shows a test set test ROC curve according to example 1.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
Interpretation of terms:
sWGS: short genome sequence, low depth whole genome sequencing.
SCOB: bladder cancer.
CNV: copy number variations, i.e., gene Copy number variations.
Baseline: a baseline.
Hereinafter, the present invention will be described in further detail by taking bladder cancer as an example.
The conventional detection methods for bladder cancer, such as cystoscopy, tissue puncture and urocytology examination, have the problems of complicated detection means, additional pain brought to patients, low accuracy and the like.
The main subjects of liquid biopsy currently exist in 3 categories: circulating Tumor Cells (CTC), circulating tumor DNA (ctDNA) and exosomes (Exosome), and the three tumor markers are obtained mainly by taking peripheral blood through veins and centrifuging. Bladder cancer often has a large amount of tumor cells shed from bladder cancer tissues and small pieces of free DNA released by apoptotic rupture of cancer cells due to the specificity of the lesion in the urine. The invention carries out low-depth whole genome sequencing on tumor DNA contained in urine by an NGS method, screens genome characteristics related to bladder cancer and screens the bladder cancer, becomes a new checking means, and can reflect whether a patient has the bladder cancer in a more efficient, comprehensive, sensitive and real-time manner.
Compared with other liquid biopsy methods, the technical algorithm is characterized in that:
1. low depth whole genome sequencing (sggs) was performed using urine exfoliated cells.
2. Collecting cfDNA of normal people (in the invention, normal people are equal to healthy people) and Chinese population data in a 1000genome database, constructing CNV baseline, and detecting CNV of a sample by using the CNV baseline.
3. The characteristics of the genome information in various aspects such as the number of large segments CNV (generally large genome segments with the length of more than 1 kb) obtained based on different CNVbias lines and abnormal reads proportion are screened, and the optimal genome characteristics are reserved.
4. And constructing an early diagnosis model of bladder cancer by using a machine learning method.
In one embodiment of the present invention, the CNV baseline construction process includes:
step1, constructing a new coordinate file preprocessed. interval. list by referring to a genome fasta file and a bead file of WGS and using a preprocess interval command in the GATK, wherein the bin window does not set a maximum value;
step2, obtaining a coordinate file preprocessed.interval.list through Step1, an existing reference genome fasta file, an sWGS result bam file obtained by sequencing a sample to be tested, obtaining a reads statistical file corresponding to each sample by utilizing a CollactReadCounts command in the GATK, and recording the reads number in each bin;
step3, obtaining a file GC _ content.interval _ list containing GC content in each bin by referring to a genome fasta file and a preprocessed.interval.list file in Step1 and utilizing an AntotateInterval command in the GATK, and using the file GC _ content.interval _ list for the subsequent construction of baseline;
step 4: obtaining an interval reads statistical file of each sample in step2, obtaining a final CNVbias file CNVbias _ pon in step3 by using a CreateReadCountPannelOfNormals command in GATK, and obtaining a GC _ content _ interval _ list file GC _ content _ list in step 3.
In an embodiment of the present invention, the CNV detection process (see fig. 2) includes:
step1: establishing a preprocessed. interval. list of a WGS coordinate file in step1 through CNV baseline, referring to a genome fasta file and a bam file of a sample to be detected, and obtaining a reads statistical file corresponding to the sample to be detected by utilizing a CollactReadCounts command in the GATK;
step2, obtaining a reads statistical file corresponding to the sample to be detected through Step1, and carrying out noise reduction on the reads statistical file by using a denoise ReadCounts command in the GATK through an existing CNVbase.
Step3: merging each CNA section by using model segments in GATK through condensed.tsv in step2 to obtain a merged file, wherein the content comprises mean log2 copy ratio;
step 4: whether the CNV is amplified, deleted or normal is judged by using a CallCopyRatioSegments command in the GATK through a result file in step 3.
In one embodiment of the present invention, the large segment CNV detection (see fig. 3) process:
step1, downloading a cytogenetic band file corresponding to a reference genome by a UCSC database, and naming the CytoBand.txt;
step2, calculating copy number in the segment through mean log2 copy ratio in a Step3 result file in a CNV detection process to obtain a CNV segment in gain and loss states, wherein the start interval and the end interval of the CNV have intersection with the range of chromosome fragments in the CytoBand.
In one embodiment of the present invention, the model construction and prediction process (see FIG. 4):
step A: collecting a number (e.g., 29) of healthy human samples, a number (e.g., 21) of urine from bladder cancer patients;
and B: collecting normal population cfDNA samples, comparing the normal population cfDNA samples with a reference genome to obtain a bam file for constructing the bastine (cfDNABaselineECNV) of the CNV, and counting the number of the large-fragment CNV obtained based on the normal population cfDNA detection of each sample. The CNVbias line construction process is shown in figure 1;
and C: downloading a bam file of Chinese population samples from a thousand-person genome database for constructing the baseline of CNV, and counting the number of large segments of CNV (1000 GbaselineCNV) obtained by each sample based on the detection of the thousand-person genome population. The CNVbias line construction process is shown in figure 1;
step D: counting and obtaining the abnormal comparison reads information of each sample from the bam file, wherein the abnormal comparison reads information comprises the ratio of soft-clip reads to all reads, the ratio of reads with an inserted segment larger than 100000, and finally the total ratio (abnormal reads) of the two abnormal reads;
step E: dividing the sample into a training set (80%) and a testing set (20%);
step F: making an ROC curve for the numerical value of each characteristic, and selecting the characteristic with the maximum AUC value as a final characteristic;
step G: training the characteristics through a random forest model;
step I: the model is validated using a test set.
The following examples are provided to further illustrate the advantageous effects of the present invention.
Example 1
The target is as follows: and performing model training through the processed data characteristics, and then verifying the accuracy of the model by using the test set.
The method comprises the following steps:
1. a data set was established consisting of 29 healthy human urine samples, 21 bladder cancer patients urine samples.
2. A baseline (baseline) for CNV was constructed based on cfDNA samples from the normal population. The step1 of the construction process: and (3) comparing the sequencing of the cfDNA of 60 normal people with a reference genome to obtain a bam file. Step2: construct a new coordinate file preprcessed.interval.list using preprcessintervals command in GATK by referring to the genome fasta file and the bed file of WGS. And 3, step3: and (3) obtaining a reads statistical file corresponding to each sample by utilizing a CollactReadCounts command in the GATK, and recording the number of reads in each bin through the coordinate file preprocessed. interval.list obtained in the step2, the existing reference genome fasta file, and the existing bam files for sequencing and comparing cfDNA of 60 normal crowds. And 4, step 4: and obtaining a file GC _ content.interval _ list containing GC content in each bin by using an AntotateInterval command in the GATK by referring to a genome fasta file and a preprocessed.interval.list file in the step 2. And 5, step 5: and obtaining a final cfDNA-based CNVbias file cfDNA-CNVbias-pon by using a createReadCountPannelOfNormals command in the GATK through the section reads statistical file of each sample obtained in the step3 and the GC _ content _ interval _ list file in the step 4.
3. A CNV baseline (baseline) is constructed based on Chinese population samples in a thousand-person genome database. The first step is as follows: the bam files of Chinese people (CHB) were downloaded from the 1000 genes database. The subsequent construction process of the obtained bam file is the same as the CNVbias line construction process based on cfDNA, and finally the 1000 gene-based CNVbias line file 1000 G.CNVbias line
4. CNVbaseline based on cfDNA, CNV detection was performed on 50 test samples (29 normal patient samples, 21 urine samples from cancer patients) using GATK software. Step1 of the detection process: a reads statistical file corresponding to a sample to be detected is obtained by using a CollactAdCounts command in a GATK through a WGS coordinate file preprocessed. interval.list in step1 in a CNV baseline process constructed based on cfDNA, a genome fasta file and 50 sample bam files to be detected. Step2: and (3) carrying out noise reduction on the reads statistical file by using the reads statistical file corresponding to the sample to be detected obtained in the step1 and the existing CNVbase. And 3, step3: merge the CNV segments with model segments command in GATK via step2 to get merged file, the content of which includes mean log2 copy ratio. And 4, judging whether the CNV is amplified (+), deleted (-), or normal (0) by using the result file in the step3 and a CallCopyRatioSegments command in the GATK. Exemplary results are shown in table 1.
5. The CNV obtained by CNVbase detection based on cfDNA is counted, the number of large-fragment CNV of each sample to be detected is counted, the first column of the file is the name of the sample, and the second example is the number of the large-fragment CNV. Step1 of the statistical process: the UCSC database downloads a cytogenetic band file corresponding to the reference genome, named cytoband. Step2: and calculating copy number in the segment through mean log2 copy ratio in a Step3 result file in a CNV detection process to obtain a CNV segment in gain and loss states, wherein the start interval and the end interval of the CNV have intersection with the range of chromosome segments in the CytoBand.
6. Based on 1000 genes of CNVbase, 50 test samples (29 normal patient samples, 21 urine samples from cancer patients) were tested for CNV using GATK software. The detection procedure was the same as 4.
7. Based on the CNV obtained by the CNVbase detection of 1000 genes, the number of the large-segment CNV of each sample to be detected is counted, the first column of the file is the name of the sample, and the second example is the number of the large-segment CNV. The statistical procedure was the same as 5.
8. And counting abnormal comparison reads. Respectively reading 50 bam files of samples to be detected, and counting the ratio of soft-clip reads to all reads, the ratio of reads with an insert fragment larger than 100000bp (insert _ size > 100000) and the total ratio of the two abnormal reads.
9. A data set consisting of 50 samples (29 urine samples of healthy people and 21 urine samples of bladder cancer patients) was divided into a training set (80%) and a testing set (20%).
10. An ROC curve is drawn for the 3 data characteristics (abnormalRead characteristic, cfDNABASELINeCNV characteristic, 1000 GbaselineeCNV), and the characteristic with the maximum AUC is selected as the model training use characteristic.
11. And (5) carrying out model training on the screened features by using a training set (80%) through a random forest method.
12. The model was validated using the test set (20%).
Table 1: sample CNV test result file
Figure DEST_PATH_IMAGE001
1. And (3) feature selection results:
by comparing the AUC of 3 features (abnormalReads feature, cfdnabaseselinecnv feature, 1000 GbaselineCNV), the features were finally retained: 1000 GbaselineCNV. The ROC curves for each feature are shown in fig. 5, 6 and 7. Wherein, fig. 5 shows the abnormalReads characteristic ROC curve, AUC = 0.8; fig. 6 shows a cfdnabaseselinecnv characteristic ROC curve, AUC = 0.66; fig. 7 shows a 1000GbaselineCNV characteristic ROC curve, AUC = 1.
2. And (3) a model performance verification result:
training the above features by using a random forest model, verifying by using a test set, and showing a test ROC curve of the test set in figure 8. The model was validated using the test set, AUC =1, indicating that the model was able to completely distinguish cancer samples from normal samples with 100% accuracy.
From the above description, it can be seen that the above-described embodiments of the present invention achieve the following technical effects: by using the technical method, the bladder cancer can be effectively and early diagnosed by using the low-depth whole genome detection in the urine cast-off cells and using the large-fragment CNV detected based on the thousand human genome CNV as baseline as the model training characteristic.
Example 2
In this embodiment, a device for constructing a cancer screening model is provided. The construction apparatus includes: the CNV baseline construction module is set to construct different CNV baselines based on different data; the genome feature acquisition module is set to carry out CNV detection on sWGS data of a sample to be detected by using different CNV baselines respectively so as to acquire a plurality of genome features; the final genome feature determination module is set to establish a training set and a test set by utilizing the sWGS data of the sample to be detected, the numerical value of each genome feature is made into an ROC curve, and the genome feature with the largest AUC value is selected as the final genome feature; and the model training module is used for carrying out model training on the final genome characteristics to obtain a cancer screening model.
Specifically, the cancer is bladder cancer, and the sample to be detected comprises a sample of urine exfoliated cells of healthy people and a sample of urine exfoliated cells of bladder cancer patients.
The CNV baseline construction module comprises: the cfCNV baseline construction submodule is set to select a predetermined number of normal persons, acquire cfDNA sequencing information of the normal persons, compare the cfCNV baseline sequencing information with a reference genome and construct a cfCNV baseline; the 1000 G.CNVbias line construction sub-module is set to construct 1000 G.CNVbias line based on the human group sample genome data and the reference genome in the thousand human genome database; preferably, the predetermined number is 50 or more, for example 60; preferably, the genome data of the population in the thousand-person genome database is the genome data of the Chinese population in the thousand-person genome database.
The construction apparatus further comprises: a verification module configured to verify the model using a test set; preferably, training the final genome features by adopting a random forest model; preferably, the genomic features include the number of large segment CNVs and the abnormal reads ratio.
The cfCNV baseline construction submodule and the 1000G.CNVbaseline construction submodule respectively comprise:
the coordinate file construction submodule is used for constructing a coordinate file by utilizing the reference genome information file;
the reads statistical submodule is arranged to analyze and obtain a reads statistical file corresponding to each sample recorded with the reads number in each section bin by utilizing a coordinate file obtained by the coordinate file construction submodule, a reference genome information file and cfDNA sequencing information data of a predetermined number of normal persons or human group sample genome data in a thousand-person genome database;
the GC content statistical submodule is arranged to analyze and obtain a file containing the GC content in each bin by utilizing the reference genome information file and the coordinate file;
and the CNVbase file forming submodule is set to obtain a reads statistical file corresponding to the sample through the reads statistical submodule, and the GC content statistical submodule obtains a file containing the GC content in each bin and analyzes the file to obtain the CNVbase file.
The genome feature acquisition module comprises:
the reads statistical file acquisition submodule is used for analyzing and obtaining the reads statistical file corresponding to each sample to be detected, wherein the reads statistical file is recorded with the numbers of reads in each interval bin, and the sWGS data of the samples to be detected are set as coordinate files, reference genome information files and the sWGS data of the samples to be detected;
the noise reduction sub-module is set to perform noise reduction processing on the reads statistical file corresponding to the sample to be detected by using the CNVbase file to obtain a noise-reduced file;
the merging submodule is used for merging all CNA sections in the noise-reduced file to obtain a merged file;
and the judgment submodule is used for judging whether the CNV is amplified, deleted or normal by analyzing the merged file.
The number of the large segment CNV is obtained by a large segment CNV detection submodule which is set as: 1) downloading a cytogenetic band file corresponding to a reference genome by a UCSC database; 2) and calculating the copy number in the segment through the combined file to obtain the CNV segment in an amplification and deletion state, wherein the CNV segment with the intersection of the start and stop intervals of the CNV segment and the range of the chromosome segment in the cytogenetic band file is the large-segment CNV.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of constructing a cancer screening model, comprising the steps of:
constructing different CNV baselines based on different data;
performing CNV detection on the sWGS data of a sample to be detected by using different CNV baselines respectively to obtain a plurality of genome characteristics;
establishing a training set and a testing set by using the sWGS data of the sample to be detected, making an ROC curve for the numerical value of each genome characteristic, and selecting the genome characteristic with the maximum AUC value as a final genome characteristic;
performing model training on the final genome characteristics to obtain the cancer screening model;
the constructing of different CNV baseline based on different data comprises:
selecting a preset number of healthy people, obtaining cfDNA sequencing information of the healthy people, comparing the cfDNA sequencing information with a reference genome, and constructing cfCNV baseline;
constructing 1000G.CNVbase based on the crowd sample genome data in the thousand human genome database and the reference genome;
the method of constructing further comprises, after obtaining the cancer screening model: verifying a model using the test set;
the genome characteristics comprise the number of large-segment CNV and the proportion of abnormal reads;
the construction of the CNV baseline comprises the following steps:
s1, constructing a coordinate file by using the reference genome information file;
s2, analyzing and obtaining a reads statistical file corresponding to each sample recorded with reads number in each section bin by using the coordinate file obtained in S1, the reference genome information file and the cfDNA sequencing information data of the predetermined number of normal persons or the genome data of the human population sample in the thousand-person genome database;
s3, analyzing the reference genome information file and the coordinate file to obtain a file containing GC content in each bin;
s4, obtaining a reads statistical file corresponding to the sample in S2, obtaining a file containing GC content in each bin in S3, and analyzing to obtain a CNVbias line file;
the CNV detection comprises:
analyzing and obtaining a reads statistical file corresponding to each sample to be detected, wherein the reads statistical file records the number of reads in each interval bin, by using the coordinate file, the reference genome information file and the sWGS data of the sample to be detected;
performing noise reduction processing on the reads statistical file corresponding to the sample to be detected by using the CNVbase file to obtain a noise-reduced file;
merging each CNA section in the noise-reduced file to obtain a merged file;
analyzing the merged file to judge whether the CNV is amplified, deleted or normal;
the number of the large segment CNV is obtained by large segment CNV detection, and the large segment CNV detection comprises the following steps:
1) downloading a cytogenetic band file corresponding to the reference genome by a UCSC database;
2) calculating the copy number in the segment through the merged file to obtain a CNV segment in an amplification and deletion state, wherein the CNV segment with intersection between the start and stop intervals of the CNV segment and the range of the chromosome segment in the cytogenetic band file is a large segment CNV:
the cancer is bladder cancer, and the samples to be detected comprise urine exfoliated cell samples of healthy people and urine exfoliated cell samples of bladder cancer patients.
2. The construction method according to claim 1, wherein the predetermined number is 50 or more people.
3. The construction method according to claim 1, wherein the genome data of the human population in the thousand-human genome database is the genome data of the Chinese human population in the thousand-human genome database.
4. The construction method according to claim 1, wherein the final genome features are trained using a random forest model.
5. A construction apparatus for a cancer screening model, the construction apparatus comprising:
the CNV baseline construction module is set to construct different CNV baselines based on different data;
the genome feature acquisition module is set to perform CNV detection on sWGS data of a sample to be detected by using different CNV baselines respectively to acquire a plurality of genome features;
the final genome feature determination module is set to establish a training set and a test set by utilizing the sWGS data of the sample to be detected, the numerical value of each genome feature is used as an ROC curve, and the genome feature with the largest AUC value is selected as the final genome feature;
a model training module configured to perform model training on the final genome features to obtain the cancer screening model;
the CNV baseline construction module comprises:
the cfCNV baseline construction submodule is set to select a preset number of healthy people, acquire cfDNA sequencing information of the healthy people, compare the cfCNV baseline sequencing information with a reference genome and construct a cfCNV baseline;
the 1000 G.CNVbias line construction sub-module is set to construct 1000 G.CNVbias line based on the human group sample genome data in the thousand human genome database and the reference genome;
the construction apparatus further includes: a validation module configured to validate a model using the test set;
the genome characteristics comprise the number of large-segment CNV and the proportion of abnormal reads;
the cfCNV baseline construction sub-module and the 1000G.CNVbaseline construction sub-module respectively comprise:
the coordinate file construction submodule is used for constructing a coordinate file by utilizing the reference genome information file;
the reads statistical submodule is arranged for analyzing and obtaining a reads statistical file corresponding to each sample recorded with the reads number in each section bin by utilizing the coordinate file obtained by the coordinate file constructing submodule, the reference genome information file and the cfDNA sequencing information data of the predetermined number of normal people or the human population sample genome data in the thousand-person genome database;
the GC content statistical submodule is arranged to analyze the reference genome information file and the coordinate file to obtain a file containing the GC content in each bin;
a CNVbase file forming submodule which is set to obtain a reads statistical file corresponding to the sample through the reads statistical submodule, and a GC content statistical submodule obtains a file containing GC content in each bin and analyzes the file to obtain the CNVbase file;
the genome feature acquisition module comprises:
the reads statistical file acquisition sub-module is configured to analyze and obtain the reads statistical file corresponding to each to-be-detected sample, wherein the reads number in each interval bin is recorded, by using the coordinate file, the reference genome information file and the sWGS data of the to-be-detected sample;
the noise reduction sub-module is configured to perform noise reduction processing on the reads statistical file corresponding to the sample to be detected by using the CNVbase file to obtain a noise-reduced file;
the merging submodule is configured to merge the CNA sections in the noise-reduced file to obtain a merged file;
a judging submodule configured to judge whether the CNV is amplified, deleted, or normal by analyzing the merged file;
the number of the large-segment CNV is obtained through a large-segment CNV detection submodule which is set as:
1) downloading a cytogenetic band file corresponding to the reference genome by a UCSC database;
2) calculating the copy number in the segment through the merged file to obtain a CNV segment in an amplification and deletion state, wherein the CNV segment with intersection between the start and stop intervals of the CNV segment and the range of the chromosome segment in the cytogenetic band file is a large-segment CNV;
the cancer is bladder cancer, and the samples to be detected comprise urine exfoliated cell samples of healthy people and urine exfoliated cell samples of bladder cancer patients.
6. The construction apparatus according to claim 5, wherein the predetermined number is 50 or more people.
7. The constructing device of claim 5, wherein the genome data of the human population in the thousand-human genome database is the genome data of the Chinese human population in the thousand-human genome database.
8. The building apparatus according to claim 5, wherein the final genome features are trained using a random forest model.
9. A cancer screening model constructed by the method of constructing a cancer screening model according to any one of claims 1 to 4.
10. A cancer screening device comprising the cancer screening model of claim 9.
CN202110707095.0A 2021-06-24 2021-06-24 Cancer screening model, and construction method and construction device of cancer screening model Active CN113257360B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110707095.0A CN113257360B (en) 2021-06-24 2021-06-24 Cancer screening model, and construction method and construction device of cancer screening model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110707095.0A CN113257360B (en) 2021-06-24 2021-06-24 Cancer screening model, and construction method and construction device of cancer screening model

Publications (2)

Publication Number Publication Date
CN113257360A CN113257360A (en) 2021-08-13
CN113257360B true CN113257360B (en) 2021-10-15

Family

ID=77189642

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110707095.0A Active CN113257360B (en) 2021-06-24 2021-06-24 Cancer screening model, and construction method and construction device of cancer screening model

Country Status (1)

Country Link
CN (1) CN113257360B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114566285B (en) * 2022-04-26 2022-07-19 北京橡鑫生物科技有限公司 Early screening model for bladder cancer, construction method of early screening model, kit and use method of early screening model
CN115691667B (en) * 2022-12-30 2023-04-18 北京橡鑫生物科技有限公司 Urology early screening device, model construction method and equipment
CN116564508B (en) * 2023-07-07 2023-09-29 北京橡鑫生物科技有限公司 Early prostate cancer screening model and construction method thereof

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111833963A (en) * 2019-05-07 2020-10-27 中国科学院北京基因组研究所 cfDNA classification method, device and application
CN112111565A (en) * 2019-06-20 2020-12-22 上海其明信息技术有限公司 Mutation analysis method and device for cell free DNA sequencing data
CN111599407B (en) * 2020-05-13 2021-10-15 北京橡鑫生物科技有限公司 Method and device for detecting copy number variation

Also Published As

Publication number Publication date
CN113257360A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
CN113257360B (en) Cancer screening model, and construction method and construction device of cancer screening model
CN114171115B (en) Differential methylation region screening method and device thereof
CN110438228B (en) DNA methylation marker for colorectal cancer
CN112951418B (en) Method and device for evaluating methylation of linked regions based on liquid biopsy, terminal equipment and storage medium
CN109680060A (en) Methylate marker and its application in diagnosing tumor, classification
CN111091868B (en) Method and system for analyzing chromosome aneuploidy
CN107435062B (en) Peripheral blood gene marker for discriminating benign and malignant pulmonary nodules and application thereof
CN111863250A (en) Combined diagnosis model and system for early breast cancer
CN116580768B (en) Tumor tiny residual focus detection method based on customized strategy
CN110060733A (en) Tumour somatic variation detection device is sequenced in two generations based on single sample
CN116356001B (en) Dual background noise mutation removal method based on blood circulation tumor DNA
US20180173847A1 (en) Establishing a machine learning model for cancer anticipation and a method of detecting cancer by using multiple tumor markers in the machine learning model for cancer anticipation
CN110055331A (en) A kind of kit and its application for bladder cancer auxiliary diagnosis or screening
CN115862857A (en) Tumor immune subtype prediction method, system and computer equipment
KR101223270B1 (en) Method for determining low―mass ions to screen colorectal cancer, method for providing information to screen colorectal cancer by using low―mass ions, and operational unit therefor
CN111833963A (en) cfDNA classification method, device and application
CN111968702B (en) Malignant tumor early screening system based on circulating tumor DNA
CN117275585A (en) Method for constructing lung cancer early-screening model based on LP-WGS and DNA methylation and electronic equipment
CN105779433A (en) Kit and applications thereof
KR102217272B1 (en) Extracting method of disease diagnosis biomarkers using mutation information in whole genome sequence
KR102397822B1 (en) Apparatus and method for analyzing cells using chromosome structure and state information
CN115491423A (en) Gene combination, kit and application for monitoring MRD of B cell lymphoma
CN111351942B (en) Lung cancer tumor marker screening system and lung cancer risk analysis system
CN115678999B (en) Application of marker in lung cancer recurrence prediction and prediction model construction method
CN112522395B (en) Methylation determination device for lung cancer and colorectal cancer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant