CN115148287A - Construction method of gene focus amplification typing model and typing method of tumor sample - Google Patents

Construction method of gene focus amplification typing model and typing method of tumor sample Download PDF

Info

Publication number
CN115148287A
CN115148287A CN202211067952.6A CN202211067952A CN115148287A CN 115148287 A CN115148287 A CN 115148287A CN 202211067952 A CN202211067952 A CN 202211067952A CN 115148287 A CN115148287 A CN 115148287A
Authority
CN
China
Prior art keywords
gene
typing
tumor
amplification
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211067952.6A
Other languages
Chinese (zh)
Other versions
CN115148287B (en
Inventor
徐瑞华
赵齐
王诗翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University Cancer Center
Original Assignee
Sun Yat Sen University Cancer Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University Cancer Center filed Critical Sun Yat Sen University Cancer Center
Priority to CN202211067952.6A priority Critical patent/CN115148287B/en
Publication of CN115148287A publication Critical patent/CN115148287A/en
Application granted granted Critical
Publication of CN115148287B publication Critical patent/CN115148287B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Computation (AREA)
  • Chemical & Material Sciences (AREA)
  • Software Systems (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Pathology (AREA)
  • Primary Health Care (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application relates to a method for constructing a gene focus amplification typing model, a method for typing a tumor sample, a device for constructing a gene focus amplification typing model, a system for typing a tumor sample, a computer device and a storage medium. The method comprises the following steps: acquiring first association characteristic sample data of each allele in a preset number of tumor samples; randomly dividing each first associated characteristic sample data to generate a training sample set and a test sample set; training a preset genotyping model according to the training sample set, testing the genotyping model according to the testing sample set, adjusting model parameters of the genotyping model based on PR-AUC indexes obtained by training and testing until the PR-AUC indexes meet preset requirements, generating the trained gene focus amplification genotyping model, and outputting a gene focus amplification genotyping result based on the trained gene focus amplification genotyping model. By the method, whether the gene is carried in the ecDNA can be accurately predicted.

Description

Construction method of gene focus amplification typing model and typing method of tumor sample
Technical Field
The application relates to the technical field of intelligent biological information processing, in particular to a method for constructing a gene focus amplification typing model, a method for typing a tumor sample, a device for constructing the gene focus amplification typing model, a system for typing the tumor sample, computer equipment and a storage medium.
Background
Mutations in somatic genomic DNA drive tumor production and progression. Extrachromosomal circular DNA (ecDNA) is widely found in many different types of tumor cells, but is rarely seen in normal cells, and is the main source of oncogenes expressed by tumor cells, including oncogenes that are much more transcriptionally active than the same genes on the chromosome, while the genes that make up the ecDNA vary from tumor to tumor. High copy number of ecDNA, strong transcription activity, and promotion of tumor heterogeneity by unequal division, the adaptability of the tumor to the environment is enhanced, the rapid tumor evolution and the treatment resistance are further promoted, and the poor prognosis of the tumor patient is caused.
At present, the research of ecDNA is important for the prevention and cure of tumors. However, the conventional techniques still lack practical tools for the study of ecDNA, and thus cannot accurately predict whether a gene is carried in ecDNA.
Disclosure of Invention
In view of the above, it is desirable to provide a method of constructing a focused gene amplification typing model, a method of typing a tumor sample, an apparatus for constructing a focused gene amplification typing model, a system for typing a tumor sample, a computer device, and a storage medium, which can accurately predict whether a gene is carried in ecDNA.
In a first aspect, a method for constructing a gene focus amplification typing model is provided, the method comprising:
acquiring first association characteristic sample data of each allele in a preset number of tumor samples; the first associated feature sample data comprises allele-specific absolute copy number information in the tumor sample; absolute copy number information includes one or more of total copy number of genes, small allele copy number of genes, tumor purity, tumor ploidy, copy number variation load, heterozygosity deletion ratio, tumor heteroploidy score, and copy number variation pattern;
randomly dividing each first associated characteristic sample data to generate a training sample set and a test sample set;
training a preset genotyping model according to a training sample set, testing the genotyping model according to a test sample set, adjusting model parameters of the genotyping model based on PR-AUC indexes obtained by training and testing until the PR-AUC indexes meet preset requirements, generating the trained gene focus amplification genotyping model, and outputting a gene focus amplification genotyping result based on the trained gene focus amplification genotyping model; the result of the gene focus amplification typing includes that the gene is carried in extrachromosomal circular DNA or that the gene is carried in chromosomal DNA.
In one embodiment, the first correlated feature sample data further comprises allele-specific amplification frequency information in the tumor sample; the amplification frequency information includes one or more of a circular amplification frequency, a broken bridge cycle amplification frequency, a complex rearrangement amplification frequency, and a linear amplification frequency.
In one of the embodiments, the first and second electrodes are, the method for acquiring the first associated characteristic sample data specific to each allele in the preset number of tumor samples comprises the following steps: obtaining a first target gene sequencing result of a preset number of tumor patients; the first target gene sequencing result comprises a gene sequencing result of a tumor sample of a tumor patient and a gene sequencing result of a corresponding normal control sample; analyzing the sequencing result of each first target gene, and obtaining corresponding first associated feature sample data.
In one embodiment, the copy number variation pattern comprises 19 copy number variation submodes and a corresponding ratio of each copy number variation submode.
In one embodiment, the step of analyzing the sequencing result of each first target gene to obtain the corresponding first associated characteristic sample data includes: based on the allele-specific copy number detection software, analyzing the sequencing result of each first target gene to obtain corresponding absolute copy number information; and calculating each absolute copy number information based on copy number variation analysis and copy number variation mode analysis software to obtain the corresponding heterozygosity loss ratio and the ratio corresponding to each copy number variation sub-mode.
In one embodiment, the first target gene sequencing result is determined according to one or more of a whole exome sequencing method, a targeted gene sequencing method, a SNP detection method, and a whole genome sequencing method.
In one of the embodiments, the first and second electrodes are, the genotyping model is one of XGboost, logistic regression, random forest and GBDT.
In one of the embodiments, the first and second electrodes are, the construction method further comprises the following steps: and performing performance verification on the trained gene focus amplification typing model based on a grouping k-fold cross verification method to obtain a performance verification result.
In one embodiment, the construction method further comprises: and screening target genes related to the extrachromosomal circular DNA according to the gene focus amplification typing result, analyzing chromosome genome sites and occurrence frequency of the target genes, and generating a sequenced target gene list.
In one of the embodiments, the first and second electrodes are, the first associated characteristic sample data is an associated characteristic sample data matrix or an associated characteristic sample data list.
In a second aspect, there is provided a method for typing a tumor sample, the method comprising:
acquiring second association characteristic sample data of each allele in the target tumor sample; the second correlated feature sample data comprises allele-specific absolute copy number information in the target tumor sample; absolute copy number information includes one or more of total copy number of genes, small allele copy number of genes, tumor purity, tumor ploidy, copy number variation load, heterozygosity deletion ratio, tumor heteroploidy score, and copy number variation pattern;
inputting the second associated characteristic sample data into the gene focus amplification typing model trained in the embodiment of the construction method of any one of the gene focus amplification typing models, and outputting a corresponding gene focus amplification typing result;
determining a sample typing result of the target tumor sample according to the gene focus amplification typing result; the sample typing results include circular focus amplification, non-circular focus amplification or afocal amplification.
In one embodiment, the gene focus amplification typing results further comprise typing results of gene-on-chromosome DNA; the parting method also comprises the following steps: determining the typing result of the chromosome DNA carried by the gene according to the absolute copy number information in the second associated characteristic sample data and the corresponding gene focus amplification typing result; the result of typing of chromosomal DNA carried with a gene includes that the gene is carried with chromosomal DNA and amplification occurs or that the gene is carried with chromosomal DNA and amplification does not occur.
In one embodiment, the typing method further comprises: obtaining a second target gene sequencing result of a tumor patient to be detected; the second target gene sequencing result comprises a gene sequencing result of a target tumor sample of the tumor patient to be detected and a gene sequencing result of a corresponding normal control sample; and analyzing the sequencing result of each second target gene to obtain corresponding second associated characteristic sample data.
In one embodiment, the typing method further comprises: and carrying out Cox survival analysis on the target tumor sample according to the sample typing result to obtain the prognosis prediction result of each tumor type. In one embodiment, the step of determining a sample typing result of the target tumor sample according to the gene focus amplification typing result comprises: judging whether the focus amplification typing result of each gene is gene-loaded extrachromosomal circular DNA or not; if the gene focus amplification and typing result is that the gene is carried in extrachromosomal circular DNA, judging that the sample typing result is circular amplification; if the gene focus amplification and typing results are all carried by genes in the chromosome DNA, judging that the sample typing result is non-circular focus amplification or non-focus amplification according to the tumor ploidy corresponding to each gene focus amplification and typing result.
In a third aspect, a device for constructing a gene focus amplification typing model is provided, where the device includes a first data obtaining module, a sample data dividing module, and a typing model constructing module.
The first data acquisition module is used for acquiring first association characteristic sample data specific to each allele in a preset number of tumor samples; the first associated feature sample data comprises allele-specific absolute copy number information in the tumor sample; the absolute copy number information includes one or more of total copy number of genes, small allele copy number of genes, tumor purity, tumor ploidy, copy number variation load, heterozygosity deletion ratio, tumor heteroploidy score, and copy number variation pattern.
The sample data dividing module is used for randomly dividing the sample data of each first associated characteristic to generate a training sample set and a test sample set.
The genotyping model construction module is used for training a preset genotyping model according to a training sample set, testing the genotyping model according to a testing sample set, adjusting model parameters of the genotyping model based on PR-AUC indexes obtained by training and testing until the PR-AUC indexes meet preset requirements, generating the trained gene focus amplification genotyping model, and outputting a gene focus amplification genotyping result based on the trained gene focus amplification genotyping model; the result of the gene focus amplification typing includes that the gene is carried in extrachromosomal circular DNA or that the gene is carried in chromosomal DNA.
In a fourth aspect, a tumor sample typing system is provided, which comprises a second data acquiring device, a typing model applying device and a sample typing generation device.
The second data acquisition device is used for acquiring second associated characteristic sample data specific to each allele in the target tumor sample; the second correlated feature sample data comprises allele-specific absolute copy number information in the target tumor sample; the absolute copy number information comprises total copy of gene, small allele copy of gene, tumor purity, tumor ploidy one or more of copy number variation load, heterozygosity deletion ratio, tumor aneuploidy score, and copy number variation pattern.
The typing model application device is used for inputting each second related characteristic sample data into the trained gene focus amplification typing model constructed in any one of the embodiments of the gene focus amplification typing model construction device and outputting a corresponding gene focus amplification typing result.
The sample typing generation device is used for determining a sample typing result of the target tumor sample according to the gene focus amplification typing result; the sample typing results include circular focus amplification, non-circular focus amplification or afocal amplification.
In a fifth aspect, a computer device is provided, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the method for constructing a gene focus amplification typing model or the method for typing a tumor sample in any one of the above-mentioned method embodiments when executing the computer program.
In a sixth aspect, a computer readable storage medium is provided, which has a computer program stored thereon, and the computer program is executed by a processor to implement the steps of the method for constructing a gene focus amplification typing model or the method for typing a tumor sample in any one of the above-mentioned method embodiments.
The construction method of the gene focus amplification typing model, the typing method of the tumor sample, the construction device of the gene focus amplification typing model, the typing system of the tumor sample, the computer equipment and the storage medium are characterized in that first association characteristic sample data specific to each allele in a preset number of tumor samples are obtained; then, carrying out random division processing on the sample data of each first associated characteristic to generate a training sample set and a test sample set; then, training a preset genotyping model according to the training sample set, testing the genotyping model according to the testing sample set, adjusting model parameters of the genotyping model based on PR-AUC indexes obtained by training and testing until the PR-AUC indexes meet preset requirements, and generating the trained gene focus amplification genotyping model, so that the trained gene focus amplification genotyping model can accurately output focus amplification genotyping results of genes including extrachromosomal circular DNAs (deoxyribonucleic acids) or genes on chromosomal DNAs (deoxyribonucleic acids), and whether the genes are loaded on the ecDNAs (deoxyribonucleic acids) can be accurately predicted.
Drawings
FIG. 1 is a diagram showing an environment in which a method for constructing a gene focus amplification typing model and a method for typing a tumor sample are applied in one embodiment;
FIG. 2 is a first flowchart of a method for constructing a gene focus amplification typing model according to one embodiment;
FIG. 3 is a schematic flowchart illustrating the steps of obtaining sample data of association features specific to each allele in a predetermined number of tumor samples according to an embodiment;
FIG. 4 is a schematic flowchart illustrating the steps of analyzing the sequencing result of each first target gene to obtain corresponding first associated feature sample data according to one embodiment;
FIG. 5 is a diagram showing a second flow of a method for constructing a gene focus amplification typing model in one embodiment;
FIG. 6 is a schematic diagram showing feature importance of sample data for evaluating the result of the gene focus amplification typing model and the first associated feature in one embodiment;
FIG. 7 is a third flowchart showing a method of constructing a gene focus amplification typing model in one embodiment;
FIG. 8 is a first flowchart of a method for typing a tumor sample according to an embodiment;
FIG. 9 is a diagram illustrating a second process of a method for typing a tumor sample according to an embodiment;
FIG. 10 is a schematic diagram of a third flow chart of a typing method for a tumor sample according to an embodiment;
FIG. 11 is a schematic flow chart illustrating the steps for determining a sample typing result for a target tumor sample in one embodiment;
FIG. 12 is a sample of a tumor of one embodiment a fourth schematic flow diagram of the typing method;
FIG. 13 is a block diagram showing an apparatus for constructing a gene focus amplification typing model according to one embodiment;
FIG. 14 is a block diagram showing the structure of a typing apparatus for a tumor sample according to an embodiment;
FIG. 15 is a diagram showing an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application, and are not intended to limit the present application.
The construction method of the gene focus amplification typing model and the typing method of the tumor sample provided by the application can be applied to the application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a method for constructing a gene focus amplification typing model is provided, which is exemplified by applying the method to the terminal in fig. 1, and includes the following steps 201 to 203.
Step 201, obtaining first association feature sample data specific to each allele in a preset number of tumor samples.
Wherein the first correlated feature sample data includes allele-specific absolute copy number information in the tumor sample. It is understood that the absolute copy number information includes one or more of total copy number of gene (total _ cn), mini-allele copy number of gene (minor _ cn), tumor purity (write), tumor ploidy (ploidy), copy number variation load (cna _ burden), heterozygosity loss ratio (pLOH), tumor aneuploidy score (AScore), and copy number variation pattern.
Specifically, the Copy Number Variation (CNV) pattern is variation information for characterizing a DNA fragment having a size of 1kb or more, and is closely related to the occurrence and progression of cancer. The terminal can obtain first associated characteristic sample data specific to each allele in a preset number of tumor samples.
In one particular example of this, the user may, the first associated characteristic sample data specific to each allele in the preset number of tumor samples further comprises desensitization clinical information of tumor patients corresponding to each tumor sample. Wherein the desensitization clinical information includes patient age and patient gender. The above is only a specific example, and the actual application is flexibly set according to the user requirement, and is not limited herein.
In one embodiment, the first correlated characteristic sample data further comprises allele-specific amplification frequency information in the tumor sample.
Wherein the allele-specific amplification frequency information in the tumor sample comprises one or more of a Circular amplification frequency (freq _ Circular), a broken bridge cyclic amplification frequency (freq _ BFB), a complex rearrangement amplification frequency (freq _ HR), and a Linear amplification frequency (freq _ Linear).
In one embodiment, the copy number variation pattern of allele-specific absolute copy number information in the tumor sample comprises 19 copy number variation submodes and a corresponding ratio of each copy number variation submode.
In one embodiment, the 19 copy number variation submodes and the corresponding ratios of the respective copy number variation submodes include a first copy number variation submode (CN 1) and the corresponding ratio thereof, a second copy number variation submode (CN 2) and the corresponding ratio thereof, a third copy number variation submode (CN 3) and the corresponding ratio thereof, a fourth copy number variation submode (CN 4) and the corresponding ratio thereof, a fifth copy number variation submode (CN 5) and the corresponding ratio thereof, a sixth copy number variation submode (CN 6) and the corresponding ratio thereof, a seventh copy number variation submode (CN 7) and the corresponding ratio thereof, an eighth copy number variation submode (CN 8) and the corresponding ratio thereof, a ninth copy number variation submode (CN 9) and the corresponding ratio thereof, a tenth copy number variation submode (CN 10) and the corresponding ratio thereof, an eleventh copy number variation submode (CN 11) and the corresponding ratio thereof, a twelfth copy number variation equation (CN 12) and the corresponding ratio thereof, a thirteenth copy number variation submode (CN 13) and the corresponding ratio thereof, a fourteenth copy number variation submode (CN 14) and the corresponding ratio thereof, a seventeenth copy number variation submode (CN 16) and the corresponding ratio thereof, a nineteenth copy number variation submode (CN 16) and the corresponding ratio thereof, a seventeenth copy number variation submode (CN 16) and the corresponding ratio thereof. The above is only a specific example, and the actual application is flexibly set according to the user requirement, and is not limited herein.
In one embodiment, the first associated feature sample data is an associated feature sample data matrix or an associated feature sample data list.
In one embodiment, as shown in fig. 3, the step of obtaining first associated feature sample data specific to each allele in a predetermined number of tumor samples includes steps 301 and 302.
Step 301, obtaining the sequencing results of the first target genes of a preset number of tumor patients.
Wherein the first target gene sequencing result comprises a gene sequencing result of a tumor sample of a tumor patient and a gene sequencing result of a corresponding normal control sample. It can be understood that the terminal can obtain the gene sequencing result of the tumor sample of the tumor patient and the gene sequencing result of the normal control sample corresponding to each tumor patient, i.e. the first target gene sequencing result.
In a particular example, the first gene of interest sequencing result can be, but is not limited to, a Whole Genome Sequencing (WGS) based gene sequencing result. The terminal may be, but is not limited to, by cancer and tumor genetic mapping (TCGA) public databases and downloading to obtain the sequencing result of the first target gene of a preset number of tumor patients. Wherein the predetermined number may be but is not limited to 386. The above is only a specific example, and the actual application is flexibly set according to the user requirement, and is not limited herein.
In one embodiment, the first target gene sequencing result is determined according to one or more of a whole exome sequencing method, a targeted gene sequencing method, a SNP detection method, and a Whole Genome Sequencing (WGS) method.
Step 302, analyzing the sequencing result of each first target gene to obtain corresponding first associated characteristic sample data.
The terminal can analyze the obtained first target gene sequencing results of the preset number of tumor patients, so as to obtain first associated characteristic sample data corresponding to each first target gene sequencing result.
In the embodiment, the sequencing result of the first target gene of a preset number of tumor patients is obtained; and then, analyzing each first target gene sequencing result to obtain corresponding first associated characteristic sample data, so that the acquisition efficiency and accuracy of the first target gene sequencing result are improved, the convenience of the construction process of the gene focus amplification typing model is further improved, and the cost of the construction process of the gene focus amplification typing model is reduced.
In one embodiment, as shown in fig. 4, the step of analyzing the sequencing result of each first target gene to obtain the corresponding first associated characteristic sample data includes step 401 and step 402.
Step 401, analyzing the sequencing result of each first target gene based on allele-specific copy number detection software to obtain corresponding absolute copy number information.
Step 402, calculating each absolute copy number information based on copy number variation analysis and copy number variation pattern analysis software, obtaining the corresponding heterozygosity loss ratio and the ratio corresponding to each copy number variation submode.
The terminal can analyze the sequencing result of each first target gene based on allele specific copy number detection software to obtain absolute copy number information corresponding to the sequencing result of each first target gene; and then, calculating each absolute copy number information based on copy number variation analysis and copy number variation mode analysis software to obtain the heterozygosity loss ratio corresponding to each absolute copy number information and the ratio corresponding to each copy number variation sub-mode corresponding to each absolute copy number information.
In one particular example, the allele-specific copy number detection software can be, but is not limited to, ASCAT software, facts software, or Sequenza software. Wherein, the efficiency and quality of the sequencing result of each first target gene analyzed by adopting ASCAT software are better. In addition, the copy number variation analysis and copy number variation pattern analysis software may be, but is not limited to, signiner software. The above is only a specific example, and the actual application is flexibly set according to the user requirement, and is not limited herein.
In this example, based on allele-specific copy number detection software, analyzing the sequencing result of each first target gene to obtain corresponding absolute copy number information; then, based on copy number variation analysis and copy number variation mode analysis software, calculating each absolute copy number information to obtain the corresponding heterozygosity loss proportion and the proportion corresponding to each copy number variation sub-mode, improving the obtaining efficiency and accuracy of the absolute copy number information, the heterozygosity loss proportion and the proportion corresponding to each copy number variation sub-mode, further improving the convenience of the construction process of the gene focus amplification and typing model, and reducing the cost of the construction process of the gene focus amplification and typing model.
Step 202, performing random division processing on each first associated feature sample data to generate a training sample set and a test sample set.
The terminal can randomly divide the first associated feature sample data, so that a training sample set and a test sample set can be generated. It is to be understood that the training sample set is used to train a pre-set genotyping model and the testing sample set is used to test the genotyping model.
In a specific example, the terminal may perform random division processing on each first associated feature sample data, so as to generate a preset division ratio of 4:1, training sample set and test sample set. It can be understood that each first associated feature sample data can be assigned to only one of the training sample set and the test sample set, so as to avoid data crosstalk in the training sample set and the test sample set. The above is only a specific example, and the actual application is flexibly set according to the user requirement, and is not limited herein.
And 203, training a preset genotyping model according to the training sample set, testing the genotyping model according to the testing sample set, adjusting model parameters of the genotyping model based on PR-AUC indexes obtained by training and testing until the PR-AUC indexes meet preset requirements, generating the trained gene focus amplification genotyping model, and outputting a gene focus amplification genotyping result based on the trained gene focus amplification genotyping model.
Wherein the gene focus amplification typing result comprises that the gene is carried in extrachromosomal circular DNA and the gene is carried in chromosomal DNA. The terminal can train a preset genotyping model according to the training sample set, test the genotyping model according to the testing sample set, adjust model parameters of the genotyping model based on PR-AUC indexes obtained by training and testing until the PR-AUC indexes meet preset requirements, and generate the trained gene focus amplification genotyping model.
It is understood that the gene focus amplification typing results including the extrachromosomal circular DNA and the chromosomal DNA can be outputted based on the trained gene focus amplification typing model. That is, it is possible to classify the gene load on extrachromosomal circular DNA or chromosomal DNA by a focus-based amplification typing model. Further, the air conditioner is characterized in that, whether the gene is carried on the extrachromosomal circular DNA can also be predicted by the result of gene focus amplification and typing. In addition, the PR-AUC index is an evaluation index of the genotyping model, and the PR-AUC index is favorable for improving the performance of the genotyping model. The preset requirements are flexibly set in practical application according to user requirements, and are not limited herein.
In one embodiment, the genotyping model is one of XGBoost, logistic regression, random forest and GBDT.
According to the construction method of the gene focus amplification typing model, first associated characteristic sample data specific to each gene in a preset number of tumor samples is obtained; then, carrying out random division processing on the sample data of each first associated characteristic to generate a training sample set and a test sample set; then, training a preset genotyping model according to the training sample set, testing the genotyping model according to the testing sample set, adjusting model parameters of the genotyping model based on PR-AUC indexes obtained by training and testing until the PR-AUC indexes meet the preset requirements, generating the trained gene focus amplification genotyping model, therefore, the focus amplification typing result of the gene including the gene carried in the extrachromosomal circular DNA or the gene carried in the chromosomal DNA is accurately output based on the trained gene focus amplification typing model, and whether the gene is carried in the ecDNA or not can be accurately predicted, namely, the specific ecDNA member gene of each tumor can be accurately identified.
In addition, the construction method of the gene focus amplification typing model can be applied to typing of tumor samples, and corresponding gene focus amplification typing results can be output by inputting the acquired second associated characteristic sample data of each allele specificity in the acquired target tumor sample into the trained gene focus amplification typing model; then, determining a sample typing result of the target tumor sample according to the amplification typing result of each gene focus; the sample typing results include circular focus amplification, non-circular focus amplification or afocal amplification.
In one embodiment, as shown in FIG. 5, the construction method further includes step 204.
And 204, performing performance verification on the trained gene focus amplification typing model based on a grouping k-fold cross verification method to obtain a performance verification result.
And the terminal can carry out performance verification on the trained gene focus amplification typing model based on a grouping k-fold cross verification method to obtain a performance verification result corresponding to the trained gene focus amplification typing model.
In one specific example, K in the packet K-fold cross-validation method may be any positive integer. The terminal can perform performance verification on the trained gene focus amplification typing model by adopting a 5-fold cross verification method and performing 3-time repetition, so as to obtain a performance verification result corresponding to the trained gene focus amplification typing model, which is only a specific example, and is flexibly set according to user requirements in practical application, and is not limited herein.
In the embodiment, the trained gene focus amplification typing model is subjected to performance verification based on a grouping k-fold cross verification method to obtain a performance verification result, so that the accuracy and convenience of the construction of the gene focus amplification typing model are improved.
In one specific example, the results of the gene focus amplification typing model assess the feature importance of the first associated feature sample data, as shown in fig. 6. Wherein, FIG. 6A is a precision-sensitivity curve (PR-AUC); FIG. 6B is a true positive-false positive curve, i.e., receiver operating characteristic curve (ROC); fig. 6C is a feature importance estimation of first associated feature sample data. The feature importance of the first associated feature sample data obtained by the test is reduced in sequence according to the following sequence: total copy number (total _ CN), tumor ploidy (ploidy), copy number variation load (cna _ burden), first copy number variation submode (CN 1) and its proportion, ninth copy number variation submode (CN 9) and its proportion, tumor purity (purity), second copy number variation submode (CN 2) and its proportion, circular amplification frequency (freq _ Circular), eighteenth copy number variation submode (CN 18) and its proportion, gene mini-allele copy (minor _ CN), linear amplification frequency (freq _ Linear), patient age (age), eighth copy number variation submode (CN 8) and its proportion, heterozygosity loss proportion (pLOH), a frequency of fragmentation bridge cycling amplification (freq _ BFB), a third copy number variation submode (CN 3) and its proportion, a tumor aneuploidy score (AScore), a seventh copy number variation submode (CN 7) and its proportion, a fifteenth copy number variation submode (CN 15) and its proportion, a complex rearrangement amplification frequency (freq _ HR), a nineteenth copy number variation submode (CN 19) and its proportion, a seventeenth copy number variation submode (CN 17) and its proportion, a sixteenth copy number variation submode (CN 16) and its proportion, a sixth copy number variation submode (CN 6) and its proportion, a fourth copy number variation submode (CN 4) and its proportion, a tenth copy number variation submode (CN 10) and its proportion, and an eleventh copy number variation submode (CN 11) and its proportion. The above is only a specific example, and the actual application is flexibly set according to the user requirement, and is not limited herein.
In one embodiment, as shown in FIG. 7, the construction method further comprises step 205.
Step 205, according to the gene focus amplification typing result, screening target genes related to the extrachromosomal circular DNA, analyzing chromosome genome sites and occurrence frequency of the target genes, and generating a sorted target gene list.
The terminal can screen target genes related to the extrachromosomal circular DNA according to gene focus amplification typing results output by the trained gene focus amplification typing model, and analyzes chromosome genome sites and occurrence frequency of the target genes, so that a sorted target gene list is generated, and a potential target point list is provided for tumor treatment through the target gene list.
In this embodiment, the ordered target gene list is generated according to the gene focus amplification typing result, so that a potential target point list is provided for tumor therapy through the target gene list, and the convenience and the applicability of the gene focus amplification typing model are improved.
In a second aspect, as shown in fig. 8, a method for typing a tumor sample is provided, the method comprising steps 701, 702 and 704.
Step 701, obtaining second association feature sample data specific to each allele in the target tumor sample.
Wherein the second correlated feature sample data comprises allele-specific absolute copy number information in the target tumor sample. It is understood that the absolute copy number information includes one or more of total copy number of gene (total _ cn), mini-allele copy number of gene (minor _ cn), tumor purity (write), tumor ploidy (ploidy), copy number variation load (cna _ burden), heterozygosity loss ratio (pLOH), tumor aneuploidy score (AScore), and copy number variation pattern.
In a specific example, the target tumor sample may be, but is not limited to, a rectal cancer sample of a patient to be tested, which is only a specific example, and the target tumor sample is flexibly set according to user requirements in practical applications, and is not limited herein.
In one embodiment, the second correlated characteristic sample data further comprises allele-specific amplification frequency information in the tumor sample.
Wherein the allele-specific amplification frequency information in the tumor sample comprises one or more of a Circular amplification frequency (freq _ Circular), a broken bridge cyclic amplification frequency (freq _ BFB), a complex rearrangement amplification frequency (freq _ HR), and a Linear amplification frequency (freq _ Linear).
In one embodiment, the copy number variation pattern of allele-specific absolute copy number information in the tumor sample comprises 19 copy number variation submodes and a corresponding ratio of each copy number variation submode.
In one embodiment, the second associated feature sample data is an associated feature sample data matrix or an associated feature sample data list.
In one embodiment, as shown in fig. 9, the typing method further includes step 700A and step 700B.
Step 700A, a second target gene sequencing result of a patient to be detected with a tumor is obtained.
And the second target gene sequencing result comprises a gene sequencing result of a target tumor sample of the tumor patient to be detected and a gene sequencing result of a corresponding normal control sample. It can be understood that the terminal can obtain the gene sequencing result of the tumor sample of the tumor patient to be detected and the gene sequencing result of the normal control sample corresponding to each tumor patient, i.e. the second target gene sequencing result. The tumor patient to be detected is a patient who needs to be classified by a tumor sample in a hospital or a detection department.
In one embodiment, the second target gene sequencing result is determined according to one or more of a whole exome sequencing method, a targeted gene sequencing method, a SNP detection method, and a Whole Genome Sequencing (WGS) method.
And step 700B, analyzing the sequencing result of each second target gene to obtain corresponding second associated characteristic sample data.
The terminal can analyze the obtained second target gene sequencing result of the tumor patient to be detected, so as to obtain second associated characteristic sample data corresponding to the second target gene sequencing result.
In the embodiment, the sequencing result of the second target gene of the tumor patient to be detected is obtained; and then, analyzing each second target gene sequencing result to obtain corresponding second associated characteristic sample data, so that the obtaining efficiency and accuracy of the second target gene sequencing result are improved, and the typing efficiency of the tumor sample is further improved.
Step 702, inputting each second associated feature sample data into the trained gene focus amplification typing model in the above construction method of any gene focus amplification typing model, and outputting the corresponding gene focus amplification typing result.
Wherein the gene focus amplification typing result comprises that the gene is carried in extrachromosomal circular DNA or the gene is carried in chromosomal DNA. The terminal can input each second associated characteristic sample data into the trained gene focus amplification typing model and output a corresponding gene focus amplification typing result.
In one embodiment, as shown in FIG. 10, the gene focus amplification typing results further include typing results of gene-on-chromosome DNA. The typing method further includes step 703.
And 703, determining the typing result of the chromosome DNA carried by the gene according to the absolute copy number information in the second associated characteristic sample data and the corresponding gene focus amplification typing result.
Wherein the typing result of the gene carried on the chromosomal DNA includes that the gene is carried on the chromosomal DNA and amplification occurs or that the gene is carried on the chromosomal DNA and amplification does not occur. The terminal can determine the typing result of the chromosomal DNA carried by the gene according to the absolute copy number information in the second correlated characteristic sample data and the gene focus amplification typing result corresponding to the absolute copy number information, thereby further determining whether the chromosomal DNA carried by the gene is amplified or not according to the typing result of the chromosomal DNA carried by the gene.
In this embodiment, the typing result of the gene-carried chromosomal DNA is determined according to the absolute copy number information in the second correlated characteristic sample data and the corresponding gene focus amplification typing result, so that the accuracy of the gene focus amplification typing result is further improved, and the typing efficiency of the tumor sample is improved.
And step 704, determining a sample typing result of the target tumor sample according to the gene focus amplification typing result.
Wherein the sample typing result comprises circular focus amplification, non-circular focus amplification or afocal amplification. The terminal can output each gene focus amplification typing result according to the trained gene focus amplification typing model, and the sample typing result of the target tumor sample can be accurately determined.
The typing method of the tumor sample comprises the steps of obtaining second association characteristic sample data of each allele specificity in a target tumor sample; then, inputting the second associated characteristic sample data into the trained gene focus amplification typing model in the construction method embodiment of any gene focus amplification typing model, and outputting the corresponding gene focus amplification typing result; secondly, determining a sample typing result of the target tumor sample according to the gene focus amplification typing result; whether a tumor patient to be clinically detected carries the ecDNA or not is accurately predicted according to a sample typing result of the target tumor sample, so that the typing convenience, efficiency and accuracy of the tumor sample are improved, and the typing cost of the tumor sample is reduced; further, the extraction of genomic markers for tumor prognosis prediction is accelerated by the sample typing result of the target tumor sample, and the method has important significance in early screening, diagnosis, prognosis evaluation, relapse and metastasis monitoring of tumors.
In one embodiment, as shown in fig. 11, the step of determining the sample typing result of the target tumor sample according to the focusing amplification typing results of each gene includes steps 1001 to 1003.
Step 1001, determining whether the result of the focal amplification typing of each gene is a circular DNA in which the gene is carried extrachromosomally.
In step 1002, if the result of genotyping of the gene focus amplification is that the gene is carried in extrachromosomal circular DNA, the result of genotyping the sample is judged to be circular amplification. And 1003, judging that the sample typing result is non-circular focus amplification or non-focus amplification according to the tumor ploidy corresponding to each gene focus amplification typing result if the gene focus amplification typing results are all carried in chromosome DNA.
The terminal can output the gene focus amplification typing result according to the trained gene focus amplification typing model and judge whether the gene focus amplification typing result is gene-carried extrachromosomal circular DNA or not; then, under the condition that the gene focus amplification typing result is that the gene is carried in extrachromosomal circular DNA, judging that the sample typing result is circular amplification; meanwhile, under the condition that the gene focus amplification typing results are all carried by genes on chromosome DNA, the typing result of the sample is judged to be noncyclic focus amplification or afocal amplification according to the tumor ploidy corresponding to each gene focus amplification typing result. In this embodiment, whether the gene-carried extrachromosomal circular DNA or both the gene-carried chromosomal DNA exists in each gene focus amplification and typing result is analyzed, so that the accurate typing result of the sample is noncyclic focus amplification, cyclic focus amplification or afocal amplification, and the efficiency and accuracy of typing of the tumor sample are improved.
In one embodiment, the step of determining whether the typing result of the sample is non-circular focus amplification or non-focus amplification according to the tumor ploidy corresponding to the focus amplification typing result of each gene comprises:
if the tumor ploidy corresponding to the gene focus amplification parting result is larger than a preset tumor ploidy threshold value, judging that the sample parting result is non-circular focus amplification;
and if the tumor ploidy corresponding to each gene focus amplification parting result is less than or equal to a preset tumor ploidy threshold value, judging that the sample parting result is afocal amplification.
The preset tumor ploidy threshold may be, but is not limited to, 4 times of the tumor ploidy standard value, and is flexibly set according to the user requirement in practical application, which is not limited herein.
In this example, when the tumor ploidy corresponding to the gene focus amplification typing result is greater than the preset tumor ploidy threshold, the sample typing result is determined to be acyclic focus amplification; and the tumor ploidy corresponding to each gene focus amplification parting result is less than or equal to a preset tumor ploidy threshold value, and the sample parting result is judged to be unfocused amplification; the method also realizes the further accurate division of the sample parting result when the gene focus amplification parting result is carried by the gene on the chromosome DNA, and improves the parting efficiency and accuracy of the tumor sample.
In one embodiment, as shown in fig. 12, the typing method further comprises step 705.
Step 705, according to the sample typing result, cox survival analysis is performed on the target tumor sample to obtain the prognosis prediction result of each tumor type.
According to the sample typing result of the target tumor sample, cox survival analysis is carried out on the target tumor sample, and the prognosis prediction result of each tumor type can be obtained. In addition, cox survival analysis can be performed on the target tumor sample using, but is not limited to, a Cox model analysis algorithm.
In this embodiment, according to the sample typing result, cox survival analysis is performed on the target tumor sample to obtain the prognosis prediction result of each tumor type, so that the target tumor sample of the patient with the tumor to be detected is accurately predicted for clinical prognosis, and a reference basis is provided for clinically developing new treatment means and targets.
It should be understood that although the various steps in the flow charts of fig. 2-12 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-12 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.
In a third aspect, as shown in fig. 13, there is provided an apparatus for constructing a gene focus amplification typing model, the apparatus includes a first data obtaining module 1210, a sample data partitioning module 1220 and a typing model constructing module 1230.
The first data obtaining module 1210 is configured to obtain first associated feature sample data specific to each allele in a preset number of tumor samples; the first associated feature sample data comprises absolute copy number information of allele specificity in the tumor sample; the absolute copy number information includes one or more of total copy number of genes, small allele copy number of genes, tumor purity, tumor ploidy, copy number variation load, heterozygosity deletion ratio, tumor heteroploidy score, and copy number variation pattern.
The sample data partitioning module 1220 is configured to perform random partitioning processing on each first associated feature sample data, and generate a training sample set and a test sample set.
The genotyping model constructing module 1230 is configured to train a preset genotyping model according to the training sample set, test the genotyping model according to the test sample set, adjust model parameters of the genotyping model based on PR-AUC indexes obtained by the training and the testing until the PR-AUC indexes meet preset requirements, generate a trained gene focus amplification genotyping model, and output a gene focus amplification genotyping result based on the trained gene focus amplification genotyping model; the result of the gene focus amplification typing includes that the gene is carried in extrachromosomal circular DNA or that the gene is carried in chromosomal DNA.
In one embodiment, the first correlated feature sample data further comprises allele-specific amplification frequency information in the tumor sample; the amplification frequency information includes one or more of a circular amplification frequency, a broken bridge cycle amplification frequency, a complex rearrangement amplification frequency, and a linear amplification frequency.
In one embodiment, the first data acquisition module 1210 includes a first sequencing result acquisition unit and a first sequencing result analysis unit.
The first sequencing result acquisition unit is used for acquiring the sequencing results of first target genes of a preset number of tumor patients; the first target gene sequencing result comprises a gene sequencing result of a tumor sample of a tumor patient and a gene sequencing result of a corresponding normal control sample; the first sequencing result analysis unit is used for analyzing each first target gene sequencing result to obtain corresponding first associated characteristic sample data.
In one embodiment, the copy number variation pattern comprises 19 copy number variation submodes and a corresponding ratio of each copy number variation submode.
In one embodiment, the first sequencing result analysis unit includes an absolute copy number information obtainer and a heterozygosity loss ratio obtainer. The copy number information acquirer is used for analyzing the sequencing result of each first target gene based on allele specific copy number detection software to obtain corresponding absolute copy number information; the heterozygosity loss proportion acquirer is used for calculating each absolute copy number information based on copy number variation analysis and copy number variation mode analysis software to obtain the corresponding heterozygosity loss proportion and the proportion corresponding to each copy number variation sub-mode.
In one embodiment, the first target gene sequencing result is determined according to one or more of a whole exome sequencing method, a targeted gene sequencing method, a SNP detection method, and a whole genome sequencing method.
In one embodiment, the genotyping model is one of XGBoost, logistic regression, random forest and GBDT.
In one embodiment, the apparatus for constructing the gene focus amplification typing model further comprises a performance verification module.
The performance verification module is used for performing performance verification on the trained gene focus amplification typing model based on a grouping k-fold cross verification method to obtain a performance verification result.
In one embodiment, the apparatus for constructing the gene focus amplification typing model further comprises a gene list generation module.
The gene list generation module is used for screening target genes related to extrachromosomal circular DNA according to gene focus amplification typing results, analyzing chromosome genome sites and occurrence frequency of the target genes, and generating a sequenced target gene list.
In one embodiment, the first associated feature sample data is an associated feature sample data matrix or an associated feature sample data list.
For the specific limitations of the apparatus for constructing the gene focus amplification typing model, reference may be made to the above limitations of the method for constructing the gene focus amplification typing model, which are not described herein again. The modules in the construction device of the gene focus amplification typing model can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In a fourth aspect, as shown in fig. 14, a tumor sample typing system is provided, which comprises a second data acquiring device 1310, a typing model applying device 1320 and a sample typing generation device 1330.
The second data obtaining device 1310 is configured to obtain second associated feature sample data specific to each allele in the target tumor sample; the second correlated feature sample data comprises allele-specific absolute copy number information in the target tumor sample; the absolute copy number information includes one or more of total copy number of genes, small allele copy number of genes, tumor purity, tumor ploidy, copy number variation load, heterozygosity deletion ratio, tumor heteroploidy score, and copy number variation pattern.
The typing model application unit 1320 is configured to input each second related feature sample data to the trained gene focus amplification typing model constructed in any one of the embodiments of the apparatus for constructing a gene focus amplification typing model described above, and output a corresponding gene focus amplification typing result.
The sample typing generation device 1330 is configured to determine a sample typing result of the target tumor sample according to the gene focus amplification typing result; the sample typing results include circular focus amplification, non-circular focus amplification or afocal amplification.
In one embodiment, the focused gene amplification typing results further include typing results of a gene carried on chromosomal DNA; the typing system for tumor samples further comprises a typing result analysis device for chromosomal DNA loaded with genes.
Wherein the typing result analyzing device of the chromosomal DNA carried by the gene is used for determining the typing result of the chromosomal DNA carried by the gene based on the absolute copy number information in the second correlated characteristic sample data and the corresponding gene focus amplification typing result. The result of typing of chromosomal DNA carried with a gene includes that the gene is carried with chromosomal DNA and amplification occurs or that the gene is carried with chromosomal DNA and amplification does not occur.
In one embodiment, the typing system for tumor samples further comprises a sequencing result obtaining device and a sequencing result analyzing device.
The sequencing result acquisition device is used for acquiring a second target gene sequencing result of the tumor patient to be detected; the second target gene sequencing result comprises a gene sequencing result of a target tumor sample of the tumor patient to be detected and a gene sequencing result of a corresponding normal control sample; and the sequencing result analysis device is used for analyzing the sequencing result of each second target gene to obtain corresponding second associated characteristic sample data.
In one embodiment, the system for typing a tumor sample further comprises a survival analysis device.
The survival analysis device is used for carrying out Cox survival analysis on the target tumor sample according to the sample typing result to obtain the prognosis prediction result of each tumor type.
In one embodiment, the sample typing generation device 1330 comprises a gene focus amplification typing result determination module.
The gene focus amplification typing result judging module is used for judging whether each gene focus amplification typing result is gene-carried extrachromosomal circular DNA or not; the gene focus amplification and typing result judging module is also used for judging that the typing result of the sample is circular amplification if the gene focus amplification and typing result is circular DNA carried outside the chromosome; the gene focus amplification and typing result judging module is also used for judging whether the sample typing result is non-circular focus amplification or non-focus amplification according to the tumor ploidy corresponding to each gene focus amplification and typing result if the gene focus amplification and typing results are all carried in chromosome DNA.
For the specific definition of the typing device for tumor samples, reference may be made to the above definition of the typing method for tumor samples, which is not repeated herein. The modules in the device for typing a tumor sample can be implemented in whole or in part by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 14. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of constructing a gene focus amplification typing model or a method of typing a tumor sample. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 15 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In a fifth aspect, a computer device is provided, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the method for constructing a gene focus amplification typing model or the method for typing a tumor sample in any one of the above-mentioned method embodiments when executing the computer program.
In a sixth aspect, a computer readable storage medium is provided, which has a computer program stored thereon, and the computer program is executed by a processor to implement the steps of the method for constructing a gene focus amplification typing model or the method for typing a tumor sample in any one of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (19)

1. A method for constructing a gene focus amplification typing model is characterized by comprising the following steps:
obtaining first association characteristic sample data of each allele specificity in a preset number of tumor samples; the first correlated feature sample data comprises allele-specific absolute copy number information in the tumor sample; the absolute copy number information includes one or more of total copy number of genes, small allele copy number of genes, tumor purity, tumor ploidy, copy number variation load, heterozygosity deletion ratio, tumor heteroploidy score, and copy number variation pattern;
carrying out random division processing on the first associated characteristic sample data to generate a training sample set and a test sample set;
training a preset genotyping model according to the training sample set, testing the genotyping model according to the testing sample set, adjusting model parameters of the genotyping model based on PR-AUC indexes obtained by training and testing until the PR-AUC indexes meet preset requirements, generating a trained gene focus amplification genotyping model, and outputting a gene focus amplification genotyping result based on the trained gene focus amplification genotyping model; the gene focus amplification typing result comprises that the gene is carried in extrachromosomal circular DNA or the gene is carried in chromosomal DNA.
2. The method for constructing a gene focus amplification typing model according to claim 1, wherein the first associated feature sample data further comprises allele-specific amplification frequency information in the tumor sample; the amplification frequency information includes one or more of a circular amplification frequency, a fragmentation bridge cycling amplification frequency, a complex rearrangement amplification frequency, and a linear amplification frequency.
3. The method for constructing a gene focus amplification typing model according to claim 1, wherein the step of obtaining the sample data of the first associated characteristic specific to each allele in the predetermined number of tumor samples comprises:
obtaining a first target gene sequencing result of the preset number of tumor patients; the first target gene sequencing result comprises a gene sequencing result of a tumor sample of the tumor patient and a gene sequencing result of a corresponding normal control sample;
and analyzing the sequencing result of each first target gene to obtain corresponding first associated characteristic sample data.
4. The method of claim 3, wherein the copy number variation pattern comprises 19 copy number variation submodes and a ratio corresponding to each of the copy number variation submodes.
5. The method of claim 4, wherein the step of analyzing each of the first target gene sequencing results to obtain the corresponding first associated feature sample data comprises:
analyzing the sequencing result of each first target gene based on allele-specific copy number detection software to obtain corresponding absolute copy number information;
and calculating the absolute copy number information based on copy number variation analysis and copy number variation mode analysis software to obtain the corresponding heterozygosity deletion ratio and the ratio corresponding to each copy number variation submode.
6. The method for constructing a gene focus amplification typing model according to claim 3, wherein the first target gene sequencing result is determined according to one or more of a whole exome sequencing method, a targeted gene sequencing method, an SNP detection method and a whole genome sequencing method.
7. The method for constructing a gene focus amplification and typing model according to claim 1, wherein the genotyping model is one of XGBoost, logistic regression, random forest and GBDT.
8. The method for constructing a gene focus amplification typing model according to claim 1, further comprising:
and performing performance verification on the trained gene focus amplification typing model based on a grouping k-fold cross verification method to obtain a performance verification result.
9. The method for constructing a gene focus amplification typing model according to claim 1, further comprising:
and screening target genes related to the extrachromosomal circular DNA according to the gene focus amplification typing result, analyzing chromosome genome sites and occurrence frequency of the target genes, and generating a sequenced target gene list.
10. The method of constructing a gene focus amplification typing model according to claim 1, wherein the first associated feature sample data is an associated feature sample data matrix or an associated feature sample data list.
11. A method of typing a tumor sample, the method comprising:
acquiring second association characteristic sample data of each allele in the target tumor sample; the second correlated feature sample data comprises allele-specific absolute copy number information in the target tumor sample; the absolute copy number information includes one or more of total copy number of genes, small allele copy number of genes, tumor purity, tumor ploidy, copy number variation load, heterozygosity deletion ratio, tumor heteroploidy score, and copy number variation pattern;
inputting each of said second correlated characteristic sample data into the trained gene focus amplification typing model according to any one of claims 1 to 10, and outputting the corresponding gene focus amplification typing result;
determining a sample typing result of the target tumor sample according to each gene focus amplification typing result; the sample typing results include circular focus amplification, non-circular focus amplification or afocal amplification.
12. The method of claim 11, wherein the gene focus amplification typing further comprises a typing of a chromosome-carried DNA; the typing method further comprises the following steps:
determining the typing result of the chromosome DNA carried by the gene according to the absolute copy number information in the second associated characteristic sample data and the corresponding gene focus amplification typing result; the typing result of the gene carried on the chromosomal DNA includes that the gene is carried on the chromosomal DNA and amplification occurs or that the gene is carried on the chromosomal DNA and amplification does not occur.
13. The method of typing a tumor specimen according to claim 11, further comprising:
obtaining a second target gene sequencing result of a tumor patient to be detected; the second target gene sequencing result comprises a gene sequencing result of a target tumor sample of the tumor patient to be detected and a gene sequencing result of a corresponding normal control sample;
and analyzing the sequencing result of each second target gene to obtain corresponding second associated characteristic sample data.
14. The method of typing a tumor specimen according to claim 11, further comprising:
and carrying out Cox survival analysis on the target tumor sample according to the sample typing result to obtain a prognosis prediction result of each tumor type.
15. The method of claim 11, wherein the step of determining a sample-based typing result of the target tumor sample according to each of the gene focus amplification typing results comprises:
judging whether the gene focus amplification typing result is that the gene is carried in extrachromosomal circular DNA or not;
if the gene focus amplification typing result is that the gene is carried in extrachromosomal circular DNA, judging that the sample typing result is the circular focus amplification;
and if the gene focus amplification typing results are the gene carried in the chromosome DNA, judging that the sample typing result is the noncyclic focus amplification or the afocal amplification according to the tumor ploidy corresponding to each gene focus amplification typing result.
16. A construction device of a gene focus amplification typing model is characterized by comprising:
the first data acquisition module is used for acquiring first association characteristic sample data specific to each allele in a preset number of tumor samples; the first correlated feature sample data comprises allele-specific absolute copy number information in the tumor sample; the absolute copy number information includes one or more of total copy number of genes, small allele copy number of genes, tumor purity, tumor ploidy, copy number variation load, heterozygosity deletion ratio, tumor heteroploidy score, and copy number variation pattern;
the sample data dividing module is used for randomly dividing each first associated characteristic sample data to generate a training sample set and a test sample set;
the genotyping model construction module is used for training a preset genotyping model according to the training sample set, testing the genotyping model according to the testing sample set, adjusting model parameters of the genotyping model based on PR-AUC indexes obtained by training and testing until the PR-AUC indexes meet preset requirements, generating the trained gene focus amplification genotyping model, and outputting a gene focus amplification genotyping result based on the trained gene focus amplification genotyping model; the gene focus amplification typing result comprises that the gene is carried in extrachromosomal circular DNA or the gene is carried in chromosomal DNA.
17. A system for typing a tumor sample, said system comprising:
the second data acquisition device is used for acquiring second association characteristic sample data specific to each allele in the target tumor sample; the second correlated feature sample data comprises allele-specific absolute copy number information in the target tumor sample; the absolute copy number information includes one or more of total copy number of genes, small allele copy number of genes, tumor purity, tumor ploidy, copy number variation load, heterozygosity deletion ratio, tumor heteroploidy score, and copy number variation pattern;
typing model application means for inputting each of the second correlated characteristic sample data to the trained gene focus amplification typing model constructed by the gene focus amplification typing model construction means according to claim 16, and outputting a corresponding gene focus amplification typing result;
the sample typing generation device is used for determining a sample typing result of the target tumor sample according to each gene focus amplification typing result; the sample typing results include circular focus amplification, non-circular focus amplification or afocal amplification.
18. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the computer program implements the steps of the method of constructing a gene focus amplification typing model according to any one of claims 1 to 10 or the method of typing a tumor sample according to any one of claims 11 to 15.
19. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of constructing a gene focus amplification typing model according to any one of claims 1 to 10 or the method of typing a tumor sample according to any one of claims 11 to 15.
CN202211067952.6A 2022-09-01 2022-09-01 Construction method of gene focus amplification parting model and parting method of tumor sample Active CN115148287B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211067952.6A CN115148287B (en) 2022-09-01 2022-09-01 Construction method of gene focus amplification parting model and parting method of tumor sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211067952.6A CN115148287B (en) 2022-09-01 2022-09-01 Construction method of gene focus amplification parting model and parting method of tumor sample

Publications (2)

Publication Number Publication Date
CN115148287A true CN115148287A (en) 2022-10-04
CN115148287B CN115148287B (en) 2024-05-31

Family

ID=83415492

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211067952.6A Active CN115148287B (en) 2022-09-01 2022-09-01 Construction method of gene focus amplification parting model and parting method of tumor sample

Country Status (1)

Country Link
CN (1) CN115148287B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116864000A (en) * 2023-07-21 2023-10-10 上海信诺佰世医学检验有限公司 Tumor chemotherapy typing system based on high-flux targeted sequencing analysis

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111662983A (en) * 2020-07-06 2020-09-15 北京吉因加科技有限公司 Kit for detecting lymphoma gene variation and application thereof
CN111833963A (en) * 2019-05-07 2020-10-27 中国科学院北京基因组研究所 cfDNA classification method, device and application
CN112509636A (en) * 2020-12-21 2021-03-16 上海科技大学 Tumor genome copy number variation characteristic pattern recognition method and application thereof
CN113192560A (en) * 2021-03-02 2021-07-30 郑州大学第一附属医院 Construction method of hepatocellular carcinoma typing system based on iron death process
WO2022033000A1 (en) * 2020-08-12 2022-02-17 臻悦生物科技江苏有限公司 Method for determining genomic instability based on next-generation sequencing technology, and kit

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111833963A (en) * 2019-05-07 2020-10-27 中国科学院北京基因组研究所 cfDNA classification method, device and application
CN111662983A (en) * 2020-07-06 2020-09-15 北京吉因加科技有限公司 Kit for detecting lymphoma gene variation and application thereof
WO2022033000A1 (en) * 2020-08-12 2022-02-17 臻悦生物科技江苏有限公司 Method for determining genomic instability based on next-generation sequencing technology, and kit
CN112509636A (en) * 2020-12-21 2021-03-16 上海科技大学 Tumor genome copy number variation characteristic pattern recognition method and application thereof
CN113192560A (en) * 2021-03-02 2021-07-30 郑州大学第一附属医院 Construction method of hepatocellular carcinoma typing system based on iron death process

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116864000A (en) * 2023-07-21 2023-10-10 上海信诺佰世医学检验有限公司 Tumor chemotherapy typing system based on high-flux targeted sequencing analysis
CN116864000B (en) * 2023-07-21 2024-06-11 上海信诺佰世医学检验有限公司 Tumor chemotherapy typing system based on high-flux targeted sequencing analysis

Also Published As

Publication number Publication date
CN115148287B (en) 2024-05-31

Similar Documents

Publication Publication Date Title
Alachiotis et al. RAiSD detects positive selection based on multiple signatures of a selective sweep and SNP vectors
Parry et al. k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction
Topa et al. Gaussian process test for high-throughput sequencing time series: application to experimental evolution
Landau et al. Dispersion estimation and its effect on test performance in RNA-seq data analysis: a simulation-based comparison of methods
Teppa et al. Disentangling evolutionary signals: conservation, specificity determining positions and coevolution. Implication for catalytic residue prediction
Huo et al. Meta-analytic framework for sparse k-means to identify disease subtypes in multiple transcriptomic studies
Morganella et al. Finding recurrent copy number alterations preserving within-sample homogeneity
Graudenzi et al. Pathway-based classification of breast cancer subtypes
CA3154621A1 (en) Single cell rna-seq data processing
CN115148287A (en) Construction method of gene focus amplification typing model and typing method of tumor sample
Hosseini et al. Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements
Kritikos et al. Noise reduction in protein-protein interaction graphs by the implementation of a novel weighting scheme
US20240029827A1 (en) Method for determining the pathogenicity/benignity of a genomic variant in connection with a given disease
Li et al. FUNMarker: Fusion network-based method to identify prognostic and heterogeneous breast cancer biomarkers
Li et al. Benchmarking computational methods to identify spatially variable genes and peaks
Yuan et al. Comparative analysis of methods for identifying recurrent copy number alterations in cancer
Hess et al. Lack of sufficiently strong informative features limits the potential of gene expression analysis as predictive tool for many clinical classification problems
CN112102882B (en) Quality control system and method for NGS detection process of tumor sample
CN113862371A (en) Prediction device for alcohol-related hepatocellular carcinoma disease progression and prognosis risk and training method of prediction model thereof
CN113234833A (en) Pancreatic cancer prognosis marker, prognosis risk assessment model and application thereof
Chong et al. SeqControl: process control for DNA sequencing
CN113999908A (en) Kit for predicting colorectal cancer prognosis risk, prediction device thereof and training method of prediction model
CN109920474B (en) Absolute quantitative method, device, computer equipment and storage medium
He et al. Multivariate association analysis with somatic mutation data
Johnson et al. Recombination rate inference via deep learning is limited by sequence diversity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant