CN118197523A - Method and system for generating gene comparison table and generating prognosis rehabilitation report - Google Patents

Method and system for generating gene comparison table and generating prognosis rehabilitation report Download PDF

Info

Publication number
CN118197523A
CN118197523A CN202410231123.XA CN202410231123A CN118197523A CN 118197523 A CN118197523 A CN 118197523A CN 202410231123 A CN202410231123 A CN 202410231123A CN 118197523 A CN118197523 A CN 118197523A
Authority
CN
China
Prior art keywords
information
gene
data
gene sequence
generating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410231123.XA
Other languages
Chinese (zh)
Inventor
赵梓丞
王轶男
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Bairen Biotechnology Co ltd
Original Assignee
Shenzhen Bairen Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Bairen Biotechnology Co ltd filed Critical Shenzhen Bairen Biotechnology Co ltd
Priority to CN202410231123.XA priority Critical patent/CN118197523A/en
Publication of CN118197523A publication Critical patent/CN118197523A/en
Pending legal-status Critical Current

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to the field of gene sequence retrieval, in particular to a method and a system for generating a gene comparison table and a prognosis rehabilitation report. The method for generating the gene comparison table comprises the following steps: obtaining a number of gene sequence data from a normalized dataset: screening the gene sequence data according to a first preset screening rule to obtain optimized gene sequence data; analyzing the optimized gene sequence data according to the point location information of the gene sequence to obtain a plurality of mutation data, wherein each mutation data comprises a correlation degree corresponding to a first factor; judging whether the correlation degree corresponding to the mutation data is larger than or equal to a first preset threshold value; when the correlation is greater than or equal to a first preset threshold, marking the corresponding mutation data as effective mutation data, and when the correlation is less than the first preset threshold, eliminating the corresponding mutation data; a gene control table was formed based on all the available mutation data. Solves the problems that the existing gene sequence library can not be automatically searched and a gene report can be generated.

Description

Method and system for generating gene comparison table and generating prognosis rehabilitation report
Technical Field
The invention relates to the field of gene library retrieval, in particular to a method and a system for generating a gene comparison table and a prognosis rehabilitation report.
Background
Single Nucleotide Polymorphisms (SNPs), which mainly refer to DNA sequence diversity at the genomic level caused by single nucleotide variation, are numerous and polymorphic, and are prevalent in the human genome. So far, the single nucleotide polymorphism database (dbSNP) has accumulated 33 hundred million SNPs. SNPs are unevenly distributed in the genome and frequently occur in non-coding regions that are subject to natural selection pressure. Non-coding SNPs have been identified to affect disease progression and clinical phenotype by affecting transcription factor binding activity, mRNA structure, gene expression, epigenetic status, and other factors.
Some SNP sites also affect gene function, resulting in altered biological properties and even pathogenic. Reading the results of different mutation processes on SNP loci, and providing a new view for understanding the functional basis of the change of the molecules related to somatic cell aging.
It is necessary to provide a method and system for accurately identifying image anomalies to improve the accuracy and efficiency of anomaly identification.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a method and a system for generating a gene comparison table and a prognosis rehabilitation report, and aims to solve the problem that the existing gene search library cannot automatically search genes and generate reports.
The technical scheme provided by the invention is as follows:
A method of generating a gene lookup table, the method of generating a gene lookup table comprising:
obtaining a number of gene sequence data from a normalized dataset:
Screening the gene sequence data according to a first preset screening rule to obtain optimized gene sequence data;
analyzing the optimized gene sequence data according to the point location information of the gene sequence to obtain a plurality of mutation data, wherein each mutation data comprises a correlation degree corresponding to a first factor;
judging whether the correlation degree corresponding to the mutation data is larger than or equal to a first preset threshold value;
when the correlation degree is greater than or equal to a first preset threshold value, marking the corresponding mutation data as effective mutation data,
When the correlation degree is smaller than a first preset threshold value, eliminating the corresponding mutation data;
a gene control table was formed based on all the available mutation data.
Preferably, the gene sequence data includes short-read long-sequence support rate information in normal tissue, sequencing depth information in tumor, and allele frequency information in tumor, and the screening the gene sequence data according to the first preset screening rule includes:
Judging whether the short-reading long-sequence support rate information in the normal tissue of the gene sequence data is lower than a second preset threshold value, whether the sequencing depth information in the tumor is lower than a third preset threshold value or whether the allele frequency information in the tumor is lower than a fourth preset threshold value;
And rejecting the gene sequence data when the support rate of the short-reading long sequences in the normal tissues of the gene sequence data is lower than a second preset threshold value, the sequencing depth in the tumor is lower than a third preset threshold value or the allele frequency in the tumor is lower than a fourth preset threshold value.
Preferably, the screening the gene sequence data according to the first preset screening rule further includes:
Constructing the gene sequence data into a 2x2 list;
Judging whether the check value of each gene sequence data in the list is smaller than a fifth preset threshold value;
When the check value is smaller than a fifth preset threshold value, reserving the corresponding gene sequence data;
And when the check value is greater than or equal to a fifth preset threshold value, rejecting the corresponding gene sequence data.
Preferably, analyzing the gene sequence according to the point location information of the gene sequence to obtain a plurality of mutation data includes:
combining the mutation data according to a preset type and generating a plurality of trinucleotide context matrixes;
Generating a plurality of mutation data by the trinucleotide context matrix and a preset classification table.
Preferably, prior to obtaining several gene sequence data from the somatic mutation dataset, the method comprises:
the somatic cell dataset is pre-processed to obtain transformed normalized data.
Preferably, the method for generating a prognostic rehabilitation report comprises:
acquiring personal gene information, pathological information and physiological information;
obtaining comparison information by referring to the gene comparison table of any one of the above according to the personal gene information;
generating first integral information according to the pathological information and the contrast information;
generating second integral information according to the physiological information and the contrast information;
extracting palindromic information of the gene information to generate third integral information;
calculating total integral according to the first integral information, the second integral information and the third integral information;
and generating a first prognosis rehabilitation report according to the total integral.
Preferably, the gene comparison table generation system includes:
The acquisition module is used for acquiring a plurality of gene sequence data from the somatic mutation data set:
The screening module is used for screening the gene sequence data according to a first preset screening rule to obtain the optimized gene sequence data;
The analysis module is used for analyzing the optimized gene sequence data according to the point position information of the gene sequence to obtain a plurality of mutation data, wherein each mutation data comprises a correlation degree corresponding to a first factor;
the first judging module is used for judging whether the correlation degree corresponding to the mutation data is larger than or equal to a first preset threshold value;
A marking module, configured to mark the corresponding mutation data as valid mutation data when the correlation degree is greater than or equal to a first preset threshold value,
The rejecting module is used for rejecting the corresponding mutation data when the correlation degree is smaller than a first preset threshold value;
The first generation module is used for forming a gene comparison table according to all the effective mutation data.
Preferably, the screening module includes:
The first judging submodule is used for judging whether the short-reading long sequence support rate information in the normal tissue of the gene sequence data is lower than a second preset threshold value, whether the sequencing depth information in the tumor is lower than a third preset threshold value or whether the allele frequency information in the tumor is lower than a fourth preset threshold value;
The first sub-eliminating module is used for eliminating the gene sequence data when the support rate of the short-reading long sequences in the normal tissues of the gene sequence data is lower than a second preset threshold value, the sequencing depth in the tumor is lower than a third preset threshold value or the allele frequency in the tumor is lower than a fourth preset threshold value.
Preferably, the screening module includes:
the association module is used for constructing the gene sequence data into a 2 x 2 list;
the second judging submodule is used for judging whether the check value of each gene sequence data in the list is smaller than a fifth preset threshold value or not;
The retaining module is used for retaining the corresponding gene sequence data when the check value is smaller than a fifth preset threshold value;
and the second sub-eliminating module is used for eliminating the corresponding gene sequence data when the check value is greater than or equal to a fifth preset threshold value.
Preferably, the parsing module includes:
the reconstruction module is used for merging the mutation data according to a preset type and generating a plurality of trinucleotide context matrixes;
And the first sub-generation module is used for generating a plurality of mutation data by the trinucleotide context matrix and a preset classification table.
Preferably, the gene comparison table generation system includes:
and the transformation module is used for preprocessing the somatic cell data set to obtain normalized data after transformation.
Preferably, the prognostic rehabilitation report generating system includes:
the personal information module is used for acquiring personal gene information, pathological information and physiological information;
a comparison module for obtaining comparison information by referring to the gene comparison table of any one of the above claims 1 to 5 according to the personal gene information;
The first integration module is used for generating first integration information according to the pathological information and the contrast information;
the second integration module is used for generating second integration information according to the physiological information and the contrast information;
The third integration module is used for extracting palindromic information of the gene information to generate third integration information;
the calculation module is used for calculating total integral according to the first integral information, the second integral information and the third integral information;
and the second generation module is used for generating a first prognosis rehabilitation report according to the total integral.
In order to solve the above problem, an embodiment of the present invention further provides an electronic device, including:
At least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the method of generating a gene lookup table as described above.
In order to solve the above-described problems, an embodiment of the present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method of generating a gene map as described above.
According to the technical scheme, a plurality of gene sequence data are obtained from the normalized data set: screening the gene sequence data according to a first preset screening rule to obtain optimized gene sequence data; analyzing the optimized gene sequence data according to the point location information of the gene sequence to obtain a plurality of mutation data, wherein each mutation data comprises a correlation degree corresponding to a first factor; judging whether the correlation degree corresponding to the mutation data is larger than or equal to a first preset threshold value; when the correlation is greater than or equal to a first preset threshold, marking the corresponding mutation data as effective mutation data, and when the correlation is less than the first preset threshold, rejecting the corresponding mutation data; a gene control table was formed based on all the available mutation data. Solves the problems of lack of screening, comparison and retrieval accuracy in the generation of the existing gene report.
Drawings
FIG. 1 is a main flow chart of a method for generating a gene comparison table according to an embodiment of the present invention;
FIG. 2 is a first sub-flowchart of a method for generating a gene mapping table according to an embodiment of the present invention;
FIG. 3 is a second sub-flowchart of a method for generating a gene mapping table according to an embodiment of the present invention;
FIG. 4 is a third sub-flowchart of a method for generating a gene mapping table according to an embodiment of the present invention;
FIG. 5 is a fourth sub-flowchart of a method for generating a gene mapping table according to an embodiment of the present invention;
FIG. 6 is a main flow chart of a method of generating a prognostic rehabilitation report provided by an embodiment of the present invention;
FIG. 7 is a schematic diagram of a system for generating a gene mapping table according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a first substructure of a system for generating a gene mapping table according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of a second substructure of a system for generating a gene mapping table according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of a third substructure of a system for generating a gene mapping table according to an embodiment of the present invention;
FIG. 11 is a schematic diagram of a fourth substructure of a system for generating a gene mapping table according to an embodiment of the present invention;
FIG. 12 is a schematic diagram of a system for generating a gene mapping table according to an embodiment of the present invention;
Fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
Furthermore, in the description of the present specification and the appended claims, the terms "first" and "second" and the like are used solely to distinguish one from another, and are not to be construed as indicating or implying a relative importance.
Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in various places throughout this specification are not necessarily all referring to the same embodiment, but mean "one or more, but not all, embodiments" unless expressly specified otherwise.
The embodiment of the application provides a method for generating a gene comparison table, and an execution subject of the method for generating the gene comparison table comprises, but is not limited to, at least one of a server, a terminal and the like which can be configured to execute the electronic equipment of the method provided by the embodiment of the application. In other words, the method of generating the gene lookup table may be performed by software or hardware installed in the terminal device or the server device, and the software may be a blockchain platform. The service side includes, but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (ContentDelivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.
Referring to fig. 1, a main flow chart of a method for generating a gene mapping table according to an embodiment of the invention is shown. The method for generating the gene comparison table comprises the following steps:
Step S101, acquiring a plurality of gene sequence data from a normalized data set.
More specifically, a specific procedure for acquiring several gene sequence data from the normalized dataset will be described in detail below.
Step S102, screening the gene sequence data according to a first preset screening rule to obtain optimized gene sequence data.
More specifically, the gene sequence data is screened according to a first preset screening rule, and a specific process of obtaining the optimized gene sequence data will be described in detail below.
And step S103, analyzing the optimized gene sequence data according to the point location information of the gene sequence to obtain a plurality of mutation data degrees.
Wherein each mutation data includes a correlation corresponding to the first factor. Specifically, the first factor is a specific cancer type gene sequence, and the optimized gene sequence data is divided into two groups according to whether the gene sequence of the specific cancer type is carried or not.
Methods of resolution include, but are not limited to, extracting the sequence context of the gene sequence data, using STRAME to find the motifs enriched therein, examining the distribution in gene sequences of each cancer type (single nucleotide polymorphisms of somatic mutations) in the significantly enriched gene sequences, as well as the distribution in common gene sequences and somatic mutations that are not somatic mutant single nucleotide polymorphisms; kaplan-Meier survival analysis was performed using R package "surviminer", using multivariate Cox regression analysis to determine the prognostic impact of gene sequence data, identifying Significant Mutant Genes (SMGs) in different cancer groups, decomposing and comparing the contributions of the mutant features in different cancer groups.
Step S104, judging whether the correlation degree corresponding to the mutation data is larger than or equal to a first preset threshold value, and executing step S105 when the correlation degree is larger than or equal to the first preset threshold value; when the correlation is smaller than the first preset threshold, step S106 is performed.
Specifically, the first preset threshold value in the community is 0.3, namely when the correlation degree is greater than 0.3, the mutation data is marked as effective mutation data.
Step S105, marking the corresponding mutation data as valid mutation data.
And S106, eliminating the corresponding mutation data.
Step S107, a gene comparison table is formed according to all the effective mutation data.
Recurrent somatic mutations in tumors cannot be found by conventional methods. Somatic mutations were found by whole genome sequencing followed by quality control, comparison and detection of mutations, as compared to paracancerous tissue, whereas recurrent mutations were present during tumor development.
Referring to fig. 2 in combination, a first sub-flowchart of a method for generating a gene mapping table according to an embodiment of the invention is shown. The gene sequence data includes short-read long-sequence support rate information in normal tissue, sequencing depth information in tumor, and allele frequency information in tumor, and step S102 includes:
Step S201, judging whether the short-reading long sequence support rate information in the normal tissue of the gene sequence data is lower than a second preset threshold value, whether the sequencing depth information in the tumor is lower than a third preset threshold value or whether the allele frequency information in the tumor is lower than a fourth preset threshold value, and executing step S202 when the short-reading long sequence support rate in the normal tissue of the gene sequence data is lower than the second preset threshold value, the sequencing depth in the tumor is lower than the third preset threshold value or the allele frequency in the tumor is lower than the fourth preset threshold value; and when the support rate of the short-reading long sequences in the normal tissues of the gene sequence data is not lower than the second preset threshold, the sequencing depth in the tumor is not lower than the third preset threshold or the allele frequency in the tumor is not lower than the fourth preset threshold, executing the step S203.
Specifically, the second preset threshold is 3, the third preset threshold is 5%, and the fourth preset threshold is 0.05. Sites in the dataset with less than or equal to 1% of the minor allele frequency of the mutation are filtered out, as the mutation may be due to sequencing anomalies, alignment anomalies, etc., and belongs to unreliable mutation information. Then, through the mutation coverage depth information carried in the database, the method removes the mutation with one of the following characteristics:
a. Mutation with short-reading long sequence support rate not lower than 3 times in normal tissue;
b. Mutations with a sequencing depth of less than 5% in tumors;
c. mutations with allele frequencies (VAF) below 0.05 in tumors.
Through the filtering and screening, the mutation set which is effective in the data set and can be applied to subsequent screening and detection can be obtained.
Step S202, deleting the gene sequence data.
Step S203, reserving gene sequence data.
In summary, the system filters out data with infrequent mutations to improve the effectiveness of the data.
Referring to fig. 3, a second sub-flowchart of a method for generating a gene mapping table according to an embodiment of the invention is shown. Step S102 further includes:
step S203, constructing the gene sequence data into a2×2 list.
Step S204, judging whether the check value of each gene sequence data in the list is smaller than a fifth preset threshold value; when the check value is smaller than the fifth preset threshold, step S205 is performed; when the check value is greater than or equal to the fifth preset threshold, step S206 is performed.
Step S205, reserving corresponding gene sequence data.
Step S206, eliminating the corresponding gene sequence data.
Specifically, the fifth preset threshold value used in the present example is 0.05, i.e., the test value is less than 0.05, and the recurrent somatic mutation is confirmed in the present embodiment as Fisher exact test.
Referring to fig. 4, a third sub-flowchart of a method for generating a gene mapping table according to an embodiment of the invention is shown. Step S103 includes:
step S301, combining mutation data according to preset types and generating a plurality of trinucleotide context matrixes.
Step S302, generating a plurality of mutation data by the trinucleotide context matrix and a preset classification table.
Specifically, the R tool "sigminer" was used to combine the 96 trinucleotide context (96 x 33) matrices generated for the samples by cancer type, the 96x33 matrix was decomposed into known COSMIC profiles, and the contribution of each mutation signature to each cancer type was determined.
Referring to fig. 5, a fourth sub-flowchart of a method for generating a gene mapping table according to an embodiment of the invention is shown. Before performing step S101, the method comprises:
step S401, preprocessing the somatic cell data set to obtain normalized data after transformation.
Specifically, smSNPs (single nucleotide polymorphism of somatic mutation) is identified from a mutation dataset (somatic dataset), and somatic mutation data is retrieved and collected. The mutant dataset is entered in vcf format or vcf-like format to form normalized data.
The VCF Format is a commonly used text file Format for storing variant information in a genome, where VCF is an abbreviation of "VARIANT CALL Format". This format is often used to describe Single Nucleotide Polymorphisms (SNPs), indels (indels), and other types of genetic variation. The VCF file is mainly composed of the following parts:
Header (Header): the line beginning with the #, describes the meta-information of the file, including version of the file format, reference genome, annotation description of various variant information, sample information, etc.
Column heading (Column Headers): one row after the header, beginning with the well number. Typically includes the following fields:
# CHROM chromosome number;
POS, the specific position (counted in base pairs) where the mutation occurs;
ID, a variant unique identifier, if not, represented by a dot (');
REF is the sequence of the reference genome;
ALT: variant sequence (possibly in plurality);
QUAL: mass fraction of mutation detection;
FILTER, whether the site passes quality control;
an additional information field containing various information and comments related to the mutation;
FORMAT, mutating the FORMAT of the call details followed by sample information;
Data line: each row represents one mutation in the genome. The data in a row corresponds to a field defined in a column header. The data in each field is separated by a Tab (Tab). For the INFO and FORMAT fields, a colon (:) or a semicolon (;) separation is typically used between the different information.
The VCF format effectively integrates mutation information and comprehensive annotations to the mutation and allows information sharing using standardized means. This format is widely used in bioinformatic analysis, genomic research and clinical genetic diagnosis.
Specifically, common SNP (single nucleotide polymorphism) sites were obtained from a known SNP data table to identify somatic mutation single nucleotide polymorphisms (smSNPs) in the TCGA somatic mutation dataset (somatic dataset), smSNPs located in the regulatory feature region was further selected, and non-coding functionalities smSNPs (somatic mutation single nucleotide polymorphisms) in the TCGA mutation dataset were identified using the common SNP sites.
Referring to FIG. 6 in combination, a main flow chart of a method for generating a prognostic rehabilitation report according to an embodiment of the present invention is shown. The method for generating the prognosis rehabilitation report comprises the following steps:
step S501, acquiring personal gene information, pathology information and physiology information.
Step S502, obtaining comparison information by referring to the gene comparison table of any one of the above items according to the personal gene information.
Step S503, generating first integral information according to the pathological information and the contrast information.
Wherein the pathology information includes cancer type information.
Step S504, generating second integral information according to the physiological information and the contrast information.
Wherein the physiological information includes gender and age information.
In step S505, the palindromic information of the genetic information is extracted to generate third integration information.
Step S506, calculating total integral according to the first integral information, the second integral information and the third integral information.
Step S507, a first prognosis rehabilitation report is generated according to the total integral.
In detail, each technology in the prognostic rehabilitation report generating method in the embodiment of the present invention adopts the same technical means as the method for generating the gene comparison table in fig. 1 to 5, and can generate the same technical effects, and is not described herein.
Referring to fig. 7, a schematic diagram of a system for generating a gene mapping table according to an embodiment of the invention is shown. The gene map generation system 1 includes: the device comprises an acquisition module 11, a screening module 12, an analysis module 13, a first judgment module 14, a marking module 15, a rejecting module 16 and a first generation module 17.
The acquisition module 11 is used for acquiring a plurality of gene sequence data from the somatic mutation data set.
And the screening module 12 is used for screening the gene sequence data according to a first preset screening rule to obtain optimized gene sequence data.
And the analysis module 13 is used for analyzing the optimized gene sequence data according to the point location information of the gene sequence to obtain a plurality of mutation data, wherein each mutation data comprises a correlation degree corresponding to a first factor.
Wherein each mutation data includes a correlation corresponding to the first factor. Specifically, the first factor is a specific cancer type gene sequence, and the optimized gene sequence data is divided into two groups according to whether the gene sequence of the specific cancer type is carried or not.
Methods of resolution include, but are not limited to, extracting the sequence context of the gene sequence data, using STRAME to find the motifs enriched therein, examining the distribution in gene sequences of each cancer type (single nucleotide polymorphisms of somatic mutations) in the significantly enriched gene sequences, as well as the distribution in common gene sequences and somatic mutations that are not somatic mutant single nucleotide polymorphisms; kaplan-Meier survival analysis was performed using R package "surviminer", using multivariate Cox regression analysis to determine the prognostic impact of gene sequence data, identifying Significant Mutant Genes (SMGs) in different cancer groups, decomposing and comparing the contributions of the mutant features in different cancer groups.
The first determining module 14 is configured to determine whether the correlation degree corresponding to the mutation data is greater than or equal to a first preset threshold.
Specifically, the first preset threshold value in the community is 0.3, namely when the correlation degree is greater than 0.3, the mutation data is marked as effective mutation data.
And the marking module 15 is configured to mark the corresponding mutation data as valid mutation data when the correlation degree is greater than or equal to a first preset threshold value.
And the rejecting module 16 is configured to reject the corresponding mutation data when the correlation degree is smaller than a first preset threshold value.
A first generation module 17 for forming a gene comparison table according to all the effective mutation data.
Recurrent somatic mutations in tumors cannot be found by conventional methods. Somatic mutations were found by whole genome sequencing followed by quality control, comparison and detection of mutations, as compared to paracancerous tissue, whereas recurrent mutations were present during tumor development.
Referring to fig. 8, a schematic diagram of a first substructure of a system for generating a gene mapping table according to an embodiment of the invention is shown. The screening module 12 includes: a first determination sub-module 121 and a first culling sub-module 122.
The first judging submodule 121 is configured to judge whether the short-reading long-sequence support rate information in the normal tissue of the genetic sequence data is lower than a second preset threshold, whether the sequencing depth information in the tumor is lower than a third preset threshold, or whether the allele frequency information in the tumor is lower than a fourth preset threshold.
Specifically, the second preset threshold is 3, the third preset threshold is 5%, and the fourth preset threshold is 0.05. Sites in the dataset with less than or equal to 1% of the minor allele frequency of the mutation are filtered out, as the mutation may be due to sequencing anomalies, alignment anomalies, etc., and belongs to unreliable mutation information. Then, through the mutation coverage depth information carried in the database, the method removes the mutation with one of the following characteristics:
a. Mutation with short-reading long sequence support rate not lower than 3 times in normal tissue;
b. Mutations with a sequencing depth of less than 5% in tumors;
c. mutations with allele frequencies (VAF) below 0.05 in tumors.
Through the filtering and screening, the mutation set which is effective in the data set and can be applied to subsequent screening and detection can be obtained.
The first sub-knockout module 122 is configured to knockout the gene sequence data when the support rate of the short-reading long sequences in the normal tissue of the gene sequence data is lower than the second preset threshold, the sequencing depth in the tumor is lower than the third preset threshold, or the allele frequency in the tumor is lower than the fourth preset threshold.
In summary, the system filters out data with infrequent mutations to improve the effectiveness of the data.
Referring to fig. 9, a second sub-structure diagram of a system for generating a gene mapping table according to an embodiment of the invention is shown. The screening module 12 further includes: the association module 123, the second determination sub-module 124, the retention module 125, and the second culling module 126.
And an association module 123 for constructing the gene sequence data into a 2×2 list.
The second judging submodule 124 is used for judging whether the test value of each gene sequence data in the list is smaller than a fifth preset threshold value.
A retaining module 125, configured to retain the corresponding gene sequence data when the test value is less than a fifth preset threshold value.
The second sub-culling module 126 is configured to cull the corresponding gene sequence data when the test value is greater than or equal to a fifth preset threshold value.
Specifically, the fifth preset threshold value used in the present example is 0.05, i.e., the test value is less than 0.05, and the recurrent somatic mutation is confirmed in the present embodiment as Fisher exact test.
Referring to fig. 10, a third sub-structure diagram of a system for generating a gene mapping table according to an embodiment of the invention is shown. The parsing module 13 includes: a reconstruction module 131 and a first sub-generation module 132.
The reconstruction module 131 is configured to combine the mutation data according to a preset type and generate a plurality of trinucleotide context matrices.
The first sub-generation module 132 is configured to generate a plurality of mutation data from the trinucleotide context matrix and the preset classification table.
Specifically, the R tool "sigminer" was used to combine the 96 trinucleotide context (96 x 33) matrices generated for the samples by cancer type, the 96x33 matrix was decomposed into known COSMIC profiles, and the contribution of each mutation signature to each cancer type was determined.
Referring to fig. 11, a fourth sub-structure diagram of a system for generating a gene mapping table according to an embodiment of the invention is shown. The gene map generation system 1 further includes: a conversion module 18.
The transformation module 18 is configured to pre-process the somatic cell data set to obtain transformed normalized data.
Specifically, smSNPs (single nucleotide polymorphism of somatic mutation) is identified from a mutation dataset (somatic dataset), and somatic mutation data is retrieved and collected. The mutant dataset is entered in vcf format or vcf-like format to form normalized data.
The VCF Format is a commonly used text file Format for storing variant information in a genome, where VCF is an abbreviation of "VARIANT CALL Format". This format is often used to describe Single Nucleotide Polymorphisms (SNPs), indels (indels), and other types of genetic variation. The VCF file is mainly composed of the following parts:
Header (Header): the line beginning with the #, describes the meta-information of the file, including version of the file format, reference genome, annotation description of various variant information, sample information, etc.
Column heading (Column Headers): one row after the header, beginning with the well number. Typically includes the following fields:
# CHROM chromosome number;
POS, the specific position (counted in base pairs) where the mutation occurs;
ID, a variant unique identifier, if not, represented by a dot (');
REF is the sequence of the reference genome;
ALT: variant sequence (possibly in plurality);
QUAL: mass fraction of mutation detection;
FILTER, whether the site passes quality control;
an additional information field containing various information and comments related to the mutation;
FORMAT, mutating the FORMAT of the call details followed by sample information;
Data line: each row represents one mutation in the genome. The data in a row corresponds to a field defined in a column header. The data in each field is separated by a Tab (Tab). For the INFO and FORMAT fields, a colon (:) or a semicolon (;) separation is typically used between the different information.
The VCF format effectively integrates mutation information and comprehensive annotations to the mutation and allows information sharing using standardized means. This format is widely used in bioinformatic analysis, genomic research and clinical genetic diagnosis.
Specifically, common SNP (single nucleotide polymorphism) sites were obtained from a known SNP data table to identify somatic mutation single nucleotide polymorphisms (smSNPs) in the TCGA somatic mutation dataset (somatic dataset), smSNPs located in the regulatory feature region was further selected, and non-coding functionalities smSNPs (somatic mutation single nucleotide polymorphisms) in the TCGA mutation dataset were identified using the common SNP sites.
In detail, each module in the gene comparison table generation system 1 in the embodiment of the present invention adopts the same technical means as the method for generating the gene comparison table in fig. 1 to 5, and can generate the same technical effects, which are not described herein.
Please refer to fig. 12 in combination, which is a schematic diagram of a prognosis rehabilitation report generating system according to an embodiment of the present invention. The prognostic rehabilitation report generating system 2 includes: the personal information module 21, the comparison module 22, the first integration module 23, the second integration module 24, the third integration module 25, the calculation module 26 and the second generation module 27.
The personal information module 21 is used for acquiring personal gene information, pathology information and physiological information.
A comparison module 22 for obtaining comparison information by referring to the gene comparison table according to any one of the preceding claims 1-5 according to the personal gene information.
The first integration module 23 is configured to generate first integration information according to the pathology information and the contrast information.
Wherein the pathology information includes cancer type information.
A second integration module 24 for generating second integration information based on the physiological information and the contrast information.
Wherein the physiological information includes gender and age information.
And a third integration module 25 for extracting palindromic information of the genetic information to generate third integration information.
The calculating module 26 is configured to calculate a total integral according to the first integral information, the second integral information and the third integral information.
A second generation module 27 for generating a first prognostic rehabilitation report from the total score.
In detail, each module in the prognostic rehabilitation report generating system 2 in the embodiment of the present invention adopts the same technical means as the method for generating the prognostic rehabilitation report in fig. 6, and can produce the same technical effects, and will not be described here again.
The invention also discloses an electronic device 1000, please refer to fig. 13, fig. 13 is a schematic structural diagram of the electronic device according to the embodiment of the invention.
The electronic device 1000 may include at least one processor 100. And a memory 200 communicatively coupled to the at least one processor 100. Wherein the memory 200 stores a computer program executable by the at least one processor 100, the computer program being executable by the at least one processor 100 to enable the at least one processor 100 to perform the method of generating a gene map table and/or the method of generating a prognostic rehabilitation report as described above.
The processor 100 may be formed by an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be formed by a plurality of integrated circuits packaged with the same function or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, and combinations of various control chips. The processor 100 is a control core (ControlUnit) of the electronic device 1000, connects various components of the entire electronic device 1000 using various interfaces and lines, and executes various functions of the electronic device 1000 and processes data by running or executing programs or modules (e.g., generating a gene map program, etc.) stored in the memory 200, and calling data stored in the memory 200.
Further, the modules/units integrated with the electronic device 1000 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a stand alone product. The computer readable storage medium may be volatile or nonvolatile. For example, the computer-readable storage medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
The present invention also provides a computer readable storage medium comprising a computer program executable by the processor 100 to perform a method of generating a gene mapping table and/or a method of generating a prognostic rehabilitation report as described above.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

Claims (10)

1. A method of generating a gene mapping table, the method comprising:
obtaining a number of gene sequence data from a normalized dataset:
Screening the gene sequence data according to a first preset screening rule to obtain optimized gene sequence data;
analyzing the optimized gene sequence data according to the point location information of the gene sequence to obtain a plurality of mutation data, wherein each mutation data comprises a correlation degree corresponding to a first factor;
judging whether the correlation degree corresponding to the mutation data is larger than or equal to a first preset threshold value;
when the correlation degree is greater than or equal to a first preset threshold value, marking the corresponding mutation data as effective mutation data,
When the correlation degree is smaller than a first preset threshold value, eliminating the corresponding mutation data;
a gene control table was formed based on all the available mutation data.
2. The method of generating a gene mapping table according to claim 1, wherein the gene sequence data includes short read long sequence support rate information in normal tissue, sequencing depth information in tumor, and allele frequency information in tumor, and wherein the screening the gene sequence data according to the first preset screening rule comprises:
Judging whether the short-reading long-sequence support rate information in the normal tissue of the gene sequence data is lower than a second preset threshold value, whether the sequencing depth information in the tumor is lower than a third preset threshold value or whether the allele frequency information in the tumor is lower than a fourth preset threshold value;
And rejecting the gene sequence data when the support rate of the short-reading long sequences in the normal tissues of the gene sequence data is lower than a second preset threshold value, the sequencing depth in the tumor is lower than a third preset threshold value or the allele frequency in the tumor is lower than a fourth preset threshold value.
3. The method of generating a gene mapping table according to claim 1, wherein the screening the gene sequence data according to the first preset screening rule further comprises:
Constructing the gene sequence data into a 2x2 list;
Judging whether the check value of each gene sequence data in the list is smaller than a fifth preset threshold value;
When the check value is smaller than a fifth preset threshold value, reserving the corresponding gene sequence data;
And when the check value is greater than or equal to a fifth preset threshold value, rejecting the corresponding gene sequence data.
4. The method of generating a gene mapping table according to claim 1, wherein analyzing the gene sequence according to the point location information of the gene sequence to obtain a plurality of mutation data comprises:
combining the mutation data according to a preset type and generating a plurality of trinucleotide context matrixes;
Generating a plurality of mutation data by the trinucleotide context matrix and a preset classification table.
5. The method of generating a gene mapping table of claim 1, wherein prior to obtaining a plurality of gene sequence data from a somatic mutation dataset, the method comprises:
the somatic cell dataset is pre-processed to obtain transformed normalized data.
6. A method of generating a prognostic rehabilitation report, the method comprising:
acquiring personal gene information, pathological information and physiological information;
Obtaining comparison information by referring to the gene comparison table according to the personal gene information according to any one of the above claims 1 to 5;
generating first integral information according to the pathological information and the contrast information;
generating second integral information according to the physiological information and the contrast information;
extracting palindromic information of the gene information to generate third integral information;
calculating total integral according to the first integral information, the second integral information and the third integral information;
and generating a prognosis rehabilitation report according to the total integral.
7. A gene mapping table generation system, characterized in that the gene mapping table generation system comprises:
The acquisition module is used for acquiring a plurality of gene sequence data from the somatic mutation data set:
The screening module is used for screening the gene sequence data according to a first preset screening rule to obtain the optimized gene sequence data;
The analysis module is used for analyzing the optimized gene sequence data according to the point position information of the gene sequence to obtain a plurality of mutation data, wherein each mutation data comprises a correlation degree corresponding to a first factor;
the first judging module is used for judging whether the correlation degree corresponding to the mutation data is larger than or equal to a first preset threshold value;
A marking module, configured to mark the corresponding mutation data as valid mutation data when the correlation degree is greater than or equal to a first preset threshold value,
The rejecting module is used for rejecting the corresponding mutation data when the correlation degree is smaller than a first preset threshold value;
The first generation module is used for forming a gene comparison table according to all the effective mutation data.
8. A prognostic rehabilitation report generation system, wherein the prognostic rehabilitation report generation comprises:
the personal information module is used for acquiring personal gene information, pathological information and physiological information;
a comparison module for obtaining comparison information by referring to the gene comparison table of any one of the above claims 1 to 5 according to the personal gene information;
The first integration module is used for generating first integration information according to the pathological information and the contrast information;
the second integration module is used for generating second integration information according to the physiological information and the contrast information;
The third integration module is used for extracting palindromic information of the gene information to generate third integration information;
the calculation module is used for calculating total integral according to the first integral information, the second integral information and the third integral information;
and the second generation module is used for generating a first prognosis rehabilitation report according to the total integral.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the method of generating a gene mapping table according to any one of claims 1 to 5 and/or the method of generating a prognostic rehabilitation report according to claim 6.
10. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the method of generating a gene mapping table according to any one of claims 1 to 5 and/or the method of generating a prognostic rehabilitation report according to claim 6.
CN202410231123.XA 2024-02-29 2024-02-29 Method and system for generating gene comparison table and generating prognosis rehabilitation report Pending CN118197523A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410231123.XA CN118197523A (en) 2024-02-29 2024-02-29 Method and system for generating gene comparison table and generating prognosis rehabilitation report

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410231123.XA CN118197523A (en) 2024-02-29 2024-02-29 Method and system for generating gene comparison table and generating prognosis rehabilitation report

Publications (1)

Publication Number Publication Date
CN118197523A true CN118197523A (en) 2024-06-14

Family

ID=91407814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410231123.XA Pending CN118197523A (en) 2024-02-29 2024-02-29 Method and system for generating gene comparison table and generating prognosis rehabilitation report

Country Status (1)

Country Link
CN (1) CN118197523A (en)

Similar Documents

Publication Publication Date Title
Beyter et al. Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits
Saunders et al. Strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs
DePristo et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data
US11043283B1 (en) Systems and methods for automating RNA expression calls in a cancer prediction pipeline
Smadbeck et al. C opy number variant analysis using genome‐wide mate‐pair sequencing
Porubsky et al. Recurrent inversion toggling and great ape genome evolution
EP3464626A1 (en) Methods for detecting genetic variations
Tatsumoto et al. Direct estimation of de novo mutation rates in a chimpanzee parent-offspring trio by ultra-deep whole genome sequencing
Olson et al. Variant calling and benchmarking in an era of complete human genome sequences
Fondon III et al. Analysis of microsatellite variation in Drosophila melanogaster with population-scale genome sequencing
Lou et al. Batch effects in population genomic studies with low‐coverage whole genome sequencing data: Causes, detection and mitigation
US20170329893A1 (en) Methods of determining genomic health risk
CN110189796A (en) A kind of sheep full-length genome resurveys sequence analysis method
CN115083521B (en) Method and system for identifying tumor cell group in single cell transcriptome sequencing data
US20190287646A1 (en) Identifying copy number aberrations
CN111919256A (en) Method, device and system for detecting chromosome aneuploidy
Wood et al. Recommendations for accurate resolution of gene and isoform allele-specific expression in RNA-Seq data
Panoutsopoulou et al. Quality control of common and rare variants
CN109461473B (en) Method and device for acquiring concentration of free DNA of fetus
CN108595912A (en) Detect the method, apparatus and system of chromosomal aneuploidy
CN113724781A (en) Method and apparatus for detecting homozygous deletions
Kubiritova et al. On the critical evaluation and confirmation of germline sequence variants identified using massively parallel sequencing
Huang et al. Enabling population assignment from cancer genomes with SNP2pop
CN118197523A (en) Method and system for generating gene comparison table and generating prognosis rehabilitation report
Afyounian et al. Segmentum: a tool for copy number analysis of cancer genomes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination