CN116758989B - Breast cancer marker screening method and related device - Google Patents

Breast cancer marker screening method and related device Download PDF

Info

Publication number
CN116758989B
CN116758989B CN202310687187.6A CN202310687187A CN116758989B CN 116758989 B CN116758989 B CN 116758989B CN 202310687187 A CN202310687187 A CN 202310687187A CN 116758989 B CN116758989 B CN 116758989B
Authority
CN
China
Prior art keywords
marker
methylation
breast cancer
markers
genome
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310687187.6A
Other languages
Chinese (zh)
Other versions
CN116758989A (en
Inventor
陈明
崔哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Nebula Bioinformatics Technology Development Co ltd
Original Assignee
Harbin Nebula Bioinformatics Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Nebula Bioinformatics Technology Development Co ltd filed Critical Harbin Nebula Bioinformatics Technology Development Co ltd
Priority to CN202310687187.6A priority Critical patent/CN116758989B/en
Publication of CN116758989A publication Critical patent/CN116758989A/en
Application granted granted Critical
Publication of CN116758989B publication Critical patent/CN116758989B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to the technical field of cancer marker screening, in particular to a breast cancer marker screening method and a related device. The invention compares the breast cancer sample with the reference sample, screens out genome of the breast cancer sample mutated relative to the reference sample, and screens out methylation of the breast cancer sample mutated relative to the reference sample. The present invention then performs transcriptome data analysis on the primary markers to screen the primary markers for markers that are truly due to variation. From the above analysis, the present invention can screen markers from breast cancer samples for cancer mutation based on both genomics and methylation, thereby enabling a more comprehensive screening of markers that cause cancer.

Description

Breast cancer marker screening method and related device
Technical Field
The invention relates to the technical field of cancer marker screening, in particular to a breast cancer marker screening method and a related device.
Background
Breast cancer is one of the most common malignant tumors of females, and early detection and diagnosis can improve the success rate of breast cancer treatment and the survival rate of patients. With the continuous development of genome sequencing technology, the sequencing cost is reduced continuously, and the sequencing precision is improved continuously, so that scientists can study human genome and related disease occurrence mechanisms more deeply and widely. That is, it is to determine from a population that has already had breast cancer which shares common features of that population (commonality, i.e., the feature that the cancer population is mutated relative to the non-cancer population, i.e., the feature that the mutation is labeled). However, it is difficult to fully elucidate the pathogenesis of breast cancer by a single-group study, that is, the prior art only screens out markers generated in breast cancer samples (ex vivo samples) compared with non-breast cancer samples from the genomic standpoint (if the markers appear in human tissues, the markers are used for judging whether human cancers are suffered or not), if the breast cancer samples have cancer mutation at positions other than the genomic positions, the prior art does not screen out the markers generated by the mutation at the positions, so that the markers screened by the prior art are relatively one-sided and are not sufficient for classifying human cancer lesions.
In summary, the markers screened in the prior art are compared on one side.
Accordingly, there is a need for improvement and advancement in the art.
Disclosure of Invention
In order to solve the technical problems, the invention provides a screening method and a related device for breast cancer markers, which solve the problem that the markers screened by the prior art are relatively single-sided.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
in a first aspect, the present invention provides a method for screening a breast cancer marker, comprising:
Identifying each of the initially selected genomic markers generated from the breast cancer sample relative to a reference sample, the reference sample being a sample that is not cancerous, the genomic markers being used to genetically identify breast cancer;
Identifying each primary methylation signature generated by the breast cancer sample relative to a reference sample, the methylation signature being used to identify breast cancer from methylation;
And respectively carrying out transcriptome data analysis on each primary genome marker and each primary methylation marker, and screening target genome markers and target methylation markers from each genome marker, wherein the transcriptome data is used for recording all genome data and methylation data.
In one implementation, the identifying each of the initially selected genomic markers of the breast cancer sample relative to a reference sample, the reference sample being a sample that is not cancerous, the genomic markers being used to genetically identify breast cancer, comprises:
applying a plurality of individual cell mutation recognition tools to a breast cancer sample and a reference sample to obtain first genome marker sets respectively recognized by the individual cell mutation recognition tools, wherein the first genome markers are genomes mutated in the process of forming somatic cells;
applying a plurality of copy variation recognition tools to the breast cancer sample and the reference sample to obtain second genome marker sets respectively recognized by the copy variation recognition tools, wherein the second genome markers are genomes with genes mutated in the replication process;
Performing intersection processing on each first genome marker set to obtain each primary selected first genome marker in each primary selected genome marker;
And performing intersection processing on each second genome marker set to obtain each primary second genome marker in each primary genome marker.
In one implementation, the identifying each of the primary methylation markers generated by the breast cancer sample relative to the reference sample is used to identify breast cancer from methylation, comprising:
applying ChAMP tools to a breast cancer sample and a reference sample, and identifying each first differential methylation site set and each first differential methylation region set carried by the breast cancer sample;
applying a Bismark tool and a METHYLKIT tool to a breast cancer sample and a reference sample, and identifying each second differential methylation site set carried by the breast cancer sample;
Applying metilene tools to the breast cancer sample and the reference sample to identify each second set of differential methylation regions carried by the breast cancer sample;
Performing intersection processing on each first differential methylation site set and each second differential methylation site set to obtain each primary differential methylation site in each primary methylation marker;
and performing intersection processing on each first differential methylation region set and each second differential methylation region set to obtain each primary differential methylation region in each primary methylation marker.
In one implementation, the performing transcriptome data analysis on each of the preliminary genomic markers and each of the preliminary methylation markers, respectively, screens the genomic markers of interest and the methylation markers of interest from each of the genomic markers, the transcriptome data being used to record all genomic data and methylation data, comprises:
Applying a feature count tool to each of said primary genome markers and each of said primary methylation markers, respectively, to obtain genomic significance signatures output by said feature count tool for transcriptome data analysis of each of said primary genome markers and methylation significance signatures output by said feature count tool for transcriptome data analysis of each of said primary methylation markers;
screening target genome markers from each of the primary genome markers according to genome saliency characteristics of each of the primary genome markers;
And screening the target methylation markers from the primary methylation markers according to methylation significance characteristics of the primary methylation markers.
In one implementation, the method further comprises:
and applying a lasso algorithm to each target genome marker and each target methylation marker to obtain each marker weight value, wherein the marker weight values comprise genome marker weight values and methylation marker weight values.
In one implementation, the applying a lasso algorithm to each of the target genomic markers and each of the target methylation markers results in each marker weight value, the marker weight values including a genomic marker weight value and a methylation marker weight value, comprising:
The method comprises the steps of (1) assigning a cancer label to a breast cancer sample, assigning each label to each label, and assigning each weight parameter to each weight corresponding to each label;
Constructing an objective function of the lasso algorithm with the cancer tag and each of the marker tags and each of the weight parameters;
Taking the sum of the absolute values of the weight parameters as a limiting condition of the lasso algorithm, wherein the sum of the absolute values of the weight parameters is smaller than a set value;
determining each parameter value corresponding to each weight parameter based on the objective function and the limiting condition;
And determining the weight value of each marker according to each parameter value.
In one implementation manner, the determining, based on the objective function and the limiting condition, each parameter value corresponding to each weight parameter includes:
and under the limiting condition, determining each parameter value corresponding to each weight parameter when the objective function takes the minimum value.
In a second aspect, an embodiment of the present invention further provides a screening device for a breast cancer marker, where the device includes the following components:
The genome marker identification module is used for identifying each primary genome marker generated by a breast cancer sample relative to a reference sample, wherein the reference sample is a sample which is not cancerous, and the genome markers are used for identifying breast cancer from genes;
a methylation marker identification model for identifying each of the primary methylation markers generated by the breast cancer sample relative to the reference sample, the methylation markers being used to identify breast cancer from methylation;
and the marker screening module is used for respectively carrying out transcriptome data analysis on each primary genome marker and each primary methylation marker, screening out target genome markers and target methylation markers from each genome marker, and recording all genome data and methylation data by using the transcriptome data.
In a third aspect, an embodiment of the present invention further provides a terminal device, where the terminal device includes a memory, a processor, and a breast cancer marker screening program stored in the memory and capable of running on the processor, and when the processor executes the breast cancer marker screening program, the steps of the breast cancer marker screening method described above are implemented.
In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, where a breast cancer marker screening program is stored on the computer readable storage medium, and when the breast cancer marker screening program is executed by a processor, the steps of the breast cancer marker screening method are implemented.
The beneficial effects are that: the invention compares the breast cancer sample with the reference sample, screens out genome of the breast cancer sample mutated relative to the reference sample (marked as a primary genome marker), and screens out methylation of the breast cancer sample mutated relative to the reference sample (marked as a primary methylation marker). The present invention then performs transcriptome data analysis on the primary markers to screen the primary markers for markers that are truly due to variation (genomic markers of interest and methylation markers of interest). From the above analysis, the present invention can screen markers from breast cancer samples for cancer mutation based on both genomics and methylation, thereby enabling a more comprehensive screening of markers that cause cancer.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a frame diagram of a screening marker in an embodiment of the present invention;
fig. 3 is a schematic block diagram of an internal structure of a terminal device according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is clearly and completely described below with reference to the examples and the drawings. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The study shows that breast cancer is one of the most common malignant tumors of females, and early detection and diagnosis can improve the success rate of breast cancer treatment and the survival rate of patients. With the continuous development of genome sequencing technology, the sequencing cost is reduced continuously, and the sequencing precision is improved continuously, so that scientists can study human genome and related disease occurrence mechanisms more deeply and widely. That is, it is to determine from a population that has already had breast cancer which shares common features of that population (commonality, i.e., the feature that the cancer population is mutated relative to the non-cancer population, i.e., the feature that the mutation is labeled). However, it is difficult to fully elucidate the pathogenesis of breast cancer by a single-group study, that is, the prior art only screens out markers generated in breast cancer samples (ex vivo samples) compared with non-breast cancer samples from the genomic standpoint (if the markers appear in human tissues, the markers are used for judging whether human cancers are suffered or not), if the breast cancer samples have cancer mutation at positions other than the genomic positions, the prior art does not screen out the markers generated by the mutation at the positions, so that the markers screened by the prior art are relatively one-sided and are not sufficient for classifying human cancer lesions.
In order to solve the technical problems, the invention provides a screening method and a related device for breast cancer markers, which solve the problem that the markers screened by the prior art are relatively single-sided. In specific implementation, each primary genomic marker and each primary methylation marker generated by the breast cancer sample relative to the reference sample are first identified, then transcriptome data analysis is performed on each primary genomic marker and each primary methylation marker, and the target genomic marker and the target methylation marker are selected from each genomic marker. The invention can screen out the marker causing canceration from the genome angle and can screen out the marker causing canceration from the methylation angle. Thereby realizing the purpose of comprehensively screening markers.
For example, the technical problem to be solved by the present invention is how to comprehensively and accurately identify the marker causing breast cancer (i.e. if there is a marker in human tissue, it represents that the tissue has cancer variation). In order to solve the technical problems, the invention adopts the following technical scheme:
A breast cancer sample (tissue of a breast cancer patient leaving a human body) and a reference sample (healthy sample leaving a human body) are collected, and genomes different from the reference sample on the breast cancer sample are identified, for example, the identified distinct genomes comprise genome A1, genome A2, genome A3, genome A4 and genome A5, and the five distinct genomes are five primary genome markers. Methylation is identified on breast cancer samples that is different from the reference sample, such as identified distinct methylation including methylation B1, methylation B2, methylation B3, which are three primary methylation markers. The five primary genome markers and the three primary methylation markers are not necessarily true markers that lead to breast cancer, and therefore the five primary genome markers and the three primary methylation markers need to be screened.
Exemplary method
The breast cancer marker screening method of the embodiment can be applied to terminal equipment, and the terminal equipment can be a terminal product with an image acquisition function, such as a computer and the like. In this embodiment, as shown in fig. 1, the breast cancer marker screening method specifically includes the following steps:
S100, identifying each primary genomic marker generated by the breast cancer sample relative to a reference sample, wherein the reference sample is a sample which is not cancerous, and the genomic markers are used for identifying breast cancer from genes.
In this embodiment, the breast cancer sample and the reference sample are all derived from in vitro tissues, the breast cancer sample is sequence data corresponding to human tissues in which canceration occurs (the sequence data is used to represent human cell variation information), and the reference sample is sequence data corresponding to human tissues in which canceration does not occur. And judging the difference of the breast cancer sample and the reference sample on genes by taking the reference sample as a standard, wherein the difference is a marker.
In one embodiment, step S100 includes steps S101, S102, S103, S104 as follows:
S101, applying a plurality of individual cell mutation recognition tools to a breast cancer sample and a reference sample to obtain first genome marker sets respectively recognized by the individual cell mutation recognition tools, wherein the first genome markers are genomes mutated in the process of forming somatic cells.
Three recognition tools, mutect2 (this approach employed a partial re-contrast-based approach similar to HaplotypeCaller for somatic mutation recognition), strelka2 recognition approach (Bayesian model of admixture), muSE (use of bayesian statistical models to identify somatic mutations by comparing allele frequencies in tumor and normal samples, and estimate the number of mutated cells on a reference sample by considering factors such as mutation type, genomic location, and coverage of sequencing reads) were used.
Each of the recognition tools recognizes the variant genome set SNV i:
SNVi={SNVij(1≤j≤Vi)}(1≤i≤3)
SNV i is the set of variant genomes (first set of genome markers) identified by the i-th recognition tool, SNV ij is the j-th variant genome identified by the i-th recognition tool, and V i is the total number of variant genomes identified by the i-th recognition tool.
S102, applying a plurality of copy variation recognition tools to the breast cancer sample and the reference sample to obtain second genome marker sets respectively recognized by the copy variation recognition tools, wherein the second genome markers are genomes with genes mutated in the replication process.
In one embodiment, the copy variation recognition tools include 3 tools, mutect tools (using depth and average depth of sequencing data to detect CNV, a second genomic marker), CNVkit tools (using windows or genes as regions to calculate depth for each region), variation recognition algorithm tools (karyotype number algorithm and whole genome partitioning algorithm).
The variant genome set CNV i identified by each copy variant identification tool:
CNVi={CNVij(1≤j≤Vi′)}(1≤i≤3)
CNV i is the set of variant genomes (second set of genome markers) identified by the ith copy variant recognition tool, CNV ij is the jth variant genome identified by the ith copy variant recognition tool, and V i' is the total number of variant genomes identified by the ith recognition tool.
In one embodiment, the erroneous CNVs i are filtered and the identified CNVs are annotated and visualized. CNV-facets (based on likelihood ratio testing) can be employed to estimate the probability of CNV and build a foreground-background model to describe the change in depth for CNV detection.
S103, performing intersection processing on each first genome marker set to obtain each primary selected first genome marker in each primary selected genome marker.
The first genome markers identified by each variant genome identification tool have the same markers and different markers, and the same markers identified by each identification tool are the first genome markers.
In one embodiment, a comparison tool of the biplotype provided by Tllumina (for correcting the comparison coordinates of different variants and comparing against the typing results) is used to screen for variant genome recognition tools supported by at least two tools (i.e., to screen for genomes for which both variant genome recognition tools are judged to be variant).
Each primary first genomic marker constitutes a collection SNV:
SNV={snvi,1≤i≤N}
snv i is the ith primary first genomic marker, and N is the total number of primary first genomic markers.
S104, performing intersection processing on each second genome marker set to obtain each primary selected second genome marker in each primary selected genome marker.
The second genomic markers identified by each copy variation identification tool have the same markers and different markers, and the same markers identified by each identification tool are the first selected second genomic markers.
In one embodiment, the overlay (overlay) case of bedtools calculated CNVs (second genomic markers) is used, when overlay is greater than 70%, the corresponding CNVs are pooled. And filtering according to the mutation position and crowd data condition, wherein the embodiment screens the mutation supported by at least two tools, and finally obtains a combined mutation set.
That is, if two copy variation recognition tools have greater than 70% overlap in the variation region of a genome, then the genome can be determined to be a variant genome (primary second genomic marker).
S200, identifying each primary methylation marker generated by the breast cancer sample relative to the reference sample, wherein the methylation markers are used for identifying breast cancer from methylation.
Not only genomic variations can lead to breast cancer, but methylation variations can also lead to breast cancer. In one embodiment, identifying methylation variations includes steps S201 through S205 as follows:
S201, a CHAMP tool is applied to a breast cancer sample and a reference sample, and a first differential methylation site set DMC 'and a first differential methylation region set DMR' carried by the breast cancer sample are identified.
CHAMP is a methylation chip analysis tool, and quality control, pretreatment, normalization, methylation site detection, etc. analysis was performed using the CHAMP tool. And finally, carrying out differential methylation analysis according to grouping conditions, and identifying differential methylation sites and differential methylation areas.
Dmc' i is the i first differential methylation site,Represents the total number of first differential methylation sites. dmr' i is the i first differential methylation region,/>Representing the total number of first differentially methylated regions.
S202, applying a Bismark tool and METHYLKIT tool to the breast cancer sample and the reference sample, and identifying a second set of differential methylation sites DMCs carried by the breast cancer sample.
The Bismark tool is the tool (please supplement the chinese meaning of Bismark here), and the METHYLKIT tool is the tool (please supplement the chinese meaning of METHYLKIT here).
S203, applying metilene tools to the breast cancer sample and the reference sample, and identifying a second differential methylation region set DMR s carried by the breast cancer sample.
The metilene tool (please supplement the chinese meaning of metilene here).
Steps S202 and S203 whole genome methylation sequencing (WGBS) was performed on the acquired breast cancer samples and control samples, using Trim Galore to reject linker and contaminating sequences, and screening for low quality bases. Sequence alignment was then performed using Bismark, repeated sequence labeling, and methylation site recognition was performed. Differential methylation site recognition was performed using METHYLKIT and differential methylation region screening was performed using metilene.
Representing the total number of differential sites,/>Representing the total number of difference regions.
S204, performing intersection processing on the first differential methylation site set and the second differential methylation site set to obtain each primary differential methylation site DMC in each primary methylation marker.
That is, the first set of differential methylation sites and the second set of differential methylation sites have the same differential methylation sites, and the same differential methylation sites in both sets are the desired first choice differential methylation sites DMC.
DMC={dmci,1≤i≤Nc}
Where N c is the differential DMC total number.
S205, performing intersection processing on each first differential methylation region set and each second differential methylation region set to obtain each primary differential methylation region DMR in each primary methylation marker.
That is, the first differential methylation region set and the second differential methylation region set have the same differential methylation region, and the same differential methylation region in the two sets is the required primary differential methylation site DMR.
DMR={dmri,1≤i≤Nr}
Wherein N r is the total number of differential DMRs.
S300, respectively carrying out transcriptome data analysis on each primary genome marker and each primary methylation marker, screening target genome markers and target methylation markers from each genome marker, wherein the transcriptome data is used for recording all genome data and methylation data.
Step S200 screens out some primary genome markers and primary methylation markers, and then screens out the primary markers again according to the significance characteristics of the markers so as to screen out real markers.
In one embodiment, step S300 includes steps S301, S302, S303 as follows:
S301, applying a feature counting tool FeatureCounts to each of the primary genome markers and each of the primary methylation markers, respectively, to obtain genome saliency features FC and P output by the feature counting tool for transcriptome data analysis of each of the primary genome markers and methylation saliency features FC and P output by the feature counting tool for transcriptome data analysis of each of the primary methylation markers.
FC is FC-Fold Change, the Fold Change. P represents the P value and is a statistical significance index.
In one embodiment, the primary genome marker and the primary methylation marker in step S301 are markers after pretreatment, the pretreatment comprising the following process:
Quality control assessment was performed on sequencing data using FASTQC, splice sequences were deleted using Trim Galore, and low quality bases were filtered. The filtered primary genome markers and primary methylation markers were aligned to GRCh version 38 human reference genome using STAR alignment to screen the pre-treated primary genome markers and primary methylation markers as post-pretreatment markers.
In one embodiment, transcript statistics are calculated using FeatureCounts and converted to TPM statistics. Normalized data, screening for differential genes was performed using DESeq2 to obtain significance signatures FC and P.
S302, screening target genome markers from the primary genome markers according to genome saliency characteristics FC and P of the primary genome markers.
If FC and P of a certain primary genomic marker satisfy the following conditions, the primary genomic marker is used as the target genomic marker.
|logFC|>1.5,P<0.05
S303, screening target methylation markers (comprising methylation sites and methylation regions) from the primary methylation markers according to methylation significance characteristics of the primary methylation markers (comprising methylation sites and methylation regions).
If the FC and P of a particular primary methylation marker meet the following conditions, then that primary methylation marker serves as the target methylation marker.
|logFC|>1.5,P<0.05
S400, determining a marker weight value.
In step S300, each target genome marker and each target methylation marker (including methylation site and methylation region) are determined, and then the weight of each marker needs to be determined, so that the isolated human tissue can be classified according to the weight of each marker and the weight of each marker, and a doctor can determine whether cancer variation occurs according to the classification result.
In one embodiment, the weights of the individual markers are determined using a Lasso model (Lasso algorithm model). This embodiment includes steps S401 to S405 as follows:
S401, a cancer label Y i is given to the breast cancer sample, each label X ij is given to each label, and each weight parameter β j is given to each weight corresponding to each label.
The marker tag X ij is a marker tag corresponding to the j-th marker in the i-th breast cancer sample, and the markers comprise a target genome marker and each target methylation marker. The term "label" is used to refer to a number for cancer and a number for a marker.
S402, constructing an objective function of the lasso algorithm by the cancer label, each label and each weight parameter.
N represents the number of breast cancer samples, p represents the total number of markers, and λ is the Lasso model parameter (constant).
S403, taking the sum of absolute values |beta j | of the weight parameters as a limiting condition of the lasso algorithm, wherein the sum is smaller than a set value tau.
S404, determining each parameter value corresponding to each weight parameter based on the objective function and the limiting condition.
And S405, determining weight values of the markers according to the parameter values.
The values of β j that satisfy equations (1) and (2) are the weight values of the individual markers that need to be solved.
In one embodiment, each target genomic marker and each target methylation marker is further screened according to the marker weight value, and markers with weight values equal to zero are removed to obtain the final markers.
In one embodiment, as shown in FIG. 2, a method of using multiple sets of chemistry to mine and screen for breast cancer markers, and screening for breast cancer specific markers based on different sets of chemistry characteristics; meanwhile, screening out a breast cancer marker (Biomarker) with the national crowd characteristics by using the national massive crowd data as a background; and finally, quantifying a plurality of groups of chemical markers obtained by screening, constructing a robust classification model, and evaluating screening effectiveness of the screened markers. The final marker can be used for accurate diagnosis and treatment of breast cancer in a crowd queue.
In summary, the invention compares the breast cancer sample with the reference sample, screens out the genome of the breast cancer sample mutated relative to the reference sample (marked as a primary genome marker), and screens out the methylation of the breast cancer sample mutated relative to the reference sample (marked as a primary methylation marker). The present invention then performs transcriptome data analysis on the primary markers to screen the primary markers for markers that are truly due to variation (genomic markers of interest and methylation markers of interest). From the above analysis, the present invention can screen markers from breast cancer samples for cancer mutation based on both genomics and methylation, thereby enabling a more comprehensive screening of markers that cause cancer.
Exemplary apparatus
The embodiment also provides a breast cancer marker screening device, which comprises the following components:
the genome marker identification module is used for identifying each primary genome marker generated by a breast cancer sample relative to a reference sample, wherein the reference sample is a sample which is not cancerous;
A methylation marker identification model for identifying each primary methylation marker generated by the breast cancer sample relative to the reference sample;
and the marker screening module is used for respectively carrying out transcriptome data analysis on each primary genome marker and each primary methylation marker, screening out target genome markers and target methylation markers from each genome marker, and recording all genome data and methylation data by using the transcriptome data.
Based on the above embodiment, the present invention also provides a terminal device, and a functional block diagram thereof may be shown in fig. 3. The terminal equipment comprises a processor, a memory, a network interface, a display screen and a temperature sensor which are connected through a system bus. Wherein the processor of the terminal device is adapted to provide computing and control capabilities. The memory of the terminal device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the terminal device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for screening for a breast cancer marker. The display screen of the terminal equipment can be a liquid crystal display screen or an electronic ink display screen, and the temperature sensor of the terminal equipment is preset in the terminal equipment and is used for detecting the running temperature of the internal equipment.
It will be appreciated by persons skilled in the art that the functional block diagram shown in fig. 3 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the terminal device to which the present inventive arrangements are applied, and that a particular terminal device may include more or fewer components than shown, or may combine some of the components, or may have a different arrangement of components.
In one embodiment, a terminal device is provided, the terminal device including a memory, a processor, and a breast cancer marker screening program stored in the memory and executable on the processor, the processor implementing the following operating instructions when executing the breast cancer marker screening program:
Identifying each of the primary genomic markers generated by the breast cancer sample relative to a reference sample, the reference sample being a sample that is not cancerous;
Identifying each primary methylation marker generated by the breast cancer sample relative to a reference sample;
And respectively carrying out transcriptome data analysis on each primary genome marker and each primary methylation marker, and screening target genome markers and target methylation markers from each genome marker, wherein the transcriptome data is used for recording all genome data and methylation data.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A method for screening a breast cancer marker, comprising:
Identifying each of the primary genomic markers generated by the breast cancer sample relative to a reference sample, the reference sample being a sample that is not cancerous;
Identifying each primary methylation marker generated by the breast cancer sample relative to a reference sample;
Respectively carrying out transcriptome data analysis on each primary genome marker and each primary methylation marker, and screening target genome markers and target methylation markers from each genome marker, wherein the transcriptome data are used for recording all genome data and methylation data;
Further comprises:
and applying a lasso algorithm to each target genome marker and each target methylation marker to obtain each marker weight value, wherein the marker weight values comprise genome marker weight values and methylation marker weight values.
2. The breast cancer marker screening method of claim 1, wherein identifying each of the primary genomic markers produced by a breast cancer sample relative to a reference sample, the reference sample being a sample that is not cancerous, comprises:
applying a plurality of individual cell mutation recognition tools to a breast cancer sample and a reference sample to obtain first genome marker sets respectively recognized by the individual cell mutation recognition tools, wherein the first genome markers are genomes mutated in the process of forming somatic cells;
applying a plurality of copy variation recognition tools to the breast cancer sample and the reference sample to obtain second genome marker sets respectively recognized by the copy variation recognition tools, wherein the second genome markers are genomes with genes mutated in the replication process;
Performing intersection processing on each first genome marker set to obtain each primary selected first genome marker in each primary selected genome marker;
And performing intersection processing on each second genome marker set to obtain each primary second genome marker in each primary genome marker.
3. The breast cancer marker screening method of claim 1, wherein said identifying each primary methylation marker generated in a breast cancer sample relative to a reference sample comprises:
applying ChAMP tools to a breast cancer sample and a reference sample, and identifying each first differential methylation site set and each first differential methylation region set carried by the breast cancer sample;
applying a Bismark tool and a METHYLKIT tool to a breast cancer sample and a reference sample, and identifying each second differential methylation site set carried by the breast cancer sample;
Applying metilene tools to the breast cancer sample and the reference sample to identify each second set of differential methylation regions carried by the breast cancer sample;
Performing intersection processing on each first differential methylation site set and each second differential methylation site set to obtain each primary differential methylation site in each primary methylation marker;
and performing intersection processing on each first differential methylation region set and each second differential methylation region set to obtain each primary differential methylation region in each primary methylation marker.
4. The breast cancer marker screening method of claim 1, wherein said performing a transcriptome data analysis on each of said primary genomic markers and each of said primary methylation markers, respectively, screens each of said genomic markers for a target genomic marker and a target methylation marker, said transcriptome data being used to record all genomic data and methylation data, comprises:
Applying a feature count tool to each of said primary genome markers and each of said primary methylation markers, respectively, to obtain genomic significance signatures output by said feature count tool for transcriptome data analysis of each of said primary genome markers and methylation significance signatures output by said feature count tool for transcriptome data analysis of each of said primary methylation markers;
screening target genome markers from each of the primary genome markers according to genome saliency characteristics of each of the primary genome markers;
And screening the target methylation markers from the primary methylation markers according to methylation significance characteristics of the primary methylation markers.
5. The breast cancer marker screening method of claim 1, wherein said applying a lasso algorithm to each of said target genomic markers and each of said target methylation markers results in each marker weight value, said marker weight values comprising genomic marker weight values and methylation marker weight values, comprising:
The method comprises the steps of (1) assigning a cancer label to a breast cancer sample, assigning each label to each label, and assigning each weight parameter to each weight corresponding to each label;
Constructing an objective function of the lasso algorithm with the cancer tag and each of the marker tags and each of the weight parameters;
Taking the sum of the absolute values of the weight parameters as a limiting condition of the lasso algorithm, wherein the sum of the absolute values of the weight parameters is smaller than a set value;
determining each parameter value corresponding to each weight parameter based on the objective function and the limiting condition;
And determining the weight value of each marker according to each parameter value.
6. The breast cancer marker screening method according to claim 5, wherein said determining respective parameter values corresponding to respective weight parameters based on said objective function and said limiting condition comprises:
and under the limiting condition, determining each parameter value corresponding to each weight parameter when the objective function takes the minimum value.
7. A breast cancer marker screening device, comprising the following components:
the genome marker identification module is used for identifying each primary genome marker generated by a breast cancer sample relative to a reference sample, wherein the reference sample is a sample which is not cancerous;
A methylation marker identification model for identifying each primary methylation marker generated by the breast cancer sample relative to the reference sample;
A marker screening module, configured to perform transcriptome data analysis on each of the primary genome markers and each of the primary methylation markers, and screen a target genome marker and a target methylation marker from each of the genome markers, where the transcriptome data is used to record all genome data and methylation data;
Further comprises:
and applying a lasso algorithm to each target genome marker and each target methylation marker to obtain each marker weight value, wherein the marker weight values comprise genome marker weight values and methylation marker weight values.
8. A terminal device comprising a memory, a processor and a breast cancer marker screening program stored in the memory and executable on the processor, when executing the breast cancer marker screening program, performing the steps of the breast cancer marker screening method according to any one of claims 1-6.
9. A computer readable storage medium, wherein a breast cancer marker screening program is stored on the computer readable storage medium, which, when executed by a processor, implements the steps of the breast cancer marker screening method according to any one of claims 1-6.
CN202310687187.6A 2023-06-09 2023-06-09 Breast cancer marker screening method and related device Active CN116758989B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310687187.6A CN116758989B (en) 2023-06-09 2023-06-09 Breast cancer marker screening method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310687187.6A CN116758989B (en) 2023-06-09 2023-06-09 Breast cancer marker screening method and related device

Publications (2)

Publication Number Publication Date
CN116758989A CN116758989A (en) 2023-09-15
CN116758989B true CN116758989B (en) 2024-04-30

Family

ID=87947138

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310687187.6A Active CN116758989B (en) 2023-06-09 2023-06-09 Breast cancer marker screening method and related device

Country Status (1)

Country Link
CN (1) CN116758989B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108830045A (en) * 2018-06-29 2018-11-16 深圳先进技术研究院 A kind of biomarker screening system method based on multiple groups
CN111378754A (en) * 2020-04-23 2020-07-07 嘉兴市第一医院 TCGA (TCGA-based genetic algorithm) database-based breast cancer methylation biomarker and screening method thereof
CN111440869A (en) * 2020-03-16 2020-07-24 武汉百药联科科技有限公司 DNA methylation marker for predicting primary breast cancer occurrence risk and screening method and application thereof
CN114171115A (en) * 2021-11-12 2022-03-11 深圳吉因加医学检验实验室 Differential methylation region screening method and device thereof
CN115820860A (en) * 2022-11-23 2023-03-21 华中农业大学 Method for screening non-small cell lung cancer marker based on methylation difference of enhancer, marker and application thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108830045A (en) * 2018-06-29 2018-11-16 深圳先进技术研究院 A kind of biomarker screening system method based on multiple groups
CN111440869A (en) * 2020-03-16 2020-07-24 武汉百药联科科技有限公司 DNA methylation marker for predicting primary breast cancer occurrence risk and screening method and application thereof
CN111378754A (en) * 2020-04-23 2020-07-07 嘉兴市第一医院 TCGA (TCGA-based genetic algorithm) database-based breast cancer methylation biomarker and screening method thereof
CN114171115A (en) * 2021-11-12 2022-03-11 深圳吉因加医学检验实验室 Differential methylation region screening method and device thereof
CN115820860A (en) * 2022-11-23 2023-03-21 华中农业大学 Method for screening non-small cell lung cancer marker based on methylation difference of enhancer, marker and application thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Integrative analysis of DNA methylation and gene expression profiles identified potential breast cancer-specific diagnostic";Xinhua Liu etc.;《BIOSCIENCE REPORTS》;20200527;第40卷(第5期);全文 *

Also Published As

Publication number Publication date
CN116758989A (en) 2023-09-15

Similar Documents

Publication Publication Date Title
Chang et al. Invariant delineation of nuclear architecture in glioblastoma multiforme for clinical and molecular association
Shannon et al. Analyzing microarray data using cluster analysis
Li et al. Machine learning for lung cancer diagnosis, treatment, and prognosis
US8515680B2 (en) Analysis of transcriptomic data using similarity based modeling
US20050022168A1 (en) Method and system for detecting discriminatory data patterns in multiple sets of data
Kim et al. rSW-seq: algorithm for detection of copy number alterations in deep sequencing data
US20220180518A1 (en) Improved histopathology classification through machine self-learning of &#34;tissue fingerprints&#34;
CN108038352B (en) Method for mining whole genome key genes by combining differential analysis and association rules
US20020169730A1 (en) Methods for classifying objects and identifying latent classes
Sobhani et al. Artificial intelligence and digital pathology: Opportunities and implications for immuno-oncology
US20230111704A1 (en) Systems and methods for predicting patient outcome to cancer therapy
CN110289047B (en) Sequencing data-based tumor purity and absolute copy number prediction method and system
Riordan et al. Automated analysis and classification of histological tissue features by multi-dimensional microscopic molecular profiling
US20220101135A1 (en) Systems and methods for using a convolutional neural network to detect contamination
Liu et al. Pathological prognosis classification of patients with neuroblastoma using computational pathology analysis
Padmanaban et al. Between-tumor and within-tumor heterogeneity in invasive potential
CN116758989B (en) Breast cancer marker screening method and related device
Ye et al. Method of tumor pathological micronecrosis quantification via deep learning from label fuzzy proportions
CN116403701A (en) Method and device for predicting TMB level of non-small cell lung cancer patient
US10937159B2 (en) Predicting outcome in invasive breast cancer from collagen fiber orientation disorder features in tumor associated stroma
US11535896B2 (en) Method for analysing cell-free nucleic acids
CN112930573A (en) Disease type automatic determination method and electronic equipment
Janeiro et al. Spatially resolved tissue imaging to analyze the tumor immune microenvironment: beyond cell-type densities
US20090138209A1 (en) Prognostic apparatus, and prognostic method
US10867208B2 (en) Unbiased feature selection in high content analysis of biological image samples

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant