CN114093424A

CN114093424A - Method, device, equipment and storage medium for screening and processing lesion specific data

Info

Publication number: CN114093424A
Application number: CN202111439901.7A
Authority: CN
Inventors: 陈澍宜
Original assignee: Zhushi Biotechnology Suzhou Co ltd
Current assignee: Zhushi Biotechnology Suzhou Co ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-02-25

Abstract

The invention provides a method, a device, equipment and a storage medium for screening and processing lesion specific data, wherein the method comprises the following steps: preprocessing lesion sample data and control sample data to obtain a first lesion sample reading data set and a first control sample reading data set; searching target differential methylation areas between the first lesion sample read data set and the first control sample read data set based on the first lesion sample read data set and the first control sample read data set, and obtaining a second lesion sample read data set and a second control sample read data set which represent each target differential methylation area; calculating a methylation continuity score for each of the second lesion sample read data and the second control sample read data, and screening the lesion specific data from the second lesion sample read data set with the second control sample read data set as a reference and the methylation continuity score. The invention is beneficial to improving the training quality and the recognition accuracy of the neural network.

Description

Method, device, equipment and storage medium for screening and processing lesion specific data

Technical Field

The invention relates to the technical field of biological information, in particular to a method, a device, equipment and a storage medium for screening and processing lesion specific data.

Background

Read data obtained by high throughput DNA methylation sequencing of diseased tissue obtained from clinical work (e.g., surgical resection) is often contaminated with a significant proportion of data from non-diseased cells, thereby making downstream data analysis noisy. For example, DNA from diseased tissue and normal cells is individually sequenced by high throughput DNA methylation, data from both are labeled separately and trained on a neural network, and finally it is expected that the trained neural network will be able to automatically identify sequencing data from plasma circulating DNA (cfdna) and distinguish whether new read data is from diseased cells or normal cells.

However, in the training process, the sequencing data from the diseased tissue DNA not only comes from diseased cells, but also is mixed with more data from normal cells, which brings noise and confusion to the training of the neural network, and reduces the training quality of the neural network and the final recognition accuracy.

Disclosure of Invention

The invention provides a method, a device, equipment and a storage medium for screening and processing lesion specific data, which are used for solving the problems that the sequencing data of lesion tissue DNA in the prior art brings noise and confusion to the training of a neural network.

In a first aspect, an embodiment of the present invention provides a method for screening and processing lesion specific data, where the method includes:

acquiring lesion sample data and contrast sample data, and preprocessing the lesion sample data and the contrast sample data to obtain a first lesion sample reading data set and a first contrast sample reading data set, wherein each first lesion sample reading data and each first contrast sample reading data represent a base arrangement sequence on a DNA sequence and contain methylation state information of each cytosine on the DNA sequence;

searching for at least one target differential methylation region between the first lesion sample read data set and the first control sample read data set based on the first lesion sample read data set and the first control sample read data set, and obtaining a second lesion sample read data set and a second control sample read data set which represent each target differential methylation region;

calculating a methylation continuity score for each second lesion sample read data and second control sample read data, and screening lesion specific data from the second lesion sample read data set with the second control sample read data set as a reference and according to the methylation continuity score.

In an embodiment of the present invention, the method further includes:

and after labeling the screened lesion specific data, training a preset neural network so that the trained neural network can predict the probability of lesion occurrence based on the input sample.

In an embodiment of the present invention, the preprocessing the lesion sample data and the control sample data includes:

dividing the lesion sample data and the control sample data into a training set and a testing set, wherein the sample data in the training set is used for screening out the lesion specific data and training a neural network, and the sample data in the testing set is used for testing the trained neural network;

wherein the lesion sample data comprises DNA data of lesion tissue from a preset number of patients and the control sample data comprises DNA data of non-lesion tissue beside lesion tissue of the preset number of patients, cfDNA data of a preset number of healthy and high risk groups.

In an embodiment of the present invention, the obtaining of the lesion sample data and the control sample data, and the preprocessing the lesion sample data and the control sample data to obtain the first lesion sample reading data set and the first control sample reading data set further includes:

breaking the DNA molecules of each sample into DNA fragments with preset lengths, and performing enzymatic or salt chemical conversion on the DNA fragments respectively to construct a DNA library;

performing genome-wide methylation sequencing on the DNA library to obtain the first lesion sample read data set and a first control sample read data set.

In one embodiment of the present invention, said performing genome-wide methylation sequencing on said DNA library to obtain said first lesion sample read data set and said first control sample read data set comprises:

performing whole genome methylation sequencing on the DNA library, and performing data filtration on a sequencing result based on a preset rule;

and comparing the filtered data with a preset reference genome and carrying out methylation state identification to obtain the first lesion sample reading data set and the first reference sample reading data set.

In an embodiment of the invention, the finding at least one target differentially methylated region between the first lesion sample read data set and the first control sample read data set based on the two sets and obtaining a second lesion sample read data set and a second control sample read data set representing each target differentially methylated region comprises:

searching differential methylation areas between the first lesion sample read data and the first reference sample read data corresponding to the lesion tissues and the non-lesion tissues of the patients with the preset number respectively to obtain a first set;

searching differential methylation regions between the first lesion sample read data and the first reference sample read data corresponding to the lesion tissues of the preset number of patients and the DNA data of the preset number of high risk groups respectively to obtain a second set;

(ii) taking the intersection of the first and second sets as the target differentially methylated region;

obtaining a second lesion sample read data set and a second control sample read data set representing each target differentially methylated region according to the target differentially methylated regions.

In one embodiment of the present invention, the calculating the methylation continuity score of each of the second lesion sample read data and the second control sample read data, and the screening of lesion specific data from the second lesion sample read data by using the second control sample read data as a reference and according to the methylation continuity score comprises:

calculating the methylation continuity score of each second lesion sample read data and the methylation continuity score of each read data in the second control sample read data set corresponding to the healthy population respectively, and marking the maximum value and the minimum value of the methylation state continuity scores of all second control sample read data corresponding to the healthy population as Smax and Smin respectively;

if the target differential methylation region is a hypomethylation region, removing the reading data with the methylation state continuity score larger than or equal to Smin in the second lesion sample reading data set;

and if the target differential methylation region is a hypermethylation region, removing the read data with the methylation state continuity score smaller than or equal to Smax in the second lesion sample read data set.

In one embodiment of the present invention, said calculating a methylation continuity score for each of the second lesion sample read data and the second control sample read data, and screening lesion specific data from the second lesion sample read data set with the second control sample read data set as a reference and according to the methylation continuity score further comprises:

respectively calculating the methylation continuity score of each second lesion sample read data and the methylation continuity score of each read data corresponding to the high risk group in the second control sample read data set, and sequencing the methylation state continuity scores of all the second control sample read data corresponding to the high risk group from high to low;

if the target differential methylation region is a hypomethylation region, removing the preset percentage of reading data after ranking the methylation state continuity scores in the second control sample reading data set corresponding to the high risk group, marking the minimum value of the methylation state continuity scores of the rest reading data as Smin, and removing the reading data with the methylation state continuity score larger than or equal to Smin in the second lesion sample reading data set;

if the target differential methylation region is a hypermethylation region, removing a preset percentage of read data with methylation state continuity scores ranked ahead in the second control sample read data set corresponding to the high risk group, marking the maximum value of the methylation state continuity scores of the rest read data as Smax, and removing the read data with the methylation state continuity score smaller than or equal to Smax in the second lesion sample read data set.

In one embodiment of the present invention, said screening lesion specific data from said second set of lesion sample reads with said second set of control sample reads as a reference and according to said methylation continuity score further comprises:

after the removing operation, calculating an index value associated with each target differential methylation region according to a preset index function and the read data in all the second lesion sample read data sets corresponding to the target differential methylation region;

sorting all the index values from large to small, and selecting target differential methylation regions associated with preset index values which are sorted in the front as target differential methylation regions;

determining read data in the second lesion sample read data set after the removal operation corresponding to each targeted differentially methylated region as the lesion specific data.

In a second aspect, the embodiments of the present invention further provide a device for screening and processing lesion-specific data, the device including:

the system comprises a preprocessing module, a data acquisition module and a data acquisition module, wherein the preprocessing module is used for acquiring lesion sample data and comparison sample data, preprocessing the lesion sample data and the comparison sample data, and acquiring a first lesion sample reading data set and a first comparison sample reading data set, wherein each first lesion sample reading data and each first comparison sample reading data represent a base arrangement sequence on a DNA sequence and contain methylation state information of each cytosine on the DNA sequence;

a search module for searching for at least one target differential methylation region between the first lesion sample read data set and the first reference sample read data set based on the first lesion sample read data set and the first reference sample read data set, and obtaining a second lesion sample read data set and a second reference sample read data set representing each target differential methylation region;

and the screening module is used for calculating the methylation continuity score of each second lesion sample read data and the second control sample read data, taking the second control sample read data set as a reference and screening out lesion specific data from the second lesion sample read data set according to the methylation continuity score.

The present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for screening and processing lesion specific data according to any one of the above first aspect.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for lesion specific data screening and processing according to any of the above first aspects.

According to the method, the device, the equipment and the storage medium for screening and processing the lesion specific data, differential methylation regions are searched between a first lesion sample read data set and a first reference sample read data set, a second lesion sample read data set and a second reference sample read data set which represent each target differential methylation region are obtained, the second reference sample read data set is used as a reference, lesion specific data are screened from the second lesion sample read data set according to the methylation continuity scores, and then the lesion specific data are trained after being labeled, so that the training quality and the final identification accuracy of a neural network are improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for screening and processing lesion-specific data according to an embodiment of the present invention;

FIG. 2 is a graphical illustration of the methylation state continuity score for each piece of read data provided by an embodiment of the present invention;

FIG. 3(a) is one of the schematic illustrations of hypomethylated and hypermethylated regions provided by embodiments of the present invention;

FIG. 3(b) is a second schematic diagram of the hypomethylated region and the hypermethylated region provided by the embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a disease-specific data screening and processing device according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," and the like in the description and in the claims, and in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein.

The technical background related to the invention is described as follows:

diseased tissue obtained from clinical work (e.g., surgical resection) often contains not only diseased cells but also non-diseased cells. The non-diseased cells mixed in the diseased tissue mainly include non-diseased tissue cells, immune inflammatory cells, stromal cells, vascular endothelial cells, and the like. The proportion of pathological cells in the pathological tissue is the pathological purity. Lesion tissue specimens, for example, obtained by standard surgical procedures, typically have a lesion purity of less than 70%.

Although algorithms for calculating the lesion purity exist at present, the algorithms can only estimate the overall possible lesion purity according to model assumptions, and an efficient noise reduction method for screening lesion cell specific data from DNA methylation sequencing data of lesion tissues does not exist.

In order to solve the problem that the sequencing data of the pathological change tissue DNA in the prior art brings noise and confusion to the training of a neural network, the invention provides a pathological change specific data screening and processing method, a device, equipment and a storage medium, by obtaining a first lesion sample read data set and a first control sample read data set, finding differential methylation regions between the two sets, obtaining a second lesion sample read data set and a second control sample read data set representing each target differential methylation region, and screening lesion specific data from said second set of lesion sample reads with said second set of control sample reads as a reference and according to said methylation continuity score, and then, the neural network is trained after the lesion specific data are labeled, so that the training quality and the final recognition accuracy of the neural network are improved.

The method, device, apparatus and storage medium for screening and processing lesion specific data provided by the embodiments of the present invention are described below with reference to fig. 1 to 5.

Fig. 1 is a schematic flow chart of a method for screening and processing lesion-specific data according to an embodiment of the present invention, as shown in fig. 1. The embodiment of the invention provides a method for screening and processing lesion specificity data, which comprises the following steps:

step 101, obtaining lesion sample data and comparison sample data, and preprocessing the lesion sample data and the comparison sample data to obtain a first lesion sample reading data set and a first comparison sample reading data set, wherein each first lesion sample reading data and each first comparison sample reading data represent a base arrangement sequence on a DNA sequence and contain methylation state information of each cytosine on the DNA sequence.

Step 102, finding at least one target differentially methylated region between the first lesion sample read data set and the first control sample read data set based on the two sets, and obtaining a second lesion sample read data set and a second control sample read data set representing each target differentially methylated region.

Step 103, calculating methylation continuity scores of each second lesion sample read data and second control sample read data, and screening out lesion specific data from the second lesion sample read data set by using the second control sample read data set as a reference and according to the methylation continuity scores.

Illustratively, after the step 103, the method further comprises:

and 104, labeling the screened lesion specific data, and training a preset neural network so that the trained neural network can predict the probability of lesion occurrence based on the input sample.

The following describes the steps 101 to 104.

In the step 101, the preprocessing the lesion sample data and the control sample data includes:

and 1011, dividing the lesion sample data and the control sample data into a training set and a testing set, wherein the sample data in the training set is used for screening out the lesion specific data and training a neural network, and the sample data in the testing set is used for testing the trained neural network.

Wherein the lesion sample data comprises DNA data of lesion tissues from a preset number of patients, and the control sample data comprises DNA data of non-lesion tissues beside the lesion tissues of the preset number of patients, DNA data of a preset number of healthy people and high risk people.

Illustratively, the training set includes tissue samples of paired diseased and non-diseased tissues from a predetermined number of cancer patients (e.g., 50 esophageal cancers), and blood samples from a predetermined number (e.g., 20) of healthy people and a predetermined number (e.g., 50) of high risk people; the test set includes blood samples from a predetermined high risk group and a predetermined cancer patient.

And centrifuging the blood samples (including blood samples of healthy people and blood samples of high risk people) to obtain plasma, and obtaining corresponding plasma samples, namely obtaining the plasma samples of the healthy people and the plasma samples of the high risk people.

The above-mentioned tissue sample (including a tissue sample of a diseased tissue of a cancer patient and a tissue sample of a non-diseased tissue paired with the diseased tissue) and plasma sample are separately stored, for example, the tissue sample and the plasma sample are both stored at-80 ℃.

Exemplarily, in the step 101, the preprocessing the lesion sample data and the control sample data to obtain a first lesion sample read data set and a first control sample read data set further includes:

in step 1011, the DNA molecules of each sample are broken into DNA fragments of a predetermined length, and the DNA fragments are converted enzymatically or salivately, respectively, to construct a DNA library.

The DNA molecules of each sample are DNA molecule entities, and since the DNA molecule chains extracted from cells or plasma are very long, they need to be broken into short fragments by ultrasonic oscillation or other physicochemical means.

Herein, cfDNA represents circulating DNA, and is also referred to as free DNA (circulating free DNA or cell free DNA, cfDNA). Refers to a short segment of partially degraded DNA free outside the cell that is released into the peripheral blood of a human body during apoptosis or necrosis.

Wherein ctDNA represents circulating lesion DNA. Refers to a DNA fragment derived from diseased cells in cfDNA.

For example, genomic DNA from Tissue samples was extracted using the DNeasy Blood and Tissue Kit (Qiagen) Kit.

For example, Circulating dna (cfdna) in plasma samples is extracted using the QIAamp Circulating Nucleic Acid Kit (Qiagen) Kit.

For example, genomic DNA extracted from tissues is divided into fragments of about 200bp in length using a Bioruptor sonicator.

For example, the DNA fragment and the circulating DNA are enzymatically transformed and library-constructed using NEBNext enzymic Methyl-seq Kit (NEB) Kit.

Illustratively, diseased tissue DNA may be subjected to a conversion treatment using bisulfite or an enzyme.

Step 1012, performing genome-wide methylation sequencing on the DNA library to obtain the first lesion sample read data set and the first control sample read data set.

For example, whole genome methylation sequencing was performed on the DNA library using a second generation high throughput sequencer Novaseq 6000. The sequencing strategy was double ended 150bp read length. The sequencing depth of each sample was about 10X.

Illustratively, the fastq file containing read (reads) data in the sequencing results is quality controlled and data filtered using fastp software: (1) filtering reads (reads) with a percentage of bases with a mass value of less than 15 that exceeds 40%; (2) filtering reads (reads) containing more than 5 bases N; (3) excision of the linker sequence; (4) cutting off sequences with the length of 15bp respectively at the rear end of the first read (read1) and the front end of the second read (read 2); (5) reads (reads) of less than 36bp in length are filtered out.

It should be noted that reads represents the base sequence data read by the second generation high throughput sequencer on each DNA molecule, and is also referred to as reads.

Illustratively, in step 1012 above, said performing genome-wide methylation sequencing on said DNA library to obtain said first lesion sample read data set and said first control sample read data set comprises:

step 10121, performing whole genome methylation sequencing on the DNA library, and performing data filtration on a sequencing result based on a preset rule.

Step 10122, comparing the filtered data with a preset reference genome and performing methylation state identification to obtain the first lesion sample reading data set and the first reference sample reading data set.

For example, filtered reads were aligned to human reference genome hg38 using Bismark software. PCR (polymerase chain reaction) repeats in reads were removed using the default tool in the Bismark software, based on information such as the alignment position of the reads on the reference genome. Methylation state information of each cytosine base site in the genome was obtained using a methylation extra tool in the Bismark software. For a cytosine site, the ratio of the number of reads supporting methylation of the site to the total number of reads covering the site is the methylation rate of the site.

Wherein, DNA methylation refers to the chemical modification process of obtaining a methyl group on the 5 th carbon atom of cytosine in nucleotide.

Wherein, CpG sites refer to the tightly linked cytosine nucleotide-phosphodiester bond-guanine nucleotide sites on DNA sequences.

Wherein, CpG site methylation refers to methylation of cytosine in CpG dinucleotide. Methylation of cytosine in DNA typically occurs at CpG sites.

Illustratively, the first lesion sample read data and the first control sample read data are high throughput DNA methylation sequencing data.

Illustratively, high-throughput DNA methylation sequencing refers to obtaining the order of base arrangements on a DNA sequence and the methylation state of each cytosine therein by second-generation high-throughput sequencing means. The data is presented in reads. Before sequencing, the DNA needs to be treated by bisulfite or enzyme, unmethylated cytosines in DNA molecules are converted into uracil, so that methylated and unmethylated cytosines can be distinguished from sequencing results, and the methylation state of each cytosine can be obtained. Including but not limited to whole genome methylation sequencing and high throughput targeted methylation sequencing.

The above step 102 is described in detail below.

In the step 102, the finding at least one target differentially methylated region between the first lesion sample read data set and the first control sample read data set based on the first lesion sample read data set and the first control sample read data set, and obtaining a second lesion sample read data set and a second control sample read data set representing each target differentially methylated region includes:

step 1021, finding differential methylation areas between the first lesion sample read data and the first reference sample read data corresponding to the lesion tissues and the non-lesion tissues of the preset number of patients respectively, to obtain a first set.

Illustratively, the R language package dmrseq may be used to find regions of Differential Methylation (DMR) based on a Generalized Least Squares (GLS) regression model.

Where DMR represents a Differentially Methylated Region (DMR). Refers to genomic regions that have significant differences in methylation rates between two samples from different sources (e.g., from diseased tissue and non-diseased tissue, respectively). DMR in which the methylation rate of diseased tissue is increased relative to non-diseased tissue is called hypermethylated regions, whereas the methylation rate is called hypomethylated regions.

For example, tissue samples of diseased tissue and non-diseased tissue are divided into different groups, and DMR found by using the software package dmrseq to find DMR between the different groups based on corresponding first diseased sample read data and first control sample read data obtained for the different groups is referred to as a first set.

Step 1022, finding differential methylation regions between the first lesion sample read data and the first reference sample read data corresponding to the lesion tissues of the preset number of patients and the cfDNA data of the preset number of high risk groups, respectively, to obtain a second set.

Step 1023, the intersection of the first set and the second set is taken as the target differentially methylated region.

Step 1024, obtaining a second lesion sample read data set and a second control sample read data set representing each target differential methylation region according to the target differential methylation regions.

For example, in order to reduce the potential influence caused by other variables, 2 covariates, namely a covariate a and a covariate B, can be set in the analysis process. Covariates are input parameters to the dmrseq software package. Age size (numerical value) and gender (character code) were used as covariates a and B of the above model, respectively.

Thus, high throughput methylation sequencing of DNA from diseased tissue samples and non-diseased samples (e.g., paracancerous tissue, plasma, etc.) respectively is performed. The obtained high throughput methylation sequencing data was used to find Differentially Methylated Regions (DMR) between diseased tissue sample and non-diseased sample DNA.

The above step 103 is specifically described below.

By calculating the Methylation Continuity Score (MCS) for each read within the DMR region. Reads from diseased tissue DNA were screened for disease-specific data based on methylation continuity scores using reads from plasma cfDNA from non-diseased populations. Thereby reducing data noise caused by DNA from non-diseased cells mixed in diseased tissue DNA.

Wherein, the lesion specific data refers to reads generated by high-throughput DNA methylation sequencing, which are specific to lesion cells but not in other normal cells.

Wherein, the calculation formula of the Methylation Continuity Score (MCS) of a read is as follows:

wherein L represents the number of CpG sites on the read, and n_iIndicates that i uninterrupted continuous methylated CpG sites on a read form a block of continuous methylation state, n_iIndicating the number of corresponding blocks on the read.

Illustratively, the value of MCS ranges from 0 to 1. The greater the MCS value, the higher the methylation level on the read, and the more likely the methylated CpG sites are to be uninterruptedly and contiguously distributed on the read, and less separated by unmethylated CpG sites.

FIG. 2 is a graphical representation of the methylation state continuity score for each piece of read data provided by an embodiment of the present invention, as shown in FIG. 2. Each horizontal line in the figure represents a read. Filled circles represent methylated CpG sites and open circles represent unmethylated CpG sites. The numbers on the left (1, 0.44, 0.22, 0) are methylation state continuity scores (MCS) for each read.

Fig. 3(a) is one of schematic diagrams of hypomethylated regions and hypermethylated regions provided by the embodiment of the present invention, fig. 3(b) is the second of the schematic diagrams of hypomethylated regions and hypermethylated regions provided by the embodiment of the present invention, and fig. 3(a) and fig. 3(b) show methods for screening lesion-specific reads for DMRs of the hypomethylated regions and the hypermethylated regions, respectively. The diagrams on the left of fig. 3(a) and on the left of fig. 3(b) are for hypomethylated regions, and the diagrams on the right of fig. 3(a) and on the right of fig. 3(b) are for hypermethylated regions. The dark diagonal filled portions in the columns are the portions to be removed and the light dotted filled portions are the portions to be retained.

Specifically, in step 103, the calculating the methylation continuity score of each of the second lesion sample read data and the second control sample read data, and the screening of lesion specific data from the second lesion sample read data by using the second control sample read data as a reference and according to the methylation continuity score comprises:

step 1031, calculating the methylation continuity score of each second lesion sample read data and the methylation continuity score of each read data corresponding to the healthy population in the second control sample read data set respectively, and marking the maximum value and the minimum value of the methylation state continuity scores of all second control sample read data corresponding to the healthy population as Smax and Smin respectively.

For example, as shown in fig. 3(a), in the left hypomethylated region, all read data from diseased tissue DNA of a patient (corresponding to diseased tissue DNA on the abscissa), all read data from cfDNA of a healthy population (corresponding to cfDNA of a healthy population on the abscissa), and the ordinate is the distribution interval of MCS values of all read data (reads). The minimum value Smin of the methylation state continuity score among all read data of healthy population is marked in fig. 3 (a).

For example, as shown in fig. 3(a), in the right hypermethylated region, all read data from diseased tissue DNA of the patient (corresponding to diseased tissue DNA on the abscissa), all read data from cfDNA of healthy population (corresponding to cfDNA of healthy population on the abscissa), and the ordinate is the distribution interval of MCS values of all read data (reads). The maximum value Smax of the methylation state continuity score among all read data of healthy population is marked in fig. 3 (a).

Step 1032, if the target differential methylation region is a hypomethylation region, removing the reading data with the methylation state continuity score larger than or equal to Smin in the second lesion sample reading data set.

For example, as shown in FIG. 3(a), in the left hypomethylated region, the portion above the dotted line of the diseased tissue DNA is removed.

Step 1033, removing the read data in the second lesion sample read data set with the methylation state continuity score less than or equal to Smax if the target differentially methylated region is a hypermethylated region.

For example, as shown in FIG. 3(a), in the hypermethylated region on the right, the portion below the dotted line of the DNA of the lesion tissue is removed.

Therefore, by the screening method in steps 1031 to 1033, on one hand, reads from various other cells (such as normal esophageal cells, immune cells, vascular endothelial cells and the like) in the surgically excised diseased tissue (such as esophageal cancer tissue) can be filtered out as much as possible, and the purity of the diseased reads in the training set is improved. On the other hand, for example, when the methylation state of ctDNA of a diseased cell overlaps with DNA of other organs and tissues (such as stomach and heart) of a human body, reads specific to the diseased cell can be screened out, and errors that a neural network judges trace reads derived from other organs and tissues in human blood as ctDNA can be reduced as much as possible.

In step 103, the calculating the methylation continuity score of each of the second lesion sample read data and the second control sample read data, and the screening the lesion specific data from the second lesion sample read data by using the second control sample read data as a reference and according to the methylation continuity score further comprises:

step 1034, calculating the methylation continuity score of each second lesion sample read data and the methylation continuity score of each read data corresponding to the high risk group in the second control sample read data set respectively, and sorting the methylation state continuity scores of all the second control sample read data corresponding to the high risk group from high to low.

For example, as shown in fig. 3(b), in the left hypomethylated region, all read data from diseased tissue DNA of the patient (tumor tissue DNA corresponding to the abscissa), all read data of cfDNA of healthy population (healthy population cfDNA corresponding to the abscissa), and the ordinate is the distribution interval of MCS values of all read data (reads). The minimum value Smin of the methylation state continuity score among all read data of healthy population is marked in fig. 3 (b).

For example, as shown in fig. 3(b), in the right hypermethylated region, all read data from diseased tissue DNA of the patient (tumor tissue DNA corresponding to the abscissa), all read data of cfDNA of healthy population (healthy population cfDNA corresponding to the abscissa), and the ordinate is the distribution interval of MCS values of all read data (reads). The maximum value Smax of the methylation state continuity score among all read data of healthy population is marked in fig. 3 (b).

Step 1035, if the target differential methylation region is a hypomethylation region, removing a preset percentage (e.g., 5%, as shown in fig. 3 (b)) of the read data with the methylation state continuity score ranked in the second control sample read data set corresponding to the high risk group, and marking the minimum value of the methylation state continuity scores of the remaining read data as Smin, and removing the read data with the methylation state continuity score greater than or equal to Smin in the second lesion sample read data set.

For example, as shown in FIG. 3(b), in the left hypomethylated region, the portion above the dotted line of the diseased tissue DNA is removed.

Step 1036, if the target differentially methylated region is a hypermethylated region, removing read data corresponding to a preset percentage (for example, 5% as shown in fig. 3 (b)) of the methylation state continuity scores of the second control sample read data set before ranking in the high risk group, and marking the maximum value of the methylation state continuity scores of the remaining read data as Smax, and removing read data of which the methylation state continuity score is less than or equal to Smax in the second lesion sample read data set.

For example, as shown in FIG. 3(b), in the hypermethylated region on the right, the portion below the dotted line of the DNA of the lesion tissue is removed.

Therefore, partial reads with similar methylation patterns of diseased tissue DNA and cfDNA of high risk groups (such as hepatitis and liver cirrhosis DNA) can be further filtered by the screening method of steps 1034-1036. The reason why 5% of reads in cfDNA of high risk population are removed in advance is that it is not excluded that cfDNA of a small number of high risk population contains a small amount of ctDNA.

After completion of the previous step of reads screening filtration, the number of reads per DMR lesion tissue DNA from each individual was changed. To try to screen out DMRs with more filtered reads in more individuals. The method for screening and processing lesion specific data of the present invention further comprises further screening DMR after the step 1036, therefore, the step 103 of screening the lesion specific data from the second lesion sample read data set by using the second control sample read data set as a reference and the methylation continuity score further comprises:

step 1037, after the removing operation, for each target differential methylation region, calculating an index value associated with the target differential methylation region according to a preset index function and the read data in all the second lesion sample read data sets corresponding to the target differential methylation region.

Illustratively, the preset index function is DMR Universal Score (DUS), which is calculated by the following formula:

wherein, for each DMR, n represents the total number of individuals, t represents the proportion of the filtered reads number to the filtered reads total number, i represents the ith individual, and d represents the proportion of the individuals with t >0 to the total number of the individuals.

Illustratively, the DUS value ranges from 0 to 1. When DUS is 0, it means that the number of reads remaining after filtering the DMR in all individuals is 0; DUS ═ 1 indicates that the DMR had no reads filtered in all individuals.

Step 1038, rank all index values from large to small, and select the target differential methylation region associated with the top preset index value as the target differential methylation region.

For example, the DMRs are arranged from large to small according to the DUS value, and a top-ranked preset number of indicator values (e.g., 200) DMRs are selected for subsequent analysis.

Step 1039, determining read data in the second lesion sample read data set after the ablation operation corresponding to each targeted differentially methylated region as the lesion specific data.

The above step 104 is described in detail below.

In step 104, the screened lesion specific data is labeled and then a preset neural network is trained, so that the trained neural network can predict the probability of lesion occurrence based on the input sample.

Illustratively, after the above-mentioned filtering and screening steps, all screened lesion-specific data (reads) in the above-mentioned 200 DMRs are labeled with 1, and all data (reads) from cfDNA of high risk and healthy people are labeled with 0, for example. The neural network is trained using the labeled reads. The neural network structure may be a typical transform network.

Illustratively, after the step 104, the method for screening and processing lesion specific data according to the embodiment of the present invention further includes the following steps 105 and 106.

Step 105, after labeling the screened lesion specific data and training a preset neural network, inputting a plurality of DNA sample reading data from the same target to be detected into the trained neural network so as to identify the probability of whether each reading data is the lesion specific data.

Illustratively, all reads in a DMR sample of circulating DNA (cfDNA) from a test set population are input into a trained neural network for identification. The neural network gives the probability that each read is derived from circulating lesion dna (ctdna).

And 106, performing estimation operation according to the probability corresponding to the DNA reading data and a preset function to obtain the proportion of lesion reading data in the DNA reading data of the target to be detected.

Illustratively, the expression of the preset function is:

wherein the content of the first and second substances,

represents the proportion of lesion read data in DNA read data from an object to be detected, p_iThe probability that the ith read data output by the neural network is lesion specific data, n represents the number of all read data from a target to be detected, and t represents a preset value.

For example, 1001 values of t ═ 0, 0.1%, 0.2%, 0.3% … … 99.9.9%, and 100% are respectively substituted in turn into the right side of the above equation. That is, the interval of 0% to 100% is divided into 1000 parts on average at intervals of 0.1%, and t takes one value at a time. Finding a value of t maximizes the value of the above formula. The t value is the estimated ratio of ctDNA of the individual to cfDNA

. The method can obtain the global optimal solution of the t value at the precision of 0.1%.

Illustratively, by setting a threshold value for the above-calculated

Individuals with values greater than the threshold are determined to be diseased patients (e.g., cancer patients), otherwise, non-diseased patients (e.g., non-cancerous).

Illustratively, a series of thresholds may be defined in 0.1% steps from 0% to 100%. The population from the test set is determined using each threshold value separately. With the change of the threshold value, the judgment sensitivity, specificity and accuracy of the test set are changed, so that a Receiver Operating Characteristic Curve (ROC) can be drawn, and the size of an Area Under the Curve (AUC) value can be calculated.

The AUC value is calculated by drawing ROC curves in the results of the traditional method and the embodiment of the invention, and the AUC value is compared between the two values. It is generally understood by those skilled in the art that a greater AUC indicates a more optimal method. The comparison result shows that the noise reduction method for screening the lesion specificity data from the high-throughput DNA methylation sequencing data is beneficial to improving the identification accuracy in subsequent analysis.

In conclusion, the embodiment of the invention is beneficial to improving the training quality of the neural network and the final recognition accuracy rate by screening the lesion specific reads in the DMR and further screening the DMR.

The disease-specific data screening and processing device provided by the present invention is described below, and the disease-specific data screening and processing device described below and the disease-specific data screening and processing method described above may be referred to in correspondence with each other.

Fig. 4 is a schematic structural diagram of a lesion-specific data screening and processing device according to an embodiment of the present invention, as shown in fig. 4. The embodiment of the invention provides a lesion specific data screening and processing device 400, wherein the device 400 comprises a preprocessing module 410, a searching module 420 and a screening module 430. Wherein the content of the first and second substances,

the preprocessing module 410 is configured to obtain lesion sample data and control sample data, and preprocess the lesion sample data and the control sample data to obtain a first lesion sample read data set and a first control sample read data set, where each first lesion sample read data and each first control sample read data indicate a base arrangement order on a DNA sequence and include methylation state information of each cytosine on the DNA sequence.

A finding module 420 for finding at least one target differentially methylated region between the first lesion sample read data set and the first control sample read data set based on the two sets, and obtaining a second lesion sample read data set and a second control sample read data set representing each target differentially methylated region.

A screening module 430 for calculating a methylation continuity score for each of the second lesion sample read data and the second control sample read data, and screening lesion specific data from the second lesion sample read data set with the second control sample read data set as a reference and according to the methylation continuity score.

Illustratively, the apparatus 400 further comprises:

a training module 440 (not shown), the training module 440 configured to:

Illustratively, the preprocessing module 410 is further configured to:

breaking the DNA molecules of each sample into DNA fragments with preset lengths, and performing enzymatic conversion on the DNA fragments respectively to construct a DNA library;

Illustratively, the preprocessing module 410 is further configured to:

said performing genome-wide methylation sequencing of said DNA library to obtain said first lesion sample read data set and said first control sample read data set comprises:

Illustratively, the lookup module 420 is further configured to:

Illustratively, the screening module 430 is further configured to:

Illustratively, the apparatus 400 further comprises:

an estimation module 450 (not shown), the estimation module 450 being configured to:

after labeling the screened lesion specific data and training a preset neural network, inputting DNA reading data from the same target to be detected into the trained neural network so as to identify the probability of whether each reading data is the lesion specific data;

and carrying out estimation operation according to the probability corresponding to the DNA reading data and a preset function so as to obtain the proportion of lesion reading data in the DNA reading data of the target to be detected.

Illustratively, the expression of the preset function is:

wherein the content of the first and second substances,

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform the lesion specific data screening and processing method, the method comprising:

Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method for screening and processing lesion specific data provided by the above methods.

In yet another aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the above-provided lesion-specific data screening and processing methods.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for screening and processing lesion specific data, the method comprising:

2. The method for screening and processing lesion specific data according to claim 1, wherein obtaining lesion sample data and control sample data and preprocessing the lesion sample data and the control sample data to obtain a first lesion sample read data set and a first control sample read data set further comprises:

3. The method for disease-specific data screening and processing of claim 2, wherein performing genome-wide methylation sequencing on the DNA library to obtain the first disease sample read data set and the first control sample read data set comprises:

4. The method of lesion specific data processing according to claim 3, wherein finding at least one target differentially methylated region between the first set of lesion sample read data and the first set of control sample read data based on the two sets and obtaining a second set of lesion sample read data and a second set of control sample read data representing each target differentially methylated region comprises:

5. The method for screening and processing lesion specific data according to claim 1, wherein said calculating a methylation continuity score for each of said second lesion sample read data and said second control sample read data, and screening lesion specific data from said second lesion sample read data set with said second control sample read data set as a reference and according to said methylation continuity score comprises:

6. The method for screening and processing lesion specific data according to claim 5, wherein said calculating a methylation continuity score for each of said second lesion sample read data and said second control sample read data, and screening lesion specific data from said second lesion sample read data set with said second control sample read data set as a reference and based on said methylation continuity score further comprises:

7. The method for screening and processing lesion specific data according to claim 6, wherein said screening lesion specific data from said second set of lesion sample reads with said second set of control sample reads as a reference and according to said methylation continuity score further comprises:

8. A lesion-specific data screening and processing device, the device comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the lesion specific data screening and processing method of any one of claims 1 to 7.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the lesion specific data screening and processing method of any one of claims 1 to 17.