WO2023184330A1 - 基因组甲基化测序数据的处理方法、装置、设备和介质 - Google Patents

基因组甲基化测序数据的处理方法、装置、设备和介质 Download PDF

Info

Publication number
WO2023184330A1
WO2023184330A1 PCT/CN2022/084386 CN2022084386W WO2023184330A1 WO 2023184330 A1 WO2023184330 A1 WO 2023184330A1 CN 2022084386 W CN2022084386 W CN 2022084386W WO 2023184330 A1 WO2023184330 A1 WO 2023184330A1
Authority
WO
WIPO (PCT)
Prior art keywords
methylation
sequence
genome
window
methylated
Prior art date
Application number
PCT/CN2022/084386
Other languages
English (en)
French (fr)
Inventor
宋阳
Original Assignee
京东方科技集团股份有限公司
成都京东方光电科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司, 成都京东方光电科技有限公司 filed Critical 京东方科技集团股份有限公司
Priority to PCT/CN2022/084386 priority Critical patent/WO2023184330A1/zh
Priority to CN202280000605.3A priority patent/CN117157714A/zh
Publication of WO2023184330A1 publication Critical patent/WO2023184330A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • the present disclosure belongs to the field of gene detection technology, and particularly relates to a method, device, equipment and medium for processing genome methylation sequencing data.
  • DNA (Deoxyribo Nucleic Acid, deoxyribonucleic acid) methylation is an epigenetic modification method without changing the DNA sequence, that is, the process of adding a methyl group to the 5' carbon of cytosine.
  • DNA methylation in the human body Cylation generally occurs at CpG nucleotide sites and can regulate the expression of coding genes.
  • High-throughput sequencing technology can be used to obtain the methylation pattern of the genome. Studies have shown that DNA methylation patterns have an impact on individual growth, development, and gene expression patterns. And the stability of the genome plays an important regulatory role, and abnormal DNA methylation is closely related to the occurrence and development of tumors and cell carcinogenesis. Methylation sequencing is used to identify individual methylation patterns to obtain personalized Disease assessment is a developing trend in the field of disease surveillance today.
  • the present disclosure provides a method, device, equipment and medium for processing genome methylation sequencing data.
  • Some embodiments of the present disclosure provide a method for detecting genome methylation sequencing data.
  • the method includes:
  • the methylation index includes: the alignment covered by the window in the target region The result is the total number of methylated bases at methylation sites and the total number of bases at methylation sites;
  • the steps of analyzing the methylation indicators of the comparison results covered by the windows at each different position and outputting the comprehensive methylation assessment results of the genome methylation sequencing sequence include:
  • the position is calculated.
  • the regional window methylation level value corresponding to each window located at different positions in the target region is analyzed to obtain the regional methylation level value of the target region in the genome methylation sequencing sequence.
  • the regional window methylation level values corresponding to each window located at different positions in the target region are analyzed to obtain the regional methylation level of the target region in the genome methylation sequencing sequence.
  • Value steps include:
  • the methylation index includes: the methylation index in the comparison result covered by the window covering the target site.
  • the steps of analyzing the methylation indicators of the comparison results covered by the windows at each different position and outputting the comprehensive methylation assessment results of the genome methylation sequencing sequence include:
  • each base number covering the target site calculates each base number covering the target site.
  • the site window methylation level value corresponding to the window covering each different position of the target site is analyzed to obtain the site methylation level value of the target site.
  • the step of analyzing the site window methylation level value corresponding to the window covering each different position of the target site to obtain the site methylation level value of the target site includes:
  • the window is sequentially slid from the first end to the second end of the comparison result, and during each sliding process, the methylation indicators of the comparison results covered by the windows at different positions are
  • the steps to perform statistics include:
  • the step of comparing the genome methylation sequencing sequence and the reference genome sequence to obtain the comparison result includes:
  • the positive and negative strands of each of the secondary reference genome fragment sequences are compared with the genome methylation sequencing sequence to obtain the alignment results.
  • the step of segmenting the reference genome sequence to obtain multiple reference genome fragment sequences includes:
  • Each reference chromosome genome sequence is divided according to a preset length to obtain multiple reference genome fragment sequences.
  • the reference genome sequence includes: a first converted reference genome sequence and a second converted reference genome sequence; the methylated genome sequence at least includes: a first amplified methylated genome sequence, a second amplified methylated genome sequence.
  • the step of comparing the positive and negative strands of each of the reference genome fragment sequences with the genome methylation sequencing sequence to obtain the comparison results includes:
  • the methylated genome sequence is subjected to methylation amplification to obtain at least a fifth amplified methylated genome sequence and a sixth amplified methylated genome sequence;
  • the first converted reference genome sequence and the second converted reference genome sequence are respectively combined with the third amplified methylated genome sequence, the fourth amplified methylated genome sequence, the fifth amplified methylated genome sequence, The sixth amplified methylated genome sequence was compared;
  • the parent sequence of the amplified methylated genome sequence that is the same as the first converted reference genome sequence is used as the positive strand, and the same amplified methylated genome sequence that is aligned with the second converted reference genome sequence is used as the positive strand.
  • the parent sequence serves as the negative strand.
  • the first amplified methylated genome sequence is subjected to methylation amplification to obtain at least a third amplified methylated genome sequence and a fourth amplified methylated genome sequence, and the amplified methylated genome sequence is obtained.
  • the steps of performing methylation amplification on the second amplified methylated genome sequence to obtain at least a fifth amplified methylated genome sequence and a sixth amplified methylated genome sequence include:
  • the method further includes:
  • the target type of sequences in the obtained original gene sequencing sequence where the target type of sequences include: linker sequences, sequences whose overlapping bases with the linker sequence are greater than the preset number of bases, and whose quality value is lower than the quality. At least one of an end sequence with a value threshold and a sequence with a length less than the length threshold;
  • the filtered original gene sequencing sequence does not meet the target requirements, continue to perform the filtering operation on the original gene sequencing sequence until the filtered original gene sequencing sequence meets the target requirements, and sequence the filtered original gene sequence.
  • the sequence is used as a genome methylation sequencing sequence to be detected, wherein the target requirements include: base quality requirements, base ratio requirements, sequence average GC distribution requirements, N content distribution requirements, sequence length requirements, repeat sequence requirements and linker sequences at least one of the requirements.
  • Some embodiments of the present disclosure provide a device for processing genome methylation data, the device including:
  • An acquisition module configured to acquire the genome methylation sequencing sequence to be detected and the reference genome sequence
  • An alignment module configured to compare the genome methylation sequencing sequence with the reference genome sequence to obtain an alignment result
  • the statistics module is configured to construct a window through which the window is moved successively from the first end to the second end of the comparison result, and in the process of each movement, the comparison results covered by the windows at different positions are measured.
  • Basis index is used for statistics, and the step size of each movement of the window is smaller than the length of the window;
  • the evaluation module is configured to analyze the methylation indicators of the comparison results counted by the windows covering each different position, and output the comprehensive methylation evaluation results of the genome methylation sequencing sequence.
  • the methylation index includes: the alignment covered by the window in the target region The result is the total number of methylated bases at methylation sites and the total number of bases at methylation sites;
  • the evaluation module is also configured to:
  • the position is calculated.
  • the regional window methylation level value corresponding to each window located at different positions in the target region is analyzed to obtain the regional methylation level value of the target region in the genome methylation sequencing sequence.
  • the evaluation module is also configured to:
  • the methylation index includes: the methylation index in the comparison result covered by the window covering the target site.
  • the evaluation module is also configured to:
  • each base number covering the target site calculates each base number covering the target site.
  • the site window methylation level value corresponding to the window covering each different position of the target site is analyzed to obtain the site methylation level value of the target site.
  • the evaluation module is also configured to:
  • the statistics module is also configured to:
  • the comparison module is also configured to:
  • the positive and negative strands of each of the secondary reference genome fragment sequences are compared with the genome methylation sequencing sequence to obtain the alignment results.
  • the comparison module is also configured to:
  • Each reference chromosome genome sequence is divided according to a preset length to obtain multiple reference genome fragment sequences.
  • the reference genome sequence includes: a first converted reference genome sequence and a second converted reference genome sequence; the methylated genome sequence at least includes: a first amplified methylated genome sequence, a second amplified methylated genome sequence.
  • the comparison module is also configured to:
  • the methylated genome sequence is subjected to methylation amplification to obtain at least a fifth amplified methylated genome sequence and a sixth amplified methylated genome sequence;
  • the first converted reference genome sequence and the second converted reference genome sequence are respectively combined with the third amplified methylated genome sequence, the fourth amplified methylated genome sequence, the fifth amplified methylated genome sequence, The sixth amplified methylated genome sequence was compared;
  • the parent sequence of the amplified methylated genome sequence that is the same as the first converted reference genome sequence is used as the positive strand, and the same amplified methylated genome sequence that is aligned with the second converted reference genome sequence is used as the positive strand.
  • the parent sequence serves as the negative strand.
  • the comparison module is also configured to:
  • the acquisition module is also configured to:
  • the target type of sequences in the obtained original gene sequencing sequence where the target type of sequences include: linker sequences, sequences whose overlapping bases with the linker sequence are greater than the preset number of bases, and whose quality value is lower than the quality. At least one of an end sequence with a value threshold and a sequence with a length less than the length threshold;
  • the filtered original gene sequencing sequence does not meet the target requirements, continue to perform the filtering operation on the original gene sequencing sequence until the filtered original gene sequencing sequence meets the target requirements, and sequence the filtered original gene sequence.
  • the sequence is used as the genome methylation sequencing sequence to be detected; wherein the target requirements include: base quality requirements, base ratio requirements, sequence average GC distribution requirements, N content distribution requirements, sequence length requirements, repeat sequence requirements and linker sequences at least one of the requirements.
  • Some embodiments of the present disclosure provide a computing processing device, including:
  • a memory having computer readable code stored therein;
  • One or more processors when the computer readable code is executed by the one or more processors, the computing processing device performs the method for processing genome methylation sequencing data as described above.
  • Some embodiments of the present disclosure provide a computer program, including computer readable code, which when run on a computing processing device, causes the computing processing device to perform processing of genome methylation sequencing data as described above. method.
  • Some embodiments of the present disclosure provide a non-transitory computer-readable medium in which the method for processing genome methylation sequencing data as described above is stored.
  • Figure 1 schematically shows a flow chart of a method for processing genome methylation sequencing data provided by some embodiments of the present disclosure
  • Figure 2 schematically shows one of the flow diagrams of another method for processing genome methylation sequencing data provided by some embodiments of the present disclosure
  • Figure 3 schematically shows one of the principle diagrams of another method for processing genome methylation sequencing data provided by some embodiments of the present disclosure
  • Figure 4 schematically shows the second flow diagram of another method for processing genome methylation sequencing data provided by some embodiments of the present disclosure
  • Figure 5 schematically shows the second principle diagram of another method for processing genome methylation sequencing data provided by some embodiments of the present disclosure
  • Figure 6 schematically shows the third flowchart of another method for processing genome methylation sequencing data provided by some embodiments of the present disclosure
  • Figure 7 schematically shows the third principle diagram of another method for processing genome methylation sequencing data provided by some embodiments of the present disclosure
  • Figure 8 schematically shows the fourth schematic flowchart of another method for processing genome methylation sequencing data provided by some embodiments of the present disclosure
  • Figure 9 schematically shows the fourth schematic diagram of another method for processing genome methylation sequencing data provided by some embodiments of the present disclosure.
  • Figure 10 schematically shows the fifth principle diagram of another method for processing genome methylation sequencing data provided by some embodiments of the present disclosure
  • Figure 11 schematically shows the fifth flow diagram of another method for processing genome methylation sequencing data provided by some embodiments of the present disclosure
  • Figure 12 schematically shows a flow chart of yet another method for processing genome methylation sequencing data provided by some embodiments of the present disclosure
  • Figure 13 schematically shows one of the effect diagrams of yet another method for processing genome methylation sequencing data provided by some embodiments of the present disclosure
  • Figure 14 schematically shows the second effect diagram of yet another method for processing genome methylation sequencing data provided by some embodiments of the present disclosure
  • Figure 15 schematically shows the third effect diagram of yet another method for processing genome methylation sequencing data provided by some embodiments of the present disclosure
  • Figure 16 schematically shows the fourth effect diagram of yet another method for processing genome methylation sequencing data provided by some embodiments of the present disclosure
  • Figure 17 schematically shows a structural diagram of a device for processing genome methylation sequencing data provided by some embodiments of the present disclosure
  • Figure 18 schematically illustrates a block diagram of a computing processing device for performing methods according to some embodiments of the present disclosure
  • Figure 19 schematically illustrates a storage unit for holding or carrying program code implementing methods according to some embodiments of the present disclosure.
  • the genome methylation sequencing sequence is usually obtained through high-throughput sequencing technology, and then the entire genome methylation sequencing sequence is aligned to a pre-prepared reference genome sequence to identify the genome methyl group based on the alignment results.
  • methylation levels of sequencing sequences due to the need for overall comparison, the comparison process takes a long time and the efficiency of the comparison is low.
  • methylation sites are not completely completed during the experiment. Performing methylation conversion, or the sequencing resolution is too low, will result in the methylation level of some sites or regions not being identified or being identified inaccurately.
  • Figure 1 schematically shows a flow chart of a method for processing genome methylation sequencing data provided by the present disclosure.
  • the method includes:
  • Step 101 Obtain the genome methylation sequencing sequence to be detected and the reference genome sequence.
  • the genome methylation sequencing sequence to be detected refers to the human genome sequence obtained by sequencing the methylated genome obtained from upstream experiments through high-throughput sequencing technology.
  • the reference genome sequence is obtained by downloading from the global gene database. of high-quality human genome sequences.
  • the genome methylation sequencing sequence after obtaining the genome methylation sequencing sequence, can be further quality screened and data filtered to improve the quality of the genome methylation sequencing sequence subsequently involved in identification and ensure that the genome methylation sequencing sequence is Accuracy of identification of base indexes.
  • low-quality gene sequencing sequences can be filtered based on the base ratio and base distribution of the genome methylation sequencing sequence, or preprocessing can be performed by removing duplications, removing incomplete fragments, etc. Specific preprocessing The method can be set according to actual needs and is not limited here.
  • Step 102 Compare the genome methylation sequencing sequence and the reference genome sequence to obtain an alignment result.
  • the genome methylation sequencing sequence is compared to the reference genome sequence to obtain the comparison results of the genome methylation sequencing sequence.
  • the converted DNA (Deoxyribo Nucleic Acid, deoxyribonucleic acid) was amplified by PCR (Polymerase Chain Reaction, polymerase chain reaction).
  • PCR Polymerase Chain Reaction, polymerase chain reaction
  • Base conversion is performed so that various possible combinations of sequences can be compared during the alignment process. Based on the comparison results, the number of methylated bases at different sites or regions in the genome methylation sequencing sequence can be calculated for subsequent identification of methylation levels.
  • the comparison results in the form of bam files can be used to perform comparison according to the position. Sequencing and deduplication of identically tagged sequences can eliminate errors caused by the bias of PCR amplification.
  • Step 103 Construct a window, move the window successively from the first end to the second end of the comparison result, and during each movement process, measure the methylation indicators of the comparison results covered by the windows at different positions. According to statistics, the step size of each movement of the window is smaller than the length of the window.
  • a window is an object used to limit the value range of data.
  • the value range of the window can be used to cover a specific range of data to limit the value range of the data to be processed.
  • the comparison results are regarded as a series of continuous data.
  • a window covering the partial comparison results can be used to locate the values within the range of values covered by the window. Comparison results. Specifically, the window can be moved on the comparison result by sliding, jumping, etc., so as to realize the window covering different value intervals in the comparison result.
  • methylation sites are not completely methylated during the experimental process, or the resolution of the methylation sequencing process is poor. If it is too low, the methylation level of some sites or regions may be inaccurate. These errors will affect the accuracy of the methylation assessment results of the sites or regions. Therefore, embodiments of the present disclosure adopt a window-based identification method to reduce the negative impact caused by upstream experimental errors and sequencing errors.
  • the embodiment of the present disclosure uses a window covering a partial area of the comparison result to move from one end of the area to be detected in the comparison result to the other end.
  • the step size of each sliding is the same and the length of the window is also the same, and each time The sliding step size needs to be smaller than the length of the window to ensure that the areas covered by windows at different positions have repeated parts.
  • the methylation index of the different areas covered by each sliding of the window on the comparison results can be calculated. Statistics are performed to divide the methylation indicators in the comparison results into multiple different areas, and there are some repeated areas, and obtain the methylation indicators corresponding to the windows at different positions with multiple coverage.
  • Step 104 Analyze the methylation indicators of the comparison results covered by the window covering each different position, and output the comprehensive methylation evaluation results of the genome methylation sequencing sequence.
  • the methylation indicators of the comparison results covered by windows at different positions are analyzed as the comprehensive methylation evaluation results.
  • the integration method can be weighted fitting, averaging, etc.
  • the specific integration method can be set according to actual needs and is not limited here.
  • the embodiments of the present disclosure integrate the methylation comprehensive evaluation results obtained by integrating the methylation indicators collected from different windows with multiple coverage, and can synthesize the methylation comparison results of multiple windows covering specific regions or characteristic sites.
  • the methylation index minimizes the impact of local errors in the sequencing sequence on the accuracy of the methylation assessment results of the entire region or site, and improves the accuracy of the methylation assessment results of the methylation sequencing sequence.
  • the present disclosure shows specific evaluation methods as follows, two evaluation methods of methylation level values In practical applications, they can be used in parallel or alone. When combined with each other, the order can be set according to actual needs. The order does not affect the accuracy of the results obtained.
  • the methylation index includes: the alignment covered by the window in the target region The result is the total number of methylated bases at methylation sites and the total number of bases at methylation sites;
  • the step 104 includes:
  • Step 104A1 according to the ratio of the total number of methylated bases of the methylated sites included in the comparison result covered by the window at each different position of the target region to the total number of bases of the methylated sites , calculate the regional window methylation level value corresponding to each window located at different positions in the target region.
  • FIG. 3 where Reference represents the reference genome sequence, window represents the window, B represents the alignment area in the alignment result, b1 to b4 respectively represent windows at four different positions, and between adjacent windows
  • the step size is Step, and MRL (Methylation region level) represents the regional methylation level value.
  • the number of windows can be any number greater than or equal to 2, and the sliding step size can also be any length smaller than the length of the window.
  • the window starts from the left end of the target area B.
  • the window is marked b1.
  • it starts to slide to the right end of the target area B in the order of b2, b3, and b4.
  • the window is located at each time.
  • the total number of methylated bases and the total number of methylated bases of the methylated sites in the covered area are counted.
  • Step 104A2 Analyze the regional window methylation level values corresponding to each window located at different positions in the target region to obtain the regional methylation level value of the target region in the genome methylation sequencing sequence.
  • the target is obtained by analyzing the regional window methylation level values counted by multiple different windows that only cover part of the target area.
  • the regional methylation level value of the region can minimize the impact of local errors in the sequencing sequence on the accuracy of the methylation level of the entire region, and improve the accuracy of the methylation assessment results of the methylation sequencing sequence.
  • the specific integration method can be weighted fitting, averaging, etc. The specific integration method can be set according to actual needs and is not limited here.
  • the step 104A2 includes: averaging the regional window methylation level values corresponding to each window located at different positions in the target region, and calculating based on the average of the regional window methylation level values.
  • the methylation level value of the window in the target area is first calculated through the following formula (1):
  • Mbi represents the regional window methylation level value of the i-th window
  • Mi represents the total number of methylated bases at the methylation site covered by the i-th window
  • UMi represents the methylation covered by the i-th window.
  • the total number of unmethylated bases at the site, the sum of Mi and UMi is the total number of bases at the methylated site covered by the i-th window.
  • MRL represents the regional methylation level value of the target area
  • i is the index of the window
  • C represents the number of windows obtained after covering the calculated target area.
  • the methylation index includes: methylation sites in the window covering the target site
  • the total number of bases, as well as the number of methylated bases at the target site refer to Figure 4, the step 104 includes:
  • Step 104B1 calculate each different position covering the target site based on the number of methylated bases at the target site and the total number of bases at the methylated sites in different windows covering the target site The window corresponds to the site window methylation level value.
  • the window starts from the starting position of the left end of the target area B.
  • the window is marked b1.
  • the step size Step it starts to slide to the right end of the target area B in the order of b2, b3, and b4.
  • the number of target sites in the area covered by the window is different, for example, the number of windows covering the first target site from left to right is 2, the number of windows covering the second target site is 3, and the number of windows covering the third target site is 4.
  • the number of methylated bases covering a specific target site is related to the number of methylated bases covering the target site. The total number of bases at methylated sites in different windows was counted.
  • Step 104B2 Analyze the site window methylation level value corresponding to the window covering each different position of the target site to obtain the site methylation level value of the target site.
  • the target site is obtained by analyzing the site window methylation level values counted from multiple different windows covering the target site.
  • the site methylation level value of the site can minimize the impact of local errors in the sequencing sequence on the accuracy of the methylation assessment results of the entire site, and improves the accuracy of the methylation assessment results of the methylated sequencing sequence. accuracy.
  • the specific integration method can be weighted fitting, averaging, etc. The specific integration method can be set according to actual needs and is not limited here.
  • the step 104B2 includes: averaging the site window methylation level values corresponding to the windows covering each different position of the target site, and calculating the site window methylation level value according to the site window methylation level value. Calculate the site methylation level value of the target site in the genome methylation sequencing sequence.
  • the site window methylation level value of each window covering the target site is first calculated through the following formula (3):
  • Msi represents the site window methylation level value of the i-th window covering the target site
  • Ms represents the number of methylated bases of the target site
  • Mi represents the number of methylation sites covered by the i-th window.
  • UMi represents the total number of unmethylated bases at the methylation sites covered by the i-th window
  • the sum of Mi and UMi is the total number of methylation sites covered by the i-th window. Number of bases.
  • MSL Metal site level
  • i the index of the window
  • c the number of windows that repeatedly cover the target site.
  • the step 103 includes: sliding through the window from the first end to the second end of the comparison result by a preset length, and before the first sliding, the first part of the covered comparison result is The methylation index is counted, and the methylation index of the comparison result covered by the window is counted after each sliding, wherein the step size of each movement of the window is smaller than the length of the window.
  • the length of the window that partially covers the comparison result is set to a
  • the length of the comparison result is expressed as L
  • a is less than L
  • the sliding step size is preset to s
  • s is less than a
  • a Both s and s are negatively correlated with the number of window slides. It is worth mentioning that within a certain range, the smaller the length of the window, the smaller the sliding step size, and the larger the number of sliding times, the more repeated parts of each window will be, and the larger the amount of data obtained, the calculated The more complete the methylation index obtained, the more accurate the subsequent calculated comprehensive methylation assessment results will be.
  • the length of the window is larger and the sliding step size is larger, the amount of data of the methylation index obtained will be smaller, the time consumption of data statistics will be smaller, and the efficiency of methylation assessment will be improved accordingly.
  • the length of the window and the step size of each slide can be set according to actual needs. The setting of the length ensures that the obtained window can be used in the embodiments of the present disclosure when it can cover the comparison results multiple times, which is not limited here.
  • the preset step size for each window sliding can be greater than or equal to half the length of the window, it can ensure that the data of the comparison results that are covered by multiple coverages are continuous and uninterrupted, so that the comparison results can be covered by multiple coverages as much as possible. , making the evaluation results of methylation levels obtained in the subsequent evaluation process more accurate.
  • step 102 includes:
  • Step 1021 Segment the reference genome sequence to obtain multiple reference genome fragment sequences.
  • the reference genome sequence can be divided according to specific scales or specific units, such as chromosome units, to obtain multiple reference chromosome genome sequences.
  • Step 1022 Compare the positive and negative strands of each of the secondary reference genome fragment sequences with the genome methylation sequencing sequence to obtain an alignment result.
  • the present disclosure compared to the related art method of aligning the methylated sequencing sequences to the entire reference genome sequence, adopts the method of aligning the methylated sequencing sequences to multiple sequences one by one. Using the reference genome fragment sequence, the alignment time will be significantly reduced, thereby improving the throughput and alignment speed of post-sequencing alignment.
  • step 1021 includes:
  • Step 10211 Divide the reference genome sequence into chromosome units to obtain multiple reference chromosome genome sequences.
  • Step 10212 Divide each reference chromosome genome sequence according to a preset length to obtain multiple reference genome fragment sequences.
  • the reference genome sequence is first divided into chromosomes as division units, Refsplit represents the reference chromosome genome sequence obtained by the division, and then the reference chromosome genome sequence is divided according to the preset length. The sequence is further segmented, and Chrsplit represents the segmented reference genome fragment sequence.
  • the reference genome sequence includes: a first converted reference genome sequence and a second converted reference genome sequence; the methylated genome sequence at least includes: a first amplified methylated genome sequence, a second amplified methylated genome sequence.
  • step 1022 includes:
  • Step 201 Obtain the original genome methylation sequencing sequence obtained by sequencing the human genome, and download the human reference genome sequence through the database.
  • Step 202 Perform base conversion from G to A on the original reference genome sequence to obtain a first converted reference genome sequence, and perform base conversion from C to T on the original reference genome sequence to obtain a second converted reference genome. sequence, and perform base conversion from G to A on the original methylated genome sequence to obtain the first amplified methylated genome sequence, and perform base conversion from C to T on the original methylated genome sequence. Conversion yields a second amplified methylated genome sequence.
  • G represents guanine
  • A represents adenine
  • C represents cytosine
  • T represents thymine. Since in the upstream experiment, 4 possible sequences will be generated when PCR amplifying the DNA after methylation conversion. Referring to Figure 9, it can be found that the final sequence obtained can be regarded as the positive strand before unmethylation conversion. Complementary pairings after conversion from G to A and C to T were performed respectively.
  • Step 203 Perform C to T base conversion on the first amplified methylated genome sequence to obtain a third amplified methylated genome sequence, and perform G on the first amplified methylated genome sequence.
  • the C to T converted strand 1 (first strand) of the genome methylation sequencing sequence is Amplified methylated genome sequence) performs C to T, G to A conversion, named REAF1 (third amplified methylated genome sequence) and REAF2 (fourth amplified methylated genome sequence), pair strand 2 (Second amplified methylated genome sequence)
  • REAF1 third amplified methylated genome sequence
  • REAF2 fourth amplified methylated genome sequence
  • pair strand 2 Second amplified methylated genome sequence
  • REAR1 fifth amplified methylated genome sequence
  • REAR2 ixth amplified methylated genome sequence
  • Step 204 Combine the first converted reference genome sequence and the second converted reference genome sequence with the third amplified methylated genome sequence, the fourth amplified methylated genome sequence, and the fifth amplified methylated genome sequence respectively. The genome sequence and the sixth amplified methylated genome sequence were compared.
  • a 2X2 comparison method is used to compare REF1 and REF2 to REAF1 and REAF2, and REAR1 and REAR2.
  • Step 205 Use the parent sequence of the amplified methylated genome sequence that is the same as the first converted reference genome sequence as the positive strand, and compare the amplified methylated sequence that is the same as the second converted reference genome sequence.
  • the parent sequence of the genome sequence serves as the negative strand.
  • the method further includes:
  • Step 301 Filter the target type of sequences in the obtained original gene sequencing sequence, where the target type of sequences includes: linker sequences, sequences whose overlapping bases with the linker sequence are greater than the preset number of bases, and quality values. At least one of a terminal sequence lower than the quality value threshold and a sequence whose length is smaller than the length threshold.
  • data filtering detects and removes possible adapter sequences, and when the sequences at both ends of the reads overlap with the adapter, which is greater than or equal to a preset number of bases, such as 3, they are also regarded as adapter sequences.
  • Remove, and after removing the linkers trim the sequence ends whose quality value is lower than the quality threshold of, for example, 20, and discard the fragment sequence whose length is smaller than the length threshold of, for example, 20 due to trimming.
  • the maximum allowed error rate is 0.1 ( The number of errors divided by the length of the matching area).
  • the specific threshold parameters can be set according to actual needs and are not limited here.
  • Step 302 When the filtered original gene sequencing sequence does not meet the target requirements, continue to perform the filtering operation on the original sequencing sequence until the filtered original sequencing sequence meets the target requirements, and then filter the original gene sequencing sequence.
  • the sequence is used as the genome methylation sequencing sequence to be detected; wherein the target requirements include: base quality requirements, base ratio requirements, sequence average GC distribution requirements, N content distribution requirements, sequence length requirements, repeat sequence requirements and linker sequences at least one of the requirements.
  • single-end or double-end sequencing data are evaluated for base quality, base ratio, average GC distribution of the sequence, N content distribution, sequence length, repeat sequence and adapter sequence, and the data is filtered.
  • the completed sequencing data is again evaluated for data quality according to the standards required by the target, and it is observed whether some unqualified indicators have been corrected to ensure the quality of subsequent sequencing sequences involved in the identification.
  • Figure 12 shows a schematic flow chart of yet another method for processing genome methylation sequencing data provided by some embodiments of the present disclosure:
  • the number of reads and the number of bases before and after data quality control and data filtering of methylation sequencing data are as follows in Table 1:
  • Sample_fq1 and Sample_fq2 are the names of the original sequence files for paired-end sequencing
  • Sample_fq1_qc and Sample_fq2_qc are the names of the sample_fq1 and Sample_fq2 data quality control and filtered sequence files respectively.
  • Figure 15 and Figure 16 show the statistics of adapter distribution by read position before and after data quality control and data filtering. It can be seen that before data quality control, there are partial adapter sequences in the last 20 bases of read positions, that is, Figure 15. After data quality control, Statistical linker distribution by read position shows that there are basically no linker sequences, as shown in Figure 16.
  • the name of the original reference sequence is genome.fa
  • the name of the reference genome converted from G to A is genome_mfa.GA_conversion.fa
  • the name of the reference genome converted from C to T is genome_mfa.CT_conversion.fa.
  • the names of the sequencing data after data quality control and filtering are Sample_fq1_qc.fastq and Sample_fq2_qc.fastq.
  • Sample_fq1_qc.fastq forms two new sequence files named Sample_fq1_qc. CT_conversion.fastq, Sample_fq1_qc.GA_conversion.fastq
  • Sample_fq2_qc.fastq forms two new sequence files: Sample_fq2_qc.CT_conversion.fastq, Sample_fq2_qc.GA_conversion.fastq.
  • the reference genome is first divided into chromosome units according to the Refsplit width, and then divided within the chromosome according to a specific Chrsplit width, and then the data is compared according to the 2X2 method introduced previously.
  • the number of paired reads before deduplication is 134,701,491.
  • the BAM files are sorted according to coordinate, and the same-labeled sequences are deduplicated.
  • the total repeated sequences are 1,701,997, accounting for 1.26%.
  • the number of paired reads after removal is 132,999,494.
  • part of the extracted information is as follows in Table 2 (random sampling of 20 sites):
  • the regional methylation level is identified according to the previously introduced method. Taking the region from 9000 to 30000 on chromosome 1 as an example, the Window width is set to 2000 and the Step width is set to 1000.
  • the window obtained after sliding By intersecting with the methylation information obtained in S6, the obtained partial window methylation original data information is as follows in Table 3 (the 20 sites in the middle of the data set):
  • formula (2) was used to calculate the methylation level value of the region from 9000 to 30000 in chromosome number 1, which was 0.7481512.
  • Figure 17 schematically shows a structural diagram of a genome methylation sequencing data processing device 40 provided by the present disclosure.
  • the device includes:
  • the acquisition module 401 is configured to acquire the genome methylation sequencing sequence to be detected and the reference genome sequence;
  • the comparison module 402 is configured to compare the genome methylation sequencing sequence with the reference genome sequence to obtain a comparison result
  • the statistics module 403 is configured to construct a window through which the window is moved successively from the first end to the second end of the comparison result, and in the process of each movement, the first and second values of the comparison results covered by the windows at different positions are calculated.
  • Basis index is used for statistics, and the step size of each movement of the window is smaller than the length of the window;
  • the evaluation module 404 is configured to analyze the methylation indicators of the comparison results counted by the windows covering each different position, and output the comprehensive methylation evaluation results of the genome methylation sequencing sequence.
  • the methylation index includes: the alignment covered by the window in the target region The result is the total number of methylated bases at methylation sites and the total number of bases at methylation sites;
  • the evaluation module 404 is also configured to:
  • the position is calculated.
  • the regional window methylation level value corresponding to each window located at different positions in the target region is analyzed to obtain the regional methylation level value of the target region in the genome methylation sequencing sequence.
  • the evaluation module 404 is also configured to:
  • Genome methylation sequencing averages the regional window methylation level values corresponding to each window located at different positions in the target region, and calculates the genomic methylation based on the average of the regional window methylation level values.
  • Regional methylation level value of the target region in the sequencing sequence is
  • the methylation index includes: the methylation index in the comparison result covered by the window covering the target site.
  • the evaluation module 404 is also configured to:
  • each base number covering the target site calculates each base number covering the target site.
  • the site window methylation level value corresponding to the window covering each different position of the target site is analyzed to obtain the site methylation level value of the target site.
  • the evaluation module 404 is also configured to:
  • Genome methylation sequencing averages the site window methylation level values corresponding to the windows covering each different position of the target site, and calculates the genome based on the mean value of the site window methylation level values The site methylation level value of the target site in the methylation sequencing sequence.
  • the statistics module 403 is also configured to:
  • the comparison module 402 is also configured to:
  • the comparison module 402 is also configured to:
  • Each reference chromosome genome sequence is divided according to a preset length to obtain multiple reference genome fragment sequences.
  • the reference genome sequence includes: a first converted reference genome sequence and a second converted reference genome sequence; the methylated genome sequence at least includes: a first amplified methylated genome sequence, a second amplified methylated genome sequence.
  • the comparison module 402 is also configured to:
  • the step of comparing the positive and negative strands of each of the reference genome fragment sequences with the genome methylation sequencing sequence to obtain the comparison results includes:
  • the methylated genome sequence is subjected to methylation amplification to obtain at least a fifth amplified methylated genome sequence and a sixth amplified methylated genome sequence;
  • the first converted reference genome sequence and the second converted reference genome sequence are respectively combined with the third amplified methylated genome sequence, the fourth amplified methylated genome sequence, the fifth amplified methylated genome sequence, The sixth amplified methylated genome sequence was compared;
  • the parent sequence of the amplified methylated genome sequence that is the same as the first converted reference genome sequence is used as the positive strand, and the same amplified methylated genome sequence that is aligned with the second converted reference genome sequence is used as the positive strand.
  • the parent sequence serves as the negative strand.
  • the comparison module 402 is also configured to:
  • the acquisition module 401 is also configured to:
  • the target type of sequences in the obtained original gene sequencing sequence where the target type of sequences include: linker sequences, sequences whose overlapping bases with the linker sequence are greater than the preset number of bases, and whose quality value is lower than the quality. At least one of an end sequence with a value threshold and a sequence with a length less than the length threshold;
  • Genome methylation sequencing sequence wherein the target requirements include: at least one of base quality requirements, base ratio requirements, sequence average GC distribution requirements, N content distribution requirements, sequence length requirements, repeat sequence requirements and adapter sequence requirements A sort of.
  • the embodiments of the present disclosure integrate the methylation comprehensive evaluation results obtained by integrating the methylation indicators collected from different windows with multiple coverage, and can synthesize the methylation comparison results of multiple windows covering specific regions or characteristic sites.
  • the methylation index minimizes the impact of local errors in the sequencing sequence on the accuracy of the methylation assessment results of the entire region or site, and improves the accuracy of the methylation assessment results of the methylation sequencing sequence.
  • Various component embodiments of the present disclosure may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof.
  • a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all functions of some or all components in a computing processing device according to embodiments of the present disclosure.
  • DSP digital signal processor
  • the present disclosure may also be implemented as an apparatus or apparatus program (eg, computer program and computer program product) for performing part or all of the methods described herein.
  • Such a program implementing the present disclosure may be stored on a non-transitory computer-readable medium, or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, or provided on a carrier signal, or in any other form.
  • Figure 18 illustrates a computing processing device that may implement methods in accordance with the present disclosure.
  • the computing processing device conventionally includes a processor 510 and a computer program product in the form of memory 520 or non-transitory computer-readable media.
  • Memory 520 may be electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
  • the memory 520 has storage space 530 for program code 531 for executing any method steps in the above-described methods.
  • the storage space 530 for program codes may include individual program codes 531 respectively used to implement various steps in the above method. These program codes can be read from or written into one or more computer program products.
  • These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such computer program products are typically portable or fixed storage units as described with reference to Figure 19.
  • the storage unit may have storage segments, storage spaces, etc. arranged similarly to the memory 520 in the computing processing device of FIG. 18 .
  • the program code may, for example, be compressed in a suitable form.
  • the storage unit includes computer readable code 531', ie code that can be read by, for example, a processor such as 510, which code, when executed by a computing processing device, causes the computing processing device to perform the methods described above. various steps.
  • any reference signs placed between parentheses shall not be construed as limiting the claim.
  • the word “comprising” does not exclude the presence of elements or steps not listed in a claim.
  • the word “a” or “an” preceding an element does not exclude the presence of a plurality of such elements.
  • the present disclosure may be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In the element claim enumerating several means, several of these means may be embodied by the same item of hardware.
  • the use of the words first, second, third, etc. does not indicate any order. These words can be interpreted as names.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本公开提供的基因组甲基化测序数据的处理方法、装置、设备和介质,属于基因组甲基化检测技术领域。所述方法包括:获取待检测的基因组甲基化测序序列以及参考基因组序列;将所述基因组甲基化测序序列和所述参考基因组序列进行比对,得到比对结果;构建窗口,通过所述窗口从所述比对结果的第一端向第二端进行逐次移动,在每次移动的过程中对不同位置的窗口所覆盖比对结果的甲基化指标进行统计,所述窗口每次移动的步长小于所述窗口的长度;对所统计到每个不同位置的窗口所覆盖比对结果的甲基化指标进行分析,输出所述基因组甲基化测序序列的甲基化综合评估结果。

Description

基因组甲基化测序数据的处理方法、装置、设备和介质 技术领域
本公开属于基因检测技术领域,特别涉及一种基因组甲基化测序数据的处理方法、装置、设备和介质。
背景技术
DNA(Deoxyribo Nucleic Acid,脱氧核糖核酸)甲基化是在不改变DNA序列的前提下,进行表观遗传的修饰方式,即胞嘧啶5'碳上加上甲基的过程,人体内的DNA甲基化一般发生在CpG核苷酸部位,能调节编码基因的表达,利用高通量测序技术可获得基因组的甲基化模式,研究表明DNA甲基化模式对个体的生长、发育、基因表达模式以及基因组的稳定性起到重要的调控作用,且DNA异常甲基化与肿瘤的发生、发展、细胞癌变有着密切的关系,通过甲基化测序鉴定个体的甲基化模式,以获得个性化的疾病评估是当今疾病监测领域发展的趋势。
概述
本公开提供了一种基因组甲基化测序数据的处理方法、装置、设备和介质。
本公开一些实施例提供一种基因组甲基化测序数据的检测方法,所述方法包括:
获取待检测的基因组甲基化测序序列以及参考基因组序列;
将所述基因组甲基化测序序列和所述参考基因组序列进行比对,得到比对结果;
构建窗口,通过所述窗口从所述比对结果的第一端向第二端进行逐次移动,在每次移动的过程中对不同位置的窗口所覆盖比对结果的甲基化指标进行统计,所述窗口每次移动的步长小于所述窗口的长度;
对所统计到每个不同位置的窗口所覆盖比对结果的甲基化指标进行分析,输出所述基因组甲基化测序序列的甲基化综合评估结果。
可选地,在所述甲基化综合评估结果为所述比对结果中目标区域的区域甲基化水平值时,所述甲基化指标包括:在目标区域中所述窗口所覆盖比对结果的甲基化位点的总甲基化碱基数量、甲基化位点的总碱基数量;
所述对所统计到每个不同位置的窗口所覆盖比对结果的甲基化指标进行分析,输出所述基因组甲基化测序序列的甲基化综合评估结果的步骤,包括:
根据在目标区域的每个不同位置的窗口所覆盖比对结果中,所包含的甲基化位点的甲基化总碱基数量与甲基化位点的总碱基数量的比值,计算处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值;
将处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值进行分析,得到所述基因组甲基化测序序列中目标区域的区域甲基化水平值。
可选地,所述将处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值进行分析,得到所述基因组甲基化测序序列中目标区域的区域甲基化水平值的步骤,包括:
针对处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值求均值,根据所述区域窗口甲基化水平值的均值计算所述基因组甲基化测序序列中目标区域的区域甲基化水平值。
可选地,在所述甲基化综合评估结果为目标位点的位点甲基化水平值时,所述甲基化指标包括:覆盖所述目标位点的窗口所覆盖比对结果中的甲基化位点的总碱基数量,以及所述目标位点的甲基化碱基数量;
所述对所统计到每个不同位置的窗口所覆盖比对结果的甲基化指标进行分析,输出所述基因组甲基化测序序列的甲基化综合评估结果的步骤,包括:
根据所述目标位点的甲基化碱基数量与覆盖所述目标位点的不同窗口所覆盖比对结果中的甲基化位点的总碱基数量,计算覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值;
将覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值进行分析,得到所述目标位点的位点甲基化水平值。
可选地,所述将覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值进行分析,得到所述目标位点的位点甲基化水平值的步骤,包括:
针对覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值求均值,根据所述位点窗口甲基化水平值的均值计算所述基因 组甲基化测序序列中目标位点的位点甲基化水平值。
可选地,所述通过所述窗口从所述比对结果的第一端向第二端进行逐次滑动,在每次滑动的过程中对不同位置的窗口所覆盖比对结果的甲基化指标进行统计的步骤,包括:
通过所述窗口从所述比对结果的第一端向第二端逐次按照预设长度进行滑动,在首次滑动之前对所覆盖的比对结果的甲基化指标进行统计,并在每次滑动之后对所述窗口所覆盖的比对结果的甲基化指标进行统计,其中所述窗口每次移动的步长小于所述窗口的长度。
可选地,所述将所述基因组甲基化测序序列和所述参考基因组序列进行比对,得到比对结果的步骤,包括:
将所述参考基因组序列进行切分,得到多个参考基因组片段序列;
将每个所述从参考基因组片段序列分别与所述基因组甲基化测序序列进行正负链比对,得到比对结果。
可选地,所述将所述参考基因组序列进行切分,得到多个参考基因组片段序列的步骤,包括:
将所述参考基因组序列按照染色体单位进行切分,得到多个参考染色体基因组序列;
将每个所述参考染色体基因组序列按照预设长度进行切分,得到多个参考基因组片段序列。
可选地,所述参考基因组序列包括:第一转换参考基因组序列、第二转换参考基因组序列,所述甲基化基因组序列至少包括:第一扩增甲基化基因组序列、第二扩增甲基化基因组序列;
所述将每个所述从参考基因组片段序列分别与所述基因组甲基化测序序列进行正负链比对,得到比对结果的步骤,包括:
对所述第一扩增甲基化基因组序列进行甲基化扩增,至少得到第三扩增甲基化基因组序列和第四扩增甲基化基因组序列,以及对所述第二扩增甲基化基因组序列进行甲基化扩增,至少得到第五扩增甲基化基因组序列和第六扩增甲基化基因组序列;
将所述第一转换参考基因组序列、第二转换参考基因组序列分别与所述第三扩增甲基化基因组序列、第四扩增甲基化基因组序列、第五扩增甲基化基因组序列、第六扩增甲基化基因组序列进行比对;
将与所述第一转换参考基因组序列比对相同的扩增甲基化基因组序 列的母序列作为正链,将与所述第二转换参考基因组序列比对相同的扩增甲基化基因组序列的母序列作为负链。
可选地,所述对所述第一扩增甲基化基因组序列进行甲基化扩增,至少得到第三扩增甲基化基因组序列和第四扩增甲基化基因组序列,以及对所述第二扩增甲基化基因组序列进行甲基化扩增,至少得到第五扩增甲基化基因组序列和第六扩增甲基化基因组序列的步骤,包括:
将所述第一扩增甲基化基因组序列进行C到T的碱基转换,得到第三扩增甲基化基因组序列,以及将所述第一扩增甲基化基因组序列进行G到A的碱基转换,得到第四扩增甲基化基因组序列;
以及,将所述第二扩增甲基化基因组序列进行C到T的碱基转换,得到第五扩增甲基化基因组序列,以及将所述第二扩增甲基化基因组序列进行G到A的碱基转换,得到第六扩增甲基化基因组序列。
可选地,在所述基因组甲基化测序获取待检测的基因组甲基化测序序列,以及通过数据库下载获得参考基因组序列的步骤之后,所述方法还包括:
过滤所获取到原始基因测序序列中的目标类型的序列,其中所述目标类型的序列包括:接头序列、与所述接头序列的重叠碱基大于预设碱基数量的序列、质量值低于质量值阈值的末端序列、长度小于长度阈值的序列中的至少一种;
在过滤处理后的原始基因测序序列不符合目标要求时,继续对所述原始基因测序序列执行所述过滤操作,直至过滤处理后的原始基因测序序列符合目标要求,将过滤处理后的原始基因测序序列作为待检测的基因组甲基化测序序列,其中所述目标要求包括:碱基质量要求、碱基比例要求、序列平均GC分布要求、N含量分布要求、序列长度要求、重复序列要求和接头序列要求中的至少一种。
本公开一些实施例提供一种基因组甲基化数据的处理装置,所述装置包括:
获取模块,被配置为获取待检测的基因组甲基化测序序列以及参考基因组序列;
比对模块,被配置为将所述基因组甲基化测序序列和所述参考基因组序列进行比对,得到比对结果;
统计模块,被配置为构建窗口,通过所述窗口从所述比对结果的第 一端向第二端进行逐次移动,在每次移动的过程中对不同位置的窗口所覆盖比对结果的甲基化指标进行统计,所述窗口每次移动的步长小于所述窗口的长度;
评估模块,被配置为对所统计到每个不同位置的窗口所覆盖比对结果的甲基化指标进行分析,输出所述基因组甲基化测序序列的甲基化综合评估结果。
可选地,在所述甲基化综合评估结果为所述比对结果中目标区域的区域甲基化水平值时,所述甲基化指标包括:在目标区域中所述窗口所覆盖比对结果的甲基化位点的总甲基化碱基数量、甲基化位点的总碱基数量;
所述评估模块,还被配置为:
根据在目标区域的每个不同位置的窗口所覆盖比对结果中,所包含的甲基化位点的甲基化总碱基数量与甲基化位点的总碱基数量的比值,计算处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值;
将处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值进行分析,得到所述基因组甲基化测序序列中目标区域的区域甲基化水平值。
可选地,所述评估模块,还被配置为:
针对处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值求均值,根据所述区域窗口甲基化水平值的均值计算所述基因组甲基化测序序列中目标区域的区域甲基化水平值。
可选地,在所述甲基化综合评估结果为目标位点的位点甲基化水平值时,所述甲基化指标包括:覆盖所述目标位点的窗口所覆盖比对结果中的甲基化位点的总碱基数量,以及所述目标位点的甲基化碱基数量;
所述评估模块,还被配置为:
根据所述目标位点的甲基化碱基数量与覆盖所述目标位点的不同窗口所覆盖比对结果中的甲基化位点的总碱基数量,计算覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值;
将覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值进行分析,得到所述目标位点的位点甲基化水平值。
可选地,所述评估模块,还被配置为:
针对覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值求均值,根据所述位点窗口甲基化水平值的均值计算所述基因组甲基化测序序列中目标位点的位点甲基化水平值。
可选地,所述统计模块,还被配置为:
通过所述窗口从所述比对结果的第一端向第二端逐次按照预设长度进行滑动,在首次滑动之前对所覆盖的比对结果的甲基化指标进行统计,并在每次滑动之后对所述窗口所覆盖的比对结果的甲基化指标进行统计,其中所述窗口每次移动的步长小于所述窗口的长度。
可选地,所述比对模块,还被配置为:
将所述参考基因组序列进行切分,得到多个参考基因组片段序列;
将每个所述从参考基因组片段序列分别与所述基因组甲基化测序序列进行正负链比对,得到比对结果。
可选地,所述比对模块,还被配置为:
将所述参考基因组序列按照染色体单位进行切分,得到多个参考染色体基因组序列;
将每个所述参考染色体基因组序列按照预设长度进行切分,得到多个参考基因组片段序列。
可选地,所述参考基因组序列包括:第一转换参考基因组序列、第二转换参考基因组序列,所述甲基化基因组序列至少包括:第一扩增甲基化基因组序列、第二扩增甲基化基因组序列;
所述比对模块,还被配置为:
对所述第一扩增甲基化基因组序列进行甲基化扩增,至少得到第三扩增甲基化基因组序列和第四扩增甲基化基因组序列,以及对所述第二扩增甲基化基因组序列进行甲基化扩增,至少得到第五扩增甲基化基因组序列和第六扩增甲基化基因组序列;
将所述第一转换参考基因组序列、第二转换参考基因组序列分别与所述第三扩增甲基化基因组序列、第四扩增甲基化基因组序列、第五扩增甲基化基因组序列、第六扩增甲基化基因组序列进行比对;
将与所述第一转换参考基因组序列比对相同的扩增甲基化基因组序列的母序列作为正链,将与所述第二转换参考基因组序列比对相同的扩增甲基化基因组序列的母序列作为负链。
可选地,所述比对模块,还被配置为:
将所述第一扩增甲基化基因组序列进行C到T的碱基转换,得到第三扩增甲基化基因组序列,以及将所述第一扩增甲基化基因组序列进行G到A的碱基转换,得到第四扩增甲基化基因组序列;
以及,将所述第二扩增甲基化基因组序列进行C到T的碱基转换,得到第五扩增甲基化基因组序列,以及将所述第二扩增甲基化基因组序列进行G到A的碱基转换,得到第六扩增甲基化基因组序列。
可选地,所述获取模块,还被配置为:
过滤所获取到原始基因测序序列中的目标类型的序列,其中所述目标类型的序列包括:接头序列、与所述接头序列的重叠碱基大于预设碱基数量的序列、质量值低于质量值阈值的末端序列、长度小于长度阈值的序列中的至少一种;
在过滤处理后的原始基因测序序列不符合目标要求时,继续对所述原始基因测序序列执行所述过滤操作,直至过滤处理后的原始基因测序序列符合目标要求,将过滤处理后的原始基因测序序列作为待检测的基因组甲基化测序序列;其中所述目标要求包括:碱基质量要求、碱基比例要求、序列平均GC分布要求、N含量分布要求、序列长度要求、重复序列要求和接头序列要求中的至少一种。
本公开一些实施例提供一种计算处理设备,包括:
存储器,其中存储有计算机可读代码;
一个或多个处理器,当所述计算机可读代码被所述一个或多个处理器执行时,所述计算处理设备执行如上述所述的基因组甲基化测序数据的处理方法。
本公开一些实施例提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算处理设备上运行时,导致所述计算处理设备执行如上述的基因组甲基化测序数据的处理方法。
本公开一些实施例提供一种非瞬态计算机可读介质,其中存储了如上述的基因组甲基化测序数据的处理方法。
上述说明仅是本公开技术方案的概述,为了能够更清楚了解本公开的技术手段,而可依照说明书的内容予以实施,并且为了让本公开的上述和其它目的、特征和优点能够更明显易懂,以下特举本公开的具体实施方式。
附图简述
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示意性地示出了本公开一些实施例提供的一种基因组甲基化测序数据的处理方法的流程示意图;
图2示意性地示出了本公开一些实施例提供的另一种基因组甲基化测序数据的处理方法的流程示意图之一;
图3示意性地示出了本公开一些实施例提供的另一种基因组甲基化测序数据的处理方法的原理示意图之一;
图4示意性地示出了本公开一些实施例提供的另一种基因组甲基化测序数据的处理方法的流程示意图之二;
图5示意性地示出了本公开一些实施例提供的另一种基因组甲基化测序数据的处理方法的原理示意图之二;
图6示意性地示出了本公开一些实施例提供的另一种基因组甲基化测序数据的处理方法的流程示意图之三;
图7示意性地示出了本公开一些实施例提供的另一种基因组甲基化测序数据的处理方法的原理示意图之三;
图8示意性地示出了本公开一些实施例提供的另一种基因组甲基化测序数据的处理方法的流程示意图之四;
图9示意性地示出了本公开一些实施例提供的另一种基因组甲基化测序数据的处理方法的原理示意图之四;
图10示意性地示出了本公开一些实施例提供的另一种基因组甲基化测序数据的处理方法的原理示意图之五;
图11示意性地示出了本公开一些实施例提供的另一种基因组甲基化测序数据的处理方法的流程示意图之五;
图12示意性地示出了本公开一些实施例提供的再一种基因组甲基化测序数据的处理方法的流程示意图;
图13示意性地示出了本公开一些实施例提供的再一种基因组甲基化测序数据的处理方法的效果示意图之一;
图14示意性地示出了本公开一些实施例提供的再一种基因组甲基化测 序数据的处理方法的效果示意图之二;
图15示意性地示出了本公开一些实施例提供的再一种基因组甲基化测序数据的处理方法的效果示意图之三;
图16示意性地示出了本公开一些实施例提供的再一种基因组甲基化测序数据的处理方法的效果示意图之四;
图17示意性地示出了本公开一些实施例提供的一种基因组甲基化测序数据的处理装置的结构示意图;
图18示意性地示出了用于执行根据本公开一些实施例的方法的计算处理设备的框图;
图19示意性地示出了用于保持或者携带实现根据本公开一些实施例的方法的程序代码的存储单元。
详细描述
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
相关技术中通常通过高通量测序技术来获得基因组甲基化测序序列,然后将基因组甲基化测序序列整个比对到预先准备好的参考基因组序列,以根据比对结果来鉴定该基因组甲基化测序序列的甲基化水平。但是由于比对过程需要进行整体比对的影响,比对过程的耗时较长,比对的效率较底,而且由于上游实验以及测序过程中存在误差,例如实验时甲基化位点未完全进行甲基化化转换,或者测序分辨率过低,将会导致部分位点或者区域的甲基化水平鉴定不到或者鉴定不准的情况出现。
图1示意性地示出了本公开提供的一种基因组甲基化测序数据的处理方法的流程示意图,所述方法包括:
步骤101,获取待检测的基因组甲基化测序序列以及参考基因组序列。
需要说明的是,待检测的基因组甲基化测序序列是指通过对上游实验获得的甲基化基因组通过高通量测序技术进行测序得到的人类基因组序列,参考基因组序列是通过全球基因数据库下载获得的高质量人类基因组序列。
在本公开实施例中,在获取到基因组甲基化测序序列之后可对基因组甲基化测序序列进一步进行质量筛选和数据过滤,以提高后续参与鉴定的基因组甲基化测序序列的质量,保证甲基化指标鉴定的准确性。示例性的,可以通过对基因组甲基化测序序列的碱基比例、碱基分布情况,过滤低质量的基因测序序列,或者是通过去重、剔除残缺片段等方式进行预处理,具体的预处理方式可以根据实际需求设置,此处不做限定。
步骤102,将所述基因组甲基化测序序列和所述参考基因组序列进行比对,得到比对结果。
在本公开实施例中,将基因组甲基化测序序列比对到参考基因组序列上,以获取基因组甲基化测序序列的比对结果,在比对的过程中,由于上游实验中对甲基化转换后的DNA(Deoxyribo NucleicAcid,脱氧核糖核酸)进行了PCR(Polymerase Chain Reaction,聚合酶链式反应)扩增,为保证比对结果的充分性,需要进一步对甲基化测序序列和参考基因组序列进行碱基转换,以使得比对的过程中可以对可能存在的各种组合序列进行比对。基于该比对结果可以计算出基因组甲基化测序序列中不同位点或区域的甲基化碱基数,以供后续甲基化水平的鉴定过程使用。
进一步的由于上游实现中PCR扩增偏好性,因此可能导致甲基化化基因测序序列中部分区域的reads(序列片段)分布不均匀,因此可通过bam文件形式的比对结果按照比对位置进行排序,对相同标记的序列进行去重处理,可以消除由于PCR扩增的偏好性所带来的误差。
步骤103,构建窗口,通过所述窗口从所述比对结果的第一端向第二端进行逐次移动,在每次移动的过程中对不同位置的窗口所覆盖比对结果的甲基化指标进行统计,所述窗口每次移动的步长小于所述窗口的长度。
在本公开实施例中,窗口是一种用于对数据的取值区间进行限定的对象,可通过窗口的取值位置范围对数据特定区间进行覆盖,来限定所需处理的数据取值区间。在本公开中比对结果作为一串连续的数据,为了对其中某一段部分比对结果进行分析统计,可以通过覆盖于部分比对结果的窗口,来定位该窗口所覆盖取值位置范围内的比对结果。具体的,该窗口可以是在比对结果上通过滑动、跳转等方式进行移动,以实现窗口对于比对结果中不同取值区间的覆盖。
在本公开实施例中,考虑到在DNA的实验过程以及测序过程中可能存 在误差,例如实验过程中甲基化位点未完全进行甲基化转化,又或是甲基化测序过程中分辨率过低,可能会导致部分位点或者区域的甲基化水平不准确,这些误差均会影响位点或者区域甲基化评估结果的准确性。因此本公开实施例采用窗口式的鉴定方式来减小上游实验误差和测序误差所带来的负面影响。
具体的,本公开实施例通过覆盖比对结果部分区域的窗口,从比对结果中待检测区域的一端向另一端逐次进行移动,每次滑动的步长相同以及窗口的长度也相同,且每次滑动步长需小于窗口的长度,以保证不同位置的窗口所覆盖的区域存在重复部分,在滑动的过程中可对窗口在比对结果上每次滑动做覆盖的不同区域的甲基化指标进行统计,从而将比对结果中的甲基化指标划分为多个不同区域,且存在部分重复区域,获得多重覆盖的不同位置的窗口对应的甲基化指标。
步骤104,对所统计到每个不同位置的窗口所覆盖比对结果的甲基化指标进行分析,输出所述基因组甲基化测序序列的甲基化综合评估结果。
在本公开实施例中,通过将不同位置的窗口所覆盖比对结果的甲基化指标进行分析,作为甲基化综合评估结果。整合的方式可以是加权拟合、求均值等方式,具体整合方式可以根据实际需求设置,此处不做限定。
本公开实施例通过利用多重覆盖的不同窗口所统计到的甲基化指标整合得到的甲基化综合评估结果,能够综合对于覆盖特定区域或者特征位点的多个窗口所覆盖比对结果的甲基化指标,尽可能的减轻了测序序列中局部误差对于整个区域或者位点甲基化评估结果的准确性所造成的影响,提高了甲基化测序序列的甲基化评估结果的准确性。
在本公开的一些实施例中,针对区域甲基化水平值和位点甲基化水平值的鉴定过程,本公开在下述示出有具体的评估方式,两种甲基化水平值的评估方式在实际应用中可以并行使用,也可以单独使用,相互结合时可以根据实际需求设置先后次序,先后次序并不影响所获得结果的准确性。
1)区域甲基化水平的鉴定方式:
可选地,在所述甲基化综合评估结果为所述比对结果中目标区域的区域甲基化水平值时,所述甲基化指标包括:在目标区域中所述窗口所覆盖比对结果的甲基化位点的总甲基化碱基数量、甲基化位点的总碱基数量;
参照图2,所述步骤104,包括:
步骤104A1,根据在目标区域的每个不同位置的窗口所覆盖比对结果 中,所包含的甲基化位点的甲基化总碱基数量与甲基化位点的总碱基数量的比值,计算处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值。
在本公开实施例中,参照图3,其中Reference表示参考基因组序列,window表示窗口,B表示比对结果中的比对区域,b1~b4分别表示4个不同位置的窗口,相邻窗口之间的步长为Step,MRL(Methylation region level)表示区域甲基化水平值。当然此处仅是示例性描述,窗口的数量可以是大于或等于2的任意个数,而滑动步长也可以是小于窗口的长度的任意长度。
参照图3,窗口从目标区域B的左端起始位置开始,此时窗口标记为b1,按照步长Step开始按照b2、b3、b4的顺序逐次滑动至目标区域B的右端,窗口在每次位于不同位置,也就是b1~b4所覆盖的区域时,对所覆盖区域的甲基化位点的甲基化总碱基数量和甲基化位点的总碱基数量进行统计。
步骤104A2,将处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值进行分析,得到所述基因组甲基化测序序列中目标区域的区域甲基化水平值。
在本公开实施例中,对于区域甲基化水平值的鉴定而言,通过对仅覆盖该目标区域中部分的多个不同窗口所统计到的区域窗口甲基化水平值进行分析来得到该目标区域的区域甲基化水平值,可以尽可能减轻测序序列中的局部误差对于整个区域甲基化水平的准确性的影响,提高了甲基化测序序列的甲基化评估结果的准确性。具体整合的方式可以是加权拟合、求均值等方式,具体整合方式可以根据实际需求设置,此处不做限定。
可选地,所述步骤104A2,包括:针对处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值求均值,根据所述区域窗口甲基化水平值的均值计算所述基因组甲基化测序序列中目标区域的区域甲基化水平值。
在本公开的一些实施例中,首先通过下述公式(1)计算目标区域中窗口的甲基化水平值:
Mbi=Mi/(Mi+UMi)         (1)
其中,Mbi表示第i个窗口的区域窗口甲基化水平值,Mi表示第i个窗口覆盖的甲基化位点的甲基化总碱基数量,UMi表示第i个窗口覆盖的甲基化位点的非甲基化总碱基数量,Mi与UMi的和为第i个窗口覆盖的甲基化 位点的总碱基数量。
然后通过下述公式(2)将目标区域中不同位置的所有窗口的区域窗口甲基化水平值进行分析:
Figure PCTCN2022084386-appb-000001
其中,MRL表示目标区域的区域甲基化水平值,i为窗口的角标,C表示覆盖完所计算目标区域后所得到的窗口数量。
2)位点甲基化水平的鉴定方式:
可选地,在所述甲基化综合评估结果为目标位点的位点甲基化水平值时,所述甲基化指标包括:覆盖所述目标位点的窗口中的甲基化位点的总碱基数量,以及所述目标位点的甲基化碱基数量,参照图4,所述步骤104,包括:
步骤104B1,根据所述目标位点的甲基化碱基数量与覆盖所述目标位点的不同窗口中甲基化位点的总碱基数量,计算覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值。
在本公开实施例中,参照图5,其中Reference表示参考基因组序列,window表示窗口,B表示比对结果中的比对区域,b1~b4分别表示4个不同位置的窗口,相邻窗口之间的步长为Step,M1为目标位点(菱形所标记点均为目标位点)。当然此处仅是示例性描述,窗口的数量可以是大于或等于2的任意个数,而滑动步长也可以是小于窗口的长度的任意长度。
参照图5,窗口从目标区域B的左端起始位置开始,此时窗口标记为b1,按照步长Step开始按照b2、b3、b4的顺序逐次滑动至目标区域B的右端。可见,窗口在每次位于不同位置,也就是b1~b4所覆盖的区域时,窗口所覆盖区域中的目标位点的数量不同,例如覆盖从左到右第一个目标位点的窗口的数量为2,覆盖第二个目标位点的窗口的数量为3,覆盖第三个目标位点的窗口的数量为4,对覆盖特定目标位点的甲基化碱基数量与覆盖目标位点的不同窗口中甲基化位点的总碱基数量进行统计。
步骤104B2,将覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值进行分析,得到所述目标位点的位点甲基化水平值。
在本公开实施例中,对于位点甲基化水平值的鉴定而言,通过对覆盖该目标位点的多个不同窗口所统计到的位点窗口甲基化水平值进行分析来得 到该目标位点的位点甲基化水平值,可以尽可能减轻测序序列中的局部误差对于整个位点甲基化评估结果的准确性的影响,提高了甲基化测序序列的甲基化评估结果的准确性。具体整合的方式可以是加权拟合、求均值等方式,具体整合方式可以根据实际需求设置,此处不做限定。
可选地,所述步骤104B2,包括:针对覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值求均值,根据所述位点窗口甲基化水平值的均值计算所述基因组甲基化测序序列中目标位点的位点甲基化水平值。
在本公开实施例中,首先通过下述公式(3)计算覆盖目标位点的每个窗口的位点窗口甲基化水平值:
Msi=Ms/(Mi+UMi)          (3)
其中,Msi表示第i个覆盖目标位点的窗口的位点窗口甲基化水平值,Ms表示目标位点的甲基化碱基数,Mi表示第i个窗口覆盖的甲基化位点的甲基化总碱基数量,UMi表示第i个窗口覆盖的甲基化位点的非甲基化总碱基数量,Mi和UMi之和为第i个窗口覆盖的甲基化位点的总碱基数量。
然后通过下述公式(4)将每个覆盖目标位点的窗口的所有位点窗口甲基化水平值进行分析来得到目标位点的位点甲基化水平值:
Figure PCTCN2022084386-appb-000002
其中,MSL(Methylation site level)表示目标位点的位点甲基化水平值,i为窗口的角标,c表示重复覆盖目标位点的窗口的数量。
可选地,所述步骤103,包括:通过所述窗口从所述比对结果的第一端向第二端逐次按照预设长度进行滑动,在首次滑动之前对所覆盖的比对结果的甲基化指标进行统计,并在每次滑动之后对所述窗口所覆盖的比对结果的甲基化指标进行统计,其中所述窗口每次移动的步长小于所述窗口的长度。
在本公开的一些实施例中,部分覆盖比对结果的窗口的长度设置为a,比对结果的长度表示为L,则a小于L,滑动步长预设为s,则s小于a,a和s均与窗口滑动的次数呈负相关关系。值得说明是的,在一定的范围内,窗口的长度越小,滑动步长越小,滑动次数越大,则每个窗口的重复部分越 多,并且所得到的数据量也越大,所计算得到的甲基化指标越充分,从而使得后续所计算得到的甲基化综合评估结果越准确。相反的,若窗口的长度越大,滑动步长越大,所得到的甲基化指标的数据量则越小,数据统计的耗时则越小,甲基化评估的效率则随之提高。当然,窗口的长度和每次滑动的步长可以根据实际需求设置,长度的设置保证所得到的窗口可以多重覆盖比对结果时均可用于本公开实施例,此处不做限定。
进一步的,若每次窗口滑动的预设步长可以大于或等于窗口的一半长度,则可保证被多重覆盖的比对结果的数据连续不间断,从而可以尽可能对的比对结果进行多重覆盖,使得后续评估过程中所得到甲基化水平的评估结果更加准确。
可选地,参照图6,所述步骤102,包括:
步骤1021,将所述参考基因组序列进行切分,得到多个参考基因组片段序列。
在本公开实施例中,将参考基因组序列可以按照特定尺度或者是特定单位,例如染色体单位等方式进行切分,以得到多个参考染色体基因组序列。
步骤1022,将每个所述从参考基因组片段序列分别与所述基因组甲基化测序序列进行正负链比对,得到比对结果。
在本公开的一些实施例中,相对于相关技术中将所述有的甲基化测序序列比对到整个参考基因组序列的方式,由于本公开采用将甲基化测序序列逐个比对到多个参考基因组片段序列,比对耗时间将显著减低,从而提高了测序后比对的通量和比对速度。
可选地,参照图6,所述步骤1021,包括:
步骤10211,将所述参考基因组序列按照染色体单位进行切分,得到多个参考染色体基因组序列。
步骤10212,将每个所述参考染色体基因组序列按照预设长度进行切分,得到多个参考基因组片段序列。
在本公开的一些实施例中,参照图7,其中首先将参考基因组序列按照染色体为划分单位进行且切分,Refsplit表示切分得到的参考染色体基因组序列,然后在按照预设长度对参考染色体基因组序列进一步进行切分,Chrsplit表示切分得到的参考基因组片段序列。
可选地,所述参考基因组序列包括:第一转换参考基因组序列、第二转换参考基因组序列,所述甲基化基因组序列至少包括:第一扩 增甲基化基因组序列、第二扩增甲基化基因组序列;
参照图8,所述步骤1022,包括:
步骤201,获取对人类基因组进行测序得到的原始基因组甲基化测序序列,通过数据库下载人类的参考基因组序列。
步骤202,将所述原始参考基因组序列进行G到A的碱基转换,得到第一转换参考基因组序列,并将所述原始参考基因组序列进行C都T的碱基转换,得到第二转换参考基因组序列,以及,将所述原始甲基化基因组序列进行G到A的碱基转换,得到第一扩增甲基化基因组序列,并将所述原始甲基化基因组序列进行C到T的碱基转换,得到第二扩增甲基化基因组序列。
在本公开实施例中,G表示鸟嘌呤,A表示腺嘌呤,C表示胞嘧啶,T表示胸腺嘧啶。由于在上游实验中对甲基化转换后的DNA进行PCR扩增时,会产生4种可能的序列,参照图9,可以发现所得到最后的序列可以当做是未甲基化转换前的正链分别进行了G到A,C到T的转换后的互补配对。
步骤203,将所述第一扩增甲基化基因组序列进行C到T的碱基转换,得到第三扩增甲基化基因组序列,以及将所述第一扩增甲基化基因组序列进行G到A的碱基转换,得到第四扩增甲基化基因组序列,以及,将所述第二扩增甲基化基因组序列进行C到T的碱基转换,得到第五扩增甲基化基因组序列,以及将所述第二扩增甲基化基因组序列进行G到A的碱基转换,得到第六扩增甲基化基因组序列。
在本公开实施例中,参照图10,为了对甲基化发生后的甲基化测序序列的正负链进行区分,将基因组甲基化测序序列的C到T转换后的链1(第一扩增甲基化基因组序列)进行C到T,G到A的转换,命名为REAF1(第三扩增甲基化基因组序列)和REAF2(第四扩增甲基化基因组序列),对链2(第二扩增甲基化基因组序列)进行同样的操作,获得REAR1(第五扩增甲基化基因组序列)和REAR2(第六扩增甲基化基因组序列)。同时对参考基因组序列同样进行C到T转换后的参考基因组命名为REF1(第一转换参考基因组序列),将G到A转换后的参考基因组命名为REF2(第二转换参考基因组序列)。
步骤204,将所述第一转换参考基因组序列、第二转换参考基因组序列分别与所述第三扩增甲基化基因组序列、第四扩增甲基化基因组 序列、第五扩增甲基化基因组序列、第六扩增甲基化基因组序列进行比对。
在本公开实施例中,同样参照图10,利用2X2的比对方式,将REF1和REF2比对到REAF1和REAF2、REAR1和REAR2。
步骤205,将与所述第一转换参考基因组序列比对相同的扩增甲基化基因组序列的母序列作为正链,将与所述第二转换参考基因组序列比对相同的扩增甲基化基因组序列的母序列作为负链。
在本公开实施例中,参照图10当比对后的序列相同时,即REF1-REAF1相同,将REAF1的母序列确定为正链;REF2-REAR2相同,将REAR2的母序列确定为负链。
可选地,参照图11,在所述步骤101之后,所述方法还包括:
步骤301,过滤所获取到原始基因测序序列中的目标类型的序列,其中所述目标类型的序列包括:接头序列、与所述接头序列的重叠碱基大于预设碱基数量的序列、质量值低于质量值阈值的末端序列、长度小于长度阈值的序列中的至少一种。
在本公开的一些实施例中,数据过滤检测并去除可能存在的接头序列,且当reads两端序列与接头重叠碱基大于或等于例如3的预设碱基数量以上的同样视为接头序列进行去除,并在去除完接头的基础上,将质量值低于例如20的质量阈值的序列末端进行修剪,丢弃由于修剪所导致长度小于例如20的长度阈值片段序列,允许的最大错误率为0.1(错误数除以匹配区域的长度),当然具体的阈值参数可以根据实际需求设置,此处不做限定。
进一步的,还可以在之前对原始基因测序序列利用Perl语言编写好的fastq格式自动检测脚本,鉴定fastq文件属于Phred33或Phred64格式(一种基于文本的存储生物序列和对应碱基(或氨基酸)质量的文件格式),以保证原始基因测序序列的格式符合过滤标准。
步骤302,在过滤处理后的原始基因测序序列不符合目标要求时,继续对所述原始测序序列执行所述过滤操作,直至过滤处理后的原始测序序列符合目标要求,将过滤处理后的原始测序序列作为待检测的基因组甲基化测序序列;其中所述目标要求包括:碱基质量要求、碱基比例要求、序列平均GC分布要求、N含量分布要求、序列长度要求、重复序列要求和接头序列要求中的至少一种。
在本公开实施例中,对单端或双端测序数据进行碱基质量,碱基比例, 序列平均GC分布,N含量分布,序列长度,重复序列及接头序列等情况进行数据评估,对数据过滤完后的测序数据再次按目标要求的标准进行数据质量评估,观察部分不合格的指标是否已经矫正,以保证后续参与鉴定的测序序列的质量。
可选地,图12示出本公开一些实施例提供的再一种基因组甲基化测序数据的处理方法的流程示意图:
S1、数据质控及数据过滤模块;
在本公开实施例中,对甲基化测序数据进行数据质控及数据过滤前后的reads数及碱基数如下表1:
Figure PCTCN2022084386-appb-000003
表1
其中,其中Sample_fq1和Sample_fq2为双端测序原始序列文件名称,Sample_fq1_qc和Sample_fq2_qc分别为Sample_fq1和Sample_fq2数据质控和过滤后序列文件名称。
数据质控及数据过滤前后按reads位置统计碱基质量值分布情况可参照图13和图14。可见,数据质控前,在reads位置的后30个碱基质量偏低,即图13,数据质控后,按reads位置统计碱基质量分布,质量值基本在28以上,即图14。
数据质控及数据过滤前后按reads位置统计接头分布情况可见图15和图16,可见数据质控前,在reads位置的后20个碱基存在部分接头序列,即图15,数据质控后,按reads位置统计接头分布,基本不再存在接头序列,即图16。
S2、对参考基因组进行序列转换;
在本公开实施例中,原始参考序列名称为genome.fa,由G到A转换后的参考基因组名称为genome_mfa.GA_conversion.fa,由C到T转换后的参考基因组名称为genome_mfa.CT_conversion.fa。
S3、对数据质控及过滤后的测序数据进行序列转换;
在本公开实施例中,数据质控及过滤后的测序数据名称为 Sample_fq1_qc.fastq和Sample_fq2_qc.fastq,由C到T、G到A转换后,Sample_fq1_qc.fastq形成两个新的序列文件为Sample_fq1_qc.CT_conversion.fastq、Sample_fq1_qc.GA_conversion.fastq;Sample_fq2_qc.fastq形成两个新的序列文件为Sample_fq2_qc.CT_conversion.fastq、Sample_fq2_qc.GA_conversion.fastq。
S4、将转换后的测序数据比对到转换后的参考基因组;
在本公开实施例中,先按Refsplit宽度以染色体为单位切分参考基因组,随后按特定的Chrsplit宽度在染色体内部切分完成后,按先前介绍的2X2方法进行数据比对。
S5、对比对的bam文件进行去重;
在本公开实施例中,去重前配对的reads数为134,701,491,按照coordinate排序bam文件,对相同标记的序列进行去重处理,总重复序列为1,701,997,占1.26%,去除后的配对reads数为132,999,494。
S6、提取比对后各个位点得总count数及甲基化count数;
在本公开实施例中,以1号染色体区间9000到30000区域为例,提取到的部分信息如下表2(随机抽样20个位点):
Figure PCTCN2022084386-appb-000004
表2
S7、鉴定区域及位点的甲基化水平。
在本公开实施例中,按照先前介绍的方法,鉴定区域甲基化水平,以1号染色体区间9000到30000区域为例,其中Window宽度设置为2000,Step宽度设置为1000,滑动后获得的窗口与S6中得到的甲基化信息取交集,获取到的部分窗口甲基化原始数据信息如下表3(数据集中间的20个位点):
Figure PCTCN2022084386-appb-000005
表3
随后,利用公式(1)计算每个窗口的甲基化水平值Mbi如下表4:
Figure PCTCN2022084386-appb-000006
表4
最后利用公式(2)计算染色体编号1区域9000到30000的区域甲基化水平值为0.7481512。
第二步,按照先前介绍的位点甲基化水平综合评估方法,使用公式(3)计算窗口位点甲基化水平值,获取到的部分信息如下表5(随机抽样20个位点):
Figure PCTCN2022084386-appb-000007
表5
随后利用公式(4)计算位点甲基化水平值,获取到的部分信息如下表6(随机抽样20个位点)。
Figure PCTCN2022084386-appb-000008
表6
当然,上述具体实施例仅是示例性说明,具体的使用方式可以根据实际需求设置,此处不做限定。
图17示意性地示出了本公开提供的一种基因组甲基化测序数据的处理装置40的结构示意图,所述装置包括:
获取模块401,被配置为获取待检测的基因组甲基化测序序列以及参考基因组序列;
比对模块402,被配置为将所述基因组甲基化测序序列和所述参考基因组序列进行比对,得到比对结果;
统计模块403,被配置构建窗口 通过所述窗口从所述比对结果的第一端向第二端进行逐次移动,在每次移动的过程中对不同位置的窗口所覆盖比对结果的甲基化指标进行统计,所述窗口每次移动的步长小于所述窗口的长度;
评估模块404,被配置为对所统计到每个不同位置的窗口所覆盖比对结果的甲基化指标进行分析,输出所述基因组甲基化测序序列的甲基化综合评估结果。
可选地,在所述甲基化综合评估结果为所述比对结果中目标区域的区域甲基化水平值时,所述甲基化指标包括:在目标区域中所述窗口所覆盖比对结果的甲基化位点的总甲基化碱基数量、甲基化位点的总碱基数量;
所述评估模块404,还被配置为:
根据在目标区域的每个不同位置的窗口所覆盖比对结果中,所包含的甲基化位点的甲基化总碱基数量与甲基化位点的总碱基数量的比值,计算处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值;
将处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值进行分析,得到所述基因组甲基化测序序列中目标区域的区域甲基化水平值。
可选地,所述评估模块404,还被配置为:
基因组甲基化测序针对处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值求均值,根据所述区域窗口甲基化水平值的均值计算所述基因组甲基化测序序列中目标区域的区域甲基化水平值。
可选地,在所述甲基化综合评估结果为目标位点的位点甲基化水平值时,所述甲基化指标包括:覆盖所述目标位点的窗口所覆盖比对结果中的甲基化位点的总碱基数量,以及所述目标位点的甲基化碱基数量;
所述评估模块404,还被配置为:
根据所述目标位点的甲基化碱基数量与覆盖所述目标位点的不同窗口所覆盖比对结果中的甲基化位点的总碱基数量,计算覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值;
将覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值进行分析,得到所述目标位点的位点甲基化水平值。
可选地,所述评估模块404,还被配置为:
基因组甲基化测序针对覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值求均值,根据所述位点窗口甲基化水平值的均值计算所述基因组甲基化测序序列中目标位点的位点甲基化水平值。
可选地,所述统计模块403,还被配置为:
通过所述窗口从所述比对结果的第一端向第二端逐次按照预设长度进行滑动,在首次滑动之前对所覆盖的比对结果的甲基化指标进行统计,并在每次滑动之后对所述窗口所覆盖的比对结果的甲基化指标进行统计,其中所述窗口每次移动的步长小于所述窗口的长度。
可选地,所述比对模块402,还被配置为:
将所述参考基因组序列进行切分,得到多个参考基因组片段序列;
将每个所述参考基因组片段序列分别与所述基因组甲基化测序序列进行正负链比对,得到比对结果。
可选地,所述比对模块402,还被配置为:
将所述参考基因组序列按照染色体单位进行切分,得到多个参考染色体基因组序列;
将每个所述参考染色体基因组序列按照预设长度进行切分,得到多个参考基因组片段序列。
可选地,所述参考基因组序列包括:第一转换参考基因组序列、第二转换参考基因组序列,所述甲基化基因组序列至少包括:第一扩增甲基化基因组序列、第二扩增甲基化基因组序列;
所述比对模块402,还被配置为:
所述将每个所述从参考基因组片段序列分别与所述基因组甲基化测序序列进行正负链比对,得到比对结果的步骤,包括:
对所述第一扩增甲基化基因组序列进行甲基化扩增,至少得到第三扩增甲基化基因组序列和第四扩增甲基化基因组序列,以及对所述第二扩增甲基化基因组序列进行甲基化扩增,至少得到第五扩增甲基化基因组序列和第六扩增甲基化基因组序列;
将所述第一转换参考基因组序列、第二转换参考基因组序列分别与所述第三扩增甲基化基因组序列、第四扩增甲基化基因组序列、第五扩增甲基化基因组序列、第六扩增甲基化基因组序列进行比对;
将与所述第一转换参考基因组序列比对相同的扩增甲基化基因组序列的母序列作为正链,将与所述第二转换参考基因组序列比对相同的扩增甲基化基因组序列的母序列作为负链。
可选地,所述比对模块402,还被配置为:
将所述第一扩增甲基化基因组序列进行C到T的碱基转换,得到第三扩增甲基化基因组序列,以及将所述第一扩增甲基化基因组序列进行G到A的碱基转换,得到第四扩增甲基化基因组序列;
以及,将所述第二扩增甲基化基因组序列进行C到T的碱基转换,得到第五扩增甲基化基因组序列,以及将所述第二扩增甲基化基因组序列进行G到A的碱基转换,得到第六扩增甲基化基因组序列。
可选地,所述获取模块401,还被配置为:
过滤所获取到原始基因测序序列中的目标类型的序列,其中所述目标类型的序列包括:接头序列、与所述接头序列的重叠碱基大于预设碱基数量的序列、质量值低于质量值阈值的末端序列、长度小于长度阈值的序列中的至少一种;
在过滤处理后的原始测序序列不符合目标要求时,继续对所述原始测序序列执行所述过滤操作,直至过滤处理后的原始测序序列符合目标要求,将过滤处理后的原始测序序列作为待检测的基因组甲基化测序序列,其中所述目标要求包括:碱基质量要求、碱基比例要求、序列平均GC分布要求、N含量分布要求、序列长度要求、重复序列要求和接头序列要求中的至少一种。
本公开实施例通过利用多重覆盖的不同窗口所统计到的甲基化指标整合得到的甲基化综合评估结果,能够综合对于覆盖特定区域或者特征位点的多个窗口所覆盖比对结果的甲基化指标,尽可能减轻了测序序列中的局部误差对于整个区域或者位点甲基化评估结果的准确性的影响,提高了甲基化测 序序列的甲基化评估结果的准确性。
本公开的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本公开实施例的计算处理设备中的一些或者全部部件的一些或者全部功能。本公开还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本公开的程序可以存储在非瞬态计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
例如,图18示出了可以实现根据本公开的方法的计算处理设备。该计算处理设备传统上包括处理器510和以存储器520形式的计算机程序产品或者非瞬态计算机可读介质。存储器520可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器520具有用于执行上述方法中的任何方法步骤的程序代码531的存储空间530。例如,用于程序代码的存储空间530可以包括分别用于实现上面的方法中的各种步骤的各个程序代码531。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图19所述的便携式或者固定存储单元。该存储单元可以具有与图18的计算处理设备中的存储器520类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码531’,即可以由例如诸如510之类的处理器读取的代码,这些代码当由计算处理设备运行时,导致该计算处理设备执行上面所描述的方法中的各个步骤。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在 不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本公开的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本公开的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本公开可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。
最后应说明的是:以上实施例仅用以说明本公开的技术方案,而非对其限制;尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本公开各实施例技术方案的精神和范围。

Claims (15)

  1. 一种基因组甲基化测序数据的检测方法,其特征在于,所述方法包括:
    获取待检测的基因组甲基化测序序列以及参考基因组序列;
    将所述基因组甲基化测序序列和所述参考基因组序列进行比对,得到比对结果;
    构建窗口,通过所述窗口从所述比对结果的第一端向第二端进行逐次移动,在每次移动的过程中对不同位置的窗口所覆盖比对结果的甲基化指标进行统计,所述窗口每次移动的步长小于所述窗口的长度;
    对所统计到每个不同位置的窗口所覆盖比对结果的甲基化指标进行分析,输出所述基因组甲基化测序序列的甲基化综合评估结果。
  2. 根据权利要求1所述的方法,其特征在于,在所述甲基化综合评估结果为所述比对结果中目标区域的区域甲基化水平值时,所述甲基化指标包括:在目标区域中所述窗口所覆盖比对结果的甲基化位点的总甲基化碱基数量、甲基化位点的总碱基数量;
    所述对所统计到每个不同位置的窗口所覆盖比对结果的甲基化指标进行分析,输出所述基因组甲基化测序序列的甲基化综合评估结果的步骤,包括:
    根据在目标区域的每个不同位置的窗口所覆盖比对结果中,所包含的甲基化位点的甲基化总碱基数量与甲基化位点的总碱基数量的比值,计算处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值;
    将处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值进行分析,得到所述基因组甲基化测序序列中目标区域的区域甲基化水平值。
  3. 根据权利要求2所述的方法,其特征在于,所述将处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值进行分析,得到所述基因组甲基化测序序列中目标区域的区域甲基化水平值的步骤,包括:
    针对处于所述目标区域中不同位置的每个窗口相对应的区域窗口甲基化水平值求均值,根据所述区域窗口甲基化水平值的均值计算所述基因组甲基化测序序列中目标区域的区域甲基化水平值。
  4. 根据权利要求1所述的方法,其特征在于,在所述甲基化综合评估结果为目标位点的位点甲基化水平值时,所述甲基化指标包括:覆盖所述目标位点的窗口所覆盖比对结果中的甲基化位点的总碱基数量,以及所述目标位点的甲基化碱基数量;
    所述对所统计到每个不同位置的窗口所覆盖比对结果的甲基化指标进行分析,输出所述基因组甲基化测序序列的甲基化综合评估结果的步骤,包括:
    根据所述目标位点的甲基化碱基数量与覆盖所述目标位点的不同窗口所覆盖比对结果中的甲基化位点的总碱基数量,计算覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值;
    将覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值进行分析,得到所述目标位点的位点甲基化水平值。
  5. 根据权利要求4所述的方法,其特征在于,所述将覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值进行分析,得到所述目标位点的位点甲基化水平值的步骤,包括:
    针对覆盖所述目标位点的每个不同位置的窗口相对应的位点窗口甲基化水平值求均值,根据所述位点窗口甲基化水平值的均值计算所述基因组甲基化测序序列中目标位点的位点甲基化水平值。
  6. 根据权利要求1所述的方法,其特征在于,所述通过所述窗口从所述比对结果的第一端向第二端进行逐次滑动,在每次滑动的过程中对不同位置的窗口所覆盖比对结果的甲基化指标进行统计的步骤,包括:
    通过所述窗口从所述比对结果的第一端向第二端逐次按照预设长度进行滑动,在首次滑动之前对所覆盖的比对结果的甲基化指标进行统计,并在每次滑动之后对所述窗口所覆盖的比对结果的甲基化指标进行统计,其中所述窗口每次移动的步长小于所述窗口的长度。
  7. 根据权利要求1所述的方法,其特征在于,所述将所述基因组甲基化测序序列和所述参考基因组序列进行比对,得到比对结果的步骤,包括:
    将所述参考基因组序列进行切分,得到多个参考基因组片段序列;
    将每个所述从参考基因组片段序列分别与所述基因组甲基化测序序列进行正负链比对,得到比对结果。
  8. 根据权利要求7所述的方法,其特征在于,所述将所述参考基因组序列进行切分,得到多个参考基因组片段序列的步骤,包括:
    将所述参考基因组序列按照染色体单位进行切分,得到多个参考染色体基因组序列;
    将每个所述参考染色体基因组序列按照预设长度进行切分,得到多个参考基因组片段序列。
  9. 根据权利要求7所述的方法,其特征在于,所述参考基因组序列包括:第一转换参考基因组序列、第二转换参考基因组序列,所述甲基化基因组序列至少包括:第一扩增甲基化基因组序列、第二扩增甲基化基因组序列;
    所述将每个所述从参考基因组片段序列分别与所述基因组甲基化测序序列进行正负链比对,得到比对结果的步骤,包括:
    对所述第一扩增甲基化基因组序列进行甲基化扩增,至少得到第三扩增甲基化基因组序列和第四扩增甲基化基因组序列,以及对所述第二扩增甲基化基因组序列进行甲基化扩增,至少得到第五扩增甲基化基因组序列和第六扩增甲基化基因组序列;
    将所述第一转换参考基因组序列、第二转换参考基因组序列分别与所述第三扩增甲基化基因组序列、第四扩增甲基化基因组序列、第五扩增甲基化基因组序列、第六扩增甲基化基因组序列进行比对;
    将与所述第一转换参考基因组序列比对相同的扩增甲基化基因组序列的母序列作为正链,将与所述第二转换参考基因组序列比对相同的扩增甲基化基因组序列的母序列作为负链。
  10. 根据权利要求9所述的方法,其特征在于,所述对所述第一扩增甲基化基因组序列进行甲基化扩增,至少得到第三扩增甲基化基因组序列和第四扩增甲基化基因组序列,以及对所述第二扩增甲基化基因组序列进行甲基化扩增,至少得到第五扩增甲基化基因组序列和第六扩增甲基化基因组序列的步骤,包括:
    将所述第一扩增甲基化基因组序列进行C到T的碱基转换,得到第三扩增甲基化基因组序列,以及将所述第一扩增甲基化基因组序列进行G到A的碱基转换,得到第四扩增甲基化基因组序列;
    以及,将所述第二扩增甲基化基因组序列进行C到T的碱基转换,得到第五扩增甲基化基因组序列,以及将所述第二扩增甲基化基因组 序列进行G到A的碱基转换,得到第六扩增甲基化基因组序列。
  11. 根据权利要求1所述的方法,其特征在于,在所述基因组甲基化测序获取待检测的基因组甲基化测序序列,以及通过数据库下载获得参考基因组序列的步骤之后,所述方法还包括:
    过滤所获取到原始基因测序序列中的目标类型的序列,其中所述目标类型的序列包括:接头序列、与所述接头序列的重叠碱基大于预设碱基数量的序列、质量值低于质量值阈值的末端序列、长度小于长度阈值的序列中的至少一种;
    在过滤处理后的原始基因测序序列不符合目标要求时,继续对所述原始基因测序序列执行所述过滤操作,直至过滤处理后的原始基因测序序列符合目标要求,将过滤处理后的原始基因测序序列作为待检测的基因组甲基化测序序列,其中所述目标要求包括:碱基质量要求、碱基比例要求、序列平均GC分布要求、N含量分布要求、序列长度要求、重复序列要求和接头序列要求中的至少一种。
  12. 一种基因组甲基化测序数据的检测装置,其特征在于,所述装置包括:
    获取模块,被配置为获取待检测的基因组甲基化测序序列以及参考基因组序列;
    比对模块,被配置为将所述基因组甲基化测序序列和所述参考基因组序列进行比对,得到比对结果;
    统计模块,被配置为构建窗口,通过所述窗口从所述比对结果的第一端向第二端进行逐次移动,在每次移动的过程中对不同位置的窗口所覆盖比对结果的甲基化指标进行统计,所述窗口每次移动的步长小于所述窗口的长度;
    评估模块,被配置为对所统计到每个不同位置的窗口所覆盖比对结果的甲基化指标进行分析,输出所述基因组甲基化测序序列的甲基化综合评估结果。
  13. 一种计算处理设备,其特征在于,包括:
    存储器,其中存储有计算机可读代码;
    一个或多个处理器,当所述计算机可读代码被所述一个或多个处理器执行时,所述计算处理设备执行如权利要求1-11中任一项所述的基因组甲基化测序数据的处理方法。
  14. 一种计算机程序,其特征在于,包括计算机可读代码,当所述计算机可读代码在计算处理设备上运行时,导致所述计算处理设备执行如权利要求1-11中任一项所述的基因组甲基化测序数据的处理方法。
  15. 一种非瞬态计算机可读介质,其特征在于,其中存储了如权利要求1-11中任一项所述的基因组甲基化测序数据的处理方法的计算机程序。
PCT/CN2022/084386 2022-03-31 2022-03-31 基因组甲基化测序数据的处理方法、装置、设备和介质 WO2023184330A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2022/084386 WO2023184330A1 (zh) 2022-03-31 2022-03-31 基因组甲基化测序数据的处理方法、装置、设备和介质
CN202280000605.3A CN117157714A (zh) 2022-03-31 2022-03-31 基因组甲基化测序数据的处理方法、装置、设备和介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/084386 WO2023184330A1 (zh) 2022-03-31 2022-03-31 基因组甲基化测序数据的处理方法、装置、设备和介质

Publications (1)

Publication Number Publication Date
WO2023184330A1 true WO2023184330A1 (zh) 2023-10-05

Family

ID=88198650

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/084386 WO2023184330A1 (zh) 2022-03-31 2022-03-31 基因组甲基化测序数据的处理方法、装置、设备和介质

Country Status (2)

Country Link
CN (1) CN117157714A (zh)
WO (1) WO2023184330A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117316301A (zh) * 2023-11-22 2023-12-29 北华大学 一种基因检测数据智能压缩处理方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160265051A1 (en) * 2015-03-15 2016-09-15 Yandong Cao Methods for Detection of Fetal Chromosomal Abnormality Using High Throughput Sequencing
CN112397151A (zh) * 2021-01-21 2021-02-23 臻和(北京)生物科技有限公司 基于靶向捕获测序的甲基化标志物筛选与评价方法及装置
CN112397150A (zh) * 2021-01-20 2021-02-23 臻和(北京)生物科技有限公司 基于目标区域捕获测序的ctDNA甲基化水平预测装置及方法
CN112410408A (zh) * 2020-11-12 2021-02-26 江苏高美基因科技有限公司 基因测序方法、装置、设备和计算机可读存储介质
CN112639094A (zh) * 2018-05-08 2021-04-09 深圳华大智造科技股份有限公司 用于准确且经济高效的测序、单体型分型和组装的基于单管珠粒的dna共条形码化

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160265051A1 (en) * 2015-03-15 2016-09-15 Yandong Cao Methods for Detection of Fetal Chromosomal Abnormality Using High Throughput Sequencing
CN112639094A (zh) * 2018-05-08 2021-04-09 深圳华大智造科技股份有限公司 用于准确且经济高效的测序、单体型分型和组装的基于单管珠粒的dna共条形码化
CN112410408A (zh) * 2020-11-12 2021-02-26 江苏高美基因科技有限公司 基因测序方法、装置、设备和计算机可读存储介质
CN112397150A (zh) * 2021-01-20 2021-02-23 臻和(北京)生物科技有限公司 基于目标区域捕获测序的ctDNA甲基化水平预测装置及方法
CN112397151A (zh) * 2021-01-21 2021-02-23 臻和(北京)生物科技有限公司 基于靶向捕获测序的甲基化标志物筛选与评价方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117316301A (zh) * 2023-11-22 2023-12-29 北华大学 一种基因检测数据智能压缩处理方法
CN117316301B (zh) * 2023-11-22 2024-03-05 北华大学 一种基因检测数据智能压缩处理方法

Also Published As

Publication number Publication date
CN117157714A (zh) 2023-12-01

Similar Documents

Publication Publication Date Title
CN108920899B (zh) 一种基于目标区域测序的单个外显子拷贝数变异预测方法
CN113366122B (zh) 游离dna末端特征
CN107423578B (zh) 检测体细胞突变的装置
CN111261229B (zh) 一种MeRIP-seq高通量测序数据的生物分析流程
CN111341383B (zh) 一种检测拷贝数变异的方法、装置和存储介质
CN109949861B (zh) 肿瘤突变负荷检测方法、装置和存储介质
CN108256289B (zh) 一种基于目标区域捕获测序基因组拷贝数变异的方法
DE202013012824U1 (de) Systeme zum Erfassen von seltenen Mutationen und einer Kopienzahlvariation
CN112218957A (zh) 用于确定在无细胞核酸中的肿瘤分数的系统及方法
CN112365922B (zh) 用于检测msi的微卫星位点、其筛选方法及应用
CN110060733B (zh) 基于单样本的二代测序肿瘤体细胞变异检测装置
CN115631789B (zh) 一种基于泛基因组的群体联合变异检测方法
CN112349346A (zh) 检测基因组区域中的结构变异的方法
CN111326212A (zh) 一种结构变异的检测方法
WO2023184330A1 (zh) 基因组甲基化测序数据的处理方法、装置、设备和介质
CN113470743A (zh) 一种基于bd单细胞转录组和蛋白组测序数据的差异基因分析方法
CN116189763A (zh) 一种基于二代测序的单样本拷贝数变异检测方法
CN113789371A (zh) 一种基于批次矫正的拷贝数变异的检测方法
CN111696622A (zh) 一种校正和评估变异检测软件检测结果的方法
CN113862351A (zh) 体液样本中鉴定胞外rna生物标志物的试剂盒及方法
WO2023124779A1 (zh) 基于三代测序数据检测点突变的分析方法和装置
JP2004527728A (ja) ベースコーリング装置及びプロトコル
CN111243665A (zh) 一种核糖体印记测序数据分析方法及系统
CN114530200B (zh) 基于计算snp熵值的混合样本鉴定方法
CN110684830A (zh) 一种石蜡切片组织rna分析方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22934177

Country of ref document: EP

Kind code of ref document: A1