WO2021238441A1 - 甲基化水平的向量化表征、特定测序区间检测方法和装置 - Google Patents

甲基化水平的向量化表征、特定测序区间检测方法和装置 Download PDF

Info

Publication number
WO2021238441A1
WO2021238441A1 PCT/CN2021/086169 CN2021086169W WO2021238441A1 WO 2021238441 A1 WO2021238441 A1 WO 2021238441A1 CN 2021086169 W CN2021086169 W CN 2021086169W WO 2021238441 A1 WO2021238441 A1 WO 2021238441A1
Authority
WO
WIPO (PCT)
Prior art keywords
methylation
interval
sequencing
vector
specific
Prior art date
Application number
PCT/CN2021/086169
Other languages
English (en)
French (fr)
Inventor
杨昊
蒋泽宇
Original Assignee
广州市基准医疗有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市基准医疗有限责任公司 filed Critical 广州市基准医疗有限责任公司
Publication of WO2021238441A1 publication Critical patent/WO2021238441A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Definitions

  • This application relates to the technical field of genetic information processing, in particular to a vectorized characterization of methylation levels, detection of specific sequencing intervals, recording methods, devices, computer equipment, and readable storage media for specific sequencing intervals.
  • a vectorized characterization method of methylation level includes:
  • S20 Determine the number of Reads of various sequencing results in the preset reading interval according to the methylation information of each methylation sequencing interval; wherein, the number of Reads is the number of the sequencing results of the corresponding category in the corresponding methylation sequencing interval.
  • the number of occurrences in the basic information; the sequence of various sequencing results in the reading interval is preset;
  • S40 Determine, according to the number of occurrences of each sequence combination in each methylation sequencing interval, a vector of times generated during each window reading process of each methylation sequencing interval;
  • S50 Assemble the number vectors generated during each window reading of each methylation sequencing interval into a methylation vector of each methylation sequencing interval.
  • a vectorized characterization device for methylation level including:
  • the first acquisition module is used to acquire methylation information of each methylation sequencing interval of the detection sample; wherein the detection sample includes a plurality of methylation sequencing intervals;
  • the first determining module is used to determine the number of Reads of various sequencing results in the preset reading interval according to the methylation information of each methylation sequencing interval; wherein, the number of Reads means that the sequencing result of the corresponding category is in the corresponding methyl group.
  • the number of occurrences in the methylation information in the sequencing interval; the sequence of the various sequencing results in the reading interval is preset;
  • the sliding module is used to slide the sliding window from the first position to the last position in the reading interval according to the sliding step length, and obtain the reading process of each window according to the number of Reads of various sequencing results.
  • the number of appearances of each sequence combination wherein the sliding window slides backward according to the sliding step after reading the number of appearances of each sequence combination;
  • the second determining module is used to determine the number vector of each methylation sequencing interval generated during each window reading process according to the number of occurrences of each sequence combination in each methylation sequencing interval;
  • the reading module is used to splice the number vectors generated in each window of each methylation sequencing interval into the methylation vector of each methylation sequencing interval.
  • the vectorized characterization method and device for methylation level described above by acquiring the methylation information of each methylation sequencing interval of the test sample, determine the number of Reads of various sequencing results in the preset reading interval, and place the sliding window on the above
  • the reading interval slides from the first position to the last position according to the sliding step length to obtain the number of occurrences of each sequence combination in the sliding window during the reading process of each window according to the number of Reads of various sequencing results, and then determine each The number vector generated during each window reading process of the methylation sequencing interval, and the number vector generated during each window reading of each methylation sequencing interval are spliced into the methylation vector of each methylation sequencing interval , So that each vector element in the determined methylation vector corresponds to the sequence combination of each methylation sequencing interval in the test sample, which can comprehensively perform the methylation information of each methylation sequencing interval in the test sample Characterization; specifically, each methylation sequencing interval of the test sample can be characterized by the methylation vector, and is no longer
  • a detection method for a specific sequencing interval includes:
  • the sample is divided into a training sample and a test sample, the methylation vector of each methylation sequencing interval in the training sample is determined as a training vector, and the methylation vector of each methylation sequencing interval in the test sample is determined as a test Vectors, each methylation sequencing interval with the same interval sequence is determined as a set of methylation sequencing intervals, and multiple sets of methylation sequencing intervals are obtained, and each group of training vectors is input into the classification model for training, and each group of methylation is obtained
  • the evaluation model of the sequencing interval enables each evaluation model to test against a corresponding set of test vectors to determine the evaluation index of each methylation sequencing interval; wherein the interval order of each methylation sequencing interval in each sample is preset;
  • the methylation vectors of a group of methylation sequencing intervals include a group of training vectors and a group of test vectors corresponding to the group of methylation sequencing intervals;
  • the methylation sequencing interval whose evaluation index is greater than or equal to a specific threshold is determined as the specific sequencing interval.
  • a detection device for a specific sequencing interval including:
  • the second acquisition module is used for the above-mentioned vectorization characterization device for methylation level to acquire the methylation vector of each methylation sequencing interval in each sample;
  • the dividing module is used to divide the sample into a training sample and a test sample, determine the methylation vector of each methylation sequencing interval in the training sample as a training vector, and determine the methylation vector of each methylation sequencing interval in the test sample.
  • the quantization vector is determined as a test vector, and each methylation sequencing interval with the same interval sequence is determined as a set of methylation sequencing intervals, and multiple sets of methylation sequencing intervals are obtained, and each set of training vectors is input into the classification model for training, and Evaluation models for each group of methylation sequencing intervals, so that each evaluation model is tested against a corresponding set of test vectors to determine the evaluation index of each methylation sequencing interval; among them, the interval of each methylation sequencing interval in each sample
  • the sequence is preset; the methylation vectors of a group of methylation sequencing intervals include a group of training vectors and a group of test vectors corresponding to the group of methylation sequencing intervals;
  • the third determining module is used to determine a methylation sequencing interval whose evaluation index is greater than or equal to a specific threshold value as a specific sequencing interval.
  • the above-mentioned detection method and device for specific sequencing intervals by obtaining the methylation vector of each methylation sequencing interval in each sample, divide the sample into training samples and test samples, and divide the value of each methylation sequencing interval in the training sample
  • the methylation vector is determined as the training vector
  • the methylation vector of each methylation sequencing interval in the test sample is determined as the test vector
  • each methylation sequencing interval with the same interval sequence is determined as a set of methylation sequencing intervals
  • Obtain multiple sets of methylation sequencing intervals input each set of training vectors into the classification model for training, and obtain evaluation models of each set of methylation sequencing intervals, so that each evaluation model is tested against a corresponding set of test vectors to determine each
  • the evaluation index of the methylation sequencing interval, the methylation sequencing interval whose evaluation index is greater than or equal to a specific threshold is determined as a specific sequencing interval, so that the differences between different types of samples in each group of specific sequencing intervals detected are far greater Because of the differences within samples
  • a recording method of a specific sequencing interval includes:
  • a recording device for a specific sequencing interval including:
  • the second acquisition module is used for the detection device of the above-mentioned specific sequencing interval to obtain the specific sequencing interval of each sample, and set the dictionary corresponding to each specific sequencing interval, so that the key of each dictionary represents the corresponding specific sequencing interval, and the value of each dictionary represents the corresponding specific
  • the methylation vector of the sequencing interval is recorded in each dictionary in a specific format to record each specific sequencing interval
  • the above-mentioned recording method and device for specific sequencing intervals are set up with dictionaries corresponding to each specific sequencing interval, so that the key of each dictionary represents the corresponding specific sequencing interval, and the value of each dictionary represents the methylation vector of the corresponding specific sequencing interval, and adopts the json format It is easy for related processing personnel to process and easy for machine to parse and generate specific format to record each dictionary, which can further improve the efficiency of subsequent corresponding analysis of each specific sequencing interval.
  • Fig. 1 is a flowchart of a method for vectorized characterization of methylation level in an embodiment
  • FIG. 2 is a schematic diagram of an information recording method of a methylated region according to an embodiment
  • Figure 3 is a schematic diagram of the combination of parts C and T in a reading interval of length 8 in an embodiment
  • Fig. 4 is a schematic diagram of the methylation region intercepted by each step of movement in an embodiment
  • FIG. 5 is a schematic structural diagram of a vectorized characterization device for methylation level according to an embodiment
  • FIG. 6 is a flowchart of a method for detecting a specific sequencing interval according to an embodiment
  • FIG. 7 is a schematic diagram of the methylation vector and corresponding tag data of an embodiment
  • FIG. 8 is a schematic diagram of an ROC curve corresponding to a tree-based model of an embodiment
  • FIG. 9 is a schematic diagram of the ROC curve corresponding to the regression-based model of an embodiment
  • FIG. 10 is a schematic diagram of an ROC curve obtained by using methylation entropy value and xgboost model according to an embodiment
  • FIG. 11 is a schematic diagram of an ROC curve obtained by adopting methylation entropy value and lasso model according to an embodiment
  • FIG. 12 is a schematic diagram of an ROC curve obtained by using methylation apparent polymorphism and the xgboost model according to an embodiment
  • FIG. 13 is a schematic diagram of an ROC curve obtained by using methylation apparent polymorphism and a lasso model according to an embodiment
  • FIG. 14 is a schematic diagram of ROC curve obtained by using methylation frequency and xgboost model according to an embodiment
  • FIG. 15 is a schematic diagram of an ROC curve obtained by using methylation frequency and a lasso model according to an embodiment
  • FIG. 16 is a venn diagram of the interval in which the importance of the xgboost model and the lasso model are in the top 10 according to an embodiment
  • FIG. 17 is a venn diagram of an interval in which the importance of the xgboost model and the lasso model are in the top 25 places according to an embodiment
  • FIG. 18 is a venn diagram of the interval where the importance of the xgboost model and the lasso model are in the top 50 according to an embodiment
  • FIG. 19 is a venn diagram of the interval where the importance of the xgboost model and the lasso model are in the top 100 according to an embodiment
  • Figure 20 is a venn diagram of important intervals obtained by using the xgboost model and the lasso model in the methylation entropy value in 100 divisions;
  • Figure 21 is the venn diagram of the important intervals obtained by the xgboost model and the lasso model in 100 divisions of the apparent polymorphism of methylation;
  • Figure 22 is a venn diagram of important intervals obtained by using the xgboost model and the lasso model in 100 divisions of methylation frequency;
  • FIG. 23 is a schematic structural diagram of a detection device for a specific sequencing interval according to an embodiment
  • FIG. 24 is a schematic diagram of a json file recording important methylation area names and important arrangement forms in an embodiment
  • FIG. 25 is a schematic diagram of the directory hierarchy of screening feature record files in an embodiment
  • FIG. 26 is a schematic diagram of a json file describing how each sample is affected by a single feature in an embodiment
  • FIG. 27 is a schematic diagram of a txt file of the influence of a single feature on the corresponding methylated region in an embodiment
  • Fig. 28 is a schematic diagram of a computer device according to an embodiment.
  • a method for vectorized characterization of methylation level includes the following steps:
  • test samples can be derived from specific tissues such as tissues that need to be analyzed for methylation level, such as breast cancer (BrC) tissues or several samples of plasma, etc., for various test samples, they can be identified by relevant medical methods Its benign and malignant labels can be used for subsequent analysis and processing.
  • tissues that need to be analyzed for methylation level such as breast cancer (BrC) tissues or several samples of plasma, etc.
  • the acquiring methylation information of each methylation sequencing interval of the test sample includes:
  • each test sample contains multiple continuous or discontinuous methylation sites in each methylation region (methylation sequencing interval) (actually across the length of the DNA). It can range from 100-250nt).
  • T indicates that the site has not been methylated
  • C indicates that the site has been methylated.
  • the combination of T and C represents the different formation of methylation sites.
  • the actual sequencing test samples usually contain methylation sites for which methylation information cannot be detected. These methylation sites for which methylation information cannot be detected are often denoted by N.
  • N can be replaced by T uniformly, which means that there is no methylation information.
  • the above-mentioned first identifier may be C
  • the second identifier may be T, that is, C is used to indicate the sites that are methylated in each methylation sequencing interval, and T is used to indicate that there is no methylation in each methylation sequencing interval.
  • the site of radicalization may be C, and the second identifier may be T, that is, C is used to indicate the sites that are methylated in each methylation sequencing interval, and T is used to indicate that there is no methylation in each methylation sequencing interval.
  • S20 Determine the number of Reads of various sequencing results in the preset reading interval according to the methylation information of each methylation sequencing interval; wherein, the number of Reads is the number of the sequencing results of the corresponding category in the corresponding methylation sequencing interval.
  • the number of occurrences in the basic information; the sequence of various sequencing results in the reading interval is preset.
  • the length of the reading interval can be preset, and the sequence of various sequencing results in the reading interval is preset. Generally, once the sequence of various sequencing results is determined, it needs to be declared in advance, and the above-mentioned preset sequence is adopted in the subsequent analysis, evaluation and corresponding recording process.
  • the reading length of the reading interval is set to a certain extent, the longer the better, the longer the reading length of the reading interval, the fewer the number of sliding times, which can more completely reflect the information of the entire sequencing interval, instead of causing information through multiple sliding Fragmentation.
  • the reading length of the reading interval is fixed, a balance needs to be struck between the amount of information obtained in a single reading interval and the number of combinations of the methylation status of all sites under the reading length, which requires all sites under the reading length
  • the number of combinations of basicization states cannot be too many, otherwise the amount of information that increases exponentially with the increase in reading length will exert tremendous pressure on storage and calculations.
  • the reading length of the reading interval can be set to 8-10 methylation sites through the test of related samples. All possible methylation situations in a reading interval can be sorted alphabetically in the sequence. The various sequencing results of the reading interval can be sorted in any meaningful way. The corresponding ordering principle is easy to implement in various computer programming languages, and the consistency of the ordering is ensured in the methylation sequencing intervals of different lengths.
  • the sliding window is slid from the first position to the last position in the reading interval according to the sliding step length, and the reading process of each window is obtained according to the number of Reads of various sequencing results.
  • the number of occurrences of each sequence combination of includes:
  • the first position of the sliding window is set to the (s(m-1)+1) position of the reading interval, and the sliding window is read according to the number of Reads of various sequencing results The number of occurrences of each sequence combination in; where the initial value of m is 1, s represents the sliding step length, and the sliding step length s is less than the length of the reading interval;
  • reading the number of occurrences of each sequence combination in the sliding window according to the number of Reads of various sequencing results includes:
  • the sliding window is shortened so that the end of the sliding window falls at the end of the reading interval, and the number of occurrences of each sequence combination in the current sliding window is determined according to the number of Reads of various sequencing results.
  • the first position of the sliding window can be set to the first position of the reading interval, and the occurrence of each sequence combination in the current sliding window can be read from the number of Reads of various sequencing results The number of times to determine the number of occurrences of each sequence combination corresponding to the current sliding window during the window reading process; then move the first position of the sliding window backward by s to perform the second window reading, and read in the second window
  • the process of fetching read the number of occurrences of each sequence combination in the current sliding window from the number of Reads of various sequencing results to determine the number of occurrences of each sequence combination corresponding to the current sliding window during the window reading process, and so on , Until the end of the sliding window is the end of the reading interval or the end of the sliding window exceeds the end of the reading interval.
  • the sliding window used in the window reading process needs to be shortened, so that the beginning and the end of the sliding window remain stationary, and the end of the sliding window falls to the end of the reading interval , To read the information included in the corresponding reading interval completely and accurately through the sliding of each sliding window.
  • S40 Determine, according to the number of occurrences of each sequence combination in each methylation sequencing interval, the number vector generated during each window reading process of each methylation sequencing interval.
  • the determining the number vector generated in each window reading according to the number of occurrences of each sequence combination includes:
  • the number of occurrences of each sequence combination is arranged in the order of the combination to obtain the number vector.
  • the number of occurrences of each sequence combination in the corresponding sliding window will be generated.
  • the number of occurrences of the sequence combination with the combination order of 1 is taken as the first vector element, and the combination order is 2.
  • S50 Assemble the number vectors generated during each window reading of each methylation sequencing interval into a methylation vector of each methylation sequencing interval.
  • the splicing the number vector generated in each window reading into a methylation vector includes:
  • the number vectors generated in each window reading are connected end to end according to the window reading order to obtain the methylation vector.
  • the number vector of a certain methylation interval includes the following two vectors: the first number vector 0 5 0 0 0 3 0 2, the second number vector 0 0 0 0 0 2 3 5, you can Connect these two vectors end-to-end according to the order of obtaining the corresponding order vector to obtain the following methylation vector:
  • each value in the corresponding numerical vector corresponds to the sequence combination of the reading interval in each window reading process. Subsequent recording and analysis do not need to repeat the record of the methylation combination represented by each value, so it is compared with traditional methylation.
  • the horizontal recording scheme greatly reduces the complexity of recording, as well as the pressure of storage and calculation. Compared with traditional methods of characterizing methylation information, a specific methylation sequencing interval is no longer simply summarized as a single value. On the contrary, the numerical vector recording method can restore the abundance distribution of different methylation sites in the methylation sequencing interval, which is more convenient for researchers to follow-up analysis of the methylation interval.
  • the number of Reads of various sequencing results in the preset reading interval is determined by obtaining the methylation information of each methylation sequencing interval of the test sample, and the sliding window is set in the reading interval. Slide from the first position to the last position according to the sliding step length to obtain the number of occurrences of each sequence combination in the sliding window during the reading process of each window according to the number of Reads of various sequencing results, and then determine each methyl group
  • the number vector generated during each window reading process of the sequencing interval, and the number vector generated during the reading of each methylation sequencing interval in each window are spliced into the methylation vector of each methylation sequencing interval, so that Each vector element in the determined methylation vector has a one-to-one correspondence with the sequence combination of each methylation sequencing interval in the test sample, which can fully characterize the methylation information of each methylation sequencing interval in the test sample;
  • each methylation sequencing interval of the test sample can be characterized by the methylation vector, and is
  • bi-sulfur sequencing is performed with the aid of a pre-designed panel In this way, the methylation information of each site in each test sample is obtained; by setting a reading interval of an appropriate length, all possible methylation situations in a reading interval of a certain methylation sequencing interval are arranged in a specific order, Obtain the various sequencing results of the reading interval, and count the frequency of each sequencing result, and obtain the number of Reads for each sequencing result.
  • BrC breast cancer
  • a methylation sequencing method based on a pre-designed panel (panel) is used for each sample in each methylated region.
  • the information recording method can usually be recorded in the form shown in Figure 2:
  • the corresponding methylation region contains multiple continuous or discontinuous methylation sites (the actual spanning DNA length can vary from 100-250 nt).
  • the sequence containing T and C on the left represents the methylation sites obtained by sequencing.
  • T represents the site It has not been methylated
  • C indicates that the site has been methylated.
  • the combination of T and C represents the different formation of methylation sites.
  • the number on the right represents the frequency of occurrence of this combination in the region of the sample measured by the methylation sequencing method.
  • actual sequencing samples usually contain methylation sites for which no methylation information can be detected, and they are denoted by N.
  • N is replaced by T in a unified manner, which means that there is no methylation information.
  • this example uses the above-mentioned methylation vector method to characterize the methylation level in a single region of a single sample.
  • the above methylation vector can be generated as follows:
  • a reading length as a set of "reading windows”.
  • all possible combinations of C and T are listed and sorted for the above reading interval.
  • the length of the selected interval is 8, and the ordered results (sequencing results) are obtained in 256 cases.
  • Figure 3 shows the combination of parts C and T in the 256 cases.
  • an ordered dictionary can be initialized in the python programming language. The keys of each dictionary and the above 256 cases have their initial values of 0. At the same time, an empty list is generated.
  • the sliding step moves the sliding step by 5 and intercept the length of 6-13 to get TTTTTTTT, TTTTTTTT, CCCTTTTT...
  • the value of is taken out in order and put into the empty list, and the length of the list is 512 at this time. Then all values in the dictionary are reset to zero.
  • the third step after moving the step length of 5, intercept the length of 11-18 to get TTTCCCCC, TTTCCCCC, TTTCCCCC... Add the ordered dictionary keys TTTCCCCC, TTTCCCCC, TTTCCCCC... to increase by 50, 120, 30 respectively... The values are taken out in order and put into the empty list, and the length of the list is 768 at this time. Then all values in the dictionary are reset to zero.
  • the first type the length of the selected reading interval is greater than the number of methylation sites in the methylation region. At this time, the number of methylation sites in the methylation region is used as the reading interval length of this region, and one-time statistics are performed (no sliding translation is performed). For example, for a methylated region with a length of 4, the reading interval is selected as 4 at this time, and the final statistical vector length obtained in the list is 16.
  • the methylation level that characterizes the single methylation area of a single sample has changed from the original single value to a numerical vector with a length of tens or hundreds.
  • the effective value of the vector is relatively sparse, so a feature screening step is required, and the truly effective features are used for subsequent modeling and classification, so as to achieve the purpose of compressing the amount of data.
  • a vectorized characterization device for methylation level including:
  • the first acquisition module 10 is configured to acquire methylation information of each methylation sequencing interval of a test sample; wherein, the test sample includes a plurality of methylation sequencing intervals;
  • the first determining module 20 is configured to determine the number of Reads of various sequencing results in the preset reading interval according to the methylation information of each methylation sequencing interval; wherein, the number of Reads is that the sequencing result of the corresponding category is in the corresponding A
  • the number of occurrences in the methylation information in the base sequencing interval; the sequence of the various sequencing results in the reading interval is preset;
  • the sliding module 30 is configured to slide the sliding window from the first position to the last position in the reading interval according to the sliding step length, and obtain the reading process of each window according to the number of Reads of various sequencing results. The number of occurrences of each sequence combination; wherein the sliding window slides backward according to the sliding step after reading the number of occurrences of each sequence combination;
  • the second determining module 40 is configured to determine the number vector of each methylation sequencing interval generated during each window reading process according to the number of occurrences of each sequence combination in each methylation sequencing interval;
  • the reading module 50 is used for splicing the number vectors generated during each window reading of each methylation sequencing interval into a methylation vector of each methylation sequencing interval.
  • each module in the above-mentioned vectorized characterization device for methylation level can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a method for detecting a specific sequencing interval including:
  • the methylation vector of each methylation sequencing interval in each sample is obtained by using the vectorized characterization method of methylation level described in any of the foregoing embodiments.
  • each methylation sequencing interval with the same interval sequence is determined as a set of methylation sequencing intervals, and multiple sets of methylation sequencing intervals are obtained.
  • Each group of training vectors is input into the classification model for training, and each group of A
  • the evaluation model of the basement sequencing interval enables each evaluation model to test against a corresponding set of test vectors to determine the evaluation index of each methylation sequencing interval; wherein, the interval order of each methylation sequencing interval in each sample is preset Set; the methylation vector of a set of methylation sequencing intervals includes a set of training vectors and a set of test vectors corresponding to the set of methylation sequencing intervals.
  • each sample includes multiple training samples and multiple test samples; each sample (including training samples and test samples) has type labels such as benign or malignant. Accordingly, each methylation in each sample can be determined according to the type label corresponding to the sample Type label of the methylation vector in the sequencing interval.
  • Each sample includes multiple methylation sequencing intervals, and the number of methylation sequencing intervals in each sample is the same; in the actual operation process, if a certain interval is not obtained through sequencing, you need to refer to the remaining samples to generate a specific An empty vector of length (corresponding to specific sorting vector information), and keep each element of the vector at 0.
  • the sequence (interval sequence) of each methylation sequencing interval in the corresponding sample is preset.
  • each group of methylation sequencing intervals can include the methylation sequencing interval corresponding to the training sample and the methylation corresponding to the test sample Sequencing intervals; the methylation vectors of each group of methylation sequencing intervals may include a set of training vectors and a set of test vectors corresponding to the group of methylation sequencing intervals.
  • each group of training vectors into the classification model for training to obtain the evaluation model of each group of methylation sequencing interval, and input each group of test vectors into the evaluation model test of the corresponding methylation sequencing interval, so that each evaluation model can be tested for each group
  • the vector outputs the corresponding AUC (Area Under Curve) to determine the evaluation index of each evaluation model based on the AUC.
  • the aforementioned classification models may include models such as tree-based models and regression-based models.
  • the methylation vectors of a set of methylation sequencing intervals include a set of training vectors and a set of test vectors.
  • the set of methylation vectors corresponds to an evaluation model, and the set of methylation vectors includes training Both the direction and the test vector correspond to this evaluation model.
  • the evaluation model After a certain set of test vectors is input to the evaluation model corresponding to the set of test vectors, the evaluation model outputs the AUC of the set of test vectors for this set of test vectors, and the AUC of this set of test vectors is the evaluation index of the evaluation model;
  • the group test vector is input to the evaluation model of the corresponding methylation sequencing interval, and each evaluation model will output each AUC for each group of test vectors to determine the evaluation index of each evaluation model.
  • the sample can be divided multiple times, so that the corresponding training sample and test sample are obtained in each division process of the sample.
  • each group of training vectors is input into the classification model for training.
  • To obtain the evaluation model of each group of methylation sequencing interval make each evaluation model test against the test vector obtained in this division process, and obtain the AUC of each evaluation model in this division process, so as to obtain each division process
  • the AUC of each evaluation model, so that each methylation sequencing interval has multiple AUCs, and the evaluation index of the corresponding methylation sequencing interval can be determined according to each AUC of each methylation sequencing interval.
  • S63 Determine a methylation sequencing interval whose evaluation index is greater than or equal to a specific threshold value as a specific sequencing interval.
  • Each evaluation model with an evaluation index greater than or equal to a specific threshold has a corresponding set of methylation sequencing intervals, and the determined groups of methylation sequencing intervals are in each The order of intervals in the sample is the same.
  • Each set of methylation sequencing intervals determined above is a specific sequencing interval.
  • the method for determining the specific threshold includes:
  • the set ratio of the maximum value in each evaluation index is determined as the specific threshold value.
  • the above set ratio can be set to 80% or the like. If the set ratio is set to 80%, then 80% of the maximum value in the evaluation index can be determined as a specific threshold.
  • the differences between samples of different types are far greater than the differences within samples of the same type. Therefore, these specific sequencing intervals can be a good indicator for distinguishing various types of samples; for example, when the sample types include In the case of benign and malignant, the difference between benign and malignant samples in a specific sequencing interval is far greater than the difference within benign samples or the difference within malignant samples, which can be a good indicator for distinguishing between benign and malignant samples.
  • the specific sequencing interval can more clearly characterize the type of the corresponding sequence or sample, that is, the methylation vector corresponding to the specific sequencing interval can make a greater contribution to the judgment of the type of the sequence in the specific sequencing interval.
  • the genome and other samples of this embodiment include many methylation sequencing intervals (for example, the number is 10,000), and each methylation sequencing interval has a corresponding interval sequence in the corresponding sample, and the sequence of the intervals in the samples is the same.
  • Multiple methylation sequencing intervals are a set of methylation sequencing intervals, such as the first methylation sequencing interval in the first sample, the first methylation sequencing interval in the second sample,..., Until the first methylation sequencing interval in the last sample is a set of methylation sequencing intervals; the second methylation sequencing interval in the first sample, the second methylation sequence in the second sample The sequencing interval,..., until the second methylation sequencing interval in the last sample is a set of methylation sequencing intervals, and so on.
  • Each sample uses all the above-mentioned methylation sequencing intervals for sequencing and extracting its own information.
  • a specific sequencing interval divide the sample into a training set and a test set, and use the vector (methylation vector) of each interval (methylation sequencing interval) as a feature, which is an interval and an interval for independent modeling .
  • Different samples are compared based on the same interval, so when the sample includes 10,000 methylation sequencing intervals, there are a total of 10,000 evaluation models.
  • AUC is actually the evaluation index of the corresponding methylation sequencing interval, that is, the average or median value of the estimated AUC of each interval in multiple segmentation) (meet at least one threshold) .
  • the optimal interval (set) of these 10,000 intervals can be obtained, and the optimal interval (set) is the interval corresponding to the maximum AUC.
  • the detection method of the specific sequencing interval described above is to obtain the methylation vector of each methylation sequencing interval in each sample, divide the sample into a training sample and a test sample, and divide the methyl group of each methylation sequencing interval in the training sample.
  • the methylation vector is determined as the training vector
  • the methylation vector of each methylation sequencing interval in the test sample is determined as the test vector
  • each methylation sequencing interval with the same interval sequence is determined as a set of methylation sequencing intervals, and the number of methylation sequencing intervals is obtained.
  • each group of training vectors is input into the classification model for training, and the evaluation models of each group of methylation sequencing intervals are obtained, so that each evaluation model is tested against the corresponding set of test vectors to determine each methyl group.
  • the evaluation index of the sequencing interval, the methylation sequencing interval whose evaluation index is greater than or equal to a specific threshold is determined as a specific sequencing interval, so that the difference between different types of samples in each set of specific sequencing intervals is much greater than the same.
  • these specific sequencing intervals can be a good indicator for distinguishing various types of samples; for example, when the sample types include benign and malignant samples, the difference between benign and malignant samples in the specific sequencing interval is much greater than that of benign samples Internal differences or internal differences in malignant samples can be a good indicator for distinguishing between benign and malignant samples.
  • the specific sequencing interval can more clearly characterize the type of the corresponding sequence or sample, that is, the methylation vector corresponding to the specific sequencing interval can make a greater contribution to the judgment of the sequence type in the specific sequencing interval.
  • the sample is divided into a training sample and a test sample
  • the methylation vector of each methylation sequencing interval in the training sample is determined as the training vector
  • each methylation sequencing interval in the test sample is determined as the training vector.
  • Determine the methylation vector as a test vector determine each methylation sequencing interval with the same interval sequence as a set of methylation sequencing intervals, obtain multiple sets of methylation sequencing intervals, and input each group of training vectors into the classification model for Train to obtain the evaluation models of each group of methylation sequencing intervals, so that each evaluation model is tested against the corresponding set of test vectors to determine the evaluation indicators of each methylation sequencing interval include:
  • the methylation vector of each methylation sequencing interval in the training sample is determined as the training vector
  • the methylation vector of each methylation sequencing interval in the test sample is determined as the test vector
  • the interval order is determined
  • the same methylation sequencing intervals are determined as a set of methylation sequencing intervals, and multiple sets of methylation sequencing intervals are obtained, and each group of training vectors is input into the classification model for training, and the evaluation model of each group of methylation sequencing intervals is obtained , So that each evaluation model is tested against a corresponding set of test vectors to obtain the AUC of each evaluation model in the division process;
  • the sample is divided multiple times (such as 100 times or more), and the training samples obtained from each division are trained to obtain the evaluation model of each group of methylation sequencing interval during this division, so that the evaluation model is aimed at this.
  • the corresponding group of test vector tests obtain the AUC of each evaluation model in the sub-division process. In this way, after each sample is divided and the corresponding training test is performed, multiple AUCs of each methylation sequencing interval can be obtained to determine each The evaluation index of the methylation sequencing interval can effectively eliminate related errors in the process of determining the evaluation index.
  • the division ratio may be the same or not exactly the same; the division method of each division is different from each other (that is, the result of each division is Training samples and test samples are different).
  • the division ratio of training samples and test samples can be 6:4, and in another division process, the division ratio of training samples and test samples can be 7:3, and so on.
  • the corresponding numbers are 300 and 200 respectively.
  • the detection method of the above-mentioned specific sequencing interval further includes:
  • the positions or combinations of positions corresponding to each target element in the corresponding specific sequencing interval are detected, and the detected positions or combinations of positions are determined as the methylation sites or combinations of methylation sites in the corresponding specific sequencing intervals.
  • Each element in the methylation vector corresponds to a site or a combination of sites in the methylation sequencing interval. According to the sequence of the various sequencing results in the reading interval, the length of the sliding window, and the sliding step length, the positions or combinations of positions corresponding to each element in the methylation vector can be respectively located.
  • the aforementioned vector elements with specific contribution degrees are the more important vector elements in the corresponding specific sequencing interval.
  • the vector elements with specific contributions in the methylation vector of a specific sequencing interval are the sites or combinations of sites corresponding to the corresponding specific sequencing intervals, which often carry more important information. Determine these sites or combinations of sites as the methylation sites or combinations of methylation sites in the corresponding specific sequencing intervals, so that the methylation sites or combinations of methylation sites in each specific sequencing interval can be updated. Accurate recording and analysis help determine the effect of the detected methylation sites or combinations of methylation sites on the corresponding specific sequencing interval.
  • the search for vector elements with a specific contribution degree in the training corresponding evaluation model in the methylation vectors of each specific sequencing interval to obtain multiple target elements includes:
  • the importance ranking parameters output by each specific model for each vector element in a specific vector are obtained, and multiple importance ranking parameters of each vector element are obtained;
  • the specific model is an evaluation corresponding to a specific sequencing interval Model;
  • the specific vector is a methylation vector corresponding to a specific sequencing interval;
  • the importance ranking parameters of each vector element are added together to obtain the summation number of each vector element, and the vector element whose summation number is arranged before the set number of bits is determined as a vector element with a specific contribution.
  • the above-mentioned set digits can be set according to the total number of sums, for example, set to 30% of the total number of sums.
  • the search for vector elements with a specific contribution degree in the training corresponding evaluation model in the methylation vectors of each specific sequencing interval to obtain multiple target elements includes:
  • each division process obtain the importance score output by each specific model for each vector element in a specific vector, and obtain multiple scores for each vector element;
  • the specific model is an evaluation model corresponding to a specific sequencing interval;
  • the specific vector is the methylation vector corresponding to the specific sequencing interval;
  • the vector elements whose importance scores whose values are non-zero values are greater than or equal to the number threshold are determined as vector elements with specific contribution.
  • the above-mentioned number threshold can be set according to the number of sample division times, for example, set to a value equivalent to 20% of the number of sample division times.
  • the detection of the corresponding AUC in the test set can characterize the generalization ability of the corresponding evaluation model, and in the training set (such as the training vector)
  • This process of modeling in the corresponding methylation vector can get the importance ranking of the vector elements in the corresponding methylation vector (such as obtaining the importance ranking parameters and importance scores of each vector element, etc.).
  • This process can evaluate which elements are evaluated accordingly.
  • Model contribution is relatively large. Since the above example uses the sorted methylation vector as a feature for modeling, from the ranking of the importance of this feature, the important methylation sites (or combinations of methylation sites) can be correspondingly performed position.
  • each methylation sequencing interval has an evaluation model in each division, corresponding to the importance of a set of features. , Then the ranking of these features in 100 divisions needs to be considered comprehensively.
  • the vector elements can be ranked correspondingly using the methods provided in the above two examples to obtain relatively important vector elements.
  • the first example method includes: adding 100 division schemes, adding the ranking of each feature (importance ranking parameter), adding the feature with the smallest number of additions, defining it as the most important feature in 100 divisions, adding The second most important feature is defined as the second most important feature, and so on;
  • the second example method includes: each time division, training and testing, the importance scores of different features can be obtained from the evaluation model, and the important features can be defined Assuming that there are 100 division schemes, if the feature importance score value is non-zero in at least 20 divisions, it can be counted as an important feature.
  • the ranking of these important features is as follows: if the feature importance is not zero in 100 divisions, ranking first; 99 times, ranking second, and so on, the corresponding rankings can be tied. In this way, the relatively important vector elements in the methylation vector of the specific sequencing interval are determined to locate important methylation sites (or combinations of methylation sites).
  • each calculated vector is the same among different samples. Assuming there are a total of m samples, and each sample has n features in a specific interval, X ij represents the j feature of sample i, and Y i represents the attribute label of sample i (for example, benign, benign; or malignant, malignant) .
  • Feature screening can include the following steps.
  • the first step the determination of the methylation interval (specific sequencing interval) where important features are located.
  • Each methylation interval is modeled on the training set and evaluated on the test set.
  • the evaluation index AUC Absolute Under Curve
  • the average or median value between the test samples is taken as a representative and sorted. Select a specific threshold and select the intervals with the highest overall AUC level.
  • Step 2 Determine the position of important features in the selected methylation interval.
  • the above interval with the highest overall AUC level is returned to the evaluation of the model in the training set, and the most important features in the interval are selected according to different model importance judgment methods. For example, for a methylated region with a length of 18, if the reading interval 8 is selected and the interval overlaps 3 (corresponding to a sliding step of 5), a numerical vector with a length of 768 will be finally obtained.
  • the features No. 1-768 in this vector can be used to learn the degree of their contribution to the establishment of the model from the process of building the model in the training set.
  • the final result of the two-step selection will obtain several important methylation regions (specific sequencing intervals) and several important features (methylation sites or combinations of methylation sites) inside these methylation regions.
  • the highest average AUC obtained by using the xgboost model is 0.981 (its corresponding standard deviation is 0.012), and the highest median value is 0.983 (its corresponding IQR value is 0.015);
  • the highest average AUC obtained by using the lasso model is 0.977 (The corresponding standard deviation is 0.012), and the highest median value is 0.976 (the corresponding IQR value is 0.016).
  • the four highest AUC averages or medians above all belong to the same methylation detection interval. This result shows that the optimal methylation detection interval found by the above two tree-based models and regression-based models are consistent.
  • the above-mentioned optimal methylation detection interval has a length of 13.
  • 16 features are recognized by both (numbers 32,56,64,80,95,96,112,144,158,160,176,188,208,240,256,492).
  • the influence of these 16 features on the optimal methylation interval can be represented by the following vector:
  • the above label only contains 0 and 1.
  • the length of the number indicates the length of the methylation site in the methylation region, 0 indicates that a certain feature believes that the point does not require methylation, and 1 indicates that methylation is required.
  • this example also uses several algorithmic methylation characterization methods reported in traditional literature as a comparison. These methods include: methylation entropy, methylation epi-polymorphism, and methylation frequency.
  • the 11939 methylation detection intervals of each sample were calculated for the corresponding number of methylation characterization values (range 0-1).
  • 257 cases of breast cancer benign-malignant tissue samples corresponded to observation ⁇ number of variables as 257 ⁇ 11939.
  • the results obtained are as follows:
  • Methylation entropy The mean AUC of the xgboost model is 0.970, which corresponds to a standard deviation of 0.024; the mean AUC of the lasso model is 0.984, which corresponds to a standard deviation of 0.012.
  • the corresponding ROC curve can refer to Figure 10 and Figure 11.
  • Methylation epi-polymorphism The mean AUC of the xgboost model is 0.974, which corresponds to a standard deviation of 0.018; the mean AUC of the lasso model is 0.986, which corresponds to a standard deviation of 0.009.
  • the corresponding ROC curve can refer to Figure 12 and Figure 13. .
  • Methylation frequency The mean AUC of the xgboost model is 0.972, which corresponds to a standard deviation of 0.021; the mean AUC of the lasso model is 0.969, which corresponds to a standard deviation of 0.017.
  • the corresponding ROC curve can be referred to Figure 14 and Figure 15.
  • the above evaluation results are close to the average value and standard deviation of the AUC with the detection method of the specific sequencing interval provided in this embodiment.
  • the detection method for a specific sequencing interval provided in this embodiment only uses a single optimal methylation interval, which achieves the effect of using the entire methylation panel in the traditional method for modeling.
  • the corresponding ROC curve shows that the algorithm is significantly better than a series of traditional methylation characterization methods based on the stability of the ROC curve between different partitions.
  • the detection method of the specific sequencing interval can divide the median value of the AUC of the results obtained by 100 times of different intervals into the model in descending order, and obtain the front of the xgboost model and the lasso model respectively. 10, 25, 50, 100 important intervals. Taking the intersection of these intervals, a venn graph (venn diagram) can be obtained. Refer to Figure 16, Figure 17, Figure 18, and Figure 19. Their intersections account for the percentage of the union respectively: 25%, 19%, 19%, 18%.
  • the three traditional methylation characterization methods, methylation entropy, methylation epi-polymorphism, and methylation frequency can be obtained separately
  • the important intervals obtained by the two machine learning models are intersected respectively according to the methylation characterization method, and the venn diagram can be obtained, as shown in Figure 20, Figure 21, and Figure 22. Their intersections account for the percentages of the union respectively: 8%, 8%, and 7%.
  • a detection device for a specific sequencing interval including:
  • the second obtaining module 61 is configured to obtain the methylation vector of each methylation sequencing interval in each sample by using the vectorized characterization device for methylation level described in any of the foregoing embodiments;
  • the dividing module 62 is configured to divide the sample into a training sample and a test sample, determine the methylation vector of each methylation sequencing interval in the training sample as a training vector, and determine the methylation vector of each methylation sequencing interval in the test sample.
  • the basement vector is determined as a test vector, each methylation sequencing interval with the same interval sequence is determined as a set of methylation sequencing intervals, and multiple sets of methylation sequencing intervals are obtained, and each set of training vectors is input into the classification model for training.
  • the evaluation model of each group of methylation sequencing interval and make each evaluation model test against the corresponding set of test vectors to determine the evaluation index of each methylation sequencing interval; among them, the value of each methylation sequencing interval in each sample
  • the interval sequence is preset; the methylation vectors of a set of methylation sequencing intervals include a set of training vectors and a set of test vectors corresponding to the set of methylation sequencing intervals;
  • the third determining module 63 is configured to determine a methylation sequencing interval whose evaluation index is greater than or equal to a specific threshold value as a specific sequencing interval.
  • each module in the detection device for the above-mentioned specific sequencing interval can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a method for recording a specific sequencing interval including:
  • the specific sequencing interval detection method described in any of the above embodiments is used to obtain the specific sequencing interval of each sample, and the dictionary corresponding to each specific sequencing interval is set, so that the key of each dictionary represents the corresponding specific sequencing interval, and the value of each dictionary represents the corresponding
  • the methylation vector of a specific sequencing interval is recorded in each dictionary in a specific format to record each specific sequencing interval.
  • the above-mentioned specific format may be a json format.
  • the json format is used to record each dictionary, which is convenient for related processing personnel to read and write, and it is also easy for machine to parse and generate, and it can effectively improve the efficiency of network transmission.
  • a dictionary corresponding to each specific sequencing interval is set, so that the key of each dictionary represents the corresponding specific sequencing interval, and the value of each dictionary represents the methylation vector of the corresponding specific sequencing interval, and the json format is adopted to facilitate the processing and processing by related processing personnel. It is easy for the machine to parse and generate a specific format to record each dictionary, which can further improve the efficiency of subsequent corresponding analysis of each specific sequencing interval.
  • the specific sequencing interval (important methylation region) of each sample and the features (such as methylation site or methylation site combination) obtained by screening can be divided into two parts for recording.
  • the important methylation region in Figure 24 is a specific sequencing interval, the value of each dictionary, each dictionary characterizes an important methylation region, and the value of the dictionary can also record the important methylation regions of the corresponding important methylation regions.
  • Arrangement form, important arrangement form may include the arrangement form in which sites or combinations of sites are included.
  • the file shown in Figure 24 is in the json format and records the content of the dictionary, where the keys of the dictionary are important methylation areas, and the value of the dictionary is a list, which contains important arrangement forms in the methylation areas.
  • a numeric vector with a length of 768 is obtained, then the key of the dictionary is the name of the methylated region of length 18, and the value of the dictionary is of length In the 768 numeric vector, the names corresponding to the outstanding features are contributed.
  • Figure 25 is a preferred file information storage method.
  • the above json file is placed in the root directory together with a parent folder containing more detailed information.
  • the level 1 subfolder is named after the important methylation area, and the level 2 subfolder is numbered with the number corresponding to the important features in the methylation area. In other words, the number reflects the order of the feature in 1-768.
  • the level 2 folder contains two files, the first is a json file, which records the impact of this feature on each sample; the second is a txt file, which records the feature’s effect on the methylated region The overall impact.
  • Figure 26 shows the format of the json file in the level 2 subfolder.
  • This file records the contents of the dictionary.
  • the dictionary key is the name of the sample used to evaluate the impact, and the dictionary value is in the form of a list, containing four parts (the affected Reads sequence, the location coordinates of the impact, and the sequence obtained by intercepting the Reads with the coordinates (theoretically retain the important feature) Consistent), the measured number of the Read in the sample). If only one Reads is affected, the list is a single-level list. If there are more than one, it is a nested list. If no Reads are affected, the list is still occupied by an empty list "[]".
  • Figure 27 shows the format of the txt file in the level 2 subfolder. This file records the influence of a feature in the methylated area on the entire interval. The file only contains 0 and 1. The length of the number indicates the length of the methylation site in the methylation region, 0 indicates that a certain feature believes that the point does not require methylation, and 1 indicates that methylation is required. Still taking a methylated region with a length of 18, select reading interval 8 and interval overlap 3 as examples. Figure 27 shows three examples. The first indicates that one of the features emphasizes the second, third, sixth, and ninth position of the region.
  • the method considers that 2, 3, 6, 8, 9, 10, 11, 12, 13, 14, 15, 17 positions need to be performed in the entire region Methylation, and the three methylation sites 6, 9, and 11 have higher weights (for example, the weight can be set to 2, and the remaining sites can be set to 1).
  • This example provides a recording method for a specific sequencing interval. From the json file at the same level as the parent folder, the names corresponding to the outstanding features can be obtained. From the json file in the level 2 subfolder, you can get the affected Reads sequence and the affected location coordinates. Use the coordinates to intercept the sequence obtained by Reads, and the number of Reads measured in the sample. From the txt file in the level 2 subfolder, you can get the influence of a certain feature in the methylated area on the entire interval. With the help of the above information, the pros and cons of the corresponding methylation sites, as well as the potential problems in subsequent experiments, can be evaluated in detail and in all directions.
  • a recording device for a specific sequencing interval including:
  • the second acquisition module is used to obtain the specific sequencing interval of each sample by using the detection device of the specific sequencing interval described in any of the above embodiments, and set the dictionary corresponding to each specific sequencing interval, so that the key of each dictionary represents the corresponding specific sequencing interval ,
  • the value of each dictionary represents the methylation vector of the corresponding specific sequencing interval, and each dictionary is recorded in a specific format to record each specific sequencing interval.
  • each module in the recording device for the above-mentioned specific sequencing interval can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 28.
  • the computer equipment includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and a computer program.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize a vectorized characterization method of methylation level, a detection method of a specific sequencing interval, or a recording method of a specific sequencing interval.
  • the display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, trackball or touchpad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.
  • FIG. 28 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer-readable storage medium on which a computer program is stored, wherein, when the program is executed by the processor, it achieves the methylation level of any one of the above-mentioned embodiments.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

一种甲基化水平的向量化表征、特定测序区间的检测、特定测序区间的记录方法、装置、计算机设备和可读存储介质。其中甲基化水平的向量化表征方法和装置,可以通过获取样本的甲基化信息,确定阅读区间各类测序结果的Reads数目,将滑动窗口在阅读区间按照滑动步长滑动,以获取各次窗口读取过程中,滑动窗口的各个序列组合的出现次数,进而确定各个甲基化测序区间在各次窗口读取过程的次数向量,将各个次数向量拼接为相应甲基化测序区间的甲基化向量,使所确定的甲基化向量中的每个向量元素与检测样本中各个甲基化测序区间的序列组合一一对应,能够对检测样本中各个甲基化测序区间的甲基化信息进行全面表征。

Description

甲基化水平的向量化表征、特定测序区间检测方法和装置 技术领域
本申请涉及基因信息处理技术领域,特别是涉及一种甲基化水平的向量化表征、特定测序区间的检测、特定测序区间的记录方法、装置、计算机设备和可读存储介质。
背景技术
在医学技术领域,对癌症的早诊早筛对于癌症的治愈有重大意义。对于早期预防方法,一种代表性技术路线是提取血浆等样本,通过预先设计的面板(panel)进行甲基化测序,并借助机器学习方法将良恶性样本有效区分。上述传统方法通常将面板(panel)上特定甲基化区通过单一计算数值表示(例如“甲基化熵”),这一计算数值仅能够反映单一区域的整体甲基化水平,容易丢失其中的大量信息,使所表征的甲基化水平存在片面性。
发明内容
基于此,有必要针对上述技术问题,提供一种能够反映样本整体甲基化水平的向量化表征、特定测序区间的检测、特定测序区间的记录方法、装置、计算机设备和可读存储介质。
其中,一种甲基化水平的向量化表征方法,包括:
S10,获取检测样本的各个甲基化测序区间的甲基化信息;其中,所述检测样本包括多个甲基化测序区间;
S20,根据各个甲基化测序区间的甲基化信息确定预设的阅读区间中各类测 序结果的Reads数目;其中,所述Reads数目为相应类别的测序结果在相应甲基化测序区间的甲基化信息中的出现次数;所述阅读区间中各类测序结果的排列顺序预先设定;
S30,将滑动窗口在所述阅读区间按照滑动步长从第一个位点至最后一个位点滑动,根据各类测序结果的Reads数目获取各次窗口读取过程中,滑动窗口的各个序列组合的出现次数;其中,所述滑动窗口在读取各个序列组合的出现次数后,按照所述滑动步长向后滑动;
S40,根据各个甲基化测序区间中各个序列组合的出现次数确定各个甲基化测序区间在各次窗口读取过程中产生的次数向量;
S50,将各个甲基化测序区间在各次窗口读取过中产生的次数向量拼接为各个甲基化测序区间的甲基化向量。
一种甲基化水平的向量化表征装置,包括:
第一获取模块,用于获取检测样本的各个甲基化测序区间的甲基化信息;其中,所述检测样本包括多个甲基化测序区间;
第一确定模块,用于根据各个甲基化测序区间的甲基化信息确定预设的阅读区间中各类测序结果的Reads数目;其中,所述Reads数目为相应类别的测序结果在相应甲基化测序区间的甲基化信息中的出现次数;所述阅读区间中各类测序结果的排列顺序预先设定;
滑动模块,用于将滑动窗口在所述阅读区间按照滑动步长从第一个位点至最后一个位点滑动,根据各类测序结果的Reads数目获取各次窗口读取过程中,滑动窗口的各个序列组合的出现次数;其中,所述滑动窗口在读取各个序列组合的出现次数后,按照所述滑动步长向后滑动;
第二确定模块,用于根据各个甲基化测序区间中各个序列组合的出现次数 确定各个甲基化测序区间在各次窗口读取过程中产生的次数向量;
读取模块,用于将各个甲基化测序区间在各次窗口读取过中产生的次数向量拼接为各个甲基化测序区间的甲基化向量。
上述甲基化水平的向量化表征方法和装置,通过获取检测样本的各个甲基化测序区间的甲基化信息,确定预设的阅读区间中各类测序结果的Reads数目,将滑动窗口在上述阅读区间按照滑动步长从第一个位点至最后一个位点滑动,以根据各类测序结果的Reads数目获取各次窗口读取过程中,滑动窗口的各个序列组合的出现次数,进而确定各个甲基化测序区间在各次窗口读取过程中产生的次数向量,将各个甲基化测序区间在各次窗口读取过中产生的次数向量拼接为各个甲基化测序区间的甲基化向量,使所确定的甲基化向量中的每个向量元素与检测样本中各个甲基化测序区间的序列组合一一对应,能够对检测样本中各个甲基化测序区间的甲基化信息进行全面表征;具体地,检测样本的各个甲基化测序区间均可通过甲基化向量进行表征,不再被简单归纳为单一的一个数值,能够复原各个甲基化测序区间不同甲基化位点的丰度分布情况,更加便于研究人员对该甲基化区间的后续分析。这样在后续记录和分析不必将每一个向量元素所代表的甲基化情况组合重复记录,因此相对于相关传统的甲基化水平记录方法,大大减少了记录的复杂程度,以及相应分析过程中的存储压力和运算压力。
其中,一种特定测序区间的检测方法,包括:
采用上述甲基化水平的向量化表征方法获取各个样本中各个甲基化测序区间的甲基化向量;
将所述样本划分为训练样本和测试样本,将训练样本中各个甲基化测序区间的甲基化向量确定为训练向量,将测试样本中各个甲基化测序区间的甲基化 向量确定为测试向量,将区间顺序相同的各个甲基化测序区间确定为一组甲基化测序区间,得到多组甲基化测序区间,将各组训练向量分别输入分类模型进行训练,得到各组甲基化测序区间的评估模型,使各个评估模型针对对应的一组测试向量进行测试,以确定各个甲基化测序区间的评估指标;其中,各个样本中各个甲基化测序区间的区间顺序预先设定;一组甲基化测序区间的甲基化向量包括该组甲基化测序区间对应的一组训练向量和一组测试向量;
将评估指标大于或等于特定阈值的甲基化测序区间确定为特定测序区间。
一种特定测序区间的检测装置,包括:
第二获取模块,用于上述甲基化水平的向量化表征装置获取各个样本中各个甲基化测序区间的甲基化向量;
划分模块,用于将所述样本划分为训练样本和测试样本,将训练样本中各个甲基化测序区间的甲基化向量确定为训练向量,将测试样本中各个甲基化测序区间的甲基化向量确定为测试向量,将区间顺序相同的各个甲基化测序区间确定为一组甲基化测序区间,得到多组甲基化测序区间,将各组训练向量分别输入分类模型进行训练,得到各组甲基化测序区间的评估模型,使各个评估模型针对对应的一组测试向量进行测试,以确定各个甲基化测序区间的评估指标;其中,各个样本中各个甲基化测序区间的区间顺序预先设定;一组甲基化测序区间的甲基化向量包括该组甲基化测序区间对应的一组训练向量和一组测试向量;
第三确定模块,用于将评估指标大于或等于特定阈值的甲基化测序区间确定为特定测序区间。
上述特定测序区间的检测方法和装置,通过获取各个样本中各个甲基化测序区间的甲基化向量,将所述样本划分为训练样本和测试样本,将训练样本中 各个甲基化测序区间的甲基化向量确定为训练向量,将测试样本中各个甲基化测序区间的甲基化向量确定为测试向量,将区间顺序相同的各个甲基化测序区间确定为一组甲基化测序区间,得到多组甲基化测序区间,将各组训练向量分别输入分类模型进行训练,得到各组甲基化测序区间的评估模型,使各个评估模型针对对应的一组测试向量进行测试,以确定各个甲基化测序区间的评估指标,将评估指标大于或等于特定阈值的甲基化测序区间确定为特定测序区间,使得检测得到的各组特定测序区间中,不同类型的样本之间的差异远远大于同类型样本内部的差异,因而这些特定测序区间可以成为区分各类型样本的良好指标;比如在样本类型包括良性和恶性时,特定测序区间在良性样本和恶性样本之间的差异,远远大于良性样本内部的差异或恶性样本内部的差异,从而可以成为区分良恶性样本的良好指标。这样特定测序区间便可以较为明显的表征相应序列或者样本的类型,即根据特定测序区间对应的甲基化向量可以为特定测序区间中序列的类型判断作更大贡献。
其中,一种特定测序区间的记录方法,包括:
采用上述特定测序区间的检测方法获取各个样本的特定测序区间,设置各个特定测序区间分别对应的字典,使各个字典的键表征相应特定测序区间,各个字典的值表征相应特定测序区间的甲基化向量,采用特定格式记录各个字典,以记录各个特定测序区间
一种特定测序区间的记录装置,包括:
第二获取模块,用于上述特定测序区间的检测装置获取各个样本的特定测序区间,设置各个特定测序区间分别对应的字典,使各个字典的键表征相应特定测序区间,各个字典的值表征相应特定测序区间的甲基化向量,采用特定格式记录各个字典,以记录各个特定测序区间
上述特定测序区间的记录方法和装置,设置各个特定测序区间分别对应的字典,使各个字典的键表征相应特定测序区间,各个字典的值表征相应特定测序区间的甲基化向量,并采用json格式等便于相关处理人员处理并易于机器解析和生成特定格式记录各个字典,可以进一步提升后续对各个特定测序区间进行相应分析的效率。
附图说明
图1是一个实施例的甲基化水平的向量化表征方法流程图;
图2是一个实施例的甲基化区域的信息记录方式示意图;
图3是一个实施例中长度为8的阅读区间中部分C与T的组合示意图;
图4是一个实施例中通过每一步移动所截取的甲基化区域示意图;
图5是一个实施例的甲基化水平的向量化表征装置结构示意图;
图6是一个实施例的特定测序区间的检测方法流程图;
图7是一个实施例的甲基化向量及相应标签数据示意图;
图8是一个实施例的tree-based模型对应的ROC曲线示意图;
图9是一个实施例的regression-based模型对应的ROC曲线示意图;
图10是一个实施例的采用甲基化熵值和xgboost模型得到的ROC曲线示意图;
图11是一个实施例的采用甲基化熵值和lasso模型得到的ROC曲线示意图;
图12是一个实施例的采用甲基化表观多态性和xgboost模型得到的ROC曲线示意图;
图13是一个实施例的采用甲基化表观多态性和lasso模型得到的ROC曲线示意图;
图14是一个实施例的采用甲基化频率和xgboost模型得到的ROC曲线示意图;
图15是一个实施例的采用甲基化频率和lasso模型得到的ROC曲线示意图;
图16是一个实施例的运用xgboost模型和lasso模型重要性在前10位的区间的venn图;
图17是一个实施例的运用xgboost模型和lasso模型重要性在前25位的区间的venn图;
图18是一个实施例的运用xgboost模型和lasso模型重要性在前50位的区间的venn图;
图19是一个实施例的运用xgboost模型和lasso模型重要性在前100位的区间的venn图;
图20是甲基化熵值在100次划分中,分别运用xgboost模型和lasso模型得到的重要区间的venn图;
图21是甲基化表观多态性在100次划分中,分别运用xgboost模型和lasso模型得到的重要区间的venn图;
图22是甲基化频率在100次划分中,分别运用xgboost模型和lasso模型得到的重要区间的venn图;
图23是一个实施例的特定测序区间的检测装置结构示意图;
图24是一个实施例中记录重要甲基化区域名称及重要排列形式的json文件示意图;
图25是一个实施例中筛选特征记录文件目录层级示意图;
图26是一个实施例中描述各样本受到单个特征影响情况的json文件示意图;
图27是一个实施例中单个特征对相应甲基化区域的影响txt文件示意图;
图28是一个实施例的计算机设备示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅用以解释本申请,并不用于限定本申请。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
在一个实施例中,如图1所示,提供了一种甲基化水平的向量化表征方法,包括以下步骤:
S10,获取检测样本的各个甲基化测序区间的甲基化信息;其中,所述检测样本包括多个甲基化测序区间。
上述检测样本可以来源于需要进行甲基化水平分析的组织等特定组织,例如可以来源于乳腺癌(BrC)的组织或血浆的若干样本等等,针对各类检测样本,可以通过相关医学手段鉴定其良性和恶性的标签,以便于后续进行相应分析处理。
具体地,所述获取检测样本的各个甲基化测序区间的甲基化信息包括:
采用预设面板(panel)对检测样本的各个甲基化测序区间进行重亚硫酸盐法(bi-sulfur)测序,使各个甲基化测序区间中被甲基化的位点采用第一标识 表征,未被甲基化的位点采用第二标识表征,得到所述各个甲基化测序区间中各个位点的甲基化信息。
在基于预设面板的甲基化测序过程中,每个检测样本在每个甲基化区域(甲基化测序区间)包含多个连续或不连续甲基化位点(实际横跨DNA的长度可为100-250nt不等)。根据重亚硫酸盐方法记录原理,T表示该位点没有被甲基化,而C表示该位点已经被甲基化。T与C的组合表示甲基化位点的不同构成方式。在实际测序的检测样本中,通常包含未能测到甲基化信息的甲基化位点,这些未能测到甲基化信息的甲基化位点往往以N表示,本实施例中,为简化后续计算强度,可以统一将N用T代替,表示没有甲基化信息。此时,上述第一标识可以为C,第二标识可以为T,即采用C表示各个甲基化测序区间中被甲基化的位点,采用T表示各个甲基化测序区间中未被甲基化的位点。
S20,根据各个甲基化测序区间的甲基化信息确定预设的阅读区间中各类测序结果的Reads数目;其中,所述Reads数目为相应类别的测序结果在相应甲基化测序区间的甲基化信息中的出现次数;所述阅读区间中各类测序结果的排列顺序预先设定。
上述阅读区间的长度可以预先设定,阅读区间中各类测序结果的排列顺序预先设定。通常情况下,各类测序结果的排列顺序一旦确定,需事先声明,并在后续的分析评估以及相应记录过程中,均采用上述预先设定的排列顺序。
具体地,阅读区间的阅读长度的设定,在一定程度上越长越好,阅读长度长的阅读区间滑动次数越少,能够更完整地反映整个测序区间的信息,而不是通过多次滑动导致信息碎片化。在阅读区间的阅读长度一定时,在单个阅读区间获得的信息量与该阅读长度下所有位点甲基化状态的组合的数目之间需要取得平衡,这就要求该阅读长度下所有位点甲基化状态的组合的数目不能太多, 否则随阅读长度增加指数增长的信息量将对存储和计算产生巨大压力。
在一个示例中,通过相关样本的测试,阅读区间的阅读长度可以设为8-10个甲基化位点。一段阅读区间所有可能出现的甲基化情况可以按照序列的字母排序。阅读区间的各类测序结果可以使用任何有意义的排序,相应的排序原则为在各类计算机编程语言中易于实施,并在不同长度的甲基化测序区间中保证排序的一致性。
S30,将滑动窗口在所述阅读区间按照滑动步长从第一个位点至最后一个位点滑动,根据各类测序结果的Reads数目获取各次窗口读取过程中,滑动窗口的各个序列组合的出现次数;其中,所述滑动窗口在读取各个序列组合的出现次数后,按照所述滑动步长向后滑动。
在一个实施例中,将滑动窗口在所述阅读区间按照滑动步长从第一个位点至最后一个位点滑动,根据各类测序结果的Reads数目获取各次窗口读取过程中,滑动窗口的各个序列组合的出现次数包括:
S31,在第m次窗口读取过程中,将滑动窗口的首位设置在所述阅读区间的(s(m-1)+1)位,根据各类测序结果的Reads数目读取所述滑动窗口中各个序列组合的出现次数;其中,m的初始值为1,s表示滑动步长,滑动步长s小于阅读区间的长度;
S32,若在第m次窗口读取过程中,所述滑动窗口的末位不为所述阅读区间的末位,则将m更新为m+1,返回执行步骤S31,直至所述滑动窗口的末位为所述阅读区间的末位。
具体地,若所述滑动窗口的末位超出所述阅读区间,根据各类测序结果的Reads数目读取所述滑动窗口中各个序列组合的出现次数包括:
缩短所述滑动窗口,使所述滑动窗口的末位落在所述阅读区间的末位,根 据各类测序结果的Reads数目确定当前滑动窗口中各个序列组合的出现次数。
本实施例中,在第1次窗口读取过程中,可以将滑动窗口的首位设置在阅读区间的第1位,在各类测序结果的Reads数目中读取当前滑动窗口中各个序列组合的出现次数,以确定该次窗口读取过程中当前滑动窗口对应的各个序列组合的出现次数;再将滑动窗口的首位向后移动s位,以进行第2次窗口读取,在第2次窗口读取过程中,在各类测序结果的Reads数目中读取当前滑动窗口中各个序列组合的出现次数,以确定该次窗口读取过程中当前滑动窗口对应的各个序列组合的出现次数,以此类推,直至滑动窗口的末位为阅读区间的末位或者滑动窗口的末位超出阅读区间的末位。若滑动窗口的末位超出阅读区间的末位,则需要缩短该次窗口读取过程所采用的滑动窗口,使滑动窗口的首尾保持不动,将滑动窗口的末位落到阅读区间的末位,以通过各次滑动窗口的滑动,完整准确地读取相应阅读区间所包括的信息。
S40,根据各个甲基化测序区间中各个序列组合的出现次数确定各个甲基化测序区间在各次窗口读取过程中产生的次数向量。
具体地,所述根据各个序列组合的出现次数确定各次窗口读取中产生的次数向量包括:
设置滑动窗口的各个序列组合的排列顺序,得到组合顺序;
在各次窗口读取中,将各个序列组合的出现次数按照组合顺序排列,得到次数向量。
本实施例中,各次窗口读取过程中,均会产生相应滑动窗口中各个序列组合的出现次数,将组合顺序为1的序列组合的出现次数作为第1个向量元素,将组合顺序为2的序列组合的出现次数作为第2个向量元素,以此类推,直至将排在最末位的序列组合的出现次数作为最后向量元素,以此得到次数向量, 可以保证所得到的次数向量的准确性。
S50,将各个甲基化测序区间在各次窗口读取过中产生的次数向量拼接为各个甲基化测序区间的甲基化向量。
具体地,所述将各次窗口读取中产生的次数向量拼接为甲基化向量包括:
将各次窗口读取中产生的次数向量按照窗口读取顺序首尾相连,得到所述甲基化向量。
在一个示例中,若某甲基化区间的次数向量包括如下两个向量:第一个次数向量0 5 0 0 0 3 0 2,第二个次数向量0 0 0 0 0 2 3 5,则可以将这两个向量按照相应次数向量的获取顺序首尾相连,得到如下甲基化向量:
0 5 0 0 0 3 0 2 0 0 0 0 0 2 3 5。
上述甲基化向量的优势在于,一旦预先设定阅读区间中各类测序结果的排列顺序,以及滑动窗口的各个序列组合的排列顺序,相应数值向量(甲基化向量)中的每个数值(向量元素)即与阅读区间在各次窗口读取过程中的序列组合一一对应,后续记录和分析不必将每一个数值所代表的甲基化情况组合重复记录,因此相比传统的甲基化水平记录方案,大大减少了记录的复杂程度,以及存储和运算的压力。相比传统表征甲基化信息的方法,一个特定的甲基化测序区间不再被简单归纳为单一的一个数值。相反,通过数值向量的记录方式,能够复原该甲基化测序区间不同甲基化位点的丰度分布情况,更加便于研究人员对该甲基化区间的后续分析。
上述甲基化水平的向量化表征方法,通过获取检测样本的各个甲基化测序区间的甲基化信息,确定预设的阅读区间中各类测序结果的Reads数目,将滑动窗口在上述阅读区间按照滑动步长从第一个位点至最后一个位点滑动,以根据各类测序结果的Reads数目获取各次窗口读取过程中,滑动窗口的各个序列 组合的出现次数,进而确定各个甲基化测序区间在各次窗口读取过程中产生的次数向量,将各个甲基化测序区间在各次窗口读取过中产生的次数向量拼接为各个甲基化测序区间的甲基化向量,使所确定的甲基化向量中的每个向量元素与检测样本中各个甲基化测序区间的序列组合一一对应,能够对检测样本中各个甲基化测序区间的甲基化信息进行全面表征;具体地,检测样本的各个甲基化测序区间均可通过甲基化向量进行表征,不再被简单归纳为单一的一个数值,能够复原各个甲基化测序区间不同甲基化位点的丰度分布情况,更加便于研究人员对该甲基化区间的后续分析。这样在后续记录和分析不必将每一个向量元素所代表的甲基化情况组合重复记录,因此相对于相关传统的甲基化水平记录方法,大大减少了记录的复杂程度,以及相应分析过程中的存储压力和运算压力。
在一个实施例中,若从乳腺癌(BrC)组织来源和血浆来源的若干良性样本和恶性样本中确定检测样本,借助预先设计的面板(panel)进行重亚硫酸盐法(bi-sulfur)测序从而获得每个检测样本中各个位点的甲基化信息;通过设定合适长度的阅读区间,将某个甲基化测序区间的一段阅读区间所有可能出现的甲基化情况按特定顺序排列,得到该段阅读区间的各类测序结果,并统计每种测序结果可能性出现的频次,得到各类测序结果的Reads数目。采用滑动窗口通过给定滑动步长滑动阅读区间的方式将整个甲基化测序区间的情况读出,将甲基化水平转化为各次滑动滑动窗口前窗口读取过程中产生的次数向量;以将甲基化测序区间在各次窗口读取过中产生的次数向量拼接为该甲基化测序区间在的甲基化向量。这样针对各个甲基化测序区间分别执行上述操作,以分别获取各个甲基化测序区间的甲基化向量;根据各个甲基化测序区间的甲基化向量,后续可以在训练集样本中进行单一甲基化测序区域的机器学习模型建模以区分 良性和恶性样本,并借助测试集进行评估,从而得到对于区分良恶性重要的甲基化区域,以及对应区域的甲基化表现形式。
在一个示例中,如图2所示,当使用甲基化信息进行良恶性样本判别时,基于预先设计的面板(panel)的甲基化测序方法,每个样本在每个甲基化区域的信息记录方式通常可记录为图2所示的形式:相应甲基化区域包含多个连续或不连续甲基化位点(实际横跨DNA的长度可为100-250nt不等)。此处以图2中18个位点为例,左方包含T和C的序列表示测序得到的甲基化位点情况,根据重亚硫酸盐(bi-sulfur)方法记录原理,T表示该位点没有被甲基化,而C表示该位点已经被甲基化。T与C的组合表示甲基化位点的不同构成方式,相应地,右侧的数字代表这种组合方式在该样本该区域内被甲基化测序方法测得的出现频次。事实上,实际测序样本中,通常包含未能测到甲基化信息的甲基化位点,它们以N表示。此处为简化后续计算强度,统一将N用T代替,表示没有甲基化信息。
过往方法中,该区域通常使用一个单一值(例如甲基化熵)表示甲基化水平。这种直接的计算方式将大量甲基化信息进行简化和抛弃。在后续利用机器学习模型进行分类时,此类方法具有影响模型分类效果的潜在风险。有鉴于此,本示例采用上述甲基化向量的方式表征单一样本单一区域内的甲基化水平。
上述甲基化向量可以通过如下方式生成:
选定一个阅读长度,作为一组“阅读区间”(reading window)。相应地,将上述阅读区间,所有可能的C和T的组合进行罗列和排序。本示例选取区间长度为8,得到的有序结果(测序结果),共256种情况,其中图3示出了256种情况中的部分C与T的组合。相应地,可以在python编程语言中,将一个有序字典初始化,每个字典的键与上述256种情况,各自初始值为0。同时生成一 个空列表。
以此阅读区间下所有可能性的有序排列为基础,可对图2所示的每个样本的每一个甲基化区域进行统计,以得到各类测序结果的Reads数目。参考图4所示,图4示出了通过每一步移动所截取的甲基化区域的过程,此示例使用长度为8的阅读区间,选取的区间重叠为3,具体是在长度为18的甲基化区域上,下文将参考图4所示分步骤对此进行详细说明。
在第一步的初始位置,在长度为18的这个区域中,截取前1-8的长度将得到TTTTTTTT、CCCCTTTT、TTTTCCCC……(图4中,第一步,下划线所示)这些测序结果,综合考虑这些信息在测序结果中的出现频次,将有序字典键TTTTTTTT、CCCCTTTT、TTTTCCCC……分别增加50、120、30……。注意,在1-18的长度,每种序列并不会重复,但在截取位置可能存在重复。重复的键值在统计时不进行合并,或只统计其中的一个或若干个,而是分别按照其所在18长度的序列中出现频次,依次为对应键值的字典添加值。完成所有的统计后,将字典的值按顺序取出放入空列表中,此时列表长度为256。而后将字典所有值归零。
在第二步的位置,移动滑动步长5后截取6-13的长度得到TTTTTTTT、TTTTTTTT、CCCTTTTT……将有序字典键TTTTTTTT、TTTTTTTT、CCCTTTTT……分别增加50、120、30……将字典的值按顺序取出放入空列表中,此时列表长度为512。而后将字典所有值归零。
在第三步的位置,移动步长5后截取11-18的长度得到TTTCCCCC、TTTCCCCC、TTTCCCCC……将有序字典键TTTCCCCC、TTTCCCCC、TTTCCCCC……分别增加50、120、30……将字典的值按顺序取出放入空列表中,此时列表长度为768。而后将字典所有值归零。
因此,长度为18的甲基化区域,如果选取阅读区间8,区间重叠3,最终 将得到一个长度为768的数值向量,作为该甲基化区域的甲基化程度代表。
由于面板(panel)上每一个甲基化区域的包含的甲基化位点数目并不恒定(同一个甲基化位点在),此算法还需要考虑两种例外情况。
第一种:所选阅读区间长度大于该甲基化区域的甲基化位点数目。此时以该甲基化区域的甲基化位点数目作为本区域的阅读区间长度,进行一次性统计(不进行任何滑动平移)。例如,长度为4的甲基化区域,阅读区间此时选为4,最终在列表中得到的统计向量长度为16。
第二种:该甲基化区域的甲基化位点数目不能支持给定阅读区间和重叠区间下,阅读区间进行若干次完整的移动,则在最后一个不完整区间中,以其长度作为阅读区间长度进行统计。例如,长度为16的甲基化区域,如果选取阅读为8,区间重叠为3,则最后一个不完整区间为6,该区间以此为阅读区间进行统计。最终在列表中得到的统计向量长度为256+256+64=576。
可见,表征单一样本单一甲基化区域的甲基化水平,由原来的单一值,变为了长度几十上百的数值向量。该向量的有效数值较为稀疏,因此需要进行特征筛选的步骤,将真正有效的特征用于后续建模分类,从而达到将数据量压缩的目的。
在一个实施例中,如图5所示,提供了一种甲基化水平的向量化表征装置,包括:
第一获取模块10,用于获取检测样本的各个甲基化测序区间的甲基化信息;其中,所述检测样本包括多个甲基化测序区间;
第一确定模块20,用于根据各个甲基化测序区间的甲基化信息确定预设的阅读区间中各类测序结果的Reads数目;其中,所述Reads数目为相应类别的测序结果在相应甲基化测序区间的甲基化信息中的出现次数;所述阅读区间中 各类测序结果的排列顺序预先设定;
滑动模块30,用于将滑动窗口在所述阅读区间按照滑动步长从第一个位点至最后一个位点滑动,根据各类测序结果的Reads数目获取各次窗口读取过程中,滑动窗口的各个序列组合的出现次数;其中,所述滑动窗口在读取各个序列组合的出现次数后,按照所述滑动步长向后滑动;
第二确定模块40,用于根据各个甲基化测序区间中各个序列组合的出现次数确定各个甲基化测序区间在各次窗口读取过程中产生的次数向量;
读取模块50,用于将各个甲基化测序区间在各次窗口读取过中产生的次数向量拼接为各个甲基化测序区间的甲基化向量。
关于甲基化水平的向量化表征装置的具体限定可以参见上文中对于甲基化水平的向量化表征方法的限定,在此不再赘述。上述甲基化水平的向量化表征装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,参考图6所示,提供一种特定测序区间的检测方法,包括:
S61,采用上述任一实施例所述的甲基化水平的向量化表征方法获取各个样本中各个甲基化测序区间的甲基化向量。
S62,将所述样本划分为训练样本和测试样本,将训练样本中各个甲基化测序区间的甲基化向量确定为训练向量,将测试样本中各个甲基化测序区间的甲基化向量确定为测试向量,将区间顺序相同的各个甲基化测序区间确定为一组甲基化测序区间,得到多组甲基化测序区间,将各组训练向量分别输入分类模 型进行训练,得到各组甲基化测序区间的评估模型,使各个评估模型针对对应的一组测试向量进行测试,以确定各个甲基化测序区间的评估指标;其中,各个样本中各个甲基化测序区间的区间顺序预先设定;一组甲基化测序区间的甲基化向量包括该组甲基化测序区间对应的一组训练向量和一组测试向量。
上述样本包括多个训练样本和多个测试样本;各个样本(包括训练样本和测试样本)均具有良性或者恶性等类型标签,相应地,可以依据样本对应的类型标签确定各个样本中各个甲基化测序区间的甲基化向量的类型标签。各个样本均包括多个甲基化测序区间,各个样本中甲基化测序区间的个数相同;在实际操作过程中,如果某个区间通过测序没有得到相应的信息,需要参照其余样本生成一个特定长度(对应特定排序向量信息)的空向量,并使向量的每个元素保持为0。上述各个甲基化测序区间在相应样本中的排列顺序(区间顺序)都是预先设定的。这样将区间顺序相同的各个甲基化测序区间确定为一组甲基化测序区间时,各组甲基化测序区间便可以包括训练样本对应的甲基化测序区间和测试样本对应的甲基化测序区间;各组甲基化测序区间的甲基化向量便可以包括该组甲基化测序区间对应的一组训练向量和一组测试向量。
将各组训练向量分别输入分类模型进行训练,得到各组甲基化测序区间的评估模型,将各组测试向量输入相应甲基化测序区间的评估模型测试,可以使各个评估模型针对各组测试向量输出相应的AUC(Area Under Curve),以依据AUC确定各个评估模型评估指标。
上述分类模型可以包括tree-based模型和regression-based模型等模型。如上所述,一组甲基化测序区间的甲基化向量包括一组训练向量和一组测试向量,相应地,该组甲基化向量对应一个评估模型,该组甲基化向量包括的训练向和测试向量均对应这个评估模型。将某组测试向量输入该组测试向量对应的 评估模型后,该评估模型针对这组测试向量输出这组测试向量的AUC,这组测试向量的AUC便为该评估模型评估指标;这样,将各组测试向量输入相应甲基化测序区间的评估模型,各个评估模型便会针对各组测试向量输出各个AUC,确定各个评估模型评估指标。
在一个示例中,可以对样本进行多次划分,以是样本在各次划分过程中分别得到相应的训练样本和测试样本,在各次划分过程中,将各组训练向量分别输入分类模型进行训练,得到各组甲基化测序区间的评估模型,使各个评估模型针对该次划分过程中得到的测试向量进行测试,得到该次划分过程中各个评估模型的AUC,以此得到各次划分过程中各个评估模型的AUC,这样各个甲基化测序区间便具有多个AUC,便可以依据各个甲基化测序区间的各个AUC确定相应甲基化测序区间的评估指标。
S63,将评估指标大于或等于特定阈值的甲基化测序区间确定为特定测序区间。
评估指标大于或等于特定阈值的评估模型可以包括多个,各个评估指标大于或等于特定阈值的评估模型均存在对应的一组甲基化测序区间,所确定的各组甲基化测序区间在各个样本中的区间顺序相同。上述所确定的各组甲基化测序区间便为特定测序区间。
具体地,所述特定阈值的取值方法包括:
将各个评估指标中最大值的设定比例确定为所述特定阈值。
上述设定比例可以设置为80%等值,若设定比例设置为80%,则可以将评估指标中最大值的80%确定为特定阈值。
本实施例确定的各组特定测序区间中,不同类型的样本之间的差异远远大于同类型样本内部的差异,因而这些特定测序区间可以成为区分各类型样本的 良好指标;比如在样本类型包括良性和恶性时,特定测序区间在良性样本和恶性样本之间的差异,远远大于良性样本内部的差异或恶性样本内部的差异,从而可以成为区分良恶性样本的良好指标。这样特定测序区间便可以较为明显的表征相应序列或者样本的类型,即根据特定测序区间对应的甲基化向量可以为特定测序区间中序列的类型判断作较大贡献。
进一步地,本实施例的基因组等样本包括很多个甲基化测序区间(如数目为10000个),各个甲基化测序区间在相应的样本中均有相应的区间顺序,样本中区间顺序相同的多个甲基化测序区间为一组甲基化测序区间,比如第一个样本中的第一个甲基化测序区间、第二个样本中的第一个甲基化测序区间、……、直至最后一个样本中的第一个甲基化测序区间为一组甲基化测序区间;第一个样本中的第二个甲基化测序区间、第二个样本中的第二个甲基化测序区间、……、直至最后一个样本中的第二个甲基化测序区间为一组甲基化测序区间等等。每个样本都使用上述全部甲基化测序区间,进行测序和提取各自的信息。在评估特定测序区间的时候,将样本分为训练集和测试集,使用每个区间(甲基化测序区间)的向量(甲基化向量)作为特征,是一个区间一个区间进行单独的建模,不同样本基于同一个区间进行比较,所以在样本包括10000个甲基化测序区间时,共有10000个评估模型。通过10000个AUC(此处的AUC实际上为对应甲基化测序区间的评估指标,即每个区间在多次切分中的评估AUC的均值或中位值)的排序(至少满足一个阈值),能够得到这10000个区间的最优区间(集),最优区间(集)即为最大AUC对应的区间。
上述特定测序区间的检测方法,通过获取各个样本中各个甲基化测序区间的甲基化向量,将所述样本划分为训练样本和测试样本,将训练样本中各个甲基化测序区间的甲基化向量确定为训练向量,将测试样本中各个甲基化测序区 间的甲基化向量确定为测试向量,将区间顺序相同的各个甲基化测序区间确定为一组甲基化测序区间,得到多组甲基化测序区间,将各组训练向量分别输入分类模型进行训练,得到各组甲基化测序区间的评估模型,使各个评估模型针对对应的一组测试向量进行测试,以确定各个甲基化测序区间的评估指标,将评估指标大于或等于特定阈值的甲基化测序区间确定为特定测序区间,使得检测得到的各组特定测序区间中,不同类型的样本之间的差异远远大于同类型样本内部的差异,因而这些特定测序区间可以成为区分各类型样本的良好指标;比如在样本类型包括良性和恶性时,特定测序区间在良性样本和恶性样本之间的差异,远远大于良性样本内部的差异或恶性样本内部的差异,从而可以成为区分良恶性样本的良好指标。这样特定测序区间便可以较为明显的表征相应序列或者样本的类型,即根据特定测序区间对应的甲基化向量可以为特定测序区间中序列的类型判断作更大贡献。
在一个实施例中,所述将所述样本划分为训练样本和测试样本,将训练样本中各个甲基化测序区间的甲基化向量确定为训练向量,将测试样本中各个甲基化测序区间的甲基化向量确定为测试向量,将区间顺序相同的各个甲基化测序区间确定为一组甲基化测序区间,得到多组甲基化测序区间,将各组训练向量分别输入分类模型进行训练,得到各组甲基化测序区间的评估模型,使各个评估模型针对对应的一组测试向量进行测试,以确定各个甲基化测序区间的评估指标包括:
将所述样本分别多次划分为训练样本和测试样本;
在各次划分过程中,将训练样本中各个甲基化测序区间的甲基化向量确定为训练向量,将测试样本中各个甲基化测序区间的甲基化向量确定为测试向量,将区间顺序相同的各个甲基化测序区间确定为一组甲基化测序区间,得到多组 甲基化测序区间,将各组训练向量分别输入分类模型进行训练,得到各组甲基化测序区间的评估模型,使各个评估模型针对对应的一组测试向量测试得到各个评估模型在该次划分过程的AUC;
获取各次划分过程中各个评估模型的AUC,得到各个甲基化测序区间对应的多个AUC,根据各个甲基化测序区间对应的各个AUC的平均值或者中位值确定相应甲基化测序区间的评估指标。
本实施例将样本进行多次(如100次或者更次)划分,针对各次划分得到的训练样本分别训练得到这次划分时各组甲基化测序区间的评估模型,以使评估模型针对这次划分时相应组测试向量测试得到各个评估模型在该次划分过程的AUC,这样经过各次样本划分,进行相应的训练测试便可以得到各个甲基化测序区间的多个AUC,以此确定各个甲基化测序区间的评估指标,可以有效消除评估指标确定过程中的相关误差。
针对样本进行多次划分,并进行相应训练、测试的过程中,划分比例可以是相同的,也可以是不完全相同的;各次划分的划分方式是互不相同的(即各次划分得到的训练样本和测试样本均不一样)。比如某一次划分过程中,训练样本和测试样本的划分比例可以为6:4,另一次划分过程中,训练样本和测试样本的划分比例可以为7:3等等。又比如在一个示例中,若有500个样本,按照6:4的划分,分为训练集和测试集,则各自对应数目为300个和200个,在划分比例一定的情况下,会有多个60:40的数据集,假如划分不重复的100次,则每一次里面,300个训练样本和200个测试样本的组合是不一样的,对于同一个甲基化测序区间,由于有100次上述训练集和测试集的划分,事实上建立了100个评估模型,每个模型在测试集的样本中均有一个对应的AUC,这100个AUC有高有低,为了让不同的甲基化测序区间之间能够进行横向比较,可以取这100 个的模型的中位值或均值,来表征这个区间的模型的好坏,以消除单次划分、训练和测试过程中面临的误差影响。
具体地,上述特定测序区间的检测方法还包括:
在各个特定测序区间的甲基化向量中查找在训练相应评估模型中具备特定贡献度的向量元素,得到多个目标元素;
检测各个目标元素在相应特定测序区间对应的位点或者位点组合,将检测得到的位点或者位点组合确定为相应特定测序区间的甲基化位点或者甲基化位点组合。
甲基化向量中的各个元素均对应甲基化测序区间的一个位点或者位点组合。依据阅读区间中各类测序结果的排列顺序、滑动窗口的长度和滑动步长这些信息可以分别对甲基化向量中各个元素所对应的位点或者位点组合进行定位。
上述具备特定贡献度的向量元素为相应特定测序区间中较为重要的向量元素。特定测序区间的甲基化向量中具备特定贡献度的向量元素在相应特定测序区间对应的位点或者位点组合为被甲基化的位点或位点组合,往往携带更为重要的信息,将这些位点或者位点组合确定为相应特定测序区间的甲基化位点或者甲基化位点组合,以对各个特定测序区间的甲基化位点或者甲基化位点组合进行更为精确的记录及分析,有助于对确定检测得到的甲基化位点或者甲基化位点组合对相应特定测序区间的影响。
在一个示例中,所述在各个特定测序区间的甲基化向量中查找在训练相应评估模型中具备特定贡献度的向量元素,得到多个目标元素包括:
在各次划分过程中,获取各个特定模型针对特定向量中各个向量元素输出的重要性排序参数,得到各个向量元素的多个重要性排序参数;其中,所述特 定模型为特定测序区间对应的评估模型;所述特定向量为特定测序区间对应的甲基化向量;
将各个向量元素的各个重要性排序参数相加,得到各个向量元素的加和数,将加和数排列在设定位数之前的向量元素确定为具备特定贡献度的向量元素。
上述设定位数可以依据加和数总数进行设置,比如设置为加和数总数的30%等值。
在另一个示例中,所述在各个特定测序区间的甲基化向量中查找在训练相应评估模型中具备特定贡献度的向量元素,得到多个目标元素包括:
在各次划分过程中,获取各个特定模型针对特定向量中各个向量元素输出的重要性评分,得到各个向量元素的多个评分;其中,所述特定模型为特定测序区间对应的评估模型;所述特定向量为特定测序区间对应的甲基化向量;
将取值为非零值的重要性评分个数大于或等于个数阈值的向量元素确定为具备特定贡献度的向量元素。
上述个数阈值可以依据样本划分次数进行设置,比如设置为样本划分次数的20%等值。
上述两个示例中,各次样本划分及其中评估模型的建立过程中,在测试集(如测试向量)中检测相应AUC可以表征相应评估模型的泛化能力,而在训练集(如训练向量)中进行建模的这个过程,可以得到相应甲基化向量中向量元素的重要性排序(如得到各个向量元素的重要性排序参数和重要性评分等等),这个过程能够评估哪些元素对相应评估模型贡献比较大。由于上述示例是拿排序好的甲基化向量作为特征进行建模,所以从这个特征重要性的排序中,可以对重要的甲基化位点的(或甲基化位点的组合)进行相应定位。
具体地,以对样本进行100次划分、训练和测试为例,如划分不重复的100 次,每个甲基化测序区间在每一次划分中有一个评估模型,对应一组特征的重要性排序,则这些特征在100次划分中的排序需要综合考虑,此时便可以采用上述两个示例提供的方法对向量元素进行相应排序,以得到相对重要的向量元素。相应地,第一个示例方法包括:将100次划分方案,每个特征的排序(重要性排序参数)相加,加和数目最小的特征,定义为100次划分中最重要的特征,加和数目排第二的,定义为第2重要的特征,以此类推;第二个示例方法包括:每次划分、训练和测试时,从评估模型可以得到不同特征的重要性评分,重要特征可定义为,假设有100次划分方案,如果特征重要性评分值至少20次划分中非零,便可统计为重要特征。这些重要特征的之间的排序为:如果在100次划分中均满足特征重要性非零,排名第一;99次,排名第二,以此类推,相应名次可并列。以此确定特定测序区间的甲基化向量中相对重要的向量元素,以对重要甲基化位点(或甲基化位点组合)进行定位。
在一个实施例中,参考图7所示,鉴于每一个甲基化测序区间的位点数目在样本间保持一致,因此每一个计算所得的向量的长度在不同样本间等长。假设总共有m个样本,每个样本在特定区间有n个特征,则X ij表示i号样本的j号特征,Y i表示i号样本的属性标签(例如良性,benign;或恶性,malignant)。
将样本数据均匀分为训练集和测试集后,可通过各类分类模型,如tree-based模型(例如xgboost),或regression-based模型(例如lasso),进行特征筛选。特征筛选可以包括以下步骤。
第一步:重要特征所在甲基化区间(特定测序区间)的确定。每个甲基化区间在训练集上建模,在测试集上评估。每一个甲基化区间可以在测试集上得到每个测试样本的评估指标AUC(Area Under Curve),取测试样本间均值或中位值为代表,进行排序。选取特定阈值,将整体AUC水平最高的若干区间选中。
第二步:重要特征在选定甲基化区间位置的确定。上述整体AUC水平最高的区间,返回至训练集中对模型的评估,依照不同模型重要性判定方式,挑选出区间内最重要的特征。例如,长度为18的甲基化区域,如果选取阅读区间8,区间重叠3(对应滑动步长为5),最终将得到一个长度为768的数值向量。该向量中的1-768号特征,可由模型在训练集的建立过程,得知各自对模型建立的贡献程度。
两步挑选的最终结果将得到若干重要甲基化区域(特定测序区间),以及这些甲基化区域内部的若干重要特征(甲基化位点或者甲基化位点组合)。
具体地,通过如下3个应用示例对特定测序区间的检测方法进行说明。
应用示例一:
根据tree-based模型(例如xgboost),或regression-based模型(例如lasso),应用于给定的甲基化面板(包括11939个甲基化检测区间)和乳腺癌良性-恶性组织样本(257例),进行特征筛选。将组织样本进行训练集和测试集的6:4的2划分(即实际样本数目比为154:103,将该过程重复进行100次),应用所得到的评估模型进行测试,对应ROC曲线如图8,图9。图8、图9结合图19的信息表明,对于xgboost模型和lasso模型所找到的前100个最优甲基化检测区间中,有31个被二者共同认可。在100次划分重复中,运用xgboost模型所得到的AUC最高均值0.981(其所对应标准差0.012),最高中位值0.983(其所对应IQR值0.015);运用lasso模型所得到的AUC最高均值0.977(其所对应标准差0.012),最高中位值0.976(其所对应IQR值0.016)。上述四个最高的AUC均值或中位值均属于同一个甲基化检测区间。该结果表明,上述两种tree-based模型和regression-based模型,所找到的最优甲基化检测区间一致。
应用示例二:
为深入研究最优甲基化区间上每个位点的作用权重,应用示例一中的示例可进行进一步的评估工作。上述最优甲基化检测区间,其长度为13,在选取阅读为8,区间重叠为3的特征构建方案时,将产生按指定顺序排列的512个特征(相应编号为1-512)。基于上述方案,查看最优区间所对应的最优特征评估数据,可以得知,在运用tree-based的模型xgboost对该最优区间进行评估时,根据训练集中特征重要性(在单次划分中重要性大于0,且在大于20个划分重复中成立)可得到40个重要特征(编号16,32,48,56,64,80,95,96,112,128,144,156,158,160,176,188,192,208,224,232,240,248,252,254,255,256,384,416,432,444,448,464,480,488,492,496,504,508,511,512);在运用regression-based的模型lasso对该最优区间进行评估时,根据训练集中特征重要性(在单次划分中重要性非0,且在大于20个划分重复中成立)可得到19个重要特征(编号23,28,32,56,64,80,95,96,112,144,158,160,166,176,188,208,240,256,492)。其中16个特征被二者共同认可(编号32,56,64,80,95,96,112,144,158,160,176,188,208,240,256,492),这16个特征对该最优甲基化区间产生的影响可用下列向量表示:
32:1 1 1 0 0 0 0 0 0 0 0 0 0
56:1 1 0 0 1 0 0 0 0 0 0 0 0
64:1 1 0 0 0 0 0 0 0 0 0 0 0
80:1 0 1 1 0 0 0 0 0 0 0 0 0
95:1 0 1 0 0 0 0 1 0 0 0 0 0
96:1 0 1 0 0 0 0 0 0 0 0 0 0
112:1 0 0 1 0 0 0 0 0 0 0 0 0
144:0 1 1 1 0 0 0 0 0 0 0 0 0
158:0 1 1 0 0 0 1 0 0 0 0 0 0
160:0 1 1 0 0 0 0 0 0 0 0 0 0
176:0 1 0 1 0 0 0 0 0 0 0 0 0
188:0 1 0 0 0 1 0 0 0 0 0 0 0
208:0 0 1 1 0 0 0 0 0 0 0 0 0
240:0 0 0 1 0 0 0 0 0 0 0 0 0
256:0 0 0 0 0 0 0 0 0 0 0 0 0
492:0 0 0 0 0 0 0 0 1 0 1 0 0
上述标记只包含0与1,数字长度表示甲基化区域的甲基化位点长度,0表示某特征认为该点不需要甲基化,而1表示需要甲基化。将16个特征对区间的影响进行合并,可得到向量:
7 8 8 6 1 1 1 1 1 0 1 0 0
从上述分析可知,在最优甲基化区间上,最重要的位点为2、3号,其次是位点1号,而最不重要的位点是10、12、13。
应用示例三:
与应用示例一、应用示例二所使用的特定测序区间的检测方法相对应,本示例也使用传统文献报道的若干算法甲基化表征方式作为对比。这些方法包括:甲基化熵值(methylation entropy)、甲基化表观多态性(methylation epi-polymorphism)、甲基化频率(methylation frequency)。
将每一例样本的11939个甲基化检测区间计算对应数目的甲基化表征值(范围0-1),相应地,257例乳腺癌良性-恶性组织样本对应观测×变量数目为257 ×11939。对样本使用同样的100次6:4划分,分别运用xgboost模型和lasso模型进行检测,得到的结果如下:
甲基化熵值(methylation entropy):xgboost模型的AUC均值0.970,对应标准差0.024;得到lasso模型的AUC均值0.984,对应标准差0.012,相应的ROC曲线可以参考图10,图11。
甲基化表观多态性(methylation epi-polymorphism):xgboost模型的AUC均值0.974,对应标准差0.018;得到lasso模型的AUC均值0.986,对应标准差0.009,相应ROC曲线可以参考图12、图13。
甲基化频率(methylation frequency):xgboost模型的AUC均值0.972,对应标准差0.021;得到lasso模型的AUC均值0.969,对应标准差0.017,相应ROC曲线可以参考图14,图15。
上述评估结果与本实施例提供的特定测序区间的检测方法有接近的AUC均值和标准差。本实施例提供的特定测序区间的检测方法仅使用单一最优甲基化区间,便达到传统方法使用整个甲基化面板(panel)建模的效果。相应ROC曲线表明,以不同划分之间的ROC曲线稳定性作为衡量标准,该算法显著优于传统的一系列甲基化表征方式。
对于不同区间之间的重要性评估,本实施例提供的特定测序区间的检测方法可通过不同区间在100次划分建模所得结果的AUC的中位值降序,分别得到xgboost模型和lasso模型的前10、25、50、100个重要区间。将这些区间取交集,可得到venn图(文氏图),可以参考图16、图17、图18、图19。它们的交集部分分别占并集的百分比为:25%、19%、19%、18%。
相比之下,传统表征方法基于甲基化面板(panel)的区间重要性评估,可通过以下方式进行:在xgboost模型中,通过筛选训练集中特征重要性(在单 次划分中重要性大于0,且在大于20个划分重复中成立)可得到重要区间。在lasso模型中,通过筛选训练集中特征重要性(在单次划分中重要性非0,且在大于20个划分重复中成立)可得到重要区间。基于上述原则,三种传统甲基化表征方法,甲基化熵值(methylation entropy)、甲基化表观多态性(methylation epi-polymorphism)、甲基化频率(methylation frequency),可分别得到xgboost模型和lasso模型所判定的重要区间。将两种机器学习模型得到的重要区间按照甲基化表征方法分别取交集,可得到venn图,见图20、图21、图22。它们的交集部分分别占并集的百分比为:8%、8%、7%。
从上述评估可以发现,本专利所构建的算法,相比于传统的甲基化表征方法,在基于xgboost模型和lasso模型进行区间重要性的评估时,有显著更高的模型间相似性。此外,过往方法将单一甲基化区间总结归纳为唯一数值,从而丢失了该区间内的大部分信息。基于本专利所构建方法的数据记录方式,可将单一区间不同位点的影响程度进行逐个分析。上述优点均可视为本专利所构建算法的创新性。
在一个实施例中,参考图23所示,提供一种特定测序区间的检测装置,包括:
第二获取模块61,用于采用上述任一实施例所述的甲基化水平的向量化表征装置获取各个样本中各个甲基化测序区间的甲基化向量;
划分模块62,用于将所述样本划分为训练样本和测试样本,将训练样本中各个甲基化测序区间的甲基化向量确定为训练向量,将测试样本中各个甲基化测序区间的甲基化向量确定为测试向量,将区间顺序相同的各个甲基化测序区间确定为一组甲基化测序区间,得到多组甲基化测序区间,将各组训练向量分别输入分类模型进行训练,得到各组甲基化测序区间的评估模型,使各个评估 模型针对对应的一组测试向量进行测试,以确定各个甲基化测序区间的评估指标;其中,各个样本中各个甲基化测序区间的区间顺序预先设定;一组甲基化测序区间的甲基化向量包括该组甲基化测序区间对应的一组训练向量和一组测试向量;
第三确定模块63,用于将评估指标大于或等于特定阈值的甲基化测序区间确定为特定测序区间。
关于特定测序区间的检测装置的具体限定可以参见上文中对于特定测序区间的检测方法的限定,在此不再赘述。上述特特定测序区间的检测装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供一种特定测序区间的记录方法,包括:
采用上述任一实施例所述的特定测序区间的检测方法获取各个样本的特定测序区间,设置各个特定测序区间分别对应的字典,使各个字典的键表征相应特定测序区间,各个字典的值表征相应特定测序区间的甲基化向量,采用特定格式记录各个字典,以记录各个特定测序区间。
上述特定格式可以为json格式。采用json格式记录各个字典,便于相关处理人员阅读和编写,同时也易于机器解析和生成,还可以有效地提升网络传输效率。
本实施例设置各个特定测序区间分别对应的字典,使各个字典的键表征相应特定测序区间,各个字典的值表征相应特定测序区间的甲基化向量,并采用json格式等便于相关处理人员处理并易于机器解析和生成特定格式记录各个字典,可以进一步提升后续对各个特定测序区间进行相应分析的效率。
在一个示例中,对于各个样本的特定测序区间(重要的甲基化区域),以及其中筛选所得的特征(如甲基化位点或者甲基化位点组合)可以分为两部分进行记录。
参考图24所示,图24中重要甲基化区域为的特定测序区间,各个字典的值,各个字典表征一个重要甲基化区域,字典的值还可以记录相应重要甲基化区域的各个重要排列形式,重要排列形式可以包括其中位点或者位点组合的排列形式。图24所示的文件为json格式,记录字典内容,其中字典的键为重要的甲基化区域,字典的值为列表,包含该甲基化区域中重要的排列形式。例如长度为18的甲基化区域,如果选取阅读区间8,区间重叠3,得到一个长度为768的数值向量,则字典的键为该长度18甲基化区域的名称,字典的值为长度为768的数值向量中,贡献突出的特征所对应的名称。
图25是优选的文件信息存储方式。上述json文件和一个包含更详细信息的母文件夹共同放置于根目录中。层级1子文件夹以重要的甲基化区域进行命名,层级2子文件夹则以该甲基化区域中重要特征所对应排列的数字编号。换言之,该编号反映该特征在1-768中的次序。层级2文件夹内包含两个文件,第一个是json文件,该文件记录了这个特征对每个样本的影响;第二个是txt文件,该文件记录了这个特征对该甲基化区域的总体影响。
图26展示了层级2子文件夹中json文件的形式。该文件记录字典内容。字典键为用于评估影响的样本名称,字典值为列表形式,包含四部分内容(所影响到的Reads序列,影响的位置坐标,用坐标截取Reads所得到的序列(理论上与该重要特征保持一致),该Read在该样本的测得数目)。如果只有一个Reads受到影响,该列表为单层列表。如果有多个,则为嵌套列表。如果没有任何Reads受到影响,该列表依然以空列表“[]”进行占位。
图27展示了层级2子文件夹中txt文件的形式。该文件记录了甲基化区域中某个特征对区间整体所施加的影响。文件中只包含0与1,数字长度表示甲基化区域的甲基化位点长度,0表示某特征认为该点不需要甲基化,而1表示需要甲基化。依旧以长度为18的甲基化区域,选取阅读区间8,区间重叠3为例,图27展示了三种示例,第一种表示其中一个特征强调该区域第2,3,6,9位甲基化位点的重要性,第二种表示另一个特征强调该区域第6,8,9,10,11,13位甲基化位点的重要性,第三种表示第三个特征强调该区域第11,12,14,15,17(三个示例将分别存在于三个独立文件中)。假设该甲基化区域有且仅有上述三个重要特征,则整个区域中,该方法认为2,3,6,8,9,10,11,12,13,14,15,17位置需要进行甲基化,且6,9,11三个甲基化位点有更高权重(如权重可以设为2,其余位点可以设为1)。
本示例提供的特定测序区间的记录方法,从与母文件夹同一层级的json文件可得到贡献突出的特征所对应的名称。从层级2子文件夹中的json文件可得到所影响到的Reads序列,影响的位置坐标,用坐标截取Reads所得到的序列,该Read在该样本的测得数目。从层级2子文件夹中txt文件可得到甲基化区域中某个特征对区间整体所施加的影响。借助上述信息,可对相应甲基化位点的优劣,以及后续实验中存在的潜在问题,进行详细地、全方位的评估。
在一个实施例中,提供一种特定测序区间的记录装置,包括:
第二获取模块,用于采用上述任一实施例所述的特定测序区间的检测装置获取各个样本的特定测序区间,设置各个特定测序区间分别对应的字典,使各个字典的键表征相应特定测序区间,各个字典的值表征相应特定测序区间的甲基化向量,采用特定格式记录各个字典,以记录各个特定测序区间。
关于特定测序区间的记录装置的具体限定可以参见上文中对于特定测序区 间的记录方法的限定,在此不再赘述。上述特定测序区间的记录装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
基于如上所述的示例,在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图28所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种甲基化水平的向量化表征方法、特定测序区间的检测方法或者特定测序区间的记录方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图28中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
据此,在一个实施例中还提供一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如上述各实施例中的任意一种甲基化水平的向量化表征方法、特定测序区间的检测方法或者特定测序区间的记录 方法。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (18)

  1. 一种甲基化水平的向量化表征方法,其特征在于,所述方法包括:
    S10,获取检测样本的各个甲基化测序区间的甲基化信息;其中,所述检测样本包括多个甲基化测序区间;
    S20,根据各个甲基化测序区间的甲基化信息确定预设的阅读区间中各类测序结果的Reads数目;其中,所述Reads数目为相应类别的测序结果在相应甲基化测序区间的甲基化信息中的出现次数;所述阅读区间中各类测序结果的排列顺序预先设定;
    S30,将滑动窗口在所述阅读区间按照滑动步长从第一个位点至最后一个位点滑动,根据各类测序结果的Reads数目获取各次窗口读取过程中,滑动窗口的各个序列组合的出现次数;其中,所述滑动窗口在读取各个序列组合的出现次数后,按照所述滑动步长向后滑动;
    S40,根据各个甲基化测序区间中各个序列组合的出现次数确定各个甲基化测序区间在各次窗口读取过程中产生的次数向量;
    S50,将各个甲基化测序区间在各次窗口读取过中产生的次数向量拼接为各个甲基化测序区间的甲基化向量。
  2. 根据权利要求1所述的方法,其特征在于,所述将滑动窗口在所述阅读区间按照滑动步长从第一个位点至最后一个位点滑动,根据各类测序结果的Reads数目获取各次窗口读取过程中,滑动窗口的各个序列组合的出现次数包括:
    S31,在第m次窗口读取过程中,将滑动窗口的首位设置在所述阅读区间的(s(m-1)+1)位,根据各类测序结果的Reads数目读取所述滑动窗口中各个序列组合的出现次数;其中,m的初始值为1,s表示滑动步长;
    S32,若在第m次窗口读取过程中,所述滑动窗口的末位不为所述阅读区间 的末位,则将m更新为m+1,返回执行步骤S31,直至所述滑动窗口的末位为所述阅读区间的末位。
  3. 根据权利要求2所述的方法,其特征在于,若所述滑动窗口的末位超出所述阅读区间,根据各类测序结果的Reads数目读取所述滑动窗口中各个序列组合的出现次数包括:
    缩短所述滑动窗口,使所述滑动窗口的末位落在所述阅读区间的末位,根据各类测序结果的Reads数目确定当前滑动窗口中各个序列组合的出现次数。
  4. 根据权利要求2所述的方法,其特征在于,所述根据各个序列组合的出现次数确定各次窗口读取中产生的次数向量包括:
    设置滑动窗口的各个序列组合的排列顺序,得到组合顺序;
    在各次窗口读取中,将各个序列组合的出现次数按照所述组合顺序排列,得到所述次数向量。
  5. 根据权利要求2所述的方法,其特征在于,所述将各次窗口读取中产生的次数向量拼接为甲基化向量包括:
    将各次窗口读取中产生的次数向量按照窗口读取顺序首尾相连,得到所述甲基化向量。
  6. 根据权利要求1至5任一项所述的方法,其特征在于,所述获取检测样本的各个甲基化测序区间的甲基化信息包括:
    采用预设面板对所述检测样本的各个甲基化测序区间进行重亚硫酸盐法测序,使所述各个甲基化测序区间中被甲基化的位点采用第一标识表征,未被甲基化的位点采用第二标识表征,得到所述各个甲基化测序区间中各个位点的甲基化信息。
  7. 一种甲基化水平的向量化表征装置,其特征在于,包括:
    第一获取模块,用于获取检测样本的各个甲基化测序区间的甲基化信息;其中,所述检测样本包括多个甲基化测序区间;
    第一确定模块,用于根据各个甲基化测序区间的甲基化信息确定预设的阅读区间中各类测序结果的Reads数目;其中,所述Reads数目为相应类别的测序结果在相应甲基化测序区间的甲基化信息中的出现次数;所述阅读区间中各类测序结果的排列顺序预先设定;
    滑动模块,用于将滑动窗口在所述阅读区间按照滑动步长从第一个位点至最后一个位点滑动,根据各类测序结果的Reads数目获取各次窗口读取过程中,滑动窗口的各个序列组合的出现次数;其中,所述滑动窗口在读取各个序列组合的出现次数后,按照所述滑动步长向后滑动;
    第二确定模块,用于根据各个甲基化测序区间中各个序列组合的出现次数确定各个甲基化测序区间在各次窗口读取过程中产生的次数向量;
    读取模块,用于将各个甲基化测序区间在各次窗口读取过中产生的次数向量拼接为各个甲基化测序区间的甲基化向量。
  8. 一种特定测序区间的检测方法,其特征在于,所述方法包括:
    采用权利要求1至6任一项所述的甲基化水平的向量化表征方法获取各个样本中各个甲基化测序区间的甲基化向量;
    将所述样本划分为训练样本和测试样本,将训练样本中各个甲基化测序区间的甲基化向量确定为训练向量,将测试样本中各个甲基化测序区间的甲基化向量确定为测试向量,将区间顺序相同的各个甲基化测序区间确定为一组甲基化测序区间,得到多组甲基化测序区间,将各组训练向量分别输入分类模型进行训练,得到各组甲基化测序区间的评估模型,使各个评估模型针对对应的一组测试向量进行测试,以确定各个甲基化测序区间的评估指标;其中,各个样 本中各个甲基化测序区间的区间顺序预先设定;一组甲基化测序区间的甲基化向量包括该组甲基化测序区间对应的一组训练向量和一组测试向量;
    将评估指标大于或等于特定阈值的甲基化测序区间确定为特定测序区间。
  9. 根据权利要求8所述的方法,其特征在于,所述将所述样本划分为训练样本和测试样本,将训练样本中各个甲基化测序区间的甲基化向量确定为训练向量,将测试样本中各个甲基化测序区间的甲基化向量确定为测试向量,将区间顺序相同的各个甲基化测序区间确定为一组甲基化测序区间,得到多组甲基化测序区间,将各组训练向量分别输入分类模型进行训练,得到各组甲基化测序区间的评估模型,使各个评估模型针对对应的一组测试向量进行测试,以确定各个甲基化测序区间的评估指标包括:
    将所述样本分别多次划分为训练样本和测试样本;
    在各次划分过程中,将训练样本中各个甲基化测序区间的甲基化向量确定为训练向量,将测试样本中各个甲基化测序区间的甲基化向量确定为测试向量,将区间顺序相同的各个甲基化测序区间确定为一组甲基化测序区间,得到多组甲基化测序区间,将各组训练向量分别输入分类模型进行训练,得到各组甲基化测序区间的评估模型,使各个评估模型针对对应的一组测试向量测试得到各个评估模型在该次划分过程的AUC;
    获取各次划分过程中各个评估模型的AUC,得到各个甲基化测序区间对应的多个AUC,根据各个甲基化测序区间对应的各个AUC的平均值或者中位值确定相应甲基化测序区间的评估指标。
  10. 根据权利要求9所述的方法,其特征在于,还包括:
    在各个特定测序区间的甲基化向量中查找在训练相应评估模型中具备特定贡献度的向量元素,得到多个目标元素;
    检测各个目标元素在相应特定测序区间对应的位点或者位点组合,将检测得到的位点或者位点组合确定为相应特定测序区间的甲基化位点或者甲基化位点组合。
  11. 根据权利要求10所述的方法,其特征在于,所述在各个特定测序区间的甲基化向量中查找在训练相应评估模型中具备特定贡献度的向量元素,得到多个目标元素包括:
    在各次划分过程中,获取各个特定模型针对特定向量中各个向量元素输出的重要性排序参数,得到各个向量元素的多个重要性排序参数;其中,所述特定模型为特定测序区间对应的评估模型;所述特定向量为特定测序区间对应的甲基化向量;
    将各个向量元素的各个重要性排序参数相加,得到各个向量元素的加和数,将加和数排列在设定位数之前的向量元素确定为具备特定贡献度的向量元素。
  12. 根据权利要求10所述的方法,其特征在于,所述在各个特定测序区间的甲基化向量中查找在训练相应评估模型中具备特定贡献度的向量元素,得到多个目标元素包括:
    在各次划分过程中,获取各个特定模型针对特定向量中各个向量元素输出的重要性评分,得到各个向量元素的多个评分;其中,所述特定模型为特定测序区间对应的评估模型;所述特定向量为特定测序区间对应的甲基化向量;
    将取值为非零值的重要性评分个数大于或等于个数阈值的向量元素确定为具备特定贡献度的向量元素。
  13. 根据权利要求8至12任一项所述的方法,其特征在于,所述特定阈值的取值方法包括:
    将各个评估指标中最大值的设定比例确定为所述特定阈值。
  14. 一种特定测序区间的检测装置,其特征在于,包括:
    第二获取模块,用于采用权利要求7所述的甲基化水平的向量化表征装置获取各个样本中各个甲基化测序区间的甲基化向量;
    划分模块,用于将所述样本划分为训练样本和测试样本,将训练样本中各个甲基化测序区间的甲基化向量确定为训练向量,将测试样本中各个甲基化测序区间的甲基化向量确定为测试向量,将区间顺序相同的各个甲基化测序区间确定为一组甲基化测序区间,得到多组甲基化测序区间,将各组训练向量分别输入分类模型进行训练,得到各组甲基化测序区间的评估模型,使各个评估模型针对对应的一组测试向量进行测试,以确定各个甲基化测序区间的评估指标;其中,各个样本中各个甲基化测序区间的区间顺序预先设定;一组甲基化测序区间的甲基化向量包括该组甲基化测序区间对应的一组训练向量和一组测试向量;
    第三确定模块,用于将评估指标大于或等于特定阈值的甲基化测序区间确定为特定测序区间。
  15. 一种特定测序区间的记录方法,其特征在于,所述方法包括:
    采用权利要求8至13任一项所述的特定测序区间的检测方法获取各个样本的特定测序区间,设置各个特定测序区间分别对应的字典,使各个字典的键表征相应特定测序区间,各个字典的值表征相应特定测序区间的甲基化向量,采用特定格式记录各个字典,以记录各个特定测序区间。
  16. 一种特定测序区间的记录装置,其特征在于,包括:
    第二获取模块,用于采用权利要求14所述的特定测序区间的检测装置获取各个样本的特定测序区间,设置各个特定测序区间分别对应的字典,使各个字典的键表征相应特定测序区间,各个字典的值表征相应特定测序区间的甲基化 向量,采用特定格式记录各个字典,以记录各个特定测序区间。
  17. 一种计算机设备,包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至6任一项所述的甲基化水平的向量化表征方法、权利要求8至13任一项所述的特定测序区间的检测方法或者权利要求15所述的特定测序区间的记录方法。
  18. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1至6任一项所述的甲基化水平的向量化表征方法、权利要求8至13任一项所述的特定测序区间的检测方法或者权利要求15所述的特定测序区间的记录方法。
PCT/CN2021/086169 2020-05-27 2021-04-09 甲基化水平的向量化表征、特定测序区间检测方法和装置 WO2021238441A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010462199.5A CN111627499B (zh) 2020-05-27 2020-05-27 甲基化水平的向量化表征、特定测序区间检测方法和装置
CN202010462199.5 2020-05-27

Publications (1)

Publication Number Publication Date
WO2021238441A1 true WO2021238441A1 (zh) 2021-12-02

Family

ID=72271903

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/086169 WO2021238441A1 (zh) 2020-05-27 2021-04-09 甲基化水平的向量化表征、特定测序区间检测方法和装置

Country Status (2)

Country Link
CN (1) CN111627499B (zh)
WO (1) WO2021238441A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111627499B (zh) * 2020-05-27 2020-12-08 广州市基准医疗有限责任公司 甲基化水平的向量化表征、特定测序区间检测方法和装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102061337A (zh) * 2010-11-24 2011-05-18 深圳华大基因科技有限公司 一种组织特异性差异甲基化区域检测方法和系统
CN109637583A (zh) * 2018-12-20 2019-04-16 中国科学院昆明植物研究所 一种植物基因组差异甲基化区域的检测方法
WO2019209884A1 (en) * 2018-04-23 2019-10-31 Grail, Inc. Methods and systems for screening for conditions
CN111627499A (zh) * 2020-05-27 2020-09-04 广州市基准医疗有限责任公司 甲基化水平的向量化表征、特定测序区间检测方法和装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103103624B (zh) * 2011-11-15 2014-12-31 深圳华大基因科技服务有限公司 高通量测序文库的构建方法及其应用
CN103136587B (zh) * 2013-03-07 2015-12-09 武汉大学 基于支持向量机的配电网运行状态分类识别方法
CN103233072B (zh) * 2013-05-06 2014-07-02 中国海洋大学 一种高通量全基因组dna甲基化检测技术
CN103559423B (zh) * 2013-10-31 2017-02-15 深圳先进技术研究院 一种甲基化作用的预测方法、装置
CN107273663B (zh) * 2017-05-22 2018-12-11 人和未来生物科技(长沙)有限公司 一种dna甲基化测序数据计算解读方法
CN110211633B (zh) * 2019-05-06 2021-08-31 臻和精准医学检验实验室无锡有限公司 Mgmt基因启动子甲基化的检测方法、测序数据的处理方法及处理装置
CN110334748A (zh) * 2019-06-14 2019-10-15 大连理工大学 基于d-s证据理论进行多组学数据集成的癌症亚型分类方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102061337A (zh) * 2010-11-24 2011-05-18 深圳华大基因科技有限公司 一种组织特异性差异甲基化区域检测方法和系统
WO2019209884A1 (en) * 2018-04-23 2019-10-31 Grail, Inc. Methods and systems for screening for conditions
CN109637583A (zh) * 2018-12-20 2019-04-16 中国科学院昆明植物研究所 一种植物基因组差异甲基化区域的检测方法
CN111627499A (zh) * 2020-05-27 2020-09-04 广州市基准医疗有限责任公司 甲基化水平的向量化表征、特定测序区间检测方法和装置

Also Published As

Publication number Publication date
CN111627499B (zh) 2020-12-08
CN111627499A (zh) 2020-09-04

Similar Documents

Publication Publication Date Title
Pol et al. Unstable taxa in cladistic analysis: identification and the assessment of relevant characters
Adhatrao et al. Predicting students' performance using ID3 and C4. 5 classification algorithms
JP6715451B2 (ja) マススペクトル解析システム,方法およびプログラム
EP2779892A1 (en) Iterative time series matrix pattern enhancer processor
RU2722692C1 (ru) Способ и система выявления вредоносных файлов в неизолированной среде
Staniak et al. The landscape of R packages for automated exploratory data analysis
Kitchenham et al. Validating Search Processes in Systematic Literature Reviews.
WO2021238441A1 (zh) 甲基化水平的向量化表征、特定测序区间检测方法和装置
CN112181490B (zh) 功能点评估法中功能类别的识别方法、装置、设备及介质
CN111209409A (zh) 数据匹配方法及装置、存储介质及电子终端
CN114187980A (zh) 模型训练方法、模型预测方法、分子筛选方法及其装置
Lemant et al. Robust, universal tree balance indices
US20220084631A1 (en) Method and apparatus for machine learning based identification of structural variants in cancer genomes
CN113010615B (zh) 基于高斯混合模型聚类算法的分层级数据可视化方法
CN112347776B (zh) 医疗数据处理方法及装置、存储介质、电子设备
US20100280759A1 (en) Mass spectrometer output analysis tool for identification of proteins
Kawale Machine Learning Tool Development And Use In Biological Information Decoding
WO2024077533A1 (zh) 构建动态基因调控网络的方法和系统以及计算机设备
RU2778979C1 (ru) Способ и система кластеризации исполняемых файлов
Canbek et al. Accuracy Barrier (ACCBAR): A novel performance indicator for binary classification
Silva et al. AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data
Breetha et al. Hierarchical clustering for cancer discovery using Range check and delta check
CN117349358B (zh) 基于分布式图处理框架的数据匹配与合并的方法和系统
Alcalá et al. Prots2Net: a PPIN predictor of a proteome or a metaproteome sample
Kersting et al. Tree balance in phylogenetic models

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21814295

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21814295

Country of ref document: EP

Kind code of ref document: A1