CN116168761B - Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium - Google Patents

Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium Download PDF

Info

Publication number
CN116168761B
CN116168761B CN202310409461.3A CN202310409461A CN116168761B CN 116168761 B CN116168761 B CN 116168761B CN 202310409461 A CN202310409461 A CN 202310409461A CN 116168761 B CN116168761 B CN 116168761B
Authority
CN
China
Prior art keywords
nucleic acid
acid sequence
sample
window
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310409461.3A
Other languages
Chinese (zh)
Other versions
CN116168761A (en
Inventor
吕行
邝英兰
黄萌
叶莘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Livzon Cynvenio Diagnostics Ltd
Original Assignee
Zhuhai Livzon Cynvenio Diagnostics Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Livzon Cynvenio Diagnostics Ltd filed Critical Zhuhai Livzon Cynvenio Diagnostics Ltd
Priority to CN202310409461.3A priority Critical patent/CN116168761B/en
Publication of CN116168761A publication Critical patent/CN116168761A/en
Application granted granted Critical
Publication of CN116168761B publication Critical patent/CN116168761B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention provides a method, a device, electronic equipment and a storage medium for determining a nucleic acid sequence characteristic region, which belong to the technical field of biological data processing, wherein the method comprises the following steps: determining nucleic acid sequence data within each candidate window; and establishing a characteristic matrix based on the area value of the methylation fraction curve and carrying out characteristic selection to obtain a characteristic region. According to the method for determining the nucleic acid sequence characteristic region, the characteristic matrix is constructed and obtained by obtaining the plurality of candidate windows and utilizing the area values of methylation scores corresponding to the candidate windows, and then the characteristic selection is carried out on the candidate windows obtained in a replaced resampling mode, so that the downstream model can focus on the characteristic selection of a sample, the defects in a pure biological mode and a pure statistical mode are overcome, and the generalization capability of the downstream model is improved by obtaining a proper amount of effective nucleic acid sequence region characteristics.

Description

Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium
Technical Field
The present invention relates to the field of biological data processing technologies, and in particular, to a method and apparatus for determining a characteristic region of a nucleic acid sequence, an electronic device, and a storage medium.
Background
For sequence type data such as biological gene information, electrocardiographic information and voice information, a machine learning or deep learning method is generally adopted to input data for classification model construction when classification or abnormality detection is carried out. With the development of computer and sequencing technologies, more and more large-scale biological data are generated, and DNA methylation is used as an epigenetic marker for wide research and plays a vital role in tumor discovery.
The methylation data currently available are subject to a significant imbalance between sample size and number of differentially methylated regions. In the prior art, the related models are classified, clustered and the like according to the differential methylation areas, but too much differential methylation area data disturbs the training direction of the classification models. Because of the large number of data, the sample size is relatively small, and the direct classification model is easy to form over-fitting. Therefore, how to obtain a proper training sample efficiently to help the downstream model to improve the generalization ability of the nucleic acid sequence key features of the identified biological tissue is a urgent problem to be solved.
Disclosure of Invention
The invention provides a method, a device, electronic equipment and a storage medium for determining a characteristic region of a nucleic acid sequence, which are used for solving the defect that a model is subjected to over fitting by directly utilizing a large amount of methylation data in the prior art, and realizing obtaining a proper amount of effective characteristic values of different methylation regions to improve the generalization capability of a downstream model.
The invention provides a method for determining a characteristic region of a nucleic acid sequence, which comprises the following steps:
dividing the nucleic acid sequence data of each target differential methylation region by adopting windows with preset lengths according to preset step sizes, screening candidate windows from each window according to preset conditions, and determining the nucleic acid sequence data in each candidate window; the nucleic acid sequence data of the target differential methylation region are obtained by differential methylation region analysis of the nucleic acid sequence data of the first sample and the second sample;
determining methylation scores of CpG sites of the nucleic acid sequence data in each candidate window, and constructing a feature matrix based on an area value formed by methylation score curves and coordinate axes of each candidate window and classification labels of biological samples corresponding to the nucleic acid sequence data of each candidate window;
extracting matrix data of a target number from the feature matrix in a replaced resampling mode, performing feature selection on the extracted matrix data, and screening a training sample window from the candidate window to obtain a feature area corresponding to the training sample window.
According to the method for determining the characteristic region of the nucleic acid sequence, the row data of each row of the characteristic matrix corresponds to one biological sample, the column data of each column of the characteristic matrix corresponds to the area value corresponding to each biological sample in one candidate window and the classification label of the corresponding biological sample, and the classification label comprises information belonging to a first sample or information belonging to a second sample.
According to the method for determining the characteristic region of the nucleic acid sequence, the nucleic acid sequence data of the target differential methylation region is obtained by the following steps:
obtaining nucleic acid sequence data of a plurality of first samples and second samples from a nucleic acid sequence database;
performing nucleic acid sequence fragment mass calculation and mass fraction filtration on the obtained nucleic acid sequence data to obtain high-flux nucleic acid sequence data;
determining the methylation state of CpG sites in the high-throughput nucleic acid sequence data by using the genomic sequence data for comparison, and performing differential methylation region analysis on the high-throughput nucleic acid sequence data to obtain the nucleic acid sequence data of the target differential methylation region.
According to the method for determining the characteristic region of the nucleic acid sequence provided by the invention, the preset conditions are as follows: the number of CpG sites of the nucleotide sequence data in the candidate window is greater than or equal to a first threshold.
According to the method for determining the characteristic region of the nucleic acid sequence provided by the invention, the characteristic selection is carried out on the extracted matrix data, a training sample window is screened from the candidate windows, and the characteristic region corresponding to the training sample window is obtained, and the method comprises the following steps:
Calculating the inter-group variance and the intra-group variance of the area values corresponding to the first sample and the second sample in each extracted biological sample;
if the inter-group variance corresponding to the first sample and the second sample in the currently extracted candidate window is larger than the intra-group variance, determining the currently extracted candidate window as an alternative training sample window;
and screening the training sample window from the alternative training sample windows based on a Lasso regression analysis method to obtain a characteristic region corresponding to the training sample window.
According to the method for determining the characteristic region of the nucleic acid sequence provided by the invention, the method for screening the training sample window from the alternative training sample windows based on the Lasso regression analysis method comprises the following steps:
performing Lasso regression analysis on the corresponding area values in the alternative training sample window to construct a feature selection model;
and determining the training sample window based on the area value of the candidate training sample window with the regression coefficient not being 0 in the constructed feature selection model.
According to the method for determining the characteristic region of the nucleic acid sequence provided by the invention, after extracting matrix data of target quantity from the characteristic matrix in a replaced resampling mode, performing characteristic selection on the extracted matrix data, and screening a training sample window from the candidate window to obtain a characteristic region corresponding to the training sample window, the method further comprises the following steps:
Obtaining regional characteristic data in each training sample window from sample nucleic acid sequence data;
dividing the sample nucleic acid sequence data into a training set and a testing set, training a classification model by utilizing regional characteristic data corresponding to the sample nucleic acid sequence data in the training set, and classifying the testing set by utilizing the trained classification model;
and evaluating the classification performance of the classification model according to the classification result and the classification label of the sample nucleic acid sequence data in the test set.
The invention also provides a nucleic acid sequence characteristic region determining device, which comprises:
the first processing module is used for dividing the window with the preset length for the nucleic acid sequence data of each target differential methylation region according to the preset step length, screening candidate windows from each window according to the preset conditions, and determining the nucleic acid sequence data in each candidate window; the nucleic acid sequence data of the target differential methylation region are obtained by differential methylation region analysis of the nucleic acid sequence data of the first sample and the second sample;
the second processing module is used for determining methylation scores of all CpG sites of the nucleic acid sequence data in all candidate windows and constructing a feature matrix based on an area value formed by methylation score curves and coordinate axes of all candidate windows and classification labels of biological samples corresponding to the nucleic acid sequence data of all candidate windows;
And the third processing module is used for extracting matrix data of target quantity from the feature matrix in a replaced resampling mode, carrying out feature selection on the extracted matrix data, and screening a training sample window from the candidate window to obtain a feature area corresponding to the training sample window.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the nucleic acid sequence characteristic region determination method as any one of the above when executing the program.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of determining a characteristic region of a nucleic acid sequence as described in any of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a method of determining a region of a nucleic acid sequence feature as described in any one of the above.
According to the method, the device, the electronic equipment and the storage medium for determining the characteristic region of the nucleic acid sequence, the target differential methylation region is subjected to window segmentation and preliminary screening to obtain a plurality of candidate windows, the area value of methylation fraction corresponding to each candidate window is utilized to construct and obtain the characteristic matrix, and the candidate windows obtained in a replaced resampling mode are subjected to characteristic selection, so that a downstream model can focus on the characteristic selection of a sample, the phenomenon of inaccurate characteristic selection in a pure biological mode and overfitting caused by small sample quantity in a pure statistical mode is improved, and the characteristic of a proper amount of effective nucleic acid sequence region is obtained to improve the generalization capability of the downstream model.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method for determining a characteristic region of a nucleic acid sequence according to the present invention;
FIG. 2 is a second flow chart of a method for determining a characteristic region of a nucleic acid sequence according to the present invention;
FIG. 3 is a schematic diagram showing the structure of a nucleic acid sequence characteristic region determining apparatus according to the present invention;
fig. 4 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The nucleic acid sequence characteristic region determination method, apparatus, electronic device, and storage medium of the present invention are described below with reference to fig. 1 to 4.
The methylation data currently available have the problem of extremely unbalanced sample size and number of methylation sites. In order to solve the problem in the prior art, the related models are classified and clustered according to the differential methylation regions, but too much differential methylation region data disturbs the training direction of the classification model. Because of the large number of data and relatively small sample size, overfitting can be easily achieved by directly using the data of the differential methylation region in the existing data to make a classification model. According to the method for determining the characteristic region of the nucleic acid sequence, disclosed by the embodiment of the invention, a proper amount of characteristic data can be obtained in a limited sample, particularly, under the condition that the differential methylation region data in the limited sample is more, an important region is identified, the characteristic region data of the nucleic acid sequence is extracted, and then the characteristic region data is used for training a related model, so that the generalization capability of the related model can be well improved.
The method for determining a characteristic region of a nucleic acid sequence according to an embodiment of the present invention will be described below. The execution subject of the method for determining a characteristic region of a nucleic acid sequence according to the embodiment of the present invention may be a processor, or may be a server, and the specific type of the execution subject is not limited herein. The following describes a method for determining a characteristic region of a nucleic acid sequence according to an embodiment of the present invention using an execution body as a processor.
As shown in fig. 1, the method for determining a characteristic region of a nucleic acid sequence according to an embodiment of the present invention mainly includes step 110, step 120, and step 130.
Step 110, dividing the nucleic acid sequence data of each target differential methylation region by using windows with preset lengths according to preset step sizes, screening candidate windows from each window according to preset conditions, and determining the nucleic acid sequence data in each candidate window.
The nucleic acid sequence data of the target differential methylation region was obtained by differential methylation region analysis of the nucleic acid sequence data of the first sample and the second sample.
It is understood that the first sample and the second sample may be biological tissue samples collected from different populations of people, such as tumor samples and non-tumor samples, or samples in different physiological states, and the like, and the specific types of the first sample and the second sample are not limited herein.
In some embodiments, the screening of the methylation regions can be performed on the nucleic acid sequence data of the first sample and the second sample, and on the basis of screening to obtain a plurality of different methylation regions, the screening of the methylation regions can be further performed, and the different methylation regions with weak differentiation can be rapidly filtered to obtain the target different methylation regions, so that the characteristics with stronger differentiation and stronger discrimination of the nucleic acid sequences of the first sample and the second sample can be selected, and further the accuracy and the efficiency of feature selection can be realized.
In this example, the nucleic acid sequence data of the target differential methylation region is obtained in the following manner.
The nucleic acid sequence data of the plurality of first samples and the second sample may be first obtained from a nucleic acid sequence database. The nucleic acid sequence data of the first sample as well as the second sample may be obtained from a common database, such as a nucleic acid sequence database. The first and second samples may be tissue, blood, and other samples, etc. For example, the nucleic acid sequence data of the first sample and the second sample obtained may be a methylation chip matrix or methylation high throughput sequencing data, or the like.
On the basis, the obtained nucleic acid sequence data can be subjected to nucleic acid sequence fragment quality calculation and quality score filtration to obtain high-flux nucleic acid sequence data, methylation states of CpG sites in the high-flux nucleic acid sequence data are determined by utilizing genome sequence data for comparison, whether methylation changes occur in each CpG site in sequencing data is determined, so that methylation states of CpG sites in the obtained sample nucleic acid sequence data are determined, and further differential methylation region analysis on the high-flux nucleic acid sequence data is realized, and the nucleic acid sequence data of a target differential methylation region is obtained.
Wherein, when the differential methylation region analysis is carried out, aiming at the methylation qPCR reaction characteristics, related reaction primers and probe design principles, the probe interval can be set to be not more than a certain number of bp (base pair), for example, not more than 50bp, 100bp, 200bp or 500bp; and each differential methylation region includes at least a number of CpG sites, e.g., may include at least 2-10 CpG sites. On the basis, a plurality of differential methylation regions can be obtained, so that the target differential methylation regions can be conveniently screened according to certain conditions.
In this embodiment, the requirement of the downstream molecular biology analysis technology platform, such as the requirement of primer probes, is considered, the probe interval and the number of probes during the analysis of the differential methylation region are reasonably set, and then the length of the differential methylation region of the selected analysis is reasonably set, so that the practicability of the characteristics extracted from the nucleic acid sequence data in the follow-up process is improved.
After the target differential methylation areas are obtained, window segmentation with preset length is carried out on the nucleic acid sequence data of each target differential methylation area according to preset step length by adopting windows with preset length, candidate windows are screened out from each window according to preset conditions, and the nucleic acid sequence data in each candidate window is determined.
It will be appreciated that the specific size of the preset length and preset step size may be set according to the requirements of the subsequent downstream molecular biological analysis platform.
For example, the preset length of the window may be set to 200 CpG sites, and with 1 CpG site as a preset step length, a sliding window view set is created for each target differential methylation region, so as to obtain a plurality of sliding windows, and the number of CpG sites is calculated for each sliding window view.
In this case, considering the significance of methylation features in the sliding windows, preset conditions may be set to screen the sliding windows to obtain a certain number of candidate windows, so as to screen the candidate windows and obtain the nucleic acid sequence data in the candidate windows.
In some embodiments, the preset condition may be that the number of CpG sites of the nucleotide sequence data within the candidate window is greater than or equal to a first threshold.
For example, the first threshold may be 6. Under the condition, windows with the CpG site number smaller than 6 in the sliding window can be filtered out, and further the nucleic acid sequence fragments with weak distinguishing property are filtered out, so that each candidate window is obtained, and the nucleic acid sequence data in the candidate windows are ensured to have obvious distinguishing capability.
Step 120, determining methylation scores of each CpG site of the nucleic acid sequence data in each candidate window, and constructing a feature matrix based on the area value formed by the methylation score curve and the coordinate axis of each candidate window and the classification label of the biological sample corresponding to the nucleic acid sequence data of each candidate window.
In this embodiment, the area value formed by the methylation fraction curve of the CpG sites within each candidate window and the coordinate axis reflects the overall methylation level of all the nucleic acid sequence fragments within the same CpG site or region within the individual.
It should be noted that, a single CpG site is not related to other CpG sites, and a differential methylation region may contain multiple CpG sites, where the methylation fraction of the single CpG site is obtained by averaging or summing. For example, for methylation data of a plurality of nucleic acid sequence fragments existing in any window, the number of methylation of the same CpG site can be calculated first, then the average value or the sum of the methylation numbers of each CpG site is calculated, and further a methylation fraction curve of the CpG site in each candidate window is obtained, so that the area value of the curve and the coordinate axis is determined.
It can be understood that each candidate window nucleic acid sequence data corresponds to identification information of a biological sample and a classification tag of the biological sample, the classification tag can be information belonging to a first sample or information belonging to a second sample, and the classification tag can be used as a tag of a different candidate window nucleic acid sequence fragment so as to facilitate type recognition of a subsequent extracted feature.
On the basis, a feature matrix can be constructed according to the area value corresponding to each candidate window and the biological sample classification label corresponding to the nucleic acid sequence data of each candidate window. The row data of each row of the feature matrix corresponds to a biological sample, the column data of each column of the feature matrix corresponds to an area value corresponding to each biological sample in one candidate window and a classification label corresponding to the biological sample, and the classification label can be used as a label of an area value corresponding to each biological sample in each candidate window so as to distinguish whether the area value corresponds to a first sample or a second sample.
And 130, extracting matrix data of a target number from the feature matrix according to a replaced resampling mode, performing feature selection on the extracted matrix data, and screening a training sample window from the candidate window to obtain a feature area corresponding to the training sample window.
It can be understood that, since the number of the candidate windows obtained by the screening in the above process is still large due to the small number of biological samples including the first sample and the second sample, in this embodiment, a resampling mode with a put-back may be used to continuously extract matrix data from the constructed feature matrix, and feature selection may be performed on the extracted data, so as to improve the overfitting phenomenon caused by the small number of samples.
It should be noted that, in each extraction process, 80% of data can be extracted from the plurality of matrix data to perform analysis, after the analysis is completed, the data extracted in the previous round is put back, the next round of extraction and analysis is performed continuously, and a candidate window with a plurality of extracted times is selected as a training sample window according to the extraction result of each round. The number of extraction may be set according to actual needs, for example, 100 times, without limitation.
In the embodiment, the resampling extraction mode with the replacement can reduce the degree of overfitting of the screened area caused by too small sample number, ensure the representativeness of extracted candidate windows, further obtain the candidate windows through screening, reduce the number of the windows, obtain more accurate model training data, further reduce the degree of overfitting, and further increase the robustness of feature selection and the performance of downstream follow-up tasks.
In the embodiment, the problem of transforming important candidate window selection into characteristic selection is solved, and the problems that the conventional biological mode (such as screening by using the hypermethylation state or hypomethylation state of regional data) has relatively poor generalization capability, and window selection is inaccurate due to the diversity and atypical interference of methylation states presented by different samples when the window is screened by using priori biological knowledge are solved. And the classification labels of the biological samples are associated and correspond to methylation feature scores of the nucleic acid sequence fragments in the candidate window in a mode of constructing a feature matrix, and a feature selection mode is adopted to enable the downstream model to focus on the selection of important feature areas, so that the overfitting phenomenon generated by small sample size in a pure statistics mode is improved, interpretation and data presentation of the downstream model can be conveniently carried out by using a description which is easy to recognize after training, and the interpretability of the downstream model is improved.
According to the method for determining the characteristic region of the nucleic acid sequence, the window segmentation and preliminary screening are carried out on the target differential methylation region to obtain a plurality of candidate windows, the area value of the methylation fraction corresponding to each candidate window is utilized to construct and obtain the characteristic matrix, and the characteristic selection is carried out on the candidate windows obtained in a replaced resampling mode, so that the downstream model can focus on the characteristic selection of a sample, the phenomenon of overfitting caused by inaccurate characteristic selection in a pure biological mode and small sample quantity in a pure statistical mode is improved, and the characteristic of a proper amount of effective nucleic acid sequence region is obtained to improve the generalization capability of the downstream model.
In some embodiments, as shown in fig. 2, the step of analyzing the extracted data, and screening the candidate window for a training sample window further includes step 131, step 132, and step 133.
Step 131, calculating the inter-group variance and the intra-group variance of the area values corresponding to the first sample and the second sample in each extracted biological sample.
And step 132, if the inter-group variance corresponding to the first sample and the second sample in the current candidate window is greater than the intra-group variance, determining the current candidate window as an alternative training sample window.
And step 133, screening the training sample window from the alternative training sample windows based on the Lasso regression analysis method to obtain a characteristic region corresponding to the training sample window.
It will be appreciated that after the extracted matrix data, i.e. the data corresponding to a biological sample, is obtained, each nucleic acid sequence fragment may be determined to belong to the first sample or the second sample according to the tag in the matrix data, i.e. the classification tag.
For one candidate window in each line of sampled data, all data with the first sample label can be divided into one group, and data with the second sample label can be divided into another group, so that the inter-group variance and the intra-group variance of the area values corresponding to the first sample and the second sample can be calculated.
It will be appreciated that the overall average may be calculated based on the area values in the first and second sample sets, and the sum of the squares of the differences between the area values of the first and second sample sets and the overall average may be calculated as the overall deviation.
Further, the average value of the first sample group and the average value of the second sample group are calculated based on the area values in the first sample group and the second sample group, respectively, and the sum of squares of differences between the area values in the first sample group and the average value of the first sample group is calculated as the intra-group deviation of the group, and the sum of squares of differences between the area values in the second sample group and the average value of the second sample group is calculated as the intra-group deviation of the group, thereby summing the intra-group deviations of the first sample group and the second sample group to obtain the intra-group variance. On this basis, the difference between the overall deviation and the intra-group variance is obtained as the inter-group variance corresponding to the first sample group and the second sample group.
If the inter-group variance corresponding to the first sample group and the second sample group of the candidate window is larger than the intra-group variance, the fact that at least one distribution is far relative to other distributions is indicated, and therefore it is judged that the first sample group and the second sample group have significance differences in the candidate window, the candidate window is determined to be an important window and can be used for training of a subsequent model, namely the candidate window extracted in advance is determined to be an alternative training sample window, and then a characteristic area corresponding to the training sample window is obtained.
On the basis, the data of the alternative training sample window are analyzed through a Lasso regression analysis method, and variable screening and complexity adjustment are carried out while a generalized linear model is fitted, so that characteristic screening can be further realized, and the characteristics of the screened training sample window are guaranteed to have better generalization.
In some embodiments, based on a Lasso regression analysis method, a training sample window is selected from candidate training sample windows, and a feature region corresponding to the training sample window is obtained, including: performing Lasso regression analysis on the corresponding area values in the alternative training sample window to construct a feature selection model; and determining a training sample window based on the area value of the candidate training sample window with the regression coefficient not being 0 in the constructed feature selection model.
It should be noted that, for the area value corresponding to the alternative training window, regression analysis may be performed by using a Lasso regression analysis mode to obtain a corresponding regression model. The dependent variable in the regression analysis is a classification label of each area value data, namely belongs to the first sample or the second sample, and the independent variable is an area value corresponding to each alternative training sample window.
It can be understood that by performing variable screening and complexity adjustment while fitting the generalized linear model, part of the regression coefficients can be compressed to zero, so as to realize feature screening, specifically, an alternative training sample window with the regression coefficient not being 0 in the regression model obtained by screening construction can be determined as a training sample window.
In some embodiments, after extracting matrix data of a target number from a feature matrix according to a resampling mode with a put back, performing feature selection on the extracted matrix data, and screening a training sample window from candidate windows to obtain a feature region corresponding to the training sample window, the method for determining a nucleic acid sequence feature region according to the embodiment of the invention further includes: obtaining regional characteristic data in each training sample window from sample nucleic acid sequence data; dividing sample nucleic acid sequence data into a training set and a testing set, training a classification model by using regional characteristic data corresponding to the sample nucleic acid sequence data in the training set, and classifying the testing set by using the trained classification model to obtain a classification result of the sample nucleic acid sequence data in the testing set; and evaluating the classification performance of the classification model according to the classification result and the classification label of the sample nucleic acid sequence data in the test set.
The region characteristic data corresponding to the sample nucleic acid sequence data is training data, and the sample nucleic acid sequence data comprises a classification tag marked in advance.
The sample nucleic acid sequence data can be divided into a training set and a testing set according to a certain proportion (such as 8:2), and after the characteristic region corresponding to the training sample window is obtained, the region characteristic data of the sample nucleic acid sequence data in the coordinate range is obtained according to the coordinate range of the characteristic region on the nucleic acid sequence.
On the basis, the classification model is trained by utilizing the regional characteristic data corresponding to the sample nucleic acid sequence data in the training set, the test set is classified by utilizing the trained classification model, and the classification performance of the classification model is evaluated according to the classification result and the classification label of the sample nucleic acid sequence data in the test set.
The classification model may be a random forest model, a decision tree, etc., for example, where a random forest model is employed as the classification model, the effect of the random forest model may be assessed by the receiver operating characteristics ROC curve and the area under ROC curve AUC value. Under the condition that the AUC value meets the preset performance standard, the selected characteristic region has better performance; under the condition that the AUC value does not meet the preset performance standard, the values of some parameters in the process of determining the nucleic acid sequence characteristic region can be optimized and adjusted.
The nucleic acid sequence characteristic region determining apparatus provided by the present invention will be described below, and the nucleic acid sequence characteristic region determining apparatus described below and the nucleic acid sequence characteristic region determining method described above may be referred to correspondingly to each other.
As shown in fig. 3, the nucleic acid sequence feature area determining apparatus according to the embodiment of the present invention mainly includes a first processing module 310, a second processing module 320, and a third processing module 330.
A first processing module 310, configured to segment the nucleic acid sequence data of each target differential methylation region by using windows of a preset length according to a preset step length, screen candidate windows from each window according to a preset condition, and determine the nucleic acid sequence data in each candidate window; the nucleic acid sequence data of the target differential methylation region is obtained by differential methylation region analysis of the nucleic acid sequence data of the first sample and the second sample.
The second processing module 320 is configured to determine methylation scores of CpG sites of the nucleic acid sequence data in each candidate window, and construct a feature matrix based on an area value formed by the methylation score curve and the coordinate axis of each candidate window and the classification label of the biological sample corresponding to the nucleic acid sequence data in each candidate window.
The third processing module 330 is configured to extract a target number of matrix data from the feature matrix according to a replaced resampling manner, perform feature selection on the extracted matrix data, and screen a training sample window from the candidate window to obtain a feature area corresponding to the training sample window.
According to the nucleic acid sequence characteristic region determining device provided by the embodiment of the invention, the window segmentation and preliminary screening are carried out on the target differential methylation region to obtain a plurality of candidate windows, the area value of the methylation fraction corresponding to each candidate window is utilized to construct and obtain the characteristic matrix, and the characteristic selection is carried out on the candidate window obtained in a replaced resampling mode, so that the downstream model can focus on the characteristic selection of a sample, the phenomenon of overfitting caused by inaccurate characteristic selection in a pure biological mode and small sample quantity in a pure statistical mode is improved, and the characteristic of a proper amount of effective nucleic acid sequence region is obtained to improve the generalization capability of the downstream model.
In some embodiments, the row data of each row of the feature matrix corresponds to one biological sample, the column data of each column of the feature matrix corresponds to the area value of each biological sample corresponding within one candidate window, and the classification label of the corresponding biological sample comprises information belonging to the first sample or information belonging to the second sample.
In some embodiments, the first processing module 310 is further configured to obtain nucleic acid sequence data for a plurality of first samples and second samples from a nucleic acid sequence database; performing nucleic acid sequence fragment mass calculation and mass fraction filtration on the obtained nucleic acid sequence data to obtain high-flux nucleic acid sequence data; determining the methylation state of CpG sites in the high-flux nucleic acid sequence data by utilizing the genomic sequence data for comparison, and carrying out differential methylation region analysis on the high-flux nucleic acid sequence data to obtain the nucleic acid sequence data of the target differential methylation region.
In some embodiments, the preset conditions are: the number of CpG sites of the nucleic acid sequence data within the candidate window is greater than or equal to the first threshold.
In some embodiments, the third processing module 330 is further configured to calculate an inter-group variance and an intra-group variance of the area values corresponding to the first sample and the second sample within each of the extracted biological samples; if the inter-group variance corresponding to the first sample and the second sample in the extracted candidate window is larger than the intra-group variance, determining the currently extracted candidate window as an alternative training sample window; and screening a training sample window from the alternative training sample windows based on a Lasso regression analysis method to obtain a characteristic region corresponding to the training sample window.
In some embodiments, the third processing module 330 is further configured to perform Lasso regression analysis on the corresponding area values in the candidate training sample window, and construct a feature selection model; and determining a training sample window based on the area value of the candidate training sample window with the regression coefficient not being 0 in the constructed feature selection model.
In some embodiments, the nucleic acid sequence feature region determining apparatus of the embodiments of the present invention further includes a fourth processing module for obtaining region feature data within each training sample window from the sample nucleic acid sequence data; dividing sample nucleic acid sequence data into a training set and a testing set, training a classification model by using regional characteristic data corresponding to the sample nucleic acid sequence data in the training set, and classifying the testing set by using the trained classification model; and evaluating the classification performance of the classification model according to the classification result and the classification label of the sample nucleic acid sequence data in the test set.
Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: processor 410, communication interface (Communications Interface) 420, memory 430 and communication bus 440, wherein processor 410, communication interface 420 and memory 430 communicate with each other via communication bus 440. Processor 410 may invoke logic instructions in memory 430 to perform a nucleic acid sequence signature region determination method comprising: dividing the nucleic acid sequence data of each target differential methylation region by adopting windows with preset lengths according to preset step sizes, screening candidate windows from each window according to preset conditions, and determining the nucleic acid sequence data in each candidate window; the nucleic acid sequence data of the target differential methylation region are obtained by differential methylation region analysis of the nucleic acid sequence data of the first sample and the second sample; determining methylation scores of CpG sites of the nucleic acid sequence data in each candidate window, and constructing a feature matrix based on an area value formed by methylation score curves and coordinate axes of each candidate window and classification labels of biological samples corresponding to the nucleic acid sequence data of each candidate window; extracting matrix data of target quantity from the feature matrix according to a replaced resampling mode, carrying out feature selection on the extracted matrix data, and screening a training sample window from the candidate window to obtain a feature area corresponding to the training sample window.
Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the method of determining a characteristic region of a nucleic acid sequence provided by the methods described above, the method comprising: dividing the nucleic acid sequence data of each target differential methylation region by adopting windows with preset lengths according to preset step sizes, screening candidate windows from each window according to preset conditions, and determining the nucleic acid sequence data in each candidate window; the nucleic acid sequence data of the target differential methylation region are obtained by differential methylation region analysis of the nucleic acid sequence data of the first sample and the second sample; determining methylation scores of CpG sites of the nucleic acid sequence data in each candidate window, and constructing a feature matrix based on an area value formed by methylation score curves and coordinate axes of each candidate window and classification labels of biological samples corresponding to the nucleic acid sequence data of each candidate window; extracting matrix data of target quantity from the feature matrix according to a replaced resampling mode, carrying out feature selection on the extracted matrix data, and screening a training sample window from the candidate window to obtain a feature area corresponding to the training sample window.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the nucleic acid sequence feature region determining method provided by the above methods, the method comprising: dividing the nucleic acid sequence data of each target differential methylation region by adopting windows with preset lengths according to preset step sizes, screening candidate windows from each window according to preset conditions, and determining the nucleic acid sequence data in each candidate window; the nucleic acid sequence data of the target differential methylation region are obtained by differential methylation region analysis of the nucleic acid sequence data of the first sample and the second sample; determining methylation scores of CpG sites of the nucleic acid sequence data in each candidate window, and constructing a feature matrix based on an area value formed by methylation score curves and coordinate axes of each candidate window and classification labels of biological samples corresponding to the nucleic acid sequence data of each candidate window; extracting matrix data of target quantity from the feature matrix according to a replaced resampling mode, carrying out feature selection on the extracted matrix data, and screening a training sample window from the candidate window to obtain a feature area corresponding to the training sample window.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A method for determining a characteristic region of a nucleic acid sequence, comprising:
dividing the nucleic acid sequence data of each target differential methylation region by adopting windows with preset lengths according to preset step sizes, screening candidate windows from each window according to preset conditions, and determining the nucleic acid sequence data in each candidate window; the nucleic acid sequence data of the target differential methylation region are obtained by differential methylation region analysis of the nucleic acid sequence data of the first sample and the second sample;
determining methylation scores of CpG sites of the nucleic acid sequence data in each candidate window, and constructing a feature matrix based on an area value formed by methylation score curves and coordinate axes of each candidate window and classification labels of biological samples corresponding to the nucleic acid sequence data of each candidate window;
Extracting matrix data of a target number from the feature matrix in a replaced resampling mode, performing feature selection on the extracted matrix data, and screening a training sample window from the candidate window to obtain a feature area corresponding to the training sample window;
the data of each row of the feature matrix corresponds to a biological sample, the data of each column of the feature matrix corresponds to an area value corresponding to each biological sample in a candidate window and a classification label corresponding to the biological sample, and the classification label comprises information belonging to a first sample or information belonging to a second sample.
2. The method for determining a characteristic region of a nucleic acid sequence according to claim 1, wherein the nucleic acid sequence data of the target differential methylation region is obtained by:
obtaining nucleic acid sequence data of a plurality of first samples and second samples from a nucleic acid sequence database;
performing nucleic acid sequence fragment mass calculation and mass fraction filtration on the obtained nucleic acid sequence data to obtain high-flux nucleic acid sequence data;
determining the methylation state of CpG sites in the high-throughput nucleic acid sequence data by using the genomic sequence data for comparison, and performing differential methylation region analysis on the high-throughput nucleic acid sequence data to obtain the nucleic acid sequence data of the target differential methylation region.
3. The method for determining a characteristic region of a nucleic acid sequence according to claim 1, wherein the predetermined condition is: the number of CpG sites of the nucleotide sequence data in the candidate window is greater than or equal to a first threshold.
4. The method for determining a characteristic region of a nucleic acid sequence according to any one of claims 1 to 3, wherein the feature selection of the extracted matrix data, and the selection of a training sample window from the candidate windows, to obtain the characteristic region corresponding to the training sample window, comprises:
calculating the inter-group variance and the intra-group variance of the area values corresponding to the first sample and the second sample in each extracted biological sample;
if the inter-group variance corresponding to the first sample and the second sample in the current candidate window is larger than the intra-group variance, determining the current candidate window as an alternative training sample window;
and screening the training sample window from the alternative training sample windows based on a Lasso regression analysis method to obtain a characteristic region corresponding to the training sample window.
5. The method of claim 4, wherein the screening the training sample window from the candidate training sample windows based on Lasso regression analysis method comprises:
Performing Lasso regression analysis on the corresponding area values in the alternative training sample window to construct a feature selection model;
and determining the training sample window based on the area value of the candidate training sample window with the regression coefficient not being 0 in the constructed feature selection model.
6. A method for determining a characteristic region of a nucleic acid sequence according to any one of claims 1 to 3, wherein, after extracting a target number of matrix data from the characteristic matrix in a replaced resampling manner, and performing characteristic selection on the extracted matrix data, a training sample window is selected from the candidate windows, so as to obtain a characteristic region corresponding to the training sample window, the method further comprises:
obtaining regional characteristic data in each training sample window from sample nucleic acid sequence data;
dividing the sample nucleic acid sequence data into a training set and a testing set, training a classification model by utilizing regional characteristic data corresponding to the sample nucleic acid sequence data in the training set, and classifying the testing set by utilizing the trained classification model;
and evaluating the classification performance of the classification model according to the classification result and the classification label of the sample nucleic acid sequence data in the test set.
7. A nucleic acid sequence characteristic region determining apparatus, comprising:
the first processing module is used for dividing the window with the preset length for the nucleic acid sequence data of each target differential methylation region according to the preset step length, screening candidate windows from each window according to the preset conditions, and determining the nucleic acid sequence data in each candidate window; the nucleic acid sequence data of the target differential methylation region are obtained by differential methylation region analysis of the nucleic acid sequence data of the first sample and the second sample;
the second processing module is used for determining methylation scores of all CpG sites of the nucleic acid sequence data in all candidate windows and constructing a feature matrix based on an area value formed by methylation score curves and coordinate axes of all candidate windows and classification labels of biological samples corresponding to the nucleic acid sequence data of all candidate windows;
the third processing module is used for extracting matrix data of target quantity from the feature matrix in a replaced resampling mode, carrying out feature selection on the extracted matrix data, and screening a training sample window from the candidate window to obtain a feature area corresponding to the training sample window;
The data of each row of the feature matrix corresponds to a biological sample, the data of each column of the feature matrix corresponds to an area value corresponding to each biological sample in a candidate window and a classification label corresponding to the biological sample, and the classification label comprises information belonging to a first sample or information belonging to a second sample.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of determining the characteristic region of a nucleic acid sequence as claimed in any one of claims 1 to 6 when the program is executed by the processor.
9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the nucleic acid sequence characteristic region determination method according to any one of claims 1 to 6.
CN202310409461.3A 2023-04-18 2023-04-18 Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium Active CN116168761B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310409461.3A CN116168761B (en) 2023-04-18 2023-04-18 Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310409461.3A CN116168761B (en) 2023-04-18 2023-04-18 Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116168761A CN116168761A (en) 2023-05-26
CN116168761B true CN116168761B (en) 2023-06-30

Family

ID=86414883

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310409461.3A Active CN116168761B (en) 2023-04-18 2023-04-18 Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116168761B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982253A (en) * 2011-09-02 2013-03-20 深圳华大基因科技有限公司 Detection method and device of methylation difference of multiple samples
CN109637583A (en) * 2018-12-20 2019-04-16 中国科学院昆明植物研究所 A kind of detection method in Plant Genome differential methylation region
WO2021154009A1 (en) * 2020-01-28 2021-08-05 주식회사 젠큐릭스 Composition using cpg methylation changes in specific genes to diagnose bladder cancer, and use thereof
CN115497561A (en) * 2022-09-01 2022-12-20 北京吉因加医学检验实验室有限公司 Method and device for layering screening of methylation markers

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3127894A1 (en) * 2019-02-05 2020-08-13 Grail, Inc. Detecting cancer, cancer tissue of origin, and/or a cancer cell type
NZ784999A (en) * 2019-08-16 2022-08-26 Univ Hong Kong Chinese Determination of base modifications of nucleic acids
WO2021119471A1 (en) * 2019-12-13 2021-06-17 Grail, Inc. Cancer classification using patch convolutional neural networks
WO2021202423A1 (en) * 2020-03-31 2021-10-07 Grail, Inc. Cancer classification with genomic region modeling
WO2021202752A1 (en) * 2020-03-31 2021-10-07 Guardant Health, Inc. Determining tumor fraction for a sample based on methyl binding domain calibration data
CN112735531B (en) * 2021-03-30 2021-07-02 臻和(北京)生物科技有限公司 Methylation analysis method and device of circulating cell-free nucleosome active region, terminal equipment and storage medium
WO2022214051A1 (en) * 2021-04-08 2022-10-13 The Chinese University Of Hong Kong Cell-free dna methylation and nuclease-mediated fragmentation
CN112951418B (en) * 2021-05-17 2021-08-06 臻和(北京)生物科技有限公司 Method and device for evaluating methylation of linked regions based on liquid biopsy, terminal equipment and storage medium
CN114171115B (en) * 2021-11-12 2022-07-29 深圳吉因加医学检验实验室 Differential methylation region screening method and device thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982253A (en) * 2011-09-02 2013-03-20 深圳华大基因科技有限公司 Detection method and device of methylation difference of multiple samples
CN109637583A (en) * 2018-12-20 2019-04-16 中国科学院昆明植物研究所 A kind of detection method in Plant Genome differential methylation region
WO2021154009A1 (en) * 2020-01-28 2021-08-05 주식회사 젠큐릭스 Composition using cpg methylation changes in specific genes to diagnose bladder cancer, and use thereof
CN115497561A (en) * 2022-09-01 2022-12-20 北京吉因加医学检验实验室有限公司 Method and device for layering screening of methylation markers

Also Published As

Publication number Publication date
CN116168761A (en) 2023-05-26

Similar Documents

Publication Publication Date Title
US10127351B2 (en) Accurate and fast mapping of reads to genome
CN104302781B (en) A kind of method and device detecting chromosomal structural abnormality
CN111276252B (en) Construction method and device of tumor benign and malignant identification model
CN108256289B (en) Method for capturing and sequencing genome copy number variation based on target region
US20190287646A1 (en) Identifying copy number aberrations
WO2021061473A1 (en) Systems and methods for diagnosing a disease condition using on-target and off-target sequencing data
CN115620812B (en) Resampling-based feature selection method and device, electronic equipment and storage medium
CN111180013B (en) Device for detecting blood disease fusion gene
CN112289376A (en) Method and device for detecting somatic cell mutation
CN107463797B (en) Biological information analysis method and device for high-throughput sequencing, equipment and storage medium
CN114613430A (en) Filtering method and computing equipment for false positive nucleotide variation sites
KR20220076444A (en) Method and apparatus for classifying variation candidates within whole genome sequence
CN108460248B (en) Method for detecting long tandem repeat sequence based on Bionano platform
KR102124193B1 (en) Method for screening makers for predicting depressive disorder or suicide risk using machine learning, markers for predicting depressive disorder or suicide risk, method for predicting depressive disorder or suicide risk
CN111696622B (en) Method for correcting and evaluating detection result of mutation detection software
CN116168761B (en) Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium
US20190108311A1 (en) Site-specific noise model for targeted sequencing
CN116469462A (en) Ultra-low frequency DNA mutation identification method and device based on double sequencing
CN113981081A (en) Breast cancer molecular marker based on RNA editing level and diagnosis model
CN113823356A (en) Methylation site identification method and device
CN115066503A (en) Using bulk sequencing data to guide analysis of single cell sequencing data
CN114703263B (en) Group chromosome copy number variation detection method and device
US20220259657A1 (en) Method for discovering marker for predicting risk of depression or suicide using multi-omics analysis, marker for predicting risk of depression or suicide, and method for predicting risk of depression or suicide using multi-omics analysis
KR102404947B1 (en) Method and apparatus for machine learning based identification of structural variants in cancer genomes
KR102519739B1 (en) Non-invasive prenatal testing method and devices based on double Z-score

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant