CN117558348A - Method, device, equipment and medium for predicting fluctuation degree of sequencing data - Google Patents

Method, device, equipment and medium for predicting fluctuation degree of sequencing data Download PDF

Info

Publication number
CN117558348A
CN117558348A CN202311562929.9A CN202311562929A CN117558348A CN 117558348 A CN117558348 A CN 117558348A CN 202311562929 A CN202311562929 A CN 202311562929A CN 117558348 A CN117558348 A CN 117558348A
Authority
CN
China
Prior art keywords
data
sequencing
fluctuation
genome
fluctuation degree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311562929.9A
Other languages
Chinese (zh)
Inventor
栗海波
尹泽宇
余伟师
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Saifu Medical Laboratory Co ltd
Original Assignee
Suzhou Saifu Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Saifu Medical Laboratory Co ltd filed Critical Suzhou Saifu Medical Laboratory Co ltd
Priority to CN202311562929.9A priority Critical patent/CN117558348A/en
Publication of CN117558348A publication Critical patent/CN117558348A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a fluctuation degree prediction method, device, equipment and medium of sequencing data, and belongs to the technical field of high-throughput sequencing, comprising the following steps: mapping the original data to be detected to a reference genome to obtain a genome comparison file; performing data processing on the original sequencing data in the target interval of each genome in the genome comparison file to generate a sequencing file to be processed; acquiring characteristic parameters of each autosomal to be detected in a target interval in a sequencing file to be processed; and inputting the characteristic parameters into a preset data fluctuation degree prediction model so that the preset data fluctuation degree prediction model predicts the fluctuation degree of the original data to be detected based on the characteristic parameters and outputs a corresponding fluctuation degree prediction result. And the fluctuation degree prediction result corresponding to the original data to be detected is directly output through the preset data fluctuation degree prediction model, so that the degree of data fluctuation is automatically, rapidly, accurately and efficiently screened, the auditing efficiency is improved, and the high universality is realized.

Description

Method, device, equipment and medium for predicting fluctuation degree of sequencing data
Technical Field
The invention relates to the technical field of high-throughput sequencing, in particular to a method, a device, equipment and a medium for predicting fluctuation degree of sequencing data.
Background
There are many NGS (Next-Generation Sequencing technology, next generation sequencing technology) detection methods for detecting CNV (Copy Number Variation ) at present, including: whole genome sequencing (Whole Genome Sequencing, WGS), whole exon sequencing (Whole Exome Sequencing, WES), low depth genome sequencing, and the like. All the different detection methods need to obtain the final genome sequencing data by a plurality of complex experimental procedures, and due to the complexity of experimental links, certain data fluctuation exists in the final sequencing data, and even if the same sample passes through the same experimental procedure, the final obtained data cannot ensure that the data distribution is completely consistent. Under the background that the fluctuation of the data cannot be avoided, the fluctuation degree of the data has strong relevance to the accuracy of the subsequent CNV detection, and the stronger the fluctuation is, the lower the accuracy of the CNV detection is, and the smaller the fluctuation is, the higher the accuracy of the CNV detection is.
In general, after the analysis of CNV, a professional performs manual checking according to the detected signal of CNV and the signal distribution diagram on chromosome to determine whether the data has fluctuation phenomenon, and whether the CNV result is true and reliable, which has several disadvantages: the quality control standards are not uniform, different professionals have differences in the sense of data or images, and no clear quality control index exists, so that different quality control results of different persons with the same data exist, and the experience of the inspector is depended; the labor is consumed, and each person takes tens of minutes to review one data; the influence degree of fluctuation can not be measured, only a rough description can be provided for the intensity of fluctuation after manual auditing, quantification can not be performed, and the accuracy of CNV can not be evaluated according to the unquantified evaluation result.
In summary, it is a technical problem to be solved in the art how to efficiently and automatically implement the prediction of the degree of data volatility of gene sequencing data to determine whether the gene sequencing data has an adverse effect on CNV results.
Disclosure of Invention
In view of the above, the present invention aims to provide a method, a device and a medium for predicting the fluctuation degree of sequencing data, which can efficiently and automatically predict the fluctuation degree of the data of gene sequencing data so as to determine whether the gene sequencing data has an adverse effect on the CNV result. The specific scheme is as follows:
in a first aspect, the present application discloses a method for predicting the degree of fluctuation of sequencing data, comprising:
mapping the original data to be detected to a reference genome to obtain a genome comparison file;
performing data processing on the original sequencing data in the target interval of each genome in the genome comparison file to generate a sequencing file to be processed;
acquiring characteristic parameters of each autosomal to be detected in the target interval in the sequencing file to be processed;
and inputting the characteristic parameters into a preset data fluctuation degree prediction model so that the preset data fluctuation degree prediction model predicts the fluctuation degree of the original data to be detected based on the characteristic parameters and outputs a corresponding fluctuation degree prediction result.
Optionally, the data processing is performed on the raw sequencing data in the target interval of each genome in the genome comparison file to generate a sequencing file to be processed, including:
determining a target interval category based on the sequencing type of the genome comparative file;
selecting a genome segment corresponding to the target segment class as a target segment;
counting the number of original sequencing data in the target interval for each genome in the genome comparison file;
performing data volume correction processing on the number of original sequencing data of each genome based on the GC content of the original sequencing data in the target interval and the total data volume of the original sequencing data in the genome comparison file to generate a sequencing file to be processed;
or, performing data volume correction processing on the number of the original sequencing data of each genome based on the GC content of the original sequencing data and the median of the original sequencing data in the target interval to generate a sequencing file to be processed.
Optionally, the obtaining the characteristic parameters of each autosome to be detected in the target interval in the sequencing file to be processed includes:
Extracting the number of corrected original sequencing data of each autosome to be detected in the target interval in the sequencing file to be processed;
and calculating the proportion of the number of the corrected original data to the number of the comparison data in the corresponding target interval in the comparison data set so as to extract the characteristic parameters of each autosome to be detected in the target interval.
Optionally, the calculating the ratio of the number of corrected original data to the number of reference data in the corresponding target interval in the reference data set to extract the characteristic parameters of each autosome to be detected in the target interval includes:
performing proportion calculation on the corrected original data number and the contrast data number in the corresponding target interval in the contrast data set to obtain an interval proportion value of the target interval, and fitting the interval proportion value of the target interval in each chromosome to be measured to obtain a fitted interval proportion value;
and extracting the interval proportion value of each autosome to be detected and the fitted interval proportion value to obtain the characteristic parameters of each autosome to be detected in the target interval.
Optionally, before the characteristic parameter is input into the preset data fluctuation degree prediction model, the method further includes:
Setting the learning rate and iteration times of an initial data fluctuation degree prediction model, and obtaining a target data fluctuation degree prediction model by gradient intervals;
and training the target data fluctuation degree prediction model by using a training data set to obtain a trained preset data fluctuation degree prediction model.
Optionally, before training the target data fluctuation degree prediction model by using the training data set to obtain the trained preset data fluctuation degree prediction model, the method further includes:
determining training characteristic parameters and chromosome distribution diagrams in each test interval by using the training data number statistical file in the test interval and the comparison data number median matrix file of the comparison data set in the test interval;
performing grade type marking on corresponding training data based on the data fluctuation information in the chromosome distribution diagram to obtain grade type marking information of the test data;
and merging all training characteristic parameters of the training data and corresponding grade parting marking information to obtain a training data set.
Optionally, the fluctuation degree prediction result includes: a normal fluctuation degree prediction result, a discrete fluctuation degree prediction result, a fluctuation type fluctuation degree prediction result, and a degradation type fluctuation degree prediction result.
In a second aspect, the present application discloses a fluctuation degree prediction apparatus for sequencing data, comprising:
the data mapping module is used for mapping the original data to be detected to a reference genome to obtain a genome comparison file;
the data processing module is used for carrying out data processing on the original sequencing data in the target interval of each genome in the genome comparison file so as to generate a sequencing file to be processed;
the parameter extraction module is used for obtaining characteristic parameters of each autosome to be detected in the target interval in the sequencing file to be processed;
the result prediction module is used for inputting the characteristic parameters into a preset data fluctuation degree prediction model so that the preset data fluctuation degree prediction model predicts the fluctuation degree of the original data to be detected based on the characteristic parameters and outputs a corresponding fluctuation degree prediction result.
In a third aspect, the present application discloses an electronic device comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the steps of the previously disclosed method for predicting the degree of fluctuation of sequencing data.
In a fourth aspect, the present application discloses a computer-readable storage medium for storing a computer program; wherein the computer program when executed by a processor implements the steps of the previously disclosed method for predicting the extent of fluctuation of sequencing data.
As can be seen, the present application discloses a method for predicting the degree of fluctuation of sequencing data, comprising: mapping the original data to be detected to a reference genome to obtain a genome comparison file; performing data processing on the original sequencing data in the target interval of each genome in the genome comparison file to generate a sequencing file to be processed; acquiring characteristic parameters of each autosomal to be detected in the target interval in the sequencing file to be processed; and inputting the characteristic parameters into a preset data fluctuation degree prediction model so that the preset data fluctuation degree prediction model predicts the fluctuation degree of the original data to be detected based on the characteristic parameters and outputs a corresponding fluctuation degree prediction result. Therefore, the fluctuation degree prediction model is utilized to predict the fluctuation degree of the original data to be detected, the corresponding fluctuation degree prediction result of the original data to be detected is directly output, the fluctuation degree of the data can be automatically, rapidly, accurately and efficiently screened, the manual auditing degree is replaced, and the auditing efficiency is improved. In addition, the input value of the preset data fluctuation degree prediction model is the characteristic parameter of the original data to be detected, namely, the preset data fluctuation degree prediction model can predict the fluctuation degree results of different types of original data to be detected, so that the high universality is realized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for predicting the degree of fluctuation of sequencing data disclosed herein;
FIG. 2 is a flow chart of a method for generating a sequencing file to be processed disclosed in the present application;
FIG. 3 is a flowchart of a method for generating a statistical matrix file against a data set disclosed in the present application;
FIG. 4 is a flow chart of a method for predicting the extent of fluctuation of sequencing data in accordance with the disclosure herein;
FIG. 5 is a flowchart of a training and evaluating method for a predictive model of the fluctuation degree of preset data disclosed in the present application;
FIG. 6 is a flow chart of a method for hierarchical prediction of clinical specimen data fluctuation levels disclosed herein;
FIG. 7 is a schematic structural diagram of a device for predicting the fluctuation degree of gene sequencing data disclosed in the present application;
fig. 8 is a block diagram of an electronic device disclosed in the present application.
Detailed Description
The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
There are many NGS detection methods for detecting CNV at present, including: whole genome sequencing, whole exon sequencing, low depth genome sequencing, and the like. All the different detection methods need to obtain the final genome sequencing data by a plurality of complex experimental procedures, and due to the complexity of experimental links, certain data fluctuation exists in the final sequencing data, and even if the same sample passes through the same experimental procedure, the final obtained data cannot ensure that the data distribution is completely consistent. Under the background that the fluctuation of the data cannot be avoided, the fluctuation degree of the data has strong relevance to the accuracy of the subsequent CNV detection, and the stronger the fluctuation is, the lower the accuracy of the CNV detection is, and the smaller the fluctuation is, the higher the accuracy of the CNV detection is.
In general, after the analysis of CNV, a professional performs manual checking according to the detected signal of CNV and the signal distribution diagram on chromosome to determine whether the data has fluctuation phenomenon, and whether the CNV result is true and reliable, which has several disadvantages: the quality control standards are not uniform, different professionals have differences in the sense of data or images, and no clear quality control index exists, so that different quality control results of different persons with the same data exist, and the experience of the inspector is depended; the labor is consumed, and each person takes tens of minutes to review one data; the influence degree of fluctuation can not be measured, only a rough description can be provided for the intensity of fluctuation after manual auditing, quantification can not be performed, and the accuracy of CNV can not be evaluated according to the unquantified evaluation result.
Therefore, the invention provides a fluctuation degree prediction method of sequencing data, which can realize the prediction of the data fluctuation degree of gene sequencing data in a high-efficiency and automatic manner so as to determine whether the gene sequencing data has adverse effect on CNV results.
Referring to fig. 1, the embodiment of the invention discloses a method for predicting fluctuation degree of sequencing data, which comprises the following steps:
Step S11: mapping the original data to be detected to a reference genome to obtain a genome comparison file.
In this embodiment, the original data in the Fastq file of the clinical sample is used as the original data to be tested, and is compared to the reference genome by the preset comparison software to obtain the genome comparison file BAM (Binary Alignment Map). The categories of the preset alignment software may include, but are not limited to: BWA, bowtie, MAQ, SOAP2, etc. The version of the reference genome may be hg19 or hg38, with the specific version being based on practical considerations. It should be noted that the reference genome needs to be identical to the reference genome of the control set data.
Step S12: and carrying out data processing on the original sequencing data in the target interval of each genome in the genome comparison file to generate a sequencing file to be processed.
In this embodiment, referring to fig. 2, a target interval category is determined based on the sequencing type of the genome comparative file; selecting a genome segment corresponding to the target segment class as a target segment; counting the number of original sequencing data in the target interval for each genome in the genome comparison file; performing data volume correction processing on the number of original sequencing data of each genome based on the GC content of the original sequencing data in the target interval and the total data volume of the original sequencing data in the genome comparison file to generate a sequencing file to be processed; or, performing data volume correction processing on the number of the original sequencing data of each genome based on the GC content of the original sequencing data and the median of the original sequencing data in the target interval to generate a sequencing file to be processed. It can be understood that the target interval category is determined by the sequencing type of the genome comparative file, specifically, when the sequencing type is the whole genome range type of the whole genome sequencing WGS type or the low-depth whole genome sequencing CNVseq, the category of the window interval formed by dividing the genome into windows is taken as the target interval category; when the sequencing type is the whole exon sequencing WES and the sequencing type of the custom gene Panel class, the class of the probe interval which is defined by a plurality of capture reagents is taken as the target interval class. Then selecting a genome interval corresponding to the target interval category as a target interval, counting the number of original sequencing data Reads in each genome target interval, and correcting the number of the original sequencing data Reads in the target interval based on the GC content of the genome target interval, wherein the correction method is the absence, so as to obtain a reading number statistical file corrected by the GC; correcting the GC corrected Reads number statistical file based on the total Reads data volume of the sequencing sample or based on the median correction of the target interval Reads number of the sequencing sample, and finally generating the sequencing file to be processed. It can be seen that the method is applicable to various high-throughput sequencing formats.
Step S13: and obtaining characteristic parameters of each autosome to be detected in the sequencing file to be processed in the target interval.
In this embodiment, the number of corrected original sequencing data of each autosome to be detected in the target interval in the sequencing file to be processed is extracted; and calculating the proportion of the number of the corrected original data to the number of the comparison data in the corresponding target interval in the comparison data set so as to extract the characteristic parameters of each autosome to be detected in the target interval. It can be understood that the number of target interval sequencing data Reads of each autosome in the sequencing file to be processed is extracted and proportional calculation is performed on the comparison set sample so as to extract the characteristic parameters of each autosome.
Specifically, the ratio calculation is performed on the number of the corrected original data and the number of the comparison data in the corresponding target interval in the comparison data set, so as to extract the characteristic parameters of each autosome to be detected in the target interval, including: performing proportion calculation on the corrected original data number and the contrast data number in the corresponding target interval in the contrast data set to obtain an interval proportion value of the target interval, and fitting the interval proportion value of the target interval in each chromosome to be measured to obtain a fitted interval proportion value; and extracting the interval proportion value of each autosome to be detected and the fitted interval proportion value to obtain the characteristic parameters of each autosome to be detected in the target interval. It can be understood that the number of Reads in each target interval of the sequencing sample is statistically proportional to the median of the target interval in the control set, and the statistical data fluctuation Ratio is calculated according to the following formula:
Wherein i represents an i-th genome target interval; ratio (i) represents the log2 Ratio value of the ith genome target interval; reads (i) represents the number of Reads of the test sample within the ith genomic target interval; ck_reads (i) represents the median of the number of Reads of the control dataset sample in the ith genomic target interval.
Then according to different chromosomes of the genome, performing the relation fitting on the chromosome positions according to the relation value of the target interval on the chromosome, and finally obtaining the relation value after the fitting of each target interval; extracting the Ratio values before and after the matching of the less of each chromosome, and obtaining 6 relevant statistical parameters, namely characteristic parameters: the Ratio values are accumulated and summed, then all ratios on the same chromosome are accumulated, and the accumulated and summed formula of the Ratio values is as follows:
wherein i represents an i-th genome target interval; n represents the number of target intervals on the target dyeing; ratio (i) represents the log2 Ratio value of the ith genome target interval; sum_raw represents the cumulative Sum of all target interval Ratio values.
The cumulative sum of the Ratio values after the Loess fitting is carried out, and the cumulative formula of all ratios on the same chromosome is as follows:
Wherein i represents an i-th genome target interval; n represents the number of target intervals on the target dyeing; ratio_fit (i) represents a log2 Ratio value of the ith genome target interval after the log2 Ratio value is subjected to the log fitting; sum_fit represents the cumulative Sum of Ratio values after all target interval mains fitting.
The absolute value of the Ratio value is cumulatively summed, and the formula for accumulating all ratios on the same chromosome is as follows:
wherein i represents an i-th genome target interval; n represents the number of target intervals on the target dyeing; ratio_fit (i) represents a log2 Ratio value of the ith genome target interval after the log2 Ratio value is subjected to the log fitting; sum_fit, the absolute value of the Ratio value after all target intervals are fitted is accumulated and summed.
The absolute value of the Ratio values after the Loess fitting is cumulatively summed, and the cumulative formula of all ratios on the same chromosome is as follows:
wherein i represents an i-th genome target interval; n represents the number of target intervals on the target dyeing; ratio_fit (i) represents a log2 Ratio value of the ith genome target interval after the log2 Ratio value is subjected to the log fitting; sum_fit represents the absolute value cumulative Sum of Ratio values after all target interval mains fitting.
The Ratio standard deviation is calculated, wherein the calculation formula is as follows:
Wherein i represents an i-th genome target interval; n represents the number of target intervals on the target dyeing; ratio (i) represents the log2 Ratio value of the ith genome target interval; μ represents the average of log2 ratio values of the genomic target interval; sd_raw represents the standard deviation of all target interval Ratio values.
The standard deviation of the Ratio value after the Loess fitting is calculated, and the formula is as follows:
wherein i represents an i-th genome target interval; n represents the number of target intervals on the target dyeing; ratio_fit (i) represents a log2 Ratio value of the ith genome target interval after the log2 Ratio value is subjected to the log fitting; mu_Fit represents the average value of log2 proportion values of the genome target interval after the event fitting; sum_fit represents the cumulative Sum of Ratio values after all target interval mains fitting. Since the sequence of the human reference genome consists of 1 to 22 autosomes, two sex chromosomes X and Y, totaling 24 chromosomes; the sex chromosome composition of the male sample is XY, the sex chromosome composition of the female sample is XX, in order to avoid interference of sex chromosomes, only 6 characteristic parameters of autosomes from 1 to 22 are required to be extracted, and finally 132 characteristic parameters are obtained for each sample.
In this embodiment, referring to fig. 3, the generation flow of the sample ready statistical matrix file of the control dataset is as follows: dividing the historical disease samples into groups, dividing the normal disease-free samples into a control group, and merging the multiple sample Reads number statistical files in the control group according to whether genome intervals are the same or not; and counting the median of the number of the multiple samples of Reads for each genome interval, and eliminating the interval with the median of 0. A median of 0 indicates that the interval has no Reads coverage in most samples and cannot be used for subsequent analysis statistics; and finally, generating a target interval Reads number median matrix file, namely a Reads statistical matrix file of the comparison data set. It can be seen that by grouping historical disease samples, dividing normal disease-free samples into control groups, and merging the corrected Reads statistics files of the samples in all the control groups, quality control and median statistics are performed for each target interval, and the files are used as controls to perform comparative statistical analysis for each sequencing sample.
Step S14: and inputting the characteristic parameters into a preset data fluctuation degree prediction model so that the preset data fluctuation degree prediction model predicts the fluctuation degree of the original data to be detected based on the characteristic parameters and outputs a corresponding fluctuation degree prediction result.
In the embodiment, inputting characteristic parameters of a sequencing sample into a preset data fluctuation degree prediction model for prediction to obtain a data fluctuation class classification result predicted by the preset data fluctuation degree prediction model; wherein the fluctuation degree prediction result comprises: a normal fluctuation degree prediction result, a discrete fluctuation degree prediction result, a fluctuation type fluctuation degree prediction result, and a degradation type fluctuation degree prediction result.
As can be seen, the present application discloses a method for predicting the degree of fluctuation of sequencing data, comprising: mapping the original data to be detected to a reference genome to obtain a genome comparison file; performing data processing on the original sequencing data in the target interval of each genome in the genome comparison file to generate a sequencing file to be processed; acquiring characteristic parameters of each autosomal to be detected in the target interval in the sequencing file to be processed; and inputting the characteristic parameters into a preset data fluctuation degree prediction model so that the preset data fluctuation degree prediction model predicts the fluctuation degree of the original data to be detected based on the characteristic parameters and outputs a corresponding fluctuation degree prediction result. Therefore, the fluctuation degree prediction model is utilized to predict the fluctuation degree of the original data to be detected, the corresponding fluctuation degree prediction result of the original data to be detected is directly output, the fluctuation degree of the data can be automatically, rapidly, accurately and efficiently screened, the manual auditing degree is replaced, and the auditing efficiency is improved. In addition, the input value of the preset data fluctuation degree prediction model is the characteristic parameter of the original data to be detected, namely, the preset data fluctuation degree prediction model can predict the fluctuation degree results of different types of original data to be detected, so that the high universality is realized.
Referring to fig. 4, an embodiment of the present invention discloses a specific method for predicting the fluctuation degree of sequencing data, and compared with the previous embodiment, the present embodiment further describes and optimizes the technical scheme.
Specific:
step S21: mapping the original data to be detected to a reference genome to obtain a genome comparison file; performing data processing on the original sequencing data in the target interval of each genome in the genome comparison file to generate a sequencing file to be processed; and obtaining characteristic parameters of each autosome to be detected in the sequencing file to be processed in the target interval.
In step S21, the more detailed processing procedure is referred to the above disclosed embodiments, and will not be described herein.
Step S22: determining training characteristic parameters and chromosome distribution diagrams in each test interval by using the training data number statistical file in the test interval and the comparison data number median matrix file of the comparison data set in the test interval; performing grade type marking on corresponding training data based on the data fluctuation information in the chromosome distribution diagram to obtain grade type marking information of the test data; and merging all training characteristic parameters of the training data and corresponding grade parting marking information to obtain a training data set.
In this embodiment, referring to fig. 5, the data of the multi-training samples including the feature parameters and the data fluctuation class are divided into two groups: the group is defined as a test set and used for carrying out inter-algorithm evaluation and model parameter optimization; the other is defined as a training data set. The extraction of the training feature parameters of the training data set is the same as the feature parameter extraction process in step S13, and will not be described in detail. Manually marking the data fluctuation level of the training samples according to the data fluctuation information in the chromosome distribution map, specifically, drawing the chromosome distribution map of the Ratio values of different chromosomes to obtain the chromosome distribution map, feeding back the graphic result in the chromosome distribution map to a professional manual auditing team, and carrying out grading type marking on each training sample according to the grading type standard of the determined data fluctuation; and merging the extracted training characteristic parameters of all training samples with the grade typing marking information to obtain a training data set.
It should be noted that the data fluctuation class classification standard is obtained by converting based on quality control experience corresponding to a historical CNV result, and specifically, the classification is divided into 4 classes, normal type fluctuation, discrete type fluctuation, fluctuation type fluctuation and degradation type fluctuation according to the data fluctuation degree from low to high; wherein the Normal fluctuation Normal: in a data distribution diagram of the sample of the type, data signal points are distributed near a 0 baseline, no obvious fluctuation phenomenon exists, the sample data has no fluctuation phenomenon, and CNV results are analyzed relatively accurately; the Discrete wave Discrete: in a data distribution diagram of the sample of the type, data signal points are distributed near a 0 baseline, but some outliers with the signal points being obviously deviated from the baseline exist, false positive CNV of a small fragment is usually detected in analysis results due to the existence of the outliers, and the results are relatively accurate after the CNV of the small fragment is filtered; the fluctuation type fluctuation Volatility: in a data distribution diagram of the sample of the type, the distribution of data signal points presents a fluctuation shape near a 0 baseline, the obvious fluctuation phenomenon is obvious, a great number of false positives exist in a CNV detection result of the sample of the type, and a relatively accurate result cannot be obtained through filtering; the Degradable fluctuation Degradable: in the data distribution diagram of the sample of the type, all data signal points are obviously deviated from a 0 baseline, and CNV cannot be accurately analyzed due to abnormal serious data fluctuation of the sample data. It can be understood that the sample data fluctuation typing standard established above, wherein the CNV result detected by the Normal fluctuation Normal sample can be directly used; the CNV analysis result of Discrete fluctuation display data needs to filter CNV of small fragments; the analysis results of the CNV of the fluctuating fluctuation Volatity and the Degradable fluctuation Degradable sample cannot be subjected to subsequent analysis, and experimental sequencing is recommended to be carried out again.
Step S23: setting the learning rate and iteration times of an initial data fluctuation degree prediction model, and obtaining a target data fluctuation degree prediction model by gradient intervals; and training the target data fluctuation degree prediction model by using the training data set to obtain a trained preset data fluctuation degree prediction model.
In this embodiment, corresponding classification models are respectively constructed according to a preset classification algorithm to be used as initial data fluctuation degree prediction models, and then the learning rate and iteration times of the same initial data fluctuation degree prediction models are set, for example: the learning rate is set to be 0.1, and the iteration times are set to be 80, so that a target data fluctuation degree prediction model corresponding to each preset classification algorithm is obtained; and training the target data fluctuation degree prediction model by using the training data set to obtain a trained target preset data fluctuation degree prediction model. Wherein, the preset classification algorithm is as follows: random FOREST algorithm Forest, K nearest neighbor algorithm KNN (K-Nearest Neighbors algorithm), light GBM algorithm LightGBM (Light Gradient Boosting Machine), logistic regression algorithm (Logistic Regression, LR), decision TREE algorithm TREE, XGBoost (Extreme Gradient Boosting) classifier XGBClassifier.
In this embodiment, test feature parameters in a test set are respectively input into six target data fluctuation degree prediction models, target grade results of each target data fluctuation degree prediction model for a training sample are obtained, the target grade results are compared with grade classification marking information of the training sample, and prediction performance of each target data fluctuation degree prediction model is evaluated. Specifically, the evaluation method adopts a receiver operation characteristic Curve (receiver operating characteristic Curve, ROC Curve), and the algorithm performance is evaluated by the size of an Area (AUC) surrounded by the ROC Curve and a coordinate axis, wherein the larger the Area is, the better the performance is. Through evaluation and comparison, the light GBM algorithm LightGBM classification effect is superior to other algorithms, so that the algorithm is determined to be used for subsequent model construction, super parameters of the target data fluctuation degree prediction model corresponding to the light GBM algorithm are set again, iteration frequency ranges of 80-200 are specifically set, inter-ladder intervals are set to be 20, optimization evaluation is carried out on the light GBM model, and the final evaluation result shows that the optimal iteration frequency is 180. And finally, combining parameters such as optimal iteration times and inter-ladder intervals with the light GBM algorithm to obtain a corresponding target data fluctuation degree prediction model serving as a preset data fluctuation degree prediction model. Table one is a test set sample prediction result scenario analysis table:
List one
The second table is a model prediction effect evaluation table and an overall accuracy evaluation table of different grading types: watch II
Step S24: and inputting the characteristic parameters into a preset data fluctuation degree prediction model so that the preset data fluctuation degree prediction model predicts the fluctuation degree of the original data to be detected based on the characteristic parameters and outputs a corresponding fluctuation degree prediction result.
Referring to fig. 6, a functional module is developed based on a preset data fluctuation degree prediction model, so as to realize the typing prediction of clinical sample data, namely, the model is packaged into an automatic functional module, the model prediction is automatically performed on samples in the clinical production process, a data fluctuation typing result is obtained, the influence degree on CNV is determined, and the subsequent downstream processing of the samples is guided. The specific flow is as follows:
1. genome comparison file acquisition: and comparing the clinical sample original data Fastq file with a reference genome to obtain a genome comparison file. The reference genome needs to be consistent with the reference genome of the control set data;
2. counting the number of original sequencing data Reads in each genome target interval, and correcting GC content and data quantity, wherein the genome target interval is consistent with the correction method and the treatment of a control set sample;
3. Proportional calculation is carried out on the number of target interval sequencing data Reads of the sample and a control set sample, and characteristic parameters of each chromosome are extracted;
4. and inputting the characteristic parameters of the sample into the model for prediction to obtain a grading result such as data fluctuation and the like of the sample model prediction.
Therefore, the data degree fluctuation prediction method has low requirement on the calculation resources of the server, a common server with the 8-core 32G memory can allow processing tasks of dozens of target genes to be operated simultaneously, the model can be simply deployed inside a system, the use and the operation are convenient, the whole flow analysis can be completed only by deploying relevant calculation nodes, and the influence degree of the degree of data fluctuation on CNV detection can be efficiently screened out.
Referring to FIG. 7, the invention also discloses a device for predicting the fluctuation degree of sequencing data, which comprises:
the data mapping module 11 is used for mapping the original data to be detected to a reference genome to obtain a genome comparison file;
a data processing module 12, configured to perform data processing on the raw sequencing data in the target interval of each genome in the genome comparison file, so as to generate a sequencing file to be processed;
The parameter extraction module 13 is used for obtaining characteristic parameters of each autosome to be detected in the target interval in the sequencing file to be processed;
the result prediction module 14 is configured to input the characteristic parameter into a preset data fluctuation degree prediction model, so that the preset data fluctuation degree prediction model predicts the fluctuation degree of the original data to be measured based on the characteristic parameter, and output a corresponding fluctuation degree prediction result.
As can be seen, the present application discloses mapping raw data to be tested to a reference genome to obtain a genome alignment file; performing data processing on the original sequencing data in the target interval of each genome in the genome comparison file to generate a sequencing file to be processed; acquiring characteristic parameters of each autosomal to be detected in the target interval in the sequencing file to be processed; and inputting the characteristic parameters into a preset data fluctuation degree prediction model so that the preset data fluctuation degree prediction model predicts the fluctuation degree of the original data to be detected based on the characteristic parameters and outputs a corresponding fluctuation degree prediction result. Therefore, the fluctuation degree prediction model is utilized to predict the fluctuation degree of the original data to be detected, the corresponding fluctuation degree prediction result of the original data to be detected is directly output, the fluctuation degree of the data can be automatically, rapidly, accurately and efficiently screened, the manual auditing degree is replaced, and the auditing efficiency is improved. In addition, the input value of the preset data fluctuation degree prediction model is the characteristic parameter of the original data to be detected, namely, the preset data fluctuation degree prediction model can predict the fluctuation degree results of different types of original data to be detected, so that the high universality is realized.
Further, the embodiment of the present application further discloses an electronic device, and fig. 8 is a block diagram of an electronic device 20 according to an exemplary embodiment, where the content of the figure is not to be considered as any limitation on the scope of use of the present application.
Fig. 8 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present application. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is used for storing a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the method for predicting the fluctuation degree of sequencing data disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.
In this embodiment, the power supply 23 is configured to provide an operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein; the input/output interface 25 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.
Processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The processor 21 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 21 may also comprise a main processor, which is a processor for processing data in an awake state, also called CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 21 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 21 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.
The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, and the like, and the storage may be temporary storage or permanent storage.
The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20 and the computer program 222, so as to implement the operation and processing of the processor 21 on the mass data 223 in the memory 22, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the method of predicting the extent of fluctuation of sequencing data performed by the electronic device 20 as disclosed in any of the previous embodiments. The data 223 may include, in addition to data received by the electronic device and transmitted by the external device, data collected by the input/output interface 25 itself, and so on.
Further, the application also discloses a computer readable storage medium for storing a computer program; wherein the computer program when executed by the processor implements the previously disclosed method for predicting the degree of fluctuation of sequencing data. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application. The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in random access Memory RAM (Random Access Memory), memory, read-Only Memory ROM (Read Only Memory), electrically programmable EPROM (Electrically Programmable Read Only Memory), electrically erasable programmable EEPROM (Electric Erasable Programmable Read Only Memory), registers, hard disk, a removable disk, a CD-ROM (Compact Disc-Read Only Memory), or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description of the method, the device, the equipment and the medium for predicting the fluctuation degree of sequencing data provided by the invention applies specific examples to illustrate the principle and the implementation of the invention, and the above examples are only used for helping to understand the method and the core idea of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (10)

1. A method for predicting the degree of fluctuation of sequencing data, comprising:
mapping the original data to be detected to a reference genome to obtain a genome comparison file;
performing data processing on the original sequencing data in the target interval of each genome in the genome comparison file to generate a sequencing file to be processed;
acquiring characteristic parameters of each autosomal to be detected in the target interval in the sequencing file to be processed;
and inputting the characteristic parameters into a preset data fluctuation degree prediction model so that the preset data fluctuation degree prediction model predicts the fluctuation degree of the original data to be detected based on the characteristic parameters and outputs a corresponding fluctuation degree prediction result.
2. The method of claim 1, wherein the data processing of raw sequencing data in a target interval of each genome in the genome-aligned file to generate a sequencing file to be processed comprises:
determining a target interval category based on the sequencing type of the genome comparative file;
selecting a genome segment corresponding to the target segment class as a target segment;
Counting the number of original sequencing data in the target interval for each genome in the genome comparison file;
performing data volume correction processing on the number of original sequencing data of each genome based on the GC content of the original sequencing data in the target interval and the total data volume of the original sequencing data in the genome comparison file to generate a sequencing file to be processed;
or, performing data volume correction processing on the number of the original sequencing data of each genome based on the GC content of the original sequencing data and the median of the original sequencing data in the target interval to generate a sequencing file to be processed.
3. The method for predicting the fluctuation degree of sequencing data according to claim 1, wherein the step of obtaining the characteristic parameters of each autosome to be tested in the target interval in the sequencing file to be processed comprises the steps of:
extracting the number of corrected original sequencing data of each autosome to be detected in the target interval in the sequencing file to be processed;
and calculating the proportion of the number of the corrected original data to the number of the comparison data in the corresponding target interval in the comparison data set so as to extract the characteristic parameters of each autosome to be detected in the target interval.
4. The method according to claim 3, wherein the proportional calculation of the number of corrected raw data and the number of reference data in the corresponding target interval in the reference data set is performed to extract characteristic parameters of each autosome to be detected in the target interval, comprising:
performing proportion calculation on the corrected original data number and the contrast data number in the corresponding target interval in the contrast data set to obtain an interval proportion value of the target interval, and fitting the interval proportion value of the target interval in each chromosome to be measured to obtain a fitted interval proportion value;
and extracting the interval proportion value of each autosome to be detected and the fitted interval proportion value to obtain the characteristic parameters of each autosome to be detected in the target interval.
5. The method for predicting the fluctuation degree of sequencing data according to claim 1, wherein before the characteristic parameter is input into a preset data fluctuation degree prediction model, the method further comprises:
setting the learning rate and iteration times of an initial data fluctuation degree prediction model, and obtaining a target data fluctuation degree prediction model by gradient intervals;
And training the target data fluctuation degree prediction model by using a training data set to obtain a trained preset data fluctuation degree prediction model.
6. The method according to claim 5, wherein before training the target data fluctuation degree prediction model by using a training data set to obtain a trained preset data fluctuation degree prediction model, further comprises:
determining training characteristic parameters and chromosome distribution diagrams in each test interval by using the training data number statistical file in the test interval and the comparison data number median matrix file of the comparison data set in the test interval;
performing grade type marking on corresponding training data based on the data fluctuation information in the chromosome distribution diagram to obtain grade type marking information of the test data;
and merging all training characteristic parameters of the training data and corresponding grade parting marking information to obtain a training data set.
7. The method for predicting the degree of fluctuation of sequencing data according to any one of claims 1 to 6, wherein the result of predicting the degree of fluctuation comprises: a normal fluctuation degree prediction result, a discrete fluctuation degree prediction result, a fluctuation type fluctuation degree prediction result, and a degradation type fluctuation degree prediction result.
8. A fluctuation degree prediction apparatus of sequencing data, comprising:
the data mapping module is used for mapping the original data to be detected to a reference genome to obtain a genome comparison file;
the data processing module is used for carrying out data processing on the original sequencing data in the target interval of each genome in the genome comparison file so as to generate a sequencing file to be processed;
the parameter extraction module is used for obtaining characteristic parameters of each autosome to be detected in the target interval in the sequencing file to be processed;
the result prediction module is used for inputting the characteristic parameters into a preset data fluctuation degree prediction model so that the preset data fluctuation degree prediction model predicts the fluctuation degree of the original data to be detected based on the characteristic parameters and outputs a corresponding fluctuation degree prediction result.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the steps of the fluctuation degree prediction method of sequencing data as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program; wherein the computer program when executed by a processor implements the steps of the method for predicting the extent of fluctuation of sequencing data as claimed in any one of claims 1 to 7.
CN202311562929.9A 2023-11-22 2023-11-22 Method, device, equipment and medium for predicting fluctuation degree of sequencing data Pending CN117558348A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311562929.9A CN117558348A (en) 2023-11-22 2023-11-22 Method, device, equipment and medium for predicting fluctuation degree of sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311562929.9A CN117558348A (en) 2023-11-22 2023-11-22 Method, device, equipment and medium for predicting fluctuation degree of sequencing data

Publications (1)

Publication Number Publication Date
CN117558348A true CN117558348A (en) 2024-02-13

Family

ID=89816324

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311562929.9A Pending CN117558348A (en) 2023-11-22 2023-11-22 Method, device, equipment and medium for predicting fluctuation degree of sequencing data

Country Status (1)

Country Link
CN (1) CN117558348A (en)

Similar Documents

Publication Publication Date Title
CN112685950B (en) Method, system and equipment for detecting abnormality of ocean time sequence observation data
CN113807004B (en) Cutter life prediction method, device and system based on data mining
CN116416884B (en) Testing device and testing method for display module
CN113836241B (en) Time sequence data classification prediction method, device, terminal equipment and storage medium
CN116738551B (en) Intelligent processing method for acquired data of BIM model
CN113567369A (en) Forest environment monitoring method and system based on multispectral remote sensing
CN116451081A (en) Data drift detection method, device, terminal and storage medium
US20240104804A1 (en) System for clustering data points
CN118378213A (en) Data quality evaluation method, device, equipment, storage medium and product
CN110728315A (en) Real-time quality control method, system and equipment
CN110852322B (en) Method and device for determining region of interest
CN115831219B (en) Quality prediction method, device, equipment and storage medium
CN116825192A (en) Interpretation method of ncRNA gene mutation, storage medium and terminal
CN117558348A (en) Method, device, equipment and medium for predicting fluctuation degree of sequencing data
Zhang et al. On Mendelian randomization analysis of case-control study
CN114118306B (en) Method and device for analyzing SDS (sodium dodecyl sulfate) gel electrophoresis experimental data and SDS gel reagent
CN116564418A (en) Cell group correlation network construction method, device, equipment and storage medium
CN107506600B (en) Cancer type prediction method and device based on methylation data
US20220261998A1 (en) Adaptive machine learning system for image-based biological sample constituent analysis
CN111625525A (en) Environmental data repairing/filling method and system
US11789970B2 (en) Graph-based discovery of geometry of clinical data to reveal communities of clinical trial subjects
CN117971818B (en) Data management task operation method based on big data
CN118312657B (en) Knowledge base-based intelligent large model analysis recommendation system and method
US20240047022A1 (en) Automatic selection of optimal graphs with robust geometric properties in graph-based discovery of geometry of clinical data
US20120123753A1 (en) Method for analyzing longitudinal data, corresponding computer and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination