CN116364294A - Database system establishment and application thereof - Google Patents
Database system establishment and application thereof Download PDFInfo
- Publication number
- CN116364294A CN116364294A CN202111635764.4A CN202111635764A CN116364294A CN 116364294 A CN116364294 A CN 116364294A CN 202111635764 A CN202111635764 A CN 202111635764A CN 116364294 A CN116364294 A CN 116364294A
- Authority
- CN
- China
- Prior art keywords
- data
- module
- database
- database system
- screening
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 claims abstract description 91
- 238000006243 chemical reaction Methods 0.000 claims abstract description 37
- 238000012545 processing Methods 0.000 claims abstract description 37
- 238000012216 screening Methods 0.000 claims abstract description 30
- 238000010801 machine learning Methods 0.000 claims abstract description 24
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 17
- 238000012549 training Methods 0.000 claims abstract description 16
- 238000003759 clinical diagnosis Methods 0.000 claims abstract description 12
- 238000011528 liquid biopsy Methods 0.000 claims abstract description 11
- 238000004140 cleaning Methods 0.000 claims abstract description 9
- 238000002372 labelling Methods 0.000 claims abstract description 9
- 238000002474 experimental method Methods 0.000 claims description 22
- 210000002966 serum Anatomy 0.000 claims description 13
- 238000009499 grossing Methods 0.000 claims description 11
- 210000002700 urine Anatomy 0.000 claims description 11
- 239000007788 liquid Substances 0.000 claims description 9
- 238000007637 random forest analysis Methods 0.000 claims description 8
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 239000012530 fluid Substances 0.000 claims description 7
- 210000003296 saliva Anatomy 0.000 claims description 7
- 238000007405 data analysis Methods 0.000 claims description 6
- 238000003672 processing method Methods 0.000 claims description 6
- 210000002381 plasma Anatomy 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 claims description 4
- 230000002547 anomalous effect Effects 0.000 claims description 4
- 239000013060 biological fluid Substances 0.000 claims description 4
- 238000001574 biopsy Methods 0.000 claims description 4
- 210000003722 extracellular fluid Anatomy 0.000 claims description 4
- 238000004579 scanning voltage microscopy Methods 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 description 8
- 230000002159 abnormal effect Effects 0.000 description 6
- 229920001184 polypeptide Polymers 0.000 description 5
- 102000004196 processed proteins & peptides Human genes 0.000 description 5
- 108090000765 processed proteins & peptides Proteins 0.000 description 5
- 206010028980 Neoplasm Diseases 0.000 description 4
- 239000012634 fragment Substances 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 238000003860 storage Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 201000010099 disease Diseases 0.000 description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 3
- 206010006187 Breast cancer Diseases 0.000 description 2
- 208000026310 Breast neoplasm Diseases 0.000 description 2
- 208000017667 Chronic Disease Diseases 0.000 description 2
- 201000011510 cancer Diseases 0.000 description 2
- 208000029078 coronary artery disease Diseases 0.000 description 2
- 102000004169 proteins and genes Human genes 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 210000001519 tissue Anatomy 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 206010020772 Hypertension Diseases 0.000 description 1
- 208000000236 Prostatic Neoplasms Diseases 0.000 description 1
- 208000007097 Urinary Bladder Neoplasms Diseases 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 206010012601 diabetes mellitus Diseases 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000000338 in vitro Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 239000007791 liquid phase Substances 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000001819 mass spectrum Methods 0.000 description 1
- 230000002503 metabolic effect Effects 0.000 description 1
- 239000002207 metabolite Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Biomedical Technology (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Pathology (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Quality & Reliability (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
Abstract
The invention relates to a method for establishing a database system, which comprises the following steps: step one, detecting a biological liquid biopsy sample by using an analytical instrument, and obtaining a fid original data file generated in a laboratory; step two, converting the data file into a processable data file by using software; step three, processing the processable data by using a data conversion method; analyzing and judging QC according to sample characteristic data condition, and judging whether the data requirement is met or not; step five, if the data requirement is met, cleaning the sample characteristic data; step six, carrying out data modeling prediction on the processed characteristic data by using an algorithm, and screening out key peaks and abundance data thereof; step seven, data with accurate manual labeling are used in advance and used for training a machine learning model; and step eight, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.
Description
Technical Field
The invention relates to a method for processing data of detection results of proteomics, polypeptide histology, metabonomics and the like in a biological liquid biopsy sample by an analytical instrument such as a mass spectrometer, a gas chromatograph or a liquid chromatograph.
Background
In the field of in vitro diagnostics, the progression of a disease always represents an abnormality of proteins, polypeptides and metabolites. For example, in the serum of cancer patients, some polypeptides persist at very low levels (e.g., all FPA fragments in various cancer patients and 3C3f fragments in breast cancer patients), and others at high levels (e.g., several C3f fragments in bladder and prostate cancers and one FPA fragment in breast cancer). From the perspective of data analysis, we need to acquire various attributes of the detection objects, normalize the attributes into feature vectors with equal length, and finally analyze the feature vectors by adopting various calculation means so as to identify markers of diseases such as tumor, coronary heart disease and the like, and diagnose the diseases. In this process, the extraction of the markers is particularly important, directly affecting the accuracy of the diagnostic result.
The experimental data processing is a commonly used scientific calculation method widely applied to production and scientific research processes, and is an important tool for product design quality management and scientific research. By analyzing the data obtained by the detection methods such as spectrum, chromatograph, mass spectrum and the like through a special calculation means, the markers of the chronic diseases such as tumor, coronary heart disease, hypertension, diabetes and the like can be rapidly judged.
Disclosure of Invention
In order to analyze the obtained experimental data through strict and accurate data processing and find out the internal rules of things and provide a basis for the diagnosis of chronic diseases, the invention provides a method for processing the data of detection results of a mass spectrometer, a gas chromatograph or a liquid phase analyzer on proteomics, a polypeptide histology, a metabonomics and the like in a biological liquid biopsy sample.
Specifically, the present invention includes the following embodiments.
1. A method of database system creation, comprising the steps of:
step one, detecting a biological liquid biopsy sample by using an analytical instrument, and obtaining a fid original data file generated in a laboratory;
step two, converting the fid data file into a processable data file by using CompassXport software;
step three, processing the processable data by using a data conversion method to obtain accurate relative abundance of each component;
analyzing and judging the QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment;
step five, if the data requirement is met, cleaning the sample characteristic data;
step six, carrying out data modeling prediction on the processed characteristic data by using an algorithm, and screening out key peaks and abundance data thereof;
step seven, data with accurate manual labeling are used in advance and used for training a machine learning model;
and step eight, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.
2. The method for building a database system according to claim 1, wherein,
in the first step, the analytical instrument is a mass spectrometer, a gas chromatograph or a liquid chromatograph.
3. The method for building a database system according to claim 1, wherein,
in step one, the biological fluid biopsy sample is selected from any one of serum, urine, interstitial fluid, saliva, plasma and joint fluid.
4. The method for building a database system according to claim 1, wherein in the second step, in order to improve the efficiency of data conversion, a multi-process processing method is adopted.
5. The method of creating a database system according to claim 1, wherein in the second step, the data file that can be processed is a data file in mzml, txt, csv format.
6. The method for building a database system according to claim 1,
the data conversion method in the third step comprises the following steps:
s1, processing data by using a square root method;
s2, smoothing data by using a SavitzkyGolay method;
s3, correcting the data by using an SNIP method;
s4, calculating peak intensity.
7. The method of building a database system according to claim 1, wherein the data cleansing in step five is selected from one or more of deleting anomalous data, data padding, and feature screening.
8. The method of building a database system according to claim 1, wherein the algorithm in step six is a random forest, SVM, neural network or bayesian network.
The invention also includes the following embodiments:
9. a system for building a database, comprising the following modules:
the acquisition module is used for detecting the biological liquid biopsy sample by using an analysis instrument and acquiring a fid original data file generated in a laboratory;
the data conversion module uses CompassXport software to convert the fid data file into a data file which can be processed;
the data processing module is used for processing the processable data by using a data conversion method so as to obtain accurate relative abundance of each component;
the data analysis module is used for analyzing and judging QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment; if the data requirements are met, cleaning the sample characteristic data;
the data modeling prediction module is used for carrying out data modeling prediction on the processed characteristic data by using an algorithm and screening out key peaks and abundance data thereof;
the computer training module is used for training a machine learning model by using data with accurate manual labels in advance;
and the database auxiliary treatment construction module inputs the data obtained by screening into a pre-trained machine learning model and gives out corresponding prediction results to assist clinical diagnosis.
10. The system for creating a database as claimed in claim 9, wherein,
in the acquisition module, the analytical instrument is a mass spectrometer, a gas chromatograph or a liquid chromatograph.
11. The system for creating a database as claimed in claim 9, wherein,
in the collection module, the biological fluid biopsy sample is selected from any one of serum, urine, interstitial fluid, saliva, plasma and joint fluid.
12. The system for creating a database as claimed in claim 9, wherein,
in the data conversion module, in order to improve the efficiency of data conversion, a multi-process processing method is adopted.
13. The method of building a database system according to claim 9, wherein in the data conversion module, the data files that can be processed are data files in mzm, txt, csv format.
14. The system for creating a database as recited in claim 9,
the data conversion method in the data processing module comprises the following steps:
s1, processing data by using a square root method;
s2, smoothing data by using a SavitzkyGolay method;
s3, correcting the data by using an SNIP method;
s4, calculating peak intensity.
15. The method of claim 9, wherein in the data analysis module, the data cleansing is selected from one or more of deleting anomalous data, data padding, and feature screening.
16. The method of database system build-up of claim 9, wherein in the data modeling prediction module, the algorithm is a random forest, SVM, neural network, or bayesian network.
Drawings
FIG. 1 is a flow chart of database system setup of the present invention.
Fig. 2 is a spectrum obtained by detecting serum samples of healthy human body under conventional conditions using a mass spectrometer.
Fig. 3 is a spectrum obtained by detecting serum samples of healthy human body under conventional conditions using a spectrometer.
Fig. 4 is a spectrum obtained by detecting urine samples of a healthy human body under conventional conditions using a liquid chromatograph.
Detailed Description
One aspect of the invention relates to a method of establishing a database system, comprising the steps of:
detecting a biological liquid biopsy sample by using an analytical instrument, and obtaining a laboratory generated fid original data file, wherein the analytical instrument is preferably a mass spectrometer, a spectrometer or a chromatograph, and the biological liquid biopsy sample is preferably any one of serum, urine and tissue fluid;
step two, converting the fid data file into a processable mzml, txt, csv data file by using CompassXport software, preferably adopting a multi-process processing method, and improving the data conversion efficiency;
processing mzml, txt, csv and other data by using a data conversion method to obtain accurate relative abundance of each component;
analyzing and judging the QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment;
step five, if the data requirement is met, carrying out data cleaning on the sample characteristic data, such as deleting abnormal data, filling data, screening characteristics and the like;
step six, carrying out data modeling prediction on the processed characteristic data, wherein the following algorithm is mainly used: random forests, SVMs, neural networks, bayesian networks, etc., and screening out key peaks and their abundance data.
Step seven, data with accurate manual labeling are used in advance and used for training a machine learning model;
and step eight, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.
The data conversion method in the third step comprises the following steps:
1. processing the data by using a square root method, and reducing data noise;
2. smoothing data by using a SavitzkyGolay method, and improving the data rule;
3. correcting the data using the SNIP method;
4. the peak intensity was calculated.
Another aspect of the invention relates to a system for creating a database:
a system for building a database, comprising the following modules:
the acquisition module is used for detecting a biological liquid biopsy sample by using an analysis instrument to acquire a fid original data file generated by a laboratory, wherein the analysis instrument is preferably a mass spectrometer, a spectrometer or a chromatograph, and the biological liquid biopsy sample is preferably any one of serum, urine, tissue fluid, saliva, blood plasma and joint fluid;
the data conversion module converts the fid data file into a processable data file by using CompassXport software, preferably, the processable data file is in mzml, txt or csv format, and in order to improve the data conversion efficiency, preferably, a multi-process processing method is adopted;
the data processing module is used for processing the data by using a data conversion method so as to obtain accurate relative abundance of each component;
the data analysis module is used for analyzing and judging QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment; if the data requirement is met, carrying out data cleaning on the sample characteristic data, wherein the data cleaning is preferably selected from more than one of deleting abnormal data, filling data and screening characteristics;
the data modeling prediction module performs data modeling prediction on the processed characteristic data by using an algorithm, preferably mainly using the following algorithm: random forests, SVMs, neural networks, bayesian networks, etc., and screening out key peaks and their abundance data.
The computer training module is used for training a machine learning model by using data with accurate manual labels in advance;
and the database auxiliary treatment construction module inputs the data obtained by screening into a pre-trained machine learning model and gives out corresponding prediction results to assist clinical diagnosis.
The data conversion method in the data processing module comprises the following steps:
s1, processing data by using a square root method, and reducing data noise;
s2, smoothing data by using a SavitzkyGolay method, and improving a data rule;
s3, correcting the data by using an SNIP method;
s4, calculating peak intensity.
Examples
Example 1
The mass spectrometer detects the serum sample to obtain mass-to-charge ratio data processing:
and step one, detecting the serum sample by using a mass spectrometer to obtain mass-to-charge ratio data of each component after ionization. Obtaining a fid original data file generated by a mass spectrometer and converting the fid original data file into mzml format data by using CompassXport;
reading mzml and performing baseline removal conversion by a data conversion method to obtain accurate relative ion abundance of each component;
step three, analyzing and judging the experiment to perform QC according to the obtained data condition, judging whether the data requirement is met or not, if not, analyzing the reason, and re-performing the experiment;
step four, data cleaning is carried out, a random forest algorithm is used according to the sample type, and key peaks and ion abundance data thereof are screened out by combining previous model training data;
step five, data with accurate manual labeling are used in advance and used for training a machine learning model;
step six: and inputting the data obtained by screening into a pre-trained machine learning model, and giving out corresponding prediction results to assist clinical diagnosis.
The data conversion method in the second step comprises the following steps:
1. processing the data by using a square root method, and reducing data noise;
2. smoothing data by using a SavitzkyGolay method, and improving the data rule;
3. correcting the data using the SNIP method;
4. the peak intensity was calculated.
Example 2
The spectrometer detects the serum sample to obtain detection result data processing:
step one, detecting a serum sample by using a spectrometer, and obtaining a fid original data file generated in a laboratory;
step two, converting the fid data file into a txt data file which can be processed by using CompassXport software;
thirdly, performing data processing on txt data by using a data conversion method to obtain accurate relative abundance of each component;
analyzing and judging the QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment;
step five, if the data requirement is met, deleting abnormal data of the sample characteristic data;
and step six, carrying out data modeling prediction on the processed characteristic data by using an SVM algorithm, and screening out key peaks and abundance data thereof.
Step seven, data with accurate manual labeling are used in advance and used for training a machine learning model;
and step eight, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.
The data conversion method in the third step comprises the following steps:
1. processing the data by using a square root method, and reducing data noise;
2. smoothing data by using a SavitzkyGolay method, and improving the data rule;
3. correcting the data using the SNIP method;
4. the peak intensity was calculated.
Example 3
The mass spectrometer detects the urine sample to obtain mass-to-charge ratio data processing:
and step one, detecting the urine sample by using a mass spectrometer to obtain mass-to-charge ratio data of each component after ionization. Obtaining a fid original data file generated by a mass spectrometer and converting the fid original data file into mzml format data by using CompassXport;
step two, data processing is carried out on mzml data by using a data conversion method so as to obtain accurate ion relative abundance of each component;
analyzing and judging the QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment;
if the data requirement is met, deleting abnormal data and filling the data of the sample characteristic data;
and fifthly, carrying out data modeling prediction on the processed characteristic data by using a Bayesian network algorithm, and screening out key peaks and ion abundance data thereof.
Step six, data with accurate manual labeling are used in advance and used for training a machine learning model;
and step seven, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.
The data conversion method in the second step comprises the following steps:
1. processing the data by using a square root method, and reducing data noise;
2. smoothing data by using a SavitzkyGolay method, and improving the data rule;
3. correcting the data using the SNIP method;
4. the peak intensity was calculated.
Example 4
Detecting a urine sample by using a liquid chromatograph, and obtaining data processing of detection results of protein, polypeptide and metabolic products:
and step one, detecting the urine sample by using a liquid chromatograph to obtain detection data of each component. Obtaining a fid original data file generated by a liquid chromatograph and converting the fid original data file into txt format by using CompassXport;
performing data processing on txt data by using a data conversion method to obtain accurate relative abundance of each component;
analyzing and judging the QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment;
step four, if the data requirements are met, abnormal data and sign screening are carried out on the sample characteristic data;
and fifthly, carrying out data modeling prediction on the processed characteristic data by using a random forest algorithm, and screening out key peaks and abundance data thereof.
Step six, data with accurate manual labeling are used in advance and used for training a machine learning model;
and step seven, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.
The data conversion method in the second step comprises the following steps:
1. processing the data by using a square root method, and reducing data noise;
2. smoothing data by using a SavitzkyGolay method, and improving the data rule;
3. correcting the data using the SNIP method;
4. the relative intensities of the substances were calculated.
Example 5
The spectrometer detects saliva samples to obtain detection result data processing:
step one, detecting saliva samples by using a spectrometer, and obtaining a fid original data file generated in a laboratory;
step two, converting the fid data file into a txt data file which can be processed by using CompassXport software;
thirdly, processing txt data by using a data conversion method to obtain accurate relative abundance of each component;
analyzing and judging the QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment;
step five, if the data requirement is met, deleting abnormal data from the sample characteristic data;
and step six, carrying out data modeling prediction on the processed characteristic data by using a neural network algorithm, and screening out key peaks and abundance data thereof.
Step seven, data with accurate manual labeling are used in advance and used for training a machine learning model;
and step eight, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.
The data conversion method in the third step comprises the following steps:
1. processing the data by using a square root method, and reducing data noise;
2. smoothing data by using a SavitzkyGolay method, and improving the data rule;
3. correcting the data using the SNIP method;
4. the relative intensities of the substances were calculated.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When used in whole or in part, is implemented in the form of a computer program product comprising one or more computer instructions. When loaded or executed on a computer, wholly or partly produce a machine, such as the process or function described in embodiments of the present invention. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (solid state disk SSD)), etc.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.
Claims (16)
1. A method of database system creation, comprising the steps of:
step one, detecting a biological liquid biopsy sample by using an analytical instrument, and obtaining a fid original data file generated in a laboratory;
step two, converting the fid data file into a processable data file by using CompassXport software;
step three, processing the processable data by using a data conversion method to obtain accurate relative abundance of each component;
analyzing and judging the QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment;
step five, if the data requirement is met, cleaning the sample characteristic data;
step six, carrying out data modeling prediction on the processed characteristic data by using an algorithm, and screening out key peaks and abundance data thereof;
step seven, data with accurate manual labeling are used in advance and used for training a machine learning model;
and step eight, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.
2. The method for building a database system according to claim 1, wherein,
in the first step, the analytical instrument is a mass spectrometer, a gas chromatograph or a liquid chromatograph.
3. The method for building a database system according to claim 1, wherein,
in step one, the biological fluid biopsy sample is selected from any one of serum, urine, interstitial fluid, saliva, plasma and joint fluid.
4. The method for building a database system according to claim 1, wherein in the second step, in order to improve the efficiency of data conversion, a multi-process processing method is adopted.
5. The method of creating a database system according to claim 1, wherein in the second step, the data file that can be processed is a data file in mzml, txt, csv format.
6. The method for building a database system according to claim 1,
the data conversion method in the third step comprises the following steps:
s1, processing data by using a square root method;
s2, smoothing data by using a SavitzkyGolay method;
s3, correcting the data by using an SNIP method;
s4, calculating peak intensity.
7. The method of building a database system according to claim 1, wherein the data cleansing in step five is selected from one or more of deleting anomalous data, data padding, and feature screening.
8. The method of building a database system according to claim 1, wherein the algorithm in step six is a random forest, SVM, neural network or bayesian network.
9. A system for building a database, comprising the following modules:
the acquisition module is used for detecting the biological liquid biopsy sample by using an analysis instrument and acquiring a fid original data file generated in a laboratory;
the data conversion module uses CompassXport software to convert the fid data file into a data file which can be processed;
the data processing module is used for processing the processable data by using a data conversion method so as to obtain accurate relative abundance of each component;
the data analysis module is used for analyzing and judging QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment; if the data requirements are met, cleaning the sample characteristic data;
the data modeling prediction module is used for carrying out data modeling prediction on the processed characteristic data by using an algorithm and screening out key peaks and abundance data thereof;
the computer training module is used for training a machine learning model by using data with accurate manual labels in advance;
and the database auxiliary treatment construction module inputs the data obtained by screening into a pre-trained machine learning model and gives out corresponding prediction results to assist clinical diagnosis.
10. The system for creating a database as claimed in claim 9, wherein,
in the acquisition module, the analytical instrument is a mass spectrometer, a gas chromatograph or a liquid chromatograph.
11. The system for creating a database as claimed in claim 9, wherein,
in the collection module, the biological fluid biopsy sample is selected from any one of serum, urine, interstitial fluid, saliva, plasma and joint fluid.
12. The system for creating a database as claimed in claim 9, wherein,
in the data conversion module, in order to improve the efficiency of data conversion, a multi-process processing method is adopted.
13. The method for building a database system according to claim 9, wherein in the data conversion module, the data file that can be processed is a data file in mzml, txt, csv format.
14. The system for creating a database as recited in claim 9,
the data conversion method in the data processing module comprises the following steps:
s1, processing data by using a square root method;
s2, smoothing data by using a SavitzkyGolay method;
s3, correcting the data by using an SNIP method;
s4, calculating peak intensity.
15. The method of claim 9, wherein in the data analysis module, the data cleansing is selected from one or more of deleting anomalous data, data padding, and feature screening.
16. The method of database system build-up of claim 9, wherein in the data modeling prediction module, the algorithm is a random forest, SVM, neural network, or bayesian network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111635764.4A CN116364294A (en) | 2021-12-28 | 2021-12-28 | Database system establishment and application thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111635764.4A CN116364294A (en) | 2021-12-28 | 2021-12-28 | Database system establishment and application thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116364294A true CN116364294A (en) | 2023-06-30 |
Family
ID=86928936
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111635764.4A Pending CN116364294A (en) | 2021-12-28 | 2021-12-28 | Database system establishment and application thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116364294A (en) |
-
2021
- 2021-12-28 CN CN202111635764.4A patent/CN116364294A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
López-Fernández et al. | Mass-Up: an all-in-one open software application for MALDI-TOF mass spectrometry knowledge discovery | |
AU2002245043B2 (en) | Method for analyzing mass spectra | |
US20020193950A1 (en) | Method for analyzing mass spectra | |
AU2002241535A1 (en) | Method for analyzing mass spectra | |
JP2006522340A (en) | Analyzing mass spectrometry data | |
WO2004097581A2 (en) | Computational method and system for mass spectral analysis | |
US20100036791A1 (en) | Examination value predicting device using electrophoresis waveform, prediction method, and predicting program | |
CN114414704B (en) | System, model and kit for evaluating malignancy degree or probability of thyroid nodule | |
JP2009505231A (en) | System, method, and computer program for comparing and editing metabolite data obtained from a plurality of samples using a computer system database | |
Mantini et al. | Independent component analysis for the extraction of reliable protein signal profiles from MALDI-TOF mass spectra | |
Sun et al. | Recent advances in computational analysis of mass spectrometry for proteomic profiling | |
JP6179600B2 (en) | Mass spectrometry data analyzer | |
CN116364294A (en) | Database system establishment and application thereof | |
US10937525B2 (en) | System that generates pharmacokinetic analyses of oligonucleotide total effects from full-scan mass spectra | |
Johann Jr et al. | Novel approaches to visualization and data mining reveals diagnostic information in the low amplitude region of serum mass spectra from ovarian cancer patients | |
JP7207171B2 (en) | SEARCH SUPPORT METHOD FOR MARKER SUBSTANCE, SEARCH SUPPORT PROGRAM AND SEARCH SUPPORT DEVICE | |
WO2006130368A2 (en) | Iterative base peak framing of mass spectrometry data | |
Atlas et al. | A statistical technique for monoisotopic peak detection in a mass spectrum | |
Sellers et al. | Feature detection techniques for preprocessing proteomic data | |
CN115831369A (en) | Method, device, equipment and medium for processing early screening data and constructing early screening model | |
Hartman et al. | Peptimetric: Quantifying and visualizing differences in peptidomic data | |
Capelo et al. | Prescriptomics: the next frontier in medicine | |
Guzzi et al. | Database Community and Health Related Data: Experiences Through the Last Decade | |
Gullo et al. | MaSDA: a system for analyzing mass spectrometry data | |
CN118553429A (en) | Intelligent prediction model training and category prediction method for metabolic diseases |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |