CN116364294A - Database system establishment and application thereof - Google Patents

Database system establishment and application thereof Download PDF

Info

Publication number
CN116364294A
CN116364294A CN202111635764.4A CN202111635764A CN116364294A CN 116364294 A CN116364294 A CN 116364294A CN 202111635764 A CN202111635764 A CN 202111635764A CN 116364294 A CN116364294 A CN 116364294A
Authority
CN
China
Prior art keywords
data
module
database
database system
screening
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111635764.4A
Other languages
Chinese (zh)
Inventor
傅博
韩嘉宸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Gurong Biotechnology Co ltd
Original Assignee
Shanghai Gurong Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Gurong Biotechnology Co ltd filed Critical Shanghai Gurong Biotechnology Co ltd
Priority to CN202111635764.4A priority Critical patent/CN116364294A/en
Publication of CN116364294A publication Critical patent/CN116364294A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Pathology (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The invention relates to a method for establishing a database system, which comprises the following steps: step one, detecting a biological liquid biopsy sample by using an analytical instrument, and obtaining a fid original data file generated in a laboratory; step two, converting the data file into a processable data file by using software; step three, processing the processable data by using a data conversion method; analyzing and judging QC according to sample characteristic data condition, and judging whether the data requirement is met or not; step five, if the data requirement is met, cleaning the sample characteristic data; step six, carrying out data modeling prediction on the processed characteristic data by using an algorithm, and screening out key peaks and abundance data thereof; step seven, data with accurate manual labeling are used in advance and used for training a machine learning model; and step eight, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.

Description

Database system establishment and application thereof
Technical Field
The invention relates to a method for processing data of detection results of proteomics, polypeptide histology, metabonomics and the like in a biological liquid biopsy sample by an analytical instrument such as a mass spectrometer, a gas chromatograph or a liquid chromatograph.
Background
In the field of in vitro diagnostics, the progression of a disease always represents an abnormality of proteins, polypeptides and metabolites. For example, in the serum of cancer patients, some polypeptides persist at very low levels (e.g., all FPA fragments in various cancer patients and 3C3f fragments in breast cancer patients), and others at high levels (e.g., several C3f fragments in bladder and prostate cancers and one FPA fragment in breast cancer). From the perspective of data analysis, we need to acquire various attributes of the detection objects, normalize the attributes into feature vectors with equal length, and finally analyze the feature vectors by adopting various calculation means so as to identify markers of diseases such as tumor, coronary heart disease and the like, and diagnose the diseases. In this process, the extraction of the markers is particularly important, directly affecting the accuracy of the diagnostic result.
The experimental data processing is a commonly used scientific calculation method widely applied to production and scientific research processes, and is an important tool for product design quality management and scientific research. By analyzing the data obtained by the detection methods such as spectrum, chromatograph, mass spectrum and the like through a special calculation means, the markers of the chronic diseases such as tumor, coronary heart disease, hypertension, diabetes and the like can be rapidly judged.
Disclosure of Invention
In order to analyze the obtained experimental data through strict and accurate data processing and find out the internal rules of things and provide a basis for the diagnosis of chronic diseases, the invention provides a method for processing the data of detection results of a mass spectrometer, a gas chromatograph or a liquid phase analyzer on proteomics, a polypeptide histology, a metabonomics and the like in a biological liquid biopsy sample.
Specifically, the present invention includes the following embodiments.
1. A method of database system creation, comprising the steps of:
step one, detecting a biological liquid biopsy sample by using an analytical instrument, and obtaining a fid original data file generated in a laboratory;
step two, converting the fid data file into a processable data file by using CompassXport software;
step three, processing the processable data by using a data conversion method to obtain accurate relative abundance of each component;
analyzing and judging the QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment;
step five, if the data requirement is met, cleaning the sample characteristic data;
step six, carrying out data modeling prediction on the processed characteristic data by using an algorithm, and screening out key peaks and abundance data thereof;
step seven, data with accurate manual labeling are used in advance and used for training a machine learning model;
and step eight, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.
2. The method for building a database system according to claim 1, wherein,
in the first step, the analytical instrument is a mass spectrometer, a gas chromatograph or a liquid chromatograph.
3. The method for building a database system according to claim 1, wherein,
in step one, the biological fluid biopsy sample is selected from any one of serum, urine, interstitial fluid, saliva, plasma and joint fluid.
4. The method for building a database system according to claim 1, wherein in the second step, in order to improve the efficiency of data conversion, a multi-process processing method is adopted.
5. The method of creating a database system according to claim 1, wherein in the second step, the data file that can be processed is a data file in mzml, txt, csv format.
6. The method for building a database system according to claim 1,
the data conversion method in the third step comprises the following steps:
s1, processing data by using a square root method;
s2, smoothing data by using a SavitzkyGolay method;
s3, correcting the data by using an SNIP method;
s4, calculating peak intensity.
7. The method of building a database system according to claim 1, wherein the data cleansing in step five is selected from one or more of deleting anomalous data, data padding, and feature screening.
8. The method of building a database system according to claim 1, wherein the algorithm in step six is a random forest, SVM, neural network or bayesian network.
The invention also includes the following embodiments:
9. a system for building a database, comprising the following modules:
the acquisition module is used for detecting the biological liquid biopsy sample by using an analysis instrument and acquiring a fid original data file generated in a laboratory;
the data conversion module uses CompassXport software to convert the fid data file into a data file which can be processed;
the data processing module is used for processing the processable data by using a data conversion method so as to obtain accurate relative abundance of each component;
the data analysis module is used for analyzing and judging QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment; if the data requirements are met, cleaning the sample characteristic data;
the data modeling prediction module is used for carrying out data modeling prediction on the processed characteristic data by using an algorithm and screening out key peaks and abundance data thereof;
the computer training module is used for training a machine learning model by using data with accurate manual labels in advance;
and the database auxiliary treatment construction module inputs the data obtained by screening into a pre-trained machine learning model and gives out corresponding prediction results to assist clinical diagnosis.
10. The system for creating a database as claimed in claim 9, wherein,
in the acquisition module, the analytical instrument is a mass spectrometer, a gas chromatograph or a liquid chromatograph.
11. The system for creating a database as claimed in claim 9, wherein,
in the collection module, the biological fluid biopsy sample is selected from any one of serum, urine, interstitial fluid, saliva, plasma and joint fluid.
12. The system for creating a database as claimed in claim 9, wherein,
in the data conversion module, in order to improve the efficiency of data conversion, a multi-process processing method is adopted.
13. The method of building a database system according to claim 9, wherein in the data conversion module, the data files that can be processed are data files in mzm, txt, csv format.
14. The system for creating a database as recited in claim 9,
the data conversion method in the data processing module comprises the following steps:
s1, processing data by using a square root method;
s2, smoothing data by using a SavitzkyGolay method;
s3, correcting the data by using an SNIP method;
s4, calculating peak intensity.
15. The method of claim 9, wherein in the data analysis module, the data cleansing is selected from one or more of deleting anomalous data, data padding, and feature screening.
16. The method of database system build-up of claim 9, wherein in the data modeling prediction module, the algorithm is a random forest, SVM, neural network, or bayesian network.
Drawings
FIG. 1 is a flow chart of database system setup of the present invention.
Fig. 2 is a spectrum obtained by detecting serum samples of healthy human body under conventional conditions using a mass spectrometer.
Fig. 3 is a spectrum obtained by detecting serum samples of healthy human body under conventional conditions using a spectrometer.
Fig. 4 is a spectrum obtained by detecting urine samples of a healthy human body under conventional conditions using a liquid chromatograph.
Detailed Description
One aspect of the invention relates to a method of establishing a database system, comprising the steps of:
detecting a biological liquid biopsy sample by using an analytical instrument, and obtaining a laboratory generated fid original data file, wherein the analytical instrument is preferably a mass spectrometer, a spectrometer or a chromatograph, and the biological liquid biopsy sample is preferably any one of serum, urine and tissue fluid;
step two, converting the fid data file into a processable mzml, txt, csv data file by using CompassXport software, preferably adopting a multi-process processing method, and improving the data conversion efficiency;
processing mzml, txt, csv and other data by using a data conversion method to obtain accurate relative abundance of each component;
analyzing and judging the QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment;
step five, if the data requirement is met, carrying out data cleaning on the sample characteristic data, such as deleting abnormal data, filling data, screening characteristics and the like;
step six, carrying out data modeling prediction on the processed characteristic data, wherein the following algorithm is mainly used: random forests, SVMs, neural networks, bayesian networks, etc., and screening out key peaks and their abundance data.
Step seven, data with accurate manual labeling are used in advance and used for training a machine learning model;
and step eight, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.
The data conversion method in the third step comprises the following steps:
1. processing the data by using a square root method, and reducing data noise;
2. smoothing data by using a SavitzkyGolay method, and improving the data rule;
3. correcting the data using the SNIP method;
4. the peak intensity was calculated.
Another aspect of the invention relates to a system for creating a database:
a system for building a database, comprising the following modules:
the acquisition module is used for detecting a biological liquid biopsy sample by using an analysis instrument to acquire a fid original data file generated by a laboratory, wherein the analysis instrument is preferably a mass spectrometer, a spectrometer or a chromatograph, and the biological liquid biopsy sample is preferably any one of serum, urine, tissue fluid, saliva, blood plasma and joint fluid;
the data conversion module converts the fid data file into a processable data file by using CompassXport software, preferably, the processable data file is in mzml, txt or csv format, and in order to improve the data conversion efficiency, preferably, a multi-process processing method is adopted;
the data processing module is used for processing the data by using a data conversion method so as to obtain accurate relative abundance of each component;
the data analysis module is used for analyzing and judging QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment; if the data requirement is met, carrying out data cleaning on the sample characteristic data, wherein the data cleaning is preferably selected from more than one of deleting abnormal data, filling data and screening characteristics;
the data modeling prediction module performs data modeling prediction on the processed characteristic data by using an algorithm, preferably mainly using the following algorithm: random forests, SVMs, neural networks, bayesian networks, etc., and screening out key peaks and their abundance data.
The computer training module is used for training a machine learning model by using data with accurate manual labels in advance;
and the database auxiliary treatment construction module inputs the data obtained by screening into a pre-trained machine learning model and gives out corresponding prediction results to assist clinical diagnosis.
The data conversion method in the data processing module comprises the following steps:
s1, processing data by using a square root method, and reducing data noise;
s2, smoothing data by using a SavitzkyGolay method, and improving a data rule;
s3, correcting the data by using an SNIP method;
s4, calculating peak intensity.
Examples
Example 1
The mass spectrometer detects the serum sample to obtain mass-to-charge ratio data processing:
and step one, detecting the serum sample by using a mass spectrometer to obtain mass-to-charge ratio data of each component after ionization. Obtaining a fid original data file generated by a mass spectrometer and converting the fid original data file into mzml format data by using CompassXport;
reading mzml and performing baseline removal conversion by a data conversion method to obtain accurate relative ion abundance of each component;
step three, analyzing and judging the experiment to perform QC according to the obtained data condition, judging whether the data requirement is met or not, if not, analyzing the reason, and re-performing the experiment;
step four, data cleaning is carried out, a random forest algorithm is used according to the sample type, and key peaks and ion abundance data thereof are screened out by combining previous model training data;
step five, data with accurate manual labeling are used in advance and used for training a machine learning model;
step six: and inputting the data obtained by screening into a pre-trained machine learning model, and giving out corresponding prediction results to assist clinical diagnosis.
The data conversion method in the second step comprises the following steps:
1. processing the data by using a square root method, and reducing data noise;
2. smoothing data by using a SavitzkyGolay method, and improving the data rule;
3. correcting the data using the SNIP method;
4. the peak intensity was calculated.
Example 2
The spectrometer detects the serum sample to obtain detection result data processing:
step one, detecting a serum sample by using a spectrometer, and obtaining a fid original data file generated in a laboratory;
step two, converting the fid data file into a txt data file which can be processed by using CompassXport software;
thirdly, performing data processing on txt data by using a data conversion method to obtain accurate relative abundance of each component;
analyzing and judging the QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment;
step five, if the data requirement is met, deleting abnormal data of the sample characteristic data;
and step six, carrying out data modeling prediction on the processed characteristic data by using an SVM algorithm, and screening out key peaks and abundance data thereof.
Step seven, data with accurate manual labeling are used in advance and used for training a machine learning model;
and step eight, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.
The data conversion method in the third step comprises the following steps:
1. processing the data by using a square root method, and reducing data noise;
2. smoothing data by using a SavitzkyGolay method, and improving the data rule;
3. correcting the data using the SNIP method;
4. the peak intensity was calculated.
Example 3
The mass spectrometer detects the urine sample to obtain mass-to-charge ratio data processing:
and step one, detecting the urine sample by using a mass spectrometer to obtain mass-to-charge ratio data of each component after ionization. Obtaining a fid original data file generated by a mass spectrometer and converting the fid original data file into mzml format data by using CompassXport;
step two, data processing is carried out on mzml data by using a data conversion method so as to obtain accurate ion relative abundance of each component;
analyzing and judging the QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment;
if the data requirement is met, deleting abnormal data and filling the data of the sample characteristic data;
and fifthly, carrying out data modeling prediction on the processed characteristic data by using a Bayesian network algorithm, and screening out key peaks and ion abundance data thereof.
Step six, data with accurate manual labeling are used in advance and used for training a machine learning model;
and step seven, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.
The data conversion method in the second step comprises the following steps:
1. processing the data by using a square root method, and reducing data noise;
2. smoothing data by using a SavitzkyGolay method, and improving the data rule;
3. correcting the data using the SNIP method;
4. the peak intensity was calculated.
Example 4
Detecting a urine sample by using a liquid chromatograph, and obtaining data processing of detection results of protein, polypeptide and metabolic products:
and step one, detecting the urine sample by using a liquid chromatograph to obtain detection data of each component. Obtaining a fid original data file generated by a liquid chromatograph and converting the fid original data file into txt format by using CompassXport;
performing data processing on txt data by using a data conversion method to obtain accurate relative abundance of each component;
analyzing and judging the QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment;
step four, if the data requirements are met, abnormal data and sign screening are carried out on the sample characteristic data;
and fifthly, carrying out data modeling prediction on the processed characteristic data by using a random forest algorithm, and screening out key peaks and abundance data thereof.
Step six, data with accurate manual labeling are used in advance and used for training a machine learning model;
and step seven, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.
The data conversion method in the second step comprises the following steps:
1. processing the data by using a square root method, and reducing data noise;
2. smoothing data by using a SavitzkyGolay method, and improving the data rule;
3. correcting the data using the SNIP method;
4. the relative intensities of the substances were calculated.
Example 5
The spectrometer detects saliva samples to obtain detection result data processing:
step one, detecting saliva samples by using a spectrometer, and obtaining a fid original data file generated in a laboratory;
step two, converting the fid data file into a txt data file which can be processed by using CompassXport software;
thirdly, processing txt data by using a data conversion method to obtain accurate relative abundance of each component;
analyzing and judging the QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment;
step five, if the data requirement is met, deleting abnormal data from the sample characteristic data;
and step six, carrying out data modeling prediction on the processed characteristic data by using a neural network algorithm, and screening out key peaks and abundance data thereof.
Step seven, data with accurate manual labeling are used in advance and used for training a machine learning model;
and step eight, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.
The data conversion method in the third step comprises the following steps:
1. processing the data by using a square root method, and reducing data noise;
2. smoothing data by using a SavitzkyGolay method, and improving the data rule;
3. correcting the data using the SNIP method;
4. the relative intensities of the substances were calculated.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When used in whole or in part, is implemented in the form of a computer program product comprising one or more computer instructions. When loaded or executed on a computer, wholly or partly produce a machine, such as the process or function described in embodiments of the present invention. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (solid state disk SSD)), etc.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (16)

1. A method of database system creation, comprising the steps of:
step one, detecting a biological liquid biopsy sample by using an analytical instrument, and obtaining a fid original data file generated in a laboratory;
step two, converting the fid data file into a processable data file by using CompassXport software;
step three, processing the processable data by using a data conversion method to obtain accurate relative abundance of each component;
analyzing and judging the QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment;
step five, if the data requirement is met, cleaning the sample characteristic data;
step six, carrying out data modeling prediction on the processed characteristic data by using an algorithm, and screening out key peaks and abundance data thereof;
step seven, data with accurate manual labeling are used in advance and used for training a machine learning model;
and step eight, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.
2. The method for building a database system according to claim 1, wherein,
in the first step, the analytical instrument is a mass spectrometer, a gas chromatograph or a liquid chromatograph.
3. The method for building a database system according to claim 1, wherein,
in step one, the biological fluid biopsy sample is selected from any one of serum, urine, interstitial fluid, saliva, plasma and joint fluid.
4. The method for building a database system according to claim 1, wherein in the second step, in order to improve the efficiency of data conversion, a multi-process processing method is adopted.
5. The method of creating a database system according to claim 1, wherein in the second step, the data file that can be processed is a data file in mzml, txt, csv format.
6. The method for building a database system according to claim 1,
the data conversion method in the third step comprises the following steps:
s1, processing data by using a square root method;
s2, smoothing data by using a SavitzkyGolay method;
s3, correcting the data by using an SNIP method;
s4, calculating peak intensity.
7. The method of building a database system according to claim 1, wherein the data cleansing in step five is selected from one or more of deleting anomalous data, data padding, and feature screening.
8. The method of building a database system according to claim 1, wherein the algorithm in step six is a random forest, SVM, neural network or bayesian network.
9. A system for building a database, comprising the following modules:
the acquisition module is used for detecting the biological liquid biopsy sample by using an analysis instrument and acquiring a fid original data file generated in a laboratory;
the data conversion module uses CompassXport software to convert the fid data file into a data file which can be processed;
the data processing module is used for processing the processable data by using a data conversion method so as to obtain accurate relative abundance of each component;
the data analysis module is used for analyzing and judging QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment; if the data requirements are met, cleaning the sample characteristic data;
the data modeling prediction module is used for carrying out data modeling prediction on the processed characteristic data by using an algorithm and screening out key peaks and abundance data thereof;
the computer training module is used for training a machine learning model by using data with accurate manual labels in advance;
and the database auxiliary treatment construction module inputs the data obtained by screening into a pre-trained machine learning model and gives out corresponding prediction results to assist clinical diagnosis.
10. The system for creating a database as claimed in claim 9, wherein,
in the acquisition module, the analytical instrument is a mass spectrometer, a gas chromatograph or a liquid chromatograph.
11. The system for creating a database as claimed in claim 9, wherein,
in the collection module, the biological fluid biopsy sample is selected from any one of serum, urine, interstitial fluid, saliva, plasma and joint fluid.
12. The system for creating a database as claimed in claim 9, wherein,
in the data conversion module, in order to improve the efficiency of data conversion, a multi-process processing method is adopted.
13. The method for building a database system according to claim 9, wherein in the data conversion module, the data file that can be processed is a data file in mzml, txt, csv format.
14. The system for creating a database as recited in claim 9,
the data conversion method in the data processing module comprises the following steps:
s1, processing data by using a square root method;
s2, smoothing data by using a SavitzkyGolay method;
s3, correcting the data by using an SNIP method;
s4, calculating peak intensity.
15. The method of claim 9, wherein in the data analysis module, the data cleansing is selected from one or more of deleting anomalous data, data padding, and feature screening.
16. The method of database system build-up of claim 9, wherein in the data modeling prediction module, the algorithm is a random forest, SVM, neural network, or bayesian network.
CN202111635764.4A 2021-12-28 2021-12-28 Database system establishment and application thereof Pending CN116364294A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111635764.4A CN116364294A (en) 2021-12-28 2021-12-28 Database system establishment and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111635764.4A CN116364294A (en) 2021-12-28 2021-12-28 Database system establishment and application thereof

Publications (1)

Publication Number Publication Date
CN116364294A true CN116364294A (en) 2023-06-30

Family

ID=86928936

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111635764.4A Pending CN116364294A (en) 2021-12-28 2021-12-28 Database system establishment and application thereof

Country Status (1)

Country Link
CN (1) CN116364294A (en)

Similar Documents

Publication Publication Date Title
López-Fernández et al. Mass-Up: an all-in-one open software application for MALDI-TOF mass spectrometry knowledge discovery
AU2002245043B2 (en) Method for analyzing mass spectra
US20020193950A1 (en) Method for analyzing mass spectra
AU2002241535A1 (en) Method for analyzing mass spectra
JP2006522340A (en) Analyzing mass spectrometry data
WO2004097581A2 (en) Computational method and system for mass spectral analysis
US20100036791A1 (en) Examination value predicting device using electrophoresis waveform, prediction method, and predicting program
CN114414704B (en) System, model and kit for evaluating malignancy degree or probability of thyroid nodule
JP2009505231A (en) System, method, and computer program for comparing and editing metabolite data obtained from a plurality of samples using a computer system database
Mantini et al. Independent component analysis for the extraction of reliable protein signal profiles from MALDI-TOF mass spectra
Sun et al. Recent advances in computational analysis of mass spectrometry for proteomic profiling
JP6179600B2 (en) Mass spectrometry data analyzer
CN116364294A (en) Database system establishment and application thereof
US10937525B2 (en) System that generates pharmacokinetic analyses of oligonucleotide total effects from full-scan mass spectra
Johann Jr et al. Novel approaches to visualization and data mining reveals diagnostic information in the low amplitude region of serum mass spectra from ovarian cancer patients
JP7207171B2 (en) SEARCH SUPPORT METHOD FOR MARKER SUBSTANCE, SEARCH SUPPORT PROGRAM AND SEARCH SUPPORT DEVICE
WO2006130368A2 (en) Iterative base peak framing of mass spectrometry data
Atlas et al. A statistical technique for monoisotopic peak detection in a mass spectrum
Sellers et al. Feature detection techniques for preprocessing proteomic data
CN115831369A (en) Method, device, equipment and medium for processing early screening data and constructing early screening model
Hartman et al. Peptimetric: Quantifying and visualizing differences in peptidomic data
Capelo et al. Prescriptomics: the next frontier in medicine
Guzzi et al. Database Community and Health Related Data: Experiences Through the Last Decade
Gullo et al. MaSDA: a system for analyzing mass spectrometry data
CN118553429A (en) Intelligent prediction model training and category prediction method for metabolic diseases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination