CN116364294A

CN116364294A - Database system establishment and application thereof

Info

Publication number: CN116364294A
Application number: CN202111635764.4A
Authority: CN
Inventors: 傅博; 韩嘉宸
Original assignee: Shanghai Gurong Biotechnology Co ltd
Current assignee: Shanghai Gurong Biotechnology Co ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2023-06-30

Abstract

The invention relates to a method for establishing a database system, which comprises the following steps: step one, detecting a biological liquid biopsy sample by using an analytical instrument, and obtaining a fid original data file generated in a laboratory; step two, converting the data file into a processable data file by using software; step three, processing the processable data by using a data conversion method; analyzing and judging QC according to sample characteristic data condition, and judging whether the data requirement is met or not; step five, if the data requirement is met, cleaning the sample characteristic data; step six, carrying out data modeling prediction on the processed characteristic data by using an algorithm, and screening out key peaks and abundance data thereof; step seven, data with accurate manual labeling are used in advance and used for training a machine learning model; and step eight, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.

Description

Database system establishment and application thereof

Technical Field

The invention relates to a method for processing data of detection results of proteomics, polypeptide histology, metabonomics and the like in a biological liquid biopsy sample by an analytical instrument such as a mass spectrometer, a gas chromatograph or a liquid chromatograph.

Background

In the field of in vitro diagnostics, the progression of a disease always represents an abnormality of proteins, polypeptides and metabolites. For example, in the serum of cancer patients, some polypeptides persist at very low levels (e.g., all FPA fragments in various cancer patients and 3C3f fragments in breast cancer patients), and others at high levels (e.g., several C3f fragments in bladder and prostate cancers and one FPA fragment in breast cancer). From the perspective of data analysis, we need to acquire various attributes of the detection objects, normalize the attributes into feature vectors with equal length, and finally analyze the feature vectors by adopting various calculation means so as to identify markers of diseases such as tumor, coronary heart disease and the like, and diagnose the diseases. In this process, the extraction of the markers is particularly important, directly affecting the accuracy of the diagnostic result.

The experimental data processing is a commonly used scientific calculation method widely applied to production and scientific research processes, and is an important tool for product design quality management and scientific research. By analyzing the data obtained by the detection methods such as spectrum, chromatograph, mass spectrum and the like through a special calculation means, the markers of the chronic diseases such as tumor, coronary heart disease, hypertension, diabetes and the like can be rapidly judged.

Disclosure of Invention

In order to analyze the obtained experimental data through strict and accurate data processing and find out the internal rules of things and provide a basis for the diagnosis of chronic diseases, the invention provides a method for processing the data of detection results of a mass spectrometer, a gas chromatograph or a liquid phase analyzer on proteomics, a polypeptide histology, a metabonomics and the like in a biological liquid biopsy sample.

Specifically, the present invention includes the following embodiments.

1. A method of database system creation, comprising the steps of:

step one, detecting a biological liquid biopsy sample by using an analytical instrument, and obtaining a fid original data file generated in a laboratory;

step two, converting the fid data file into a processable data file by using CompassXport software;

step three, processing the processable data by using a data conversion method to obtain accurate relative abundance of each component;

analyzing and judging the QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment;

step five, if the data requirement is met, cleaning the sample characteristic data;

step six, carrying out data modeling prediction on the processed characteristic data by using an algorithm, and screening out key peaks and abundance data thereof;

step seven, data with accurate manual labeling are used in advance and used for training a machine learning model;

and step eight, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.

2. The method for building a database system according to claim 1, wherein,

in the first step, the analytical instrument is a mass spectrometer, a gas chromatograph or a liquid chromatograph.

3. The method for building a database system according to claim 1, wherein,

in step one, the biological fluid biopsy sample is selected from any one of serum, urine, interstitial fluid, saliva, plasma and joint fluid.

4. The method for building a database system according to claim 1, wherein in the second step, in order to improve the efficiency of data conversion, a multi-process processing method is adopted.

5. The method of creating a database system according to claim 1, wherein in the second step, the data file that can be processed is a data file in mzml, txt, csv format.

6. The method for building a database system according to claim 1,

the data conversion method in the third step comprises the following steps:

s1, processing data by using a square root method;

s2, smoothing data by using a SavitzkyGolay method;

s3, correcting the data by using an SNIP method;

s4, calculating peak intensity.

7. The method of building a database system according to claim 1, wherein the data cleansing in step five is selected from one or more of deleting anomalous data, data padding, and feature screening.

8. The method of building a database system according to claim 1, wherein the algorithm in step six is a random forest, SVM, neural network or bayesian network.

The invention also includes the following embodiments:

9. a system for building a database, comprising the following modules:

the acquisition module is used for detecting the biological liquid biopsy sample by using an analysis instrument and acquiring a fid original data file generated in a laboratory;

the data conversion module uses CompassXport software to convert the fid data file into a data file which can be processed;

the data processing module is used for processing the processable data by using a data conversion method so as to obtain accurate relative abundance of each component;

the data analysis module is used for analyzing and judging QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment; if the data requirements are met, cleaning the sample characteristic data;

the data modeling prediction module is used for carrying out data modeling prediction on the processed characteristic data by using an algorithm and screening out key peaks and abundance data thereof;

the computer training module is used for training a machine learning model by using data with accurate manual labels in advance;

and the database auxiliary treatment construction module inputs the data obtained by screening into a pre-trained machine learning model and gives out corresponding prediction results to assist clinical diagnosis.

10. The system for creating a database as claimed in claim 9, wherein,

in the acquisition module, the analytical instrument is a mass spectrometer, a gas chromatograph or a liquid chromatograph.

11. The system for creating a database as claimed in claim 9, wherein,

in the collection module, the biological fluid biopsy sample is selected from any one of serum, urine, interstitial fluid, saliva, plasma and joint fluid.

12. The system for creating a database as claimed in claim 9, wherein,

in the data conversion module, in order to improve the efficiency of data conversion, a multi-process processing method is adopted.

13. The method of building a database system according to claim 9, wherein in the data conversion module, the data files that can be processed are data files in mzm, txt, csv format.

14. The system for creating a database as recited in claim 9,

the data conversion method in the data processing module comprises the following steps:

s1, processing data by using a square root method;

s2, smoothing data by using a SavitzkyGolay method;

s3, correcting the data by using an SNIP method;

s4, calculating peak intensity.

15. The method of claim 9, wherein in the data analysis module, the data cleansing is selected from one or more of deleting anomalous data, data padding, and feature screening.

16. The method of database system build-up of claim 9, wherein in the data modeling prediction module, the algorithm is a random forest, SVM, neural network, or bayesian network.

Drawings

FIG. 1 is a flow chart of database system setup of the present invention.

Fig. 2 is a spectrum obtained by detecting serum samples of healthy human body under conventional conditions using a mass spectrometer.

Fig. 3 is a spectrum obtained by detecting serum samples of healthy human body under conventional conditions using a spectrometer.

Fig. 4 is a spectrum obtained by detecting urine samples of a healthy human body under conventional conditions using a liquid chromatograph.

Detailed Description

One aspect of the invention relates to a method of establishing a database system, comprising the steps of:

detecting a biological liquid biopsy sample by using an analytical instrument, and obtaining a laboratory generated fid original data file, wherein the analytical instrument is preferably a mass spectrometer, a spectrometer or a chromatograph, and the biological liquid biopsy sample is preferably any one of serum, urine and tissue fluid;

step two, converting the fid data file into a processable mzml, txt, csv data file by using CompassXport software, preferably adopting a multi-process processing method, and improving the data conversion efficiency;

processing mzml, txt, csv and other data by using a data conversion method to obtain accurate relative abundance of each component;

step five, if the data requirement is met, carrying out data cleaning on the sample characteristic data, such as deleting abnormal data, filling data, screening characteristics and the like;

step six, carrying out data modeling prediction on the processed characteristic data, wherein the following algorithm is mainly used: random forests, SVMs, neural networks, bayesian networks, etc., and screening out key peaks and their abundance data.

The data conversion method in the third step comprises the following steps:

1. processing the data by using a square root method, and reducing data noise;

2. smoothing data by using a SavitzkyGolay method, and improving the data rule;

3. correcting the data using the SNIP method;

4. the peak intensity was calculated.

Another aspect of the invention relates to a system for creating a database:

a system for building a database, comprising the following modules:

the acquisition module is used for detecting a biological liquid biopsy sample by using an analysis instrument to acquire a fid original data file generated by a laboratory, wherein the analysis instrument is preferably a mass spectrometer, a spectrometer or a chromatograph, and the biological liquid biopsy sample is preferably any one of serum, urine, tissue fluid, saliva, blood plasma and joint fluid;

the data conversion module converts the fid data file into a processable data file by using CompassXport software, preferably, the processable data file is in mzml, txt or csv format, and in order to improve the data conversion efficiency, preferably, a multi-process processing method is adopted;

the data processing module is used for processing the data by using a data conversion method so as to obtain accurate relative abundance of each component;

the data analysis module is used for analyzing and judging QC of the experiment according to the condition of the sample characteristic data, judging whether the data requirement is met or not, if not, analyzing the reason, and re-experiment; if the data requirement is met, carrying out data cleaning on the sample characteristic data, wherein the data cleaning is preferably selected from more than one of deleting abnormal data, filling data and screening characteristics;

the data modeling prediction module performs data modeling prediction on the processed characteristic data by using an algorithm, preferably mainly using the following algorithm: random forests, SVMs, neural networks, bayesian networks, etc., and screening out key peaks and their abundance data.

s1, processing data by using a square root method, and reducing data noise;

s2, smoothing data by using a SavitzkyGolay method, and improving a data rule;

s3, correcting the data by using an SNIP method;

s4, calculating peak intensity.

Examples

Example 1

The mass spectrometer detects the serum sample to obtain mass-to-charge ratio data processing:

and step one, detecting the serum sample by using a mass spectrometer to obtain mass-to-charge ratio data of each component after ionization. Obtaining a fid original data file generated by a mass spectrometer and converting the fid original data file into mzml format data by using CompassXport;

reading mzml and performing baseline removal conversion by a data conversion method to obtain accurate relative ion abundance of each component;

step three, analyzing and judging the experiment to perform QC according to the obtained data condition, judging whether the data requirement is met or not, if not, analyzing the reason, and re-performing the experiment;

step four, data cleaning is carried out, a random forest algorithm is used according to the sample type, and key peaks and ion abundance data thereof are screened out by combining previous model training data;

step five, data with accurate manual labeling are used in advance and used for training a machine learning model;

step six: and inputting the data obtained by screening into a pre-trained machine learning model, and giving out corresponding prediction results to assist clinical diagnosis.

The data conversion method in the second step comprises the following steps:

1. processing the data by using a square root method, and reducing data noise;

2. smoothing data by using a SavitzkyGolay method, and improving the data rule;

3. correcting the data using the SNIP method;

4. the peak intensity was calculated.

Example 2

The spectrometer detects the serum sample to obtain detection result data processing:

step one, detecting a serum sample by using a spectrometer, and obtaining a fid original data file generated in a laboratory;

step two, converting the fid data file into a txt data file which can be processed by using CompassXport software;

thirdly, performing data processing on txt data by using a data conversion method to obtain accurate relative abundance of each component;

step five, if the data requirement is met, deleting abnormal data of the sample characteristic data;

and step six, carrying out data modeling prediction on the processed characteristic data by using an SVM algorithm, and screening out key peaks and abundance data thereof.

The data conversion method in the third step comprises the following steps:

1. processing the data by using a square root method, and reducing data noise;

2. smoothing data by using a SavitzkyGolay method, and improving the data rule;

3. correcting the data using the SNIP method;

4. the peak intensity was calculated.

Example 3

The mass spectrometer detects the urine sample to obtain mass-to-charge ratio data processing:

and step one, detecting the urine sample by using a mass spectrometer to obtain mass-to-charge ratio data of each component after ionization. Obtaining a fid original data file generated by a mass spectrometer and converting the fid original data file into mzml format data by using CompassXport;

step two, data processing is carried out on mzml data by using a data conversion method so as to obtain accurate ion relative abundance of each component;

if the data requirement is met, deleting abnormal data and filling the data of the sample characteristic data;

and fifthly, carrying out data modeling prediction on the processed characteristic data by using a Bayesian network algorithm, and screening out key peaks and ion abundance data thereof.

Step six, data with accurate manual labeling are used in advance and used for training a machine learning model;

and step seven, inputting the data obtained by screening into a pre-trained machine learning model, and giving out a corresponding prediction result to assist clinical diagnosis.

The data conversion method in the second step comprises the following steps:

1. processing the data by using a square root method, and reducing data noise;

2. smoothing data by using a SavitzkyGolay method, and improving the data rule;

3. correcting the data using the SNIP method;

4. the peak intensity was calculated.

Example 4

Detecting a urine sample by using a liquid chromatograph, and obtaining data processing of detection results of protein, polypeptide and metabolic products:

and step one, detecting the urine sample by using a liquid chromatograph to obtain detection data of each component. Obtaining a fid original data file generated by a liquid chromatograph and converting the fid original data file into txt format by using CompassXport;

performing data processing on txt data by using a data conversion method to obtain accurate relative abundance of each component;

step four, if the data requirements are met, abnormal data and sign screening are carried out on the sample characteristic data;

and fifthly, carrying out data modeling prediction on the processed characteristic data by using a random forest algorithm, and screening out key peaks and abundance data thereof.

The data conversion method in the second step comprises the following steps:

1. processing the data by using a square root method, and reducing data noise;

2. smoothing data by using a SavitzkyGolay method, and improving the data rule;

3. correcting the data using the SNIP method;

4. the relative intensities of the substances were calculated.

Example 5

The spectrometer detects saliva samples to obtain detection result data processing:

step one, detecting saliva samples by using a spectrometer, and obtaining a fid original data file generated in a laboratory;

thirdly, processing txt data by using a data conversion method to obtain accurate relative abundance of each component;

step five, if the data requirement is met, deleting abnormal data from the sample characteristic data;

and step six, carrying out data modeling prediction on the processed characteristic data by using a neural network algorithm, and screening out key peaks and abundance data thereof.

The data conversion method in the third step comprises the following steps:

1. processing the data by using a square root method, and reducing data noise;

2. smoothing data by using a SavitzkyGolay method, and improving the data rule;

3. correcting the data using the SNIP method;

4. the relative intensities of the substances were calculated.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When used in whole or in part, is implemented in the form of a computer program product comprising one or more computer instructions. When loaded or executed on a computer, wholly or partly produce a machine, such as the process or function described in embodiments of the present invention. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (solid state disk SSD)), etc.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A method of database system creation, comprising the steps of:

2. The method for building a database system according to claim 1, wherein,

3. The method for building a database system according to claim 1, wherein,

6. The method for building a database system according to claim 1,

the data conversion method in the third step comprises the following steps:

s1, processing data by using a square root method;

s2, smoothing data by using a SavitzkyGolay method;

s3, correcting the data by using an SNIP method;

s4, calculating peak intensity.

9. A system for building a database, comprising the following modules:

10. The system for creating a database as claimed in claim 9, wherein,

11. The system for creating a database as claimed in claim 9, wherein,

12. The system for creating a database as claimed in claim 9, wherein,

13. The method for building a database system according to claim 9, wherein in the data conversion module, the data file that can be processed is a data file in mzml, txt, csv format.

14. The system for creating a database as recited in claim 9,

s1, processing data by using a square root method;

s2, smoothing data by using a SavitzkyGolay method;

s3, correcting the data by using an SNIP method;

s4, calculating peak intensity.