CN115221152A

CN115221152A - Distributed node sharing method and system for biological sample database data

Info

Publication number: CN115221152A
Application number: CN202210840621.5A
Authority: CN
Inventors: 黄杰玞; 黄晓
Original assignee: Bioit Guangzhou Biological Information Technology Co ltd
Current assignee: Bioit Guangzhou Biological Information Technology Co ltd
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2022-10-21
Also published as: CN109635026A

Abstract

The invention discloses a distributed node sharing method and a system for biological sample database data, wherein the method comprises the following steps: synchronizing each biological sample library into a cloud server, and merging to obtain a merged database; standardizing the data in the merged database to obtain processed standardized data; synchronizing the standardized data to a cloud dispersed node database; performing characteristic engineering processing on the standardized data modeling, and automatically generating a medical sample analysis result; and carrying out visualization processing on the analysis result. According to the invention, deep learning is carried out through modeling, so that artificial intelligence processing of biological sample data of large sample magnitude from different mechanisms becomes possible, the labor cost and the error rate of manual judgment are improved, the accuracy of the data is ensured, the quality of medical research work is improved, and the public verifiability, traceability and high transparency of each data and each node in a biological sample library sharing system can be effectively ensured.

Description

Distributed node sharing method and system for biological sample database data

The application is a divisional application of a patent application named 'a distributed node sharing method, a system and a device for data of a biological sample library', the application date of the original application is 11 and 29 days in 2018, and the application number is 201811447402.0.

Technical Field

The invention relates to the technical field of data processing, in particular to a distributed node sharing method and system for biological sample database data.

Background

The biological sample bank is also called biological bank, and is used for performing standardized management on the collection, processing, storage and application processes of various biological samples and managing various information related to the samples, such as clinical information of the samples, follow-up visit information of patients, quality management information of the samples and the like. Over the last century, more and more biological sample libraries have been established, which play an increasingly important role in genomics research and precision medical research.

With the continuous development of biological sample libraries, the management of biological samples is increasingly difficult, the traditional manual management mode is difficult to meet the management requirements of the biological sample libraries, and meanwhile, the acquisition and processing of data information also face great challenges. The large sample and the large data are remarkable characteristics of modern life science research, biological sample data information comes from different medical institutions and scientific research institutions and is stored in various offline databases to form a plurality of island-type data sets, the traditional data transfer mode depends on manual operation and manual interpretation, the problems of data information loss, errors, unavailability and the like easily occur, sharing cannot be realized, and effective utilization of the biological sample is hindered.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a distributed node sharing method and system for biological sample database data, so as to solve the problems of obtaining biological sample information data from databases of various sources, such as medical institutions and scientific research institutions, and ensuring the reality, integrity and availability of the information data. According to the invention, through modeling and deep learning, artificial intelligence processing of biological sample information data of large samples and large data becomes possible, the labor cost and the error rate of manual judgment are greatly improved, the accuracy of the biological sample information data is ensured, the sharing of the biological sample information data can be realized, and the research of precise medical treatment on the large samples and the large data is promoted. The invention carries out pathological characteristic engineering treatment by modeling, can automatically generate the analysis result of the medical sample, simplifies the workload of medical researchers, and greatly improves the research quality and efficiency.

The technical scheme adopted by the invention is as follows:

a distributed node sharing method for biological sample library data comprises the following steps:

synchronizing each biological sample library into a cloud server, and merging to obtain a merged database;

standardizing the data in the merged database to obtain processed standardized data;

synchronizing the standardized data to a cloud dispersed node database;

modeling and characteristic engineering processing are carried out on the standardized data to obtain an analysis result;

carrying out visualization processing on the analysis result;

the standardized data are modeled based on different database types and medical research purposes of visiting users to diseases, different directivity problems are worked out, and special data variable capturing and processing are carried out aiming at the directivity problems; and a classification algorithm is adopted during modeling, and comprises a naive Bayes, an Adaboost iterative algorithm or a support vector machine algorithm.

Optionally, the normalizing the data in the merged database to obtain the normalized data after the normalization processing includes:

converting the merged data in the database into text data to obtain sample data, and importing the sample data into a background for processing;

carrying out data cleaning processing on the sample data to obtain standardized data;

the sample data is constant value coding data and free text data;

the cleaning types of the constant value coding data are data abnormity and data loss;

when the cleaning type is data missing, if the object is random numerical data, assigning and filling by using the sum of the average value, the median value, the average value and the random standard deviation;

if the object is classified data, classifying, assigning and filling the object by using the occurrence frequency;

when the cleaning type is data abnormity, if the object is unit numerical value abnormity, unit conversion is carried out on the object by using naive Bayes and decision binary tree;

if the object is abnormal point data, the abnormal point data is eliminated by using a kernel density estimation algorithm and principal component analysis;

the processing process of the free text data comprises the following steps: firstly, primary keyword grabbing is carried out, a new variable column is created, and primary coding assignment is carried out on the new variable column.

Optionally, the step of performing data cleaning processing on the sample data to obtain standardized data specifically includes:

performing characteristic engineering processing on the sample data to obtain processed sample data;

and performing data cleaning processing in a corresponding mode according to the data type of the processed sample data and the cleaning type required to be performed.

In order to achieve the above object, the present invention further provides a distributed node sharing system for biological sample database data, including:

the merging unit is used for synchronizing all the biological sample libraries to the cloud server and merging the biological sample libraries to obtain a merged database;

the standardization unit is used for standardizing the data in the merged database to obtain processed standardized data;

the node synchronization unit is used for synchronizing the standardized data to the cloud dispersed node database;

the analysis unit is used for modeling and performing characteristic engineering processing on the standardized data to obtain an analysis result;

the visualization unit is used for performing visualization processing on the analysis result;

Optionally, the normalization unit specifically includes:

the conversion unit is used for converting the merged data in the database into text data to obtain sample data and importing the sample data into a background for processing;

the cleaning unit is used for cleaning the data of the sample data to obtain standardized data;

the sample data is fixed value coding data and free text data;

when the cleaning type is data abnormity, if the object is abnormal in unit value, unit conversion is carried out on the object by using naive Bayes and decision binary tree;

if the object is abnormal point data, removing the abnormal point data by using a kernel density estimation algorithm and principal component analysis;

the processing process of the free text data comprises the following steps: firstly, primary keyword grabbing is carried out, new variable columns are created, and primary coding assignment is carried out on the new variable columns.

Optionally, the cleaning unit specifically includes:

the characteristic processing unit is used for carrying out unsupervised sample and pathological characteristic engineering processing on the sample data to obtain the processed sample data;

and the data clearing unit is used for carrying out data clearing processing in a corresponding mode according to the data type of the processed sample data and the clearing type required to be carried out.

The invention has the beneficial effects that:

according to the distributed node sharing method and system for the biological sample library data, deep learning is performed through modeling, artificial intelligence becomes possible in the life science field, labor cost and the error rate of manual judgment are greatly improved, the accuracy of data is guaranteed, meanwhile, pathological feature engineering processing can be performed through modeling, a medical sample analysis result is automatically generated, the workload of medical researchers is simplified, and the research quality and efficiency are greatly improved. And moreover, a block chain distributed deployment mode is adopted, so that each piece of data and each node in the biological sample library sharing system can be effectively ensured to have the characteristics of public verifiability, traceability and high transparency.

Drawings

FIG. 1 is a flowchart illustrating steps of a distributed node sharing method for data of a biological sample database according to the present invention;

fig. 2 is a block diagram of a distributed node sharing system for biological sample database data according to the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the accompanying drawings:

referring to fig. 1, the distributed node sharing method for biological sample database data of the present invention includes the following steps:

synchronizing each biological sample library to a cloud server, and merging to obtain a merged database; the merged objects are a plurality of biological sample banks disposed in different hospitals and research institutions;

and synchronizing the standardized data to a cloud dispersed node database.

The biological sample data comprises sample types, acquisition positions, acquisition time, freezing conditions and analysis data for scientific research by using a new technical method, wherein the analysis data comprises sequencing data and proteomics data, and each biological sample library is arranged in a hospital and a research institution, is synchronized to a cloud server and is combined, so that preparation for unified data is made for subsequent data processing. The step becomes a big premise for realizing the sharing of the biological sample library system, and the database nodes are deployed uniformly for data acquisition, so that the database list consistency and the data utilization rate are greatly improved, and the space and time cost of subsequent data processing is reduced. The distributed node database is positioned in a block chain, the distributed nodes refer to all nodes in the block chain, and the distributed node database is integrally deployed in a block chain system.

Further as a preferred embodiment, the method further comprises the following steps:

and carrying out visualization processing on the analysis result.

In this embodiment, the standardized data is modeled based on different database types and medical research purposes of accessing users to diseases, different directionality problems are formulated, and then special data variable capture and processing are performed on the problems. In the preliminary scheme preparation, because biological samples are related to disease types, most common algorithms are classified algorithms, such as naive Bayes, adaboost iterative algorithm, support vector machine and other algorithm processing, and after a series of preliminary modeling is completed, K-folding cross validation is performed on the three models for multiple times to obtain the best accuracy. The invention carries out deep learning feedback through modeling, so that artificial intelligence becomes possible in the life science field and the labor cost and the error rate of manual judgment are greatly improved. Finally, various analysis results processed through statistical analysis are visually output according to user requirements, so that the user can conveniently compare and check the results and display results used for research.

As a preferred embodiment, the normalizing the data in the merged database to obtain the normalized data after processing specifically includes:

and carrying out data cleaning processing on the sample data to obtain standardized data.

Further as a preferred embodiment, the step of performing data cleaning processing on the sample data to obtain standardized data specifically includes:

carrying out unsupervised sample and pathological feature engineering processing on the sample data to obtain processed sample data;

In the embodiment of the invention, the sample data can be divided into constant value coding data and free text data.

The fixed value encodes data, and the cleaning types of the data are data exception and data missing. Aiming at the variables of different types, the invention adopts a corresponding algorithm to process the variables.

The cleaning types are data missing and data abnormity;

when the cleaning type is data missing, if the object is randomly digitalized data such as age, height and the like, the average value, the median value, and the sum of the average value and the random standard deviation are used for assignment filling.

And if the object is blood type, gender and other classification type data, classifying, assigning and filling the object by using the occurrence frequency.

When the type of the removal is data abnormity, if the object is unit numerical abnormity, unit conversion is carried out on the object by using simple naive Bayes and decision binary tree. And if the object is abnormal point data, using a kernel density estimation algorithm and principal component analysis abnormal point data to eliminate the abnormal point data.

For the free text data, in the embodiment, preliminary keyword grabbing is performed on the free text data, a new variable column is created, and preliminary coding assignment is performed on the free text data. The specific value will depend on the code dictionary of the specific sample library.

Referring to fig. 2, the present invention provides a distributed node sharing system for biological sample library data, including:

and the node synchronization unit is used for synchronizing the standardized data to the cloud end scattered node database.

Further, as a preferred embodiment, the method further includes:

the analysis unit is used for carrying out characterization processing on the standardized data by utilizing modeling to obtain an analysis result;

and the visualization unit is used for performing visualization processing on the analysis result.

As a further preferred embodiment, the normalization unit specifically includes:

and the cleaning unit is used for cleaning the data of the sample data to obtain standardized data.

As a preferred embodiment, the cleaning unit specifically includes:

The specific embodiment of the invention utilizes the characteristic engineering innovation algorithm, the deep learning algorithm and other algorithms to standardize the biological sample data from different sources and enhance the reality in the virtual cloud computing terminal, and utilizes the distributed deployment technology of the block chain in the subsequent steps, so that the biological sample data and the clinical data of the patient are impartial, safe and traceable, and the data loss condition in cross-domain diagnosis and research is greatly reduced.

From the above, the invention can realize deep learning through modeling, so that artificial intelligence becomes possible in the field of life science, the labor cost and the error rate of character judgment are greatly improved, the accuracy of data is ensured, meanwhile, the modeling can be utilized to carry out pathological characteristic engineering processing, the analysis result of the medical sample can be automatically generated, the workload of medical research workers is simplified, the research quality and efficiency are greatly improved, and the characteristics of open verifiability, traceability and high transparency of each piece of data and each node in the biological sample library sharing system can be effectively ensured.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for sharing distributed nodes of biological sample library data is characterized by comprising the following steps:

synchronizing each biological sample library to a cloud server, and merging to obtain a merged database;

synchronizing the standardized data to a cloud dispersed node database;

carrying out visualization processing on the analysis result;

the standardized data are modeled based on different database types and the purpose of accessing users, different directivity problems are worked out, and special data variable capturing and processing are carried out aiming at the directivity problems; during modeling, a classification algorithm is adopted, the classification algorithm comprises a naive Bayes algorithm, an Adaboost iterative algorithm or a support vector machine algorithm, and an unsupervised sample and pathological feature engineering algorithm is adopted for feature engineering processing.

2. The distributed node sharing method for biological sample library data as claimed in claim 1, wherein: the step of normalizing the data in the merged database to obtain the processed normalized data specifically comprises:

the sample data is fixed value coding data and free text data;

3. The distributed node sharing method for biological sample library data as claimed in claim 2, wherein: the step of performing data cleaning processing on the sample data to obtain standardized data specifically comprises the following steps:

4. A distributed node sharing system for biological sample library data, comprising:

the node synchronization unit is used for synchronizing the standardized data to the cloud end scattered node database;

the standardized data are modeled based on different database types and the purpose of accessing users, different directivity problems are worked out, and special data variable capturing and processing are carried out aiming at the directivity problems; a classification algorithm is adopted during modeling, and comprises a naive Bayes, an Adaboost iterative algorithm or a support vector machine algorithm; the characteristic engineering processing adopts an unsupervised sample and a pathological characteristic engineering algorithm.

5. The distributed node sharing system for biological sample library data as claimed in claim 4, wherein: the standardization unit specifically comprises:

the sample data is fixed value coding data and free text data;

6. The system of claim 5, wherein: the cleaning unit specifically comprises: