CN114530205A

CN114530205A - Organ chip database vectorization scheme for artificial intelligence algorithm

Info

Publication number: CN114530205A
Application number: CN202110986435.8A
Authority: CN
Inventors: 马欣; 林文斌
Original assignee: Tianjin Polytechnic University
Current assignee: Tianjin Polytechnic University
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2022-05-24

Abstract

This patent generally describes a method suitable for organ chip database vectorization. The organ chip database contains biological stent materials, reagents, cell lines, drugs, organ chip models, organ chip configuration parameters (reagents and drug concentrations, cell types and the like), time information and experimental results (cell metabolite concentrations, the number and survival rate of cells, the pH of microenvironment in the chip, temperature, oxygen concentration, carbon dioxide concentration, TEER, air pressure, whether drugs are added or not, the release rate of the drugs and the degradation rate) serving as label data, a deep learning model weight matrix can be obtained through training of the data, and the label data can be automatically predicted by the model after the information is input. The first thing to be done when the data in the database is input into the model is the format conversion of the data, because the types and formats of the data stored in the organ chip database are not uniform, text information, digital information and even image information exist, and the information needs to be converted into vector information which can be identified by a machine learning algorithm. This patent is designed to solve this problem-how to vectorize the data of the organ chip.

Description

Organ chip database vectorization scheme for artificial intelligence algorithm

Technical Field

The invention belongs to the field of biomedical engineering and computer science and technology fusion, and designs a vectorization method suitable for an organ chip database. The organ chip database contains biological stent materials, reagents, cell lines, drugs, organ chip models, organ chip configuration parameters (reagents and drug concentrations, cell types and the like), time information and experimental results (cell metabolite concentrations, the number and survival rate of cells, the pH of microenvironment in the chip, temperature, oxygen concentration, carbon dioxide concentration, TEER, air pressure, whether drugs are added or not, the release rate of the drugs and the degradation rate) serving as label data, a deep learning model weight matrix can be obtained through training of the data, and the label data can be automatically predicted by the model after the information is input. The first thing to be done when the data in the database is input into the model is the format conversion of the data, because the types and formats of the data stored in the organ chip database are not uniform, text information, digital information and even image information exist, and the information needs to be converted into vector information which can be identified by a machine learning algorithm. This patent is designed to solve this problem-how to vectorize the data of the organ chip.

Background

The organ chip is one physiological organ microsystem constructed on the chip, and it has micro flow control chip as core and combined with cell biology, biological material, engineering and other methods to constitute in vitro tissue organ microenvironment simulating and reflecting the main structure and function characteristics of human tissue organ. The tissue organ model can not only reproduce the physiological and pathological activities of human organs in vitro approximately and truly, but also can lead researchers to witness and research various biological behaviors of organisms in an unprecedented way, predict the response of human bodies to drugs or different external stimuli, and has wide application value in the fields of life science research, disease simulation, new drug research and development and the like.

In the process of culturing and performing experiments on organ chips, a large amount of experimental data can be generated, but in the previous research, researchers do not carefully analyze the association between data, particularly the data is not shared among different organ chip experiments, so that the data association characteristics among different experiments cannot be concerned, only the experimental results are paid attention to, the data in the experimental process, particularly dynamic data are lost, and the researchers only concern the experimental data of the researchers without time and energy, and do not have tools to concern similar experimental results which are once done by others, and the design parameters of the researchers are compared with the others, so that a proper data analysis method needs to be developed to analyze and model the data. Before data analysis, the data in the organ chip database needs to be subjected to vector conversion and then can be input into the artificial intelligence model. This patent aims to provide a good solution to such problems.

Disclosure of Invention

The purpose of the invention is as follows:

before the artificial intelligence model is built, the data information in the organ chip database needs to be subjected to vector conversion, and the actual data (text and non-text information, such as names, molecular formulas, components and the like of biological stent materials and pharmaceutical agents) needs to be converted into digital information which can be understood and calculated by the deep learning model, wherein the digital information is a coded representation and is in a digital format so as to facilitate the artificial intelligence model calculation.

For the organ chip database, it should contain the data table related to the Drug information (storing the Drug name, molecular formula, two-dimensional and three-dimensional structural formula, Target protein, SMILE format expression, MOL2VEC code, etc.), wherein the Target protein information requires to establish a protein data table and a Target data table to express Drug Target Interaction, DTI related information for short), the data table related to the cell information (storing the cell line name, source, GENE sequence, GENE2VEC code, etc.), the data table related to the biological scaffold material information (storing the molecular formula, structural formula, code expression, etc.), the data table related to the biological agent information (storing the components, ratios, concentrations, chemical formula, structural formula, code, etc.), the data table related to the organ chip model (storing the model ID of the chip, the type of the enumerated variables, the type of the organ chip, the developer, the organization, the article name and the link, The official website introduction link, the description of chip structure and components, the description of working principle, WORD2VEC code, etc.), an organ chip parameter configuration data table (storing parameter configuration ID, which is convenient to be associated with the experiment data table, and one line of parameter configuration ID information corresponding to a plurality of lines of experiment data table information with time data, and also storing organ chip model ID, one or more kinds of medicine preparation information, biological agent preparation information, adopted stent material preparation information, which cell lines are adopted, etc.), an experiment result data table with time information (storing cell metabolite concentration, the number and survival rate of cells, PH, temperature, oxygen concentration, carbon dioxide concentration, TEER, air pressure, whether to add medicine, release speed and degradation speed of medicine, etc.) which may have a certain direct or indirect relation with the experiment result data, the contact can be used for big data learning and pattern recognition through an artificial intelligence method, and further used for predicting an experimental result.

(1) For a data table related to Drug information (storing Drug names, molecular formulas, two-dimensional and three-dimensional structural formulas, Target proteins, SMILE format expressions, MOL2VEC codes and the like, wherein the Target protein information needs to establish a protein data table and a Target data table to express Drug Target Interaction, DTI related information for short), the Drug molecular formula can be converted into fingerprint information by using a Morgan algorithm, because the digit number of the fingerprint information is too long, secondary training conversion can be performed by a certain model, for example, vectors can be output by a BERT algorithm, or the Drug molecular formula can be directly converted into vectors by a MOL2Vec algorithm, and the converted digital character string result can be directly stored in the Drug information table. For vectorization of the amino acid sequence of the target protein, the amino acid sequence can be represented by the PSSM method, and the rest of the information, if a number, can be encoded by a normalization method between 0 and 1, and if a text, can be encoded by One-hot.

The PSSM matrix representation method is as follows:

firstly, finding a protein fasta sequence and a homologous protein fasta sequence, and arranging the protein fasta sequence and the homologous protein fasta sequence according to rows (columns); secondly, calculating the number of each amino acid of each sequence to obtain a PPM matrix which is a matrix of L-20, wherein 20 represents the number of the amino acids of the human body, and L represents the length of the protein sequence; thirdly, standardizing the matrix to obtain a PFM matrix; and fourthly, obtaining a PSSM matrix according to a formula, wherein the PSSM matrix is a matrix of L by 20, 20 represents the number of amino acids of the human body, and L represents the length of the protein sequence. The matrix may indicate both a protein and the potential for amino acid mutations at each position to other amino acids. The amino acid corresponding to the largest number in each row is the protein represented by this matrix. Each element represents the possibility of mutating the amino acid at that position to another amino acid. The larger the element value, the more likely mutation is to occur.

(2) For the data table relating to the cell information (storing the cell line name, source, GENE sequence, GENE2VEC code, etc.), the cell GENE sequence can be vectorized by the GENE2VEC method and stored in the cell information data table. The rest information can be coded by a normalization method between 0 and 1 if the information is a number, and can be coded by One-hot if the information is a text.

(3) For the data table (storing molecular formula, structural formula, coding expression, etc.) related to the information of the biological stent material, the molecular formula and the structural formula can be vectorized by using a Mol2Vec method and stored in the data table of the information of the biological stent material. The rest information can be coded by a normalization method between 0 and 1 if the information is a number, and can be coded by Word2Vec or One-hot if the information is a text.

(4) For the data table (storing the components, proportion, concentration, chemical formula, structural formula, code and the like) related to the biological reagent information, the chemical formula can be vectorized by using the Mol2Vec method and stored in the reagent information data table. The rest information can be coded by a normalization method between 0 and 1 if the information is a number, and can be coded by Word2Vec or One-hot if the information is a text.

(5) For the organ chip model related data table (the organ chip type storing the chip model ID, the enumerated variables, the developer, the organization, the article name and link, the official website introduction link, the chip structure and component description, the working principle description, the WORD2VEC code and the like), the field information, a part of the data such as the official website link, the developer, the article name and the like are irrelevant to the prediction of the experimental result, and the vectorization is not needed because the input into the artificial intelligence model is not needed. The rest information can be coded by a normalization method between 0 and 1 if the information is a number, and can be coded by Word2Vec if the information is a text.

(6) For an organ chip parameter configuration data table (storing a parameter configuration ID which is convenient to be associated with an experiment data table, One line of parameter configuration ID information corresponds to a plurality of lines of experiment data table information with time data, and also storing an organ chip model ID, One or more kinds of medicine preparation information, biological reagent preparation information, adopted stent material preparation information, adopted cell lines and the like), if the number is a number, the normalization method between 0 and 1 can be used for coding, and if the number is a text, Word2Vec or One-hot coding can be used.

(7) For the data sheet of experimental results with time information (storing cell metabolite concentration, number and viability of cells, PH of microenvironment in chip, temperature, oxygen concentration, carbon dioxide concentration, TEER, air pressure, whether drugs are added, release rate of drugs, degradation rate, etc.), the numbers can be encoded using a normalization method between 0 and 1 if they are numbers, and Word2Vec or One-hot encoding if they are texts. If the number of data sets used to train the model is small, it is recommended to convert the data of the numerical classes involved in the experimental results into class-classified data types, e.g., oxygen content < 19.5% is class 1; grade 2 with 19.5% < oxygen content < 24%; oxygen content > 24% is grade 3.

The invention has the advantages that:

(1) the problem of vectorization of the organ chip database is solved;

(2) a plurality of vectorization representation methods are used for providing effective data for the calculation of the artificial intelligent model.

Detailed Description

(4) For the data table (storing the components, proportion, concentration, chemical formula, structural formula, code and the like) related to the biological reagent information, the chemical formula can be vectorized by using the Mol2Vec method and stored in the reagent information data table. The rest information can be coded by a normalization method from 0 to 1 if the information is a number, and can be coded by Word2Vec or One-hot if the information is a text.

The above are merely representative examples of the many specific applications of the present invention, and do not limit the scope of the invention in any way. All the technical solutions formed by the transformation or the equivalent substitution fall within the protection scope of the present invention.

Claims

1. An organ chip database vectorization scheme for artificial intelligence algorithms, characterized by: the vector conversion of the data information in the organ chip database requires the conversion of the actual data (textual and non-textual information, such as names, molecular formulas, compositions, etc. of the biological stent material and the pharmaceutical agent) into digital information that can be understood and calculated by the deep learning model, and the digital information is a coded representation in digital format for the calculation of the artificial intelligence model.

2. The organ chip database vectorization scheme according to claim 1, wherein the data information in claim 1 mainly comprises: a data sheet related to drug information, a data sheet related to cell information, a data sheet related to biological stent material information, a data sheet related to biological reagent information, a data sheet related to organ chip model, an organ chip parameter configuration data sheet and an experimental result data sheet with time information.

3. The organ-chip database vectorization scheme for artificial intelligence algorithms according to claim 2, wherein the drug formula is converted into fingerprint information by morgan algorithm, and the fingerprint information is transformed into vectors by a model with too many digits, such as by BERT algorithm, or directly into vectors by Mol2Vec algorithm, and the result of the transformed digital string is directly stored in the drug information table. For vectorization of the amino acid sequence of the target protein, the amino acid sequence can be represented by the PSSM method, and the rest of the information, if a number, can be encoded by a normalization method between 0 and 1, and if a text, can be encoded by One-hot.

4. The organ chip database vectorization scheme for artificial intelligence algorithms according to claim 2, wherein the cell information related data tables are vectorized by using GENE2VEC method, and the cell GENE sequences are stored in the cell information data tables. The rest information can be coded by a normalization method between 0 and 1 if the information is a number, and can be coded by One-hot if the information is a text.

5. The organ chip database vectorization scheme for artificial intelligence algorithms according to claim 2, wherein the relevant tables of information about the biological stent material are vectorized by Mol2Vec method for molecular formula and structural formula and stored in the data table of the stent material information. The rest information can be coded by a normalization method between 0 and 1 if the information is a number, and can be coded by Word2Vec or One-hot if the information is a text.

6. The organ chip database vectorization scheme for artificial intelligence algorithms according to claim 2, wherein the formulation of the relevant tables for biological reagent information is vectorized by Mol2Vec method and stored in the reagent information tables. The rest information can be coded by a normalization method between 0 and 1 if the information is a number, and can be coded by Word2Vec or One-hot if the information is a text.

7. The organ chip database vectorization scheme according to claim 2, wherein said organ chip model-related data tables are vectorized, and a part of data such as official links, developers, article names, etc. is irrelevant to the prediction of experimental results, so that no vectorization is needed because no input into the artificial intelligence model is needed. The rest information can be coded by a normalization method between 0 and 1 if the information is a number, and can be coded by Word2Vec if the information is a text.

8. The organ chip database vectorization scheme according to claim 2, wherein said organ chip parameter configuration data table vectorization may be encoded using a normalization method between 0 and 1 if it is a number and Word2Vec or One-hot if it is a text.

9. The organ chip database vectorization scheme according to claim 2, wherein the vectorization of the experimental result data table with time information can be coded by a normalization method between 0 and 1 if the number is a number, and can be coded by Word2Vec or One-hot if the text is a text. If the number of data sets used to train the model is small, it is recommended to convert the data of the numerical classes involved in the experimental results into class-classified data types, e.g., oxygen content < 19.5% is class 1; grade 2 with 19.5% < oxygen content < 24%; oxygen content > 24% is grade 3.