CN117079716B

CN117079716B - Deep learning prediction method of tumor drug administration scheme based on gene detection

Info

Publication number: CN117079716B
Application number: CN202311177095.XA
Authority: CN
Inventors: 于文龙; 顾忠泽; 胡天牧
Original assignee: Jiangsu Institute Of Sports Health
Current assignee: Jiangsu Institute Of Sports Health
Priority date: 2023-09-13
Filing date: 2023-09-13
Publication date: 2024-04-05
Anticipated expiration: 2043-09-13
Also published as: CN117079716A

Abstract

The invention discloses a deep learning prediction method of a tumor drug administration scheme based on gene detection, which relates to the technical fields of deep learning and organ chips and solves the technical problem that the interpretation of the prediction effect of a deep learning model in clinical decision and biological medicine field research is not high enough.

Description

Deep learning prediction method of tumor drug administration scheme based on gene detection

Technical Field

The application relates to the technical field of deep learning and organ chip, in particular to a method for predicting deep learning of a tumor medication scheme based on gene detection.

Background

Malignant tumors have become a serious problem threatening human life health, and death due to malignant tumors has exceeded 20% of the total population death causes in China by 2021, and is still in an ascending state. The influence of the tumor on health and public health not only causes great pressure to individuals of patients, but also causes great stress to medical systems and social resources, and the prevention and treatment of the tumor become social problems.

In clinical tumor control, traditional surgery, radiation therapy and chemotherapy, although capable of controlling tumor growth to some extent, are unstable in effectiveness in some cases and often accompanied by serious side effects, and patient compliance is not high. Along with the gradual entry of the targeted drugs into the clinic, the targeted drugs are comprehensively formed into an accurately designed personalized treatment scheme for each patient according to the genetic background of the patient and the objective evidence of the tumor phenotype characteristics and by combining the subjective experience diagnosis of a professional clinician team, and the targeted drugs become the key links for making a new paradigm for the tumor treatment scheme. However, the diversity of the gene regulation and the complexity of the biological signal pathway make it impossible to perform reliable prognosis according to single or isolated targets. In addition, the different effects of random mutations are difficult to classify with limited prior knowledge, and the subtype that is difficult to classify discounts the value of the targeted drug. How to effectively integrate the overall trend information of mutation and combine the prediction of drugs with wet experimental results is an effort direction capable of improving the prediction effectiveness of gene detection. The gene detection, genotype prediction and in-vitro experiment functional verification are tightly combined, and the two mutually support and complement each other, so that the evaluation of the actual drug application effectiveness of the patient can be realized more accurately.

The core technology of accurate medical treatment is based on genomics detection to identify the genetic mutation condition of a tumor sample from a patient. When these genetic mutations are matched to drugs developed for the mutations, strong supportive evidence can be provided for the effects of the use of targeted drugs. With the progress of research and detection technology of molecular biology and genomics, the cost of identifying genetic background of patients is greatly reduced, and the concept of accurate medical treatment is increasingly emphasized, so that gene detection is promoted to be one of important information sources for clinical treatment scheme decision together by multiple reasons.

In the form of patient replacement, studies of tumor drug response in vitro models have also been developed for many years. Prior to the development of organoid and organ-chip technology, in vitro models of tumors have relied on either traditional cell culture of the patient's tumor tissue (Patient Derived Cell lines, PDC) or xenograft-based animal models (Patient Derived Xenocrafts, PDX) and the like.

The traditional two-dimensional cell culture mode is to extract the human primary cells and culture the cells in vitro, and has important significance in the field of drug development because the method can realize the intervention observation of the human cells in vitro. However, the two-dimensional cell culture method cannot accurately reflect microenvironment and heterogeneity of a real tissue structure in vivo, and especially aims at research and application in the tumor field, the situation that the phenotype of the two-dimensional cells is low in consistency with the presence of the body due to the tumor heterogeneity and the lack of the microenvironment is extremely low in clinical availability.

Patient-derived xenograft animal models (Patient Derived Xenocrafts, PDX) are animal xenograft tumor models formed by transplanting tissue mass or primary cell samples of tumor patients into immunodeficient animals. Compared with two-dimensional cell culture, the animal model of xenograft fully maintains the heterogeneity and microenvironment of tumor tissues, effectively realizes the simulation of in-vitro platforms on the phenotype and function of in-vivo tumor tissues, and has high acceptance. However, due to the high construction difficulty, the PDX system based on the immunodeficiency mice has a culture period of at least two months, the average and complete drug sensitivity detection scheme can be priced to be about 20 ten thousand yuan, the accessibility of the PDX system is poor due to the excessively high period growth cost, and the PDX system cannot be effectively applied in clinic.

The organoid constructed based on sampling culture of tumor tissue of patient and the organ chip system thereof effectively overcome the problems. Organoid technology is an in vitro culture model that is targeted optimized based on traditional two-dimensional cell culture. The three-dimensional structure ordered arrangement construction is carried out on the human cells under the in vitro environment through mediums such as matrigel and the like, so that the tissue organ structure corresponding to organs in the human body is simulated as far as possible, and the partial in-vivo physiological characteristics of the organs are reproduced. The organ chip technology is an in vitro experiment platform which combines biological, material and engineering technologies, and forms one or more cell types, organoid tissues, microenvironments and the like into a system through a microfluidic technology and is commonly arranged on a micro chip system. The biological mechanics and chemical environment can be simulated accurately by the chip and the microfluidic device. Not only can the data which are closer to in-vivo experiments be obtained, but also the necessity of animal experiments can be effectively reduced, and the method has become an important direction of in-vitro research in the field of biological medicine.

Organoids derived from tumor tissue demonstrate both phenotypic and functional consistency with the derived tumor tissue, and organoid culture cycles and costs are significantly lower than animal models of xenografts. On the basis of organoids, the micro-fluidic technology is combined, and an organ chip platform which is formed by connecting a plurality of organs in series is included, so that the whole process of absorption, distribution, metabolism, excretion and toxicity of the medicine in a human body can be better simulated, and the action effect of the medicine in the human body can be more comprehensively reflected. The evaluation and prediction of in vitro drug regimens using organoids as a proxy for tumor patients has been accepted in the clinical and scientific fields, forming a preliminary consensus.

The traditional drug-sensitive screening method combined with the traditional two-dimensional cell or three-dimensional organoid culture has the advantage that a long-time drug-sensitive screening experimental period including sample primary extraction, organoid primary culture and detection method of drug-adding period end point is required to be completed. The experimental result is used as the input of a prediction model to make a decision of a medication scheme. The actual delivery period for drug sensitive screening often is around a natural month, and two weeks are required for the shortest time. From a patient perspective, the decision to treat a regimen strives for seconds, the earlier the regimen becomes effective and the higher the ultimate clinical survival. The gene detection service in the diagnosis field has a corresponding scale, the data delivery of the gene detection service also tends to be standardized, the data delivery period is obviously shorter than the experimental period, and the model can have better timeliness by encoding the data related to the wet experiment into the model in the training stage and only the prediction model of the gene detection data is required, so that the application value is obviously improved.

The gene detection and organ-on-chip experimental data have the characteristic of common high flux, the dimension of the data characteristics is large, the output is large, the data are reasonably and effectively utilized, the core characteristics of the data are clinically extracted, and the data processing and the technology landing application difficulty are also realized. Thanks to the rapid development of computer hardware and research in the field of deep learning, the high-dimensional feature data can be effectively analyzed and explored through the deep neural network. However, the training of the deep learning model is data driven, and the feature extraction and the automation of the training iterative process of the model enable the model to exist in a black box form, so that even if the model shows a considerable effect in prediction, the prediction basis of the model is difficult to explain. The interpretability of predictive effects of deep learning models in clinical decisions and biomedical field studies is to be improved.

Disclosure of Invention

The application provides a deep learning prediction method of a tumor drug administration scheme based on gene detection, and the technical purpose of the method is to improve the interpretation of the prediction effect of a deep learning model in the fields of clinical decision and biological medicine.

The technical aim of the application is achieved through the following technical scheme:

A method for deep learning prediction of a tumor dosing regimen based on gene detection, comprising:

s1: constructing a model structure frame of a pre-training model, wherein the pre-training model comprises a first coding module, a second coding module, a genetic information decoding module, a compound structure decoding module, a full-connection layer and an output layer;

s2: constructing a pre-training mutation drug sensitive data set, and training a pre-training model through the pre-training mutation drug sensitive data set to obtain a pre-training model of the cancer species;

s3: screening cancer seed samples of the target prediction model from the organoid sample library, and resuscitating corresponding organoids;

s4: designing various medication schemes according to cancer samples;

s5: carrying out wet experiments on the organ-on-chip system matched with the selected organoids through different medication schemes to obtain a wet experiment data set;

s6: performing migration learning on the pre-training model of the cancer through a wet experimental data set until a target prediction model applicable to the target field is obtained;

s7: and predicting the rationality of the tumor medication scheme through the target prediction model.

Further, the step S2 includes:

s21: inputting the pre-training mutation drug sensitive data set into a first coding module and a second coding module to be coded respectively to obtain mutation information codes and compound structure codes;

S22: inputting the mutation information code to the genetic information decoding module for decoding and outputting, and inputting the compound structure code to the compound structure decoding module for decoding and outputting;

s23: the full-connection layer performs characteristic splicing on the outputs of the genetic information decoding module and the compound structure decoding module, and then outputs the outputs through the output layer;

s24: repeating the steps S21 to S23 until training of the pre-training model is completed, and obtaining a pre-training model of the cancer species;

wherein the first encoding module and the second encoding module each use a transfomer infrastructure based cross-attention mechanism that correlates mutation information encoding with hidden layer features of compound structural encoding, weighting the compound structural encoding by mutation information encoding while weighting mutation information encoding by compound structural encoding.

Further, the first coding module is an inverted KO module, and the inverted KO module performs mapping conversion on the mutation map of each sample in the pre-training mutation drug sensitive data set in an One-Hot coding mode to obtain mutation information codes; the second coding module is a Morgan fingerprint coding module, and the Morgan fingerprint coding module codes all intervention compounds related in the pre-training mutation drug sensitive data set according to the corresponding compound structure to obtain compound structure codes.

Further, in step S4, at least 12 of the dosage regimens are used, each of the dosage regimens producing a concentration gradient of at least 5 different concentration levels, the concentration gradient being normalized to [0,0.016,0.08,0.4,2,10.0] micromolar for the non-specific agent.

Further, step S5 includes:

s51: according to the number of the medicine taking schemes, performing the plating of organoid chambers in an organ chip system with the number of the medicine taking schemes of +1 on samples which are recovered for 5-14 days and are controlled by organoid activity and counting quality;

s52: according to each medication scheme and concentration gradient in the medication scheme design, respectively carrying out medication treatment on a single organ chip, selecting an organ chip system without adding compound medicines as a control group, and culturing the organ chip system for 7 days;

s53: the genetic information of the sample is combined with the collection of the sample library data and the detection in the system culture process of the organ chip, so that the collection of the genetic information of the sample is completed;

s54: on the 7 th day after dosing, performing activity detection on organoids on the organ-chip system, and judging the effectiveness of the drug administration scheme through cell activity data to obtain the effectiveness sequence of the drug administration scheme;

S55: the sample genetic information and the order of medication effectiveness together form a wet experiment data set.

Further, in step S2, the pre-training mutation drug sensitive data set is divided into a training set, a test set and a verification set according to the proportion of [0.7,0.2,0.1], and the training set trains the pre-training model to obtain a first pre-training model; evaluating the performance of the first pre-training model by the test set, and iteratively adjusting the super parameters of the first pre-training model according to the performance to obtain a second pre-training model; and the verification set verifies the performance level of the second pre-training model, if the verification result reaches the preset standard, the second pre-training model is the pre-training model of the cancer species, otherwise, the second pre-training model is continuously trained until the pre-training model of the cancer species is obtained.

Further, in step S6, the wet experimental data set is divided into a training set and a verification set according to a proportion not lower than [0.9,0.1], and the training set performs migration learning on the pre-training model of the pan-cancer species to obtain a first target prediction model; and the verification set verifies the performance level of the first target prediction model, if the verification result reaches a preset standard, the first target prediction model is the target prediction model, otherwise, the first target prediction model is continuously trained until the target prediction model is obtained.

Further, the validation set validates the performance level of the first target prediction model, including:

s61: predicting the effectiveness of different medication schemes of the tumor through the first target prediction model, and sequencing the different medication schemes from high to low according to the effectiveness to obtain prediction effectiveness sequencing of the different medication schemes;

s62: obtaining experimental effectiveness sequences of different medication schemes in an organ-on-a-chip system experiment;

s63: the consistency of the predicted validity ordering and the experimental validity ordering is calculated and expressed as:

wherein ρ (rho) represents a uniformity coefficient; d, d _i Representing the difference between the predicted validity rank and the ith rank in the experimental validity rank, n representing the total amount of data;

s64: and when the consistency coefficient rho (rho) is larger than 0, carrying out statistical T-test on the consistency coefficient rho (rho) to obtain a test result, judging whether the test result is smaller than a significant threshold value, and if so, enabling the sequencing of the consistency coefficient rho (rho) corresponding to the predicted validity sequencing to be consistent with the sequencing in the experimental validity sequencing.

The beneficial effects of this application lie in: the wet experiment data obtained through the organ-chip automatic high-throughput experiment platform represent the results of wet experiment output in the real world, are closer to the phenotype and the function of the organ layer in the human body compared with the traditional two-dimensional cell experiment, and avoid the inconsistency caused by population difference on an animal model. Meanwhile, the deep learning model takes wet experimental data as input, so that the correlation between the model and biological significance can be effectively improved.

In addition, the correlation patterns of the genes and the functional channels thereof which are artificially induced and arranged based on biological priori research are reversely constructed, so that the feature circulation level of the trained model can be more in accordance with biological significance to a certain extent, and the interpretability of the biological significance level can be realized according to the weight and feature importance analysis of the neural network frame.

Furthermore, on one hand, a plurality of two-dimensional cell priori experimental results are combined to construct a pre-training model of the cancer cell, and meanwhile, a fine-tuning data set is constructed by performing small-scale experiments in the fields of organoids and organ chips; based on the basis of a two-dimensional cell culture pre-training model, the three-dimensional organoid and organ chip data are used for performing the fine tuning training of the field migration, the capacity of capturing the volume and the characteristics of the data in the pre-training model is fully ensured, and the pre-training model can be applied to the three-dimensional organoid structure which is closer to the in-vivo organ level data at reasonable cost. The technical route effectively combines the trade-off of the model coverage, accuracy, consistency of in-vitro platform and real in-vivo results, cost and the like, and can effectively provide evaluation and prediction of in-vivo anti-tumor effect.

In the application of model reasoning, the input data only comprises mutation detection data, the wet experiment step is not included, the data output period is shorter, the model prediction reasoning influence period is shorter, the method is suitable for application scenes in which a quick output reasoning result is needed, and the availability is higher. Meanwhile, the method is not limited to mutation of individual targets, and based on all information integration data of biological functional channels, more accurate judgment can be made by using wider gene mutation information and combining gene interaction correlation and by means of wider knowledge of the biological functional channel layers.

In conclusion, the application combines the gene sequencing technology with the close association of the existing clinical medicines and targets thereof, is based on mass data and experimental verification provided by the organoid and organ-on-chip wet experiment technology, utilizes the deep learning technology to process the data, is connected in multiple fields, can effectively develop an ex-vivo patient substitution model, and provides evaluation of the effect level of a medication scheme based on the wet experiment in an in-vitro environment.

Drawings

FIG. 1 is a full flow chart of a method for deep learning prediction of a tumor dosing regimen based on gene detection as described herein;

FIG. 2 is a frame diagram of an inverted KO module;

FIG. 3 is a schematic structural diagram of a pre-training model according to an embodiment of the present application;

FIG. 4 is a graph comparing the predicted and experimental real tumor suppression abilities of the TP_Trial_01_0002 sample model.

Detailed Description

The technical scheme of the application will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the method for deep learning prediction of tumor drug administration scheme based on gene detection described in the present application comprises:

s1: the model structure framework construction is performed on a pre-training model, which is shown in fig. 3 and comprises a first coding module, a second coding module, a genetic information decoding module, a compound structure decoding module, a full-connection layer and an output layer.

S2: and constructing a pre-training mutation drug sensitive data set, and training the pre-training model through the pre-training mutation drug sensitive data set to obtain the pre-training model of the cancer.

As a specific example, GDSC (Genomics of Drug Sensitivity in Cancer) is an open source dataset comprising 1939 cancer cell lines of different tissue organ sources, combined with 621 different compound interventions, for a total of 57 ten thousand independent two-dimensional cell culture drug sensitive experimental design data records. The open source two-dimensional cell drug sensitivity data in the data set is used as a training data set source of a pre-training model. Specifically, all mutation data detected by the technical platforms in the dataset are summarized, and 1048575 mutation detection from 307 cell line samples are classified and stored in a cell line unit.

Taking the combination of a single sample and a single compound as the granularity of a data set, dividing a pre-training mutation drug sensitive data set into a training set, a testing set and a verification set according to the proportion of [0.7,0.2,0.1], and training a pre-training model by the training set to obtain a first pre-training model; evaluating the performance of the first pre-training model by the test set, and iteratively adjusting the super parameters of the first pre-training model according to the performance to obtain a second pre-training model; and the verification set verifies the performance level of the second pre-training model, if the verification result reaches the preset standard, the second pre-training model is the pre-training model of the cancer species, otherwise, the second pre-training model is continuously trained until the pre-training model of the cancer species is obtained.

Specifically, step S2 includes:

s21: and respectively inputting the pre-training mutation drug sensitive data set into a first coding module and a second coding module to code, and respectively obtaining mutation information codes and compound structure codes.

Specifically, the first coding module is an inverted KO module, and the inverted KO module performs mapping conversion on the mutation map of each sample in the pre-training mutation drug sensitive data set in an One-Hot coding mode to obtain mutation information codes.

As a specific embodiment, using the input length 17047 of the constructed neural network frame as a template, using 0 as an initial value without mutation, traversing all samples in the pre-training mutation drug sensitive dataset, and if a mutation is detected by a gene of one sample at a corresponding position, marking as 1. Finally, 307 binary mutation information coding vectors with the length of 17047 are formed by taking samples as units.

The KO module is a module for sorting genes and proteins manually according to functional pathways. Genes will refer to a priori knowledge of the wet experimental data contained in the study, and will be assigned to different functional pathways from shallow to deep according to functional homology. Functional pathways exist at different levels, and genes within the same functional pathway tend to have a stronger association.

The hierarchical classification of the KO modules is divided into 4 layers altogether. The first BRITE represents the largest category, including six top categories of metabolism, genetic information processing, environmental information processing, cellular processes, vital systems, and human disease. The second and third layers below this, respectively, contain Module information and orthographic Group information, both of which are progressively deeper, progressively finer in a broad class of functional classifications, classifying genes and molecules under finer functional classifications according to synergistic effects and interactions. The fourth layer of orthotics, the most detailed level of function corresponding to a single gene, each orthographically represents a homologously generalized gene or molecule and its corresponding function, and one gene may correspond to one or more orthotics since multiple functions may exist for the same gene.

And establishing a corresponding inverted neural network model frame according to the attribution relation of the KO module genes and the functional channels. Specifically, the 4 tiers of the inverted KO module include a feed-forward connection network of 4 tiers total of input layers. The first input layer of the feedforward neural network is a fourth Orthology layer of the KO module, and neurons of the corresponding input layer are built according to the number of all genes of the Orthology layer. And sequentially downwards establishing a hidden layer of the feed-forward network according to the residual hierarchy of the KO module, wherein the final output layer is the BRITE hierarchy of the KO module, namely, the hidden layer comprises 6 top-level classified neurons. After 4 levels of neurons are established, connections of different levels of neurons in the feed-forward network are established according to the attribution of KO module genes. Finally, a neural network framework with functional taxonomic biological significance is formed, which refers to the functional attribution of the KEGG database genes, as shown in figure 2.

The second coding module is a Morgan fingerprint coding module, and the Morgan fingerprint coding module codes all intervention compounds related in the pre-training mutation drug sensitive data set according to the corresponding compound structure to obtain compound structure codes. The Morgan fingerprint is a coding mode for recording by taking a single atom as a starting point, taking the atom as a circle center, gradually expanding the radius of a range and taking other contained atoms as substructures. Through the coding mode, the connection and the correlation relation between molecules in the compound and the information of the local functional groups of the compound are effectively captured, and the coding mode is a common compound coding mode in the field of chemical informatics. If the treatment of the compounds involves the combined action of two or more compounds, the compounds are encoded using Morgan fingerprints separately, and then the compounds are subjected to element-by-element addition to form feature cross-overs, which form the coincidence feature under the combined action.

Specifically, a SMILES expression of each compound structure is obtained, a mol file corresponding to the compound is obtained by using an rdkit. Chem. MolFromSmiles () function, and the compound is converted into a Morgan fingerprint by using an rdkit. Chem. GetMorganfinger finger rprintAsBitVect () function. The fingerprint is finally converted into a binary bit string. In the binary bit string, each position corresponds to whether a preset sub-functional group unit exists in the corresponding structure of the compound. If present, it is noted as 1, otherwise it is noted as 0. The Morgan fingerprint finally stores structural information of the compound in the form of a sparse vector. If the treatment of the compounds involves the combined action of two or more compounds, the compounds are encoded using Morgan fingerprints separately, and then the compounds are subjected to element-by-element addition to form feature cross-overs, which form the coincidence feature under the combined action.

S22: and inputting the mutation information code into the genetic information decoding module for decoding and outputting, and inputting the compound structure code into the compound structure decoding module for decoding and outputting.

S23: and the full-connection layer performs characteristic splicing on the outputs of the genetic information decoding module and the compound structure decoding module, and then outputs the outputs through the output layer.

The final output of the output layer is the IC50 regression value recorded corresponding to the data set, so that regression prediction of the IC50 value of the drug killing effect is constructed according to the data accumulation of the prior data set under the conditions of a sample and a compound structure of known genetic information.

S24: and repeating the steps S21 to S23 until the training of the pre-training model is completed, and obtaining the pre-training model of the cancer.

The first coding module and the second coding module both use a cross attention mechanism based on a Transformer infrastructure, the cross attention mechanism correlates mutation information codes with hidden layer characteristics of compound structural codes, the compound structural codes are weighted through the mutation information codes, and meanwhile, the mutation information codes are weighted through the compound structural codes, so that data perception interaction across data types in decoding layers constructed respectively is realized.

In the training process of the pre-training model, MSE is used as a Loss evaluation parameter of a model regression predicted value and a true value, adam is used as a model optimizer, and super parameters such as learning rate lr=1E-05, batch size batch=16, iteration number epoch=300 and the like are set for training the model. And evaluating the performance of the model by using the test set, and iteratively adjusting the super parameters of the model according to the performance. And after model training is completed, obtaining the performance level of the model on the unknown data set by using the verification set. The performance of the model before and after the training iteration is completed is shown in table 1, and it is known that the model converges in the training process, and finally, sufficient generalization is shown in the verification data set.

TABLE 1

S3: and screening cancer seed samples of the target prediction model from the organoid sample library, and resuscitating the corresponding organoids.

Because of the pre-training model of the pan-cancer species constructed based on the two-dimensional cell culture database, the predicted value of the model can not fully embody the real action condition of the compound at the level of the organ level in vivo. Therefore, the experiment of designing the organoid level on the organ chip by combining the sample and the compound realizes the migration of the model adaptation field by fine adjustment of the pre-training model of the pan-cancer species by small-scale data, thereby obtaining the real effect of the compound under the level of the three-dimensional organ structure.

Specifically, according to the cancerous tumor of the current target prediction model, the tissue organ type is selected for experiments in combination with the sample conditions in the organoid tissue sample library. The tissue cancer type of solid tumor comprises breast cancer, cervical cancer, colorectal cancer, esophageal cancer, liver cancer, lung cancer, pancreatic cancer, prostatic cancer and gastric cancer, and any one of the cancers is selected for screening samples in a sample library. And resuscitating the frozen organs of the screened frozen samples by using a human tumor organoid special culture medium, wherein the samples with better activity and the samples with complete priori mutation information in a tissue sample library can be incorporated into subsequent organ chip experiments of the cancer. Wherein the number of experimental sample queues for a single target cancer should be up to 30+.

S4: multiple medication schemes are designed according to cancer samples.

Specifically, according to selected cancer, referring to the opinion of clinical expert team, the common medication schemes in the current guideline recommendation and clinical development process are summarized and arranged, including targeted drugs and chemotherapeutics, including drugs on the market, and in special cases, the tests of the drugs still in the clinical experiment research and development process, including single-drug use or multi-drug use and other medication treatment schemes with different dimensions; finally, at least 12 combined administration schemes from clinic are realized for the target cancer, each administration scheme produces concentration gradients of at least 5 different concentration dosage grades, and the concentration gradient of the condition of the unspecified medicine is based on [0,0.016,0.08,0.4,2,10.0] micromoles. And a training data set with enough volume in the dimension of the compound is produced, so that the training capacity of the model is ensured.

S5: and carrying out a wet experiment on the organ-on-chip system matched with the selected organoids through different medication schemes to obtain a wet experiment data set.

Specifically, samples with stable growth condition and good activity after cryopreservation and resuscitation are screened from an organoid tissue sample library, preparation of an organ chip platform and consumable materials is carried out by using selected target cancers, the number of schemes in the design of a medication scheme is referred, and the organoid chambers in the organ chip platform of a +1 system of the number of medication schemes are plated on the samples which are resuscitated for 5-14 days and controlled by the organoid living property; taking an additional organ chip platform without adding compound medicines as a Control reference group, and respectively carrying out dosing treatment on a single organ chip according to each dosing scheme and concentration gradient in the design of the dosing scheme; according to the technical means of tissue type and organ chip platform, combining the simulation of micro-fluidic technology on the micro-environment of the organ chip and the culture of the self-gravity drug sensitive organ chip, the drug adding date is marked as Day0, and the organ chip is cultured for 7 days.

According to the completeness of genetic data corresponding to a selected sample in the organoid tissue sample library, collecting the genetic information of the sample by combining the collection of the data of the sample library and the detection in the process of culturing an organ chip. Specifically, if the organoid tissue sample library has priori knowledge of mutation information of complete samples, recording in a corresponding experimental database; if the tissue sample library has no genetic information, after 5 days of the organoid resuscitation cycle, the organoid sample above the organoid chamber of the order of 5 x 10 a 6 is cleaned, then the whole exon sequencing is used for detecting, analyzing and sorting the mutation information of the sample, and finally the mutation information is recorded into an experimental database by taking the sample as a unit.

And (3) detecting the activity of the organoids on the organ-chip platform at Day7 after the medicine is added. And judging the intervention and influence of the corresponding medication scheme on the activity of the tumor organoid through the cell activity data, thereby judging the effectiveness of the medication scheme. Specifically, using a cell activity detection scheme such as CCK8 and ATP, the optical density value of the sample is detected by an enzyme-labeled instrument at a specific wavelength. And respectively using a regression model purchasing machine to obtain optical density values of each drug administration scheme under different concentrations according to a best fit curve, calculating the drug concentration with the tumor organoid activity of exactly 50% of the maximum value from the fit curve, and recording the drug concentration as the EC50 of the drug administration scheme. After all the medication schemes are processed, the effectiveness of the medication schemes on tumor organoid inhibition is ordered according to the sequence of EC50 detection results, and the medication scheme effectiveness sequence is obtained.

In summary, the sample genetic information and the order of effectiveness of the regimen together form a wet test dataset.

S6: and performing migration learning on the pre-training model of the cancer seeds through a wet experimental data set until a target prediction model suitable for the target field is obtained.

Specifically, the pre-training model of the carcinomatous tumor is trained based on the priori data set, parameters of the pre-training model of the carcinomatous tumor are reserved as initial parameters, the wet experimental data set of the organ chip is used as the fine tuning data set, and the pre-training model of the carcinomatous tumor is trained. The input of the pre-training model of the cancer is a coding vector based on sample mutation detection information of an experimental database and the coding of a compound structure under different medication schemes, and if the medication schemes of the multi-medicine combination are used, the processing mode of characteristic crossing is referred; the output of the model is an EC50 concentration value for evaluating inhibition of tumor organoid activity by the drug regimen.

Dividing a wet experimental data set, dividing a training set and a verification set according to the proportion not lower than [0.9,0.1], and performing migration learning on the pre-training model of the cancer seeds by the training set to obtain a first target prediction model; and the verification set verifies the performance level of the first target prediction model, if the verification result reaches a preset standard, the first target prediction model is the target prediction model, otherwise, the first target prediction model is continuously trained until the target prediction model is obtained. The training process of the training set is used for setting super parameters such as learning rate, batch size, iteration number and the like by taking MSE as a Loss evaluation parameter of a model regression predicted value and a true value and Adam as a model optimizer. And evaluating the performance of the model by using the verification set, and iteratively adjusting the super parameters of the model according to the performance. Finally, the original parameters of the pre-training model of the cancer and the parameters of the model after fine adjustment are used, and the performance comparison of the model before and after fine adjustment is obtained through reasoning. And after the model training is finished, obtaining a target prediction model.

In this embodiment, non-small cell lung cancer is selected as target cancer, 47 non-small cell lung cancer cryopreservation samples containing complete mutation information are screened from an organoid tissue sample library, and a human tumor organoid culture medium is used, wherein the culture medium contains growth factors, nutrients, small molecule inhibitors, matrigel or Matrigel for organoid mass development and other components which are specially optimized for lung cancer cells; resuscitating and subculturing the frozen samples, wherein 34 samples are successfully resuscitated and enter into the organoid and organ chip data set queue through active quality control.

Further, a medication regimen is selected that is compatible with the clinic. Referring to the recommendations of the clinical oncologist expert group in this example, the following 20 single drug regimens comprising a targeted drug and a chemotherapeutic drug were ultimately confirmed:

gefitinib, dacatinib, oxcetirib, ai Leti, bragg, crizotinib, emtrictinib, laratinib, ensartinib, sivoratib, celepatinib, vandertinib, apatinib, afatinib, erlotinib, paclitaxel, gemcitabine, cisplatin, and two DMSO replicates were used as controls. Wherein the concentration gradient of the drug addition of each compound was set to 0. Mu. Mol/L, 0.016. Mu. Mol/L, 0.08. Mu. Mol/L, 0.4. Mu. Mol/L, 2.0. Mu. Mol/L, 10.0. Mu. Mol/L.

On day 7 of resuscitation of the group-in samples, organoid activity control was performed, confirming that 34 organoid samples in the queue were passed by the quality control, 20 compounds per sample, and 680 total experimental designs were performed, and wet experiments were performed using lung organoid chips. Specifically, separating organoid precipitate under the actions of centrifugal separation, digestive juice action, cell collection and the like by using a special digestive juice corresponding to Matrigel or Matrigel; after the organoid sediment is counted and quality controlled by a cell counter, matrigel or Matrigel is added again for blowing and beating uniformly, and finally the organoid sediment is plated in an organoid cavity on an organ chip, and attention is paid to ensuring that the organoid sediment is digested uniformly as much as possible but a small part of the organoid is kept in a clustered structure and uniformity during plating.

Further, on the 4 th day after plating, adding a compound with a corresponding concentration to the organoid chamber in each organ-chip system for effect by referring to the experimental design; the organoids were incubated for a period of 7 days after dosing, with a single change of fluid for 2 days, and in a sterile environment at 37℃throughout the incubation period.

The 7 th day of dosing culture is the experimental end point, the activity detection is carried out on the organoids, and the optical density value of each organoid sample is detected under a specific wavelength by an enzyme-labeled instrument by using an ATP cell activity detection scheme standard kit.

Table 2 shows comparison of ATP activity detection data of organ-a chip TP 0302 samples.

	0.016μM	0.08μM	0.4μM	2μM	10μM
						Gefitinib	7365	6544	8868	7559	6531
Dacomitinib	7165	5959	7638	6208	7264
						Osimertinib	6908	7498	8764	7678	7936
Alectinib	8057	6763	7915	8485	7508
						Brigatinib	8494	7384	6774	7209	7787
Crizotinib	6880	7821	9351	5618	6877
						Entrectinib	7838	9041	9450	5556	8104
Larotrectinib	9306	8899	9988	8329	9738
						Ensartinib	9702	8913	9241	9505	9732
Savolitinib	7658	6481	7187	6660	7731
						Selpercatinib	6713	6489	7509	6110	7239
Vandetanib	8451	9075	8333	9110	7712
						Apatinib	7974	9720	7789	8937	8855
Afatinib	9472	7482	8442	7398	7103
						Erlotinib	9256	7670	9298	9811	9395
Cisplatin	6263	5956	8812	8659	8148
						Gemcitabine	7385	7636	8218	7595	8053
Paclitaxel	6157	7303	8584	6246	6970
						DMSO/Control	6885	6211	5735	5907	6672

TABLE 2

Taking the TP 0302 sample results of table 2 as an example, the activity values of each sample under the action of different compounds are collated. And respectively using a regression model purchasing machine to obtain optical density values of each drug administration scheme under different concentrations according to a best fit curve, calculating the drug concentration with the tumor organoid activity of exactly 50% of the maximum value from the fit curve, and recording the drug concentration as the EC50 of the drug administration scheme. After all the medication schemes are processed, the effectiveness of the medication schemes on tumor organoid inhibition is ordered according to the sequence of EC50 detection results, and the medication scheme effectiveness sequence is obtained.

In the embodiment of the application, the verification set is reserved in the ratio of [0.8,0.2], and the training set is utilized to train the pre-training model of the pan cancer species. The input of the model is a coding vector based on sample mutation detection information of an experimental database and the coding of a compound structure under different medication schemes, and if the medication schemes of multiple drugs are combined, the processing mode of characteristic crossing is referred. The output target value of the model is an EC50 detection value of the drug regimen on the tumor organoid activity inhibition rate, a training process of a training set is used for taking MSE as a Loss evaluation parameter of a model regression predicted value and a true value, adam is used as a model optimizer, and super parameters such as a learning rate, a batch size, iteration times and the like are set. The model performance of the models before and after fine tuning was recorded as Loss based on MSE, as shown in table 3.

MSE	Pre-training model weights	Organ chip post-trimming weights
			Training data set	0.5732	0.0718
Validating a data set	0.4175	0.0953

TABLE 3 Table 3

As can be seen from table 3, the pre-training model of the cancer species before fine tuning has a certain prediction function, and the phenomenon can be reflected that the pre-training model based on the two-dimensional cell experiment system has a certain level of prediction consistency on the organoid platform result, but the error is still larger; after the organoid data is used for fine tuning, the target prediction model is obviously suitable for the field of organoid and organ chip data, and the more accurate prediction of experimental results with organ structures is realized.

s63: the consistency of the prediction validity ranking and the experimental validity ranking was calculated by Spearman's Rank Correlation Method (Spearman rank correlation verification method), expressed as:

Wherein ρ (rho) represents a uniformity coefficient; d, d _i Representing predictive validity ranking and experimental validity rankingThe difference between the ith order in the order, n represents the total amount of data;

s64: and when the consistency coefficient rho (rho) is larger than 0, carrying out statistical T-test on the consistency coefficient rho (rho) to obtain a test result, judging whether the test result is smaller than a significance threshold value, and if so, enabling the sequencing of the consistency coefficient rho (rho) corresponding to the prediction effectiveness sequencing to be consistent with the sequencing in the experimental effectiveness sequencing. The significance threshold in the embodiment of the present application is 0.05.

In particular, the Spearman's consistency method does not need to assume that the data are linearly related, only measures the relation of rank order among the data, is more suitable for consistency judgment between target prediction model output and experimental real detection values which possibly belong to different dimensions and distribution, and can effectively show differences of intra-group comparison among different medication schemes for the same sample.

And performing performance test on the target prediction model by using the verification set isolated in the fine tuning process, and evaluating the predicted value of the target prediction model and the effectiveness sequence of the medication scheme actually detected by the wet experiment. Table 4 shows the significance of Spearman's correlation coefficients and their statistical t-test for 7 samples in the validation set:

Sample	Spearman'srho	P.Value
			TP_Validation_01_0892	0.41008916	0.046557884
TP_Validation_02_0084	0.593043478	0.002256206
			TP_Validation_03_0138	0.368695652	0.076249688
TP_Validation_04_0756	0.474016102	0.019281635
			TP_Validation_05_0100	0.31826087	0.129608777
TP_Validation_06_0262	0.505217391	0.011795445
			TP_Validation_07_0619	0.406956522	0.048424943

TABLE 4 Table 4

In Table 4, all samples in the validation set had Spearman's rho > 0, meaning that the model predictions had positive correlation with the experimentally detected true values; after the statistical T-test, 5 samples of 7 samples below the significance threshold of 0.05 have statistical significance, and the remaining two samples also have smaller P-Value. Further, in the order of predicting and detecting the validity of the two medication schemes, the first medication scheme is checked, and the rest 6 samples except TP_validation_05_0100 in the verification data set have the first name of the completely consistent medication scheme, and 4 samples have the first three names of the completely consistent medication scheme. This means that the target predictive model is of sufficient value in terms of the most efficient medication instruction.

The prediction reasoning stage is to sequence a frozen sample which is not subjected to mutation detection to obtain mutation data, and then predict and evaluate the actual organ-on-chip wet experimental result by using a reasoning result obtained based on the target prediction model.

Step S7 is an application phase of the target prediction model. For a sample which needs to be predicted by a tumor killing medication scheme, after the tissue is cleaned and primarily quality controlled, a necessary DNA extraction kit which meets the tissue type requirement of the sample is used for extracting DNA of the sample; performing necessary quality control on the DNA of the sample by using various methods such as a DNA concentration tester, a purity detector, agarose gel electrophoresis and the like; breaking the DNA into fragments according to 300bp, and carrying out terminal repair and street addition; performing PCR amplification and necessary purification steps, and finally completing sequencing by using an adaptive instrument of an NGS sequencing technology platform; performing quality control on the off-machine fastq data produced by sequencing by using a quality control tool such as fastp, and performing sequence comparison according to a reference genome by using tools such as bwa; performing mutation analysis on the comparison result by using a mutation detection tool; finally, annotation tools are used to annotate the detected mutation at the gene level with reference to a mutation database such as COSMIC. And (3) sorting and summarizing mutation information of the samples, carrying out necessary filtering of low-frequency mutation according to a set threshold value of mutation frequency in the samples, and finally recording a mutated gene queue.

Specifically, the medication schemes corresponding to the samples are arranged, and a queue containing a plurality of potential medication schemes is output; and sequentially selecting the medication schemes and codes of corresponding compounds from the queues, inputting mutation information codes into the target prediction model together with the mutation information queue input mutation information codes of the samples, reasoning by using the model to obtain the inhibition effect on tumors corresponding to each medication scheme, and sequencing the medication schemes according to inhibition effectiveness.

In the embodiment of the application, a wet experimental data set established based on the non-small cell lung cancer type is used, a target prediction model obtained through fine tuning training is used, mutation information of a patient is input into the model for coding, and a compound queue of a potential medication scheme is input into the model for reasoning tumor inhibition effect of a corresponding compound; and sequencing according to the result output by the model, and obtaining recommended guidance on the medication scheme based on the sequencing.

Further, in order to verify the reliability of the predicted result of the sample, the sample is subjected to a test for consistency of the result. The lower line graph shows the drug sensitivity result of the sample. Model prediction and experiment of samples containing tp_three_01_0002 in fig. 4 the inhibition of tumors by the actual compounds, the ordinate represents the tumor inhibition, and the lower the value, the stronger the tumor inhibition; the abscissa represents different dosing regimens; in the two broken lines, the broken lines are actual detected values of experiments, the compounds on the abscissa are arranged according to descending order of the detected values of the experiments, and the solid lines represent model reasoning predicted values.

Further, three drug regimens with the strongest inhibitory effect on small cell lung cancer are checked, and the first three sequences of the experimental actual predicted value and the target predicted model reasoning predicted value are consistent, which are: (1) emtrictinib, (2) larrotib, (3) paclitaxel.

The foregoing is an exemplary embodiment of the present application, the scope of which is defined by the claims and their equivalents.

Claims

1. The deep learning prediction method of the tumor drug administration scheme based on gene detection is characterized by comprising the following steps of:

s2: constructing a pre-training mutation drug sensitivity data set based on a two-dimensional drug sensitivity database, and training a pre-training model through the pre-training mutation drug sensitivity data set to obtain a pre-training model of the cancer species;

s3: screening cancer seed samples of the target prediction model from an organoid sample library and resuscitating corresponding organoids;

s4: designing various medication schemes according to cancer samples;

s7: predicting the rationality of a tumor medication scheme through the target prediction model;

wherein, the step S2 includes:

wherein the first encoding module and the second encoding module each use a transfomer infrastructure based cross attention mechanism that correlates mutation information encoding with hidden layer features of compound structural encoding, weighting the compound structural encoding by mutation information encoding while weighting mutation information encoding by compound structural encoding;

The first coding module is an inverted KO module, and the inverted KO module performs mapping conversion on the mutation map of each sample in the pre-training mutation drug sensitive data set in an One-Hot coding mode to obtain mutation information codes; the second coding module is a Morgan fingerprint coding module, and the Morgan fingerprint coding module codes all intervention compounds related in the pre-training mutation drug sensitive data set according to the corresponding compound structure to obtain compound structure codes.

2. The deep learning prediction method of claim 1, wherein in step S4, at least 12 of the dosage regimens are used, each of the dosage regimens producing a concentration gradient of at least 5 different concentration levels, the concentration gradient being based on [0,0.016,0.08,0.4,2,10.0] micromolar for the non-specific drug.

3. The deep learning prediction method of claim 1, wherein step S5 includes:

4. The deep learning prediction method of claim 1, wherein in step S2, the pre-training mutation drug sensitive data set is divided into a training set, a test set and a verification set according to the ratio of [0.7,0.2,0.1], and the training set trains the pre-training model to obtain a first pre-training model; evaluating the performance of the first pre-training model by the test set, and iteratively adjusting the super parameters of the first pre-training model according to the performance to obtain a second pre-training model; and the verification set verifies the performance level of the second pre-training model, if the verification result reaches the preset standard, the second pre-training model is the pre-training model of the cancer species, otherwise, the second pre-training model is continuously trained until the pre-training model of the cancer species is obtained.

5. The deep learning prediction method according to claim 1, wherein in step S6, the wet experimental data set is divided into a training set and a verification set according to a ratio not lower than [0.9,0.1], and the training set performs migration learning on the pre-training model of the pan-cancer species to obtain a first target prediction model; and the verification set verifies the performance level of the first target prediction model, if the verification result reaches a preset standard, the first target prediction model is the target prediction model, otherwise, the first target prediction model is continuously trained until the target prediction model is obtained.

6. The deep learning prediction method of claim 5, wherein the validation set validates a performance level of the first target prediction model, comprising: