CN116894191A

CN116894191A - Algorithm model training and matching method and device, electronic equipment and medium

Info

Publication number: CN116894191A
Application number: CN202311161425.6A
Authority: CN
Inventors: 王聃; 许春萍; 刘清君; 杨帅
Original assignee: Yicon Beijing Medical Science And Technology Co ltd
Current assignee: Yicon Beijing Medical Science And Technology Co ltd
Priority date: 2023-09-11
Filing date: 2023-09-11
Publication date: 2023-10-17
Anticipated expiration: 2043-09-11
Also published as: CN116894191B

Abstract

The invention provides an algorithm model training and matching method, a device, electronic equipment and a medium, which relate to the technical field of data processing and comprise the following steps: creating a pre-training data set comprising a number of clinical samples and a migration training data set comprising a number of preclinical samples; and pre-training the matching algorithm model by using a pre-training data set and a loss function to obtain a pre-trained matching algorithm model, and performing migration training on the pre-trained matching algorithm model by using the migration training data set and the loss function to obtain a migration-trained matching algorithm model, wherein the migration-trained matching algorithm model is used for matching recommendation between clinical samples to be matched and preclinical samples to be matched.

Description

Algorithm model training and matching method and device, electronic equipment and medium

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to an algorithm model training and matching method, an algorithm model matching device, electronic equipment and a medium.

Background

The preclinical model refers to an experimental model for drug efficacy, toxicity and pharmacokinetics studies prior to entering a clinical trial, which is a key element in the drug development process. Preclinical models are used to model the biological or physiological conditions of cells or cells in patients with clinical disease, such as tumor cell lines are used to model the growth and proliferation characteristics of tumor cells of human origin, xenograft models (PDX models) of patient origin are used to model the biological characteristics of tumor cells and tumor tissue in humans, and PDX models (e.g., PBMC-PDX models) of immune system humanization are used to model the interactions and effects between the immune system and tumor tissue in humans.

The preclinical stage is used as a preclinical stage of a clinical trial, and the preclinical model employed needs to match the clinical condition and biological characteristics of the next clinical trial subject as accurately and effectively as possible. Then, the difference exists between the current preclinical model and the actual clinical patient, so how to overcome the difference, realize the accurate matching between the preclinical model and the clinical patient, and break through the link between the preclinical stage and the clinical test stage is an important problem to be solved urgently.

In the prior art, the characteristics are manually selected and the preclinical model and the clinical patient are manually matched, so that the method is low in efficiency, has a certain subjective bias, and cannot realize accurate and effective matching between the preclinical model and the clinical patient.

Disclosure of Invention

The present invention aims to solve the above-mentioned technical problems in the related art by providing an algorithm model training method, an algorithm model matching device, an electronic device and a medium.

In order to achieve the above purpose, the technical scheme adopted by the embodiment of the invention is as follows:

in a first aspect, an embodiment of the present invention provides an algorithm model training method, including:

Creating a pre-training data set comprising a number of clinical samples and a migration training data set comprising a number of preclinical samples;

each clinical sample in the plurality of clinical samples comprises a feature vector of the clinical sample and a label vector of the clinical sample, wherein the feature vector is expressed as a plurality of preset feature sets, data of the feature vector is histologic data, and the label vector is expressed as a plurality of preset category labels; the clinical sample and the preclinical sample are used as input samples, and the feature vector and the label vector are used in a matching algorithm model as the feature vector of the input samples and the label vector of the input samples;

pre-training the matching algorithm model by using a pre-training data set and a loss function to obtain a pre-trained matching algorithm model;

performing migration training on the pre-trained matching algorithm model by using the migration training data set and the loss function to obtain a migrated matching algorithm model, wherein the migrated matching algorithm model is used for matching recommendation between a clinical sample to be matched and a pre-clinical sample to be matched;

The matching algorithm model comprises an encoding module, a fusion module, a decoding module and a judging module; the encoding module encodes the feature vector of the input sample to obtain an encoded feature vector of the input sample; the fusion module fuses the coded feature vectors to obtain fused feature vectors of the input samples; the judging module obtains a predicted tag vector of the input sample according to the fused feature vector, and compares and judges the predicted tag vector with the tag vector of the input sample; the decoding module decodes the fused feature vector to obtain a decoded feature vector of the input sample.

Optionally, the loss function is obtained by combining one or a plurality of sub-loss functions; the plurality of sub-loss functions comprise a reconstruction sub-loss function, a discrimination sub-loss function, a batch effect syndrome loss function and a data total syndrome loss function;

the reconstruction sub-loss function is used for correcting the difference between the characteristic vector of the input sample and the decoded characteristic vector;

the discriminant sub-loss function is used for correcting the difference between the predicted label vector of the input sample and the label vector of the input sample;

The batch effect syndrome loss function is used for correcting batch effects among the fused feature vectors;

the data amount syndrome loss function is used for correcting the difference between the total amount of the data of the decoded feature vector and the total amount of the data of the feature vector of the input sample.

Optionally, in the process of combining the plurality of sub-loss functions by using the loss function, the weight of each sub-loss function is dynamically adjusted through the attention mechanism layer and weight regularization is performed, so that the plurality of sub-loss function combinations are obtained.

Optionally, the migration training is: and firstly freezing parameters of the coding module included in the pre-trained matching algorithm model, and further using the migration training data set and the loss function to finely adjust parameters of the fusion module, parameters of the judging module and parameters of the decoding module included in the pre-trained matching algorithm model.

Optionally, the clinical sample is a tumor sample of a clinical patient; the preclinical sample is a preclinical tumor in vitro or in vivo model sample;

the preclinical sample comprises any one or more of the following: in vitro tumor cell line models, in vivo PDX models, in vivo CDX models, samples of humanized animal models, and in vitro organoid models.

Optionally, the plurality of preset feature sets includes a coding gene feature set and a tumor gene feature set; the plurality of preset category labels comprise tumor type labels, the label vector comprises a tumor category label, and the predictive label vector comprises a predictive tumor category label; the omics data is transcriptomic data; the coding module in the matching algorithm model comprises a coding gene feature set coding submodule and a tumor gene feature set coding submodule;

the pre-training the matching algorithm model by using the pre-training data set and the loss function to obtain a pre-trained matching algorithm model, comprising:

adopting a coding gene feature set coding submodule in the matching algorithm model to code coding gene feature set data in the feature vector of the clinical sample to obtain a pre-training coding gene feature set coding feature of the clinical sample;

adopting a tumor gene feature set coding submodule in the matching algorithm model to code tumor gene feature set data in feature vectors of the clinical samples to obtain pre-training tumor gene feature set coding features of the clinical samples;

adopting a fusion module in the matching algorithm model to fuse the pre-training encoding gene feature set encoding features and the pre-training tumor gene feature set encoding features to obtain pre-training fusion features of the clinical sample;

Decoding the pre-training fusion characteristics by adopting a decoding module in the matching algorithm model to obtain pre-training encoding gene characteristic set decoding characteristics of the clinical samples and pre-training tumor gene decoding characteristics of the clinical samples;

and a judging module in the matching algorithm model is adopted, a predicted tumor type label in a predicted label vector of the clinical sample is obtained according to the pre-training fusion characteristic, and the predicted tumor type label and a tumor type label in a label vector of the clinical sample are compared and judged.

Optionally, the plurality of preset feature sets further includes an immune gene feature set, the preset category label further includes an immune type label, the label vector further includes an immune type label, and the predictive label vector further includes a predictive immune type label; the coding module in the matching algorithm model also comprises an immune gene feature set coding submodule;

the pre-training the matching algorithm model by using the pre-training data set and the loss function to obtain a pre-trained matching algorithm model, and the method further comprises the following steps:

adopting an immune gene feature set coding submodule in the matching algorithm model to code immune gene feature set data in feature vectors of the clinical samples, so as to obtain pre-training immune gene feature set coding features;

Adopting a fusion module in the matching algorithm model to fuse the pre-training encoding gene feature set encoding feature, the pre-training tumor gene feature set encoding feature and the pre-training immunity gene feature set encoding feature to obtain a pre-training fusion feature of the clinical sample;

decoding the pre-training fusion characteristic by adopting a decoding module in the matching algorithm model to obtain the pre-training encoding gene characteristic set decoding characteristic, the pre-training tumor gene decoding characteristic and the pre-training immune gene decoding characteristic;

and a judging module in the matching algorithm model is adopted, a predicted immunity type label in a predicted label vector of the clinical sample is obtained according to the pre-training fusion characteristic, and the predicted immunity type label and an immunity type label in a label vector of the clinical sample are compared and judged.

In a second aspect, an embodiment of the present invention further provides a sample matching method, including:

inputting the feature vector of the input sample into a matching algorithm model to obtain the fused feature vector of the clinical sample and the fused feature vector of the preclinical sample, wherein the feature vector of the input sample comprises: the feature vector of the clinical sample to be matched and the feature vector of the pre-clinical sample to be matched, wherein the matching algorithm model is a matching algorithm model obtained by adopting the algorithm model training method in any one of the first aspect;

And matching the fused feature vector of the clinical sample with the fused feature vector of the preclinical sample to obtain a matching result between the clinical sample and the preclinical sample, wherein the matching result comprises the similarity between the fused feature vector of the clinical sample and the fused feature vector of the preclinical sample.

In a third aspect, an embodiment of the present invention further provides an algorithm model training apparatus, including:

the system comprises a creation module, a pre-training data set and a migration training data set, wherein the pre-training data set comprises a plurality of clinical samples, and the migration training data set comprises a plurality of preclinical samples; each clinical sample in each clinical sample comprises a feature vector of the clinical sample and a label vector of the clinical sample, each preclinical sample in each preclinical sample comprises a feature vector of the preclinical sample and a label vector of the preclinical sample, wherein the feature vector is expressed as a plurality of preset feature sets, data of the feature vector is histologic data, and the label vector is expressed as a plurality of preset category labels; the clinical sample and the preclinical sample are used as input samples, and the feature vector and the label vector are used in a matching algorithm model as the feature vector of the input samples and the label vector of the input samples;

The training module is used for pre-training the matching algorithm model by using the pre-training data set and the loss function to obtain a pre-trained matching algorithm model; performing migration training on the pre-trained matching algorithm model by using the migration training data set and the loss function to obtain a migrated matching algorithm model, wherein the migrated matching algorithm model is used for matching recommendation between a clinical sample to be matched and a pre-clinical sample to be matched;

the matching algorithm model comprises an encoding module, a fusion module, a decoding module and a judging module; the encoding module encodes the feature vector of the input sample to obtain an encoded feature vector of the input sample; the fusion module fuses the coded feature vectors to obtain fused feature vectors of the input samples; the judging module obtains a predicted tag vector of the input sample according to the fused feature vector and compares and judges the predicted tag vector with the tag vector of the input sample; the decoding module decodes the fused feature vector to obtain a decoded feature vector of the input sample;

The loss function is obtained by combining one or a plurality of sub-loss functions; the plurality of sub-loss functions comprise a reconstruction sub-loss function, a discrimination sub-loss function, a batch effect syndrome loss function and a data total syndrome loss function;

the data total amount syndrome loss function for correcting a difference between a total amount of data of the decoded feature vector and a total amount of data of the feature vector of the input sample;

the migration training is as follows: and firstly freezing parameters of the coding module included in the pre-trained matching algorithm model, and further using the migration training data set and the loss function to finely adjust parameters of the fusion module, parameters of the judging module and parameters of the decoding module included in the pre-trained matching algorithm model.

In a fourth aspect, an embodiment of the present invention further provides a sample matching apparatus, including:

the extraction module inputs the feature vector of the input sample into a matching algorithm model to obtain the fused feature vector of the clinical sample and the fused feature vector of the preclinical sample, wherein the feature vector of the input sample comprises: the feature vector of the clinical sample to be matched and the feature vector of the preclinical sample to be matched are the matching algorithm model obtained by adopting the algorithm model training method according to any one of the first aspect;

and the matching module is used for matching the fused feature vector of the clinical sample with the fused feature vector of the preclinical sample to obtain a matching result between the clinical sample and the preclinical sample, wherein the matching result comprises the similarity between the fused feature vector of the clinical sample and the fused feature vector of the preclinical sample.

In a fifth aspect, an embodiment of the present invention further provides an electronic device, including a memory and a processor, where the memory is configured to store a computer program executable by the processor, and when the processor executes the computer program, implement the matching algorithm model method according to any one of the first aspect or the algorithm matching method according to the second aspect.

In a sixth aspect, an embodiment of the present invention further provides a computer readable medium, where a computer program is stored, where the computer program implements the matching algorithm model method according to any one of the first aspect or the algorithm matching method according to the second aspect when executed.

The beneficial effects of the invention are as follows:

(1) The invention creates a pre-training data set and a migration training data set, wherein the pre-training data set comprises a plurality of clinical samples, and the migration training data set comprises a plurality of preclinical samples; each creatively established clinical sample comprises a feature vector of the clinical sample and a label vector of the clinical sample, each preclinical sample comprises a feature vector of the preclinical sample and a label vector of the preclinical sample, wherein the feature vector is expressed as a plurality of preset feature sets, data of the feature vector is histologic data, and the label vector is expressed as a plurality of preset category labels; the method further adopts the modes of pre-training of clinical samples and migration training of preclinical samples, so that the problem of unbalanced data scale of the clinical samples and the preclinical samples is effectively solved, and the matching algorithm model simultaneously learns the characteristic general information in the clinical samples and the characteristic specific information in the preclinical samples, thereby realizing accurate matching between the clinical samples and the preclinical samples;

(2) A plurality of preset feature sets are adopted to represent feature vectors of input samples, and meanwhile, modes and differences among the input samples are comprehensively and effectively represented from the global level and the local level of a single preset feature set, so that the feature representation capability of the input samples is improved;

(3) The adopted matching algorithm model comprises a coding module, a fusion module, a decoding module and a judging module, the efficient compression from high dimension to low dimension of the characteristic vector of the input sample is realized through the coding module, the data consistency and the accuracy of the efficient compression from high dimension to low dimension of the characteristic vector of the input sample are ensured through the decoding module, the efficient fusion among a plurality of preset characteristic sets of the input sample is realized through the fusion module, the information of a plurality of preset category labels is effectively reserved through the judging module, and the mode of combining the non-supervision learning and the supervision learning is realized, so that the information fusion and the dimension reduction of the characteristic vector of the input sample are realized, and the information loss is reduced;

(4) The method has the advantages that the characteristics of a plurality of preset characteristic sets are effectively fused through a dynamic weighted combination mode of a plurality of sub-loss functions, so that input samples are better represented, a plurality of preset category label information of the input samples is effectively reserved through judging the sub-loss functions, the consistency of low-dimensional coded characteristic vectors and high-dimensional decoded characteristic vectors is guaranteed through reconstructing the sub-loss functions, the accuracy of the fused characteristic vectors is further improved, in addition, the deviation of batch effects and data total quantity among the input samples is further well eliminated through a batch effect corrector loss function and a data total quantity corrector loss function, and finally the overall optimal performance can be automatically optimized through the dynamic weighted combination of the sub-loss functions.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of an algorithm model training method provided by the invention;

FIG. 2 is a schematic diagram of a module structure of a matching algorithm model provided by the invention;

fig. 3 is a schematic diagram of a module structure of a matching algorithm model according to embodiment 1 of the present invention;

FIG. 4 is a graph schematically illustrating a change of a loss function of a test data set with the number of pre-training iterations in a pre-training process of a matching algorithm model according to embodiment 1 of the present invention;

FIG. 5 is a graph schematically illustrating the variation of the decision coefficients of the test data set with the number of pre-training iterations in the pre-training process of the matching algorithm model according to embodiment 1 of the present invention;

fig. 6 is a schematic block diagram of a matching algorithm model according to embodiment 2 of the present invention;

FIG. 7 is a graph illustrating a change of a loss function of a test data set with the number of pre-training iterations in a pre-training process of another matching algorithm model according to embodiment 2 of the present application;

fig. 8 is a schematic diagram of a curve of a test dataset decision coefficient according to the number of pre-training iterations in a pre-training process of a matching algorithm model according to embodiment 2 of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application.

Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the description of the present application, it should be noted that, if the terms "upper", "lower", and the like indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, or an azimuth or the positional relationship conventionally put in use of the product of the application, it is merely for convenience of describing the present application and simplifying the description, and it is not indicated or implied that the apparatus or element referred to must have a specific azimuth, be configured and operated in a specific azimuth, and thus should not be construed as limiting the present application.

Furthermore, the terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that the features of the embodiments of the present application may be combined with each other without conflict.

Aiming at the problems that in the related technology, manual matching is subjective and characteristics of a clinical sample and a preclinical sample cannot be accurately determined, so that the clinical sample and the preclinical sample cannot be accurately matched, the embodiment of the application provides an algorithm model training method.

The embodiment of the application provides an algorithm model training method, which is applied to electronic equipment, wherein the electronic equipment can be terminal equipment or a server, and the terminal equipment can be any one of the following: desktop computers, notebook computers, tablet computers, smart phones, etc.

The algorithm model training method provided by the application is explained below.

FIG. 1 is a schematic flow chart of an algorithm model training method provided by the application, which comprises the following steps:

s101: a pre-training dataset is created, the pre-training dataset comprising a number of clinical samples, and a migration training dataset comprising a number of preclinical samples.

Each clinical sample in the pre-training data set comprises a feature vector of a clinical sample and a label vector of the clinical sample, each preclinical sample in the migration training data set comprises a feature vector of a preclinical sample and a label vector of a preclinical sample, wherein the feature vector is represented as a plurality of preset feature sets, data of the feature vector is histology data, and the label vector is represented as a plurality of preset class labels. The clinical samples and preclinical samples are used as input samples, and the feature vectors and the label vectors are used in a matching algorithm model as the feature vectors of the input samples and the label vectors of the input samples.

It should be noted that the number of clinical samples and preclinical samples may be plural, and the clinical samples may be derived from samples of preclinical patients with different diseases. The preclinical sample may be derived from a sample of a preclinical in vivo or in vitro model. Of course, the input sample and the matching algorithm model may be selected according to actual requirements, which is not particularly limited in the embodiment of the present application.

The histology data includes, but is not limited to, transcriptome, proteome, methylation, and the like. Transcriptome data quantification provides transcriptional level gene expression data, which can be obtained by any of RNA-seq (RNA sequencing), microarray chips, and the like. The proteome data quantitatively provides protein expression data of protein level, and can be obtained by any proteome sequencing technology such as mass spectrum, antibody chip and the like. The data can be optionally processed by a data preprocessing method such as missing value interpolation, outlier rejection, logarithmic conversion, standardization, normalization and the like.

S102: pre-training the matching algorithm model by using the pre-training data set and the loss function to obtain a pre-trained matching algorithm model;

S103: and performing migration training on the pre-trained matching algorithm model by using the migration training data set and the loss function to obtain a migration trained matching algorithm model.

The migration training matching algorithm model is used for matching recommendation between the clinical samples to be matched and the preclinical samples to be matched;

it should be noted that, the pre-training process and the migration training process of the matching algorithm model are similar except that the pre-training data and the migration training data are different, and the updated model parameter ranges in the training process are different.

Fig. 2 is a schematic block diagram of a matching algorithm model provided by the present invention, and as shown in fig. 2, the matching algorithm model includes an encoding module 101, a fusion module 102, a decoding module 103, and a discriminating module 104. The encoding module 101 encodes the feature vector of the input sample to obtain an encoded feature vector of the input sample. The fusion module 102 fuses the encoded feature vectors to obtain fused feature vectors of the input samples. The discriminating module 104 obtains a predicted tag vector of the input sample according to the fused feature vector and compares and discriminates the predicted tag vector with the tag vector of the input sample. The decoding module 103 decodes the fused feature vector to obtain a decoded feature vector of the input sample. The input samples include preclinical samples and clinical samples.

The loss function of the matching algorithm model is obtained using one or several sub-loss functions in combination. The plurality of sub-loss functions comprise a reconstruction sub-loss function, a discriminant sub-loss function, a batch effect syndrome loss function, and a total data amount syndrome loss function. The sub-loss function is reconstructed to correct differences between the feature vectors of the input samples and the decoded feature vectors. And a discriminant sub-loss function for correcting a difference between the predicted tag vector of the input sample and the tag vector of the input sample. The batch effect syndrome loss function is used for correcting batch effects between the fused feature vectors of the input samples. A total data amount syndrome loss function for correcting a difference between a total amount of data of the decoded feature vector of the input sample and a total amount of data of the feature vector of the input sample.

The application comprises a specific embodiment 1 and a specific embodiment 2, which are used for solving the problem of effective matching of a preclinical sample and a clinical sample in the research and development of a new tumor treatment drug, and further effectively opening up a preclinical stage and a clinical stage of the research and development of the new drug. Of these, specific example 1 focuses on a tumor drug that is directly targeted, while specific example 2 focuses on a tumor immune drug that treats tumors by using immunity. The clinical sample is a tumor sample of a clinical patient, and the preclinical sample is a preclinical tumor in vitro or in vivo model sample. The preclinical samples include tumor cell line in vitro models, PDX in vivo models, CDX in vivo models, immune humanized animal model samples, and organoid in vitro models. Preclinical in vivo or in vitro models fall into two main categories, preclinical models that contain the human immune system and preclinical models that do not contain the human immune system.

Preclinical models that do not include the human immune system include tumor cell line in vitro models, PDX in vivo models, CDX in vivo models, and the like. The PDX model is totally called a xenograft model (patent-Derived Xenograft) derived from tumor patients, and is an experimental model for researching tumor biology and drug curative effect. In this model, tumor tissue excised directly from the patient is transplanted into an immunodeficient mouse. Compared with the traditional tumor cell line transplantation model, the PDX model can reflect the biological characteristics of the tumors of the clinical patients from the original sources more truly, including tumor cell heterogeneity, gene mutation and the like. The main advantage of the PDX model in tumor studies is its high clinical relevance. Since the tumors in the model originate directly from the patient, they are more likely to reflect the patient's response to a particular treatment. Thus, PDX is widely used in preclinical drug screening and in the development of personalized therapeutic strategies. The PDX model also has limitations such as long modeling time, high cost, and the exclusion of the human immune system. Nevertheless, the PDX model is still considered an important tool in modern tumor research because it can provide valuable information for tumor treatment. The CDX model, collectively referred to as a Cell Line-derived xenograft model (Cell Line-Derived Xenograft), is a common experimental model in tumor research. In the CDX model, human tumor cell lines cultured in vitro were transplanted into immunodeficient mice to form visible tumors. Although CDX uses a humanized tumor cell line subjected to long-time in vitro passage, the CDX has higher proliferation rate and stability, the biological characteristics of the humanized tumor cells are still reserved, the modeling speed is high, the cost is relatively low, the research on the efficacy and pharmacokinetics of tumor drugs on tumors in an in-vivo environment is allowed, and the CDX is still an in-vivo model commonly used for large-scale drug screening at present.

The model bodies such as PDX and CDX mainly have humanized tumor tissues in vivo, so that the model bodies are widely applied to the research and development of tumor immunity medicaments, but considering the heterogeneous biological differences between animals and human bodies, even if the model bodies such as PDX and CDX are human-derived tumor tissues, the gene expression of the model bodies still have differences with the human-derived tumor tissues, how to reduce the differences as far as possible and improve the matching degree between the model bodies such as PDX and CDX and a real clinical tumor patient is an important problem in the research and development process of tumor medicaments.

Preclinical models comprising the human immune system mainly include PBMC-PDX in vivo models, PBMC-CDX in vivo models, and the like. The in vivo model containing the human immune system has human tumor cells and immune cells, and can simulate the biological characteristics of the tumor cells in human body, and can simulate the interaction effect between the immune system and the tumor cells in human body better, so that the in vivo model is often used for research and development of tumor immune medicines and evaluation of tumor immune therapy.

PBMC-PDX model one of the above described PDX models was first established and then human peripheral blood mononuclear cells (PBMC, peripheral Blood Mononuclear Cell) were injected. PBMC lymphocytes mainly include T cells, B cells, NK cells, monocytes, etc., which float in the peripheral blood and are critical for the immune response of the immune system in humans. Thus, the PBMC-PDX model not only contains tumor cells of a patient, but also mimics the human immune environment, and is widely used to evaluate how tumor immune drugs activate or inhibit human immune cells to enhance the therapeutic effect on tumors. The PBMC-CDX model is similar to the PBMC-PDX model, but the CDX model established using cell lines is initiated, followed by injection of PBMC, mimicking the immune environment. These preclinical models, which contain the human immune system, provide an important bridge to help researchers understand how immunotherapy works in humans and why some patients respond well to immunotherapy, while others do not.

The model bodies such as PBMC-PDX and PBMC-CDX have humanized tumor tissues and immune systems at the same time, so that the model bodies are widely applied to the research and development of tumor immunity drugs, but considering the xenogenic biological differences between animals and human bodies, even though the model bodies such as PBMC-PDX and PBMC-CDX have humanized tumor tissues and immune systems, the gene expression of the model bodies such as PBMC-PDX and PBMC-CDX still have differences with the tumor tissues and immune systems in human bodies, and how to reduce the differences as much as possible and improve the matching degree between the model bodies such as PBMC-PDX and PBMC-CDX and the actual clinical tumor patients is an important problem in the research and development process of tumor drugs.

Meanwhile, from the current existing histology sequencing data scale, along with more and more tumor research and clinical patient queue sequencing based on high-throughput sequencing at home and abroad, large-scale histology data of different tumor types are accumulated for basic research and clinical diagnosis of tumors, medication and the like. Compared with the histology data scale of clinical samples of tumor patients, the method has the advantages that the characteristics of complex construction of preclinical models, high technical threshold, high cost and the like are considered, and the number of the existing preclinical models and the corresponding sequencing histology data scale are relatively smaller. How to balance matching "large" scale clinical oncology patient histology data with "small" scale preclinical animal model histology data is also an important issue to consider. In addition, the problem that batch effects exist among different batches of histology data and different samples due to the fact that the histology data are influenced by factors such as a sequencing platform, a sample library and sequencing depth is considered, and therefore batch effects among samples need to be comprehensively considered when matching between clinical samples from real clinical tumor patients and preclinical samples from preclinical models is carried out.

Specific example 1 is mainly aimed at solving the matching of preclinical and clinical samples in the development of oncological drugs. Tumor drugs herein refer primarily to drugs that target tumor cells directly without utilizing the immune system of the tumor patient, primarily by blocking critical signaling pathways required by tumor cells, e.g., by inhibiting specific enzymes or receptors in tumor cells, such that the tumor cells are not able to acquire the signals required for growth and differentiation. Therefore, in the development stage of tumor drugs, the molecular level gene expression, particularly the expression of tumor related genes, is very important for understanding the biological mechanism of tumorigenesis and development and the pharmacological and pharmacodynamic effects of drug action, and is also a key consideration for effectively matching a preclinical sample with a clinical sample and opening up the preclinical stage and the clinical stage of drug development.

In example 1, the clinical samples in the pre-training dataset used for the training of the matching algorithm model are RNA-seq sequencing data of tissue or blood samples of clinical tumor patients. Preclinical samples in the migration training dataset are RNA-seq sequencing data from tissue or blood samples that do not contain clinical models of the human immune system (e.g., PDX and CDX models, etc.). RNA-seq is a high throughput sequencing technique for transcriptomes, used to quantitatively determine the RNA expression level of the whole genome of a sample, and the available transcriptome data can quantitatively reflect the expression level of each gene of the whole genome in the sample. Each clinical sample comprises a feature vector of the clinical sample and a tag vector of the clinical sample, and each preclinical sample comprises a feature vector of the preclinical sample and a tag vector of the preclinical sample, wherein the feature vector is expressed as a plurality of preset feature sets, and the data of the feature vector is transcriptome data obtained by RNA-seq sequencing. The tag vector is represented as a number of preset category tags. Clinical samples in the pre-training data set and pre-clinical samples in the migration training data set are used as input samples, and feature vectors and label vectors are used as feature vectors of the input samples and label vectors of the input samples in a matching algorithm model.

In specific embodiment 1, the feature vector representation of the clinical sample in the pre-training data set and the pre-clinical sample in the migration training data set adopts a preset feature set including a coding gene feature set and a tumor gene feature set, wherein the coding gene feature set globally represents the pattern and difference of gene expression between the clinical sample or the pre-clinical sample from a transcript system corresponding to all coding gene levels, and the tumor gene feature set represents the pattern and difference of tumor cell and tumor gene expression in the sample from a tumor dimension. The coding gene refers to a gene capable of coding and expressing a protein at a protein level, and the tumor gene refers to a gene (such as a driver gene, an oncogene, etc.) closely related to mechanisms such as tumorigenesis, development, inhibition, etc. The matching algorithm model in embodiment 1 mainly focuses on preclinical models that do not include human immune systems, so that immune related information is not considered in the preset class labels, and only tumor class labels are included.

In embodiment 1, the matching algorithm model includes an encoding module, a fusion module, a decoding module, and a discrimination module. The encoding module encodes the feature vector of the input sample to obtain an encoded feature vector of the input sample. And the fusion module fuses the coded feature vectors to obtain fused feature vectors of the input samples. The judging module obtains a predicted label vector of the input sample according to the fused feature vector and compares and judges the predicted label vector with the label vector of the input sample, and only focuses on the tumor type label. The decoding module decodes the fused feature vector to obtain a decoded feature vector of the input sample.

In embodiment 1, the loss function of the matching algorithm model is obtained using one or several sub-loss functions in combination, preferably several sub-loss functions. The plurality of sub-loss functions comprise a reconstruction sub-loss function, a discriminant sub-loss function, a batch effect syndrome loss function, and a total data amount syndrome loss function. The sub-loss function is reconstructed to correct differences between the feature vectors of the input samples and the decoded feature vectors. And judging a sub-loss function for correcting the difference between the predicted label vector of the input sample and the label vector. The batch effect syndrome loss function is used for correcting batch effects between the fused feature vectors of the input samples. A total data amount syndrome loss function for correcting a difference between a total amount of data of the decoded feature vector of the input sample and a total amount of data of the feature vector of the input sample. In the process of combining a plurality of sub-loss functions of the loss function, the weight of each sub-loss function is dynamically adjusted through an attention mechanism layer and weight regularization is carried out, so that a plurality of sub-loss function combinations are obtained.

In embodiment 1, the training of the matching algorithm model is divided into two major parts, namely pre-training and migration training. The pre-training process is to pre-train the matching algorithm model by using the pre-training data set and the loss function, so as to obtain the pre-trained matching algorithm model. The migration training process is to use a migration training data set and a loss function to perform migration training on the pre-trained matching algorithm model to obtain a migration trained matching algorithm model, wherein the migration trained matching algorithm model is used for matching recommendation between clinical samples to be matched and preclinical samples to be matched. The two separate training processes of pre-training and migration training are adopted for the algorithm matching algorithm model respectively, mainly considering that: on the one hand, compared with the preclinical sample, the clinical sample has relatively larger available data scale and relatively smaller cost, so that a method of pre-training a large number of clinical samples and then migrating and training a small number of clinical samples can be adopted; on the other hand, although the clinical sample and the preclinical sample are both human tumor tissues and human gene expression (especially human tumor gene expression), the types of organisms in the clinical sample and the preclinical sample are different (clinically, the preclinical sample is a mouse or other animals), so that the clinical sample and the preclinical sample still have biological differences in human gene expression, and therefore, the method of pre-training the clinical sample and then transferring and training the preclinical sample is adopted, so that the matching algorithm model can learn general information of tumor gene characteristics in the clinical sample firstly, and then can learn specific information of tumor gene characteristics in the preclinical sample synchronously.

In embodiment 1, the migration training is to freeze the parameters of the encoding module included in the pre-trained matching algorithm model first, and then further use the migration training data set and the loss function to fine tune the parameters of the fusion module, the parameters of the discriminating module and the parameters of the decoding module included in the pre-trained matching algorithm model.

Further, the algorithm model training method of embodiment 1 includes:

s201: a pre-training dataset and a migration training dataset are created.

The pre-training data set matching the algorithm model is a clinical sample, i.e., RNA-seq sequencing data from tumor tissue or blood of a clinical patient, here about 10000 samples from different types of patients in the TCGA (The Cancer Genome Atlas, cancer genomic map) study plan. The preset class labels of the class vectors of the clinical samples of the pre-training dataset comprise tumor type labels, which are derived from given tumor type labels corresponding to the clinical samples in the TCGA. The TCGA is a study item covering a plurality of major tumor types, comprising clinical samples of a plurality of different tumor types and corresponding sets of chemical data, such as genetic variation, copy number variation, methylation and transcriptome data, etc., wherein transcriptome data obtained by RNA-seq sequencing of clinical samples of tumor patients is also included.

The migration training dataset is RNA-seq sequencing data of preclinical samples from preclinical models. The preset class labels of the class vectors of the preclinical samples in the migration training dataset comprise tumor type labels, and are obtained by manual pre-labeling.

S202: preprocessing of data of feature vectors of input samples.

After removal of the adapter and low quality reads (referring to short piece sequence messages derived from the input sample DNA or RNA) from the RNA-seq sequenced to the original FastQ file of the input sample, the clinical sample will be aligned using human reference genomic sequences as a reference using sequence alignment software STAR and the read count for each gene is obtained using software RSEM or featuresamples statistical reads and calculated as a quantitative expression value for each gene in TPM (Transcripts Per Million, counts per million transcripts). Because the preclinical sample has a pre-clinical model, the data from RNA-seq sequencing also contains reads from mice or other animals, so that the reference genome sequences of humans, mice or other animals are used as references, reads in FastQ files are aligned by using sequence alignment software STAR, reads of mice or other animals are removed by using xenome and other software, and then the reads of each gene in the whole genome are counted by using software RSEM or featureContents to obtain the count of reads of each gene and calculate the quantitative expression value of each gene, wherein the unit is TPM. Where xenome is known as software that can distinguish human from other animal reads from RNA-seq sequenced reads and screen for retained human reads, RSEM and featureCount are known as software that can statistically count corresponding reads for each gene alignment.

The calculation formula of the TPM is specifically as followsWherein->Is->The corresponding read count of the individual genes, and +.>Is->The effective length corresponding to the individual genes is obtained by log2 conversion after TPM (trusted platform module) and pseudo count 1, and the calculation formula is +.>。

S203: and constructing a coding gene feature set and a tumor gene feature set in the preset feature set.

The preset feature set representing the feature vectors of the clinical sample and the preclinical sample includes a coding gene feature set and a tumor gene feature set, wherein the coding gene feature set includes about 2 ten thousand total genes which code and express proteins on the human genome, and the tumor gene feature set is integrated from a ONCOKB, COSMIC, INTOGEN three tumor gene public database and different types of genes which are closely related to the tumor in biology, such as driving genes, cancer suppressor genes, and the like in related published documents, and about 1200 tumor genes are obtained in total for constructing the tumor gene feature set.

S204: and constructing a matching algorithm model.

Fig. 3 is a schematic diagram of a module structure of a matching algorithm model provided in embodiment 1 of the present invention, and as shown in fig. 3, the matching algorithm model includes an encoding module 101, a fusion module 102, a decoding module 103 and a discriminating module 104. The encoding module 101 encodes the feature vector of the input sample to obtain an encoded feature vector of the input sample. The fusion module 102 fuses the encoded feature vectors to obtain fused feature vectors of the input samples. The discriminating module 104 obtains a predicted tag vector of the input sample according to the fused feature vector and compares and discriminates the predicted tag vector with the tag vector of the input sample. The decoding module 103 decodes the fused feature vector to obtain a decoded feature vector of the input sample.

As shown in fig. 3, coding module 101 includes a coding gene feature set coding submodule 1011 and a tumor gene feature set coding submodule 1012. The encoding gene feature set encoding sub-module 1011 comprises an input layer 10 and a plurality of fully connected layers 11 with sequentially reduced feature dimensions, the tumor gene feature set encoding sub-module comprises an input layer 20 and a plurality of fully connected layers 21 with sequentially reduced feature dimensions, and the output feature dimensions of the last fully connected layers of the encoding gene feature set encoding sub-module 1011 and the tumor gene feature set encoding sub-module 1012 are equal, wherein the output feature dimension is selected to be 200.

As shown in fig. 3, the fusion module 102 includes two full-connection layers 40, and first performs a splicing (registration) operation on the output of the coding gene feature set coding sub-module 1011 and the last full-connection layer in the tumor gene feature set coding sub-module 1012 (the feature dimension is correspondingly selected to be 2002=400), then carrying out feature fusion through a full connection layer (the input feature dimension of the full connection layer is correspondingly selected as 400, the output feature dimension is correspondingly selected as 400), and finally carrying out equal split (split) operation, and then respectively outputting two split feature vectors (the feature dimension is correspondingly selected as 200) to a coding gene feature set decoding submodule and a tumor gene feature set decoding submodule of the decoding module at the same time;

Optionally, the fusion module 102 may also adopt a Variable Automatic Encoder (VAE) design, that is, the encoding module may be designed as a mean output sub-module and a variance output sub-module, then splice the mean and variance of the outputs and fuse multiple full-connection layers respectively, finally sample the variance of the mean and variance output by the mean module based on normal distribution, and transmit the sample to the decoding module and the discrimination module. Alternatively, the fusion module may additionally employ a multi-head attention mechanism design.

As shown in fig. 3, decoding module 103 includes a coding gene feature set decoding submodule 1031 and a tumor gene feature set decoding submodule 1032. The encoding gene feature set decoding submodule 1031 comprises a plurality of full-connection layers 70 and output layers 71 with sequentially increased feature dimensions, the tumor gene feature set decoding submodule 1032 comprises a plurality of full-connection layers 80 and output layers 81 with sequentially increased feature dimensions, the final output feature dimension of the output layers 71 of the encoding gene feature set decoding submodule 1031 is equal to the input feature dimension of the encoding gene feature set encoding submodule input layers in the encoding module, and the final output feature dimension of the output layers 81 of the tumor gene feature set decoding submodule 1032 is equal to the input feature dimension of the tumor gene feature set encoding submodule input layers in the encoding module.

As shown in fig. 3, the discrimination module 104 includes a discrimination submodule, which is a tumor type discrimination submodule 1041, including a plurality of full-connection layers 50 and Softmax layers 51, and outputs a feature vector with a dimension equal to the number of tumor type labels, and the feature vector obtained by the fusion module is used as an input of the tumor type discrimination submodule, and outputs a predicted tumor type label 52 of the input sample.

S205: a loss function is created that matches the algorithm model.

The loss function of the matching algorithm model is obtained using a combination of several sub-loss functions. The plurality of sub-loss functions comprise a reconstruction sub-loss function, a discriminant sub-loss function, a batch effect syndrome loss function, and a total data amount syndrome loss function.

The reconstruction sub-loss function is used to correct the difference between the eigenvectors of the input samples and the decoded eigenvectors, optionally using a mean square error loss function (MSE) or a mean absolute error loss function (MAE).

The discriminant sub-loss function is used to correct for differences between the predicted label vector and the label vector of the input samples, and a cross entropy loss function (cross entrophy loss) may be employed, here including only tumor type labels.

The batch effect syndrome loss function is used to correct for batch effects between fused feature vectors of input samples, optionally a maximum mean difference loss function (MMD, maximum Mean Discrepancy).

The total data amount syndrome loss function is used for correcting the difference between the total amount of the data of the decoded feature vector of the input sample and the total amount of the data of the feature vector of the input sample, and the calculation formula is thatWherein->Input TPM sum for single input sample, < +.>The sum of the output TPM for a single input sample, D is the order, D is optionally equal to 1 or 2 or 3.

Optionally, if the fusion module of the matching algorithm model adopts a design based on a variation encoder, the sub-loss function may further add a KL divergence function to further measure the difference between the posterior distribution and the prior distribution of the outputted fused feature variables.

In the process of combining a plurality of sub-loss functions, the weight of each sub-loss function can be dynamically adjusted through an attention mechanism layer and weight regularization is carried out, the weight of each sub-loss function can be dynamically adjusted through an average weight weighting group or through a method of uncertainty loss of the same variance, and then the plurality of sub-loss function combinations are obtained.

The optional implementation process of dynamically adjusting the weight of each sub-loss function and carrying out weight regularization through the attention mechanism layer is as follows:

let n sub-loss functions include ，/>，.../>...，/>Each sub-loss function is spliced to obtain a query vector +.>Define a set of key vectors->Then the ith sub-The weight corresponding to the function is->The formula of the loss function L based on weighted combination of a plurality of sub-loss functions and corresponding weights is +.>Wherein i represents +.>。

It should be noted that, during the pre-training process and the migration training process of the matching algorithm model, since the values of the sub-loss functions dynamically change, the weights of the sub-loss functions correspondingly dynamically change, so that the combination of the loss functions of the matching algorithm model as the sub-loss functions also correspondingly dynamically changes.

In summary, the loss function of the matching algorithm model is obtained through the combination of the plurality of sub-loss functions, so that the characteristics of the coding gene characteristic set and the tumor gene characteristic set can be effectively fused through the fusion module to better represent the input sample, the tumor category label information of the input sample can be effectively reserved through the fusion module and the discrimination sub-loss function, and the consistency of the coded characteristic vector and the decoded characteristic vector is ensured through the reconstruction of the sub-loss function, so that the accuracy of the fused characteristic vector is further improved; the batch effect and total data amount deviation between the input samples is further well eliminated by the batch effect syndrome loss function and the total data amount syndrome loss function. Through the dynamic weight weighted combination of the sub-loss functions, the matching algorithm model can automatically and globally optimize the weight of each sub-loss function in the pre-training and migration training process so as to obtain the optimal weight and the loss function obtained by the weighted combination.

S206: the matching algorithm model is pre-trained using the pre-training dataset.

In the process of pre-training the matching algorithm model, the encoding gene feature set encoding submodule in the matching algorithm model is adopted to encode encoding gene feature set data in the feature vector of the clinical sample, so as to obtain the pre-training encoding gene feature set encoding feature of the clinical sample. And adopting a tumor gene feature set coding submodule in the matching algorithm model to code tumor gene feature set data in feature vectors of the clinical samples, so as to obtain pre-training tumor gene feature set coding features of the clinical samples. And fusing the coding features of the pre-training coding gene feature set and the coding features of the pre-training tumor gene feature set by adopting a fusion module in the matching algorithm model to obtain the pre-training fusion features of the clinical sample. And decoding the pre-training fusion characteristic by adopting a decoding module in the matching algorithm model to obtain the pre-training encoding gene characteristic set decoding characteristic of the clinical sample and the pre-training tumor gene decoding characteristic of the clinical sample. And a judging module in the matching algorithm model is adopted, a predicted tumor type label in a predicted label vector of the clinical sample is obtained according to the pre-training fusion characteristic, and the predicted tumor type label and a tumor type label in a label vector of the clinical sample are compared and judged.

In the pre-training process, the super-parameter setting comprises the steps that the initial learning rate is 0.001, the batch size is set to be 16, the epoch number is set to be 200, wherein the learning rate adopts a dynamic attenuation strategy based on step length, the loss function calculation adopts an attention mechanism layer to dynamically adjust the weight of each sub-loss function and carry out weight regularization or dynamically update and adjust the weights of different sub-loss functions based on a method of homodyne uncertainty loss, and the weights are weighted and combined to be the loss function. And after the epoch iterates for a certain number of times, if the difference between the total loss functions of the two iterations is smaller than a threshold value (such as 1E-6), pre-training is stopped in advance and the pre-training of the matching algorithm model is completed, so that the pre-trained matching algorithm model is obtained.

In order to reduce the overfitting of the matching algorithm model during training and improve the generalization capability of the matching algorithm model, a small part (for example, with the proportion of 5% -20%) of clinical samples randomly extracted in a pre-training data set is used as a test data set, the test data set does not participate in the pre-training of the matching algorithm model, but only participates in the evaluation of the matching algorithm model in the iterative process of the pre-training of the matching algorithm model. Each time a model training iteration is performed, the change of the loss function is calculated by using the test data set, and the change of the loss function is calculated simultaneously To dynamically measure the performance change of the matching algorithm model during the pre-training process. />

Fig. 4 is a graph schematically showing a change of a loss function of a test data set with the number of pre-training iterations in the pre-training process of the matching algorithm model in embodiment 1, where the x-axis is the training iteration epoch and the y-axis is the value of the loss function.

FIG. 5 is a graph showing the variation of the decision coefficients of the test dataset with the number of pre-training iterations during the pre-training of the matching algorithm model of embodiment 1, with the x-axis being the training iteration epoch and the y-axis being the decision coefficients。

S207: and performing migration training on the matching algorithm model by using the migration training data set.

Migration training of the matching algorithm model is as follows: the parameters of the coding module included in the pre-trained matching algorithm model are frozen first, and then the parameters of the fusion module, the parameters of the judging module and the parameters of the decoding module included in the pre-trained matching algorithm model are further finely adjusted by using the preclinical samples and the loss function in the migration training data set.

In the migration training process, the super-parameter setting comprises the steps of setting an initial learning rate to 0.001, setting a batch size to 16 and setting the epoch times to 200, wherein the learning rate adopts a dynamic attenuation strategy based on step length, and the loss function calculation adopts an attention mechanism layer to dynamically adjust the weight of each sub-loss function and carry out weight regularization or dynamically update and adjust the weights of different sub-loss functions based on a method of homodyne uncertainty loss and weight combination to be the loss function. When the epoch iterates for a certain number of times, if the difference between the total loss functions of the two iterations is smaller than a threshold value (such as 1E-6), the migration training is stopped in advance and the migration training of the matching algorithm model is completed, and the matching algorithm model after the migration training is obtained.

S208: matching between the preclinical and clinical samples is performed using a migration trained matching algorithm model.

The feature vector of the input sample is included: and inputting the feature vectors of the clinical samples to be matched and the feature vectors of the preclinical samples to be matched into a migration trained matching algorithm model to obtain fused feature vectors of the clinical samples and fused feature vectors of the preclinical samples. Here, the data of the feature vector of the input sample may employ a data preprocessing method similar to step S202.

Matching the fused feature vector of the clinical sample with the fused feature vector of the preclinical sample to obtain a matching result between the clinical sample and the preclinical sample, wherein the matching result comprises the similarity between the fused feature vector of the clinical sample and the fused feature vector of the preclinical sample, and the similarity can be quantitatively calculated by using methods such as cosine similarity, mutual information, distance measurement, correlation coefficient and the like.

The technical scheme in the specific embodiment 1 has the following technical advantages:

1) The method has the advantages that the method adopts the modes of pre-training of clinical samples and migration training of preclinical samples, so that the problem of unbalanced data scale of the clinical samples and the preclinical samples is effectively solved, and the matching algorithm model simultaneously learns the general information of tumor gene characteristics in the clinical samples and the specific information of the tumor gene characteristics in the preclinical samples, thereby realizing accurate matching between the clinical samples and the preclinical samples;

2) The characteristic vector of the input sample is represented by adopting two preset characteristic sets, namely the coding gene characteristic set and the tumor gene characteristic set, and meanwhile, the mode and the difference among the input samples are comprehensively and effectively represented from the global level and the local level of the tumor gene characteristics, so that the characteristic representation capability of the input samples is improved;

3) The adopted matching algorithm model comprises a coding module, a fusion module, a decoding module and a judging module, the efficient compression from high dimension to low dimension of the characteristic vector of the input sample is realized through the coding module, the data consistency and the accuracy of the efficient compression from high dimension to low dimension of the characteristic vector of the input sample are ensured through the decoding module, the efficient fusion between the global coding gene characteristic and the local tumor gene characteristic of the input sample is realized through the fusion module, the information of the tumor class label is effectively reserved through the judging module, and the mode of combining the non-supervision learning and the supervision learning is realized, so that the information fusion and the dimension reduction of the characteristic vector of the input sample are realized, and the information loss is reduced;

4) The method has the advantages that through a dynamic weighted combination mode of a plurality of sub-loss functions, the characteristics of the coding gene characteristic set and the tumor gene characteristic set are effectively fused, so that an input sample is better represented, tumor class label information of the input sample is effectively reserved through judging the sub-loss functions, the consistency of the low-dimensional coded characteristic vector and the high-dimensional decoded characteristic vector is guaranteed through reconstructing the sub-loss functions, the accuracy of the fused characteristic vector is further improved, in addition, the deviation of a batch effect and the total data quantity between the input samples is further well eliminated through the batch effect corrector loss functions and the total data quantity corrector loss functions, and finally, the matching algorithm model can be automatically optimized to the overall optimal performance through the dynamic weight weighted combination of the sub-loss functions.

Specific example 2 is primarily aimed at solving the matching of preclinical and clinical samples in the development of oncological immune drugs. Here, the tumor immune medicine mainly refers to a medicine which acts on tumor cells and also utilizes the immune system of a tumor patient to identify and attack cancer cells, and currently mainly comprises immune checkpoint inhibitors, tumor vaccines, CAR-T cell therapies, tumor Infiltrating Lymphocyte (TIL) therapies and the like. Therefore, in the development stage of tumor drugs, the molecular level gene expression, particularly the expression of tumor-related genes and immune-related genes, is very important for understanding the action mechanism between the immune system and tumor in the human body and the pharmacological efficacy and pharmacodynamics of tumor immune drugs, and is also a key consideration for effectively matching preclinical samples with clinical samples and opening up the preclinical stage and clinical stage of drug development.

In particular example 2, the clinical samples in the pre-training dataset used for the training of the matching algorithm model are RNA-seq sequencing data of tissue or blood samples of clinical tumor patients. The preclinical samples in the migration training dataset are RNA-seq sequencing data from tissue or blood samples that contain clinical models of the human immune system (e.g., PBMC-PDX and PBMC-CDX models, etc.). Each clinical sample comprises a feature vector of the clinical sample and a tag vector of the clinical sample, and each preclinical sample comprises a feature vector of the preclinical sample and a tag vector of the preclinical sample, wherein the feature vector is expressed as a plurality of preset feature sets, and the data of the feature vector is transcriptome data obtained by RNA-seq sequencing. The tag vector is represented as a number of preset category tags. Clinical samples in the pre-training data set and pre-clinical samples in the migration training data set are used as input samples, and feature vectors and label vectors are used as feature vectors of the input samples and label vectors of the input samples in a matching algorithm model.

In specific embodiment 2, the feature vector representation of the clinical sample in the pre-training data set and the pre-clinical sample in the migration training data set adopts a preset feature set including a coding gene feature set, a tumor gene feature set and an immune gene feature set, wherein the coding gene feature set globally represents the pattern and the difference of gene expression between the clinical sample or the pre-clinical sample from a transcript system corresponding to all coding gene levels, the tumor gene feature set represents the pattern and the difference of tumor cell and tumor gene expression in the sample from a tumor dimension, and the immune gene feature set represents the pattern and the difference of immune cell activity, immune function, immune response, immune related gene expression and the like in the sample from a dimension of immune system function in a human body. The coding gene refers to a gene capable of coding and expressing a protein at a protein level, the tumor gene refers to a gene closely related to mechanisms such as tumorigenesis, development, inhibition and the like (e.g., driving gene, cancer suppressor gene and the like), and the immune gene refers to a gene directly involved in or closely related to immune function. The matching algorithm model in embodiment 2 mainly focuses on a preclinical model including a human immune system, so that immune related information needs to be considered in a preset class label, and includes a tumor class label and an immune class label.

In embodiment 2, the matching algorithm model includes an encoding module, a fusion module, a decoding module, and a discrimination module. The encoding module encodes the feature vector of the input sample to obtain an encoded feature vector of the input sample. And the fusion module fuses the coded feature vectors to obtain fused feature vectors of the input samples. And the judging module obtains a predicted tag vector of the input sample according to the fused feature vector and compares and judges the predicted tag vector with the tag vector of the input sample. The decoding module decodes the fused feature vector to obtain a decoded feature vector of the input sample.

In embodiment 2, the loss function of the matching algorithm model is obtained using one or several sub-loss function combinations, preferably several sub-loss function combinations. The plurality of sub-loss functions comprise a reconstruction sub-loss function, a discriminant sub-loss function, a batch effect syndrome loss function, and a total data amount syndrome loss function. The sub-loss function is reconstructed to correct differences between the feature vectors of the input samples and the decoded feature vectors. A sub-loss function is discriminated for correcting the difference between the predictive label vector of the input sample and the label vector, where the tumor class label and the immune class label are focused simultaneously. The batch effect syndrome loss function is used for correcting batch effects between the fused feature vectors of the input samples. A total data amount syndrome loss function for correcting a difference between a total amount of data of the decoded feature vector of the input sample and a total amount of data of the feature vector of the input sample. In the process of combining a plurality of sub-loss functions of the loss function, the weight of each sub-loss function is dynamically adjusted through an attention mechanism layer and weight regularization is carried out, so that a plurality of sub-loss function combinations are obtained.

In embodiment 2, the training of the matching algorithm model is divided into two major parts, namely pre-training and migration training. The pre-training process is to pre-train the matching algorithm model by using the pre-training data set and the loss function, so as to obtain the pre-trained matching algorithm model. The migration training process is to use a migration training data set and a loss function to perform migration training on the pre-trained matching algorithm model to obtain a migration trained matching algorithm model, wherein the migration trained matching algorithm model is used for matching recommendation between clinical samples to be matched and preclinical samples to be matched. The two separate training processes of pre-training and migration training are adopted for the algorithm matching algorithm model respectively, mainly considering that: on the one hand, compared with the preclinical sample, the clinical sample has relatively larger available data scale and relatively smaller cost, so that a method of pre-training a large number of clinical samples and then migrating and training a small number of clinical samples can be adopted; on the other hand, although the clinical sample and the preclinical sample are both human tumor tissues and immune system, and are the expression of human genes, in particular human tumors and immune genes, the types of organisms in the bodies of the clinical sample and the preclinical sample are different (clinically, the preclinical sample is a mouse or other animals), so that the clinical sample and the preclinical sample still have biological differences in human gene expression, and therefore, the method of pre-training the clinical sample and then transferring and training the preclinical sample is adopted, so that the matching algorithm model can learn the general information of tumor gene characteristics and immune gene characteristics in the clinical sample firstly, and then can synchronously learn the specific information of the tumor gene characteristics and the immune gene characteristics in the preclinical sample.

In embodiment 2, the migration training is to freeze the parameters of the encoding module included in the pre-trained matching algorithm model first, and then further use the migration training data set and the loss function to fine tune the parameters of the fusion module, the parameters of the discriminating module and the parameters of the decoding module included in the pre-trained matching algorithm model.

Further, the algorithm model training method of embodiment 2 includes:

s301: a pre-training dataset and a migration training dataset are created.

The pre-training data set matching the algorithm model is a clinical sample, i.e., RNA-seq sequencing data from tumor tissue or blood of a clinical patient, here about 10000 samples from different types of patients in the TCGA (The Cancer Genome Atlas, cancer genomic map) study plan. The pre-set class labels of the class vector of the clinical samples of the pre-training dataset include tumor type label pre-and immunity type labels.

The migration training dataset is RNA-seq sequencing data of preclinical samples from preclinical models. The preset class labels of the class vectors of the preclinical samples in the migration training dataset comprise tumor type labels and tumor type labels, which are obtained by manual pre-labeling.

S302: preprocessing of data of feature vectors of input samples.

After removal of the adapter and low quality reads (referring to short piece sequence messages derived from the input sample DNA or RNA) from the RNA-seq sequenced to the original FastQ file of the input sample, the clinical sample will be aligned using human reference genomic sequences as a reference using sequence alignment software STAR and the read count for each gene is obtained using software RSEM or featuresamples statistical reads and calculated as a quantitative expression value for each gene in TPM (Transcripts Per Million, counts per million transcripts). Because the preclinical sample has a pre-clinical model, the RNA-seq sequencing data also contains reads from mice or other animals, so that the reference genome sequence of humans, mice or other animals is used as a reference, reads in FastQ files are aligned by using sequence alignment software STAR, reads of mice or other animals are removed by using xenome and other software, and then the reads of each gene are counted by using software RSEM or featureCounts to obtain the count of reads of each gene and calculate the quantitative expression value of each gene in the whole genome, wherein the unit is TPM. Where xenome is known as software that can distinguish human from other animal reads from RNA-seq sequenced reads and screen for retained human reads, RSEM and featureCount are known as software that can statistically count corresponding reads for each gene alignment.

Wherein, the calculation formula of the TPM is specifically as followsWherein->Is->The corresponding read count of the individual genes, and +.>Is->The effective length corresponding to each gene is obtained by log2 conversion after the TPM is added with the pseudo count 1, and the calculation formula is that。

S303: and constructing a coding gene feature set and a tumor gene feature set in the preset feature set.

The preset feature set representing the feature vectors of the clinical sample and the preclinical sample includes a coding gene feature set and a tumor gene feature set, wherein the coding gene feature set includes about 2 ten thousand total genes which code and express proteins on the human genome, and the tumor gene feature set is integrated from a ONCOKB, COSMIC, INTOGEN three tumor gene public database and different types of genes which are closely related to the tumor in biology, such as driving genes, cancer suppressor genes, and the like in related published documents, and about 1200 tumor genes are obtained in total for constructing the tumor gene feature set. The immune gene feature set is integrated from a public database of tumor immune research of CRI iratlas and genes closely related to immune system and immune function in related published documents, and about 2200 immune genes are obtained in total to construct the immune gene feature set.

S304: and constructing a matching algorithm model.

Fig. 6 is a schematic block diagram of a matching algorithm model provided in embodiment 2 of the present invention, and as shown in fig. 6, the matching algorithm model includes an encoding module 101, a fusion module 102, a decoding module 103 and a discriminating module 104. The encoding module 101 encodes the feature vector of the input sample to obtain an encoded feature vector of the input sample. The fusion module 102 fuses the encoded feature vectors to obtain fused feature vectors of the input samples. The discriminating module 104 obtains a predicted tag vector of the input sample according to the fused feature vector and compares and discriminates the predicted tag vector with the tag vector of the input sample. The decoding module 103 decodes the fused feature vector to obtain a decoded feature vector of the input sample.

As shown in fig. 6, the encoding module 101 includes an encoding gene feature set encoding submodule 1011, a tumor gene feature set encoding submodule 1012, and an immune gene feature set encoding submodule 1013. The encoding gene feature set encoding submodule 1011 comprises an input layer 10 and a full-connection layer 11 with a plurality of feature dimensions reduced in sequence, the tumor gene feature set encoding submodule 1012 comprises an input layer 20 and a full-connection layer 21 with a plurality of feature dimensions reduced in sequence, the immune gene feature set encoding submodule 1013 comprises an input layer 30 and a full-connection layer 31 with a plurality of feature dimensions reduced in sequence, and the output feature dimensions of the last full-connection layers of the encoding gene feature set encoding submodule 1011, the tumor gene feature set encoding submodule 1012 and the immune gene feature set encoding submodule 1013 are equal, wherein the output feature dimension is 200.

As shown in fig. 6, the fusion module 102 will include three full-connection layers 40, and first, after the encoding gene feature set encoding submodule 1011, tumor gene feature set encoding submodule 1012 and the output of the last full-connection layer in the immune gene feature set encoding submodule 1013 are subjected to a splicing (localization) operation (the feature dimension is correspondingly selected to be 2003=600), then performing feature compression through the first fully connected layer (input feature dimension is correspondingly selected as 600, output feature dimension is correspondingly selected as 200), further performing feature fusion through the second fully connected layer (input feature dimension is correspondingly selected as 200, output feature dimension is correspondingly selected as 200), further performing feature expansion through the third fully connected layer (input feature dimension is selected as 200, output feature dimension is selected as 600), finally performing equal splitting (split) operation, and then simultaneously outputting the three split feature vectors (feature dimension is correspondingly selected as 200) to decodingThe coding gene feature set decoding submodule 1031, the tumor gene feature set decoding submodule 1032 and the immune gene feature decoding submodule 1033 of the modules;

As shown in fig. 6, the decoding module 103 includes a coding gene feature set decoding submodule 1031, a tumor gene feature set decoding submodule 1032, and an immune gene feature decoding submodule 1033. The encoding gene feature set decoding submodule 1031 comprises a plurality of full-connection layers 70 and output layers 71 with sequentially increased feature dimensions, the tumor gene feature set decoding submodule 1032 comprises a plurality of full-connection layers 80 and output layers 81 with sequentially increased feature dimensions, the immune gene feature set decoding submodule 1033 comprises a plurality of full-connection layers 90 and output layers 91 with sequentially increased feature dimensions, the final output feature dimension of the encoding gene feature set decoding submodule 1031 output layer 71 is equal to the input feature dimension of the encoding gene feature set encoding submodule 1031 input layer in the encoding module, the final output feature dimension of the output layer 81 of the tumor gene feature set decoding submodule 1032 is equal to the input feature dimension of the tumor gene feature set encoding submodule 1032 input layer in the encoding module, and the final output feature dimension of the immune gene feature set decoding submodule 1033 output layer 91 is equal to the input feature dimension of the immune gene feature set encoding submodule 1033 input layer in the encoding module.

As shown in fig. 6, the discrimination module 104 includes two discrimination sub-modules, namely a tumor type discrimination sub-module 1041 and an immune type discrimination sub-module 1042. The tumor type discriminating submodule 1041 includes a plurality of full connection layers 50 and Softmax layers 51, and outputs a dimension equal to the number of tumor type labels, and takes the fused feature vector obtained by the fusion module 102 as an input of the tumor type discriminating submodule 1041, and outputs a predicted tumor type label 52 of the input sample. The immunity type discrimination submodule 1042 comprises a plurality of full-connection layers 60 and a Softmax layer 61, the output dimension is equal to the number of immunity type labels, the fused feature vector obtained by the fusion module is used as the input of the immunity type discrimination submodule, and the predicted immunity type label 62 of the input sample is output.

S305: a loss function is created that matches the algorithm model.

The discriminant sub-loss function is used to correct for differences between the predicted tag vector and the tag vector of the input samples, and a cross entropy loss function (cross entrophy loss) may be employed, here comprising both tumor type and immune type tags.

The total data amount syndrome loss function is used for correcting the difference between the total amount of the data of the decoded feature vector of the input sample and the total amount of the data of the feature vector of the input sample, and the calculation formula is thatWherein->The sum of the input TPMs for a single input sample,input for a single input sampleAnd (3) taking the sum of the TPM out, wherein D is an order, and D is optionally equal to 1 or 2 or 3.

let n sub-loss functions include，/>，.../>...，/>Each sub-loss function is spliced to obtain a query vector +.>Define a set of key vectors->The weight corresponding to the ith sub-function is +.>The formula of the loss function L based on weighted combination of a plurality of sub-loss functions and corresponding weights is +.>Wherein i represents +.>。

In summary, the loss function of the matching algorithm model is obtained through the combination of the plurality of sub-loss functions, so that the characteristics of the encoding gene characteristic set, the tumor gene characteristic set and the immune gene characteristic set can be effectively fused through the fusion module, the input sample can be better represented, tumor type label information and immune type label information of the input sample can be effectively reserved through the fusion module and the judging sub-loss function, the consistency of the encoded characteristic vector and the decoded characteristic vector is ensured through the reconstruction sub-loss function, the accuracy of the fused characteristic vector is further improved, and in addition, the batch effect and the deviation of the data total quantity between the input samples are further well eliminated through the batch effect syndrome loss function and the data total quantity syndrome loss function. Through the dynamic weight weighted combination of the sub-loss functions, the matching algorithm model can automatically and globally optimize the weight of each sub-loss function in the pre-training and migration training process so as to obtain the optimal weight and the loss function obtained by the weighted combination.

S306: the matching algorithm model is pre-trained using the pre-training dataset.

In the process of pre-training the matching algorithm model, the encoding gene feature set encoding submodule in the matching algorithm model is adopted to encode encoding gene feature set data in the feature vector of the clinical sample, so as to obtain the pre-training encoding gene feature set encoding feature of the clinical sample. And adopting a tumor gene feature set coding submodule in the matching algorithm model to code tumor gene feature set data in feature vectors of the clinical samples, so as to obtain pre-training tumor gene feature set coding features of the clinical samples. And adopting an immune gene feature set coding submodule in the matching algorithm model to code immune gene feature set data in feature vectors of the clinical samples, so as to obtain pre-training immune gene feature set coding features of the clinical samples. And adopting a fusion module in the matching algorithm model to fuse the pre-training encoding gene feature set encoding features, the pre-training tumor gene feature set encoding features and the pre-training immune gene feature set encoding features to obtain the pre-training fusion features of the clinical sample. And decoding the pre-training fusion characteristic by adopting a decoding module in the matching algorithm model to obtain a pre-training encoding gene characteristic set decoding characteristic of the clinical sample, a pre-training tumor gene decoding characteristic of the clinical sample and a pre-training immunity gene characteristic set decoding characteristic. And a judging module in the matching algorithm model is adopted, a predicted tumor type label in a predicted label vector of the clinical sample is obtained according to the pre-training fusion characteristic, and a tumor type label in a label vector of the clinical sample is compared and judged, and meanwhile, a predicted immunity type label in the predicted label vector of the clinical sample is obtained according to the pre-training fusion characteristic, and the predicted immunity type label in the predicted label vector of the clinical sample and the immunity type label in the label vector of the clinical sample are compared and judged.

In the pre-training process, the super-parameter setting comprises the steps that the initial learning rate is 0.001, the batch size is set to be 16, the epoch times are set to be 200, wherein the learning rate adopts a dynamic attenuation strategy based on step length, the loss function calculation adopts an attention mechanism layer to dynamically adjust the weight of each sub-loss function and carry out weight regularization or dynamically update and adjust the weights of different sub-loss functions based on a method of homodyne uncertainty loss, and the weights are weighted and combined to be the loss function. And after the epoch iterates for a certain number of times, if the difference between the total loss functions of the two iterations is smaller than a threshold value (such as 1E-6), pre-training is stopped in advance and the pre-training of the matching algorithm model is completed, so that the pre-trained matching algorithm model is obtained.

In order to reduce the overfitting of the matching algorithm model during training and improve the generalization capability of the matching algorithm model, a data set is pre-trainedA small part (for example, the proportion of the clinical samples is 5% -20%) of the random extracted clinical samples is used as a test data set, the test data set does not participate in the pre-training of the matching algorithm model, but only participates in the iterative process of the pre-training of the matching algorithm model to evaluate the matching algorithm model. Each time a model training iteration is performed, the change of the loss function is calculated by using the test data set, and the change of the loss function is calculated simultaneously To dynamically measure the performance change of the matching algorithm model during the pre-training process.

Fig. 7 is a schematic diagram of a curve of a loss function of a test data set along with the number of pre-training iterations in a pre-training process of a matching algorithm model according to embodiment 2 of the present invention, where an x-axis is a training iteration epoch and a y-axis is a value of the loss function.

FIG. 8 is a graph showing the variation of the decision coefficients of the test dataset with the number of pre-training iterations in the pre-training process of the matching algorithm model according to embodiment 2 of the present invention, wherein the x-axis is the training iteration epoch and the y-axis is the decision coefficients。

S307: and performing migration training on the matching algorithm model by using the migration training data set.

S308: matching between the preclinical and clinical samples is performed using a migration trained matching algorithm model.

Inputting the feature vector of the input sample into a migration trained matching algorithm model to obtain a fused feature vector of the clinical sample and a fused feature vector of the preclinical sample, wherein the feature vector of the input sample comprises: the feature vector of the clinical sample to be matched and the feature vector of the preclinical sample to be matched. Here, the data of the feature vector of the input sample may employ a data preprocessing method similar to step S202.

The technical scheme in the specific embodiment 2 has the following technical advantages:

1) The method has the advantages that the method adopts the modes of pre-training of clinical samples and migration training of preclinical samples, not only effectively solves the problem of unbalanced data scale of the clinical samples and preclinical samples, but also enables a matching algorithm model to learn general information of tumor gene characteristics and immune gene characteristics in the clinical samples and specific information of the tumor gene characteristics and the immune gene characteristics in the preclinical samples at the same time, thereby realizing accurate matching between the clinical samples and the preclinical samples;

2) The characteristic vector of the input sample is represented by adopting three preset characteristic sets, namely a coding gene characteristic set, a tumor gene characteristic set and an immune gene characteristic set, and meanwhile, the mode and the difference among the input samples are comprehensively and effectively represented from the global level, the tumor gene characteristic and the local level of the immune gene characteristic, so that the characteristic representation capability of the input sample is improved;

3) The adopted matching algorithm model comprises a coding module, a fusion module, a decoding module and a judging module, the efficient compression from high dimension to low dimension of the characteristic vector of the input sample is realized through the coding module, the data consistency and the accuracy of the efficient compression from high dimension to low dimension of the characteristic vector of the input sample are ensured through the decoding module, the efficient fusion between the global coding gene characteristic and the local tumor gene characteristic of the input sample is realized through the fusion module, the information of the tumor class label and the immune class label is effectively reserved through the judging module, and the mode of combining the non-supervision learning and the supervision learning is realized, so that the information fusion and the dimension reduction of the characteristic vector of the input sample are realized, and the information loss is reduced;

4) The method has the advantages that by means of dynamic weighted combination of a plurality of sub-loss functions, characteristics of the coding gene characteristic set and the tumor gene characteristic set are effectively fused, input samples are better represented, tumor type label information and immunity type label information of the input samples are effectively reserved through judging the sub-loss functions, the consistency of low-dimensional coded characteristic vectors and high-dimensional decoded characteristic vectors is guaranteed through reconstructing the sub-loss functions, the accuracy of the fused characteristic vectors is further improved, in addition, deviation of batch effect and data total quantity between the input samples is further well eliminated through the batch effect corrector loss functions and the data total quantity corrector loss functions, and finally, the matching algorithm model can be automatically optimized to the overall optimal performance through the dynamic weighted combination of the sub-loss functions.

Optionally, the invention further provides an algorithm model training device, which comprises:

the system comprises a creation module, a pre-training data set and a migration training data set, wherein the pre-training data set comprises a plurality of clinical samples, and the migration training data set comprises a plurality of preclinical samples; each clinical sample in each clinical sample comprises a feature vector of the clinical sample and a label vector of the clinical sample, each preclinical sample in each preclinical sample comprises a feature vector of the preclinical sample and a label vector of the preclinical sample, wherein the feature vector is expressed as a plurality of preset feature sets, data of the feature vector is histologic data, and the label vector is expressed as a plurality of preset class labels; the clinical samples and the preclinical samples are used as input samples, and the feature vectors and the label vectors are used in a matching algorithm model as the feature vectors of the input samples and the label vectors of the input samples;

the training module is used for pre-training the matching algorithm model by using the pre-training data set and the loss function to obtain a pre-trained matching algorithm model; performing migration training on the pre-trained matching algorithm model by using a migration training data set and a loss function to obtain a migration trained matching algorithm model, wherein the migration trained matching algorithm model is used for matching recommendation between a clinical sample to be matched and a pre-clinical sample to be matched;

reconstructing a sub-loss function for correcting a difference between a feature vector of an input sample and a decoded feature vector;

a discriminant sub-loss function for correcting differences between predicted tag vectors of the input samples and the tag vectors;

a batch effect syndrome loss function for correcting batch effects between fused feature vectors of the input samples;

A total data amount syndrome loss function for correcting a difference between a total amount of data of the decoded feature vector of the input sample and a total amount of data of the feature vector of the input sample;

migration training is as follows: firstly, freezing parameters of a coding module included in the pre-trained matching algorithm model, and then further using a migration training data set and a loss function to finely adjust parameters of a fusion module, parameters of a judging module and parameters of a decoding module included in the pre-trained matching algorithm model.

Optionally, the invention further provides a sample matching method, which comprises the following steps:

inputting the feature vector of the input sample into a matching algorithm model to obtain a fused feature vector of the clinical sample and a fused feature vector of the preclinical sample, wherein the feature vector of the input sample comprises: the feature vector of the clinical sample to be matched, the feature vector of the sample to be matched and the feature vector of the sample before the clinical sample are matched, and the matching algorithm model is obtained by adopting any algorithm model training method;

Optionally, the present invention further provides a sample matching device, including:

the extraction module inputs the feature vector of the input sample into the matching algorithm model to obtain the fused feature vector of the clinical sample and the fused feature vector of the preclinical sample, wherein the feature vector of the input sample comprises: the feature vector of the clinical sample to be matched and the feature vector of the preclinical sample to be matched are matched algorithm models obtained by adopting any algorithm model training method;

the matching module is used for matching the fused feature vector of the clinical sample with the fused feature vector of the preclinical sample to obtain a matching result between the clinical sample and the preclinical sample, wherein the matching result comprises the similarity between the fused feature vector of the clinical sample and the fused feature vector of the preclinical sample, and the similarity can be quantitatively calculated by using methods such as, but not limited to, cosine similarity, mutual information, distance measurement, correlation coefficient and the like.

Optionally, the present invention further provides an electronic device (computer, server, smart phone, network device, etc.), including a memory and a processor, where the memory is configured to store a computer program executable by the processor, and the processor implements the method embodiment or the apparatus embodiment when executing the computer program.

Optionally, the present invention also provides a program product, such as a computer readable storage medium, comprising a program for performing the above-described method or apparatus embodiments when being executed by a processor.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (english: processor) to perform part of the steps of the methods of the embodiments of the invention. And the aforementioned storage medium includes: u disk, mobile hard disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An algorithm model training method, comprising:

2. The algorithm model training method of claim 1, wherein the loss function is obtained using one or several sub-loss functions in combination; the plurality of sub-loss functions comprise a reconstruction sub-loss function, a discrimination sub-loss function, a batch effect syndrome loss function and a data total syndrome loss function;

3. The method for training the algorithm model according to claim 2, wherein in the process of using the plurality of sub-loss functions for the loss function, the weight of each sub-loss function is dynamically adjusted through an attention mechanism layer and weight regularization is performed, so as to obtain a plurality of sub-loss function combinations.

4. A method of training an algorithmic model according to any one of claims 1 to 3, wherein the migration training is: and firstly freezing parameters of the coding module included in the pre-trained matching algorithm model, and further using the migration training data set and the loss function to finely adjust parameters of the fusion module, parameters of the judging module and parameters of the decoding module included in the pre-trained matching algorithm model.

5. The algorithmic model training method according to any of claims 1 to 4, wherein the clinical samples are tumor samples of clinical patients; the preclinical sample is a preclinical tumor in vitro or in vivo model sample;

6. The algorithm model training method according to any one of claims 1 to 4, wherein the plurality of preset feature sets includes a coding gene feature set and a tumor gene feature set; the plurality of preset category labels comprise tumor type labels, the label vector comprises a tumor category label, and the predictive label vector comprises a predictive tumor category label; the omics data is transcriptomic data; the coding module in the matching algorithm model comprises a coding gene feature set coding submodule and a tumor gene feature set coding submodule;

7. The algorithm model training method of claim 6, wherein the plurality of preset feature sets further comprises an immune gene feature set, the preset class labels further comprise an immune type label, the label vector further comprises an immune type label, and the predictive label vector further comprises a predictive immune type label; the coding module in the matching algorithm model also comprises an immune gene feature set coding submodule;

8. A method of sample matching, comprising:

inputting the feature vector of the input sample into a matching algorithm model to obtain the fused feature vector of the clinical sample and the fused feature vector of the preclinical sample, wherein the feature vector of the input sample comprises: the feature vector of the clinical sample to be matched and the feature vector of the pre-clinical sample to be matched, wherein the matching algorithm model is a matching algorithm model obtained by adopting the algorithm model training method of any one of the claims 1-7;

9. An algorithm model training apparatus, comprising:

10. A sample matching device, comprising:

the extraction module inputs the feature vector of the input sample into a matching algorithm model to obtain the fused feature vector of the clinical sample and the fused feature vector of the preclinical sample, wherein the feature vector of the input sample comprises: the feature vector of the clinical sample to be matched and the feature vector of the preclinical sample to be matched, wherein the matching algorithm model is a matching algorithm model obtained by adopting the algorithm model training method of any one of the claims 1-7;

11. An electronic device comprising a memory and a processor, wherein the memory is configured to store a computer program executable by the processor, and wherein the processor implements the algorithm model training method of any one of claims 1 to 7, or the sample matching method of claim 8, when the computer program is executed by the processor.

12. A computer readable medium, characterized in that a computer program is stored on a storage medium, which computer program, when executed, implements the algorithm model training method of any one of claims 1 to 7, or the sample matching method of claim 8.