CN116072210A - Model training method, device, equipment and storage medium based on gene pair - Google Patents

Model training method, device, equipment and storage medium based on gene pair Download PDF

Info

Publication number
CN116072210A
CN116072210A CN202310209045.9A CN202310209045A CN116072210A CN 116072210 A CN116072210 A CN 116072210A CN 202310209045 A CN202310209045 A CN 202310209045A CN 116072210 A CN116072210 A CN 116072210A
Authority
CN
China
Prior art keywords
gene
model
training
islet
training sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310209045.9A
Other languages
Chinese (zh)
Other versions
CN116072210B (en
Inventor
丁辉
谢雪琴
林昊
邓科君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202310209045.9A priority Critical patent/CN116072210B/en
Publication of CN116072210A publication Critical patent/CN116072210A/en
Application granted granted Critical
Publication of CN116072210B publication Critical patent/CN116072210B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physiology (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a model training method, device and equipment based on gene pairs and a storage medium, and relates to the technical field of gene pair identification. The specific implementation scheme comprises the following steps: acquiring and processing a training sample set; the processing means that a plurality of training samples of the training sample set are processed by a gene expression matrix format to comprise islet data which is encoded correspondingly by the expression mode of each pair of inverse genes, wherein the inverse genes are a pair of genes with different or opposite expression modes in different samples; islet data is determined from the expression pattern of each of the pairs of inverse genes, the expression pattern of a pair of genes being the relative magnitude relationship between the expression data values of the two genes that make up the pair of genes; training a preset initial model based on a training sample set to obtain a target model; the target model is used for predicting the type of islet data according to the reverse gene pair. The method is simple and stable in acquisition of the training sample of the machine learning model for predicting the islet data types.

Description

Model training method, device, equipment and storage medium based on gene pair
Technical Field
The invention relates to the technical field of gene pair identification, in particular to a model training method, device and equipment based on gene pairs and a storage medium.
Background
Insulin is a protein hormone.
Currently, for the variety of islet data used to characterize insulin levels, machine learning models are often used to predict based on metabolomic data (e.g., biomarkers of the metabolome).
However, the acquisition process of metabonomic data is time consuming and laborious, and has a strong dependence on the acquisition method and conditions, which may vary considerably.
Disclosure of Invention
The present invention aims to solve the above-mentioned drawbacks of the prior art and to provide a method, a device, an apparatus and a storage medium for model training based on gene pairs. The training samples used in the training method are easy to obtain, are not influenced by other conditions, and are stable.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
in a first aspect, the present invention provides a method for model training based on gene pairs, the method comprising:
acquiring and processing a training sample set; the training sample set comprises a plurality of training samples, and each training sample is processed into islet data comprising a plurality of corresponding codes of the reverse gene pairs by the original gene expression matrix format; a pair of genes is a pair of genes whose expression patterns are different or opposite in two different classes of samples; each coding is determined based on the expression pattern of each of the pairs of genes; the expression pattern of the reverse gene pair is the relative order relationship between the expression data values of the two genes composing the gene pair; training a preset initial model based on a training sample set to obtain a target model; the target model may be based on the type of islet data predicted by the pair of inverse genes.
In one possible implementation, obtaining a training sample set includes: acquiring a plurality of initial matrixes from a public database; each initial matrix is a gene expression matrix for sequencing an islet transcriptome; each row of each initial matrix is gene expression data for one gene, each column is a islet sample from a first species or a second species; each initial matrix includes gene expression data for a first type of pancreatic islet transcriptome gene and gene expression data for a second type of pancreatic islet transcriptome gene; the first type of pancreatic islet transcriptome gene expression data is different from the second type of pancreatic islet transcriptome gene expression data; determining a training set sub-matrix of each initial matrix according to the plurality of initial matrices; each training set submatrix comprises gene expression data of the first type of islet sample transcriptome genes and gene expression data of the second type of islet sample transcriptome genes; screening out gene expression data of a plurality of candidate gene pairs corresponding to the first type of islet samples and the second type of islet samples from respective training set submatrices of each initial matrix; wherein the candidate gene pair is the expression pattern of the gene pair which exists in the same sample size of a preset proportion threshold value; the expression pattern of the gene pair is the relative expression order REOs of the gene pair in the sample (or the relative order relationship between the expression data values of two genes constituting the gene pair); screening out the gene expression data of a plurality of inverse gene pairs of each initial matrix from the gene expression data of a plurality of candidate gene pairs of each initial matrix; wherein the expression pattern of the pair of genes is different or opposite in two different islet samples; taking intersection sets of the plurality of the reverse gene pairs of each initial matrix to obtain a plurality of the reverse gene pairs overlapped and shared in the initial matrix; encoding gene expression data of the intersection set of the reverse gene pairs in the plurality of initial matrixes according to a preset encoding rule, and respectively obtaining islet data which corresponds to the plurality of initial matrixes and comprises codes corresponding to the expression modes of each pair of the reverse gene pairs; each row in islet data represents a training sample, and each column in islet data represents an expression pattern of an overlapping pair of reverse genes; and obtaining a training sample set based on islet data corresponding to each of the plurality of initial matrices.
In one possible implementation, training a preset initial model based on a training sample set; obtaining a training sample set according to gene expression data of the inverse gene pair intersection in a plurality of initial matrices, wherein the training sample set comprises: utilizing a minimum redundancy maximum correlation (mRMR) algorithm to overlap common reverse gene pairs in a plurality of initial matrixes, and sequencing importance of the reverse gene pairs in islet data after respective coding to obtain sequencing results of each reverse gene pair in respective islet data; weighting each inverse gene pair according to the sequencing result of each inverse gene pair in the respective islet data to obtain an importance sequencing result of each inverse gene pair in a training sample set; determining the target feature dimension of the training sample set according to the importance sequencing result of each reverse gene pair in the training sample set; the target feature dimension is used for indicating the number of feature dimensions to be learned of the initial model, namely the number of reverse gene pairs in the training sample set; according to the importance sequencing result from strong to weak of each reverse gene pair in the training sample set, respectively selecting different numbers of target feature dimensions (different reverse gene pairs) which are sequentially increased from the training sample set, and sequentially inputting the training sample set comprising the different reverse gene pairs into a preset initial model for training.
In one possible implementation, determining the target feature dimension of the training sample set based on the importance ranking result of each of the reverse genes in the training sample set includes: increasing the number of the reverse gene pairs in the training sample set one by one according to the sequence from strong to weak of the importance sorting result by using an increment feature selection IFS method, and inputting the number of the reverse gene pairs into an initial model to obtain the lower area AUC of the ROC curve of the working feature of the subject corresponding to the number of each feature dimension; the number of feature dimensions corresponding to the maximum AUC is taken as the target feature dimension (i.e., the optimal set of inverse gene pairs).
Optionally, the method further comprises: before the increment feature selection IFS method is utilized, the number of the reverse gene pairs in the training set samples is increased one by one according to the sequence from strong to weak of the importance sequencing result of each reverse gene pair, training sample sets comprising different reverse gene pairs are respectively input into an initial model, and the lower area AUC of the ROC curve of the working feature of the subject corresponding to the number of each feature dimension is obtained, training sample sets comprising all the reverse gene pairs are sequentially and completely input into a plurality of candidate models, and the AUC of the ROC curve corresponding to each candidate model is obtained; and taking the candidate model corresponding to the maximum AUC as an initial model.
Optionally, the candidate model includes any one of the following: a support vector machine model, a logistic regression model, a random forest model, or a limiting gradient lifting tree model.
Optionally, the method further comprises: after a plurality of initial matrixes are obtained from a public database, determining a test set sub-matrix of each initial matrix according to the plurality of initial matrixes; obtaining a test sample set according to the respective test set submatrices of each initial matrix; the test sample set includes a plurality of test samples; training a preset initial model based on a training sample set to obtain a target model, wherein the training comprises the following steps: training the initial model based on the training sample set to obtain a model to be tested; based on the test sample set, testing the model to be tested to obtain a test result; and if the test result meets the preset condition, taking the model to be tested as a target model.
In a second aspect, the present invention provides a model training apparatus based on gene pairs, the apparatus comprising: an acquisition module and a processing module.
The acquisition module is used for acquiring and processing the training sample set; the training sample set comprises a plurality of training samples, and each training sample is processed into islet data comprising a plurality of corresponding codes of the reverse gene pairs by an original gene expression matrix format; a pair of genes is a pair of genes whose expression patterns are different or opposite in two different classes of samples; each coding is determined based on the expression pattern of each of the pairs of genes; the expression pattern of a pair of genes is the relative size order relationship between the expression data values of the two genes that make up the pair.
The processing module is used for training a preset initial model based on the training sample set to obtain a target model; the target model may be based on the type of islet data predicted by the pair of inverse genes.
Optionally, an acquiring module, specifically configured to acquire a plurality of initial matrices from a public database; each initial matrix is a gene expression matrix for sequencing an islet transcriptome; each row of each initial matrix is gene expression data for one gene, each column is a islet sample from a first species or a second species; each initial matrix includes gene expression data for a first type of pancreatic islet transcriptome gene and gene expression data for a second type of pancreatic islet transcriptome gene; the first type of pancreatic islet transcriptome gene expression data is different from the second type of pancreatic islet transcriptome gene expression data; determining a training set sub-matrix of each initial matrix according to the plurality of initial matrices; each training set submatrix comprises gene expression data of the first type of islet sample transcriptome genes and gene expression data of the second type of islet sample transcriptome genes; screening out gene expression data of a plurality of candidate gene pairs corresponding to the first type of islet samples and the second type of islet samples from the training set submatrices of each initial matrix respectively; wherein the candidate gene pair is the expression pattern of the gene pair which exists in the same sample size of a preset proportion threshold value; the expression pattern of the gene pair is the relative expression order REOs of the gene pair in the sample (or the relative order relationship between the expression data values of two genes constituting the gene pair); screening out the gene expression data of a plurality of inverse gene pairs of each initial matrix from the gene expression data of a plurality of candidate gene pairs of each initial matrix; wherein the expression pattern of the pair of genes is different or opposite in two different islet samples; taking intersection sets of the plurality of the reverse gene pairs of each initial matrix to obtain a plurality of the reverse gene pairs overlapped and shared in the initial matrix; encoding the gene expression data of the intersection of the reverse gene pairs in the plurality of initial matrixes according to a preset encoding rule to obtain islet data which respectively correspond to the plurality of initial matrixes and comprise codes corresponding to the expression modes of each pair of reverse gene pairs; each row in islet data represents a training sample, and each column in islet data represents an expression pattern of an overlapping pair of reverse genes; and obtaining a training sample set based on islet data corresponding to each of the plurality of initial matrices.
Optionally, the processing module specifically utilizes a minimum redundancy maximum correlation (mRMR) algorithm to overlap common reverse gene pairs in a plurality of initial matrixes, and performs importance ranking on the islet data after respective encoding to obtain ranking results of each reverse gene pair in respective islet data; weighting each inverse gene pair according to the sequencing result of each inverse gene pair in the respective islet data to obtain an importance sequencing result of each inverse gene pair in a training sample set; determining the target feature dimension of the training sample set according to the importance sequencing result of each reverse gene pair in the training sample set; the target feature dimension is used for indicating the number of feature dimensions to be learned of the initial model, namely the number of reverse gene pairs in the training sample set; according to the importance sequencing result from strong to weak of each reverse gene pair in the training sample set, respectively selecting different numbers of target feature dimensions (different reverse gene pairs) which are sequentially increased from the training sample set, and sequentially inputting the training sample set comprising the different reverse gene pairs into a preset initial model for training.
Optionally, the processing module is specifically configured to increase the number of the reverse gene pairs in the training sample set one by one according to the sequence from strong to weak of the importance sorting result by using an incremental feature selection IFS method, and input the number of the reverse gene pairs into the initial model to obtain the lower area AUC of the ROC curve of the working feature of the subject corresponding to the number of each feature dimension; the number of feature dimensions corresponding to the maximum AUC is taken as the target feature dimension (i.e., the optimal set of inverse gene pairs).
Optionally, the processing module is further configured to, before using the incremental feature selection IFS method to sequentially increase the number of the pairs of inverse genes in the training sample set from strong to weak according to the importance ranking result, and input the number of the pairs of inverse genes into the initial model one by one, to obtain an area AUC under the ROC curve of the working feature of the subject corresponding to the number of each feature dimension, sequentially and completely input the training sample set including all pairs of inverse genes into the multiple candidate models, to obtain AUCs of the ROC curves corresponding to each candidate model; and taking the candidate model corresponding to the maximum AUC as an initial model.
Optionally, the candidate model includes any one of the following: a support vector machine model, a logistic regression model, a random forest model, or a limiting gradient lifting tree model.
Optionally, the acquiring module is further configured to determine, after acquiring a plurality of initial matrices from the public database, a respective test set sub-matrix of each initial matrix according to the plurality of initial matrices; obtaining a test sample set according to the respective test set submatrices of each initial matrix; the test sample set includes a plurality of test samples; the processing module is specifically used for training the initial model based on the training sample set to obtain a model to be tested; based on the test sample set, testing the model to be tested to obtain a test result; and if the test result meets the preset condition, taking the model to be tested as a target model.
In a third aspect, the present invention provides an electronic device comprising: the system comprises a processor, a storage medium and a bus, wherein the storage medium stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor and the storage medium are communicated through the bus, and the processor executes the machine-readable instructions to execute the steps of any method in the first aspect.
In a fourth aspect, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the methods of the first aspect described above.
The beneficial effects of the invention are as follows: the model training method based on the gene pair can train the islet inverse gene pair data as a training sample to obtain a target model for predicting islet data types, and compared with the existing method for using metabonomics data as a training sample, the method for acquiring the islet transcriptome genes is simpler, the islet transcriptome genes cannot be changed along with the change of the acquisition method and the acquisition conditions, and the stability is higher.
In addition, the prediction of islet data is complex, and a single gene cannot accurately reflect the characteristics of islet data.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a model training method based on gene pairs according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of another method for training a model based on gene pairs according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of another embodiment of a model training method based on gene pairs;
FIG. 4 is a schematic diagram of model performance provided by an embodiment of the present invention;
FIG. 5 is a schematic view of tissue specificity provided by an embodiment of the present invention;
FIG. 6 is a schematic flow chart of another embodiment of a method for training a model based on gene pairs;
FIG. 7 is a schematic diagram of the model training device based on gene pairs according to the embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Description of the embodiments
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention.
Insulin is a protein hormone.
Currently, for islet data used to characterize insulin levels, machine learning models are often used to predict based on metabonomic data (e.g., biomarkers of the metabolome).
However, the acquisition process of metabonomic data is time consuming and laborious, and has a strong dependence on the acquisition method and conditions, which may vary considerably.
Based on the above, the embodiment of the invention provides a model training method, device, equipment and storage medium based on gene pairs, islet data of an islet transcriptome gene pair can be used as a training sample to train to obtain a target model, the target model can predict islet data types based on transcriptome data, and the training sample of the target model is stable.
The execution subject of the model training method based on the gene pair provided by the embodiment of the invention can be a model training device based on the gene pair. Optionally, the model training device may be an electronic device with data processing capability, such as a desktop computer, a notebook computer, a server, a cloud server, an intelligent terminal, a tablet computer, and the like; alternatively, the model training apparatus may be a processor (e.g., a central processing unit (central processing unit, CPU)) in the aforementioned electronic device; still alternatively, the model training apparatus may be an Application (APP) installed in the electronic device for executing the gene pair-based model training method; alternatively, the model training device may be a functional module or the like having a model training function based on a gene pair in the electronic device. The embodiments of the present invention are not limited in this regard.
For simplicity of description, the following description will take a model training device based on gene pairs as an example of an electronic device.
FIG. 1 is a flow chart of a model training method based on gene pairs according to an embodiment of the present invention. As shown in fig. 1, the method includes S101 to S102.
S101, the electronic equipment acquires a training sample set.
The training sample set comprises a plurality of training samples, and each training sample is processed into islet data which comprises codes corresponding to the expression modes of each pair of reverse genes by the original gene expression matrix format. A pair of genes is a pair of genes whose expression patterns are different or opposite in two different classes of samples; each coding is determined based on the expression pattern of each of the pairs of genes; the expression pattern of the reverse gene pair is the relative order relationship between the expression data values of the two genes composing the gene pair; for example, taking one gene pair as an example including gene 1 and gene 2, the expression pattern of the gene pair for the gene pair means that the expression value of gene 1 is stably greater than (or stably less than) the expression value of gene 2.
The specific process of S101 may be described in the following embodiments, and will not be described herein.
S102, the electronic equipment trains a preset initial model based on the training sample set to obtain a target model.
Wherein the target model may be based on the type of islet data predicted by the pair of inverse genes.
For example, as described above, the training sample set may include a plurality of training samples. The electronic equipment firstly inputs all training sample sets containing all feature dimensions into different preset initial models respectively to perform initial model selection so as to select a classification algorithm most suitable for data analysis and train to obtain the initial models. The electronic device then optimizes the initial model obtained by training through a two-way process. The first aspect is the adjustment of parameters: the electronic device can further divide the training sample set into two parts of data each time through a cross-validation method, wherein one part of the data is used for training the model, and the other part of the data is used for evaluating the influence of different parameters on the model performance result so as to adjust the parameters in the initial model and further obtain the optimal parameters. The second aspect is the choice of features: the electronics rank according to the importance of each feature (each pair of inverse gene pairs), select IFS based on the incremental features to select the optimal feature subset. And optimizing the initial model based on the optimal feature subset and the optimal parameters to obtain an optimal model to be tested.
The prediction performance of the model to be tested on the training sample set is presented in the form of ROC curve and AUC, and the prediction value (the type of islet data obtained by prediction of the model to be tested) of each training sample and the probability that each training sample belongs to a certain type of islet data can be obtained by inputting each training sample into the model to be tested. According to the probability and the actual situation of the predicted islet data of a certain type, an ROC curve can be drawn and the area under the curve AUC can be calculated.
The model training method based on the gene pair provided by the embodiment of the invention can train the islet data of the islet transcriptome gene pair as a training sample to obtain a target model, and compared with the existing method for using metabolomic data as a training sample, the islet transcriptome gene is simpler to acquire, the islet transcriptome gene cannot be changed along with the change of the acquisition method and the acquisition condition, and the stability is higher.
In addition, prediction of islet data is complex, a single gene cannot accurately reflect the characteristics of islet data, and the model training method based on the multiple gene pairs provided by the embodiment of the invention can more comprehensively reflect the characteristics of islet data by obtaining training samples through gene expression data of the gene pairs.
The following describes S101.
In some possible embodiments, fig. 2 is another flow chart of a model training method based on gene pairs according to an embodiment of the present invention. As shown in fig. 2, S101 may specifically include S1011 to S1016.
S1011, the electronic equipment acquires a plurality of initial matrixes from the public database.
Wherein the public database may be a gene expression integrated (gene expression omnibus, GEO) database. Each initial matrix is a gene expression matrix for pancreatic islet transcriptome sequencing, each row of each initial matrix is gene expression data for one gene, each column is a pancreatic islet sample from a first type or a second type, each initial matrix comprises gene expression data for a pancreatic islet sample transcriptome gene of the first type and gene expression data for a pancreatic islet sample transcriptome gene of the second type, and the first type of pancreatic islet sample transcriptome gene expression data is different from the second type of pancreatic islet sample transcriptome gene expression data.
S1012, the electronic equipment determines a training set submatrix of each initial matrix according to the plurality of initial matrices.
For example, as described above, each row of each initial matrix is gene expression data of one gene, each column is an islet sample from a first type or a second type, and the electronic device uses the gene expression data corresponding to the first proportion sample in any one initial matrix as a training set submatrix of the initial matrix.
The first ratio may be preset in the electronic device by a manager. For example, the first ratio may be 70% or 75% or the like. The specific values of the first ratio are not limited in the embodiment of the present invention.
Optionally, the electronic device may further use gene expression data corresponding to the second proportion sample in any one of the initial matrices as a test set sub-matrix of the initial matrix, and obtain a test sample set according to the test set sub-matrix to test the model obtained by training. Specific procedures may be described in the following embodiments, and are not repeated here.
For another example, as described above, each of the initial matrices includes gene expression data of the first type of pancreatic islet transcriptome gene and gene expression data of the second type of pancreatic islet transcriptome gene, and the electronic device may select the gene expression data corresponding to the first comparative sample from the first type of pancreatic islet transcriptome gene expression data and the second type of pancreatic islet transcriptome gene expression data of any one of the initial matrices, respectively, and use the gene expression data corresponding to the selected two types of samples as the training set submatrices of the initial matrix.
S1013, the electronic equipment respectively screens out gene expression data of a plurality of candidate gene pairs corresponding to the first type of islet samples and the second type of islet samples from the training set submatrices of each initial matrix.
Wherein the candidate gene pair is the expression pattern of the gene pair which exists in the same sample size of a preset proportion threshold value; the expression pattern of a gene pair is the relative expression order of the gene pair in a sample (witin-sample relative expression orderings, rees) or the relative order of magnitude between the expression data values of two genes that make up the gene pair; the proportional threshold may be preset in the electronic device by the manager, for example, the preset proportional threshold may be 60% or 65%, etc. As the ratio threshold increases, the stability of the resulting candidate gene pair also gradually increases. Specific values of the threshold values of the comparative examples of the present invention are not limited.
REOs refer to a gene pair (e.g., G1-G2) that is said to be REOs if its gene pair expression pattern (i.e., G1 > G2 or G1 < G2) is stable in most of the same class samples, and herein "most of the same class samples", i.e., the sample size, is within the above-mentioned predetermined ratio threshold. In general, the higher the level of the preset ratio threshold, the more stable the expression pattern of the gene pair.
S1014, the electronic equipment screens out the gene expression data of the plurality of inverse gene pairs of each initial matrix from the gene expression data of the plurality of candidate gene pairs of each initial matrix.
Among them, the pairs of the genes in S1014 were selected from the REOs whose expression patterns were stable in S1013, and thus the resulting pairs of the genes were stable. A pair of genes is one in which the expression pattern of the pair of genes is different or opposite in two different islet samples. That is, in addition to REO, when stable REO exists in each of the first and second types of genes and the expression patterns are reversed for a certain gene pair (for example, G1-G2), the gene pair may be referred to as a stable-reverse gene pair (i.e., a reverse gene pair selected in S1014).
S1015, the electronic equipment encodes gene expression data of intersections of the reverse gene pairs in the plurality of initial matrixes according to a preset encoding rule to obtain islet data which respectively correspond to the plurality of initial matrixes and comprise codes corresponding to the expression modes of the reverse gene pairs.
And (3) taking intersection sets of the plurality of the reverse gene pairs of each initial matrix obtained in the step (1014) to obtain the reverse gene pairs overlapped and shared in the plurality of initial matrices.
Encoding gene expression data of intersections of the reverse gene pairs in the plurality of initial matrixes according to a preset encoding rule, and obtaining islet data corresponding to the plurality of initial matrixes and comprising codes corresponding to the expression modes of the reverse gene pairs.
Wherein each row in the encoded islet data represents a training sample and each column in the islet data represents the expression pattern of a gene pair.
Illustratively, the preset encoding rules may specifically include: 1 denotes G1 > G2, -1 denotes G1 < G2,0 denotes that at least one of G1 and G2 is not present in the pair of reverse genes, wherein G1, G2 represent the expression levels of gene 1 and gene 2, respectively.
S1016, the electronic device obtains a training sample set based on islet data corresponding to each of the plurality of initial matrices.
In the model training method based on the gene pairs, in the initial extraction stage of the gene pairs, the multiple initial matrixes are utilized to simultaneously acquire the respective gene pairs to be inverted and then the intersection is acquired, instead of simply integrating the multiple initial matrixes and then extracting the characteristics, the subsequently constructed model can have generalization capability and is more stable.
In addition, the model training method based on the gene pair provided by the embodiment of the invention can not be influenced by batch effects among the gene expression levels of different initial matrixes by utilizing the relative expression order among genes, so that the credibility of a target model is increased.
In some possible embodiments, the electronic device may also filter the number of feature dimensions entered into the initial model. In this case, the step S102 may specifically include the following steps:
Step 1, the electronic equipment utilizes a minimum-redundancy-maximum-correlation (mRMR) algorithm to respectively sort the importance of the overlapped reverse gene pairs in the plurality of initial matrixes in the islet data after the encoding respectively, and a sorting result of each reverse gene pair in the islet data respectively is obtained.
The mRMR algorithm calculates the importance ranking of different gene pairs according to the redundancy among the different gene pairs and the correlation between the gene pairs and the islet sample types on the basis of the variable mutual exclusion information.
And 2, the electronic equipment weights each inverse gene pair according to the sequencing result of each inverse gene pair in the respective islet data, and the importance sequencing result of each inverse gene pair in the training sample set is obtained.
And 3, the electronic equipment determines the target feature dimension of the training sample set according to the importance sequencing result of each reverse gene pair in the training sample set.
The target feature dimension is used for indicating the number of feature dimensions that the initial model needs to learn, namely the number of reverse gene pairs in the training sample set.
Illustratively, the results of ranking the importance of each of the pairs of transgenes in the training sample set may be as shown in Table 1 below (where the pairs of transgenes named top 15 are shown):
TABLE 1
Figure SMS_1
/>
As shown in Table 1, the table includes importance ranking terms and gene pair terms. Wherein the importance ranking term comprises 1-15 ranking numbers and the gene pairs comprise 15 gene pairs. The importance ranking of the genes to sys1|tmem37 was 1; the importance order of the genes to ARG2|POLB is 2; the importance of the gene to PFKFB 2|TSTY5 is ranked as 3; the importance order of the genes to the MMD|SVOP is 4; the importance order of the genes to MYOM1|FAXC is 5; the importance order of the genes to ARG2|DCK is 6; the importance of the gene to PDE4b|sox5 was ranked 7; the importance order of the genes to APCDD1L|SOCS1 is 8; the importance of the genes to setd7|inhba was ranked 9; the importance of the gene to ARG2|BRI3BP is ranked as 10; the importance of the gene to ARG2|VAMP4 was ranked as 11; the importance order of the genes to PCSK9|RHBDF2 is 12; the importance ranking of the genes on HS6ST2|RILPL1 is 13; the importance order of genes to atp2a3|pon2 is 14; genes were ranked 15 in importance to chl1|pkia.
Optionally, the step 3 may specifically include the following steps:
and 3.1, the electronic equipment increases the number of the reverse gene pairs in the training sample set one by one according to the sequence from strong to weak of the importance sorting result by using an incremental feature selection (incremental feature selection, IFS) method, and respectively inputs the training sample sets comprising different reverse gene pairs into an initial model to obtain the lower area AUC of the ROC curve of the working feature of the subject corresponding to the number of each feature dimension.
And 3.2, the electronic equipment takes the number of feature dimensions corresponding to the maximum AUC as the target feature dimension (namely the optimal reverse gene pair set).
And 4, inputting a training sample set containing the target feature dimension into the model for training by the electronic equipment on the basis of the optimal reverse gene pair.
The model training method based on the gene pair provided by the embodiment of the invention utilizes the AUC to determine the target dimension, so as to control the number of the feature dimensions learned by the initial model, and the training sample of the target dimension under the maximum AUC can enable the initial model to keep a larger AUC, so that the target model obtains a better performance level.
Optionally, before the step 3.1, the method may further include: the electronic equipment inputs all training sample sets containing all the reverse gene pairs into a plurality of candidate models in sequence to obtain the AUC of the ROC curve corresponding to each candidate model; and the electronic equipment takes the candidate model corresponding to the maximum AUC as an initial model.
Optionally, the candidate model includes any one of the following: a support vector machine model, a logistic regression model, a random forest model, or a limiting gradient lifting tree (XGBoost) model.
It should be noted that, through experiments, the experimental results show that the lower Area (AUC) of the subject working characteristic (receiver operating characteristic, ROC) curve corresponding to each of the support vector machine model, the logistic regression model, the random forest model and the XGBoost model is sequentially: 0.964, 0.959, 0.956, 0.950, the predictive performance of the support vector machine model is therefore best, and can be trained as an initial model.
In some possible embodiments, as described in S1012 above, the electronic device may use the gene expression data corresponding to the first proportion sample in any one of the initial matrices as the training set submatrix of the initial matrix, and use the gene expression data corresponding to the second proportion sample in any one of the initial matrices as the test set submatrix of the initial matrix. In this case, fig. 3 is a schematic flow chart of a model training method based on gene pairs according to an embodiment of the present invention. As shown in fig. 3, after S1011 described above, the method may further include S1017 to S1018; the S102 may specifically include S1021 to S1023.
S1017, the electronic equipment determines a test set submatrix of each initial matrix according to the plurality of initial matrices.
The specific process of S1017 may be described with reference to S1012 above, and will not be described here again.
S1018, the electronic device obtains a test sample set according to the respective test set submatrices of each initial matrix.
The test sample set includes a plurality of test samples, and specific content of the test samples may be described with reference to the above training samples, which are the same as the training samples and will not be described herein.
S1021, the electronic equipment trains the initial model based on the training sample set to obtain a model to be tested.
S1021 may refer to the above description of S102, and the optimized model is taken as the model to be tested, which is not described herein.
S1022, the electronic equipment tests the model to be tested based on the test sample set to obtain a test result.
The test result obtained in S1022 is still presented in the form of ROC curve and AUC, and may be described with reference to the predicted value and the predicted probability obtained by training in S102, which are not described in detail.
S1023, if the test result meets the preset condition, the electronic equipment takes the model to be tested as a target model.
For example, the electronic device may calculate an AUC according to the test result, and when the AUC is greater than a preset threshold, the electronic device may consider that the model to be tested passes the test, and output the model to be tested as the target model.
Exemplary, fig. 4 is a schematic diagram of model performance provided by an embodiment of the present invention. As shown in fig. 4, taking the model obtained by training the training sample set including the first 15 pairs of the reverse gene pairs in the importance ranking result as an example, the AUC of the model in the training sample set was 0.981, and the AUC in the test sample set was 0.847.
Illustratively, FIG. 5 is a tissue-specific schematic of the model performance provided by embodiments of the present invention. As shown in fig. 5, unlike the good predictive performance in islet data, when the target model was applied to the expression profile data of non-islet samples, the predictive performance of the model was no longer stable with AUCs of 0.915,0.879,0.850 and 0.802 in the four independent sets of external islet data, respectively; its AUC in two independent non-islet dataset was 0.613 and 0.539, respectively. Thus, the genes constituting the above-mentioned pairs of the reverse genes can be considered to be tissue-specific.
Based on the above understanding, fig. 6 is a schematic flow chart of another model training method based on gene pairs according to an embodiment of the present invention. As shown in FIG. 6, the method comprises a stable reverse gene pair extraction stage, a feature importance ranking stage, a model construction and optimization stage, a model evaluation stage and a statistical analysis stage.
The extraction stage of the stable and reverse gene pair can be described with reference to the above-mentioned steps S1011 to S1015, and will not be described again.
The feature importance ranking stage may be described with reference to the first 2 steps specifically included in S102, which will not be described again.
The model construction and optimization stage specifically includes three parts, model selection, feature selection, and model construction. The model selection may be described with reference to the support vector machine model in S102, and will not be described again. The feature selection may be described with reference to the above-mentioned selecting the first M target inverse gene pairs from the importance ranking result to determine the target feature dimension of the training sample set, which is not described herein. The model construction may be described with reference to S102 above, and will not be described again.
The model evaluation and statistical analysis stage specifically comprises two parts of model evaluation and single cell data statistical analysis. Model evaluation may be described with reference to S1022 and S1023, and will not be described again.
For the single cell statistical analysis part, the electronic device may utilize single cell data to perform statistical analysis on the first several inverse gene pairs (i.e., the optimal inverse gene pair sets) of the target feature dimensions finally determined in the feature selection, so as to further illustrate the importance of the finally selected optimal inverse gene pair set, including checking whether the order relationship of each inverse gene pair in the optimal inverse gene pair set will also exist in the single cell data, and checking the differential expression condition of the single genes included in the optimal inverse gene pair set in the single cell transcriptome, where specific processes may be described in the related art and are not repeated.
The foregoing description of the solution provided by the embodiments of the present invention has been mainly presented in terms of a method. To achieve the above functions, it includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. The technical aim may be to use different methods to implement the described functions for each particular application, but such implementation should not be considered beyond the scope of the present invention.
In an exemplary embodiment, the embodiment of the invention further provides a model training device based on the gene pair, and fig. 7 is a schematic diagram of the composition of the model training device based on the gene pair. As shown in fig. 7, the apparatus includes: an acquisition module 71 and a processing module 72.
An acquisition module 71 for acquiring and processing a training sample set; the training sample set comprises a plurality of training samples, and each training sample is processed from an original gene expression matrix format into islet data containing a plurality of pairs of inverse gene pairs. Each code is determined based on the relative size relationship between the gene expression data of the two genes in each of the pair of inverted genes; the genes in each of the pairs of reverse genes are islet transcriptome genes whose expression patterns are opposite or different for the pair of genes.
A processing module 72, configured to train a preset initial model based on the training sample set to obtain a target model; the target model is used to predict the type of islet data based on a pair of inverse genes consisting of opposite or different expression patterns of the gene pair.
In some possible embodiments, the obtaining module 71 is specifically configured to obtain a plurality of initial matrices from a common database; each initial matrix is a gene expression matrix for sequencing an islet transcriptome; each row of each initial matrix is gene expression data for one gene, each column is a islet sample from a first species or a second species; each initial matrix includes gene expression data for a first type of pancreatic islet transcriptome gene and gene expression data for a second type of pancreatic islet transcriptome gene; the first type of islet sample transcriptome gene expression data is different from the second type of islet sample transcriptome gene expression data; determining a training set sub-matrix of each initial matrix according to the plurality of initial matrices; screening out gene expression data of a plurality of candidate gene pairs corresponding to the first type of islet samples and the second type of islet samples from the training set submatrices of each initial matrix respectively; wherein the candidate gene pair is the expression pattern of the gene pair which exists in the same sample size of a preset proportion threshold value; the expression pattern of the gene pair is the relative expression order REOs of the gene pair in the sample (or the relative order relationship between the expression data values of two genes constituting the gene pair); screening out the gene expression data of a plurality of inverse gene pairs of each initial matrix from the gene expression data of a plurality of candidate gene pairs of each initial matrix; wherein the expression pattern of the pair of genes is different or opposite in two different islet samples; taking intersection sets of the plurality of the reverse gene pairs of each initial matrix to obtain a plurality of the reverse gene pairs overlapped and shared in the initial matrix; encoding the gene expression data of the intersection set of the respective reverse gene pairs in the plurality of initial matrixes according to a preset encoding rule to obtain islet data which corresponds to the respective initial matrixes and comprises codes corresponding to the expression modes of each pair of the reverse gene pairs; each row of islet data represents a training sample, and each column of islet data represents the expression pattern of a gene pair; and obtaining a training sample set based on islet data corresponding to each of the plurality of initial matrices.
In some possible embodiments, the processing module 72 uses the mRMR algorithm with minimum redundancy and maximum correlation to respectively rank the overlapping common pairs of the reverse genes in the plurality of initial matrices, and rank the importance of the pairs of reverse genes in the respective encoded islet data, so as to obtain a ranking result of each pair of reverse genes in the respective islet data; weighting each inverse gene pair according to the sequencing result of each inverse gene pair in the respective islet data to obtain an importance sequencing result of each inverse gene pair in a training sample set; determining the target feature dimension of the training sample set according to the importance sequencing result of each reverse gene pair in the training sample set; the target dimension is used for indicating the number of feature dimensions to be learned of the initial model, namely the number of the reverse gene pairs in the training sample set;
in some possible embodiments, the processing module 72 is specifically configured to increase the number of the pairs of the reverse genes in the training sample set one by one according to the order of the importance ranking result from strong to weak by using an incremental feature selection IFS method, and input the training sample set including the different pairs of the reverse genes into the initial model, so as to obtain the area AUC under the ROC curve of the working feature of the subject corresponding to the number of each feature dimension; the number of feature dimensions corresponding to the maximum AUC is taken as the target dimension (i.e., the optimal set of inverse gene pairs).
In some possible embodiments, the processing module 72 is further configured to, in selecting the IFS method using the incremental feature, sequentially increasing the number of the reverse genes in the training sample set from strong to weak in order of the importance ranking result and inputting the training sample set including different pairs of the reverse genes into the initial model; obtaining the area AUC under the ROC curve of the working characteristics of the subject, which corresponds to the number of each characteristic dimension; taking the number of feature dimensions corresponding to the maximum AUC as the target feature dimension of the training sample set (namely the optimal inverse gene pair set); before the initial model selection, training sample sets comprising all the reverse gene pairs are sequentially and completely input into a plurality of candidate models, and the AUC of the ROC curve corresponding to each candidate model is obtained; and taking the candidate model corresponding to the maximum AUC as an initial model.
In some possible embodiments, the candidate model includes any one of the following: a support vector machine model, a logistic regression model, a random forest model, or a limiting gradient lifting tree model.
In some possible embodiments, the obtaining module 71 is further configured to determine, after obtaining a plurality of initial matrices from the common database, a respective test set sub-matrix of each initial matrix according to the plurality of initial matrices; obtaining a test sample set according to the respective test set submatrices of each initial matrix; the test sample set includes a plurality of test samples. The processing module 72 is specifically configured to train the initial model based on the training sample set to obtain a model to be tested; based on the test sample set, testing the model to be tested to obtain a test result; and if the test result meets the preset condition, taking the model to be tested as a target model.
The foregoing apparatus is used for executing the method provided in the foregoing embodiment, and its implementation principle and technical effects are similar, and are not described herein again.
The above modules may be one or more integrated circuits configured to implement the above methods, for example: one or more application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as ASIC), or one or more microprocessors (digital singnal processor, abbreviated as DSP), or one or more field programmable gate arrays (Field Programmable Gate Array, abbreviated as FPGA), or the like. For another example, when a module above is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).
The embodiment of the invention also provides an electronic device, and fig. 8 is a schematic structural diagram of the electronic device provided by the embodiment of the invention. As shown in fig. 8, the electronic device includes: a processor 81, a computer readable storage medium 82, and a bus 83, wherein: the electronic device may comprise one or more processors 81, a computer-readable storage medium 82 for storing machine-readable instructions, the processor 81 being communicatively coupled to the storage medium 82 via a bus 83, the processor 81 executing the machine-readable instructions stored by the storage medium 82 to perform the steps of the methods described in the method embodiments above.
The electronic device may be a general purpose computer, a server, a mobile terminal, or the like, without limitation. The electronic device is configured to implement the method according to the above-described method embodiment of the present invention.
It is noted that processor 81 may include one or more processing cores (e.g., a single-core processor or a multi-core processor). By way of example only, the Processor may include a central processing unit (Central Processing Unit, CPU), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), special instruction set Processor (Application Specific Instruction-set Processor, ASIP), graphics processing unit (Graphics Processing Unit, GPU), physical processing unit (Physics Processing Unit, PPU), digital signal Processor (Digital Signal Processor, DSP), field programmable gate array (Field Programmable Gate Array, FPGA), programmable logic device (Programmable Logic Device, PLD), controller, microcontroller unit, reduced instruction set computer (Reduced Instruction Set Computing, RISC), microprocessor, or the like, or any combination thereof.
The storage medium 82 may include: including mass storage, removable storage, volatile Read-write Memory, or Read-Only Memory (ROM), or the like, or any combination thereof. By way of example, mass storage may include magnetic disks, optical disks, solid state drives, and the like; removable memory may include flash drives, floppy disks, optical disks, memory cards, zip disks, magnetic tape, and the like; the volatile read-write memory may include random access memory (Random Access Memory, RAM); the RAM may include dynamic RAM (Dynamic Random Access Memory, DRAM), double data Rate Synchronous dynamic RAM (DDR SDRAM); static Random-Access Memory (SRAM), thyristor RAM (T-RAM) and Zero-capacitor RAM (Zero-RAM), etc. By way of example, ROM may include Mask Read-Only Memory (MROM), programmable ROM (Programmable Read-Only Memory, PROM), erasable programmable ROM (Programmable Erasable Read-Only Memory, PEROM), electrically erasable programmable ROM (Electrically Erasable Programmable Read Only Memory, EEPROM), compact disk ROM (CD-ROM), digital versatile disk ROM, and the like.
For ease of illustration, only one processor 81 is depicted in the electronic device. It should be noted, however, that the electronic device of the present invention may also include a plurality of processors 81, and thus the steps performed by one processor described in the present invention may also be performed jointly by a plurality of processors or separately. For example, if the processor 81 of the electronic device performs steps a and B, it should be understood that steps a and B may also be performed by two different processors together or in one processor alone. For example, the first processor performs step a, the second processor performs step B, or the first processor and the second processor together perform steps a and B.
Optionally, the present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method as described above.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims (10)

1. A method of model training based on gene pairs, the method comprising:
acquiring and processing a training sample set; the training sample set comprises a plurality of training samples, and the processing refers to processing each training sample from the original gene expression matrix format into islet data which comprises codes corresponding to the expression mode of each inverse gene pair; the expression patterns of the pair of the reverse genes are different or opposite in two different types of samples; each coding is determined based on the expression pattern of each of the pairs of genes; the expression mode of the gene pair is the relative order relation between the expression data values of two genes composing the gene pair;
training a preset initial model based on the training sample set to obtain a target model; the target model may predict the type of islet data based on the pair of inverse genes.
2. The method of claim 1, wherein the acquiring and processing a training sample set comprises:
acquiring a plurality of initial matrixes from a public database; each initial matrix is a gene expression matrix for sequencing an islet transcriptome; each row of each initial matrix is gene expression data of one gene, and each column is an islet sample from a first type or a second type; each of the initial matrices includes gene expression data for the first type of islet sample transcriptome gene and gene expression data for the second type of islet sample transcriptome gene; the first type of islet sample transcriptome gene expression data is different from the second type of islet sample transcriptome gene expression data;
Determining a training set submatrix of each initial matrix according to the plurality of initial matrices; each training set submatrix comprises gene expression data of the first type of islet sample transcriptome genes and gene expression data of the second type of islet sample transcriptome genes;
screening out gene expression data of a plurality of candidate gene pairs corresponding to the first type of islet samples and the second type of islet samples from each training set submatrix; wherein the candidate gene pair is the expression pattern of the gene pair which exists in the same sample size of a preset proportion threshold value; the expression mode of the gene pair is the relative expression order REOs of the gene pair in the sample;
screening out the gene expression data of a plurality of inverse gene pairs of each initial matrix from the gene expression data of a plurality of candidate gene pairs of each initial matrix; wherein the expression pattern of the pair of genes is different or opposite in two different islet samples; taking intersection sets of the plurality of the reverse gene pairs of each initial matrix to obtain a plurality of the reverse gene pairs which are overlapped and shared in the initial matrix;
Encoding the gene expression data of the intersection of the respective reverse gene pairs in the plurality of initial matrixes according to a preset encoding rule to obtain islet data which corresponds to the respective initial matrixes and comprises codes corresponding to the expression modes of each pair of the reverse gene pairs; each row in the islet data represents a training sample, and each column in the islet data represents an expression pattern of an overlapping pair of reverse genes;
and obtaining the training sample set based on islet data corresponding to each of the plurality of initial matrices.
3. The method of claim 2, wherein training a pre-set initial model based on the training sample set comprises:
respectively carrying out importance ranking on the overlapped common reverse gene pairs in the plurality of initial matrixes by utilizing a minimum redundancy maximum correlation (mRMR) algorithm, and obtaining ranking results of each reverse gene pair in respective islet data;
weighting each inverse gene pair according to the sequencing result of each inverse gene pair in the respective islet data, and obtaining the importance sequencing result of each inverse gene pair in the training sample set;
Determining the target feature dimension of the training sample set according to the importance sequencing result of each reverse gene in the training sample set; the target feature dimension of the training sample set is used for indicating the number of feature dimensions to be learned of the initial model, namely the number of reverse gene pairs in the training sample set;
and respectively selecting different numbers of target feature dimensions which are sequentially increased from the training sample set according to the importance sequencing results of each reverse gene pair from strong to weak in the training sample set, and sequentially inputting the training sample set comprising different reverse gene pairs into the preset initial model for training.
4. A method according to claim 3, wherein said determining the target feature dimension of the training sample set based on the results of the ranking of the importance of each of the reverse genes in the training sample set comprises: increasing the number of the reverse gene pairs in the training sample set one by one according to the sequence from strong to weak of the importance sorting result by using an increment feature selection IFS method, and respectively inputting the training sample sets comprising different reverse gene pairs into an initial model to obtain the lower area AUC of the ROC curve of the working feature of the subject corresponding to the number of each feature dimension;
And taking the number of feature dimensions corresponding to the maximum AUC as the target feature dimension of the training sample set.
5. The method according to claim 4, wherein the method further comprises:
before the increment feature selection IFS method is utilized, the number of the reverse gene pairs in the training sample set is increased one by one according to the sequence from strong to weak of the importance sequencing result of each reverse gene pair, training sample sets comprising different reverse gene pairs are respectively input into the initial model, and the lower area AUC of the ROC curve of the working feature of the subject corresponding to the number of each feature dimension is obtained, all the training sample sets comprising all the reverse gene pairs are sequentially input into a plurality of candidate models, and the AUC of the ROC curve corresponding to each candidate model is obtained;
and taking the candidate model corresponding to the maximum AUC as the initial model.
6. The method of claim 5, wherein the candidate model comprises any one of: a support vector machine model, a logistic regression model, a random forest model, or a limiting gradient lifting tree model.
7. The method according to claim 2, wherein the method further comprises:
After a plurality of initial matrixes are obtained from a public database, determining a test set sub-matrix of each initial matrix according to the plurality of initial matrixes;
obtaining a test sample set according to the test set submatrices of each initial matrix; the test sample set includes a plurality of test samples;
training a preset initial model based on the training sample set to obtain a target model, wherein the training comprises the following steps:
training the initial model based on the training sample set to obtain a model to be tested;
based on the test sample set, testing the model to be tested to obtain a test result;
and if the test result meets the preset condition, taking the model to be tested as the target model.
8. A gene pair-based model training apparatus, the apparatus comprising: the device comprises an acquisition module and a processing module;
the acquisition module is used for acquiring a training sample set; the training sample set includes a plurality of training samples; each training sample of the plurality of training samples comprises a plurality of codes for respective pairs of reverse genes; each code is determined based on the relative size relationship between the gene expression data of the two genes in each of the pair of inverted genes; the genes in each of the reverse gene pairs are islet transcriptome genes with opposite expression patterns of the gene pairs;
The processing module is used for training a preset initial model based on the training sample set to obtain a target model; the target model predicts the type of islet data based on the pair of inverse genes.
9. An electronic device comprising a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium in communication over the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the method of any one of claims 1-7.
10. A computer readable storage medium, characterized in that a computer program is stored on the computer readable storage medium, which computer program, when being executed by a processor, performs the method of any of claims 1-7.
CN202310209045.9A 2023-03-07 2023-03-07 Model training method, device, equipment and storage medium based on gene pair Active CN116072210B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310209045.9A CN116072210B (en) 2023-03-07 2023-03-07 Model training method, device, equipment and storage medium based on gene pair

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310209045.9A CN116072210B (en) 2023-03-07 2023-03-07 Model training method, device, equipment and storage medium based on gene pair

Publications (2)

Publication Number Publication Date
CN116072210A true CN116072210A (en) 2023-05-05
CN116072210B CN116072210B (en) 2023-08-18

Family

ID=86171629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310209045.9A Active CN116072210B (en) 2023-03-07 2023-03-07 Model training method, device, equipment and storage medium based on gene pair

Country Status (1)

Country Link
CN (1) CN116072210B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116467596A (en) * 2023-04-11 2023-07-21 广州国家现代农业产业科技创新中心 Training method of rice grain length prediction model, morphology prediction method and apparatus

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766695B (en) * 2017-10-20 2019-03-08 中国科学院北京基因组研究所 A kind of method and device obtaining peripheral blood genetic model training data
CN113113150A (en) * 2021-04-15 2021-07-13 上海交通大学医学院附属第九人民医院 Lymph node metastasis prediction model construction and training method, device, equipment and medium
CN114038507A (en) * 2021-10-28 2022-02-11 上海商汤智能科技有限公司 Prediction method, training method of prediction model and related device
CN114496083A (en) * 2022-01-26 2022-05-13 腾讯科技(深圳)有限公司 Cell type determination method, device, equipment and storage medium
CN115394358B (en) * 2022-08-31 2023-05-12 西安理工大学 Single-cell sequencing gene expression data interpolation method and system based on deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116467596A (en) * 2023-04-11 2023-07-21 广州国家现代农业产业科技创新中心 Training method of rice grain length prediction model, morphology prediction method and apparatus
CN116467596B (en) * 2023-04-11 2024-03-26 广州国家现代农业产业科技创新中心 Training method of rice grain length prediction model, morphology prediction method and apparatus

Also Published As

Publication number Publication date
CN116072210B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
Wright et al. ranger: A fast implementation of random forests for high dimensional data in C++ and R
Boria et al. Spatial filtering to reduce sampling bias can improve the performance of ecological niche models
CN116072210B (en) Model training method, device, equipment and storage medium based on gene pair
US20210141801A1 (en) String Parsed Categoric Encodings for Machine Learning
AU2020427921B2 (en) Automated generation of explainable machine learning
CN112560966B (en) Polarized SAR image classification method, medium and equipment based on scattering map convolution network
CN114580281A (en) Model quantization method, apparatus, device, storage medium, and program product
CN115362449A (en) Feature reordering based on sparsity of improved memory compression transfers during machine learning operations
CN112631898A (en) Software defect prediction method based on CNN-SVM
KR20210143460A (en) Apparatus for feature recommendation and method thereof
CN116011071A (en) Method and system for analyzing structural reliability of air building machine based on active learning
CN111859785B (en) Fluid feature extraction method, system, computer-readable storage medium and device
CN113554097A (en) Model quantization method and device, electronic equipment and storage medium
CN114297397A (en) Path-aware knowledge graph completion method based on convolutional network and related equipment
Khatibipour et al. JacLy: a Jacobian-based method for the inference of metabolic interactions from the covariance of steady-state metabolome data
CN114118411A (en) Training method of image recognition network, image recognition method and device
CN111383716B (en) Screening method, screening device, screening computer device and screening storage medium
CN116185843B (en) Two-stage neural network testing method and device based on neuron coverage rate guidance
CN112929916B (en) Method and device for constructing wireless propagation model
US20240104160A1 (en) Sequential group processing of optimization problems
Lin Efficient Algorithms and Systems for Tiny Deep Learning
CN113780302A (en) Method and related device for evaluating feature validity
Liu et al. Biclustering Via Sparse Clustering
CN113641732A (en) Data mining method and device, computer equipment and storage medium
CN114220488A (en) Compound design method, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant