CN117912570A

CN117912570A - Classification feature determining method and system based on gene co-expression network

Info

Publication number: CN117912570A
Application number: CN202410313618.7A
Authority: CN
Inventors: 艾冬梅; 张天鹏; 张林桐; 杜洋; 李雨珈; 邢永莲
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2024-03-19
Filing date: 2024-03-19
Publication date: 2024-04-19
Anticipated expiration: 2044-03-19
Also published as: CN117912570B

Abstract

The invention provides a classification characteristic determining method and a classification characteristic determining system based on a gene co-expression network, which relate to the technical field of data processing, wherein the method comprises the following steps: acquiring sample data; extracting gene characteristics of sample data; analyzing and determining the gene characteristics with the difference between the normal group and the cancer group; calculating correlations between the genetic features; carrying out redundant feature elimination on the gene features through a Markov blanket filtering algorithm to obtain an optimal feature set; constructing a weighted gene co-expression network; determining a pivot gene; carrying out cluster analysis on the pivot genes, and summarizing the expression quantity of the gene characteristics in various gene modules to obtain module characteristic genes; carrying out characteristic combination on the junction gene, the module characteristic gene, the cancer tissue infiltration microorganism content and the immune cell content to obtain various combination characteristics; classifying the sample data based on various combination features by a logistic regression model; based on the classification result, the combined characteristic with the best classification effect is determined.

Description

Classification feature determining method and system based on gene co-expression network

Technical Field

The invention relates to the technical field of data processing, in particular to a classification characteristic determining method and system based on a gene co-expression network.

Background

Liver cancer is the cancer with the second highest mortality rate in the world, and the accuracy of early evaluation of liver cancer is important for the treatment time of patients. The occurrence of liver cancer involves a variety of genetic and epigenetic changes, and the identification of key characteristic genes associated with liver cancer is critical to understanding its underlying molecular mechanisms and to developing effective therapies.

Gene expression levels are considered a key indicator for measuring cancer classification, but the processing of high-dimensional gene expression data is often faced with feature redundancy problems. Traditional tumor classification methods based on gene data are easily affected by redundant features, and high false positive phenomena are easily generated, so that the accuracy of tumor classification is low.

Meanwhile, most current studies use only a single gene signature as a biomarker for cancer assessment. However, the decisive factors influencing the occurrence and development of diseases are often not from a single factor, and tumor tissue microorganisms and infiltrating immune cells also have important influence on the occurrence and development of liver cancer. Evaluation based on gene signatures alone can further result in low accuracy in tumor classification.

Disclosure of Invention

In order to solve the technical problems that the traditional tumor classification method based on gene data in the prior art is easily affected by redundant features, high false positive phenomenon is easy to occur, the accuracy of tumor classification is low, and the accuracy of tumor classification is further low due to the fact that a single gene feature is used as a biomarker for cancer evaluation, the embodiment of the invention provides a classification feature determining method and system based on a gene co-expression network. The technical scheme is as follows:

In a first aspect, a method for determining classification characteristics based on a gene co-expression network is provided, including:

s1: obtaining sample data, the sample data comprising a normal group and a cancer group;

s2: extracting genetic features of the sample data;

s3: analyzing and determining the gene characteristics with the difference between the normal group and the cancer group from the gene characteristics;

s4: calculating correlations between the genetic features;

s5: according to the correlation between the gene characteristics, performing redundant characteristic elimination on the gene characteristics through a Markov blanket filtering algorithm to obtain an optimal characteristic set;

s6: constructing a weighted gene co-expression network according to the gene characteristics in the optimal characteristic set;

S7: determining a pivot gene according to the weighted gene coexpression network;

S8: performing cluster analysis on the pivot genes to obtain a plurality of gene modules, and summarizing the expression quantity of the pivot genes in each type of gene modules to obtain characteristic values of each type of gene modules, namely module characteristic genes;

S9: performing characteristic combination on the junction gene, the module characteristic gene, the cancer tissue infiltration microorganism content and the immune cell content to obtain various combination characteristics;

s10: classifying the sample data based on various combination features by a logistic regression model;

S11: based on the classification result, the combined characteristic with the best classification effect is determined.

Optionally, the correlation between the gene characteristics in S4 is calculated as follows formula (1):

（1）

wherein X represents a first gene signature, Z represents a second gene signature, Y represents a sample class, Representing the correlation between the first gene signature X and the second gene signature Z,/>Representing mutual information between a first gene signature X and a second gene signature Z,/>Representing mutual information between a first gene signature X and a second gene signature Z given a sample class Y, H () representing entropy, H (X) representing entropy of the first gene signature X, H (Z) representing entropy of the second gene signature Z, max representing a maximum value;

Wherein, given a sample class Y, the mutual information between the first gene signature X and the second gene signature Z The calculation mode of (2) is as follows:

（2）

wherein p (X, Z, Y) represents the joint probability distribution of the first gene signature X, the second gene signature Z and the sample class Y, Representing the joint probability distribution of the first gene signature X and the second gene signature Z given the sample class Y,/>Representing the conditional probability distribution of the first gene signature X given the sample class Y,/>Representing a conditional probability distribution of the second gene signature Z given the sample class Y.

Optionally, the step S5 of eliminating redundant features of the genetic features according to the correlation between the genetic features by a markov blanket filtering algorithm to obtain an optimal feature set includes:

s501: calculating the correlation between each gene characteristic and the sample class;

S502: judging whether the correlation between each gene characteristic and the sample class is greater than a threshold value or not; if the correlation is greater than the threshold, preserving the gene features with the correlation greater than the threshold into a related feature subset; otherwise, removing the gene characteristics with the correlation not greater than the threshold value;

s503: the first element in the relevant feature subset, which has the highest correlation with the sample class, is used as a target gene, and the target gene is the starting point of redundancy analysis;

S504: sequentially selecting other gene features except the target gene in the related feature subsets, and judging whether the other gene features and the target gene form a Markov blanket or not; if the Markov blanket is formed, putting the gene characteristics forming the Markov blanket with the target genes into a redundant characteristic set; otherwise, putting the gene characteristics which do not form a Markov blanket with the target genes into an optimal characteristic set;

s505: and selecting the next element as a target gene, and repeating S504 until screening of all gene characteristics is completed.

Optionally, the determining whether the remaining gene features and the target gene form a markov blanket includes:

When the condition of the following formula (3) is satisfied, the gene signature X _i and the gene signature X _j, Forming the markov blanket:

（3）

Wherein F _jc represents the correlation between the gene signature X _j and the sample class c, F _ic represents the correlation between the gene signature X _i and the sample class c, and F _ij represents the correlation between the gene signature X _i and the gene signature X _j.

Optionally, constructing a weighted gene co-expression network according to the gene characteristics in the optimal characteristic set in S6, including:

s601: constructing a correlation coefficient matrix F according to the correlation among the gene features in the optimal feature set and the following formula (4):

（4）

Wherein f _ij represents the correlation between gene signature X _i and gene signature X _j;

S602: and (3) introducing a soft threshold according to the correlation coefficient matrix and the following formula (5), and constructing an adjacency matrix:

（5）

Wherein a represents an adjacency matrix, a _ij represents adjacency values between the gene signature X _i and the gene signature X _j, and γ represents a soft threshold;

S603: constructing a topology matrix according to the adjacency matrix and the following formula (6):

（6）

wherein, Represents a topological matrix, w _ij represents the strength of association between gene signature X _i and gene signature X _j, and also represents the weight between gene signature X _i and gene signature X _j,/>Elucidating that Gene signature X _i and Gene signature X _j are directly linked or that there is a Gene linked to Gene signature X _i and Gene signature X _j at the same time, whereas,/>Indicating that there is no direct connection between gene signature X _i and gene signature X _j and no indirect connection through other genes, a _iμ indicating the adjacency value between gene signature X _i and gene signature X _μ, a _μj indicating the adjacency value between gene signature X _μ and gene signature X _j;

S604: and constructing a weighted gene coexpression network according to the topology matrix.

Optionally, determining the pivot gene according to the weighted gene co-expression network of S7 comprises:

s701: determining connectivity of individual gene features in the weighted gene co-expression network;

s702: and determining the gene characteristic as the pivot gene when the connectivity of the gene characteristic is greater than the preset connectivity.

Optionally, in the step S8, cluster analysis is performed on the hinge genes to obtain a plurality of gene modules, including:

S801: performing cluster analysis on the junction genes;

s802: and carrying out dynamic pruning on the clustering analysis result through cutreeDynamic functions to obtain a plurality of gene modules.

Optionally, the determining, based on the classification result in S11, the combined feature with the best classification effect includes:

s111: constructing a classification objective function;

s112: calculating the function value of the classification objective function according to classification results under various combined characteristics;

s113: and determining the corresponding combined characteristic when the function value of the classification objective function is minimum as the combined characteristic with the best classification effect.

Optionally, the classification objective function is as follows (7):

（7）

Wherein x _i represents the i-th combined feature, y _i represents the classification result of the i-th combined feature, β represents the regression coefficient parameter, β ₀ represents the initial value of the regression coefficient parameter, η represents the weight coefficient of the regularization term, α represents the weight coefficient regularized by L ₁, n represents the total number of combined features, and (-) ^T represents the matrix transpose operation.

In a second aspect, a classification characteristic determination system based on a gene co-expression network is provided, comprising a processor and a memory for storing processor executable instructions; the processor is configured to invoke the instructions stored in the memory to perform the method of classification feature determination based on a gene co-expression network of the first aspect.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

(1) According to the invention, according to the correlation between the gene characteristics, the redundant characteristics of the gene characteristics are eliminated based on the Markov carpet filtering algorithm to obtain the optimal characteristic set, and then the tumor classification is carried out according to the gene characteristics in the optimal characteristic set, so that the occurrence of false positive phenomenon can be reduced, and the accuracy of the tumor classification is improved.

(2) In the invention, the characteristics of the junction gene, the module characteristic gene, the cancer tissue infiltration microorganism content and the immune cell content are combined, the combined characteristics with the best classification effect are determined through the classification result of the logistic regression model, and the junction gene, the module characteristic gene, the cancer tissue infiltration microorganism content and the immune cell content are comprehensively considered, so that the accuracy of tumor classification is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a classification characteristic determining method based on a gene co-expression network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a technical framework of a classification feature determination method based on a gene co-expression network according to an embodiment of the present invention;

Fig. 3 is a schematic structural diagram of a classification characteristic determining system based on a gene co-expression network according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is described below with reference to the accompanying drawings.

In embodiments of the invention, words such as "exemplary," "such as" and the like are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the term use of an example is intended to present concepts in a concrete fashion. Furthermore, in embodiments of the present invention, the meaning of "and/or" may be that of both, or may be that of either, optionally one of both.

In the embodiments of the present invention, "image" and "picture" may be sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized. "of", "corresponding (corresponding, relevant)" and "corresponding (corresponding)" are sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized.

In embodiments of the present invention, sometimes a subscript such as W ₁ may be wrongly written in a non-subscript form such as W1, and the meaning of the expression is consistent when the distinction is not emphasized.

In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

The embodiment of the invention provides a classification characteristic determining method based on a gene co-expression network, which can be realized by a classification characteristic determining system based on the gene co-expression network, wherein the classification characteristic determining system based on the gene co-expression network can be a terminal or a server.

Referring to fig. 1 of the specification, a flow chart of a classification characteristic determining method based on a gene co-expression network according to an embodiment of the present invention is shown. Referring to fig. 2 of the specification, a schematic diagram of a technical framework of a classification characteristic determining method based on a gene co-expression network according to an embodiment of the present invention is shown. The process flow of the classification characteristic determining method based on the gene co-expression network can comprise the following steps:

s1: sample data is acquired.

Wherein the sample data comprises a normal group and a cancer group.

Specifically, from the TCGA-LIHC dataset on the official database (https:// portal. Gdc. Cancer. Gov/projects/TCGA-LIHC) of the cancer genomic map (TCGA: THE CANCER Genome Atlas), this dataset contained healthy tissue (noted as normal sample) and 374 tumor tissue (noted as cancerous sample) from 50 primary liver cancers, totaling 424 patient samples. Transcriptome data corresponding to 424 samples, somatic mutation data of 374 samples, and clinical data of 377 samples were downloaded in total. RNA-seq sequencing data corresponding to 98 samples were found on a cancer genomic cloud platform (CGC: the Seven Bridges Cancer Genomics Cloud), including 49 normal tissue samples and 49 tumor tissue samples, for subsequent study of liver cancer tissue microorganisms.

S2: and extracting the gene characteristics of the sample data.

Specifically, the gene characteristics include gene expression level, gene expression pattern, gene variation, gene subtype, transcription factor activity, and the like.

S3: from the gene signatures, analysis determines the gene signature that has a difference between the normal group and the cancer group.

Specifically, the genetic characteristics with the difference between the normal group and the cancer group are determined through the technical analysis such as mean difference analysis, non-parameter test, gene expression conversion, correlation analysis, nonlinear relation modeling and the like.

In the present invention, the combination of these techniques helps reveal the biological differences between the normal and cancer groups at the molecular level, providing important information for the intensive study of disease pathogenesis and development of potential therapeutic strategies.

S4: correlation between gene signatures was calculated.

In one possible implementation manner, the invention provides a completely new calculation manner of the correlation between the gene features in S4, specifically the calculation manner of the correlation between the gene features in S4 is as follows formula (1):

（1）

wherein X represents a first gene signature, Z represents a second gene signature, Y represents a sample class, Representing the correlation between the first gene signature X and the second gene signature Z,/>Representing mutual information between a first gene signature X and a second gene signature Z,/>Representing mutual information between the first gene signature X and the second gene signature Z given the sample class Y, H () represents entropy, H (X) represents entropy of the first gene signature X, H (Z) represents entropy of the second gene signature Z, and max represents a maximum value.

Wherein, given a sample class Y, the mutual information between the first gene signature X and the second gene signature ZThe calculation mode of (2) is as follows:

（2）

It should be noted that the number of the substrates,Representing the mutual information between the first gene feature X and the second gene feature Z, the degree of information crossing between the first gene feature X and the second gene feature Z is measured, i.e. the amount of information carried by the two genes together. /(I)Representing the mutual information between the first gene signature X and the second gene signature Z given the sample class Y, representing the amount of independent information between the first gene signature X and the second gene signature Z, i.e. the degree of association between them, given the sample type is known. In feature screening,/>And/>The larger the value of (c) the more strongly correlated the gene features and sample types, and therefore the more likely these features are to be selected into the final feature set.

In the invention, by considering sample category information, normalizing and maximizing the amount of information, a method for more comprehensively and accurately evaluating the relevance between gene features is provided. In gene feature selection and biological research, the relationship among genes can be better understood, and more depth information can be provided for further analysis and research.

S5: and according to the correlation between the gene characteristics, performing redundant characteristic elimination on the gene characteristics by a Markov blanket filtering algorithm to obtain an optimal characteristic set.

Wherein the markov blanket filter algorithm (Markov Blanket Filter) is a feature selection method that identifies a minimum subset of given target variables by means of a markov blanket (markov blanket) concept in a probabilistic graphical model, the subset containing all information related to the target variables such that given other variables, this subset is a conditional independent set of the target variables.

It should be noted that the markov carpet filter algorithm has the advantage that it avoids redundant information and redundant features by selecting the smallest feature set taking into account conditional independence between variables. This helps to improve the interpretability and generalization of the model, reduce the risk of overfitting, and reduce computational overhead. The algorithm is used for effectively selecting the characteristics by utilizing the condition independence on the basis of the probability graph model, and is particularly suitable for the condition of high-dimensional data.

In a possible implementation manner, according to the correlation between the gene features, the redundant feature elimination is performed on the gene features through a markov blanket filtering algorithm in S5 to obtain an optimal feature set, which includes the substeps S501 to S504:

s501: and calculating the correlation between each gene characteristic and the sample category.

S502: and judging whether the correlation between each gene characteristic and the sample category is larger than a threshold value. If the correlation is greater than the threshold, preserving the genetic features with correlation greater than the threshold into a subset of the correlation features. Otherwise, the genetic features with the correlation not greater than the threshold are removed.

S503: the first element with the highest correlation with the sample category in the correlated feature subset is used as a target gene, and the target gene is used as a starting point of redundancy analysis;

S504: and sequentially selecting other gene features except the target gene in the related feature subsets, and judging whether the other gene features and the target gene form a Markov blanket. If a Markov blanket is constructed, the genetic features that make up the Markov blanket with the target genes are placed into a redundant feature set. Otherwise, the current gene features that do not constitute a Markov blanket with the target gene are put into an optimal feature set.

In the invention, the Markov blanket filtering algorithm is used for feature selection, so that the calculation cost can be reduced and the model robustness can be improved while the model efficiency, generalization capability and interpretability are improved, and the method is a beneficial step and is particularly suitable for processing the classification problem of high-dimensional gene expression data.

In one possible embodiment, the determining whether the remaining gene signature forms a markov blanket with the target gene comprises:

When the condition of the following formula (3) is satisfied, the gene signature X _i and the gene signature X _j, Forming a markov blanket:

（3）

In the present invention, this condition ensures that the correlation between the genetic signature and the sample class is taken into account in the selection of the Markov blanket. By requiring that the correlation between X _j and sample class c is at least equal to the correlation between X _i and sample class c, it is ensured that the selected feature has a higher correlation to the target class. Meanwhile, F _ij is required to be greater than or equal to F _ic, that is, the correlation between the gene signature X _i and the gene signature X _j is at least equal to the correlation between X _i and the sample class c, which is helpful to capture complex relationships between multiple variables, so that the selected markov blanket can more comprehensively reflect interactions between genes.

S6: and constructing a weighted gene co-expression network according to the gene characteristics in the optimal characteristic set.

In a possible implementation manner, the construction of the weighted gene co-expression network according to the gene characteristics in the optimal characteristic set in S6 includes the substeps S601 to S604:

（4）

Wherein f _ij represents the correlation between the gene signature X _i and the gene signature X _j.

The correlation coefficient matrix F was constructed by taking into consideration the correlation between the characteristics of the genes. This helps to capture the interrelationship between genes, providing comprehensive information for subsequent network construction.

S602: according to the correlation coefficient matrix and the following formula (5), introducing a soft threshold value to construct an adjacency matrix:

（5）

Where A represents an adjacency matrix, a _ij represents adjacency values between gene signature X _i and gene signature X _j, and γ represents a soft threshold.

The soft threshold is a nonlinear transformation mode commonly used in mathematical operation, and is generally used in the fields of signal processing, image processing, statistics and the like. In the context of constructing an adjacency matrix, soft thresholds are used to adjust the elements in the correlation coefficient matrix to introduce a sparsity that highlights the portions of the network that have significant correlation.

It should be noted that, soft threshold values are introduced as a construction mechanism of the adjacency matrix, and by performing soft threshold processing on absolute values of correlations, gene features with significant correlations in the network are emphasized, and connections with weaker correlations are weakened. This can make the network more sparse, focusing on critical genetic relationships, helping to simplify the network structure.

（6）

wherein, Represents a topological matrix, w _ij represents the strength of association between gene signature X _i and gene signature X _j, and also represents the weight between gene signature X _i and gene signature X _j,/>Elucidating that Gene signature X _i and Gene signature X _j are directly linked or that there is a Gene linked to Gene signature X _i and Gene signature X _j at the same time, whereas,/>Indicating that there is no direct connection between gene signature X _i and gene signature X _j and no indirect connection through other genes, a _iμ indicates the adjacency value between gene signature X _i and gene signature X _μ, and a _μj indicates the adjacency value between gene signature X _μ and gene signature X _j.

It should be noted that, constructing the topology matrix according to the adjacency matrix, by introducing the association strength and the weight, the connection situation between the gene features can be described more precisely. Topology matrices provide a deeper understanding of the network structure, including direct and indirect connections between genes.

It should be noted that a weighted gene coexpression network based on a topology matrix is constructed, wherein the weights reflect the association strength between gene features. Such a network can more accurately reflect interactions between genes, providing a powerful tool for further analysis.

In the invention, the weighted gene co-expression network comprehensively considers the information such as correlation, intensity, weight and the like, comprehensively reflects the relation among gene characteristics, and provides a tool with more information quantity and accuracy for biological research and analysis.

S7: and determining the pivot genes according to the weighted gene co-expression network.

In one possible embodiment, determining the junction gene according to the weighted gene co-expression network of S7 comprises the sub-steps S701 and S702:

S701: determining connectivity of individual gene features in the weighted gene co-expression network.

Where connectivity is a measure of the number or strength of connections of one node to other nodes. By calculating the connectivity of each gene feature in the weighted gene co-expression network, key nodes in the network, i.e., gene features with higher connectivity, can be identified. These genetic features may play a biologically important role, for example, in regulating networks.

Wherein, when the connectivity of the gene signature is greater than the preset connectivity, the gene signature is determined to be the junction gene. A junction gene is a node of higher connectivity in a network and is generally considered to play a key role in network structure and function. They may be key regulatory factors in regulatory networks, signaling pathways, or biological processes.

In the invention, the complexity of network analysis can be simplified by screening out the gene characteristics with higher connectivity as the junction genes. Focusing on the gene characteristics with higher connectivity helps to focus on the most important parts in biology and improves understanding of network structure.

In order to test the accuracy and effectiveness of the characteristic screening and redundant characteristic eliminating method, the liver cancer samples are classified by using classical three models (support vector machine, random forest and logistic regression) in machine learning based on liver cancer gene expression data, and the accurate classification of the samples can be realized by observing the expression of various classifiers on the gene characteristic set to find out which classification model is substituted by the selected junction gene.

S8: and carrying out cluster analysis on the pivot genes to obtain a plurality of gene modules, and summarizing the expression quantity of the pivot genes in various gene modules to obtain the characteristic values of the various gene modules, namely the module characteristic genes.

It should be noted that, the cluster analysis can organize similar gene features into modules to form a modularized network structure. This helps to understand the modular nature of gene regulatory networks, i.e., genes with similar functional or biological significance are combined together to form a module.

In the invention, a large number of gene characteristics are integrated into a plurality of modules through cluster analysis, so that the complexity of a network can be simplified. This allows researchers to more easily understand the overall structure and function of the network, reducing the complexity of direct analysis of large numbers of genes.

In one possible embodiment, the clustering analysis of the pivot genes in S8 results in a plurality of gene modules, including:

S801: and carrying out cluster analysis on the pivot genes.

In the invention, the details of the clustering result can be adjusted through dynamic pruning, so that a compact gene module is obtained. The compact module is easier to understand and interpret, helps to reduce redundant information, and makes the module structure clearer. At the same time, dynamic pruning allows a hierarchical structure of modules to be formed, i.e. sub-modules may exist inside one module. This helps to more closely understand the network structure, identify sub-functions or subdivisions within the modules.

S9: and carrying out characteristic combination on the junction gene, the module characteristic gene, the cancer tissue infiltration microorganism content and the immune cell content to obtain various combination characteristics.

In the invention, the characteristics of multiple sources such as the pivot gene, the module characteristic gene, the cancer tissue infiltration microorganism content, the immune cell content and the like are combined, which is helpful for comprehensively considering information of different layers. Such integration capability allows a more comprehensive classification model that can better capture complex features of a sample.

S10: sample data is classified based on various combined features by a logistic regression model.

Specifically, classifying the sample data refers to classifying the sample data into a tumor class or a normal class.

In the invention, a logistic regression model is constructed by utilizing various combination characteristics, which is beneficial to improving the classification accuracy. Different types of features may reflect the state of the sample on different levels, and by comprehensively considering the information, the classification model can better capture the difference of the sample, and the classification accuracy is improved. Meanwhile, through a logistic regression model, feature selection can be performed, and features with important influences on target variables are determined. Meanwhile, the linear relation of the logistic regression is helpful for understanding the relative contribution of each feature to the classification result, and the interpretation of the model is improved.

In the present invention, by comparing the classification effects of logistic regression models under different combined features, it can be determined which combined feature is most effective for solving a particular problem. This helps to select the most appropriate combination of features in practical applications.

In a possible implementation manner, the determining the combined feature with the best classification effect based on the classification result in S11 includes substeps S111 to S113:

s111: and constructing a classification objective function.

Optionally, the classification objective function is as follows (7):

（7）

In the invention, the introduction of regularization terms (L1 regularization and L2 regularization) helps control the complexity of the model, preventing overfitting. The regular term can punish the regression coefficient, so that the model is more prone to selecting a small number of important features, the model is prevented from being excessively fitted on the training set, and the generalization capability of the model is improved.

S112: the function value of the classification objective function is calculated based on the classification results under the various combined features.

It should be noted that, by calculating the function values of the classification objective function under various combination features, the combination feature corresponding to the case where the model performance is the best can be found. This helps to determine the best feature combination and improves the effectiveness of the classification model.

(2) In the invention, the characteristics of the junction gene, the module characteristic gene, the cancer tissue infiltration microorganism content and the immune cell content are combined, the combined characteristics with the best classification effect are determined through the classification result of the logistic regression model, and the junction gene, the module characteristic gene, the cancer tissue infiltration microorganism content and the immune cell content are comprehensively considered, so that the accuracy of tumor classification is further improved

Referring to fig. 3 of the specification, a schematic structural diagram of a classification characteristic determining system based on a gene co-expression network according to an embodiment of the present invention is shown.

The embodiment of the invention provides a classification characteristic determining system 20 based on a gene co-expression network, which comprises a first processor 2001 and a memory 2002; the memory 2002 has stored thereon computer readable instructions which, when executed by the first processor, implement the classification characteristic determination method based on a gene co-expression network described in the above method embodiment.

Optionally, the classification characteristic determination system 20 based on the gene co-expression network may further comprise a transceiver 2003.

The first processor 2001 may be connected to the memory 2002 and the transceiver 2003, for example, via a communication bus.

The following describes the respective constituent elements of the classification characteristic determination system 20 based on the gene coexpression network in detail with reference to fig. 3:

The first processor 2001 is a control center of the classification characteristic determining system 20 based on the gene coexpression network, and may be one processor or a generic name of a plurality of processing elements. For example, the first processor 2001 is one or more central processing units (central processing unit, CPU), may be an Application SPECIFIC INTEGRATED Circuit (ASIC), or may be one or more integrated circuits configured to implement embodiments of the present invention, such as: one or more microprocessors (DIGITAL SIGNAL processors, dsps), or one or more field programmable gate arrays (field programmable GATE ARRAY, fpgas).

Alternatively, the first processor 2001 may perform various functions of the classification characteristic determination system 20 based on the gene co-expression network by running or executing a software program stored in the memory 2002 and invoking data stored in the memory 2002.

In a specific implementation, first processor 2001 may include one or more CPUs, such as CPU0 and CPU1 shown in fig. 3, as an example.

In a specific implementation, as an embodiment, the classification characteristic determination system 20 based on a gene co-expression network may also include a plurality of processors, such as the first processor 2001 and the second processor 2004 shown in fig. 3. Each of these processors may be a single-core processor (single-CPU) or a multi-core processor (multi-CPU). A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

The memory 2002 is used for storing a software program for executing the solution of the present invention, and is controlled by the first processor 2001 to execute the solution, and the specific implementation may refer to the above method embodiment, which is not described herein.

Alternatively, memory 2002 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, or an electrically erasable programmable read-only memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-only memory, EEPROM), compact disc read-only memory (compact disc read-only memory) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, without limitation. The memory 2002 may be integrated with the first processor 2001, may exist independently, and may be coupled to the first processor 2001 through an interface circuit (not shown in fig. 3) of the classification characteristic determination system 20 based on a gene co-expression network, which is not particularly limited in the embodiment of the present invention.

A transceiver 2003 for communicating with a network device or with a terminal device.

Alternatively, transceiver 2003 may include a receiver and a transmitter (not separately shown in fig. 3). The receiver is used for realizing the receiving function, and the transmitter is used for realizing the transmitting function.

Alternatively, transceiver 2003 may be integrated with first processor 2001, or may exist separately, and may be coupled to first processor 2001 through an interface circuit (not shown in fig. 3) of gene co-expression network-based classification characteristic determination system 20, as embodiments of the present invention are not particularly limited.

It should be noted that the structure of the classification characteristic determination system 20 based on the gene co-expression network shown in fig. 3 is not limited to this router, and an actual knowledge structure recognition apparatus may include more or less components than those illustrated, or may combine some components, or may arrange different components.

In addition, the technical effects of the classification characteristic determining system 20 based on the gene co-expression network may refer to the technical effects of the classification characteristic determining method based on the gene co-expression network described in the above method embodiment, and will not be described herein.

It is to be appreciated that the first processor 2001 in embodiments of the invention may be a central processing unit (central processing unit, CPU) which may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, dsps), application specific integrated circuits (asics), off-the-shelf programmable gate arrays (field programmable GATE ARRAY, fpgas) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be appreciated that the memory in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an erasable programmable ROM (erasable PROM), an electrically erasable programmable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of random access memory (random access memory, RAM) are available, such as static random access memory (STATIC RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (double DATA RATE SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, SLDRAM), and direct memory bus random access memory (direct rambus RAM, DR RAM).

The above embodiments may be implemented in whole or in part by software, hardware (e.g., circuitry), firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In addition, the character "/" herein generally indicates that the associated object is an "or" relationship, but may also indicate an "and/or" relationship, and may be understood by referring to the context.

In the present invention, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another device, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for determining classification characteristics based on a gene co-expression network, the method comprising:

s2: extracting genetic features of the sample data;

s4: calculating correlations between the genetic features;

2. The method for determining classification characteristics based on gene co-expression network according to claim 1, wherein the correlation between the gene characteristics in S4 is calculated by the following formula (1):

（1）

（2）

wherein p (X, Z, Y) represents the joint probability distribution of the first gene signature X, the second gene signature Z and the sample class Y, Representing the joint probability distribution of the first gene signature X and the second gene signature Z given the sample class Y,Representing the conditional probability distribution of the first gene signature X given the sample class Y,/>Representing a conditional probability distribution of the second gene signature Z given the sample class Y.

3. The method for determining classification characteristic based on gene co-expression network according to claim 1, wherein the step of performing redundancy characteristic elimination on the gene characteristic according to the correlation between the gene characteristic in S5 by markov blanket filtering algorithm to obtain an optimal characteristic set comprises:

4. The method for determining classification characteristic based on gene co-expression network according to claim 3, wherein said determining whether the remaining gene characteristics constitute a markov blanket with the target gene comprises:

（3）

5. The method for determining classification characteristics based on gene co-expression networks according to claim 1, wherein the constructing a weighted gene co-expression network according to the gene characteristics in the optimal characteristic set in S6 comprises:

（4）

（5）

（6）

6. The method for determining classification characteristics based on gene co-expression networks according to claim 1, wherein the determining a junction gene according to the weighted gene co-expression network of S7 comprises:

7. The method for determining classification characteristics based on gene coexpression network according to claim 1, wherein the step of performing cluster analysis on the junction genes in S8 to obtain a plurality of gene modules comprises:

S801: performing cluster analysis on the junction genes;

8. The method for determining classification characteristics based on gene co-expression network according to claim 1, wherein the determining the combined characteristics with the best classification effect based on the classification result in S11 comprises:

s111: constructing a classification objective function;

9. The method for determining classification characteristics based on a gene co-expression network according to claim 8, wherein the classification objective function is represented by the following formula (7):

（7）

10. A classification characteristic determination system based on a gene co-expression network for implementing the classification characteristic determination method based on a gene co-expression network according to any one of claims 1 to 9, characterized in that the system comprises:

a first processor;

A memory having stored thereon computer readable instructions which, when executed by the first processor, implement the method of classification feature determination based on a gene co-expression network of any of claims 1 to 9.