CN111145831A

CN111145831A - Method and device for constructing genetic subtype prediction model and computer equipment

Info

Publication number: CN111145831A
Application number: CN201911415078.9A
Authority: CN
Inventors: 黄庆生; 梁会营; 钟嘉泳; 高欢; 李宽荣
Original assignee: Guangzhou Women and Childrens Medical Center
Current assignee: Guangzhou Women and Childrens Medical Center
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-12
Anticipated expiration: 2039-12-31
Also published as: CN111145831B

Abstract

The present application relates to a method, apparatus, computer device and storage medium for constructing a genetic subtype prediction model. The method comprises the following steps: receiving a gene expression profile and a corresponding genetic subtype of a training sample; classifying and integrating the gene expression profiles according to the predetermined classification number to obtain target gene expression profiles; the target gene expression profile comprises categories corresponding to the classification numbers; selecting a target class from the classes of the target gene expression profile, and constructing a corresponding relation between the target class and the genetic subtype; and outputting a genetic subtype prediction model according to the corresponding relation. By adopting the method, the genetic subtype prediction model can be obtained based on the flexible corresponding relation constructed between the gene expression profile and the genetic subtype.

Description

Method and device for constructing genetic subtype prediction model and computer equipment

Technical Field

The present application relates to the field of biotechnology, and in particular, to a method, an apparatus, a computer device, and a storage medium for constructing a genetic subtype prediction model.

Background

In the traditional technology, cytogenetic abnormality can be determined according to characteristics such as gene expression level, immunophenotype or fluorescence in situ hybridization, and then a corresponding relation between the characteristics and genetic subtypes is obtained by analyzing a plurality of cases to construct a genetic subtype prediction model. With the development of biotechnology, research into organisms can be carried out to the molecular level of the genome and transcriptome. Therefore, it is necessary to construct more flexible and representative correspondences between gene expression profiles and genetic subtypes in transcriptomes.

Disclosure of Invention

In view of the above, there is a need to provide a method, an apparatus, a computer device and a storage medium for constructing a genetic subtype prediction model, which can construct a more flexible correspondence between a gene expression profile and a genetic subtype.

In a first aspect, a method for constructing a genetic subtype prediction model is provided, which includes:

receiving a gene expression profile and a corresponding genetic subtype of a training sample;

classifying and integrating the gene expression profiles according to the predetermined classification number to obtain target gene expression profiles; the target gene expression profile comprises categories corresponding to the classification numbers;

selecting a target class from the classes of the target gene expression profile, and constructing a corresponding relation between the target class and the genetic subtype;

and outputting a genetic subtype prediction model according to the corresponding relation.

In one embodiment, the step of selecting a target class from the classes in the target gene expression profile and constructing the corresponding relationship between the target class and the genetic subtype includes:

obtaining the category value of each category in the target gene expression profile;

determining the category corresponding to the maximum category value in the category values as the target category;

and constructing the corresponding relation between the target class and the genetic subtype.

In one embodiment, when the training sample includes a plurality of subsamples, the step of constructing the correspondence between the target class and the genetic subtype includes:

respectively constructing sub-genetic subtypes of the subsamples and sub-target categories according to the category values of the subsamples; each sub-correspondence corresponds to a different sub-sample;

selecting a sub-corresponding relation aiming at the same seed target category from the sub-corresponding relations;

selecting the child genetic subtype with the most occurrence times from the child genetic subtypes of the selected child corresponding relationship;

and constructing the corresponding relation between the same seed target category and the selected child genetic subtype.

In one embodiment, when the training sample comprises a plurality of subsamples, the step of performing classification integration on the gene expression profile according to the predetermined classification number comprises:

classifying and integrating the gene expression profiles expressed in a matrix form according to a predetermined classification number in a non-negative matrix factorization mode to obtain a weight matrix of the class and a class value matrix of the class; the category value matrix is a target gene expression profile in a matrix form;

the step of outputting a genetic subtype prediction model according to the correspondence includes:

mapping the correspondence in a row of the weight matrix;

performing singular value decomposition processing on the weight matrix;

and performing pseudo-inverse processing on the weight matrix subjected to the singular value decomposition processing to obtain a genetic subtype prediction model in the form of a pseudo-inverse matrix.

In one embodiment, after the step of performing pseudo-inverse processing on the weight matrix subjected to the singular value decomposition processing to obtain a genetic subtype prediction model in the form of a pseudo-inverse matrix, the method further includes:

acquiring a gene expression spectrum matrix; the gene expression spectrum matrix is formed by gene expression spectrums of test samples;

performing matrix multiplication processing on the gene expression spectrum matrix and the genetic subtype prediction model in the form of the pseudo-inverse matrix to obtain a classification result matrix of the test sample; the classification result matrix comprises test class values corresponding to the classes;

determining a category value with the maximum value from the test category values of the classification result matrix;

and using the category corresponding to the determined category value as the reference category of the test sample.

In one embodiment, after the step of using the category corresponding to the determined category value as the reference category of the test sample, the method further comprises:

and taking the genetic subtype corresponding to the reference category as the genetic subtype of the test sample according to the corresponding relation.

In one embodiment, further comprising: and determining the number of the classifications according to the minimum description length criterion.

In a second aspect, there is provided an apparatus for constructing a genetic subtype prediction model, comprising:

the information acquisition module is used for receiving a gene expression profile and a corresponding genetic subtype of a training sample;

the classification integration module is used for classifying and integrating the gene expression profiles according to the predetermined classification number to obtain target gene expression profiles; the target gene expression profile comprises categories corresponding to the classification numbers;

the corresponding relation construction module is used for selecting a target class from the classes of the target gene expression profile and constructing the corresponding relation between the target class and the genetic subtype;

and the prediction model output module is used for outputting a genetic subtype prediction model according to the corresponding relation.

In a third aspect, a computer device is provided, comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

According to the method, the device, the computer equipment and the storage medium for constructing the genetic subtype prediction model, the gene expression profiles of the training samples are classified and integrated according to the preset classification number to obtain the classes corresponding to the classification number, so that each class can comprise the characteristics of multiple genes, namely, the genes corresponding to the classes are not unique, when an unknown gene class appears, the unknown gene class can be classified and integrated with other known gene classes, the classes comprise the genes of unknown classes, further, the target class is selected from each class, and the corresponding relation between the target class and the genetic subtype is constructed, so that the genetic subtype can correspond to the multiple genes comprising the unknown classes, and further, the flexible corresponding relation is constructed between the gene expression profiles and the genetic subtype.

Drawings

FIG. 1 is a diagram illustrating an internal structure of a computer device according to an embodiment;

FIG. 2 is a schematic flow diagram of a method for constructing a genetic subtype prediction model in one embodiment;

FIG. 3 is a schematic flow chart showing a method for constructing a genetic subtype prediction model in another embodiment;

FIG. 4 is a block diagram showing an example of the structure of an apparatus for constructing a genetic subtype prediction model.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The method for constructing the genetic subtype prediction model provided by the application can be applied to computer equipment shown in figure 1. In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 1. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of constructing a genetic subtype prediction model. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 1 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, as shown in fig. 2, a method for constructing a genetic subtype prediction model is provided, which is described by taking the method as an example of being applied to the computer device in fig. 1, and it is understood that the method can be applied to a server, a terminal, and a system comprising the terminal and the server, and is implemented by interaction between the terminal and the server. In this embodiment, the method comprises the following steps:

step S202, receiving a gene expression profile and a corresponding genetic subtype of a training sample.

The training sample may include one case sample or a plurality of case samples, and when a plurality of case samples are included, the training sample may be considered to include a plurality of subsamples, the case sample may include gene information of a case with a known genetic subtype, and the gene information may be a gene pair expression value, for example, if the genetic subtype of a certain case is i, the gene information of the case may be used as the case sample. The gene expression value for a certain gene may be different for each case sample; the genetic subtype of each case sample has been determined in advance. The gene expression profile can comprise the gene expression value of one case sample, and can also comprise the gene expression values of a plurality of case samples; for a single case sample in the gene expression profile, there may be a plurality of gene expression values, each corresponding to a different type of gene.

Further, the gene expression profile may be represented in a matrix form, specifically, the gene expression values of the case samples are arranged according to a specific format to obtain the gene expression profile represented in the matrix form, wherein the gene expression values of each case sample may be arranged in a longitudinal arrangement manner, that is, the number of the case samples is the number of columns, and then the number of the gene types is the number of rows to construct a gene expression profile matrix, and the gene types correspond to the gene expression values; when there are different gene types corresponding to the gene expression values of the case samples, summing the number of the gene types, taking the number of the summed gene types as the number of rows, and constructing a gene expression profile matrix, for example, there are 3 case samples, where the gene expression value of one case sample corresponds to m1 genes, and the gene expression values of the other two case samples both correspond to m2 genes, summing the number of the gene types of m1 genes and m2 genes, that is, determining the number of the gene types existing only in the m2 (or m1) genes, adding the number and m1 (or m2) to obtain m3 genes, and constructing a gene expression profile matrix of m3 rows and n columns with m3 as the number of rows of the matrix. It is understood that the gene expression values of each case sample can be arranged in a horizontal arrangement, and a gene expression profile matrix of n rows and m columns can be constructed by taking the n cases as an example.

In this step, the computer device may obtain the gene expression profile and the genetic subtype of the training sample by means of user input or by means of searching from an online database.

Step S204, classifying and integrating the gene expression profiles according to the predetermined classification number to obtain target gene expression profiles; the target gene expression profile includes categories corresponding to the number of classifications.

The categories can be obtained after the gene expression profiles are classified and integrated, the classification number can be understood as the number of the categories, and the classification number is used for determining the number of the categories obtained by classifying and integrating the gene expression profiles, that is, determining the number of the categories into which the gene expression profiles are classified and integrated.

In this step, after obtaining the predetermined number of classifications, the computer device classifies and integrates the gene expression profiles according to the number of classifications to obtain target gene expression profiles, and at this time, the target gene expression profiles include categories corresponding to the number of classifications. The gene expression profiles are classified and integrated according to the classification number, which can be classified and integrated on the gene expression values in the gene expression profiles, and it can be understood that, in the process of classifying and integrating the gene expression values, the gene types corresponding to the gene expression values are also classified and integrated, so that the target gene expression profiles comprising the categories corresponding to the classification number can be obtained, for example, the gene type corresponding to the gene expression value in the gene expression profiles is m, the classification number is 3, and when the gene expression values in the gene expression profiles are integrated according to the classification number, the m genes are classified and integrated into 3 categories, so that the target gene expression profiles comprising the 3 categories are obtained; furthermore, the category can be characterized by marks, and the marks can be Roman numerals (such as I, II and III) and English letters (such as A, B and C) and the like. Alternatively, the manner of determining the number of classifications may be determined by a minimum description length criterion. The number of case samples included in the gene expression profile and the number of case samples included in the target gene expression profile may be the same.

Further, when the gene expression profiles are represented in the form of a matrix, the process of classifying and integrating the gene expression profiles by the computer device according to the classification number can be understood as a process of compressing and transforming the matrix, that is, compressing the number of rows (or columns) in the gene expression profile matrix into the number of rows (or columns) corresponding to the classification number, thereby obtaining a target expression profile matrix constructed by the number of rows (or columns) corresponding to the classification number, for example, the gene expression profile matrix is a matrix constructed by n cases and m genes, and if the number of rows and the number of columns of the matrix are m and n, then compressing and transforming the gene expression profiles according to the classification number r, so as to obtain the target gene expression profile matrix with the number of rows r and the number of columns n. In this case, the number of rows and the number of columns of the gene expression profile matrix may be n and m, and the number of rows and the number of columns of the target gene expression profile matrix may be r. Additionally, when the training sample comprises a plurality of cases (a plurality of subsamples), the gene expression spectrum matrix is constructed by the plurality of cases, and therefore, the compression transformation of the matrix can be realized by a non-Negative Matrix Factorization (NMF) method. In addition, when the gene expression profile is expressed in the form of a matrix, the number of classifications can be determined by a consensus matrix visualization method, and further, the number of classifications can be determined by a consensus matrix visualization method of multiple NMF decompositions.

And S206, selecting a target class from the classes of the target gene expression profiles, and constructing a corresponding relation between the target class and the genetic subtype.

The target category may be a category selected from all categories for constructing a correspondence with the genetic subtype. The target category may be selected from all categories by random selection or by category value selection.

In this step, after the computer device obtains the target gene expression profile through the classification and integration processing, the computer device selects a target class from classes included in the target gene expression profile, and constructs a corresponding relationship between the target class and the genetic subtype according to the obtained genetic subtype. For example, when the target gene expression profile comprises 3 categories, one category I is selected as a target category, and a corresponding relation I-I between the target category I and the genetic subtype I is constructed according to the acquired genetic subtype I.

After the gene expression values in the gene expression profile are classified and integrated, a numerical value corresponding to the category can be obtained, the numerical value can be understood as a category value, and the category value can be used for representing the degree of closeness of the association between the corresponding category and the genetic subtype. In this step, the target category may be determined according to the category value corresponding to the category, and specifically, the computer device may obtain the category value of each category in the target gene expression profile, select the category value with the largest value, determine the category corresponding to the largest category value, use the selected category as the target category, and further construct the correspondence between the target category and the genetic subtype.

In addition, in this step, when the training sample includes a plurality of case samples, that is, the training sample includes a plurality of sub-samples, the genetic subtype, the selected category, and the correspondence relationship of each case sample may be different, and therefore, in order to highlight the case where the training sample includes a plurality of case samples, the genetic subtype, the selected category, and the correspondence relationship of each case sample are respectively referred to as a sub-genetic subtype, a sub-target category, and a sub-correspondence relationship. Therefore, when constructing the sub-correspondence relationship of each sub-sample, the sub-correspondence relationship may be constructed according to the category value of each sub-sample, specifically, by taking constructing the sub-correspondence relationship of one sub-sample as an example, the computer device obtains the category value of the sub-sample, selects the maximum category value from the category values, and constructs the sub-correspondence relationship between the sub-genetic subtype and the sub-target category of the sub-sample by using the category corresponding to the maximum category value as the sub-target category, and the sub-correspondence relationship between other sub-samples may be constructed according to the method, which is not described herein. After the computer equipment constructs and obtains the sub-corresponding relations of the plurality of sub-samples, selecting the sub-corresponding relations with the same sub-target category from the plurality of sub-corresponding relations, namely selecting the sub-corresponding relations aiming at the same seed target category, taking the same seed target category as the target category, which is equivalent to taking the same seed target category as the category for constructing the corresponding relation; at this time, the selected sub-correspondence relationship has the same seed target category, and may include one or more sub-genetic subtypes, and the computer device selects the sub-genetic subtype with the largest number of occurrences from the one or more sub-genetic subtypes, and constructs the correspondence relationship between the same seed target category and the selected sub-genetic subtype with the largest number of occurrences.

For example, in 10 subsamples, each subsample has its own child correspondence, and now to determine the correspondence of the target class I, the child correspondences of the subsamples 1, 2, 4, 5, 8, 9, and 10 may be selected, and then the times of occurrence of the child genetic subtypes included in the child correspondences are analyzed and ranked, in this example, it may be determined that the number of occurrence of the child genetic subtype I is the greatest, and then the correspondence is constructed between the target class I and the child genetic subtype I.

Case sample	Sub-correspondence relationship	Case sample	Sub-correspondence relationship
				Subsample 1	I-i	Subsample 6	III-i
Subsample 2	I-i	Subsample 7	III-i
				Subsample 3	II-i	Subsample 8	I-i
Subsample 4	I-ii	Sub-sample 9	I-i
				Subsample 5	I-iii	Sub-sample 10	I-i

It is understood that when the number of the sub-genetic subtypes most frequently occurring is two or more, one sub-genetic subtype may be randomly selected, and a correspondence relationship may be constructed between the randomly selected sub-genetic subtype and the target class. When the number of the daughter genetic subtypes occurring the most is two or more, the daughter genetic subtypes may be associated with the target class.

And S208, outputting a genetic subtype prediction model according to the corresponding relation.

The genetic subtype prediction model can be used in the field of non-disease diagnosis, for example, the genetic subtype prediction model can be used for processing a test sample, the obtained genetic subtype is compared with the genetic subtype determined by other modes, and the prediction performance of other modes is verified, and for example, the genetic subtype prediction model can be used for processing the test sample and classifying the test sample according to the obtained genetic subtype result.

In the step, after the computer equipment obtains the corresponding relation between the target category and the genetic subtype, the corresponding relation is mapped into a genetic subtype prediction model, and the genetic subtype prediction model obtained through mapping processing is output; when a plurality of corresponding relations exist, the computer device can also output corresponding genetic subtype prediction models according to the plurality of corresponding relations.

In the method for constructing the genetic subtype prediction model, the gene expression profiles of the training samples are classified and integrated according to the preset classification number to obtain classes corresponding to the classification number, so that each class can comprise the characteristics of multiple genes, namely, the genes corresponding to the classes are not unique, and when an unknown gene class appears, the classes can be classified and integrated with other known gene classes to further enable the classes to comprise the genes of the unknown class.

In one embodiment, when the gene expression profile is expressed in a matrix form, the gene expression profile can be expressed by non-negative matrix factorization, and according to a predetermined number of classes, classifying and integrating the gene expression profiles to obtain a class weight matrix and a class value matrix, the class value matrix is a target gene expression profile in the form of a matrix, for example, a gene expression profile consisting of m genes of n cases, the matrix can be represented by V and can be an m multiplied by n matrix, the predetermined classification number is r, the NMF algorithm is utilized to approximate V by two non-negative matrixes, namely, the NMF algorithm is utilized to classify and integrate V to obtain V which is approximately equal to WH, where W is an m × r non-negative matrix, W may be used to characterize the weight matrix of the class, H is an r × n non-negative matrix, and H may be used to characterize the class value matrix of the class. Further, the number of classifications r may be determined using a minimum description length criterion. Further, in order to reduce randomness and improve class repeatability of classification integration, a plurality of groups of WHs can be obtained by applying an NMF method for many times, approximation errors between the matrix V and the groups of WHs are measured by using an Euclidean distance method or a Kullback-Leibler distance method, the WHs with the approximation errors reaching preset values are selected, and the average value of the selected W and H is calculated. When selecting WHs with approximation errors reaching preset values, a plurality of groups of WHs can be sorted according to the approximation errors, and the WHs before the preset ranking are selected, which is equivalent to selecting the WHs reaching the preset values, for example, the NMF method can be run for 60 times to obtain 60 groups of WHs, and the WHs are sorted from small to large according to the approximation errors, and the WHs in the first 20 groups are selected.

Because the gene expression profiles are classified and integrated according to the classification number in a matrix decomposition mode to obtain the weight matrix W and the class value matrix H, row elements in rows of the weight matrix and column elements in columns of the class value matrix can be considered to be in one-to-one correspondence with the classes. Therefore, after obtaining the weight matrix W, a correspondence relationship is constructed between row elements in rows of the weight matrix and genetic subtypes according to the correspondence relationship between the category and the genetic subtypes, that is, the correspondence relationship is mapped in the rows of the weight matrix, and then the weight matrix is subjected to singular value decomposition processing, that is, W ═ SDT^TWhere S, T is an orthogonal matrix, T^TAnd D is a diagonal matrix, and the diagonal of D is a singular value. Then, pseudo-inverse processing is carried out on the weight matrix to obtain a pseudo-inverse matrix W⁺Using the pseudo-inverse matrix as a genetic subtype prediction model, which is equivalent to the genetic subtype prediction model being a model in the form of a pseudo-inverse matrix, wherein the pseudo-inverse process may be a Moore-Penrose pseudo-inverse process, and W⁺＝TD^-1S^T。

In one embodiment, when determining the maximum class value of a test sample by using a genetic subtype prediction model in the form of a pseudo-inverse matrix, a gene expression spectrum matrix V' of the test sample may be obtained, where the gene expression spectrum matrix is formed by gene expression spectra of the test sample, and then the pseudo-inverse matrix and the gene expression spectrum matrix are subjected to matrix multiplication, that is: w⁺V ', obtaining a classification result matrix H' of the test sample, wherein H ═ W⁺V'; the classification result matrix comprises test category values corresponding to the categories, the largest category value is selected from the test category values, the category corresponding to the category value is determined, the category is used as the reference category of the test sample, for example, in the jth column of the classification result matrix H ', the category value H'_i,jIs the firstj is the maximum value in column, then h'_i,jSelecting, determining h'_i,jThe corresponding category is I, and the category I is used as a reference category of the test sample. Wherein j may be 1, at this time, the representative classification result matrix H 'has only one column, which is equivalent to only one test case sample, j may be a numerical value greater than 1, at this time, the representative classification result matrix H' has a plurality of columns, which is equivalent to a plurality of test case samples, when j is greater than 1, the maximum class value of each column may be selected, the class corresponding to the maximum class value of each column is determined, respectively, and the determined class is used as the reference class of the corresponding test sample.

Transforming the weight matrix W into a pseudo-inverse matrix W⁺Then the classification result matrix H' and the pseudo-inverse matrix W are combined⁺In the process of matrix multiplication, it can be understood that the division operation is performed between the gene expression spectrum matrix V ' and the weight matrix W of the test sample, that is, because V ' ≈ W H ', the gene expression spectrum matrix V ' and the weight matrix W are both known, and H ' can be obtained by performing the division operation as understood from the basic mathematical operation, but due to the particularity of the matrix, the weight matrix W needs to be transformed into the pseudo-inverse matrix W^+，Then the classification result matrix H' and the pseudo-inverse matrix W are combined⁺Matrix multiplication processing is performed.

In another embodiment, after the computer device determines the reference category corresponding to the maximum category value, and according to the corresponding relationship, the genetic subtype corresponding to the reference category is used as the genetic subtype of the test sample, for example, the reference category corresponding to the maximum category value is I, then according to the corresponding relationship I-I, it may be determined that the genetic subtype corresponding to the reference category I is I, and then the genetic subtype I is used as the genetic subtype of the test sample, and when the test sample includes a plurality of test case samples, the genetic subtype of each test case sample may be determined by the method of this embodiment, which is not described herein again.

In the conventional genetic typing method, when a test sample lacks known genetic materials, it is difficult to determine the genetic subtype of the test sample by a risk stratification method through genetic subtypes, which indicates that the conventional genetic subtypes only have a correspondence with the known genetic materials and do not have a correspondence with unknown genetic materials, resulting in poor flexibility of the correspondence between the genetic subtypes and the genetic materials. Based on the method, in order to improve the flexibility of the corresponding relation between the genetic subtype and the genetic material, the method for constructing the genetic subtype prediction model is provided. In order to better understand the above method, an application example of the method for constructing the genetic subtype prediction model according to the present application is described below with reference to fig. 3.

In this embodiment, step S302, determining the number r of classifications by using the minimum description length criterion, which is equivalent to determining the optimal value of the number of classifications; the classification number r defines the number of classes, which is equivalent to the number of genetic subtypes;

and S304, carrying out non-Negative Matrix Factorization (NMF) typing on the gene expression profile. The gene expression profile of the training sample consisted of m genes from n cases. The NMF algorithm approximates gene expression profiles by the product of two non-negative matrices: v ≈ WH, where V is an m × n matrix of gene expression profiles, W is an m × r non-negative matrix, and H is an r × n non-negative matrix. During NMF decomposition. Further, since an NMF algorithm for directly obtaining the best approximation is not obtained at present, in order to reduce randomness and improve repeatability of grouping, 60 independent NMFs can be operated, and the W and H matrices of 20 operations with the minimum approximation error are extracted to find the average value of W and H. The H matrix encodes the subtype of the case: if the genetic subtype of case sample j is i and in column j, the class value h_i,jIs the maximum value in the jth column, then the value is associated with the class value h_i,jCorresponding class and genetic subtype i, which can be assigned a class value h_i,jAnd constructing corresponding relations between the corresponding categories and the genetic subtypes i. The correspondence between the categories and the genetic subtypes obtained by the NMF decomposition method is determined by the genetic typing of the vast majority of case samples in each category, that is, in the correspondence between the samples 1, 2, 4, 5, 8, 9, and 10 in table 1, the predominant genetic typing is I, and the correspondence between the category I and the genetic typing I can be constructed. Wherein the training sample may include 207 cases, using a minimal descriptionThe number of classifications determined by the length criterion may be 4.

Step S306, after obtaining the weight matrix W, solving a pseudo-inverse matrix of the weight matrix W, specifically, performing singular value decomposition on W: w ═ SDT^TWhere S and T are orthogonal matrices, T^TThe transposition of T is shown, D is a diagonal matrix, the diagonal is a singular value, and the Moore-Penrose pseudo-inverse of the weight matrix W can be obtained as W⁺＝TD^-1S^T；

Step S308, processing the gene expression spectrum matrix V 'of the test sample by using the weight matrix W to obtain a typing result matrix H' ═ W⁺The V ', H ' matrix encodes the subtype of the test sample, and the H ' matrix satisfies the relation V ' ≈ WH '.

Further, for test case sample j, look up is at column vector h'_jThe maximum class value of (i.e., in the j-th column) is h'_i,jFurther, h 'can be determined'_i,jIn column I, the genetic subtype of the test case sample is determined to be I based on the correspondence I-I.

In the method of any of the above embodiments, the genetic subtype prediction model is constructed from case samples with known genetic subtypes, which is equivalent to performing data analysis on the genetic subtypes and gene expression profiles of the case samples to obtain the genetic subtype prediction model; according to the description of the embodiment, the obtained genetic subtype prediction model can be used for verifying the accuracy of other prediction modes, classifying test samples and other non-disease diagnosis fields; in addition, when predicting a test sample with unknown genetic subtype, the genetic subtype further needs to acquire a gene expression profile of the test sample, perform corresponding processing on the gene expression profile by using the genetic subtype to obtain a reference class, and then determine the genetic subtype corresponding to the maximum class value according to the corresponding relationship. And the cases and case samples mentioned in this application can be understood as one and the same concept.

In the above embodiment, the gene expression profiles of the training samples are classified and integrated according to the preset classification number to obtain classes corresponding to the classification number, so that each class can include the characteristics of multiple genes, that is, the genes corresponding to the classes are not unique, and when an unknown gene class appears, the unknown gene class can be classified and integrated with other known gene classes to further cause the classes to include the genes of the unknown class, further, a target class is selected from each class, and the corresponding relationship between the target class and the genetic subtype is constructed, so that the genetic subtype can correspond to multiple genes including the unknown class, and further, the flexibility of the corresponding relationship between the genetic subtype and the genes is improved; therefore, the method does not pursue the complete and consistent genetic material change of the genetic subtypes, and can be divided into the adjacent genetic subtypes according to the similarity of the gene expression profiles under the condition of unknown genetic materials, thereby guiding the risk stratification.

It should be understood that although the various steps in the flow charts of fig. 2-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-3 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application.

Based on the same idea as the method of constructing a genetic subtype prediction model in the above-described embodiment, the present application also provides an apparatus for constructing a genetic subtype prediction model, which can be used to perform the above-described method of constructing a genetic subtype prediction model. For convenience of illustration, only the parts related to the embodiments of the present application are shown in the schematic structural diagram of the embodiments of the apparatus for constructing the genetic subtype prediction model, and those skilled in the art will understand that the illustrated structure does not constitute a limitation of the apparatus, and may include more or less components than those illustrated, or combine some components, or arrange different components.

In one embodiment, as shown in fig. 4, there is provided an apparatus 400 for constructing a genetic subtype prediction model, comprising: an information obtaining module 402, a classification integration module 404, a corresponding relation construction module 406 and a prediction model output module 408, wherein:

an information obtaining module 402, configured to receive a gene expression profile and a corresponding genetic subtype of a training sample;

a classification integration module 404, configured to perform classification integration on the gene expression profiles according to a predetermined classification number to obtain target gene expression profiles; the target gene expression profile comprises categories corresponding to the classification number;

a corresponding relationship construction module 406, configured to select a target category from categories of the target gene expression profile, and construct a corresponding relationship between the target category and a genetic subtype;

and a prediction model output module 408, configured to output a genetic subtype prediction model according to the correspondence.

In one embodiment, the correspondence construction module 406 is further configured to obtain a category value of each category in the target gene expression profile; determining the category corresponding to the maximum category value in the category values as a target category; and constructing the corresponding relation between the target class and the genetic subtype.

In an embodiment, when the training sample includes a plurality of subsamples, the correspondence construction module 406 is further configured to respectively construct child correspondence between child genetic subtypes and child target categories of the subsamples according to category values of the plurality of subsamples; each sub-correspondence corresponds to a different sub-sample; selecting a sub-corresponding relation aiming at the same seed target category from each sub-corresponding relation; selecting the child genetic subtype with the most occurrence times from the child genetic subtypes of the selected child corresponding relationship; and constructing the corresponding relation between the same seed target category and the selected child genetic subtype.

In one embodiment, when the training sample includes a plurality of subsamples, the classification and integration module 404 is further configured to perform classification and integration on the gene expression profiles expressed in the form of a matrix according to a predetermined classification number by a non-negative matrix factorization manner, so as to obtain a weight matrix of the class and a class value matrix of the class; the category value matrix is a target gene expression profile in a matrix form;

the prediction model output module 408 is further configured to map the correspondence in a row of the weight matrix; performing singular value decomposition processing on the weight matrix; and performing pseudo-inverse processing on the weight matrix subjected to the singular value decomposition processing to obtain a genetic subtype prediction model in the form of a pseudo-inverse matrix.

In one embodiment, the prediction model output module 408 is further configured to obtain a gene expression profile matrix; the gene expression spectrum matrix is formed by gene expression spectrums of test samples; performing matrix multiplication processing on the gene expression spectrum matrix and the pseudo-inverse matrix to obtain a classification result matrix of the test sample; the classification result matrix comprises test class values corresponding to the classes; and determining the category value with the maximum value from the test category values of the classification result matrix.

In one embodiment, the prediction model output module 408 is further configured to determine the category corresponding to the category value with the largest value as the reference category; and taking the genetic subtype corresponding to the reference category as the genetic subtype of the test sample according to the corresponding relation.

In one embodiment, the classification integration module 404 is further configured to determine the number of classifications according to a minimum description length criterion.

Specific limitations regarding the apparatus for constructing the genetic subtype prediction model can be found in the above limitations regarding the method for constructing the genetic subtype prediction model, which are not described herein again. The modules in the device for constructing the genetic subtype prediction model can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of the above-described method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the respective method embodiment as described above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of constructing a genetic subtype prediction model, comprising:

2. The method of claim 1, wherein the step of selecting a target class from the classes in the target gene expression profile and constructing the corresponding relationship between the target class and the genetic subtype comprises:

3. The method of claim 2, wherein when the training sample comprises a plurality of subsamples, the step of constructing the correspondence between the target class and the genetic subtype comprises:

determining the sub-target categories of the sub-samples according to the category values of the sub-samples, and constructing the sub-corresponding relation between the sub-genetic subtypes and the sub-target categories of the sub-samples;

selecting a sub-corresponding relation aiming at the same seed target category from the sub-corresponding relations, and taking the same seed target category as the target category;

and constructing a corresponding relation between the target category and the selected sub-genetic subtype.

4. The method of claim 1, wherein when the training sample comprises a plurality of subsamples, the step of performing a classification integration of the gene expression profile according to a predetermined number of classifications comprises:

mapping the correspondence in a row of the weight matrix;

performing singular value decomposition processing on the weight matrix;

5. The method according to claim 4, wherein after the step of performing pseudo-inverse processing on the weight matrix subjected to the singular value decomposition processing to obtain the genetic subtype prediction model in the form of a pseudo-inverse matrix, the method further comprises:

6. The method of claim 5, further comprising, after the step of using the class corresponding to the determined class value as a reference class for the test sample:

7. The method of claim 1, further comprising:

and determining the number of the classifications according to the minimum description length criterion.

8. An apparatus for constructing a genetic subtype prediction model, comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.