CN111145831B

CN111145831B - Method, device and computer equipment for constructing genetic subtype prediction model

Info

Publication number: CN111145831B
Application number: CN201911415078.9A
Authority: CN
Inventors: 黄庆生; 梁会营; 钟嘉泳; 高欢; 李宽荣
Original assignee: Guangzhou Women and Childrens Medical Center
Current assignee: Guangzhou Women and Childrens Medical Center
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2023-11-24
Anticipated expiration: 2039-12-31
Also published as: CN111145831A

Abstract

The application relates to a method, a device, computer equipment and a storage medium for constructing a genetic subtype prediction model. The method comprises the following steps: receiving a gene expression profile and a corresponding genetic subtype of a training sample; according to the predetermined classification number, classifying and integrating the gene expression profile to obtain a target gene expression profile; the target gene expression profile comprises categories corresponding to the number of categories; selecting a target class from the classes of the target gene expression profile, and constructing a corresponding relation between the target class and the genetic subtype; and outputting a genetic subtype prediction model according to the corresponding relation. The genetic subtype prediction model can be obtained based on the flexible corresponding relation constructed between the gene expression profile and the genetic subtype by adopting the method.

Description

Method, device and computer equipment for constructing genetic subtype prediction model

Technical Field

The present application relates to the field of biotechnology, and in particular, to a method, an apparatus, a computer device, and a storage medium for constructing a genetic subtype prediction model.

Background

In the traditional technology, cytogenetic abnormality can be determined according to the characteristics of gene expression level, immunophenotype or fluorescence in situ hybridization, and then the corresponding relation between the characteristics and the genetic subtype is obtained by analyzing a plurality of cases, so that a genetic subtype prediction model is constructed, however, the corresponding relation is simply to correspond the characteristics and the genetic subtype of the cases, the flexibility is poor, representative corresponding relation can not be extracted from the characteristics and the genetic subtype prediction model has a narrow application range. With the development of biotechnology, research into organisms can go deep into molecular level of genome and transcriptome. Therefore, it is highly necessary to construct a more flexible, representative correspondence between the gene expression profile and the genetic subtype of the transcriptome.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, apparatus, computer device, and storage medium for constructing a genetic subtype prediction model that can construct a more flexible correspondence between gene expression profiles and genetic subtypes.

In a first aspect, there is provided a method of constructing a genetic subtype predictive model comprising:

receiving a gene expression profile and a corresponding genetic subtype of a training sample;

according to the predetermined classification number, classifying and integrating the gene expression profile to obtain a target gene expression profile; the target gene expression profile comprises categories corresponding to the number of categories;

selecting a target class from the classes of the target gene expression profile, and constructing a corresponding relation between the target class and the genetic subtype;

and outputting a genetic subtype prediction model according to the corresponding relation.

In one embodiment, the step of selecting a target class from the classes in the target gene expression profile and constructing a correspondence between the target class and the genetic subtype includes:

obtaining class values of all classes in the target gene expression profile;

determining a category corresponding to the maximum category value in the category values as the target category;

And constructing the corresponding relation between the target category and the genetic subtype.

In one embodiment, when the training sample includes a plurality of subsamples, the step of constructing the correspondence between the target class and the genetic subtype includes:

respectively constructing sub-corresponding relations between sub-genetic subtypes of the sub-samples and sub-target categories according to the category values of the plurality of sub-samples; each sub-corresponding relation corresponds to a different sub-sample;

selecting sub-corresponding relations aiming at the same seed target category from the sub-corresponding relations;

selecting the sub-genetic subtype with the largest occurrence number from the sub-genetic subtypes of the selected sub-corresponding relation;

and constructing the corresponding relation between the same seed target category and the selected sub-genetic subtype.

In one embodiment, when the training sample includes a plurality of sub-samples, the step of classifying and integrating the gene expression profile according to a predetermined number of classifications includes:

classifying and integrating the gene expression spectrums expressed in a matrix form according to the predetermined classification number in a non-negative matrix factorization mode to obtain a weight matrix of the category and a category value matrix of the category; the class value matrix is a target gene expression profile in a matrix form;

The step of outputting a genetic subtype prediction model according to the correspondence relation comprises the following steps:

mapping the corresponding relation in the row of the weight matrix;

singular value decomposition processing is carried out on the weight matrix;

and performing pseudo-inverse processing on the weight matrix subjected to singular value decomposition processing to obtain a genetic subtype prediction model in a pseudo-inverse matrix form.

In one embodiment, after the step of performing pseudo-inverse processing on the weight matrix subjected to the singular value decomposition processing to obtain the genetic subtype prediction model in the form of a pseudo-inverse matrix, the method further includes:

obtaining a gene expression profile matrix; the gene expression profile matrix consists of gene expression profiles of test samples;

performing matrix multiplication processing on the gene expression spectrum matrix and the genetic subtype prediction model in a pseudo-inverse matrix form to obtain a classification result matrix of the test sample; the classification result matrix comprises a test class value corresponding to the class;

determining a class value with the largest numerical value from the test class values of the classification result matrix;

and taking the category corresponding to the determined category value as a reference category of the test sample.

In one embodiment, after the step of taking the category corresponding to the determined category value as the reference category of the test sample, the method further comprises:

And taking the genetic subtype corresponding to the reference category as the genetic subtype of the test sample according to the corresponding relation.

In one embodiment, further comprising: and determining the number of the classifications according to the minimum description length criterion.

In a second aspect, there is provided an apparatus for constructing a genetic subtype predictive model, comprising:

the information acquisition module is used for receiving the gene expression profile and the corresponding genetic subtype of the training sample;

the classifying and integrating module is used for classifying and integrating the gene expression profile according to the predetermined classifying number to obtain a target gene expression profile; the target gene expression profile comprises categories corresponding to the number of categories;

the corresponding relation construction module is used for selecting a target class from the classes of the target gene expression profile and constructing a corresponding relation between the target class and the genetic subtype;

and the prediction model output module is used for outputting a genetic subtype prediction model according to the corresponding relation.

In a third aspect, there is provided a computer device comprising a memory storing a computer program and a processor implementing the following steps when the processor executes the computer program:

In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

According to the method, the device, the computer equipment and the storage medium for constructing the genetic subtype prediction model, the gene expression profiles of the training samples are classified and integrated according to the preset classification number, so that the classes can comprise the characteristics of various genes, namely, the genes corresponding to the classes are not unique, when unknown gene types appear, the unknown gene types can be classified and integrated with other known gene types, the classes comprise the genes of the unknown types, further, the target classes are selected from the classes, the corresponding relation between the target classes and the genetic subtypes is constructed, and therefore the genetic subtypes can correspond to various genes comprising the unknown types, and further, the flexible corresponding relation is constructed between the gene expression profiles and the genetic subtypes.

Drawings

FIG. 1 is an internal block diagram of a computer device in one embodiment;

FIG. 2 is a flow diagram of a method of constructing a genetic subtype predictive model in one embodiment;

FIG. 3 is a flow chart of a method of constructing a genetic subtype predictive model in another embodiment;

FIG. 4 is a block diagram of an apparatus for constructing a genetic subtype predictive model in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The method for constructing the genetic subtype prediction model can be applied to computer equipment shown in figure 1. In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 1. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of constructing a genetic subtype predictive model. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the architecture shown in fig. 1 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements may be implemented, as a particular computer device may include more or less components than those shown, or may be combined with some components, or may have a different arrangement of components.

In one embodiment, as shown in fig. 2, a method for constructing a genetic subtype prediction model is provided, and the method is applied to the computer device in fig. 1 for illustration, it is understood that the method can be applied to a server, can be applied to a terminal, can also be applied to a system comprising the terminal and the server, and is implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:

step S202, receiving a gene expression profile and a corresponding genetic subtype of a training sample.

The training sample may include one case sample or may include a plurality of case samples, and when the training sample includes a plurality of case samples, the training sample may be considered to include a plurality of subsamples, the case sample may include genetic information of a case with a known genetic subtype, the genetic information may be a gene pair expression value, for example, a genetic subtype of a case is i, and the genetic information of the case may be regarded as a case sample. The gene expression value may be different for each case sample for a certain gene; the genetic subtype of each case sample has been predetermined. The gene expression profile may include gene expression values of one case sample, or may include gene expression values of a plurality of case samples; for one case sample in the gene expression profile, the number of gene expression values may be plural, each gene expression value corresponding to a different kind of gene.

Further, the gene expression profile may be represented in a matrix form, specifically, the gene expression values of the case samples are arranged according to a specific format to obtain a gene expression profile represented in a matrix form, where the gene expression values of each case sample may be arranged in a longitudinal arrangement, that is, the number of case samples is taken as the column number, then the number of gene types is taken as the row number to construct a gene expression profile matrix, the gene types correspond to the gene expression values, and when the gene types corresponding to the gene expression values of each case sample are the same, the number of gene types is directly taken as the row number to construct a gene expression profile matrix, for example, n case samples each correspond to m genes, so as to construct a gene expression profile matrix with m rows and n columns; when there are different gene types corresponding to the gene expression values of the case samples, summing the numbers of the gene types, taking the summed number of the gene types as the number of lines, constructing a gene expression profile matrix, for example, 3 case samples, wherein the gene expression value of one case sample corresponds to m1 genes, and the gene expression values of the other two case samples correspond to m2 genes, summing the numbers of the gene types of the m1 genes and the m2 genes, that is, determining that only the number of the gene types in m2 (or m 1) genes exists, adding the number and m1 (or m 2) to obtain m3 genes, and constructing the gene expression profile matrix with m3 as the number of lines of a matrix, wherein m3 lines and n columns are constructed. It will be appreciated that the gene expression values of each case sample may be arranged in a lateral arrangement, and that, taking the above n cases as an example, a matrix of n rows and m columns of gene expression profiles may be constructed.

In this step, the computer device may obtain the gene expression profile and the genetic subtype of the training sample by means of user input or by means of a search from an on-line database.

Step S204, classifying and integrating the gene expression profile according to the predetermined classification number to obtain a target gene expression profile; the target gene expression profile includes a category corresponding to the number of categories.

The classification number can be understood as the number of the classes, and the classification number is used for determining the number of the classes obtained by classifying and integrating the gene expression profiles, that is, determining how many classes the gene expression profiles are classified and integrated.

In this step, after the computer device obtains the predetermined classification number, the classification integration is performed on the gene expression profile according to the classification number, so as to obtain the target gene expression profile, where the target gene expression profile includes a class corresponding to the classification number. Wherein, the classifying and integrating the gene expression profile according to the classifying number can be the classifying and integrating the gene expression value in the gene expression profile, it can be understood that in the classifying and integrating process of the gene expression value, the classifying and integrating are also performed on the gene type corresponding to the gene expression value, so that the target gene expression profile comprising the category corresponding to the classifying number can be obtained, for example, the gene type corresponding to the gene expression value in the gene expression profile is m and the classifying number is 3, and the classifying and integrating the gene expression value in the gene expression profile according to the classifying number is equivalent to classifying and integrating m genes into 3 categories, so as to obtain the target gene expression profile comprising the 3 categories; further, the category may be characterized by a logo, which may be roman numerals (e.g., i, ii, and iii), english letters (e.g., A, B and C), or the like. Additionally, the manner in which the number of classifications is determined may be determined by a minimum descriptive length criterion. In addition, the number of case samples included in the gene expression profile may be the same as the number of case samples included in the target gene expression profile.

Further, when the gene expression profile is represented in the form of a matrix, the process of classifying and integrating the gene expression profile by the computer device according to the number of classifications may be understood as a process of performing compression transformation on the matrix, that is, compressing the number of rows (or columns) in the gene expression profile matrix to the number of rows (or columns) corresponding to the number of classifications, thereby obtaining a target expression profile matrix constructed by the number of rows (or columns) corresponding to the number of classifications, for example, the gene expression profile matrix is a matrix constructed by n cases and m genes, and if the number of rows of the matrix is m and the number of columns is n, performing compression transformation on the gene expression profile according to the number of classifications r may obtain a target gene expression profile matrix with the number of rows r and the number of columns of n. When the number of rows of the gene expression profile matrix is n and the number of columns is m, the number of rows of the obtained target gene expression profile matrix is n and the number of columns is r. Additionally, when the training sample includes a plurality of cases (a plurality of subsamples), the gene expression profile matrix is constructed from the plurality of cases, and thus, compression transformation of the matrix may be achieved by a non-Negative Matrix Factorization (NMF) method. Additionally, when the gene expression profile is expressed in the form of a matrix, the number of classifications can be determined by a method of visualizing a consensus matrix, and further, the number of classifications can be determined by a method of visualizing a consensus matrix decomposed by a plurality of NMFs.

Step S206, selecting a target class from the classes of the target gene expression profile, and constructing the corresponding relation between the target class and the genetic subtype.

The target category may be a category selected from all categories, and is used for constructing a corresponding relationship with the genetic subtype. The mode of selecting the target category from all the categories can be selected randomly or according to the category value.

In the step, after the computer equipment obtains the target gene expression profile through classified integration treatment, selecting a target class from classes included in the target gene expression profile, and constructing a corresponding relationship between the target class and the genetic subtype according to the acquired genetic subtype. For example, when the target gene expression profile contains 3 categories, one of the categories I is selected as the target category, and the corresponding relationship I-I between the target category I and the genetic subtype I is constructed according to the acquired genetic subtype I.

After classifying and integrating the gene expression values in the gene expression profile, a numerical value corresponding to the category can be obtained, the numerical value can be understood as a category value, and the category value can be used for representing the association tightness degree of the corresponding category and the genetic subtype. In this step, the target class may be determined according to the class value corresponding to the class, specifically, the computer device may obtain the class value of each class in the target gene expression profile, select the class value with the largest numerical value, determine the class corresponding to the largest class value, and use the selected class as the target class, thereby constructing the corresponding relationship between the target class and the genetic subtype.

In addition, in this step, when the training sample includes a plurality of case samples, that is, the training sample includes a plurality of sub-samples, the genetic subtype, the selected category, and the correspondence of each case sample may be different, and therefore, in order to highlight the case that the training sample includes a plurality of case samples, the genetic subtype, the selected category, and the correspondence of each case sample are referred to as a sub-genetic subtype, a sub-target category, and a sub-correspondence, respectively. Therefore, when the sub-correspondence of each sub-sample is constructed, the sub-correspondence of each sub-sample may be constructed according to the above-mentioned class value according to each sub-sample, specifically, describing an example of constructing the sub-correspondence of one sub-sample, the computer device obtains the class value of the sub-sample, selects the maximum class value from the class values, and uses the class corresponding to the maximum class value as the sub-target class, so as to construct the sub-correspondence of the sub-genetic subtype and the sub-target class of the sub-sample, and the sub-correspondences of other sub-samples may be constructed according to the method, which is not repeated herein. After the computer equipment builds sub-corresponding relations of a plurality of sub-samples, selecting sub-corresponding relations with the same sub-target category from the sub-corresponding relations, namely selecting sub-corresponding relations aiming at the same seed target category, wherein the same seed target category is taken as a target category and is equivalent to taking the same seed target category as a category for building the corresponding relation; at this time, the selected sub-correspondence has the same seed target category and may include one or more sub-genetic subtypes, and the computer device selects a sub-genetic subtype having the largest occurrence number from the one or more sub-genetic subtypes, and constructs a correspondence between the same seed target category and the selected sub-genetic subtype having the largest occurrence number.

For example, in 10 sub-samples, each sub-sample has its own sub-corresponding relationship, the corresponding relationship of the target class I is now to be determined, the sub-corresponding relationships of the sub-samples 1, 2, 4, 5, 8, 9 and 10 may be selected, then the number of occurrence of the subgeneration subtypes included in the sub-corresponding relationships may be analyzed and ordered, in this case, it may be determined that the number of occurrence of the subgeneration subtype I is the largest, and then the corresponding relationship is constructed between the target class I and the subgeneration subtype I.

TABLE 1

Case sample	Sub-correspondence	Case sample	Sub-correspondence
				Subsample 1	I-i	Subsample 6	III-i
Subsample 2	I-i	Subsample 7	III-i
				Subsample 3	II-i	Subsample 8	I-i
Subsample 4	I-ii	Subsample 9	I-i
				Subsample 5	I-iii	Subsamples 10	I-i

It will be appreciated that when the most frequently occurring subgeneric subtypes are two or more, one subgeneric subtype may be randomly selected and a correspondence may be constructed between the randomly selected subgeneric subtype and the target class. When the most frequently occurring subgeneric subtypes are two or more, the subgeneric subtypes may be associated with the target class.

And step S208, outputting a genetic subtype prediction model according to the corresponding relation.

The genetic subtype prediction model may be used in a non-disease diagnosis field, for example, a test sample may be processed using the genetic subtype prediction model, the obtained genetic subtype may be compared with a genetic subtype determined by other means to verify the prediction performance of other means, and for example, a test sample may be processed using the genetic subtype prediction model and classified according to the obtained genetic subtype result.

In the step, after the corresponding relation between the target class and the genetic subtype is obtained by the computer equipment, mapping the corresponding relation into a genetic subtype prediction model, and outputting the genetic subtype prediction model obtained by mapping processing; when there are a plurality of correspondence relations, the computer device may further output the corresponding genetic subtype prediction models, respectively, according to the plurality of correspondence relations.

In the method for constructing the genetic subtype prediction model, the gene expression profile of the training sample is classified and integrated according to the preset classification number to obtain the categories corresponding to the classification number, so that each category can comprise the characteristics of a plurality of genes, namely, the genes corresponding to the categories are not unique, when unknown gene types appear, the unknown gene types can be classified and integrated with other known gene types, the categories comprise the genes of the unknown types, further, the target category is selected from each category, the corresponding relation between the target category and the genetic subtype is constructed, and therefore the genetic subtype can correspond to a plurality of genes comprising the unknown types, and the flexibility of the corresponding relation between the genetic subtype and the gene expression profile is improved.

In one embodiment, when the gene expression profile is represented in a matrix form, the gene expression profile may be classified and integrated by a non-negative matrix decomposition manner and according to a predetermined classification number, so as to obtain a class weight matrix of a class and a class value matrix of the class, where the class value matrix is a target gene expression profile in a matrix form, for example, the gene expression profile is composed of m genes of n cases, where the matrix may be represented by V and may be an mxn matrix, the predetermined classification number is r, and using an NMF algorithm to approximate V with two non-negative matrices, that is, performing classification integration on V by an NMF algorithm, to obtain v≡wh, where W is an mxr non-negative matrix, W may be used to represent the class weight matrix, H is an rxn non-negative matrix, and H may be used to represent the class value matrix of the class. Further, the number of classifications r may be determined using a minimum description length criterion. Further, in order to reduce randomness and improve class repeatability of classification integration, an NMF method can be used for multiple times to obtain multiple groups of WHs, an Euclidean distance method or a Kullback-Leibler distance method is used for measuring approximation errors between a matrix V and the multiple groups of WHs, WHs with approximation errors reaching a preset value are selected, and average values of the selected W and H are calculated. When the WH with the approximation error reaching the preset value is selected, the WHs with the preset ranking may be selected according to the magnitude of the approximation error, which is equivalent to selecting the WH with the preset value, for example, the NMF method may be operated 60 times, to obtain 60 WHs, and the WHs with the first 20 WHs are selected according to the magnitude of the approximation error from small to large.

The weight matrix W and the class value matrix H are obtained by classifying and integrating the gene expression spectrums according to the classification number in a matrix decomposition mode, so that row elements in the rows of the weight matrix and column elements in the columns of the class value matrix can be considered to be in one-to-one correspondence with the classes. Thus, after the weight matrix W is obtained, the row elements in the rows of the weight matrix are related to the genetic subtypes according to the correspondence between the categories and the genetic subtypesBuilding up the correspondence, that is, mapping the correspondence in the rows of the weight matrix, and then performing singular value decomposition processing on the weight matrix, i.e., w=sdt ^T Wherein S, T is an orthogonal matrix, T ^T The transpose of T is represented, D is the diagonal matrix, and the diagonal of D is the singular value. Then, pseudo-inverse processing is carried out on the weight matrix to obtain a pseudo-inverse matrix W ⁺ The pseudo-inverse matrix is taken as a genetic subtype prediction model, which is equivalent to a model in the form of a pseudo-inverse matrix, wherein the pseudo-inverse process can be Moore-Penrose pseudo-inverse process, and W ⁺ ＝TD ^-1 S ^T 。

In one embodiment, to determine the maximum class value of the test sample by using the genetic subtype prediction model in the form of a pseudo-inverse matrix, a gene expression profile matrix V' of the test sample may be obtained, where the gene expression profile matrix is formed by a gene expression profile of the test sample, and then the pseudo-inverse matrix and the gene expression profile matrix are subjected to matrix multiplication, that is: w (W) ⁺ V ', obtaining a classification result matrix H ' of the test sample, wherein H ' =w ⁺ V'; the classification result matrix includes test class values corresponding to the classes, and the largest class value is selected from the test class values, the class corresponding to the class value is determined, and the class is used as a reference class of the test sample, for example, in the j-th column of the classification result matrix H ', the class value H' _i,j Is the maximum in column j, then h' _i,j Sorting out and determining the ratio of h' _i,j And the corresponding category is I, and the category I is used as a reference category of the test sample. Wherein j may be 1, which represents that the classification result matrix H 'has only one column, which is equivalent to only one test case sample, and j may be a numerical value greater than 1, which represents that the classification result matrix H' has a plurality of columns, which is equivalent to a plurality of test case samples, and when j is greater than 1, the maximum class value of each column may be selected, the class corresponding to the maximum class value of each column may be determined, and the determined class may be used as the reference class of the corresponding test sample.

The weight matrix W is transformed into a pseudo-inverse matrix W ⁺ Then the classification result matrix H' and the pseudo-inverse matrix W ⁺ In the matrix multiplication process, it can be understood that division is performed between the gene expression profile matrix V ' and the weight matrix W of the test sample, that is, since V ' is approximately equal to W H ', the gene expression profile matrix V ' and the weight matrix W are known, H ' can be obtained by performing division from the basic mathematical operation point of view, but the weight matrix W needs to be transformed into the pseudo-inverse matrix W due to the specificity of the matrix ^+， Then the classification result matrix H' and the pseudo-inverse matrix W ⁺ And performing matrix multiplication processing.

In another embodiment, after determining the reference class corresponding to the maximum class value, and according to the correspondence, the computer device uses the genetic subtype corresponding to the reference class as the genetic subtype of the test sample, for example, the reference class corresponding to the maximum class value is I, and according to the correspondence I-I, it may be determined that the genetic subtype corresponding to the reference class I is I, and then uses the genetic subtype I as the genetic subtype of the test sample, and when the test sample includes a plurality of test case samples, the genetic subtype of each test case sample may be determined by the method of the present embodiment, which is not repeated herein.

In the conventional genetic typing method, when a test sample lacks known genetic material, it is difficult to determine the genetic subtype of the test sample by a method of dangerously layering the genetic subtype, which means that the conventional genetic subtype only has a correspondence relationship with the known genetic material and there is no correspondence relationship with the unknown genetic material, resulting in poor flexibility of the correspondence relationship between the genetic subtype and the genetic material. Based on the above, the application provides a method for constructing a genetic subtype prediction model in order to improve the flexibility of the corresponding relation between genetic subtypes and genetic materials. For a better understanding of the above method, an example of the application of the method of the present application for constructing a genetic subtype predictive model is described below in connection with FIG. 3.

In this embodiment, step S302, determining the number of classifications r using the minimum description length criterion is equivalent to determining the optimal value of the number of classifications; the number of classifications r defines the number of classifications, which corresponds to the number of defined genetic subtypes;

step S304, typing by non-negative matrix factorization (non-negative matrix factorization, NMF) of the gene expression profile. The gene expression profile of the training sample consisted of m genes for n cases. The NMF algorithm approximates the gene expression profile with the product of two non-negative matrices: V.apprxeq.WH, where V is an m×n matrix of gene expression profile, W is an m×r non-negative matrix, and H is an r×n non-negative matrix. During NMF decomposition. Further, since no NMF algorithm is currently available to directly derive the best approximation, to reduce randomness and improve packet repeatability, 60 independent NMFs may be run, W and H matrices for 20 runs with the smallest approximation error are extracted, and the average of W and H is calculated. The H matrix encodes the subtype of the case: if the genetic subtype of case sample j is i, and in column j, the class value h _i,j Is the maximum value in the j-th column, then and the class value h _i,j The corresponding class and genetic subtype i may correspond to a class value h _i,j And constructing a corresponding relation between the corresponding category and the genetic subtype i. The correspondence of categories and genetic subtypes obtained by the NMF decomposition method is determined by the genetic typing of most of the case samples in each category, that is, in the correspondence of samples 1, 2, 4, 5, 8, 9 and 10 in table 1, the dominant genetic typing is I, and the correspondence of category I and genetic typing I can be constructed. Wherein the training sample may include 207 cases, the number of classifications determined using the minimum descriptive length criteria may be 4.

Step S306, after obtaining the weight matrix W, solving a pseudo-inverse matrix of the weight matrix W, specifically, performing singular value decomposition on W: w=sdt ^T Wherein S and T are orthogonal matrices, T ^T The transpose of T is represented, D is a diagonal matrix, the diagonal is singular value, and Moore-Penrose pseudo-inverse of the weight matrix W can be obtained to be W ⁺ ＝TD ^-1 S ^T ；

Step 308, processing the gene expression profile matrix V 'of the test sample by using the weight matrix W, and obtaining a typing result matrix H' =w ⁺ The V ', H ' matrix encodes the subtype of the test sample, and the H ' matrix satisfies the relationship V ' ≡WH '.

Further, for test case sample j, find the column vector h' _j The maximum class value in (i.e., in column j) is h' _i,j Further, h 'can be determined' _i,j In column I, the genetic subtype of the test case sample is determined to be I according to the corresponding relation I-I.

In the method of any of the above embodiments, the genetic subtype prediction model is constructed from case samples with known genetic subtypes, which is equivalent to performing data analysis on the genetic subtypes and gene expression profiles of the case samples, thereby obtaining the genetic subtype prediction model; according to the description of the embodiment, the obtained genetic subtype prediction model can be used for verifying the accuracy of other prediction modes, classifying test samples and other non-disease diagnosis fields; in addition, when the genetic subtype predicts a test sample with unknown genetic subtype, the genetic expression profile of the test sample needs to be further acquired, the genetic expression profile is correspondingly processed by the genetic subtype to obtain a reference category, and then the genetic subtype corresponding to the maximum category value is determined according to the corresponding relation. And the cases and case samples mentioned in the present application can be understood as the same concept.

In the above embodiment, according to the preset number of classifications, the gene expression profile of the training sample is classified and integrated to obtain the categories corresponding to the number of classifications, so that each category can include the characteristics of multiple genes, that is, the genes corresponding to the categories are not unique, when an unknown gene type appears, the genes can be classified and integrated with other known gene types, so that the categories include the genes of the unknown type, further, a target category is selected from each category, and the corresponding relation between the target category and the genetic subtype is constructed, so that the genetic subtype can correspond to multiple genes including the unknown type, and the flexibility of the corresponding relation between the genetic subtype and the genes is improved; it can be seen that the above method does not pursue a completely consistent genetic material change in the genetic subtype, but in the case of unknown genetic material, it can be classified into the adjacent genetic subtypes according to the similarity of the gene expression profile, thereby guiding the risk stratification.

It should be understood that, although the steps in the flowcharts of fig. 2-3 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2-3 may include multiple steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the steps or stages in other steps or other steps.

It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application.

Based on the same ideas as the method of constructing a genetic subtype predictive model in the above-described embodiments, the present application also provides an apparatus for constructing a genetic subtype predictive model, which can be used to perform the above-described method of constructing a genetic subtype predictive model. For ease of illustration, only those portions of an embodiment of the apparatus for constructing a genetic subtype predictive model are shown in a schematic configuration of an embodiment of the present application, and those skilled in the art will appreciate that the illustrated configuration is not limiting of the apparatus and may include more or fewer components than illustrated, or may combine certain components, or a different arrangement of components.

In one embodiment, as shown in FIG. 4, an apparatus 400 for constructing a genetic subtype predictive model is provided, comprising: an information acquisition module 402, a classification integration module 404, a correspondence construction module 406, and a predictive model output module 408, wherein:

an information acquisition module 402 for receiving the gene expression profile and the corresponding genetic subtype of the training sample;

the classification integration module 404 is configured to perform classification integration on the gene expression profile according to a predetermined classification number to obtain a target gene expression profile; the target gene expression profile comprises categories corresponding to the number of categories;

the correspondence construction module 406 is configured to select a target class from the classes of the target gene expression profile, and construct a correspondence between the target class and the genetic subtype;

and a prediction model output module 408, configured to output a genetic subtype prediction model according to the correspondence.

In one embodiment, the correspondence construction module 406 is further configured to obtain a class value of each class in the target gene expression profile; determining a category corresponding to the maximum category value in the category values as a target category; and constructing the corresponding relation between the target category and the genetic subtype.

In one embodiment, when the training sample includes a plurality of sub-samples, the correspondence construction module 406 is further configured to construct sub-correspondence between the sub-genetic subtypes of the sub-samples and the sub-target categories according to the category values of the plurality of sub-samples, respectively; each sub-corresponding relation corresponds to a different sub-sample; selecting sub-corresponding relations aiming at the same seed target category from the sub-corresponding relations; selecting the sub-genetic subtype with the largest occurrence number from the sub-genetic subtypes of the selected sub-corresponding relation; and constructing the corresponding relation between the same seed target category and the selected sub-genetic subtype.

In one embodiment, when the training sample includes a plurality of sub-samples, the classification integration module 404 is further configured to perform classification integration on the gene expression profile represented in the matrix form according to a predetermined number of classifications by using a non-negative matrix factorization manner, to obtain a weight matrix of a category and a category value matrix of the category; the class value matrix is a target gene expression profile in a matrix form;

the prediction model output module 408 is further configured to map the correspondence relationship in a row of the weight matrix; singular value decomposition is carried out on the weight matrix; and performing pseudo-inverse processing on the weight matrix subjected to singular value decomposition processing to obtain a genetic subtype prediction model in a pseudo-inverse matrix form.

In one embodiment, the prediction model output module 408 is further configured to obtain a gene expression profile matrix; the gene expression profile matrix consists of gene expression profiles of the test samples; performing matrix multiplication on the gene expression profile matrix and the pseudo-inverse matrix to obtain a classification result matrix of the test sample; the classification result matrix comprises test class values corresponding to the classes; and determining the class value with the largest numerical value from the test class values of the classification result matrix.

In one embodiment, the prediction model output module 408 is further configured to determine a class corresponding to the class value with the largest numerical value as the reference class; and taking the genetic subtype corresponding to the reference category as the genetic subtype of the test sample according to the corresponding relation.

In one embodiment, the classification integration module 404 is further configured to determine the number of classifications according to a minimum description length criterion.

For specific limitations on the means for constructing the genetic subtype predictive model, reference may be made to the limitations of the method for constructing the genetic subtype predictive model hereinabove, and will not be described in detail herein. The above-described respective modules in the apparatus for constructing a genetic subtype predictive model may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method embodiments described above when the processor executes the computer program.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the respective method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the claims. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method of constructing a genetic subtype predictive model comprising:

receiving a gene expression profile and a corresponding genetic subtype of a training sample; the training sample includes a plurality of subsamples;

classifying and integrating the gene expression spectrums expressed in a matrix form according to the predetermined classification number in a non-negative matrix factorization mode to obtain a class weight matrix and a class value matrix of the class; the class value matrix is a target gene expression profile in a matrix form; the target gene expression profile comprises categories corresponding to the number of categories;

mapping the corresponding relation in the row of the weight matrix;

singular value decomposition processing is carried out on the weight matrix;

performing pseudo-inverse treatment on the weight matrix subjected to singular value decomposition treatment to obtain a genetic subtype prediction model in a pseudo-inverse matrix form;

2. The method of claim 1, wherein the step of selecting a target class from classes in the target gene expression profile, and constructing a correspondence between the target class and the genetic subtype, comprises:

Obtaining class values of all classes in the target gene expression profile;

3. The method of claim 2, wherein when the training sample comprises a plurality of subsamples, the step of constructing the correspondence between the target class and the genetic subtype comprises:

determining sub-target categories of the sub-samples according to the category values of the plurality of sub-samples, and constructing sub-corresponding relations between sub-genetic subtypes and the sub-target categories of the sub-samples;

selecting a sub-corresponding relation aiming at the same seed target category from the sub-corresponding relations, and taking the same seed target category as the target category;

and constructing the corresponding relation between the target category and the selected subgeneric subtype.

4. The method according to claim 1, further comprising, after the step of taking the class corresponding to the determined class value as the reference class of the test sample:

5. The method as recited in claim 1, further comprising:

and determining the number of the classifications according to the minimum description length criterion.

6. An apparatus for constructing a genetic subtype predictive model, comprising:

the information acquisition module is used for receiving the gene expression profile and the corresponding genetic subtype of the training sample; the training sample includes a plurality of subsamples;

the classification integration module is used for carrying out classification integration on the gene expression spectrums expressed in a matrix form according to the predetermined classification number in a non-negative matrix factorization mode to obtain a weight matrix of the category and a category value matrix of the category; the class value matrix is a target gene expression profile in a matrix form; the target gene expression profile comprises categories corresponding to the number of categories;

the prediction model output module is used for mapping the corresponding relation in the row of the weight matrix; singular value decomposition processing is carried out on the weight matrix; performing pseudo-inverse treatment on the weight matrix subjected to singular value decomposition treatment to obtain a genetic subtype prediction model in a pseudo-inverse matrix form;

The prediction model output module is also used for acquiring a gene expression profile matrix; the gene expression profile matrix consists of gene expression profiles of test samples; performing matrix multiplication processing on the gene expression spectrum matrix and the genetic subtype prediction model in a pseudo-inverse matrix form to obtain a classification result matrix of the test sample; the classification result matrix comprises a test class value corresponding to the class; determining a class value with the largest numerical value from the test class values of the classification result matrix; and taking the category corresponding to the determined category value as a reference category of the test sample.

7. The apparatus of claim 6, wherein the correspondence construction module is further configured to: obtaining class values of all classes in the target gene expression profile; determining a category corresponding to the maximum category value in the category values as the target category; and constructing the corresponding relation between the target category and the genetic subtype.

8. The apparatus of claim 7, wherein when the training sample comprises a plurality of subsamples, the correspondence building module is further to: determining sub-target categories of the sub-samples according to the category values of the plurality of sub-samples, and constructing sub-corresponding relations between sub-genetic subtypes and the sub-target categories of the sub-samples; selecting a sub-corresponding relation aiming at the same seed target category from the sub-corresponding relations, and taking the same seed target category as the target category; selecting the sub-genetic subtype with the largest occurrence number from the sub-genetic subtypes of the selected sub-corresponding relation; and constructing the corresponding relation between the target category and the selected subgeneric subtype.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.