CN115641956A

CN115641956A - Phenotype analysis method for disease prediction

Info

Publication number: CN115641956A
Application number: CN202211320189.3A
Authority: CN
Inventors: 王飞; 徐勇军
Original assignee: Zhongke Xiamen Data Intelligence Research Institute
Current assignee: Zhongke Xiamen Data Intelligence Research Institute
Priority date: 2022-10-26
Filing date: 2022-10-26
Publication date: 2023-01-24

Abstract

The invention relates to the technical field of disease prediction, and discloses a phenotype analysis method for disease prediction, which comprises the steps of constructing a database, constructing a rare disease-common disease common database by measuring the phenotype similarity of rare diseases and common diseases and referring to the difference of gene data of the rare diseases and the common diseases, processing patient data to find the optimal combination of phenotype characteristics, calculating the matching score of diseases, calculating the cross entropy loss of a phenotype characteristic matching model for disease prediction, outputting a prediction result by using the weighted sum of the classification loss function in the rare disease-common disease common database and the cross entropy loss of the phenotype characteristic matching model as the total loss function of a disease prediction model, extracting effective phenotype characteristics based on a graph volume network for comparing the data of the rare diseases and the common diseases, extracting the disease matching difference from the common characteristics of the rare diseases and the common diseases to predict the diseases, solving the problem of easy confusion among the diseases, and avoiding misdiagnosis and missed diagnosis of the diseases to a certain extent.

Description

Phenotype analysis method for disease prediction

Technical Field

The invention relates to the technical field of disease prediction, in particular to a phenotype analysis method for disease prediction.

Background

At present, deep learning methods are mostly adopted for disease prediction, and computer technology is used for assisting the diagnosis of prediction results, but in order to improve the accuracy of disease prediction and diagnosis, complex medical multi-modal data is required to be fully utilized to extract effective information hidden therein, the data comprises medical imaging data and corresponding non-imaging phenotypic characteristics, such as the age, the height and the body functions of a patient, the data is difficult to process by the traditional deep learning method, not every phenotypic characteristic contributes to disease prediction, for disease prediction, screening out the phenotypic characteristics which have negative influence on the disease prediction result by adopting an effective method can effectively improve the accuracy of a disease prediction model, particularly has important diagnostic significance in the prediction of rare diseases, and as the rare diseases and the common diseases have great commonality, in terms of general diagnosis, the method is characterized in that a self-adaptive multilayer aggregated graph convolutional network can be used for predicting diseases, an encoder is mainly designed to automatically select the optimal combination of phenotypic characteristics, the multilayer aggregated graph convolutional network with a multimeric aggregation mode is introduced to select advantageous structure information for each node, a group graph structure is designed according to the spatial distribution and text similarity of the phenotypic characteristics, each effective phenotypic characteristic is allowed to have positive effect on a disease prediction result, the optimal phenotypic characteristic information can be automatically searched from each layer in a disease prediction model, the disease diagnosis accuracy is improved to a certain extent, but for rare diseases, classification detection among diseases is not performed in the disease classification process, particularly, the rare diseases and the common diseases have great similarity on the phenotypic characteristics, in this case, the difference between the rare diseases and the corresponding common diseases is difficult to distinguish, so misdiagnosis and missed diagnosis of the rare diseases are easily caused, and therefore, how to extract effective phenotypic information for diagnosing the rare diseases has important significance for predicting medical diseases.

Disclosure of Invention

The invention aims to provide a phenotype analysis method for disease prediction, which is characterized in that effective phenotype characteristics are extracted based on a graph convolution network and used for comparing data of rare diseases and common diseases, simultaneously, rare disease data and common disease data are fused to construct a 'rare disease-common disease' common database, phenotype characteristics of patients are matched with the database, and disease matching differences are extracted from the commonalities of the rare diseases and the common diseases, so that the problems in the background technology can be effectively solved.

In order to achieve the purpose, the invention provides the following technical scheme:

a phenotypic analysis method oriented to disease prediction comprising the following analysis steps:

the method comprises the following steps: constructing a database, including constructing a rare disease database, constructing a common disease database and constructing a patient information database;

step two: performing database comparison and fusion treatment, namely constructing a 'rare disease-common disease' common database by measuring the phenotype similarity of rare disease phenotype characteristics and common disease phenotype characteristics and simultaneously referring to the difference between rare disease gene data and common disease gene data;

step three: patient information data processing defining a set of patient phenotypic characteristics H as a set K = { K = _h A patient personal basic information, a patient family genetic medical history, a patient body representation, and an adjacency matrix

Searching for appropriate phenotypic characteristics and calculating corresponding phenotypic characteristic selection scores, and further calculating edge weights to obtain the optimal combination of phenotypic characteristics;

inputting the optimal combination of the phenotypic characteristics of the patients into a rare disease database, a common disease database and a 'rare disease-common disease' common database, respectively calculating the disease matching scores of the optimal combination of the phenotypic characteristics of the patients in the multiple databases, and outputting Z = { Z = _i }∈R ^n×p Match score, Z, representing all data _i Representing a matching score of an inode in the database;

step five: calculating cross entropy loss for phenotypic feature matching models for disease prediction

Y _ij Tag information representing data;

step six: classification loss function L in 'rare disease-common disease' co-classification database _H-C Cross-entropy loss L with phenotypic feature matching model _w The weighted sum of the two is used as the total loss function of the disease prediction model, and the smaller the value of the total loss function is, the higher the accuracy of the prediction result is;

step seven: and outputting a prediction result.

As a still further scheme of the invention: the rare disease database in the step one comprises rare disease categories, rare disease genetic information, rare disease phenotype characteristics and rare disease gene data, different phenotypes and gene sequences are correspondingly positioned into rare disease entries by combining the existing rare disease knowledge base and corresponding rare disease cases, the common disease database comprises common disease categories, common disease genetic information, common disease phenotype characteristics and common disease gene data, different phenotypes and gene sequences are correspondingly supplemented into common disease entries by combining common disease pathogenic genes and clinical medical cases, and the patient information database comprises patient disease symptoms and medical examination data.

As a still further scheme of the invention: in the second step, the data classification in the rare disease database is subjected to up-sampling treatment on the samples on the basis of adopting a clustering algorithm, so that the accuracy of data classification is improved, and the negative influence on the data classification caused by the crossing and overlapping phenomena existing between each disease category of the rare disease database is reduced.

As a still further scheme of the invention: in the second step, the phenotype similarity of the phenotype characteristics of the rare diseases and the phenotype characteristics of the common diseases is defined as sim, and the phenotype similarity exists for the rare diseases x and the common diseases y

The phenotype similarity sim (x, y) of the rare disease x and the common disease y is taken as the prior information for detecting the gene difference, and the difference var (x, y) of the gene data of the rare disease and the gene data of the common disease ⁿ ＝α*W _p *var(x，y) ^n-1 + (1- α) sim (x, y), α represents weight value, W _p Representative Gene interactionA network.

As a still further scheme of the invention: the classification loss function in the rare disease-common disease classification database is

N represents the number of samples in the database that participate in the classification,

for verifying the loss of correlation of disease x with disease y,

matching degree of disease x and disease y, gamma represents influencing factor in optimization process, d is dimension of characteristic vector of disease x and disease y, C _x 、C _y The covariance matrix in d-dimension, representing disease x and disease y eigenvectors, is the input initial constant.

As a still further scheme of the invention: the three-step adjacency matrix

Middle alpha _h Is of phenotype K _h Is a phenotype selection score of, gamma is a phenotypic characteristic K of two nodes _h V, w ∈ H,

wherein

Is a phenotypic characteristic K _h The number of samples meeting the requirement.

As a still further scheme of the invention: when K is _h When it is a non-quantitative phenotypic trait, define

As a function of the threshold value theta,

is characterized by the phenotypic characteristic K _h The number of satisfactory samples in the p and u categories in (1) and γ =1, when Kh is a quantitative phenotypic characteristic, define

As a function of the threshold value delta,

is characterized by the phenotypic characteristic K _h The number of samples in the p category in (1) defines the enclosed space D [ alpha, beta ]]∈{K _h },

Is characterized by the phenotypic characteristic K _h Of the p-class in (1) does not belong to the satisfactory number of samples of the closed section D, and

compared with the prior art, the invention has the beneficial effects that:

effective phenotypic characteristics are extracted based on a graph convolution network and used for comparing rare disease data with common disease data, fusion is conducted on rare disease data and common disease data to build a 'rare disease-common disease' common database, the phenotypic characteristics of patients are matched with the database, disease matching differences are extracted from the commonalities of rare diseases and common diseases to predict diseases, the problem of easy confusion among the diseases can be solved, and misdiagnosis and missed diagnosis of the rare diseases are avoided to a certain extent.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a schematic diagram of a phenotype analysis method for disease prediction.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

Example 1:

referring to fig. 1, in an embodiment of the present invention, a phenotype analysis method for disease prediction includes the following steps:

step two: comparing and fusing the databases, determining the phenotype similarity of the phenotype characteristics of the rare diseases and the phenotype characteristics of the common diseases, simultaneously referring to the difference between the gene data of the rare diseases and the gene data of the common diseases, constructing a 'rare disease-common disease' common database, defining the phenotype similarity of the phenotype characteristics of the rare diseases and the phenotype characteristics of the common diseases as sim, and having the phenotype similarity for the rare diseases x and the common diseases y

Using sim (x, y) with phenotype similarity between the rare disease x and the common disease y as prior information for detecting gene difference, and using the difference var (x, y) between the rare disease gene data and the common disease gene data ⁿ ＝α*W _p *var(x，y) ^n-1 + (1- α) sim (x, y), α represents weight value, W _p Representing a gene interaction network, the classification loss function in the "rare-common disease" co-class database is

N represents the number of samples in the database participating in the classification

For verifying the loss associated with disease x and disease y,

matching degree of disease x and disease y, gamma represents influencing factor in optimization process, d is dimension of characteristic vector of disease x and disease y, C _x 、C _y Representing a covariance matrix of disease x and disease y eigenvectors in d dimension, phi is an input initial constant;

step three: patient information data processing defining a set of patient phenotypic characteristics H as a set K = { K = _h And (5) searching appropriate phenotypic characteristics and calculating corresponding phenotypic characteristic selection scores by using an adjacency matrix, and further calculating side weights to obtain the optimal combination of phenotypic characteristics, wherein alpha is _h Is of the phenotype K _h Is the phenotype characteristic K of the two nodes _h V, w ∈ H,

wherein

Is the number of samples with the required phenotypic characteristic Kh when K _h When it is a non-quantitative phenotypic characteristic, defining

As a function of the threshold value theta,

is characterized by the phenotypic characteristic K _h The number of samples in the p and u categories in (1) and γ =1, when Kh is constantWhen a characteristic is of a scale type, define

As a function of the threshold value delta,

is characterized by the phenotypic characteristic K _h The number of samples in the p category in (1) defines the closed area D [ alpha, beta ]]∈{K _h },

Is characterized by the phenotypic characteristic K _h The number of satisfactory samples of the p-type of (1) which do not belong to the closed section D, and

Y _ij Tag information representing data;

step six: classification loss function L in 'rare disease-common disease' co-classification database _H-C Cross entropy loss L with phenotypic feature matching model _w The weighted sum of the two is used as the total loss function of the disease prediction model, and the smaller the value of the total loss function is, the higher the accuracy of the prediction result is;

step seven: outputting the prediction result

By adopting the technical scheme: effective phenotypic characteristics are extracted based on a graph-convolution network and used for comparing data of rare diseases and common diseases, rare disease data and common disease data are fused to construct a 'rare disease-common disease' common database, phenotypic characteristics of patients are matched with the database, disease matching differences are extracted from the commonalities of the rare diseases and the common diseases to predict the diseases, the problem of easy confusion among the diseases can be solved, and misdiagnosis and missed diagnosis of the rare diseases are avoided to a certain extent.

Example 2:

the method comprises the following steps: establishing a database, including establishing a rare disease database, establishing a common disease database and establishing a patient information database, wherein the rare disease database comprises rare disease categories, rare disease genetic information, rare disease phenotype characteristics and rare disease gene data, different phenotypes and gene sequences are correspondingly positioned in rare disease entries by combining the conventional rare disease knowledge base and corresponding rare disease cases, the common disease database comprises common disease categories, common disease genetic information, common disease phenotype characteristics and common disease gene data, different phenotypes and gene sequences are correspondingly supplemented into the common disease entries by combining common disease pathogenic genes and clinical medical cases, and the patient information database comprises patient disease symptoms and medical examination data;

step two: comparing and fusing the databases, determining the phenotype similarity between the phenotype characteristics of the rare diseases and the phenotype characteristics of the common diseases, and simultaneously constructing a 'rare diseases-common diseases' common database by referring to the difference between the gene data of the rare diseases and the gene data of the common diseases, wherein the phenotype similarity between the phenotype characteristics of the rare diseases and the phenotype characteristics of the common diseases is defined as sim, and the phenotype similarity between the phenotype characteristics of the rare diseases x and the common diseases y is defined as sim

Phenotypic similarity sim (x, y) between rare disease x and common disease y is used as prior information for detecting gene difference, rarelySee difference var (x, y) between disease gene data and common disease gene data ⁿ ＝α*W _p *var(x，y) ^n-1 + (1- α) sim (x, y), α represents weight value, W _p Representing a gene interaction network, the classification loss function in the "rare-common disease" co-class database is

for verifying the loss associated with disease x and disease y,

matching degree of disease x and disease y, gamma represents influencing factor in optimization process, d is dimension of characteristic vector of disease x and disease y, C _x 、C _y A covariance matrix representing the d-dimension of disease x and disease y eigenvectors, phi being an input initial constant;

step three: patient information data processing defining a set of patient phenotypic characteristics H as a set K = { K = _h The basic information of the patient, the family genetic disease history of the patient, the physical representation of the patient are included, and the adjacency matrix is utilized

Finding suitable phenotypic characteristics and calculating corresponding phenotypic characteristic selection scores, and calculating edge weight to obtain optimal combination of phenotypic characteristics, wherein alpha _h Is of phenotype K _h Is the phenotype characteristic K of the two nodes _h V, w ∈ H,

wherein

Is the number of samples with a phenotypic characteristic Kh meeting the requirement, and when Kh is a non-quantitative phenotypic characteristic, defining

As a function of the threshold value theta,

As a function of the threshold value delta,

Y _ij Tag information representing data;

step seven: outputting a prediction result, evaluating the prediction result, verifying the accuracy of the prediction result by reversely verifying the phenotypic characteristics of the patient according to the prediction result, wherein the error rate or the accuracy of the result obtained by classifying the disease can be changed to a certain extent along with the increase of the disease data in the database because the rare disease database has the problems of small number of samples and large distribution of disease categories, the number of the rare disease samples and the number of the common disease samples are ensured to be balanced as much as possible in the disease classification process, so that the classification performance of the disease classification can not be unstable _i The sample set in the common disease data is { C } _i H, validation sample H _i And C _i Degree of matching of

d represents sample H _i And C _i When improving the classification penalty function

N represents the total number of samples, ε is the adjustment coefficient, P _i Representing the prediction probability of the sample feature.

By adopting the technical scheme: the prediction result is reversely verified, the database is fused to extract the phenotypic characteristics, data equalization processing is added in the data classification process, meanwhile, a verification model loss function of the prediction result is improved to optimize the verification model, the accuracy of the prediction result obtained by comparing the verification result with the phenotypic characteristics of the patient can ensure the balance of the data classification, the negative influence on the data classification caused by overlarge sample data difference is avoided, the accuracy of the prediction result is improved on the winning degree, and the problem of easy confusion between rare diseases and common diseases is solved.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered as the technical solutions and the inventive concepts of the present invention within the technical scope of the present invention.

Claims

1. A phenotypic analysis method oriented to disease prediction, comprising the following analysis steps:

step three: patient information data processing defining a set of patient phenotypic characteristics H as a set K = { K = } _h The basic information of the patient, the family genetic disease history of the patient, the physical representation of the patient are included, and the adjacency matrix is utilized

Searching for suitable phenotypic characteristics, calculating response phenotypic characteristic selection scores, and calculating edge weight to obtain phenotypic characteristic optimal combination；

Inputting the best combination of the phenotypic characteristics of the patients into a rare disease database, a common disease database and a 'rare disease-common disease' common database, respectively calculating the disease matching scores of the best combination of the phenotypic characteristics of the patients in a plurality of databases, and outputting Z = { Z = _i }∈R ^n×p Match score, Z, representing all data _i Representing a matching score of an inode in the database;

Y _ij Tag information representing data;

step seven: and outputting a prediction result.

2. The phenotype analysis method for disease prediction according to claim 1, wherein the rare disease database in the first step comprises rare disease categories, rare disease genetic information, rare disease phenotypic characteristics, rare disease genetic data, different phenotypes and gene sequences are correspondingly positioned into rare disease entries by combining an existing rare disease knowledge base and corresponding rare disease cases, the common disease database comprises common disease categories, common disease genetic information, common disease phenotypic characteristics, common disease genetic data, common disease pathogenic genes and clinical medical cases, different phenotypes and gene sequences are correspondingly supplemented into common disease entries, and the patient information database comprises patient diseased symptoms and medical examination data.

3. The phenotype analysis method oriented to disease prediction of claim 1, wherein in the second step, the data classification in the rare disease database is performed with up-sampling processing on samples based on a clustering algorithm, so that the accuracy of data classification is increased, and the negative influence on the data classification caused by the cross and overlap phenomenon existing between each disease category of the rare disease database is reduced.

4. The method for phenotypic analysis according to claim 1, wherein the phenotypic similarity between the phenotypic characteristics of rare diseases and common diseases is defined as sim, and has phenotypic similarity for x and y

Using sim (x, y) with phenotype similarity between the rare disease x and the common disease y as prior information for detecting gene difference, and using the difference var (x, y) between the rare disease gene data and the common disease gene data ⁿ ＝α*W _p *var(x，y) ^n-1 + (1- α) sin (x, y), α represents weight value, W _p Representing a network of gene interactions.

5. The phenotypic analysis method based on disease prediction according to claim 4, wherein the classification loss function in the "rare disease-common disease" classification database is

for verifying the loss of correlation of disease x with disease y,

the matching degree of the disease x and the disease y is used, upsilon represents an influence factor in the optimization process, d is the dimension of the characteristic vector of the disease x and the disease y,C _x 、C _y the covariance matrix in d-dimension, representing the disease x and disease y eigenvectors, is the input initial constant.

6. The method of claim 1, wherein the adjacency matrix in step three is a matrix for disease prediction

wherein

Is the number of samples for which the phenotypic characteristic Kh meets the requirement.

7. The phenotypic analysis method for disease prediction according to claim 5, wherein Kh is defined when it is a non-quantitative phenotypic characteristic

As a function of the threshold value theta,

is characterized by the phenotypic characteristic K _h The satisfactory sample numbers in p and u categories, while γ =1, when K is _h For quantifying phenotypic characteristics, define

In respect of the threshold value deltaThe function of the function(s) is,