WO2021169203A1

WO2021169203A1 - Monogenic disease name recommendation method and system based on multi-level structural similarity

Info

Publication number: WO2021169203A1
Application number: PCT/CN2020/111130
Authority: WO
Inventors: 马旭; 曹宗富; 陈翠霞; 喻浴飞; 蔡瑞琨; 李乾; 罗敏娜
Original assignee: 国家卫生健康委科学技术研究所
Priority date: 2020-02-27
Filing date: 2020-08-25
Publication date: 2021-09-02
Also published as: CN111341458B; CN111341458A

Abstract

A monogenic disease name recommendation method and system based on multi-level structural similarity, being capable of intelligently and accurately recommending a matched monogenic disease name. The method comprises: constructing a standardized clinical feature phenotype tree of a monogenic disease according to a feature relational database of monogenic disease names; labeling, on the nodes of the phenotype tree, clinical features in a feature set I inputted by a user; traversing an nth monogenic disease name in the feature relational database, and on the node of the phenotype tree, labeling a standard clinical feature in a feature set A corresponding to the nth monogenic disease name; matching, from the feature set A, an optimal standard clinical feature corresponding to each clinical feature in the feature set I; calculating a set similarity value between the feature set I and the current feature set A; and in order to make n be equal to n plus 1, re-traversing the feature relational database until all the monogenic disease names in the feature relational database are traversed, summarizing and sorting a set similarity value corresponding to the feature set I and each feature set A, and outputting the monogenic disease name corresponding to a maximum similarity value.

Description

Single gene disease name recommendation method and system based on multi-level structure similarity

Technical field

The present invention relates to the field of medical information technology, in particular to a method and system for recommending names of single-gene diseases based on multi-level structural similarity.

Background technique

Monogenic disease is a common disease. It is a disease caused by a pair of allele mutations, also known as Mendelian genetic disease. Its characteristics are as follows:

1. There are many types of monogenic diseases, and more than 8,000 monogenic diseases have been discovered;

2. The phenotype of single-gene disease is complex, and the phenotype of the same single-gene disease is highly heterogeneous, and there is a phenomenon that the clinical characteristics of different single-gene diseases overlap with each other;

3. The genetic pattern of single-gene diseases is diversified. Even the same single-gene disease may show different inheritance patterns, and different single-gene diseases may also show the same inheritance pattern.

4. The incidence of most monogenic diseases is very low and relatively rare.

These complex factors make it difficult for clinicians to understand all the phenotypes of monogenic diseases, and bring great difficulties to the clinical diagnosis and treatment of monogenic diseases. Existing technologies have established a Chinese database of monogenic diseases and clinical characteristics. On this basis, it recommends possible monogenic diseases based on the clinical characteristics of patients, and provides convenient auxiliary diagnostic tools to provide clinicians with diagnostic clues, thereby improving The correct rate of diagnosis by clinicians reduces the probability of missed diagnosis and misdiagnosis. Specifically, based on the case characteristics and standardized phenotypes entered by the user, the Elestic similarity and Fisher exact test enrichment analysis methods are used to recommend the names of single-gene diseases. Among them, the Elestic similarity is a measure of the similarity of the input text and cannot be considered The meaning of keyword words, such as "hypohidrosis" and "hyperhidrosis", may suggest that the disease names with the opposite phenotype are ranked first. The disadvantage of Fisher's exact test is that the accuracy of the results depends heavily on the input table. Whether the type is accurate, due to the complexity of the phenotype of a single gene disease, it is difficult for doctors to guarantee that the input phenotype is the standardized phenotype of the disease. If the input is an approximate phenotype, it may cause errors in the recommended results.

Summary of the invention

The purpose of the present invention is to provide a single gene disease name recommendation method and system based on multi-level structural similarity, which reduces the input restriction requirements for doctors, and intelligently and accurately recommends the matched single gene disease name.

In order to achieve the above objectives, one aspect of the present invention provides a method for recommending names of single-gene diseases based on multi-level structural similarity, including:

Construct a standardized clinical feature phenotype tree of monogenic diseases based on the characteristic relational database of the names of monogenic diseases;

Mark the nodes of the clinical features in the feature set I input by the user on the standardized clinical feature phenotype tree;

Traverse the name of the nth monogenic disease in the feature relational database, and mark the node of the standard clinical feature in the corresponding feature set A on the standardized clinical feature phenotype tree, and the initial value of n is 1;

Based on the node labels on the standardized clinical feature phenotype tree, the best standard clinical feature corresponding to each clinical feature in feature set I is matched from feature set A;

According to the similarity value between each clinical feature and the corresponding best standard clinical feature, calculate the set similarity value between feature set I and current feature set A;

Let n=n+1 traverse the name of the nth monogenic disease in the feature relational database again, until the monogenic disease name in the feature relational database is traversed, and set the set similarity value corresponding to the feature set I and each feature set A Summarize and sort, and output the name of the single-gene disease corresponding to the highest similarity value.

Preferably, the method for the relational database based on the characteristics of the names of single-gene diseases includes:

Obtain the names of known monogenic diseases and their corresponding standard clinical features from public databases and literature databases of monogenic diseases;

Based on the known names of single-gene diseases and their corresponding standard clinical features, establish a feature relationship database between the names of single-gene diseases and standard clinical features;

_{Calculate the contribution c i} of each standard clinical feature corresponding to each single-gene disease name to the single-gene disease.

Preferably, the method for constructing a standardized clinical feature phenotype tree of a single gene disease includes:

Obtain data from the characteristic relational database, and construct a standardized clinical characteristic phenotype tree of monogenic diseases based on HPO;

The standardized clinical feature phenotype tree is composed of multiple stem nodes and at least one branch node associated with each stem node. Each branch node is used to represent a standardized clinical feature, and each stem node is used to represent an associated standardized clinical feature. The index of the feature.

Further, the method of matching the best standard clinical feature corresponding to each clinical feature in feature set I from feature set A based on the node labels on the standardized clinical feature phenotype tree includes:

The feature set I includes multiple clinical features, and the feature set A includes multiple standard clinical features;

Traverse the i-th clinical feature in the feature set I, and select the standard clinical feature with the highest similarity to the i-th clinical feature from the feature set A, as the standard clinical feature corresponding to the i-th clinical feature The best standard clinical feature, the initial value of i is 1;

Let i=i+1 and re-traverse the i-th clinical feature in the feature set I until the clinical feature in the feature set I is traversed. From the feature set A corresponding to the name of the n-th monogenic disease, select the The clinical features in feature set I correspond to multiple best standard clinical features one-to-one.

Preferably, the method for selecting the standard clinical feature with the highest similarity to the i-th clinical feature from the feature set A includes:

Traverse the j-th standard clinical feature in the feature set A, and determine whether the j-th standard clinical feature and the i-th clinical feature have the same stem node B _t based on the established index. The initial value is 1;

If the judgment result is no, it is considered that the similarity value between the j-th standard clinical feature and the i-th clinical feature is zero;

If the judgment result is yes, calculate the similarity value between the j-th standard clinical feature and the i-th clinical feature based on a multi-level structure similarity algorithm;

Let j=j+1, traverse the j-th standard clinical feature in the feature set A again, and continue to perform the similarity calculation between the j-th standard clinical feature and the i-th clinical feature until the The standard clinical features in the feature set A are traversed, and multiple similarity values corresponding to the standard clinical features in the feature set A are correspondingly obtained;

The standard clinical feature corresponding to the maximum value is selected from multiple similarity value screens as the best standard clinical feature corresponding to the i-th clinical feature.

Preferably, the method for calculating the similarity value between the j-th standard clinical feature and the i-th clinical feature based on a multi-level structure similarity algorithm includes:

Based on the node labeling on the standardized clinical feature phenotype tree, obtain the _{directed set IB of all nodes in the path between the i-th clinical feature and the same stem node B t} , and obtain the j-th standard clinical feature of the same stem node B _t connected path The value of the length of the directed set IB is the number of nodes in the path L _IB , and the value of the length of the directed set AB is the number of nodes in the path L _AB ;

Extracting the intersection IAB of the nodes in the directed set IB and the directed set AB, and the value of the length of the intersection IAB is the number of common nodes in the path L _IAB ;

The formula S _IiAj = β·SM+(1-β)·SI is used to calculate the similarity value between the j-th standard clinical feature and the i-th clinical feature; where,

The SM represents the similarity value between the j-th standard clinical feature and the i-th clinical feature at multiple levels of the phenotype tree;

The SI represents the similarity value between the j-th standard clinical feature and the i-th clinical feature at the same level of the phenotype tree, and the β is a weighting coefficient.

For example, the calculation formula of the SM is SM=L _IAB /max(L _AB , L _IB ), and the calculation formula of the SI is SI=1/(L _AB +L _IB -2L _IAB +1).

Preferably, the method for calculating the set similarity value between the feature set I and the current feature set A according to the similarity value between each clinical feature and the corresponding best standard clinical feature includes:

_{Use the contribution degree c i} of the i-th clinical feature to weight the maximum similarity value corresponding to the best standard clinical feature in the feature set A;

Let i=i+1, re-weight the maximum similarity value of the best standard clinical feature corresponding to the i-th clinical feature in feature set A until all the best standard clinical features selected in feature set A are selected After the weighting process is completed, the weighted maximum similarity values corresponding to all the best standard clinical features in the feature set A are accumulated, and the set similarity values of the feature set I and the current feature set A are obtained.

Compared with the prior art, the single gene disease name recommendation method based on multi-level structural similarity provided by the present invention has the following beneficial effects:

In the single-gene disease name recommendation method based on multi-level structural similarity provided by the present invention, a standardized clinical feature phenotype tree of the single-gene disease is first constructed based on the feature relational database of the single-gene disease name, and then the feature set I input by the user is The clinical features of is marked at the node on the standardized clinical feature phenotype tree, and the nth monogenic disease name in the feature relational database is traversed, and the standard clinical features in feature set A corresponding to the current nth monogenic disease name are placed in Standardize the node labels on the clinical feature phenotype tree, and then according to the node labels on the standardized clinical feature phenotype tree, match the best standard clinical features one-to-one corresponding to each clinical feature in feature set I from feature set A, respectively , And calculate the set similarity value of feature set I and current feature set A according to the similarity value of each clinical feature and the corresponding best standard clinical feature. After that, let n=n+1 re-traverse the feature relationship The nth monogenic disease name in the database, until the monogenic disease name in the feature relational database is traversed, the collection similarity values corresponding to feature set I and each feature set A are summarized and sorted, and the highest similarity value is output The name of a single gene disease.

It can be seen that the use process of the single gene disease name recommendation method based on multi-level structural similarity provided by the present invention is convenient and friendly. It is very convenient to input standardized clinical features through instant search and phenotype tree, and allows users to input similarities. The clinical characteristics of the system reduce the requirements for user input restrictions and improve the degree of intelligent diagnosis. After clicking the query, the recommended results of the single-gene name are quickly output, which improves the accuracy and efficiency of the diagnosis of single-gene diseases.

Another aspect of the present invention provides a single gene disease name recommendation system based on multi-level structural similarity, including:

The phenotype tree unit is used to construct a standardized clinical feature phenotype tree of the single-gene disease according to the feature relation database of the name of the single-gene disease;

The input unit is used to mark the nodes of the clinical features in the feature set I input by the user on the standardized clinical feature phenotype tree;

The traversal unit is used to traverse the name of the nth monogenic disease in the feature relational database, and mark the node of the standard clinical feature in the feature set A on the standardized clinical feature phenotype tree, and the initial value of n is 1 ；

The retrieval unit, based on the node labels on the standardized clinical feature phenotype tree, matches the best standard clinical feature corresponding to each clinical feature in feature set I from feature set A;

The calculation unit is used to calculate the set similarity value between the feature set I and the current feature set A according to the similarity value between each clinical feature and the corresponding best standard clinical feature;

The judging unit makes n=n+1 respond again to the traversal marking unit until the traversal of the single gene disease name in the characteristic relational database is completed;

The output unit is used to summarize and sort the set similarity values corresponding to the feature set I and each feature set A, and output the name of the single gene disease corresponding to the highest similarity value.

Compared with the prior art, the beneficial effects of the single-gene disease name recommendation system based on multi-level structure similarity provided by the present invention are the same as the beneficial effects of the single-gene disease name recommendation method based on multi-level structure similarity provided by the above technical solutions. I won't repeat it here.

The third aspect of the present invention provides a computer-readable storage medium, for example, a non-volatile computer-readable storage medium, wherein computer-readable instructions are stored on the computer-readable storage medium, and the computer-readable instructions are executed when the processor is running The steps of the above-mentioned method for recommending names of single-gene diseases based on multi-level structural similarity.

Compared with the prior art, the beneficial effects of the computer-readable storage medium provided by the present invention are the same as the beneficial effects of the single-gene disease name recommendation method based on multi-level structural similarity provided by the above technical solutions, and will not be repeated here.

Description of the drawings

The drawings described here are used to provide a further understanding of the present invention and constitute a part of the present invention. The exemplary embodiments of the present invention and the description thereof are used to explain the present invention, and do not constitute an improper limitation of the present invention. In the attached picture:

Fig. 1 is a schematic flow chart of a method for recommending names of single-gene diseases based on similarity of multi-level structure in the first embodiment;

2 is an example diagram of node labels on the standardized clinical feature phenotype tree in Embodiment 1 of the present invention;

3 is a structural block diagram of a single gene disease name recommendation system based on multi-level structural similarity in the second embodiment;

4 is a schematic diagram of the environment architecture of the application of the single gene disease name recommendation method based on multi-level structural similarity in the fourth embodiment of the present invention;

FIG. 5 is an example diagram of an environment architecture for the application of the single-gene disease name recommendation method based on multi-level structural similarity in the fourth embodiment of the present invention.

Detailed ways

In order to make the above objectives, features, and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

Example one

Referring to Fig. 1, this embodiment provides a method for recommending names of single-gene diseases based on multi-level structural similarity, including:

Construct a standardized clinical feature phenotype tree of single gene disease according to the feature relation database of the name of a single gene disease; mark the clinical features in the feature set I input by the user in the node mark on the standardized clinical feature phenotype tree; traverse the feature relation database The name of the n-th monogenic disease, and mark the node of the standard clinical feature in the corresponding feature set A on the standardized clinical feature phenotype tree, and the initial value of n is 1; based on the node on the standardized clinical feature phenotype tree Mark, match the best standard clinical feature corresponding to each clinical feature in feature set I from feature set A; calculate the similarity value between each clinical feature and the corresponding best standard clinical feature, calculate feature set I and The set similarity value of the current feature set A; let n=n+1 to re-traverse the nth monogenic disease name in the feature relational database until the monogenic disease name in the feature relational database is traversed, and the feature set I and each The set similarity values corresponding to the feature set A are summarized and sorted, and the single gene disease name corresponding to the highest similarity value is output.

In the single-gene disease name recommendation method based on multi-level structural similarity provided in this embodiment, a standardized clinical feature phenotype tree of the single-gene disease is first constructed based on the feature relational database of the single-gene disease name, and then the characteristics input by the user are set. The clinical features in the node are marked on the standardized clinical feature phenotype tree, and the nth monogenic disease name in the feature relational database is traversed, and the current nth monogenic disease name corresponds to the standard clinical features in the feature set A Mark the nodes on the standardized clinical feature phenotype tree, and then according to the node tags on the standardized clinical feature phenotype tree, from feature set A, respectively match the best standard clinical one-to-one correspondence with each clinical feature in feature set I According to the similarity value of each clinical feature and the corresponding best standard clinical feature, the set similarity value of feature set I and the current feature set A is calculated. After that, let n=n+1 re-traverse the features The n-th monogenic disease name in the relational database, until the monogenic disease name in the feature relational database is traversed, the collection similarity values corresponding to feature set I and each feature set A are summarized and sorted, and the highest similarity value is output. The name of the single-gene disease.

It can be seen that the single-gene disease name recommendation method based on multi-level structural similarity provided by this embodiment is convenient and friendly in use. It is very convenient to input standardized clinical features through instant search and phenotype tree, and allows users Enter similar clinical features, reduce the requirements for user input restrictions, and improve the degree of intelligent diagnosis. After clicking the query, the recommended results of single gene names are quickly output, which improves the accuracy and efficiency of single gene disease diagnosis.

Specifically, the method for the feature relational database based on the name of the single gene disease in the above embodiment includes:

Obtain the names of known monogenic diseases and their corresponding standard clinical features from the public databases and literature databases of monogenic diseases; establish the names and standards of monogenic diseases based on the known names of monogenic diseases and their corresponding standard clinical features _{The feature relation database of clinical features; respectively calculate the contribution c i} of each standard clinical feature corresponding to each single-gene disease name to the single-gene disease.

Preferably, it is also necessary to translate the foreign language information in the characteristic relational database into Chinese information with reference to the Chinese Human Phenotype Standard Phrase Consortium, so as to realize the identification and matching of the Chinese version of the medical record data.

In specific implementation, the public database is the MedGen database, and the literature database is the PubMed database. The feature relational database includes matching single gene disease names, clinical features in foreign languages, clinical features in the human phenotype standard term database (HPOIDs) and Chinese Clinical features. This embodiment can provide clues and theoretical support for the clinical diagnosis and identification of single gene diseases, and also provide data support for further narrowing the scope of genetic testing. At the same time, the clinical feature relationship database established in this embodiment covers more than 8,600 types of single gene diseases, more than 11,000 clinical features of single gene disease phenotypes, and more than 90,000 types of relationship data between phenotypes and clinical features, including single genes. The latest database versions and literature reports for disease research.

Specifically, the calculation method _{of the contribution c i} of each standard clinical feature corresponding to each single-gene disease name to the single-gene disease is as follows:

In the feature relational database, assuming that there are a total of a standard clinical features, a standard clinical feature appears N times in the feature relational database, assuming that the number of occurrences of each standard clinical feature is a _i , then each standard clinical feature is in the feature The frequency of occurrence in the relational database is f _i , and the calculation formula of _{f i is:}

f _i =a _i /N;

For a certain monogenic disease in the feature relational database, assuming that there are m standard clinical features, the distribution frequency of each standard clinical feature in the feature relational database is f ₁ , f ₂ , ..., f _m , then The calculation formula for _{the contribution c i} of a standard clinical feature to the monogenic disease is:

In the above formula, k is the correction factor, and k>1, and the characteristic relation database is used as a reference database.

Feature set I, that is, clinical feature information collection can be standardized in two ways through visualization tools: the first way is to enter keywords, each keyword is equivalent to a clinical feature, and related standardized phenotypic information can be provided through instant search The drop-down menu is convenient for users to choose and realize the input of standardized clinical special diagnosis information; the second way is to directly click on the relevant standardized clinical feature information on the phenotype tree to input.

The method for constructing a standardized clinical feature phenotype tree of a single gene disease in the above embodiment includes:

Obtain data from the feature relational database, and construct a standardized clinical feature phenotype tree for monogenic diseases based on HPO; among them, the standardized clinical feature phenotype tree consists of multiple stem nodes and at least one branch node associated with each stem node. Each branch node is used to represent a standardized clinical feature, and each stem node is used to represent an index of the associated standardized clinical feature. HPO refers to the hp.obo file.

In the above embodiment, the method of matching the best standard clinical feature corresponding to each clinical feature in feature set I from feature set A based on the node labels on the standardized clinical feature phenotype tree includes:

Feature set I includes multiple clinical features, feature set A includes multiple standard clinical features; traverse the i-th clinical feature in feature set I, and select the standard clinical feature with the highest similarity to the i-th clinical feature from feature set A Feature, as the best standard clinical feature corresponding to the i-th clinical feature, the initial value of i is 1; let i=i+1 and re-traverse the i-th clinical feature in the feature set I until the i-th clinical feature in the feature set I After the clinical feature traversal is completed, multiple best standard clinical features corresponding to the clinical features in feature set I are screened out from feature set A corresponding to the name of the n-th monogenic disease.

Further, the method for selecting the standard clinical feature with the highest similarity to the i-th clinical feature from feature set A includes:

Traverse the j-th standard clinical feature in the feature set A, and judge whether the j-th standard clinical feature and the i-th clinical feature have the same stem node B _t based on the established index, the initial value of j is 1; if the result of the judgment is If no, it is considered that the similarity between the j-th standard clinical feature and the i-th clinical feature is zero; if the judgment result is yes, the j-th standard clinical feature and the i-th clinical feature are calculated based on the multi-level structure similarity algorithm The similarity value of; Let j=j+1, re-traverse the j-th standard clinical feature in the feature set A, and continue to perform the similarity calculation between the j-th standard clinical feature and the i-th clinical feature until the feature set A After the traversal of the standard clinical features in feature set A is completed, multiple similarity values corresponding to the standard clinical features in feature set A are obtained; the standard clinical feature corresponding to the maximum value is selected from the multiple similarity value screens as the i-th The clinical features correspond to the best standard clinical features.

In the foregoing embodiment, the method for calculating the similarity value between the j-th standard clinical feature and the i-th clinical feature based on the multi-level structure similarity algorithm includes:

Based on the node labeling on the standardized clinical feature phenotype tree, obtain the _{directed set IB of all nodes in the path between the i-th clinical feature and the same stem node B t} , and obtain the j-th standard clinical feature of the same stem node B _t connected path The length of the directed set IB is the number of nodes in the path L _IB , and the length of the directed set AB is the number of nodes in the path L _AB ; extract the directed set IB and the number of nodes in the path. To the intersection IAB of the nodes in the set AB, the length of the intersection IAB is the number of common nodes in the path L _IAB ; the formula S _IiAj = β·SM+(1-β)·SI is used to calculate the j-th standard clinical feature and the i-th The similarity value of each clinical feature;

Among them, SM represents the similarity value between the j-th standard clinical feature and the i-th clinical feature at multiple levels in the phenotype tree; SI represents the j-th standard clinical feature and the i-th clinical feature at the same level in the phenotype tree Similarity value, β is the weight coefficient.

In specific implementation, the feature set A corresponding to a single gene disease name in the feature relational database consists of n elements A _j , which are A ₁ , A ₂ , ..., A _n , that is, A=[A ₁ , A ₂ ,...,A _j ...,A _n ], each gene disease name in the characteristic relational database corresponds to a set A. If the standardized feature set I input by a patient with a monogenic disease is _{composed of m clinical I i} , the corresponding feature set I = [I ₁ , I ₂ , ..., _Im ]. If _{the stem nodes of I i} and A _j are not the same, then the similarity between _{I i} and A _j _{is considered to be 0. If the stem nodes of I i} and A _j are the same, as shown in Figure 2, the same stem node is B _t , Then calculate _{the similarity between I i} and A _j , the calculation method is: _{all nodes in the connecting path between I i} and B _t form a directed set IB, the number of elements in the directed set IB is denoted as N _IB , the directed set The length of IB is defined as the number of nodes on the path, denoted as L _IB , and L _IB =N _IB ;

All nodes in the connecting path between A _j and B _t form a directed set AB. The number of elements in the directed set AB is denoted as _NAB . The length of the directed set AB is defined as the number of nodes on the path, denoted as L _AB , and L _AB =N _AB ;

The intersection set of the directed set IB and the directed set AB is denoted as IAB, the number of elements in the intersection set IAB is denoted as N _IAB , and the length of the set IAB is defined as the number of nodes on the common path, denoted as L _IAB , then L _IAB =N _IAB , where SM=L _IAB /max(L _AB ,L _IB ), SI=1/(L _AB +L _IB -2L _IAB +1), β is the weight coefficient, β∈(0,1); The value range of the similarity between I _i and A _j _{is S IiAj} ∈ [0,1].

Further, in the foregoing embodiment, the method for calculating the set similarity value between the feature set I and the current feature set A according to the similarity value between each clinical feature and the corresponding best standard clinical feature includes:

_{Use the contribution degree c i} of the i-th clinical feature to weight the maximum similarity value corresponding to the best standard clinical feature in the feature set A; The maximum similarity value of the best standard clinical features corresponding to each clinical feature is weighted, until all the best standard clinical features selected in feature set A are weighted, and all the best standard clinical features in feature set A are accumulated. The weighted maximum similarity value of, obtains the set similarity value of the feature set I and the current feature set A.

In specific implementation, for each input clinical feature I _i _{, a standard clinical feature A j} corresponding to the greatest similarity can be found in the feature set A, that is to say, each clinical feature I _i will get an and feature The similarity value of the set A, the similarity between the feature set I and the feature set A, is defined as the sum of the similarity between each clinical feature I _i in the feature set I and the feature set A.

Considering that each clinical feature has different contributions to single-gene disease, the corresponding maximum similarity value needs to be weighted, and the calculation formula is

Wherein S _IiA represents the similarity value between the clinical feature I _{i and the feature set A.} The similarity value of feature set I and feature set A is defined as the sum of similarity between each clinical feature I _i in feature set I and feature set A, and its calculation formula is

S _IA represents the similarity value between feature set I and feature set A.

The advantages of this embodiment are: 1. A friendly client is developed, and users can instantly search and input standardized clinical features by clicking on the mouse or input keywords, which is very convenient; 2. By calculating the multi-level structure of clinical feature I and feature set A The similarity of the multi-level structure similarity algorithm for the input phenotype is blurred, which reduces the input restriction requirements for the doctor, makes the input process more friendly and intelligent, and can use the customized multi-level structure similarity in combination with the input information The algorithm calculates the strength of the association with the name of the single-gene disease, and according to the strength of the association prompts the patient with the possible single-gene disease, and accurately recommends the name of the single-gene disease.

Example two

Referring to FIG. 3, this embodiment provides a single-gene disease name recommendation system based on multi-level structural similarity, including:

In one embodiment, the aforementioned single-gene disease name recommendation system is applied to a computer device that includes a processor and a memory connected through a system bus. Among them, the processor of the single gene disease name recommendation system is used to provide calculation and control capabilities. The memory of the single gene disease name recommendation system includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer readable instructions. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The network interface of the single gene disease name recommendation system is used to communicate with external sensors. When the computer-readable instruction is executed by the processor, the steps of the above-mentioned single-gene disease name recommendation method based on the similarity of the multi-level structure are implemented, for example, the above-mentioned phenotype tree unit, input unit, traversal unit, retrieval unit, and calculation unit. , The judgment unit and the output unit implement the steps of the above-mentioned method for recommending names of single-gene diseases based on the similarity of the multi-level structure.

Compared with the prior art, the beneficial effect of the single-gene disease name recommendation system based on multi-level structure similarity provided by the embodiment of the present invention is similar to that of the single-gene disease name recommendation method based on multi-level structure similarity provided in the first embodiment above. The beneficial effects are the same and will not be repeated here.

Example three

This embodiment provides a computer-readable storage medium, for example, a non-volatile computer-readable storage medium, in which computer-readable instructions are stored on the computer-readable storage medium, and the computer-readable instructions are executed when the processor is running. The steps of the method for recommending names of single-gene diseases based on hierarchical structure similarity.

Compared with the prior art, the beneficial effects of the computer-readable storage medium provided in this embodiment are the same as the beneficial effects of the single-gene disease name recommendation method based on multi-level structural similarity provided by the above technical solutions, and will not be repeated here.

Example four

Based on the foregoing embodiment, referring to FIG. 4 and FIG. 5, a schematic diagram of an environment architecture of an application scenario is provided.

An application software can be developed to implement the single-gene disease name recommendation method based on multi-level structural similarity in the foregoing embodiment, and the application software can be installed in a user terminal, and the user terminal is connected to the server to realize communication.

The user terminal may be any smart device such as a computer or a tablet computer, and this embodiment only uses a computer as an example for description.

For example, open an application related to a smart device, and the user uses an input module such as a keyboard, a mouse, etc. to input the clinical features in the feature set I to realize the standardized input of the clinical features in the application. The features are sent to a database retrieval module, such as a server. The database retrieval module uses a multi-level structure similarity algorithm to traverse and calculate the similarity values between the feature set A and the feature set I corresponding to each single gene disease name in the feature relational database, and then summarize and sort. Obtain the name of the single-gene disease corresponding to the highest similarity value, and then send the name of the single-gene disease corresponding to the highest similarity value to the user through an output module, such as a display.

A person of ordinary skill in the art can understand that all or part of the steps in the above-mentioned inventive method can be implemented by a program instructing relevant hardware. The above-mentioned program can be stored in a computer readable storage medium. When the program is executed, it includes For each step of the method in the foregoing embodiment, the storage medium of the program may be: ROM/RAM, magnetic disk, optical disk, memory card, etc.

The above are only specific implementations of the present invention, but the protection scope of the present invention is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention, and they should all be covered. Within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

A method for recommending names of single-gene diseases based on multi-level structural similarity, including:

Construct a standardized clinical feature phenotype tree of monogenic diseases based on the characteristic relational database of the names of monogenic diseases;

Mark the nodes of the clinical features in the feature set I input by the user on the standardized clinical feature phenotype tree;

Traverse the name of the nth monogenic disease in the feature relational database, and mark the node of the standard clinical feature in the corresponding feature set A on the standardized clinical feature phenotype tree, and the initial value of n is 1;

Based on the node labels on the standardized clinical feature phenotype tree, the best standard clinical feature corresponding to each clinical feature of feature set I is matched from feature set A;

According to the similarity value between each clinical feature and the corresponding best standard clinical feature, calculate the set similarity value between feature set I and the current feature set A; and

Let n=n+1 traverse the name of the nth monogenic disease in the feature relational database again, until the monogenic disease name in the feature relational database is traversed, and set the set similarity value corresponding to the feature set I and each feature set A Summarize and sort, and output the name of the single-gene disease corresponding to the highest similarity value.
The method according to claim 1, wherein the method of relation database based on the characteristics of the names of single-gene diseases comprises:

Obtain the names of known monogenic diseases and their corresponding standard clinical features from public databases and literature databases of monogenic diseases;

Based on the known names of single-gene diseases and their corresponding standard clinical features, establish a feature relationship database between the names of single-gene diseases and standard clinical features; and

Calculate the contribution c i of each standard clinical feature corresponding to each single-gene disease name to the single-gene disease.
The method according to claim 1 or 2, wherein the method of constructing a standardized clinical feature phenotype tree of a single gene disease comprises:

Obtain data from the characteristic relational database, and construct a standardized clinical characteristic phenotype tree of monogenic diseases based on HPO;

The standardized clinical feature phenotype tree is composed of multiple stem nodes and at least one branch node associated with each stem node, each branch node is used to represent a standardized clinical feature, and each stem node is used to represent the associated standardization Index of clinical characteristics.
The method according to any one of claims 1 to 3, wherein the best standard clinical feature corresponding to each clinical feature in feature set I is matched from feature set A based on the node labels on the standardized clinical feature phenotype tree The methods include:

The feature set I includes multiple clinical features, and the feature set A includes multiple standard clinical features;

Traverse the i-th clinical feature in the feature set I, and select the standard clinical feature with the highest similarity to the i-th clinical feature from the feature set A, as the standard clinical feature corresponding to the i-th clinical feature The best standard clinical feature, the initial value of i is 1; and

Let i=i+1 and re-traverse the i-th clinical feature in the feature set I until the clinical feature in the feature set I is traversed. From the feature set A corresponding to the name of the n-th monogenic disease, select the The clinical features in feature set I correspond to multiple best standard clinical features one-to-one.
The method according to claim 4, wherein the method of selecting the standard clinical feature with the highest similarity to the i-th clinical feature from the feature set A comprises:

Traverse the j-th standard clinical feature in the feature set A, and determine whether the j-th standard clinical feature and the i-th clinical feature have the same stem node B t based on the established index. The initial value is 1;

If the judgment result is no, it is considered that the similarity value between the j-th standard clinical feature and the i-th clinical feature is zero;

If the judgment result is yes, calculate the similarity value between the j-th standard clinical feature and the i-th clinical feature based on a multi-level structure similarity algorithm;

Let j=j+1, traverse the j-th standard clinical feature in the feature set A again, and continue to perform the similarity calculation between the j-th standard clinical feature and the i-th clinical feature until the The standard clinical features in feature set A are traversed, and multiple similarity values corresponding to the standard clinical features in feature set A are correspondingly obtained; and

The standard clinical feature corresponding to the maximum value is selected from multiple similarity value screens as the best standard clinical feature corresponding to the i-th clinical feature.
The method according to claim 5, wherein the method of calculating the similarity value between the j-th standard clinical feature and the i-th clinical feature based on a multi-level structure similarity algorithm comprises:

Based on the node labeling on the standardized clinical feature phenotype tree, obtain the directed set IB of all nodes in the path between the i-th clinical feature and the same stem node B t , and obtain the j-th standard clinical feature of the same stem node B t connected path The value of the length of the directed set IB is the number of nodes in the path L IB , and the value of the length of the directed set AB is the number of nodes in the path L AB ;

Extract the intersection IAB of the nodes in the directed set IB and the directed set AB, and the value of the length of the intersection IAB is the number of common nodes in the path L IAB ; and

Adopt the formula
Calculate the similarity value between the j-th standard clinical feature and the i-th clinical feature; wherein,

The SM represents the similarity value between the j-th standard clinical feature and the i-th clinical feature at multiple levels of the phenotype tree; and

The SI represents the similarity value between the j-th standard clinical feature and the i-th clinical feature at the same level of the phenotype tree, and the β is a weighting coefficient.
The method according to claim 6, wherein the calculation formula of the SM is SM=L IAB /max(L AB , L IB ), and the calculation formula of the SI is SI=1/(L AB + L IB -2L IAB +1).
The method according to any one of claims 1 to 7, wherein the method of calculating the set similarity value of the feature set I and the current feature set A according to the similarity value between each clinical feature and the corresponding best standard clinical feature include:

Use the contribution degree c i of the i-th clinical feature to weight the maximum similarity value corresponding to the best standard clinical feature in the feature set A; and

Let i=i+1, re-weight the maximum similarity value of the best standard clinical feature corresponding to the i-th clinical feature in feature set A until all the best standard clinical features selected in feature set A are selected After the weighting process is completed, the weighted maximum similarity values corresponding to all the best standard clinical features in the feature set A are accumulated, and the set similarity values of the feature set I and the current feature set A are obtained.
A single gene disease name recommendation system based on multi-level structural similarity, including:

The phenotype tree unit is used to construct a standardized clinical feature phenotype tree of the single-gene disease according to the feature relation database of the name of the single-gene disease;

The input unit is used to mark the nodes of the clinical features in the feature set I input by the user on the standardized clinical feature phenotype tree;

The traversal unit is used to traverse the name of the nth monogenic disease in the feature relational database, and mark the node of the standard clinical feature in the feature set A on the standardized clinical feature phenotype tree, and the initial value of n is 1 ；

The retrieval unit, based on the node labels on the standardized clinical feature phenotype tree, matches the best standard clinical feature corresponding to each clinical feature in feature set I from feature set A;

The calculation unit is used to calculate the set similarity value between the feature set I and the current feature set A according to the similarity value between each clinical feature and the corresponding best standard clinical feature;

The judging unit makes n=n+1 respond again to the traversal marking unit until the traversal of the single-gene disease names in the characteristic relational database is completed; and

The output unit is used to summarize and sort the set similarity values corresponding to the feature set I and each feature set A, and output the name of the single gene disease corresponding to the highest similarity value.
A non-volatile computer-readable storage medium stores computer-readable instructions, wherein the computer-readable instructions execute the steps of the method according to any one of claims 1 to 8 when the computer-readable instructions are executed by a processor.
A computer device includes a memory and one or more processors. The memory stores computer-readable instructions, wherein when the computer-readable instructions are executed by the processor, the one or more processors are executed The steps of the method according to any one of claims 1 to 8.